Skip to end of metadata
Go to start of metadata


This page documents the input file formats required by NEXT-RNAi, additional input formats as well as the variety of output formats provided.


Page contents


Input files

Input FASTA file

Sequences for reagent designs or evaluations must be provided in FASTA format. The FASTA format is a general sequence format composed of a header starting with '>' followed by the sequence identifier. The next line then contains the sequence. More detailed information is available here.

NEXT-RNAi only allows sequences containing A, C, G, T nucleotides. Sequences can contain newlines.

Depending on the running mode one or two input files need to be provided:

Design of long dsRNAs or siRNAs (-e NO): one FASTA input file containing desired target sequences.



Evaluation of long dsRNAs:

  • One file with long dsRNA sequences (-e DSRNA)...



  • ... or one file containing forward and reverse primer sequences (with same base-identifier and '_f' or '_r' endings) (-e OLIGO) ...



  • ... or two files, one file containing long dsRNA sequences and the other containing forward and reverse primer sequences (with same base-identifier and '_f' or '_r' endings in the primer file, the base-identifier must be identical to the identifier belonging to the corresponding long dsRNA) (-e DSRNA+OLIGO).


Primer sequences

long dsRNA sequences


Bowtie off-target database / index

NEXT-RNAi uses Bowtie for mappings of siRNAs to an 'off-target' database (in most cases the transcriptome of the organism used). Bowtie requires an index file build from the original FASTA file using the bowtie-build script that is contained in the Bowtie package. A Bowtie index consists of six files with the file-extensions '*.1.ebwt', '*.2.ebwt', '*.3.ebwt', '*.4.ebwt', '*.rev.1.ebwt' and '*.rev.2.ebwt'. NEXT-RNAi just requires the base-name of the Bowtie index (here denoted as '*').

Example: download the latest Drosophila transcriptome FASTA file from FlyBase and unpack it, then run:

bowtie-build dmel-all-transcript-r5.27.fasta dmel-all-transcript-r5.27

This will generate a Bowtie index with the base-name 'dmel-all-transcript-r5.27' that can be used for NEXT-RNAi (-d path_to_index/dmel-all-transcript-r5.27).


Additional input files

Options file (TAG=VALUE format)

Most settings are passed to NEXT-RNAi via an options file (-o optionsFile). This file has a simple 'TAG=VALUE' format described in details here, e.g.:


DESIGNWINDOW=80,250
DESIGNNUM=50
OUTPUTNUM=1
SIRNALENGTH=19
EFFICIENCY=SIR,0
REDESIGN=ON
INTRON=90
BOWTIE=../software/bowtie/
TARGETGROUPS=../testData/fbgn_fbtr_fbpp_fb_2010_03.tsv
PRIMER3=../software/primer3/src/
GENOMEBOWTIE=../testData/dmel-all-chromosome-r5.26
GFF=GFF3
GBROWSEBASE=http://www.dkfz.de/signaling/cgi-bin/gbrowse_img/flybase/
GBROWSETRACK=GENE+TXN
AFF=YES
LOWCOMPEVAL=../software/mdust/
CANEVAL=6
HOMOLOGY=../software/blast/bin/,../testData/dmel-all-transcript-r5.26.fasta,1e-10
MIRSEED=7,../testData/miRBase_r14_Dmel.fa
RANKD=SPEC


Targetgroup file (tab-delimited)

The 'Targetgroups' file is required by NEXT-RNAi to make connections between transcripts and genes. NEXT-RNAi determines the specificity of RNAi reagents by mapping siRNA sequences to the transcriptome. In case multiple transcripts are targeted by a certain siRNA NEXT-RNAi needs to analyze whether these belong to the same or to different genes (the latter one would indicate an 'off-target' match).
The 'Targetgroups' file must be tab-delimited and must contain the headers 'Target' (e.g. transcripts) and 'TargetGroup' (e.g. the corresponding gene); it can contain additional columns that will not be recognized by NEXT-RNAi. The transcript identifiers must be the same as used in the 'off-target' database (-d option):



Bowtie mapping database / index

NEXT-RNAi can use Bowtie to localize designed reagents e.g. in the genome (GENOMEBOWTIE setting in options file), which is required to generate AFF and GFF files (see below), to calculate feature contents (see below) as well as to visualize designs in their genomic context using GBrowse. The Bowtie index for a certain genome can be build using the bowtie-build script as described above. For many genomes these indexes are already available for download from the Bowtie webpage.

Feature file (tab-delimited)

NEXT-RNAi can compare the location of a designed RNAi reagent (only if it was mapped to the same reference, e.g. the genome) with locations of all kind of sequence features, such as SNPs or UTRs. This requires a tab-delimited feature file (FEATURE setting in options file) with four columns containing the headers 'FeatureName' (the same for a certain kind of feature, e.g. 'UTR'), 'FeatureLoc' (reference, e.g. the chromosome), 'FeatureStart' (start in the reference) and 'FeatureEnd' (end in the reference); the file can also contain additional columns that will be not recognized by NEXT-RNAi.



Further additional input files in FASTA format

The following options also require files in FASTA format:

  • SEEDMATCH: For calculating the number of seed matches (seed complement frequency) a FASTA file (e.g. a file containing 3'-UTR sequences) needs to be provided. Alternatively, a Bowtie index (described above) can be used.
  • INDEPENDENT: To calculate designs that do not overlap with previously designed reagents, the sequences of these can be provided in FASTA format. NEXT-RNAi will exclude them from new designs. Alternatively, a Bowtie index (described above) can be used.
  • GENOMEFASTA: In case Blat is used for reagent mappings (e.g. to the genome), a FASTA needs to be provided.
  • HOMOLOGY: Designed reagents can be evaluated for (unwanted) homologies to a given Blast database. The Blast database consists of the FASTA file and a Blast index ('*.nhr', '*.nin', '*.nsd', '*.nsi', '*.nsq'). The Blast index can be generated by using the formatdb script contained in the Blast package, e.g. by running:
    formatdb -i dmel-all-transcript-r5.27.fasta -p F -o T
    
    NEXT-RNAi needs the location of the FASTA file, the complete index must be in the same location.
  • MIRSEED: NEXT-RNAi can calculate the number of miRNA seeds contained in RNAi reagents. The seeds to be searched must be provided as complete miRNA sequences (e.g. obtained from miRBase) in FASTA format.



Analysis of siRNA pools (tab-delimited)

siRNAs are often used as pools. NEXT-RNAi allows to evaluate siRNA pools and to generate a summarized report (POOL option). This requires a tab-delimited file containing the headers 'siRNAID' and 'POOLID' (further columns will not be recognized by NEXT-RNAi). 'siRNAID' is the sequence identifier of the siRNA (the FASTA header used in the sequence input file -i) and 'POOLID' is the pool identifier, which must be the same for all sequence identifiers belonging to the same pool.

In the example below two pools were defined (M-003000-01 and M-003001-01). Each pool consists of 4 siRNAs (e.g. D-003000-05, D-003000-06, D-003000-07, D-003000-08 belong to the pool M-003000-01).



Define intended targets of input sequences (tab-delimited)

It is possible to define the intended target for each input sequence. This might be important e.g. in case reagents have multiple perfect targets. Then the intended target will always be reported in the first position (important e.g. for parsing of the NEXT-RNAi output). To do so a tab-delimited file linking input identifiers (header 'Query') to intended genes (header 'Intended') needs to be provided (further columns will not be recognized by NEXT-RNAi).



Exclude targets from report

Targets identified from mapping RNAi reagents to the 'off-target' database (-d input) and to previous designs (option INDEPENDENT) can be excluded from the report (as some might interfere with e.g. parsing of NEXT-RNAi results). This just requires a single-columned file with the header 'Exclude'.



Output report and files

NEXT-RNAi provides a comprehensive HTML output that can be viewed in any browser by opening the index.html file in the output folder generated.

For the output presented here NEXT-RNAi was queried to design long dsRNAs against common regions for all genes in the Drosophila genome (details are provided here).

Summary statistics

The HTML output first provides summary statistics about the number of successful designs.


In this example 74,907 target sequences were queried, 70,149 designs could be calculated according to the selected design options.


Links to HTML results

These links (only a few are shown) lead to detailed reports for each designed RNAi reagent. In the images below the output for one of these designs targeting the gene csw (FBgn0000382) is discussed.


dsRNA information


The first box is focussed on the properties of the primer pair designed to synthesize this long dsRNA as well as on the properties of the dsRNA itself.



Primer sequence

Sequence (5'-3') of the primer. If queried, tags are attached to the 5'-end of the primer (here the T7 promoter 'taatacgactcactataggg') in lowercase letters.

Primer length [nt]

Length (in nt) of the primer (without tag sequence).

Primer Tm[°C]

Melting temperature of the primer in degrees Celcius (calculated without tag sequence).

Primer GC[%]

GC content (in percent) of the primer (calculated without tag sequence).

Primer pair penalty

Assembled penalty of forward and reverse primer (from primer3). The higher the worse the primer pair is.

Amplicon sequence

Sequence (5'-3') of the amplified product (without tags).

Amplicon length [nt]

Length (in nt) of the amplified product.

Amplicon location

Location of the amplified product in the genome of the organism in the format chromosome:start..end(orientation). For some organisms, where the genome is not completely assembled, the amplified product is mapped to contigs.


Target information


This part of the output summarizes all genes and annotated transcripts targeted by the RNAi reagent. This information is based on the mapping of all siRNAs sequences calculated from the designed reagent to the organism's transcriptome. The long dsRNA designed in this example has a length of 221 nt. The settings were adjusted such that any (perfect) 19 nt off-target matches should be avoided. 203 19 nt siRNAs can be generated from this sequence (using an offset of 1). Here only one gene (FBgn0000382) is targeted and the annotated isoforms are targeted by all possible 203 siRNAs.



Intended target gene

Best target gene(s) (considered as intended) based on the number of siRNAs matching its annotated transcripts. If multiple genes are targeted with the same number of siRNAs they are separated by comma.

Intended target transcripts (hits)

Transcripts annotated for the intended target gene(s) with number of siRNA hits in brackets. All annotated transcripts are listed here, even if not all of them are targeted. If multiple intended genes were found transcripts belonging to the same intended target gene are separated by comma. A semicolon then indicates that the following transcripts belong to the next intended target gene.

Other targeted gene(s)

All other genes targeted by less siRNAs are listed here, separated by comma. 'NA' indicates that no other target genes were found.

Other targeted transcripts (hits)

Transcripts annotated for the other targeted gene(s) with number of siRNA hits in brackets. All annotated transcripts are listed here, even if not all of them are targeted. If multiple other targeted genes were found transcripts belonging to the same intended target gene are separated by comma. A semicolon then indicates that the following transcripts belong to the next targeted gene.


Reagent quality



siRNAs [19 nt]

Number of possible siRNAs calculated from the RNAi reagent sequences (here 19 nt).

On-target

Number of specific siRNAs, only targeting transcripts of a single (intended) target gene.

Off-target

Number of unspecific siRNAs targeting transcripts of unintended genes besides those of the intended target gene.

No-target

Number of siRNAs with no target. This might occur e.g. in case an intron spanning long dsRNA was designed.

Efficient siRNAs

Number of siRNAs within a designed long dsRNA with a predicted efficiency above the selected 'Minimal siRNA efficiency score' and according to the selected efficiency calculation method ('rational' or 'weighted'). The efficiency calculation is based on 19 nt siRNAs and can take values between 0 and 100. Here 0 was used as cutoff, that is why all siRNAs are counted as 'efficient'. A more detailed description of the efficiency prediction is available here.

mirSeed

Number of conserved miRNA seeds (in this example annotations from miRBase were used) in the siRNAs contained in the designed long dsRNA.

Avg efficiency score

Average efficiency score of all 19 nt siRNAs contained in the designed long dsRNA according to the selected efficiency calculation method ('rational' or 'weighted'). A more detailed description of the efficiency prediction is available here.

LowComplexRegions

Number of low complexity regions (calculated with mDust) in the reagent's sequence. This value optimally should be 0.

CAN

Number of stretches with at least 6 contiguous CA[ACGT] (CAN) repeats. This value optimally should be 0.


Additional quality evaluation


This section reports on additional evaluations (e.g. homology and feature contents) of the designed RNAi reagent.



UTR

Overlap [nt] with UTR features as described above.

Sequence homology (e-value)

Predicted homology of the RNAi reagent to genes in the genome. Only homologies with E-values (according to BLAST output) below the cut-off selected on the settings page are reported. Optimally the RNAi reagent has significant homology to the intended target gene only (multiple homologous targets are separated by '&'). Significant homologies to other genes might give rise to off-target effects.


Genome browser


This image visualizes the designed reagent in its genomic context using the generic genome browser (GBrowse). It shows four tracks: the absolute location on the chromosome or contig (e.g. here X from ca. 2004 kbp to 2008 kbp), a 'Gene span track' containing the gene model, a 'Transcripts' track containing annotated transcripts and the 'RNAi' track showing the location of the actual design.



Results as flat files


In this part of the HTML report all design results and statistics as well as reports from running NEXT-RNAi are linked as flat files that can be easily loaded and manipulated in any kind of editor or spreadsheet software.



Tab-delimited result file


Regarding its content this file resembles the 'HTML results' for all designed reagents in a simple tab-delimited format. An example is linked here.

QueryID

Identifier of the queried target sequence.

QuerySubID

Identifier of the RNAi reagent composed of the target sequence identifier, '_' and an index that is incremented by 1 if multiple designs are queried for the same target sequence.

Length[nt]

Length (in nt) of the RNAi reagent.

SeqFor

Sequence (5'-3') of the forward primer. If queried, tags are attached to the 5'-end of the primer (here the T7 promoter 'taatacgactcactataggg') in lowercase letters.

PosFor

Starting position of the forward primer in the queried target sequence (starting at 0).

LenFor

Length (in nt) of the primer (without tag sequence).

GCFor[%]

GC content (in percent) of the primer (calculated without tag sequence).

TmFor[*C]

Melting temperature of the primer in degrees Celcius (calculated without tag sequence).

SeqRev

Sequence (5'-3') of the reverse primer. If queried, tags are attached to the 5'-end of the primer (here the T7 promoter 'taatacgactcactataggg') in lowercase letters.

PosRev

Starting position of the reverse primer in the queried target sequence (starting at 0).

LenRev

Length (in nt) of the primer (without tag sequence).

GCRev[%]

GC content (in percent) of the primer (calculated without tag sequence).

TmRev[*C]

Melting temperature of the primer in degrees Celcius (calculated without tag sequence).

ForRevPenalty

Assembled penalty of forward and reverse primer (from primer3). The higher the worse the primer pair is.

Specificity[Abs]

Absolute predicted specificity of siRNAs calculated from the RNAi reagent in the format Overall number of siRNAs calculates from the RNAi reagent / On-target siRNAs / Off-target siRNAs / No-target siRNAs. 'On-target' refers to siRNAs only targeting transcripts of a single (intended) target gene. 'Off-target' refers to unspecific siRNAs targeting transcripts of unintended genes besides those of the intended target gene. 'No-target' refers to siRNAs targeting no transcript at all (e.g. in case of intron-spanning designs).

Specificity[%]

Percentage of specific siRNAs of the number of all siRNAs.

EfficientsiRNAs

Number of siRNAs within a designed long dsRNA with a predicted efficiency above the selected 'Minimal siRNA efficiency score' and according to the selected efficiency calculation method ('rational' or 'weighted'). The efficiency calculation is based on 19 nt siRNAs and can take values between 0 and 100. Here 20 was used as cut-off. A more detailed description of the efficiency prediction is available here.

AvgEfficiencyScore

Average efficiency score of all 19 nt siRNAs contained in the designed long dsRNA according to the selected efficiency calculation method ('rational' or 'weighted'). A more detailed description of the efficiency prediction is available here.

Sequence

Sequence (5'-3') of the RNAi reagent (without tags).

IntendedGene

Best target gene(s) (considered as intended) based on the number of siRNAs matching its annotated transcripts. If multiple genes are targeted with the same number of siRNAs they are separated by '&'.

IntendedTxn

Transcripts annotated for the intended target gene(s). All annotated transcripts are listed here, even if not all of them are targeted. If multiple intended genes were found transcripts belonging to the same intended target gene are separated by '+'. A '&' then indicates that the following transcripts belong to the next intended target gene.

IntendedTxnHits

Number of siRNAs targeting the transcripts in the column 'IntendedTxn' in the same order and separation.

OtherGene

All other genes targeted by less siRNAs are listed here, separated by comma. 'NA' indicates that no other target genes were found.

OtherTxn

Transcripts annotated for the other targeted gene(s). All annotated transcripts are listed here, even if not all of them are targeted. If multiple other targeted genes were found transcripts belonging to the same intended target gene are separated by '+'. A '&' then indicates that the following transcripts belong to the next targeted gene.

OtherTxnHits

Number of siRNAs targeting the transcripts in the column 'OtherTxn' in the same order and separation.

UTR

Overlap [nt] with UTR features as described above.

mirSeed

Number of conserved miRNA seeds (in this example annotations from miRBase were used) in the siRNAs contained in the designed long dsRNA.

Location

Location of the amplified product in the genome of the organism in the format chromosome:start..end(orientation). For some organisms, where the genome is not completely assembled, the amplified product is mapped to contigs.

LowComplexRegions

Number of low complexity regions (calculated with mDust) in the reagent's sequence. This value optimally should be 0.

CANRepeats

Number of stretches with at least 6 contiguous CA[ACGT] (CAN) repeats. This value optimally should be 0.

Homology

Predicted homology of the RNAi reagent to genes in the genome. Only homologies with E-values (according to BLAST output) below the cut-off selected on the settings page are reported. Optimally the RNAi reagent has significant homology to the intended target gene only (multiple homologous targets are separated by '&'). Significant homologies to other genes might give rise to off-target effects.


Statistics result file



This file reports on some statistics of the designed reagents and is also presented in HTML format on the main output page.

This file reports the:

  • average and standard deviation of primer length (in nt), GC content (in %), melting temperature (in °C) and penalty (important for similar PCR efficiency)
  • average and standard deviation of siRNAs predicted to be efficient (for long dsRNA designs) or predicted siRNA efficiency (for siRNA designs)
  • summarized predicted specificity of all reagents (reagents with only 1 target, with multiple targets, with no target, with CAN repeats, with low-complexity regions etc.)
  • number of designs that could be mapped to the genome


FASTA result file



This file contains the designed RNAi reagents in FASTA format (see example above).


GFF result file



This file contains the designed RNAi reagents in generic feature format of version 3 (GFF3, see example above) that can be used with GBrowse or other genome browser to visualize the location of the designs in their genomic context. Further information about the GFF3 format is available here.


AFF result file



This file contains the designed RNAi reagents in annotation file format (AFF, see example above) that can be directly uploaded to GBrowse to visualize the designs as new track in a public available genome browser. Documentation about this format can e.g. be found here.


Location(s) of mapped reagent and oligo(s) that could not be mapped (tab-delimited)


For reagents that could be mapped a tab-delimited file is provided with detailed mapping information.



Column 1

Design identifier.

Column 2

Identifier of forward primer.

Column 3

Start position of forward primer.

Column 4

End position of forward primer.

Column 5

Orientation of forward primer.

Column 6

Identifier of revers primer.

Column 7

Start position of reverse primer.

Column 8

End position of reverse primer.

Column 9

Orientation of reverse primer.

Column 10

Reference / chromosome.

Column 11

Start position of RNAi reagent in reference.

Column 12

End position of RNAi reagent in reference.

Column 13

Is the reported match a FULL match or a PARTIAL match to the reference.


For reagents (primers and dsRNAs) that could not be mapped a list containing the corresponding reagent identifiers is provided.


Homology of RNAi reagents (tab-delimited)

This file contains the original tab-delimited BLAST output to determine the homology of the designed reagents to sequences in the transcriptome (see example above).


miRNA seeds in RNAi reagents (tab-delimited)



Tab-delimited file listing the design identifiers that are positive for at least one miRNA seed (separated by comma in the second column) from the input (see above).


Links to input text files



Reagent sequence input file

FASTA file containing the original target sequences queried.

Validated reagent sequence input file

FASTA file containing the validated target sequences queried (e.g. N's are not tolerated in query sequences).

Options input file

Summary of queried design settings in a 'OPTION=VALUE' style.

Targetgroups input file

Targetgroups file used for this run.

Input file for intended targets

Input file mapping input sequences to their intended targets


Further the names of all databases used for this run are reported.


Links to output report files



Error log file

This file reports all design-errors that occurred, e.g. sequences that where excluded from the input (because they were to short, contained invalid characters etc.), sequences that could not be covered by designs or less designs than requested and sequences that required a re-design run (if enabled by the user).

NEXT-RNAi report file

Report of all steps done by NEXT-RNAi in this run.

Failed design(s)

This file contains a list of identifiers of all queried target sequences where no reagent design was possible using the adjusted settings.


Outputs specific to siRNA reagents


The design and evaluation of siRNAs comes with specific outputs described below.


siRNA information



siRNA sequence

siRNA sense-strand sequence (5'-3').

Position in queried target

Starting position of the siRNA in the queried target sequence (starting at 0).

Length [nt]

Length (in nt) of the siRNA.

siRNA location(s)

Location of the siRNA in the genome of the organism in the format chromosome:start..end(orientation). For some organisms, where the genome is not completely assembled, the amplified product is mapped to contigs.


Reagent quality (siRNA)



siRNAs [19 nt]

Number of siRNA sequences.

On-target

Number of specific siRNAs, only targeting transcripts of a single (intended) target gene.

Off-target

Number of unspecific siRNAs targeting transcripts of unintended genes besides those of the intended target gene.

No-target

Number of siRNAs with no target. This might occur e.g. in case an intron spanning long dsRNA was designed.

SCF

Number of seed matches (seed complement frequency) to the defined seed-match databases (see above

Efficiency score

Efficiency score of the designed siRNA according to the selected efficiency calculation method ('rational' or 'weighted'). The efficiency calculation is based on 19 nt siRNAs and can take values between 0 and 100. A more detailed description of the efficiency prediction is available here.

CAN

Number of stretches with at least 6 contiguous CA[ACGT] (CAN) repeats. This value optimally should be 0.


Evaluation of siRNA pools


The evaluation of pooled siRNA reagents is a summarized report of the properties of the single siRNAs. The order of siRNA sequences, locations, efficiencies etc. equals the order of the siRNA identifiers in the header (separated by comma). An example is presented below.




  • No labels