background preloader

RNA-Seq

Facebook Twitter

1471-2105-11-422.pdf (application/pdf Object) 1471-2105-11-422.pdf (application/pdf Object) DESeq vs. edgeR vs. baySeq | Musings. 6th April 2012: For a more updated version of this post, please refer see this post. A very simple comparison Using the TagSeqExample.tab file from the DESeq package as the benchmark dataset. According to DESeq authors, T1a and T1b are similar, so I removed the second column in the file corresponding to T1a: Hierarchical clustering To get an idea of how similar the libraries are, I performed hierarchical clustering using the Spearman correlation of libraries (see here). Note that these raw reads have not been adjusted for sequencing depth, but hopefully by using a ranked correlation method we can bypass this necessity. From the dendrogram, we observe the marked difference of the normal samples to the tumour samples. Installing the R packages I then installed all the required packages (R version 2.12.0, biocinstall version 2.7.4 DESeq DESeq code adapted from the vignette: edgeR First transform the tag_seq_example_less_t1a.tsv file to the edgeR format using this script. baySeq Overlap.

FDR

DESeq.pdf (application/pdf Object) Hansen. Counting reads in features with htseq-count — HTSeq v0.5.0p4 documentation. Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature. A feature is here an interval (i.e., a range of positions) on a chromosome or a union of such intervals. In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. One may also consider each exon as a feature, e.g., in order to check for alternative splicing. For comparative ChIP-Seq, the features might be binding region from a pre-determined list.

Special care must be taken to decide how to deal with reads that overlap more than one feature. The three overlap resolution modes of htseq-count work as follows. The union of all the sets S(i) for mode union. If S contains precisely one feature, the read (or read pair) is counted for this feature. The following figure illustrates the effect of these three modes: htseq-count [options] <alignment_file><gff_file> Options -f <format>, --format=<format> Testing/benchmarking RNASeq DE tools.

I've commented on this subject several times before, but as this is scattered through the forums, I'll try a comprehensive answer. My apologies to advanced readers for repeating many basic facts. The reason you see such strong differences between DESeq and edgeR on the one hand and cuffdiff on the other hand is not because of the different methods used. Rather, there are two fundamentally different questions you may ask when analysing expression data: (A) Is the concentration of a given transcripts different in two samples? (B) May this difference be attributed to the difference in experimental conditions, i.e., may we be confident that it is due to the experimental treatment and not due to fluctuations not under the experimenter's control (not due to "biological variation")?

Allow me to discuss these in details: (A) Given two samples, you measure the concentration of your transcript of interest by some technique (here RNA-Seq, but could also be microarray and RNA-Seq). Converting FPKM from Cufflinks to raw counts for DESeq. Quote: If it were that each read maps to one transcript, you could multiply the FPKM values with the transcript length to get raw counts again. However, the whole point of cufflinks is to deal with the fact that most reads will map to several transcripts, and each read can hence influence the FPKM values of all these transcripts, and it will definitely not augment each count by one. A crucial aspect of DESeq and edgeR is that they assess the shot noise by assuming that each counting unit is evidence of one sequencing read, and hence the counting noise follows a Poisson distribution.

In cufflinks' output this is not the case. Instead, cufflinks calculates the uncertainty for you using more involved math. There are two fundamentally different tasks that you should not mix up, and that are served by different tools: 1. 2. This is what DESeq and edgeR do. The tool that sits awkwardly in between the to use cases is cuffdiff. Alternative splicing in RNA-Seq. Ok thanks guys, we have run the DEXseq and the results look more promising. I have a question regarding statistics: We ran 2 experiments, the first was mum vs affected female (4 replicates each) and the second was scrambled siRNA vs knockdown (4 replicates each) - the affected female has a mutation in the same gene we knocked down.

In the mum vs affected female the DEXseq detected 2951 genes with differential exon usage with FDR 0.1. However, in the scrambled vs knockdown it only detected 81 genes with FDR 0.1 (39 of these were also present in the mum vs affected female list). I was wondering could we be less stringent for our cut-off for the knockdown data - when we filter using p-value<0.01 for example we get 1057 genes with 360 of them also found to be present in the patient data. I know next to nothing about stats, but are p-values completely worthless in this type of analysis? Nbt.1621-S1.pdf (application/pdf Object) RnaSeqAnalysis < Cores/BioinformaticsCore < TWiki.

RNA-seq starts with a fastq-format file with a whole bunch (millions!) Of base pair reads. The idea is to map these reads to the genome, and then see to what degree various genes were represented in the set of reads. We have two pipelines for doing this: bowtie -> tophat -> cufflinks -> cuffcompare -> cuffdiff bowtie -> tophat -> Scripture bowtie -> tophat -> rankexpectation bowtie maps reads to the genome. tophat handles alt-spliced reads and splice junctions.

Bowtie Genome Index (For bowtie / tophat) You have to have an index specified in order for bowtie / tophat to run properly! You can find relevant files for the bowtie alignment indices here: . After you download or generate the index, you should store it locally in /work/Common/Data/Bowtie_Indexes , which is a symlink to /opt/Bio/bowtie/current-bowtie/indexes/ . You'll see a bunch of files like mm9.1.ebwt . You can also inspect the current status of the Bowtie indexes with: 1. 3. 4. 2.

Tophat, Cufflinks and replicates. I double checked the Cuffdiff code, and I didn't see any problems that would be leading to incorrect behavior. Therefore, I am confident that it is reporting the correct results according to our method. Perhaps this is an explanation for the differences: Cufflinks calculates FPKM at the gene level by summing the transcript-level FPKMs. While it might seem at first that this is equivalent to calculating the expression of the gene based on its read counts, this is not the case. Consider a gene of length 100 (sum of all exon lengths) that has 3 isoforms (A, B, and C) with lengths 25, 50, and 100, respectively.

If in sample 1, the gene has 100 reads mapping to it, the RPKM would be 1. However, at the isoform level, Cufflinks may find that 50 of these reads come from isoform A, 30 from B, and 20 from C. If in sample 2, the gene again has 100 reads, the RPKM would still be 1. Package Index : HTSeq 0.5.0p4. Quantile normalization for RNA seq data? Cufflinks with -G option give 0 FPKM, why?? Hello All, I have a drosophila GTF file which I created, since I wanted to use the latest release from flyable and the UCSC annotated GTF is from 2006. Anyway, here is what it looks like… After running the latest version of Tophat with the one-directional RNA-Seq reads, I get a .bam file. Using this my input to cufflinks was… cufflinks -p 4 -G ALLEXONS.gtf -o . After running this using LSF command 'bsub', I get an error output file, the start of which looks like this GFF warning: merging adjacent/overlapping segments of FBtr0084817 on 3R (21094383-21094697, 21094700-21095435) [20:04:14] Inspecting reads and determining fragment length distribution.

> Processing Locus 2L:21918-25151 [ ] 0% > Processing Locus 2L:76445-77211 [ ] 0% > Processing Locus 2L:102381-106718 [ ] 0% I have three output files, the smallest of which is the genes.expr file and the largest is the transcripts.gtf file. 1) How will I know if cufflinks has accepted the GTF file that I created correctly or not? Regards Abhijit. Maize Endopserm RNAseq: Cufflinks. Counting reads in features with htseq-count — HTSeq v0.5.0 documentation. Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature. A feature is here an interval (i.e., a range of positions) on a chromosome or a union of such intervals.

In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. One may also consider each exon as a feature, e.g., in order to check for alternative splicing. For comparative ChIP-Seq, the features might be binding region from a pre-determined list. Special care must be taken to decide how to deal with reads that overlap more than one feature. The three overlap resolution modes of htseq-count work as follows. The union of all the sets S(i) for mode union. If S contains precisely one feature, the read (or read pair) is counted for this feature. The following figure illustrates the effect of these three modes: htseq-count [options] <alignment_file><gff_file> Options -f <format>, --format=<format> Cufflinks :: Center for Bioinformatics and Computational Biology.

What is Cufflinks? Cufflinks is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. Cufflinks runs on Linux and OS X. Cufflinks is described in our recent paper, and much of the algorithmic and mathematical material is presented in the supplemental methods How does Cufflinks assemble transcripts? Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs.

While Cufflinks works well with unpaired RNA-Seq reads, it is designed with paired reads in mind. Eric T. This matching is then extended to a minimum cost path cover of the DAG, with each path representing a different transcript. How does Cufflinks calculate transcript abundances? How does Cufflinks estimate the fragment length distribution? Kasper D. TopHat :: Center for Bioinformatics and Computational Biology. What is TopHat? TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. TopHat runs on Linux and OS X. What types of reads can I use TopHat with? TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. How does TopHat find junctions? TopHat can find splice junctions without a reference annotation. Short read sequencing machines can currently produce reads 100bp or longer but many exons are shorter than this so they would be missed in the initial mapping.

TopHat generates its database of possible splice junctions from two sources of evidence. Prerequisites To use TopHat, you will need the following programs in your PATH: bowtie2 and bowtie2-align (or bowtie) bowtie2-inspect (or bowtie-inspect) bowtie2-build (or bowtie-build) samtools Obtaining and installing TopHat . Cufflinks :: Center for Bioinformatics and Computational Biology.