RNA-seq Exercises
Exercise 1: Visualizing Expression
Dr. Jones is interested to identify which of her favorite 100 genes in E.coli
are differentially expressed over 10 timepoints. See if you can't help her out!
In this first exercise, the gene expression matrix will be provided for you, in
later exercises you will learn how to construct it from the sequence data.
The file expression.txt has a gene expression matrix for
100 genes (gene_1 through gene_100) over 10 timepoints (exp_1 through exp_10).
Most genes have a stable background expression level, but some special genes
show increased expression over the timecourse and some show decreased
expression. See if you can characterize the background expression level, and
identify the special genes.
This dataset is small enough and simple enough that you could probably answer
this by inspecting the file by hand. Nevertheless, the point is to give you a
warm up exercise working with data in a scripting language like Python, Perl or R
to make plots, fit values to a distribution, etc.
You can download the Python development environment Canopy from:
Enthought
You can download R for Mac from:
http://cran.cnr.berkeley.edu/bin/macosx/
Hint: google for "python matplotlib", "python read file", OR "R read.table", "R lines", "R heatmap", "R hist", etc
Exercise 2: Reads to Expression
Dr. Jones is interested to identify which of her favorite 100 genes in E.coli
are differentially expressed over 10 timepoints. See if you can't help her out!
Unlike the first exercise which started from the expression matrix, this time
you will learn how to align the RNA-sequencing (RNAseq) reads to the reference
and compute the coverage levels of the genes.
Inputs
I posted a tarball with the data for the challenge on my website:
rnaseq.2.pileup.tgz
- ecoli.fa is the reference genome
- refgenes.ptt has a list of the 100 genes of interest
- tX.1.fq, tX.2.fq are the paired-end RNAseq reads collected from timepoint X for 1<=X<=10.
The reads are 50bp long from 200bp fragments with errors at a constant 2% error rate.
Outputs
- Construct an expression matrix with the average depth of coverage across each gene
- Plot of expression matrix (parallel coordinates, heatmap, etc)
- List of neutral, increasing, decreasing, and other "interesting" genes.
Getting Started
BWA is a widely used tool for mapping individual reads to a reference genome, and SAMTools
is a widely used program for scanning the read alignments to find & report variations
or measure coverage. Download and install BWA (http://bio-bwa.sourceforge.net)
and SAMTools (http://samtools.sourceforge.net/) to learn how to run the tools on a
small genome with simulated reads. Both run on any Unix or Mac system (possibly Windows under Cygwin).
The basic steps are:
- 'bwa index' to index the reference genome (only needs to be done once per genome).
- 'bwa aln' to align the reads
- 'bwa sampe' to report the alignments.
After aligning, run SAMTools to find variations. The basic steps are
- 'samtools faidx' to index the reference genome
- 'samtools view' to load the alignments
- 'samtools index' to index the BAM file
- 'samtools mpileup' to find variations.
- 'samtools view' is a useful command to inspect how the reads align to the genome at a given position
Here is a more detailed description on how to use mpileup: http://samtools.sourceforge.net/mpileup.shtml
Project Sketch
Index genome (bwa index)
For each time point
Align reads to genome (bwa aln, bwa sampe)
Convert and sort alignments (samtools view, samtools sort)
Report coverage at each position samtools (samtools depth)
Compute average depth of each exon (your own code)
Merge depths into expression matrix (your own code)
Display heatmap / timeseries, identify special genes (your own R code)
Exercise 3: Tuxedo Tools
The leading RNA-seq analysis pipeline uses a collection of tools:
Bowtie, TopHat,
Cufflinks (collectively called the tuxedo tools).
In this exercise, you will analyze the same read data as in Exercise 2, but this time you will
use Bowtie/TopHat/Cufflinks for the analysis instead of your own pipeline. Once you have
completed the run, correlate the coefficient of your depth of coverage pipeline and the
Cufflinks RPKM values. Plot these values, and draw the linear regression curve
|