RNA-seq Exercises

Exercise 1: Visualizing Expression

Dr. Jones is interested to identify which of her favorite 100 genes in E.coli are differentially expressed over 10 timepoints. See if you can't help her out! In this first exercise, the gene expression matrix will be provided for you, in later exercises you will learn how to construct it from the sequence data.

The file expression.txt has a gene expression matrix for 100 genes (gene_1 through gene_100) over 10 timepoints (exp_1 through exp_10). Most genes have a stable background expression level, but some special genes show increased expression over the timecourse and some show decreased expression. See if you can characterize the background expression level, and identify the special genes.

This dataset is small enough and simple enough that you could probably answer this by inspecting the file by hand. Nevertheless, the point is to give you a warm up exercise working with data in a scripting language like Python, Perl or R to make plots, fit values to a distribution, etc.

You can download the Python development environment Canopy from: Enthought

You can download R for Mac from: http://cran.cnr.berkeley.edu/bin/macosx/

Hint: google for "python matplotlib", "python read file", OR "R read.table", "R lines", "R heatmap", "R hist", etc

Exercise 2: Reads to Expression

Dr. Jones is interested to identify which of her favorite 100 genes in E.coli are differentially expressed over 10 timepoints. See if you can't help her out! Unlike the first exercise which started from the expression matrix, this time you will learn how to align the RNA-sequencing (RNAseq) reads to the reference and compute the coverage levels of the genes.


I posted a tarball with the data for the challenge on my website: rnaseq.2.pileup.tgz

  1. ecoli.fa is the reference genome
  2. refgenes.ptt has a list of the 100 genes of interest
  3. tX.1.fq, tX.2.fq are the paired-end RNAseq reads collected from timepoint X for 1<=X<=10.

The reads are 50bp long from 200bp fragments with errors at a constant 2% error rate.


  1. Construct an expression matrix with the average depth of coverage across each gene
  2. Plot of expression matrix (parallel coordinates, heatmap, etc)
  3. List of neutral, increasing, decreasing, and other "interesting" genes.

Getting Started

BWA is a widely used tool for mapping individual reads to a reference genome, and SAMTools is a widely used program for scanning the read alignments to find & report variations or measure coverage. Download and install BWA (http://bio-bwa.sourceforge.net) and SAMTools (http://samtools.sourceforge.net/) to learn how to run the tools on a small genome with simulated reads. Both run on any Unix or Mac system (possibly Windows under Cygwin).

The basic steps are:
  1. 'bwa index' to index the reference genome (only needs to be done once per genome).
  2. 'bwa aln' to align the reads
  3. 'bwa sampe' to report the alignments.
After aligning, run SAMTools to find variations. The basic steps are
  1. 'samtools faidx' to index the reference genome
  2. 'samtools view' to load the alignments
  3. 'samtools index' to index the BAM file
  4. 'samtools mpileup' to find variations.
  5. 'samtools view' is a useful command to inspect how the reads align to the genome at a given position

Here is a more detailed description on how to use mpileup: http://samtools.sourceforge.net/mpileup.shtml

Project Sketch

Index genome (bwa index)
For each time point
  Align reads to genome (bwa aln, bwa sampe)
  Convert and sort alignments (samtools view, samtools sort)
  Report coverage at each position samtools (samtools depth)
  Compute average depth of each exon (your own code)
Merge depths into expression matrix (your own code)
Display heatmap / timeseries, identify special genes (your own R code)

Exercise 3: Tuxedo Tools

The leading RNA-seq analysis pipeline uses a collection of tools: Bowtie, TopHat, Cufflinks (collectively called the tuxedo tools).

In this exercise, you will analyze the same read data as in Exercise 2, but this time you will use Bowtie/TopHat/Cufflinks for the analysis instead of your own pipeline. Once you have completed the run, correlate the coefficient of your depth of coverage pipeline and the Cufflinks RPKM values. Plot these values, and draw the linear regression curve