Beyond the Genome 2011

2012 Beyond the Genome Informatics Challenge: Digital Encoding

James David Dooling, Michael Schatz, and James Taylor

The goal of this challenge is to identify a secret message inserted into an unknown microbial genome. See the presentation for all the details. Download the data here: btg12.tgz

The tarball contains several read sets that are the starting point for the challenge. The reads were generated by taking a portion of an organism's reference sequence and inserting a DNA-encoded famous quote into the sequence. Your challenge is to identify the inserted sequence, decode the quote, and identify its speaker.

You can use the included dna-encode.pl script to decode the message. It uses the algorithm defined in GM Church, Y Gao, and S Kosuri. (2012) Next-Generation Digital Information Storage in DNA. Science. DOI: 10.1126/science.1226355

If you have Perl installed on your computer, you can get the documentation for the dna-encode.pl script with the following command:

  $ perldoc dna-encode.pl

You can get a brief synopsis of the command with this command:

  $ perl dna-encode.pl --help

The types of reads in each FASTQ (.fq) file are described in detail below.

i2x100f180.1.fq   Read 1 of Illumina 2x100 reads from 180+/-20 bp fragments
i2x100f180.2.fq   Read 2 of Illumina 2x100 reads from 180+/-20 bp fragments
i2x50f2000.1.fq   Read 1 of Illumina 2x50 reads from 2+/-0.2 kbp fragments
i2x50f2000.2.fq   Read 2 of Illumina 2x50 reads from 2+/-0.2 kbp fragments
i2x250f700.fq     Interleaved reads 1 and 2 of Illumina 2x250 reads from
                  700+/-50 bp fragments

Hints

See this presentation for background on genome assembly and whole genome alignment. Try assembling the reads, BLASTing the contigs to identify the microbe, then aligning the contigs to the reference to identify the inserted sequence. Then decode the inserted sequence using the included dna-encode script.

Solution Guide

Solution Guide available here