Beyond the Genome 2013

2013 Beyond the Genome Informatics Challenge: Metagenomics Variant Encoding

Sven-Eric Schelhorn and Michael Schatz

The goal of this challenge is to identify a secret message we encoded as variants into a metagenomics sample. The sample was generated by mixing portions of the reference sequences of several microbial species. Sequence reads were simulated from these portions. Within each portion of a referencesequence, a foreigninsert (i.e, not originating from any of the microbial species) was placed. These inserts encode a message. See the presentation for all the details. Download the data here: btg2013.tgz

You can use the included dna-encode.pl script to decode the message. It uses the algorithm defined in GM Church, Y Gao, and S Kosuri. (2012) Next-Generation Digital Information Storage in DNA. Science. DOI: 10.1126/science.1226355

If you have Perl installed on your computer, you can get the documentation for the dna-encode.pl script with the following command:

  $ perldoc dna-encode.pl

You can get a brief synopsis of the command with this command:

  $ perl dna-encode.pl --help

The types of reads in each FASTQ (.fq) file are described in detail below.

dna-encode.pl           Perl script to encode/decode text to/from DNA

sh_end_{1,2}.fastq.gz   Paired end read data from the mixed references, 
                        fastq format, 2x250bp from  1000+/-50bp fragments

lo_end_{1,2}.fastq.gz   Paired end read data from the mixed references, 
                        fastq format, 2x150bp from  5300+/-500bp fragments

Hints

It’s a metagenomic sample- Choose your tools accordingly.
After you identified an insert, you need to identify the insert wildtype.There are several ways to distinguish it from the variants. BLAST, consensus, pairwise similarities...
NCBI Blast may be unreliable due to the Government Shutdown. If yes, try to use the public BLAST server at EBI/EMBL (WU-BLAST).

Solution Guide

Solution Guide available here