Genome Assembly Exercises

Exercise 1: Whole Genome Alignment

Immediately after completing an assembly, the newly assembled sequence will often be compared to either a prior assembly or to the genome of a related species. This computation is called whole genome alignment. One of the leading tools for whole genome alignment is nucmer, distributed with MUMmer and available on sourceforge at http://mummer.sf.net.

In this exercise you will compare 2 genome files available from: wga.challenge1.tgz

For this exercise the most important tools are:

nucmer: align the two sequences
delta-filter: filter out repetitive alignments
show-coords: display the alignment information
mummerplot: draw a dotplot of the alignments
dnadiff: summarize differences between the sequences

Question 1: How many insertions, deletions, and rearrangements are there between the two sequences?

Hint: check out mummerplot and dnadiff

Exercise 2: Assembly with Allpaths

You will now assemble a microbe using one of the leading algorithms ALLPATHS-LG. For the exercise, we will use sequencing data from Staphylococcus aureus distributed as part of the Genome Assembly Gold-Stardard Evaluations (GAGE) project. You will then compare your new assembly to the reference genome to evaluate accuracy.

Steps:

Download and install ALLPATHS

Download the S. aureus sequencing reads

Assemble the reads using allpaths:

Prepare the in_libs.csv and in_groups.csv files (see the allpaths manual)
Prepare the data using PrepareAllpathsInputs.pl
Launch ALLPATHS using RunAllPathsLG

Align the contigs to the reference genome (included with the read data) with nucmer

Summarize the quality of the assembly (also see assembly.report)

Question 2: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Exercise 3: Assembly with SOAPdenovo

In this exercise, you'll assemble the same genome using another leading algorithm SOAPdenovo. We will try the same assembly using the raw reads and then again with the reads corrected using the error correction pipeline Quake to evaluate if error correction is useful. Note ALLPATHS has an integrated error correction pipeline.

Raw Read Assembly:

Download and install SOAPdenovo.

Download the S. aureus sequencing reads (same as above)

Prepare the soap config file (see soapdenovo manual)

Run SOAPdenovo using k=45

align the scaffolds to the reference genome (included with the read data) with nucmer

summarize the quality of the assembly (also see the SOAPdenovo log file)

Question 3: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Error Corrected Assembly:

Download and install Quake (directly from the source repository)

Prepare a file with the fragment pairs (pairs.list)

Error correct the fragment reads using quake.py -f pairs.list -k 17

Prepare a new soap config file using the error corrected fragments

Run SOAPdenovo using k=45

align the scaffolds to the reference genome (included with the read data) with nucmer

summarize the quality of the assembly (also see the SOAPdenovo log file)

Question 4: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Question 5: How useful was the error correction?

Question 6: How does the SOAPdenovo assembly compare the ALLPATHS assembly?

Exercise 4: Assembly with Celera Assembler

The last exercise is to assemble the reads using the Celera Assembler. You should error correct the fragment reads with quake before assembly.

Celera assembly:

Download and install Celera Assembler

Error correct the fragment reads with quake (see above)

Convert the fragment pairs to FRG format using fastqToCA

Convert the jumping pairs to FRG format using fastqToCA

Assemble the reads using runCA fragment.frg jump.frg

align the scaffolds to the reference genome (included with the read data) with nucmer

summarize the quality of the assembly (also see the SOAPdenovo log file)

Exercise 1: Whole Genome Alignment

Question 1: How many insertions, deletions, and rearrangements are there between the two sequences?

Exercise 2: Assembly with Allpaths

Question 2: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Exercise 3: Assembly with SOAPdenovo

Question 3: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Question 4: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Question 5: How useful was the error correction?

Question 6: How does the SOAPdenovo assembly compare the ALLPATHS assembly?

Exercise 4: Assembly with Celera Assembler

Question 7: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

Question 8: Which of the 3 assemblers worked the best?