Genome Assembly Exercises
Exercise 1: Whole Genome Alignment
Immediately after completing an assembly, the newly assembled sequence will often be compared
to either a prior assembly or to the genome of a related species. This computation is called whole
genome alignment. One of the leading tools for whole genome alignment is nucmer, distributed
with MUMmer and available on sourceforge at http://mummer.sf.net.
In this exercise you will compare 2 genome files available from: wga.challenge1.tgz
For this exercise the most important tools are:
- nucmer: align the two sequences
- delta-filter: filter out repetitive alignments
- show-coords: display the alignment information
- mummerplot: draw a dotplot of the alignments
- dnadiff: summarize differences between the sequences
Question 1: How many insertions, deletions, and rearrangements are there between the two sequences?
Hint: check out mummerplot and dnadiff
Exercise 2: Assembly with Allpaths
You will now assemble a microbe using one of the leading algorithms ALLPATHS-LG. For the
exercise, we will use sequencing data from Staphylococcus aureus distributed as
part of the Genome Assembly Gold-Stardard
Evaluations (GAGE) project. You will then compare your new assembly to the reference
genome to evaluate accuracy.
Steps:
Download and install ALLPATHS
Download the S. aureus sequencing reads
Assemble the reads using allpaths:
- Prepare the in_libs.csv and in_groups.csv files (see the allpaths manual)
- Prepare the data using PrepareAllpathsInputs.pl
- Launch ALLPATHS using RunAllPathsLG
Align the contigs to the reference genome (included with the read data) with nucmer
Summarize the quality of the assembly (also see assembly.report)
Question 2: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?
Exercise 3: Assembly with SOAPdenovo
In this exercise, you'll assemble the same genome using another leading algorithm
SOAPdenovo. We will try the
same assembly using the raw reads and then again with the reads corrected using the error
correction pipeline Quake
to evaluate if error correction is useful. Note ALLPATHS has an integrated error correction pipeline.
Raw Read Assembly:
Download and install SOAPdenovo.
Download the S. aureus sequencing reads (same as above)
Prepare the soap config file (see soapdenovo manual)
Run SOAPdenovo using k=45
align the scaffolds to the reference genome (included with the read data) with nucmer
summarize the quality of the assembly (also see the SOAPdenovo log file)
Question 3: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?
Error Corrected Assembly:
Download and install Quake (directly from the source repository)
Prepare a file with the fragment pairs (pairs.list)
Error correct the fragment reads using quake.py -f pairs.list -k 17
Prepare a new soap config file using the error corrected fragments
Run SOAPdenovo using k=45
align the scaffolds to the reference genome (included with the read data) with nucmer
summarize the quality of the assembly (also see the SOAPdenovo log file)
Question 4: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?
Question 5: How useful was the error correction?
Question 6: How does the SOAPdenovo assembly compare the ALLPATHS assembly?
Exercise 4: Assembly with Celera Assembler
The last exercise is to assemble the reads using the
Celera Assembler.
You should error correct the fragment reads with quake before assembly.
Celera assembly:
Download and install Celera Assembler
Error correct the fragment reads with quake (see above)
Convert the fragment pairs to FRG format using fastqToCA
Convert the jumping pairs to FRG format using fastqToCA
Assemble the reads using runCA fragment.frg jump.frg
align the scaffolds to the reference genome (included with the read data) with nucmer
summarize the quality of the assembly (also see the SOAPdenovo log file)
Question 7: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?
Question 8: Which of the 3 assemblers worked the best?
|