Genome Assembly Exercises

Exercise 1: Whole Genome Alignment


Immediately after completing an assembly, the newly assembled sequence will often be compared to either a prior assembly or to the genome of a related species. This computation is called whole genome alignment. One of the leading tools for whole genome alignment is nucmer, distributed with MUMmer and available on sourceforge at http://mummer.sf.net.

In this exercise you will compare 2 genome files available from: wga.challenge1.tgz

For this exercise the most important tools are:

  1. nucmer: align the two sequences
  2. delta-filter: filter out repetitive alignments
  3. show-coords: display the alignment information
  4. mummerplot: draw a dotplot of the alignments
  5. dnadiff: summarize differences between the sequences

Question 1: How many insertions, deletions, and rearrangements are there between the two sequences?

Hint: check out mummerplot and dnadiff


Exercise 2: Assembly with Allpaths


You will now assemble a microbe using one of the leading algorithms ALLPATHS-LG. For the exercise, we will use sequencing data from Staphylococcus aureus distributed as part of the Genome Assembly Gold-Stardard Evaluations (GAGE) project. You will then compare your new assembly to the reference genome to evaluate accuracy.

Steps:
  • Download and install ALLPATHS
  • Download the S. aureus sequencing reads
  • Assemble the reads using allpaths:
    1. Prepare the in_libs.csv and in_groups.csv files (see the allpaths manual)
    2. Prepare the data using PrepareAllpathsInputs.pl
    3. Launch ALLPATHS using RunAllPathsLG
  • Align the contigs to the reference genome (included with the read data) with nucmer
  • Summarize the quality of the assembly (also see assembly.report)

    Question 2: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?



    Exercise 3: Assembly with SOAPdenovo


    In this exercise, you'll assemble the same genome using another leading algorithm SOAPdenovo. We will try the same assembly using the raw reads and then again with the reads corrected using the error correction pipeline Quake to evaluate if error correction is useful. Note ALLPATHS has an integrated error correction pipeline.

    Raw Read Assembly:
  • Download and install SOAPdenovo.
  • Download the S. aureus sequencing reads (same as above)
  • Prepare the soap config file (see soapdenovo manual)
  • Run SOAPdenovo using k=45
  • align the scaffolds to the reference genome (included with the read data) with nucmer
  • summarize the quality of the assembly (also see the SOAPdenovo log file)

    Question 3: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

    Error Corrected Assembly:
  • Download and install Quake (directly from the source repository)
  • Prepare a file with the fragment pairs (pairs.list)
  • Error correct the fragment reads using quake.py -f pairs.list -k 17
  • Prepare a new soap config file using the error corrected fragments
  • Run SOAPdenovo using k=45
  • align the scaffolds to the reference genome (included with the read data) with nucmer
  • summarize the quality of the assembly (also see the SOAPdenovo log file)

    Question 4: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

    Question 5: How useful was the error correction?

    Question 6: How does the SOAPdenovo assembly compare the ALLPATHS assembly?



    Exercise 4: Assembly with Celera Assembler


    The last exercise is to assemble the reads using the Celera Assembler. You should error correct the fragment reads with quake before assembly.

    Celera assembly:
  • Download and install Celera Assembler
  • Error correct the fragment reads with quake (see above)
  • Convert the fragment pairs to FRG format using fastqToCA
  • Convert the jumping pairs to FRG format using fastqToCA
  • Assemble the reads using runCA fragment.frg jump.frg
  • align the scaffolds to the reference genome (included with the read data) with nucmer
  • summarize the quality of the assembly (also see the SOAPdenovo log file)

    Question 7: How many insertions, deletions, and rearrangements are there between the assembled sequence and the reference?

    Question 8: Which of the 3 assemblers worked the best?