appliedgenomics2023

Assignment 2: Genome Assembly

Assignment Date: Wednesday, September 6, 2023
Due Date: Wednesday, September 13, 2023 @ 11:59pm

Assignment Overview

In this assignment, you are given a set of unassembled reads from a mysterious pathogen that contains a secret message encoded someplace in the genome. The secret message will be recognizable as a novel insertion of sequence not found in the reference. Your task is to assess the quality of the reads, assemble the genome, identify, and decode the secret message. If all goes well the secret message should decode into a recognizable english text, otherwise double check your coordinates and try again. As a reminder, any questions about the assignment should be posted to Piazza.

For this assignment, we recommend you install and run the tools using bioconda. There are some tips below in the Resources section. Note on Mac, you must install the x86_64 package even if you are using an M1 chip.

Question 1. Coverage Analysis [20 pts]

Download the reads and reference genome from: https://github.com/schatzlab/appliedgenomics2023/blob/main/assignments/assignment2/asm.tgz?raw=true

Note we have provided both paired-end and mate-pairs reads (see included README for details). Make sure to look at all of the reads for the coverage analysis and kmer analysis, as well as in the assembly.

Question 2. Kmer Analysis [20 pts]

Use Jellyfish to count the 21-mers in the reads data. Make sure to use the “-C” flag to count cannonical kmers, otherwise your analysis will not correctly account for the fact that your reads come from either strand of DNA.

Question 3. De novo assembly [20 pts]

Assemble the reads using Spades. Spades will not run on Windows, you must use a linux/mac environment (The Ubuntu subsystem might work?).

Note: N50 size is the size such that half of the total amount of bases are in contigs this size or larger (a weighted median). For example, if you have contig sizes of 10kbp, 5kb, 3kbp, 1kbp, 1kbp, 1kbp. The total size is 21kbp. Half of this value is 10.5kbp, so the N50 size is 5kbp. To compute the N50 value, sort the contigs from largest to small, and then iterative through until their cummulative span reaches 50% of the total span.

Question 4. Whole Genome Alignment [15 pts]

Question 5. Decoding the insertion [20 pts]

Packaging

The solutions to the above questions should be submitted as a single PDF document that includes your name, email address, and all relevant figures (as needed). If you use ChatGPT for any of the code, also record the prompts used. Submit your solutions by uploading the PDF to GradeScope, and remember to select where in your submission each question/subquestion is. The Entry Code is: JK5VB4.

If you submit after this time, you will use your late days. Remember, you are only allowed 4 late days for the entire semester!

Resources

Bioconda - Package manager for bioinformatics software

On linux or mac I highly recommend that you use bioconda to install the packages rather than installing from source.

The easiest way to install conda is with Miniconda. For M1 macs, you must use the x86 installation in emulation mode since M1/arm support is still limited. For this you will use the “Rosette 2” subsystem that will convert M1 arm instructions into x86_64 on the fly. Rosette will be automatically installed when you go to run it. I also recommend using mamba instead of the default conda command for installing new packages:

## Replace MacOS-x86_64 with the version you downloaded from https://github.com/conda-forge/miniforge#mambaforge
$ chmod +x ./Mambaforge-MacOSX-x86_64.sh
$ ./Mambaforge-MacOSX-x86_64.sh

## After mamba is installed add bioconda as a default channel
$ conda config --add channels conda-forge
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --set channel_priority strict

Once bioconda is configured, all of the tools needed for this assignment except spades can be installed using:

$ mamba install samtools bowtie bwa mummer4 jellyfish fastqc fastx_toolkit

For spades, download the precompiled version from here (installing with conda is tricky because there are conflicts in the dependencies): [https://cab.spbu.ru/files/release3.15.4/manual.html#sec2]

FastQC - Raw read quality assessment

$ fastqc /path/to/reads.fq

If you have problems, make sure java is installed (sudo apt-get install default-jre)

Jellyfish - Fast Kmer Counting

When counting kmers, make sure to count “canonical” kmers on both strands (-C):

$ jellyfish count -m 21 -C -s 1000000 /path/to/reads*.fq
$ jellyfish histo mer_counts.jf > reads.histo

GenomeScope - Analyze Kmer Profile to determine genome size and other properties

GenomeScope is a web-based tool so there is nothing to install. Hooray! Just make sure to use the -C when running jellyfish count so that the reads are correctly processed.

Spades - Short Read Assembler.

Normally spades would try several values of k and merge the results together, but here we will force it to just use k=31 to save time. The assembly should take a few minutes.

$ spades.py --isolate --pe1-1 frag180.1.fq --pe1-2 frag180.2.fq --mp1-1 jump2k.1.fq --mp1-2 jump2k.2.fq -o asm -t 4 -k 31

Note: On mac you may need to run spades like this with the -m flag:

$ spades.py --isolate -m 1024 --pe1-1 frag180.1.fq --pe1-2 frag180.2.fq --mp1-1 jump2k.1.fq --mp1-2 jump2k.2.fq -o asm -t 4 -k 31

MUMmer4 - Whole Genome Alignment

$ dnadiff /path/to/ref.fa /path/to/qry.fa
$ nucmer /path/to/ref.fa /path/to/qry.fa
$ show-coords out.delta

WARNING: nucmer and related tools do not like it if/when you have spaces or special characters (‘@’) in the path to the binaries*

SAMTools - Extract part of a genome sequence using ‘samtools faidx’ (this will extract from contig_id bases 1234 through 5678)

$ ./samtools faidx /path/to/genome.fa contig_id:1234-5678