appliedgenomics2023

Assignment 3: Mapping and WDLs

Assignment Date: Wednesday, September 20, 2023
Due Date: Wednesday, September 27, 2023 @ 11:59pm

Assignment Overview

In this assignment you will explore WDLs as a workflow language to orchestrate a variety of read mapping tasks. You can execute your WDLs using miniwdl after running pip install miniwdl. You may also find learn-wdl to be very helpful.

Miniwdl works best on Linux or older (non-M1) macs. If you are using a Mac, make sure to disable gRPC FUSE for file sharing. On M1 macs (and PCs) miniwdl wont run, but you can run miniwdl in the cloud via https://github.com/schatzlab-teaching/wdl101. Follow the instructions there to get started. Alternatively you can try to use the cromwell execution engine. Note you can install cromwell via mamba install cromwell.

The bioinformatics tools you will need for the assignment (bowtie, samtools, etc) are bundled into a Docker container so you wont need to install other software packages. This will get you ready to run your code in the cloud for future assignments. Instructions for running the container are available here: https://github.com/mschatz/wga-essentials. This is known to work on new macs (M1) and older macs (intel chip). It should also run on Linux and Windows but let us know if you have any issues.

As a reminder, any questions about the assignment should be posted to Piazza.

Question 1. de Bruijn Graph construction [10 pts]

ATTCA
ATTGA
CATTG
CTTAT
GATTG
TATTT
TCATT
TCTTA
TGATT
TTATT
TTCAT
TTCTT
TTGAT

Question 2. Hello World [5 pts]

Question 3. Mapping Analysis [10 pts]

The goal of this question is to develop a WDL that will align pairs of read file (paired or matepairs) and computes alignment statistics. Download the reads and reference genome (same as assignment 2) from: https://github.com/schatzlab/appliedgenomics2023/blob/main/assignments/assignment2/asm.tgz?raw=true

Note that the build comamnd outputs multiple files that you will need to mark as output files. You can either copy them one by one, or bundle them up in a tarball or zipfile.

  1. bowtie2 -x GENOME -1 PREFIX.1.fq -2 PREFIX.2.fq -S PREFIX.sam
  2. samtools sort PREFIX.sam -o PREFIX.bam
  3. samtools stats PREFIX.bam > PREFIX.stats

Please submit the code for this but you do not need to include screenshots. Note that to run bowtie2, you will need all of the output files from the indexing step. You can either pass them in one by one, or you can bundle them all together as a tarball or zipfile as a “reference bundle”.

Question 4. Mappability Analysis [10 pts]

To extract the kmers, you will need to write your own script. You should label the kmer with the position in the genome so you can confirm it is mapped to the corrected position. For example, if the genome is:

>genome
GATTACA

It should output

>1
GAT
>2
ATT
>3
TTA
>4
TAC
>5
ACA

Packaging

The solutions to the above questions should be submitted as a single PDF document that includes your name, email address, and all relevant figures (as needed). If you use ChatGPT for any of the code, also record the prompts used. Submit your solutions by uploading the PDF to GradeScope, and remember to select where in your submission each question/subquestion is. The Entry Code is: JK5VB4.

If you submit after this time, you will use your late days. Remember, you are only allowed 4 late days for the entire semester!

Graphview Tips

It is very easy to render a graph using graphviz. Install with mamba install graphviz. For example, these commands will render graph.dot using neato:

$ cat graph.dot
digraph{
    ATT -> TAC
    TAC -> ACG
    ACG -> CGA
    CGA -> GAC
    GAC -> ACG
    ACG -> CGT
    CGT -> GTA
    GTA -> TAT
}

$ neato -T png graph.dot -o neato.png

View neato Graph

Alternatively use dot

$ dot -T png graph.dot -o dot.png

View dot Graph