appliedgenomics2023

Assignment 1: Chromosome Structures

Assignment Date: Wednesday, August 30, 2023
Due Date: Wednesday, Sept. 6, 2023 @ 11:59pm

Assignment Overview

In this assignment you will profile the overall structure of the genomes of several important species and then study human chromosome 22 in more detail. As a reminder, any questions about the assignment should be posted to Piazza.

Question 1: Chromosome structures [10 pts]

Download the chomosome size files for the following genomes (Note these have been preprocessed to only include main chromosomes):

  1. Arabidopsis thaliana (TAIR10) - An important plant model species [info]
  2. Tomato (Solanum lycopersicum v4.00) - One of the most important food crops [info]
  3. E. coli (Escherichia coli K12) - One of the most commonly studied bacteria [info]
  4. Fruit Fly (Drosophila melanogaster, dm6) - One of the most important model species for genetics [info]
  5. Human (hg38) - us :) [info]
  6. Wheat (Triticum aestivum, IWGSC) - The food crop which takes up the largest land area [info]
  7. Worm (Caenorhabditis elegans, ce10) - One of the most important animal model species [info]
  8. Yeast (Saccharomyces cerevisiae, sacCer3) - an important eukaryotic model species, also good for bread and beer [info]

Using these files, make a table with the following information per species:

Question 2. Coverage simulator [20 pts]

num_reads = calculate_number_of_reads(genomesize, readlength, coverage)

## use an array to keep track of the coverage at each position in the genome
genome_coverage = initialize_array_with_zero(genomesize)

for (i = 0; i < num_reads; i++)
{
  startpos = uniform_random(1,genomelength-readlength)
  endpos = startpos + readlength - 1
  for (x = startpos; x <= endpos; x++)
  {
    genomecoverage[x] = genomecoverage[x] + 1
  }
}

maxcoverage = max(genomecoverage)

## use an array count how many positions have 0x coverage, have 1x coverage, have 2x coverage, ...
histogram = initialize_array_with_zero(maxcoverage)

for (x = 0; x < genomelength; x++)
{
  cov = genomecoverage[x]
  histogram[cov] = histogram[cov] + 1
}

## now plot the histogram
...

Question 3: Kmer Uniqueness [20 pts]

Download the human chomosome 22 from here: https://schatz-lab.org/appliedgenomics2023/assignments/assignment1/chr22.fa.gz

Notes:

Questions:

## initialize kmer length
k=19

## read genome, convert to upper canse and convert non-DNA to 'A'
genome_string = read_from_file("chr22.fa")

## dictionary that maps a kmer (like GAT) to a frequency (like 3)
kmer_frequency = initialize_dictionary()

## now scan the genome, extract kmers, and tally up their frequencies
for(i = 0; i < length(genome_string) - k + 1; i++)
{
  kmer = substring(genome_string, i, k)  
  kmer_frequency[kmer] = kmer_frequency[kmer] + 1
}

## now tally the frequencies in a dictionary that maps kmer frequency to count
## also determine the maximum kmer frequency

tally = initialize_dictionary()
all_kmers = kmer_frequency.keys()
max_frequency = 0

for (i = 0; i < length(all_kmers); i++)
{
  kmer = all_kmers[i]
  freq = kmer_frequency[kmer]
  tally[freq] = tally[freq] + 1
  
  if (freq > max_frequency)
  {
    max_frequency = freq
  }
}

## now print in sorted order
for (i = 1; i <= max_frequency; i++)
{
  if (tally.has_key(i))
  {
    freq = tally[i]
    print_to_file(i, freq)
  }
}

Question 4: Why Genomics? [10 pts]

Hints

Packaging

The solutions to the above questions should be submitted as a single PDF document that includes your name, email address, and all relevant figures (as needed). Make sure to clearly label each of the subproblems and give the exact commands and/or code snippets you used for solving the question. Submit your solutions by uploading the PDF to GradeScope, and remember to select where in your submission each question/subquestion is. The Entry Code is: JK5VB4.

If you submit after this time, you will start to use up your late days. Remember, you are only allowed 96 hours (4 days) for the entire semester!