appliedgenomics2023

Assignment 4: BWT and Modimizers

Assignment Date: Wednesday, October 4, 2023
Due Date: Wednesday, October 11, 2023 @ 11:59pm

Assignment Overview

In this assignment you will implement the BWT and modimizers. These exercises can be computed in any programming language, although we recommend python (or C++, Java, or Rust). R is generally inefficient at string processing unless you take great care.

As a reminder, any questions about the assignment should be posted to Piazza.

Question 1. BWT Encoding [25 pts]

In the language of your choice, implement a BWT encoder and encode the string below. Faster (Linear time) methods exist for computing the BWT, although for this assignment you can use the simple method based on standard sorting techniques. Your solution does not need to be an optimal algorithm and can use O(n^2) space and O(n^2 lg n) time.

Here is the recommended pseudo code (make sure to submit your code as well as the encoded string):

computeBWT(string s)
  ## add the magic end-of-string character
  s = s + "$"
 
  ## build up the BWM from the cyclic permutations
  ## note the ith cyclic permutation is just "s[i..n] + s[0..i]"
  rows = []
  for (i = 0; i < length(s); i++)
    rows.append(cyclic_permutation(s, i))

  ## just use the builtin sort command to sort the cyclic permutations
  sort(rows)

  ## now extract the last column
  bwt = ""
  for (i = 0; i < length(s); i++)
    bwt += substr(rows[i], length(s), 1)
  return bwt

String to encode:

I_am_fully_convinced_that_species_are_not_immutable;_but_that_those_belonging_to_what_are_called_the_same_genera_are_lineal_descendants_of_some_other_and_generally_extinct_species,_in_the_same_manner_as_the_acknowledged_varieties_of_any_one_species_are_the_descendants_of_that_species._Furthermore,_I_am_convinced_that_natural_selection_has_been_the_most_important,_but_not_the_exclusive,_means_of_modification.

Question 2. BWT Decoder [25 pts]

In the language of your choice, implement a BWT decoder and decode the string below.

One of the essential properties of the BWT is that it can be decoded back into the source text without any other additional information. This is accomplished by iteratively applying the Last-First property starting with the first character of the BWT until reaching the end of string character '$'. The Last-First property states there is an equivalence between the ith occurrence of a character in the first column and the ith occurrence of that character in the last column. This equivalence can be evaluated by counting how many occurrences of a character are present in the BWT string (the last column of the BWM) or by counting characters in the first column (which you will have to determine from the BWT itself). Again, faster methods exist (the FM-index) to determine the rank of each character but you can just count it explicitly here.

The pseudocode for decoding the string is as follows:

decodeBWT(String bwt) 
  firstCol = makeFirstColumn(bwt)
  text = ""
  
  ## By construction, '$' always starts the zeroth row
  row = 0;
  while (bwt.charAt(row) != '$')
      text.append(bwt.charAt(row));
      row = applyLF(firstCol, bwt.charAt(row), rank(bwt, row));
  
  return reverse(text)

String to decode:

.uspe_gexr_______$..,e.orrs,sdddeedkdsuoden-tf,tyewtktttt,sewteb_ce__ww__h_PPsm_u_naseueeennnrrlmwwhWcrskkmHwhttv_no_nnwttzKt_l_ocoo_be___aaaooaAakiiooett_oooi_sslllfyyD__uouuueceetenagan___rru_aasanIiatt__c__saacooor_ootjeae______ir__a

Hint: use your sourcecode from Q1 to debug Q2. Also start with simple strings like GATTACA or your own name.

Question 3. Randomers, modimizers, minimizers, and minhash [50 pts]

For these questions you will need to extract to segments of chr22 [Hint use samtools faidx with https://schatz-lab.org/appliedgenomics2023/assignments/assignment1/chr22.fa.gz.

Recall the Jaccard coefficient between sets A and B is |A intersect B| / |A union B|. To compute the set union and set intersect, we recommend storing the kmers (modimizers, minhashes, minimizers) from A in a dictionary, and then scanning through the kmers in B to check whether they are already in the dictionary. As you process the kmers in A and B, just keep track of how many there are total and how many are shared between A and B.

Randomers

Modimizers

Recall that a modimizer is a kmer such that hash(kmer) % M == 0. For this question you should use the default hash() function in your programming language (see below). Use k = 21.

Minhash

Recall minhash is computed as the N smallest hash values computed from the kmers in the string. [Hint: compute the hash value of every kmer in the sequence, then sort those values to find smallest 1000 values for N=1000]. Use k = 21

Minimizers [Bonus 10pts]

Recall (w, k)-mimimizers are the set of k-mers that have minimal value within each posible window of size w along a sequence. [Hint: use a nested for loop to consider every window, and then within each window, extract every possible kmer & sort]. Use k = 21.

Comparison

Packaging

The solutions to the above questions should be submitted as a single PDF document that includes your name, email address, and all relevant figures (as needed). If you use ChatGPT for any of the code, also record the prompts used. Submit your solutions by uploading the PDF to GradeScope, and remember to select where in your submission each question/subquestion is. The Entry Code is: JK5VB4.

If you submit after this time, you will use your late days. Remember, you are only allowed 4 late days for the entire semester!

Resources

Python

Python hash() function:

C++

C++ hash<std:string>:

Java

String.hashCode():

R

digest::digest2int()

Other languages

Post to piazza for recommendations