appliedgenomics2023

Assignment 5: Bedtools and RNA-seq

Assignment Date: Wednesday October 18, 2023
Due Date: Wednesday October 25 2023 @ 11:59pm

Assignment Overview

In this assignment you will explore the capabiltiies of bedtools, plus a key couple of aspects of RNA-seq (with a small introduction to clustering). For this assignment, you will have to generate some visualizations - we recommend R or Python, but use a language you are comfortable with!

Make sure to show your work/code in your writeup!

As a reminder, any questions about the assignment should be posted to Piazza.

Question 1. De novo mutation analysis [20 pts]

For this question, we will be focusing on the de novo variants identified in this paper:
http://www.nature.com/articles/npjgenmed201627

Download the de novo variant positions from here (Supplementary Table S4):
https://github.com/schatzlab/appliedgenomics2023/blob/main/assignments/assignment4/41525_2016_BFnpjgenmed201627_MOESM431_ESM.xlsx?raw=true

Download the gene annotation of the human genome here:
https://ftp.ensembl.org/pub/release-87/gff3/homo_sapiens/Homo_sapiens.GRCh38.87.gff3.gz

Download the annotation of regulatory variants from here:
https://ftp.ensembl.org/pub/release-87/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20161111.gff.gz

NOTE The variants are reported using version 37 of the reference genome, but the annotation is for version 38. Fortunately, you can ‘lift-over’ the variants to the coordinates on the new reference genome using several avaible tools. I recommmend the UCSC liftover tool that can do this in batch by converting the variants into BED format. Note, some variants may not successfully lift over, especially if they become repetitive and/or missing in the new reference, so please make a note of how many variants fail liftover.

Question 2. Time Series [20 pts]

This file contains normalized expression values for 100 genes over 10 time points. Most genes have a stable background expression level, but some special genes show increased expression over the timecourse and some show decreased expression.

Question 3. Sampling Simulation [10 pts]

A typical human cell has ~250,000 transcripts, and a typical bulk RNA-seq experiment may involve millions of cells. Consequently in an RNAseq experiment you may start with trillions of RNA molecules, although your sequencer will only give a few tens of millions of reads. Therefore your RNAseq experiment will be a small sampling of the full composition. We hope the sequences will be a representative sample of the total population, but if your sample is very unlucky or biased it may not represent the true distribution. We will explore this concept by sampling a small subset of transcripts (500 to 50000) out of a much larger set (1M) so that you can evaluate this bias.

In data1.txt with 100,000 lines we provide an abstraction of RNA-seq data where normalization has been performed and the number of times a gene name occurs corresponds to the number of transcripts in the sample.

Question 4. Differential Expression [20 pts]

Question 5. Motif clustering and classifications [BONUS: 20 points]

In this question you will need to cluster and classify transcription factor sequence motifs from the JASPAR database. The sequences were generated from the position weight matrices using the script printsequence.py which loads the position frequency matrix (pfm) file and randomly generates a sequence according to the nucleotide probabilities in the matrix. We have provided three files to evaluate: training.txt which contains 2500 training examples (500 examples from 5 different transcription factors; validation.txt which contains 500 examples from these 5 TFs that can be used for validating your model; and secret.txt which contains 100 unlabeled examples to run your code against. Note, you may find it helpful to generate additional examples using the provided printsequence.py script. We have provided the matrix for TP53.pfm as an example.

def one_hot_encode_sequence(seq):
    mapping = {'A': [1, 0, 0, 0], 'C': [0, 1, 0, 0], 'G': [0, 0, 1, 0], 'T': [0, 0, 0, 1]}
    return np.array([mapping[base] for base in seq])

one_hot_encoded_sequences = np.array([one_hot_encode_sequence(seq) for seq in dna_sequences])

You will then need to flatten the one-hot encoded arrays (which are each 4x18) into a long vector of length 72. In this encoding, each sequence is a point in 72-dimensional space where each dimension is either 0 or 1. You will also need to scale the matrix so that PCA will work correctly. Finally color each point with the transcription factor that generates it.

    # Flatten the one-hot encoded data
    flattened_data = one_hot_encoded_sequences.reshape(one_hot_encoded_sequences.shape[0], -1)

    # Standardize the flattened data
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(flattened_data)

Packaging

The solutions to the above questions should be submitted as a single PDF document that includes your name, email address, and all relevant figures (as needed). If you use ChatGPT for any of the code, also record the prompts used. Submit your solutions by uploading the PDF to GradeScope, and remember to select where in your submission each question/subquestion is. The Entry Code is: JK5VB4.

If you submit after this time, you will use your late days. Remember, you are only allowed 4 late days for the entire semester!

Resources

Bedtools

Bedtools has really nice documentation available at: https://bedtools.readthedocs.io/en/latest/

Installation

If you were able to use mamba in assignment 2, then you can install bedtools with:

$ mamba install -c bioconda bedtools

If you had any issues installing/using mamba on your computer, you can use the Github compute environment from assignment 3.

PCA

In python, see sklearn.decomposition.PCA

Neural Networks

In python, see Keras