appliedgenomics2023

Project Ideas

Here are a few selected projects organized by theme, although you are also free to present your own project ideas. A good project is one that:

  1. Has a well defined goal
  2. Has a well defined method proposed for solving it; and
  3. Has appropriate data and resources available.

If any of these are not available, your project will not be successful, especially in the limited time remaining for class. A successful model for the class project is to apply a technique developed in a different context to the biological problem of your choice. Another successful model is to identify an important paper of interest and then try to improve apon their method in some way (faster, less memory, more sensitive, more precise, novel species, novel data sets, etc).

Basecalling and signal processing

  1. Develop an improved base calling algorithm (potentially modification aware) for Oxford Nanopore or PacBio sequencing using recurrent neural networks or other ML approach: Nanocall, Nanopolish for Methylation

  2. Align the raw signal from Oxford Nanopore or PacBio reads to a reference genome to improve variant detection, detect methlyation, and/or polishing: Nanopolish

  3. Develop a signal-level analysis tool to detect repeat expansions/contractions from nanopore data, especially for cancer or autism genomics NanoSatellite Dynamic Time Warping

  4. Develop a new application for ONT adaptive sequencing using UNCALLED or ReadFish

Genome assembly and variant calling

  1. Extend GenomeScope to infer the genome characteristics of genomes with long, error-prone reads: GenomeScope2

  2. Benchmark and/or develop a de novo assembler for polyploid genomes. Note you should use existing tools for overlapping, and focus on unitting/scaffolding: FALCON HiCanu

  3. Benchmark several SV callers, and create a classifier for identifying high confidence from low confidence SV calls: Parliament2

  4. Develop an enhanced SV calling algorithm that leverages known SVs: 1000 genomes SVs Paragraph

  5. Jointly analyze (de novo assemble or index the raw reads) the genomes of multiple varieties of some species (human, rice, etc) at once to identify sequences not present in the reference genome: PanRice; Population BWT

  6. Studying the performance of using the MinHash algorithm to bin long reads for copy number analysis or other genomics applications: Mash

  7. Optimize an assembler and/or SV caller for Nanopore ultralong reads and/or PacBio HiFi reads: Telomere-to-telomere consortium HiFi reads

  8. Develop methods for detecting mis-assemblies or consensus errors from long read data Quast Clair

  9. Develop an Nanopore-based algorithm that leverages ReadUntil to dynammically assemble a genome with minimal coverage. BOSS-RUNS

Functional Genomics

  1. Benchmark different RNA-seq aligners when using a phased diploid genome compared to a standard reference genome: Allele-Seq

  2. Develop an RNA-seq aligner/pipeline that incorporates variants known from the population (graph genome, pangenomes): GraphGenomes

  3. Develop methods to identify genome variants from RNAseq data, apply to individuals with many tissues profiled to identify somatic mutations: SNPiR

  4. Run ChommHMM/Segway on a phased diploid genome (NA12878) and evaluate how that compares to annotating the reference genome: ChromHMM

  5. Run ChromHMM/Segway on a non-human species such as rice or arabidopsis: ChromHMM, Segway Protocol

  6. Develop ChromHMM/Segway postprocessing algorithm to label the states with their biological functions: Segway Protocol

  7. Explore how single cell analysis works with minimal amounts of coverage. For example, reproduce the results from the Monocle paper, and experiment with how well it performs using lower amounts of coverage: Monocle

  8. Benchmark how different single cell pipelines work at recognizing different cell types: MetaNeighbor

Evolution and Disease Genomics

  1. Benchmark different non-coding mutation analysis schemes on a collection of diseased genomes (cancer, autism, etc): CADD, funSeq2, fitCons

  2. Develop methods for identifying somatic mutations using high error long reads (PacBio or Oxford Nanopore) Short read benchmarks

  3. Metagenomics: Benchmark sailfish/salmon/kallisto approaches for inferring the abundance of different species present in a population Meta-kalisto

  4. Develop a new metagenomics classifier using deep learning or other advanced ML techniques. PhymmBL

  5. Benchmark and/or develop a method for inferring the ethnicity of an individual from their genotype: Genealogical DNA test

  6. Apply metagenomics approaches to identifying species present in food samples or correlated with other diseases. AllFoodSeq Centrifuge

  7. Investigate the rate of heterozygosity within and among human populations using consortium data. Variation in Heterozygosity Predicts…

CS Theory and Systems

  1. Adapt one or more learned data structures for genomics data: Learned Data Structures

  2. Accelerate an important genomics pipeline using GPUs or cloud computing and use that to study a larger dataset Rail-RNA

  3. Implement a genomics processing pipeline using WebAssembly and/or Objective-C/Swift/Android fastq.bio

  4. Develop/apply a scalable datastructure for genomics Sequence Bloom Trees; Mantis

  5. Develop a novel visualization of genomics data (especially from 23-and-me reports or single cell data): Circos

  6. Apply deep learning techniques to a problem in genomics Primer

  7. Develop a novel fastq/BAM compression scheme for long term storage (which may require a large precomputed dictionary and/or extensive compute)

Or your own idea!

This should be more than you are already doing for your PhD work, but can be a novel twist to a dataset/idea you are already using. If you have a research idea but not the right data, let me know and I’ll help you find some.

Data Resources

Pointers to open access data sets to help get you started!

Human

Dog