appliedgenomics2023

Project Ideas

Here are a few selected projects organized by theme, although you are also free to present your own project ideas. A good project is one that:

Has a well defined goal
Has a well defined method proposed for solving it; and
Has appropriate data and resources available.

If any of these are not available, your project will not be successful, especially in the limited time remaining for class. A successful model for the class project is to apply a technique developed in a different context to the biological problem of your choice. Another successful model is to identify an important paper of interest and then try to improve apon their method in some way (faster, less memory, more sensitive, more precise, novel species, novel data sets, etc).

Basecalling and signal processing

Develop an improved base calling algorithm (potentially modification aware) for Oxford Nanopore or PacBio sequencing using recurrent neural networks or other ML approach: Nanocall, Nanopolish for Methylation
Align the raw signal from Oxford Nanopore or PacBio reads to a reference genome to improve variant detection, detect methlyation, and/or polishing: Nanopolish
Develop a signal-level analysis tool to detect repeat expansions/contractions from nanopore data, especially for cancer or autism genomics NanoSatellite Dynamic Time Warping
Develop a new application for ONT adaptive sequencing using UNCALLED or ReadFish

Genome assembly and variant calling

Extend GenomeScope to infer the genome characteristics of genomes with long, error-prone reads: GenomeScope2
Benchmark and/or develop a de novo assembler for polyploid genomes. Note you should use existing tools for overlapping, and focus on unitting/scaffolding: FALCON HiCanu
Benchmark several SV callers, and create a classifier for identifying high confidence from low confidence SV calls: Parliament2
Develop an enhanced SV calling algorithm that leverages known SVs: 1000 genomes SVs Paragraph
Jointly analyze (de novo assemble or index the raw reads) the genomes of multiple varieties of some species (human, rice, etc) at once to identify sequences not present in the reference genome: PanRice; Population BWT
Studying the performance of using the MinHash algorithm to bin long reads for copy number analysis or other genomics applications: Mash
Optimize an assembler and/or SV caller for Nanopore ultralong reads and/or PacBio HiFi reads: Telomere-to-telomere consortium HiFi reads
Develop methods for detecting mis-assemblies or consensus errors from long read data Quast Clair
Develop an Nanopore-based algorithm that leverages ReadUntil to dynammically assemble a genome with minimal coverage. BOSS-RUNS

Functional Genomics

Benchmark different RNA-seq aligners when using a phased diploid genome compared to a standard reference genome: Allele-Seq
Develop an RNA-seq aligner/pipeline that incorporates variants known from the population (graph genome, pangenomes): GraphGenomes
Develop methods to identify genome variants from RNAseq data, apply to individuals with many tissues profiled to identify somatic mutations: SNPiR
Run ChommHMM/Segway on a phased diploid genome (NA12878) and evaluate how that compares to annotating the reference genome: ChromHMM
Run ChromHMM/Segway on a non-human species such as rice or arabidopsis: ChromHMM, Segway Protocol
Develop ChromHMM/Segway postprocessing algorithm to label the states with their biological functions: Segway Protocol
Explore how single cell analysis works with minimal amounts of coverage. For example, reproduce the results from the Monocle paper, and experiment with how well it performs using lower amounts of coverage: Monocle
Benchmark how different single cell pipelines work at recognizing different cell types: MetaNeighbor

Evolution and Disease Genomics

Benchmark different non-coding mutation analysis schemes on a collection of diseased genomes (cancer, autism, etc): CADD, funSeq2, fitCons
Develop methods for identifying somatic mutations using high error long reads (PacBio or Oxford Nanopore) Short read benchmarks
Metagenomics: Benchmark sailfish/salmon/kallisto approaches for inferring the abundance of different species present in a population Meta-kalisto
Develop a new metagenomics classifier using deep learning or other advanced ML techniques. PhymmBL
Benchmark and/or develop a method for inferring the ethnicity of an individual from their genotype: Genealogical DNA test
Apply metagenomics approaches to identifying species present in food samples or correlated with other diseases. AllFoodSeq Centrifuge
Investigate the rate of heterozygosity within and among human populations using consortium data. Variation in Heterozygosity Predicts…

CS Theory and Systems

Adapt one or more learned data structures for genomics data: Learned Data Structures
Accelerate an important genomics pipeline using GPUs or cloud computing and use that to study a larger dataset Rail-RNA
Implement a genomics processing pipeline using WebAssembly and/or Objective-C/Swift/Android fastq.bio
Develop/apply a scalable datastructure for genomics Sequence Bloom Trees; Mantis
Develop a novel visualization of genomics data (especially from 23-and-me reports or single cell data): Circos
Apply deep learning techniques to a problem in genomics Primer
Develop a novel fastq/BAM compression scheme for long term storage (which may require a large precomputed dictionary and/or extensive compute)

Or your own idea!

This should be more than you are already doing for your PhD work, but can be a novel twist to a dataset/idea you are already using. If you have a research idea but not the right data, let me know and I’ll help you find some.

Data Resources

Pointers to open access data sets to help get you started!

Human

1000 Genomes: open access DNA sequencing reads for 3202 diverse human genomes: https://pubmed.ncbi.nlm.nih.gov/36055201/
Simons Genome Diversity Project (SGDP): open access DNA sequencing reads from 279 diverse human genomes: https://www.nature.com/articles/nature18964
Personal Genome Project (PGP): self reported 23-and-me reports (DNA data along with other phenotype data): https://www.personalgenomes.org/
T2T Genome: https://github.com/marbl/CHM13
Genome in a bottle (GIAB): diverse sequecning data from several trios: https://github.com/genome-in-a-bottle
Human Pangenome Reference Consortium (HPRC): long read data from up to 350 individuals: https://humanpangenome.org/data-and-resources/
Encode: Functional genomics data of all types: https://www.encodeproject.org/
GTEx: variants and expression data from ~1000 people in dozens of tissues (raw data is protected, but expression levels are open access): https://gtexportal.org/home/
ICGC Data portal: somatic variants (open) and germline variants (protect) in many patients: https://dcc.icgc.org/
SG-NEx: long read RNAseq data from many samples: https://github.com/GoekeLab/sg-nex-data

Dog

Darwin’s Ark: VCF file from >600 varieties of dogs: https://www.science.org/doi/10.1126/science.abk0639; https://data.broadinstitute.org/DogData/

This site is open source. Improve this page.