Quantitative Biology & Bioinformatics - 2011
Lecture 0: Schatzlab Research Projects
Introduction to DNA sequence analysis, with an emphasis on large scale analysis
using cloud computing (parallel computing). Includes a description of
embarassingly parallel, loosely coupled, and tightly coupled problems, and
techniques for addressing each type using many computers.
Lecture 1: Computational Analysis Primer
Introduction to unix and unix scripting. Introduction to analysis of DNA replication.
Lecture 2: Exact Matching and Fundamental Topics in CS
In this class we explored the problem of finding exact occurrences of a query
sequence in a large genome or database of sequences. Under this theme, we
started by analyzing the brute force approach introducing the concepts of
algorithm, complexity analysis, E-values, and programming in Matlab. Next we
discussed suffix arrays as an index for accelerating the search, including
analyzing the performance of binary search and its implementation in Matlab. We
also considered two traditional algorithms for sorting (Selection Sort versus
QuickSort) and their relative performance. Next we considered the question of if
it is possible to sort items in linear time (Bucket Sort), to segue towards
describing suffix trees and their applications. This also introduced the notion
of trees and their properties, and the fundamental depth-first-search algorithm.
Finally we considered hash tables and hash functions as way to transform very
large, intractable search domains into tractable ranges, with applications
towards developing a seed-and-extend search algorithm.
Lecture 3: Sequence Alignment & Applications
The theme of this class was the problem of finding approximate occurrences of a
short query sequence in a large genome or database of sequences. We first
defined the problem by considering various metrics of an approximate occurrence:
hamming distance, edit distance, and similarity score including BLOSSUM matrices
and affine gap penalties. We then considered different methods for computing
inexact alignments including dynamic programming for computing global &
local alignments and seed-and-extend algorithms. In the second half of the class
we discussed the sequence alignment algorithms BLAST as a hash-table based
seed-and-extend algorithm for finding statistically significant sequence homology.
Lecture 4: Alignment and Assembly
In the first half of the class we covered two important applications of alignment:
(1) MUMmer as a suffix-tree based seed-and-extend algorithm for whole genome alignment, introducing the
concepts of dot-plot and large scale genomic variations; and (3) Bowtie as a
Burrows-Wheeler short read mapping algorithm using a depth-first-search of the
implicit suffix array. We also discussed subsequent algorithms for interpreting
short read mapping results to detect polymorphisms, including the SOAPsnp model
of SNP detection, the SAMTools model of small indels, the RDexplorer model of
copy number variation detection, and the Hydra algorithm for detecting
large-scale indels and rearrangements.
The theme of the second half of the class class was graphs and methods for graph
analysis. The emphasis was on genome assembly but included a discussion of other
biological networks including PPI networks, regulation networks, neuron
interaction networks, and cell cycle graphs. We first discussed fundamental
properties of graphs, such as nodes, edges, degrees, shortest paths, connected
components, diameters, small world networks, and scale free networks. We then
examined in detail algorithms for searching graphs and contrasted
depth-first-search with breadth-first-search by considering LIFO stacks and FIFO
queues. We next considered approaches for finding minimum cost paths through
weighed graphs (traveling salesman problem), including exhaustive search, greedy
algorithms, and branch-and-bound. This lead to a discussion of the intractable
nature of NP-complete problems, and reviewed several important examples (vertex
cover, clique finding, knapsack problem, Hamiltonian cycle) and
shortest-common-superstring as a model for genome assembly. We concluded with a
review of DNA sequencing protocols, and the Lander-Waterman model of shotgun
sequencing with respect to coverage and connectivity. We then contrasted the
overlap-layout-consensus model of genome assembly, as exemplified by the Celera
Assembler, with the de Bruijn genome approach, as exemplified by SOAPdenovo, and
discussed the results of several recent large scale genome assembly projects.
Lecture 5: Gene Finding and HMMs
In this class we examined methods for ab initio gene prediction. We began with a
discussion of detecting open reading frames (ORFs) in prokaryotes and then using
increasingly sophisticated probabilistic models for distinguishing between true
genes and random ORFs: Fixed Order Markov models, Interpolated Markov Models as
implemented in Glimmer 1, and Interpolated Context Models as implemented in
Glimmer 2. We concluded with a discussion of using dynamic programming to select
the highest scoring set of non-overlapping ORFs and reverse scoring to
accurately determine 5’ start codons, as implemented in Glimmer 3.
Next we reviewed the structure of eukaryotic genes, and how ORF graphs represent
all possible gene models in a genome. We then examined Hidden Markov Models
(HMMs) in detail as a probabilistic model for scoring potential parses of the
ORF graph, focusing on the classic HMM problems of evaluation (including a
concrete example using the Forward algorithm with a trellis), decoding
(including a concrete example using the Viterbi algorithm), and learning
(including an overview of EM). We then considered how HMMs could be used for
eukaryotic gene finding. We then discussed how Generalized HMMs (GHMMs) overcome
the limitations of HMMs, and presented the GHMM used by the leading method
GlimmerHMM. We then considered the weight matrices and features GlimmerHMM uses
for evaluating potential start, splice, and stop sites.
Frontiers in Genomics, National University of Mexico - 2011
Answering the demands of digital genomics
This lecture reviews the past, present, and future of genomics especially in light of the increasing
computational demands required for analyzing large volumes of biological sensor data.
Computational Genomics
This lecture consists of 3 main topics: (1) sequencing alignment, focusing on suffix arrays and
the burrows-wheeler transform; (2) genome assembly, reviewing the theoretical and practical aspects
of creating a good assembly; and (3) cloud computing as a platform for large scale sequence analysis.
Undergraduate Research Program in Bioinformatics - 2011
Lecture 2: Sequence Alignment and Computational Thinking
In this class we explored the problem of finding exact occurrences of a query
sequence in a large genome or database of sequences. Under this theme, we
started by analyzing the brute force approach introducing the concepts of
algorithm, complexity analysis, and E-values. Next we discussed suffix arrays
as an index for accelerating the search, including analyzing the performance of
binary search. We also considered two traditional algorithms for sorting
(Selection Sort versus QuickSort) and their relative performance. In the second
half of the class we discussed finding approximate occurrences of a short query
sequence in a large genome or database of sequences. We first defined the
problem by considering various metrics of an approximate occurrence such as
hamming distance, or edit distance. We then considered different methods for
computing inexact alignments including brute force global & local
alignments, and seed-and-extend algorithms. Finally we discussed Bowtie as a
Burrows-Wheeler transform based short read mapping algorithm for discovering
alignments to reference genome.
Lecture 3: Graphs and Genomes
The theme of this class was graphs and methods for graph analysis. The emphasis
was on genome assembly but included a discussion of other biological networks
including PPI networks, regulation networks, neuron interaction networks, and
cell cycle graphs. In the class, we considered fundamental properties of graphs,
such as nodes, edges, degrees, and shortest paths. We then examined in detail
algorithms for searching graphs with a with breadth-first-search, and then
approaches for finding minimum cost paths through weighed graphs (traveling
salesman problem), including exhaustive search, greedy algorithms, and
branch-and-bound. This lead to a discussion of the intractable nature of
NP-complete problems, and reviewed several important examples (vertex cover,
clique finding, knapsack problem, Hamiltonian cycle).
|