Quantitative Biology & Bioinformatics - 2010
Lecture 0: Research Topics in Biology: Scalable Solutions for DNA Sequence Analysis
Introduction to DNA sequence analysis, with an emphasis on large scale analysis
using cloud computing (parallel computing). Includes a description of
embarassingly parallel, loosely coupled, and tightly coupled problems, and
techniques for addressing each type using many computers.
Lecture 1: Exact Matching and Fundamental Topics in Computer Science
In this class we explored the problem of finding exact occurrences of a query
sequence in a large genome or database of sequences. Under this theme, we
started by analyzing the brute force approach introducing the concepts of
algorithm, complexity analysis, E-values, and programming in Matlab. Next we
discussed suffix arrays as an index for accelerating the search, including
analyzing the performance of binary search and its implementation in Matlab. We
also considered two traditional algorithms for sorting (Selection Sort versus
QuickSort) and their relative performance. Next we considered the question of if
it is possible to sort items in linear time (Bucket Sort), to segue towards
describing suffix trees and their applications. This also introduced the notion
of trees and their properties, and the fundamental depth-first-search algorithm.
Finally we considered hash tables and hash functions as way to transform very
large, intractable search domains into tractable ranges, with applications
towards developing a seed-and-extend search algorithm.
Lecture 2: Sequence Alignment & Applications
The theme of this class was the problem of finding approximate occurrences of a
short query sequence in a large genome or database of sequences. We first
defined the problem by considering various metrics of an approximate occurrence:
hamming distance, edit distance, and similarity score including BLOSSUM matrices
and affine gap penalties. We then considered different methods for computing
inexact alignments including dynamic programming for computing global &
local alignments and seed-and-extend algorithms.
In the second half of the class we discussed 3 specific sequence alignment
algorithms: (1) BLAST as a hash-table based seed-and-extend algorithm for
finding statistically significant sequence homology; (2) MUMmer as a suffix-tree
based seed-and-extend algorithm for whole genome alignment, introducing the
concepts of dot-plot and large scale genomic variations; and (3) Bowtie as a
Burrows-Wheeler short read mapping algorithm using a depth-first-search of the
implicit suffix array. We also discussed subsequent algorithms for interpreting
short read mapping results to detect polymorphisms, including the SOAPsnp model
of SNP detection, the SAMTools model of small indels, the RDexplorer model of
copy number variation detection, and the Hydra algorithm for detecting
large-scale indels and rearrangements.
Lecture 3: Graphs and Genome Assembly
The theme of this class was graphs and methods for graph analysis. The emphasis was on genome assembly but included a discussion of other
biological networks including PPI networks, regulation networks, neuron interaction networks, and cell cycle graphs. In the first half of
the class, we considered fundamental properties of graphs, such as nodes, edges, degrees, shortest paths, connected components, diameters,
small world networks, and scale free networks. We then examined in detail algorithms for searching graphs and contrasted depth-first-search
with breadth-first-search by considering LIFO stacks and FIFO queues. We next considered approaches for finding minimum cost paths through
weighed graphs (traveling salesman problem), including exhaustive search, greedy algorithms, and branch-and-bound. This lead to a discussion
of the intractable nature of NP-complete problems, and reviewed several important examples (vertex cover, clique finding, knapsack problem,
Hamiltonian cycle) and shortest-common-superstring as a model for genome assembly. We completed the first half by describing methods for
finding Eulerian tours, noting while they can be discovered in linear time there may be an exponential number of such tours.
In the second half of the class we focused on genome assembly starting with a review of DNA sequencing protocols, and the Lander-Waterman model
of shotgun sequencing with respect to coverage and connectivity. We then contrasted the overlap-layout-consensus model of genome assembly, as
exemplified by the Celera Assembler, with the de Bruijn genome approach, as exemplified by SOAPdenovo, and discussed the results of several
recent large scale genome assembly projects. The final section of the class described various techniques for detecting mis-assemblies
(so-called Assembly Forensics), and discussed the mate-pair compression/expansion statistics, read alignment heterogeneity, and read coverage
as signals for mis-assemblies.
Lecture 4: Gene Finding and HMMs
In this class we examined methods for ab initio gene prediction. We began with a discussion of detecting open reading frames (ORFs) in prokaryotes
and then using increasingly sophisticated probabilistic models for distinguishing between true genes and random ORFs: Fixed Order Markov models,
Interpolated Markov Models as implemented in Glimmer 1, and Interpolated Context Models as implemented in Glimmer 2. We concluded with a discussion
of using dynamic programming to select the highest scoring set of non-overlapping ORFs and reverse scoring to accurately determine 5’ start codons,
as implemented in Glimmer 3.
Next we reviewed the structure of eukaryotic genes, and how ORF graphs represent all possible gene models in a genome. We then examined Hidden
Markov Models (HMMs) in detail as a probabilistic model for scoring potential parses of the ORF graph, focusing on the classic HMM problems of
evaluation (including a concrete example using the Forward algorithm with a trellis), decoding (including a concrete example using the Viterbi
algorithm), and learning (including an overview of EM). We then considered how HMMs could be used for eukaryotic gene finding. We then discussed
how Generalized HMMs (GHMMs) overcome the limitations of HMMs, and presented the GHMM used by the leading method GlimmerHMM. We then considered
the weight matrices and features GlimmerHMM uses for evaluating potential start, splice, and stop sites.