publications posters reviews presentations


Presentations

2019


129. 100 Genomes in 100 Days: The Structural Variation Landscape of Tomato Genomes.
PAG 2019 San Diego, CA. Jan 15, 2019


2018


128. Advances in Genome Sequencing and Assembly.
UCLA Computational Genomics Summer Institute UCLA, CA. July 17, 2018

127. Analyzing -omic Instability in breast cancer with nanopore sequencing of patient-derived organoids.
Nanopore Day DC College Park, MD. March 22, 2018

126. Analyzing -omic Instability in breast cancer with nanopore sequencing of patient-derived organoids.
Advances in Genome Biology and Technology (AGBT) Orlando, FL. Feb 15, 2018

125. In pursuit of perfect personal genomes.
Advances in Genome Biology and Technology (AGBT) Orlando, FL. Feb 13, 2018

124. Phased diploid genmomes using short, long and linked reads.
Plant and Animal Genomes Conference (PAG XXVI) San Diego, CA. Jan 16, 2018

123. Reference-quality diploid genomes without de novo assembly.
Plant and Animal Genomes Conference (PAG XXVI) San Diego, CA. Jan 16, 2018


2017


122. In pursuit of perfect genome sequencing.
Institute for Genome Sciences Baltimore, MD. Dec 7, 2017

121. In pursuit of perfect genome sequencing.
PacBio Users Meeting Baltimore, MD. June 28, 2017

120. In pursuit of perfect genome sequencing.
Joint Institute for Metrology in Biology Stanford, Ca. May 22, 2017

119. Personalized Phased Diploid Genomes of the EN-Tex Samples.
Advances in Genome Biology and Technology (AGBT) Hollywood, FL. Feb 15, 2017

118. Heterozygosity, Phased Genomes, and Personalized-omics.
University of Maryland College Park, MD. Jan 12, 2017


2016


117. Accurate and fast detection of complex and nested structural variations using long read technologies. Presented by Fritz Sedlazeck
Biological Data Science Cold Spring Harbor, NY. Oct 27, 2016

116. Scikit-ribo reveals precise codon-level translational control by dissecting ribosome pausing and codon elongation. Presented by Han Fang
Biological Data Science Cold Spring Harbor, NY. Oct 27, 2016

115. Scikit-ribo - Accurate A-site prediction and robust modeling of translational control. Presented by Han Fang
Advances in Genome Biology and Technology (AGBT) Orlando, FL. Feb 10-13, 2016

114. SplitThreader: A graphical algorithm for analysis of highly rearranged and amplified cancer genomes. Presented by Maria Nattestad
Advances in Genome Biology and Technology (AGBT) Orlando, FL. Feb 10-13, 2016

113. Recurrent noncoding regulatory mutations in pancreatic ductal adenocarcinoma. Presented by Tyler Garvin
Advances in Genome Biology and Technology (AGBT) Orlando, FL. Feb 10-13, 2016

112. The Resurgence of Reference Quality Genomes.
Plant and Animal Genomes Conference (PAG XXIV) San Diego, CA. Jan 12, 2016

111. Analysis of Structural Variations using 3rd gen Sequencing.
Plant and Animal Genomes Conference (PAG XXIV) San Diego, CA. Jan 12, 2016


2015


110. Comprehensive Genome and Transcriptome Structural Analysis of a Breast Cancer Cell Line using PacBio Long Read Sequencing. Presented by Maria Nattestad
Genome Informatics Cold Spring Harbor, NY. Oct 28-31, 2015

109. Ginkgo—Interactive analysis and quality assessment of single-cell CNV data. Presented by Robert Aboukhalil
Genome Informatics Cold Spring Harbor, NY. Oct 28-31, 2015

108. Scikit-ribo - Accurate A-site prediction and robust modeling of translational control. Presented by Han Fang
Genome Informatics Cold Spring Harbor, NY. Oct 28-31, 2015

107. Single Cell and Single Molecule Approaches to Studying Cancer.
JHU Genomics Symposium Baltimore, MD. Oct 22, 2015

106. Comprehensive Genome and Transcriptome Structural Analysis of a Breast Cancer Cell Line using PacBio Long Read Sequencing. Presented by Maria Nattestad
American Society of Human Genetics (ASHG) Baltimore, MD. Oct 7, 2015

105. Single cell and single molecule approaches for studying cancer.
NYU Genomics Symposium New York, NY. May 22, 2015

104. The resurgence of reference quality genomes.
Laufer Center for Quantitative Biology Stony Brook University, Stony Brook, NY. April 20, 2015

103. The resurgence of reference quality genomes.
University of Minnesota St. Paul, MN. April 9, 2015

102. Algorithms for studying the structure and function of genomes.
UNAM LIIGH Queretaro, Mexico. April 7, 2015

101. Algorithms for Single Cell and Single Molecule Biology.
Simons Foundation New York, NY. March 27, 2015

100. Single Cell Copy Number Analysis. Presented by Robert Aboukhalil
VIZBI 2015 Cambridge, MA. March 24, 2015

99. Error Correction and Assembly of Oxford Nanopore Sequencing. Presented by James Gurtowski
AGBT Marco Island, Fl. Feb 27, 2015

98. PacBio Long Read Sequencing and Structural Analysis of a Breast Cancer Cell Line. Presented by W. Richard McCombie
AGBT Marco Island, Fl. Feb 27, 2015

97. Part II: Algorithms for studying the structure and function of genomes (CS)
Johns Hopkins University Baltimore, MD. Feb 6, 2015

96. Part I: Algorithms for studying the structure and function of genomes (Biology)
Johns Hopkins University Baltimore, MD. Feb 5, 2015

95. The resurgence of reference quality genomes.
Plant and Animal Genome XXIII (PAG) San Diego, CA. Jan 13, 2015

94. Hybrid Error Correction and De Novo Assembly with Oxford Nanopore.
Plant and Animal Genome XXIII (PAG) San Diego, CA. Jan 13, 2015

93. Sugarcane Genome De Novo Assembly Challenges Presented by Hayan Lee
Plant and Animal Genome XXIII (PAG) San Diego, CA. Jan 13, 2015

92. The resurgence of reference quality assemblies using 3rd generation sequencing.
Penn State State College, PA. Jan 6, 2015


2014


91. The resurgence of reference quality assemblies using 3rd generation sequecning.
American Museum of Natural History New York, NY. Dec 9, 2014

90. Assembly Quality
USDA/ARS Workshop Cold Spring Harbor, NY. Dec 8, 2014

89. Reducing INDEL calling errors in whole genome and exome sequencing data.. Presented by Han Fang
Biological Data Sciences Cold Spring Harbor, NY. Nov 5-8, 2014

88. Biological Data Sciences Introduction
Biological Data Sciences Cold Spring Harbor, NY. Nov 5-8, 2014

87. Error Correction and Assembly of Single Molecule Sequencing Data. Presented by James Gurtowski
Genome Informatics Cambridge, UK. Sept 23, 2014

86. Assembly Workshop Roundtable
Genome Reference Consortium Cambridge, UK. Sept 20, 2014

85. Pan-genomics: theory and practice
Genome Reference Consortium Cambridge, UK. Sept 20, 2014

84. Special Seminar: Algorithms for Genome Sequencing and Disease Analytics
CSHL Cold Spring Harbor, NY. Sept 9, 2014

83. SplitMEM: Graphical Pan-Genome Analysis with Suffix Skips. Presented by Shoshana Marcus
HITSeq/ISMB Boston, MA. July 11, 2014

82. Optimizing Eukaryotic De Novo Genome Assembly. Presented with James Gurtowski and Sergey Koren
PacBio Webinar Menlo Park, CA. June 26, 2014

81. Big Data Meets DNA: How Biological Data Sciences is improving our health, foods, and energy needs [Flyer]
CSHL Public Lectures Cold Spring Harbor, NY. June 18, 2014

80. Near perfect de novo assemblies of eukaryotic genomes using PacBio long read sequencing. Presented by James Gurtowski
Sequencing, Finishing, and Analysis in the Future Meeting Santa Fe, NM. May 29, 2014

79. Big Data Meets DNA
Procter and Gamble Mason, OH. May 22, 2014

78. SplitMEM: Graphical Pan Genome Analysis with Suffix Skips. Presented by Shoshana Marcus
CSHL Quantitative Biology Seminar Series Cold Spring Harbor, NY. May 5, 2014

77. Big Data Meets DNA
IEEE Fellows Night Dinner [Flyer] Syracuse, NY. April 8, 2014

76. Genome Assembly and Disease Analytics
Department of Biology. Hamilton College. Clinton, NY. April 7, 2014

75. The next 10 years of quantitative biology
Keystone Symposia: Big Data in Biology. San Francisco, CA. March 25, 2014

74. Genome Assembly and Disease Analytics
University of Florida Genetics Institute. Gainseville, FL. March 11, 2014

73. Genome Assembly and Disease Analytics
Center for Computational and Molecular Biology Brown University. Providence RI. Feb 19, 2014

72. A near perfect de novo assembly of a eukaryotic genome. by W. Richard McCombie
Advances in Genome Biology and Technology (AGBT) Marco Island, FL. Feb 14, 2014

71. Genome Assembly and Disease Analytics
Institute for Data Intensive Engineering and Science Johns Hopkins University. Baltimore MD. Feb 11, 2014

70. KBase: Variation and RNA-seq services.
Plant and Animal Genomes (PAGXXII) San Diego, CA. Jan 14, 2014

69. De novo assembly of complex genomes using single molecule sequencing.
Plant and Animal Genomes (PAGXXII) San Diego, CA. Jan 14, 2014


2013


68. SCALPEL: Micro-assembly approach to detect indels within exome-capture data.
Genome Informatics Cold Spring Harbor, NY. Oct 31, 2013

67. Algorithms for the analysis of complex genomes.
Cold Spring Harbor In-house seminar series Cold Spring Harbor, NY. Oct 10, 2013

66. De novo assembly of complex genomes.
Beyond the Genome Mission Bay Conference Center, San Francisco, CA. Oct 1-3, 2013

65. De novo assembly of complex genomes.
Institute for Computational Biomedicine Weill Cornell Medical College, New York, NY. Sept 18, 2013

64. De novo assembly of complex genomes.
Institute for Genomic Biology Seminar UIUC, Urbana, IL. Sept 10, 2013

63. KBase Variation Services with James Gurtowski.
KBase Webinar. CSHL, Cold Spring Harbor, NY. June 28, 2013

62. Hybrid De Novo Assembly of Eukaryotic Genomes by James Gurtowski.
PacBio Users Meeting. University of Maryland School of Medicine, Baltimore, MD. June 18, 2013

61. Genome Sequencing and Assembly.
Human Microbiome Consortium Virtual Meeting: Approaches in Microbiome Assembly. University of Maryland School of Medicine, Baltimore, MD. May 2, 2013

60. IT Considerations: Hurdles and Solutions.
Developing a Neuroscience Consortium. Banbury Center, Cold Spring Harbor, NY. April 29, 2013

59. De novo assembly of complex genomes.
University of Virginia. Charlottesville, VA. April 10, 2013

58. Cloud-scale Sequence Analyis.
New York Genome Center / AWS. New York, NY. Mar 18, 2013.

57. Assembling Crop Genomes with Single Molecule Sequencing.
AGBT 2013. Marco Island, FL. Feb 22, 2013.


2012


56. Human Genetics and Plant Genomics: The long and the short of it.
CSHL In House Symposium XXVI. Cold Spring Harbor, NY. Nov 20, 2012.

55. De novo assembly of complex crop genomes.
PacBio Users Meeting. Menlo Park, CA. Oct 18, 2012.

54. Assembling crop genomes with 2nd and 3rd generation sequencing.
Strategies for de novo assemblies of complex crop genomes. The Genome Analysis Centre, Norwich, England. Oct 8, 2012

53. Genome Assembly and Alignment Primer.
Beyond the Genome. Harvard Medica School, Boston, MA. Sept 27, 2012

53. Illuminating the genetics of complex human diseases.
Beyond the Genome. Harvard Medica School, Boston, MA. Sept 27, 2012

52. De novo assembly of complex genomes.
Purdue Statistical Bioinformatics Seminar Series. Purdue University, West Lafayette, IN. Sept 18, 2012

51. SMRT-assembly: Error correction and de novo assembly of complex genomes using single molecule, real time sequencing.
Biology of Genomes. Cold Spring Harbor, NY. May 10, 2012

50. Scalable Solutions for 2nd and 3rd generation sequencing.
NYU HiTS Seminar. New York, NY. March 29, 2012

49. Entering the era of mega-genomics.
JGI Users Meeting. Walnut Creek, CA. March 20, 2012

48. Entering the era of mega-genomics.
iPlant Tech Talk. Webinar. March 13, 2012

47. Entering the era of mega-genomics.
UNC Charlotte. Charlotte, NC. March 2, 2012

46. Metassembler: Improving de novo genome assembly.
AGBT. Marco Island, FL. Feb 12, 2012

45. Entering the era of mega-genomics.
Pioneer. Des Moines, IA. Feb 7, 2012

44. SMRT-assembly: Error correction and de novo assembly of complex genomes using SMRT sequencing.
PAG-XX: PacBio Workshop. San Diego, CA. Jan 17, 2012

43. De novo assembly of complex genomes using 3rd generation sequencing.
PAG-XX: Sequencing Complex Genomes. San Diego, CA. Jan 15, 2012


2011


42. Applications of micro-, mega-, and meta-assembly.
CSHL In house seminar series. Cold Spring Harbor, NY. Dec 9, 2011.

41. Applications of micro-, mega-, and meta-assembly.
CSHL Genome Informatics. Cold Spring Harbor, NY. Nov 3, 2011.

40. Frontiers in Genomics: Answering the demands of digital genomics.
Frontiers of Genomics, Center for Genomic Sciences, University of Mexico. Cuernavaca Mexico. Oct 3-4, 2011.

39. Beyond the genome: Answering the demands of digital genomics. [Demo] [Panel]
Beyond the Genome. Rockville MD. Sept 20, 2011.

38. SMRT-assembly approaches.
PacBio Users Meeting. Menlo Park, CA. Sept 7, 2011.

38. Rapid Parallel Genome Indexing using MapReduce.
2nd Annual Workshop on MapReduce. HPDC 2011, San Jose, CA. June 8, 2011.

37. Cloud Computing and the DNA Data Race.
Emerging Computational Methods for the Life Sciences. HPDC 2011, San Jose, CA. June 8, 2011.

36. Cloud Computing and the DNA Data Race.
Data-Intensive Analysis, Analytics, and Informatics. Pittsburgh Supercomputing Center, Pittsburgh, PA. Apr 14-15, 2011.

35. Cloud Computing and the DNA Data Race.
Columbia University Medical Center. New York, NY. Mar 28, 2011.

34. Assembly and Validation of Large Genomes from Short Reads.
2011 Genome Assembly Workshop & Genome 10k Project Meeting. Chaminade, Santa Cruz, CA. Mar 14-16, 2011.

33. Cloud Computing and the DNA Data Race.
Laufer Center for Physical and Quantitative Biology. Stony Brook University, Stony Brook, NY. Feb. 15, 2011.

2010


32. Ultra Large DNA Sequence Analysis.
2010 CSHL In House Symposium. CSHL. Cold Spring Harbor NY, Nov. 23, 2010.

31. Assembly in the Clouds.
9th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS). CSHL. Cold Spring Harbor NY, Nov. 12, 2010.

30. Cloud Computing and the DNA Data Race: Theory and Practice.
Advanced Sequencing Technologies & Applications. CSHL. Cold Spring Harbor NY, Oct. 25, 2010.

29. Assembly in the Clouds.
Beyond the Genome. Harvard Medical School. Boston MA, Oct 11-13, 2010.

28. Cloud Technical Challenges.
Beyond the Genome. Harvard Medical School. Boston MA, Oct 11-13, 2010.

27. Assembly of Large Genomes using Cloud Computing / How to compute with 1000s of cores.
Illumina Sequencing Panel. Toronto, ON Canada. July 23, 2010.

26. Design Patterns for Efficient Graph Algorihms in MapReduce.
Presented by Jimmy Lin
Hadoop Summit 2010. Santa Clara, CA. June 29, 2010.

25. Cloud Computing and the DNA Data Race.
Mayo Clinic Genomics Interest Group. Rochester, MN. June 16, 2010.

24. High Performance Computing for DNA Sequence Alignment and Assembly.
Stone Ridge Technology. Bel Air, MD. May 18, 2010.

23. Computational Architecture of Cloud Environments.
NHGRI Cloud Computing Meeting. NHGRI. Bethesda, MD. April 1, 2010.

22. Scalable Solutions for DNA Sequence Analysis.
Cold Spring Harbor Laboratory. Cold Spring Harbor, NY. March 23, 2010.

21. CloudBurst, Crossbow, and Contrail: Scaling Up Bioinformatics with Cloud Computing.
Sequencing Data Analysis and Storage, XGen Conference. San Diego, CA. March 15, 2010.

One of the main challenges for computational biologists is creating efficient algorithms to match improvements in high throughput sequencing. Here we describe how CloudBurst and Crossbow use cloud computing for mapping and genotyping whole human genomes at deep coverage in an afternoon. We'll also describe how our new program, Contrail, uses cloud computing to scale up de Bruijn graph construction and analysis for the assembly of large genomes from short reads.


20. Scalable Solutions for DNA Sequence Analysis.
Argonne Leadership Computing Seminar, Argonne National Lab. Argonne, IL. March 11, 2010.

19. Scalable Solutions for DNA Sequence Analysis.
NHGRI/UMD Computational Workshop, NHGRI. Rockville, MD. Jan. 29, 2010.

2009


18. Scalable Solutions for DNA Sequence Analysis.
DC Hadoop Users Group. College Park, MD. Dec. 16, 2009.

17. Scalable Solutions for DNA Sequence Analysis.
JHU/UMD Joint Sequencing Meeting, JHU Biostatistics Department. Baltimore, MD. Dec. 4, 2009.

16. GPGPU and Cloud Computating for DNA Sequence Analysis.
Doctoral Showcase at SC09. Portland, OR. Nov. 19, 2009.

Recent advances in DNA sequencing technology have dramatically increased in the scale and scope of DNA sequence analysis, but these analyses are complicated by the volume and complexity of the data involved. For example, genome assembly computes the complete sequence of a genome from billions of short fragments read by a DNA sequencer, and read mapping computes how short sequences match a reference genome to discover conserved and polymorphic regions. Both of these important computations benefit from recent advances in high performance computing, such as in MUMmerGPU which uses highly parallel graphics processing units as high performance parallel processors, and in CloudBurst and Crossbow which use the MapReduce framework coupled with cloud computing to parallelize read mapping across large remote compute grids. These techniques have demonstrated orders of magnitude improvements in computation time for these problems, and have the potential to make otherwise infeasible studies practical.


15. Commodity Computing in Genomics Research.
Workshop on Using Clouds for parallel computations in systems biology in SC09. Portland, OR. Nov. 16, 2009.

In the next few years the data generated by DNA sequencing instruments around the world will exceed petabytes, surpassing the amounts of data generated by large-scale physics experiments such as the Hadron Collider. Even today, the terabytes of data generated every few days by each sequencing instrument test the limits of existing network and computational infrastructures. Our project is aimed at evaluating whether cloud computing technologies and the MapReduce/Hadoop infrastructure can enable the analysis of the large data-sets being generated. We will report on initial results in two specific applications: human genotyping and genome assembly using next generation sequencing data.


14. Commodity Computing in Genomics Research.
Presented by Mihai Pop. Authored by Mihai Pop, Michael Schatz, and Dan Sommer.
NSF CLuE PI Meeting. Mountain View CA. Oct 5, 2009.

13. Upper bounds on the ability to reconstruct prokaryotic genomes with next generation sequencing technologies.
Presented by Mihai Pop. Authored by Joshua Wetzel, Michael Schatz, Carl Kingsford, and Mihai Pop.
WABI 2009. Philadelphia, PA. Sept 12 2009.

12. Assembly Boot Camp.
UMD Institute for Genome Sciences. Baltimore, MD. June 26 2009.

The theory and practice of genome assembly using the Celera Assembler. Special emphasis is given to tuning the parameters and settings to get the best results for your data.


11. High Throughput Sequence Analysis with MapReduce.
J. Craig Venter Institute Informatics Seminar. Rockville, MD. June 18, 2009.

MapReduce is the parallel distributed computing framework developed by Google for large data computations, including analyzing their collection of more than 1 trillion web pages on clusters with 10s of thousands of nodes. This system enables rapid development of highly scalable applications, because developers write just a few application specific functions, and the system automatically and intelligently provides the scheduling, monitoring, and partitioning necessary to scale to this size. Furthermore, MapReduce is becoming a de facto standard for executing large computations within the cloud, where remote compute resources are used generically under a pay-as-you-go pricing model.

In this presentation, I will describe the leading open-source implementation of MapReduce called Hadoop, the cloud computing capabilities of Amazon, and outline MapReduce-based sequence analysis algorithms for read alignment, SNP discovery, and genome assembly. Scalable algorithms for these problems are essential given that current sequencing technologies routinely generate tens or hundreds of gigabytes of data for a single experiment, and can require hundreds or thousands of hours of computation. The results show MapReduce is an extremely effective system for analyzing these datasets, with near linear speedups as the size of the cluster grows. Furthermore, the Amazon compute cloud can be an efficient and cost-effective resource, especially for periodic or unusually large compute tasks.


10. Genetic Sequence Analysis in the Clouds.
Presented by Jimmy Lin at the Hadoop Summit 2009. Santa Clara, CA. June 10, 2009.

9. CloudBurst: Highly Sensitive Read Mapping with MapReduce.
Amazon Web Services Start Up Event. Washington, DC. May 27, 2009.

8. High Throughput Sequence Alignment using Graphics Processing Units.
UMD nVidia Partnership Workshop. College Park, MD. May 21, 2009.

2008 and earlier


7. Revealing Biological Modules via Graph Summarization..
Presented by Saket Navlakha at the RECOMB-SB/RG/DREAM3 2008 satelite conference. Boston, MA. Oct. 30, 2008.

A technique called Graph Summarization can be used to partition protein-protein interaction networks to reveal modules that are more biologically relevant than the clusters produced by other graph partitioning techniques. We apply GS to predict Gene Ontology annotations of biological process for proteins of unknown annotations. We also apply it to detecting membership in protein complexes, as annotated in the MIPS catalog. GS outperforms other approaches such MCODE, MCL and modularity.


6. Hunting Down the Papaya Transgenes.
Plant and Animal Genomes Conference-XVI. San Diego, CA. Jan. 16, 2008.

In the middle of the last century, the Papaya ringspot potyvirus (PRSV) devastated the papaya industry on the island of Oahu in Hawaii and in other fields throughout the world. With the eminent threat of the disease spreading to the fields in the Puna district of Hawaii island, researchers in the mid 1980s developed PRSV-resistant transgenic lines of papaya using the pathogen-derived resistance approach, in which genes from PRSV were inserted into the papaya genome using a gene gun. The commercialization of these transgenic lines in the late 1990s virtually saved the Hawaiian papaya industry, but without a full genome sequence, there was lingering concern as to the exact nature of the transgenic insertions.

In my presentation, I will report on the draft genome sequence of the virus-resistant ‘SunUp’ papaya, created in collaboration with the University of Hawaii, the University of Illinois at Urbana-Champaign, and other institutions. I will focus on the computational methods used for assembling the genome, validating its correctness, and the subsequent search for transgenic inserts. Our genome wide analysis, combined with Southern blot analysis and directed PCR, confirms the efficiency of the gene gun technology, with only 3 conclusive transgenic insertions. In addition, even though the papaya genome is nearly twice the size of the Arabidopsis genome, it contains fewer genes, and thus makes it an excellent candidate for further study of biosynthetic pathways and networks.


5. High-throughput sequence alignment using Graphics Processing Units.
Co-presented with Cole Trapnell at the CBCB Seminar. Univ of Maryland. Sept. 20, 2007

The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. We present MUMmerGPU, a high-throughput parallel sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms MUMmer by more than 3-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.


4. Interactive visual analytic tools for genome assembly.
9th Annual Computational Genomics Conference. Baltimore, MD. Oct. 29, 2006.

Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.

All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.


3. AMOS Assembly Validation and Visualization.
The Institute for Genomic Research. Rockville, MD. April 7, 2006.

During my talk, I will discuss the techniques and tools to discover and correct misassemblies in genome assemblies. The three primary sources of information used to detect misassemblies are the "happiness" of the mate-pairs, the base call agreement within the multiple alignment of reads, and the depth of coverage of those reads.

The open source AMOS assembly package provides tools for systematically analyzing these qualities to discover regions with potential misassemblies. The AMOS Assembly Investigator is a powerful genome assembly visualizer with semantic zooming capabilities. It allows one to navigate and visually inspect these potential misassemblies in a systematic fashion at all levels of detail. Once regions with misassemblies have been identified, users can correct the misassemblies with the AMOS contig patching tools.


2. Improving Genome Assemblies without Sequencing.
CBCB Seminar. Univ. of Maryland. College Park, MD. Sept. 28, 2005.

Genome assembly is the problem of reconstructing the genome sequence of an organism from a collection of short sequenced reads. An assembly takes the form of contiguous stretches of DNA sequence (contigs) linked together in scaffolds by mate-pair and other information. Genome assembly is scientifically one of the most important areas of bioinformatics research as an accurate genome sequence is needed for addressing several fundamental biological questions. Unfortunately, it is also one of the most complex computationally, having been proved NP-hard under various formalisms and a typical problem size of thousands or millions of inputs.

During my talk, I will discuss some of the algorithmic challenges and trade-offs in genome assembly. I will also discuss some computational methods for improving an assembly, which can be applied generally but without requiring additional laboratory results. One method was implemented in AutoEditor, which acts as a second generation base-caller to find and correct base-calling errors in reads using the original chromatogram trace and the multiple alignment of reads. A second was implemented in AutoJoiner, which attempts to automatically close gaps between linked contigs, and generally enhance contig quality, by extending the usable portion of reads within an assembly.


1. Improving Genome Assemblies without Sequencing.
The Institute for Genomics Research. Rockville, MD. April 25, 2005.