Hadoop Exercises
I recommend you first review the Hadoop developer tutorial at yahoo:
http://developer.yahoo.com/hadoop/tutorial/index.html
I also recommend you download the Cloudera CDH4 virtual machine:
https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH4PackagesandDownloads
Exercise 1: Hadoop Pi
The first test with hadoop will be to run an existing hadoop program, to make
sure you can launch the program, monitor progress, and get/put files on the
HDFS. The simplest program computes pi in parallel on 5 nodes with 5 samples:
$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 5 5
Question 1: What is the value of pi that it computes?
Exercise 2: Hadoop Word Count
The next program to test is the hadoop word count program. This example reads
text files and counts how often words occur. The input is text files and the
output is text files, each line of which contains a word and the count of how
often it occured, separated by a tab.
Each mapper takes a line as input and breaks it into words. It then emits a
key/value pair of the word and 1. Each reducer sums the counts for each word and
emits a single key/value with the word and sum. As an optimization, the reducer
is also used as a combiner on the map outputs. This reduces the amount of data
sent across the network by combining each word into a single record.
Before you can run the example, you'll have to copy some data into the
distributed filesystem (HDFS). Here we will create an input directory, and copy
in the complete works of Shakespeare and the bible (a standard large corpus for
text mining)
The datafile is also avalable at - make sure to gunzip after downloading:
bible+shakes.nopunc.gz
$ hadoop fs -mkdir /user/USERNAME/wordcount
$ hadoop fs -mkdir /user/USERNAME/wordcount/input
$ hadoop fs -put /bluearc/data/schatz/data/textmining/bible+shakes.nopunc /user/mschatz/wordcount/input
To run the example, the command syntax is
$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount \
/user/USERNAME/wordcount/input \
/user/USERNAME/wordcount/output
After this completes, download the results to your local directory like this:
$ hadoop fs -get /user/USERNAME/wordcount/output output
Question 2: What are the top 10 most frequently used words in the corpus?
Hint: Use the unix commands sort and head to scan the output file
Exercise 3: Hadoop Kmer Counting
The next exercise will be to implement a kmer counter using hadoop. Conceptually
this is very similar to the wordcount program, but since there are no spaces in
the human genome, we will count overlapping kmers instead of discrete words.
The idea is if the genome is:
>chr1
ACACACAGT
And we are counting 3-mers, your map function will output
ACA 1
CAC 1
ACA 1
CAC 1
ACA 1
CAG 1
AGT 1
The shuffle function will sort them so the same key comes right after each other
ACA 1
ACA 1
ACA 1
CAC 1
CAC 1
CAG 1
AGT 1
And your reducer will output:
ACA 3
CAC 2
CAG 1
AGT 1
You can implement this in Java, using the WordCount program as an example, or
you can use Hadoop Streaming to implement it in any language you would like.
The Hadoop Streaming documentation describes how to use it:
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html
And here is a nice tutorial using Python:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
The genome file is available here: ecoli.fa.gz
Question 3: What are the top 10 most frequently occurring 9-mers in E coli?
And once this is working, you can test the entire human genome available here (hg19 means human genome, build 19):
/bluearc/data/schatz/data/genomes/hg19/hg19.fa
Question 4: What are the top 10 most frequently occurring 9-mers in HG19?
Exercise 4: Hadoop Kmer Frequency Map
The next step is to invert the kmer-frequency table so that the kmer counts are
shown at each position along the genome. This is needed so that we can overlay
the repeats in the genome with respect to genes and other features.
For example, from the example above
>chr1
ACACACAGT
We want to construct this table (chromosome, offset, kmer-count)
chr1 1 3
chr1 2 2
chr1 3 3
chr1 4 2
chr1 5 3
chr1 6 1
chr1 7 1
This cannot be done in a single MapReduce cycle - we have to first count kmer
occurences and then resort them by position. The easiest approach is to modify
your kmer counter to record the positions of each kmer, and then add a second
MapReduce step that inverts the index and uses the sort capabilities of Hadoop
to build the sorted map:
Input:
>chr1
ACACACAGT
Map Output 1 (mer, offset)
ACA 1
CAC 2
ACA 3
CAC 4
ACA 5
CAG 6
AGT 7
Output1 (cnt, offset-list):
3 chr1:1,chr1:3,chr1:5
2 chr1:2,chr1:4
1 chr1:6
1 chr1:7
Map2 output (chromosome, offset, cnt):
chr1 1 3
chr1 3 3
chr1 5 3
chr1 2 2
chr1 4 2
chr1 6 1
chr1 7 1
Shuffle and sort the final output (chromosome, offset, cnt):
chr1 1 3
chr1 2 2
chr1 3 3
chr1 4 2
chr1 5 3
chr1 6 1
chr1 7 1
Once this is done, it is straightforward to scan the catalog to find unique
regions. Here there is just one small unique region chr1:[6,7] but there could
be many.
Question 5: What are the top 10 longest unique regions in E coli using k=21?
Hint: Check out the hadoop
secondary sort.
Question 6: What are the top 10 longest unique regions in the whole human genome using k=21?
Since the map for the whole human genome will be so large (3 billion positions x
~20bytes = 60 GB), we will probably want to scan it in parallel too, but then we
have to be very careful regions that span boundaries between map jobs. You may
want to also investigate the hadoop compression options since the runtime will
be proportional to the amount of data to shuffle.
|