- June 23, 2020
1 ITNPBD7 Spring 2020 – Resit/Deferred Assessment DNA Sequence Analysis Your task for this assignment is to use Hadoop and the MapReduce approach to find the average number of letters between pairs of DNA tags across a sample genome. You are provided with two files – one containing the sample genome and the second containing the set of tag pairs that you should search for. Examples of the some of the data held in these two files are given below: Sample Genome CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG AGATTTCCCTGAGAAAGTCATATTTAAGCTGCCATTTGAAGACCAAGGAATCATGACTAGAGACAAGAAGAGAGAACATAGAGTGATTATGGAGAATCTT AGTATCAGTCCAGTCCTCAGTGACGGGACCCTAACTGACCTGCCCTTCTTTGGCTTAGATTGCTTAAATGGTTCTGGATGTGATGATGGTGCACCTTGCC TATATTAGAGTAGAGTCTAAAGATTAGAATGATCCACAGGTTAATATGGGCCATTATAAAGAGATTAGTGATATTAACAATNTAGTATCAACATGGAGAT TCTATTATTTCATTGGGGTTGCAAAATTGTGATTTTCTAATCATTTCACTTTTCCTATATTTATTGCCTGGAACTTTGTAAAGAAGAAATTGATCTTATT Sample Start/End Tag Pairs CAG,AGA CCA,TGT TGG,TCA TGG,TCC TGG,TCT CCA,TGA CCA,TGC CCA,TGG GTG,TGA GAA,CAT As an example of what you must do, consider the first two lines of the above data which are individually sent to a mapper: CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG Your program should identify that the start and end tag pairs above are located at the following positions in the first line: CAG…AGA: 0..6 CAG…AGA: 21..26 CCA…TGT: 14..33 CCA…TGT: 55..87 TGG…TCC: 36..54 TGG…TCT: 36..52 CCA…TGA: 14..61 CCA…TGG: 14..36 GTG…TGA: 42..61 GTG…TGA: 86..96 GAA…CAT: 3..50 2 with the number of letters between these tags (not including the tags themselves) being: CAG…AGA 3 & 2 CCA…TGT 16 & 19 TGG…TCC 15 TGG…TCT 13 CCA…TGA 44 CCA…TGG 19 GTG…TGA 16 & 7 GAA…CAT 44 For the second line, the tag pairs are located at: CAG…AGA: 18..30 TGG…TCC: 0..16 TGG…TCC: 27..46 TGG…TCT: 0..25 CCA…TGA: 17..77 CCA…TGC: 17..93 CCA…TGG: 17..27 GAA…CAT: 3..73 with the number of letters between tags of: CAG…AGA 9 TGG…TCC 13 & 16 TGG…TCT 22 CCA…TGA 57 CCA…TGC 73 CCA…TGG 7 GAA…CAT 67 For the two data lines shown at the start of this example, the average gap between tags would therefore be: CAG…AGA 4.6666665 TGG…TCC 14.666667 TGG…TCT 17.5 CCA…TGA 50.5 CCA…TGT 22.5 CCA…TGC 73.0 CCA…TGG 13.0 GTG…TGA 11.5 GAA…CAT 55.5 Your task is to write the Map/Reduce code in Java needed to process the above data in such a way that it produces the final output of the averages shown above but for the entire genome data rather than just the two sample lines shown. You will submit a written report, detailing your design and the results you found. You must also submit a Java file containing your code. 3 Step 1, HDFS – 20 Marks Before you write any code, you will need to copy the data onto your own space in HDFS. In your report, give details of how HDFS stores data such as this (assume the file is much bigger than it really is for the purpose of your description). This section should be around half a page long, plus a diagram. Describe what HDFS is for, the architecture it uses, and the roles of different nodes in the cluster. Document the hdfs commands you used to create a directory for the data and place it there. Make sure everything you put here, including the diagram, is your own work. Do not copy anything from other sources. Step 2, Design – 20 Marks Now consider the Map/Reduce design you will implement. Compare and contrast producing a design with and without a Combiner and describe the role that the Combiner plays in improving the efficiency of your solution. You should also describe what keys and values the mapper will emit, the combiner will emit and what the final reducer will emit. You should consider how much data will be moved across the network in each of your two designs and how many different reducers will be used in each case. Step 3, Implement – 60 Marks Once you have completed your designs, you should implement the design that uses a Combiner and show how it improves the performance of the overall solution. It is advisable to use the DNASeqCount.java file provided on the assignment page in Canvas as a starting point. A file called TestSeqCount has been provided that will use the code from DNASeqCount.java and run it on the mochadoop Hadoop simulator. You are advised to develop your solution with this first before finally running it on Hadoop. TestSeqCount uses the sample data and tag pairs shown above so you can use it to check that you are getting the final answers shown above. The Hadoop run will use the full set of data and a larger set of tag pairs to produce a more detailed result so do not expect the two alternatives to produce the same output (although you can test your Hadoop job with the smaller data files if you wish). If you have problems remotely accessing Hadoop, you can try only running your code with Mochadoop on the dna-40.txt sample which contains the first 40 linest of DNA sequences and submit the results for this however your submission may be tested on a much larger data set on Hadoop so you should be sure that it works. Whether or not you use Hadoop, you should still provide the commands that would be needed to run your solution on the real Hadoop system. Submission Details Please write up your work in a report and submit it via Canvas, clearly noting your 7 digit student ID number on the front of your report but do not provide your name. Additionally, please submit your DNASeqCount.java file via Canvas and ensure that your code is very well commented and that you have put your 7 digit ID number at the top of your Java code in the commented area. Make sure your report also contains the results you got when you ran your code. The deadline for submission is Monday 22nd of June at 4pm. 4 Plagiarism Work which is submitted for assessment must be your own work. All students should note that the University has a formal policy on academic misconduct which can be found here. Plagiarism means presenting the work of others as though it were your own. The University takes a very serious view of plagiarism, and the penalties can be severe (ranging from a reduced grade in the assessment, through a fail for the module, to expulsion from the University for more serious or repeated offences). Specific guidance in relation to Computing Science assignments may be found in the Computing Science Student Handbook. We check submissions carefully for evidence of plagiarism, and pursue those cases we find. Late submission If you cannot meet the assignment hand-in deadline and have good cause, please see the module coordinator to explain your situation and ask for an extension. Coursework will be accepted up to seven days after the hand-in deadline (or expiry of any agreed extension) but the mark will be lowered by three marks per day or part thereof. After seven days the work will be deemed a non-submission and will receive an X.