Skip to main content
留学咨询

辅导案例-PA 8

By May 15, 2020No Comments

PA 8: Huffman Encoding and Compression Due Date: November 26th (Tuesday, 11:59 PM) Checkpoint Submission: November 24th (Sunday, 11:59 PM) PA Quiz: November 23rd (Saturday, 11:59 PM) 100 points Overview In this assignment you will implement a classic lossless compression algorithm Huffman Encoding. It’s an optimal lossless compression strategy discovered by David A. Huffman and his students. After finishing this assignment, you can compress and decompress any file using the algorithm that you implemented. This assignment is an individual assignment.​ You may ask Professors/TAs/Tutors for some guidance and help, but you can’t copy code. You may discuss the assignment conceptually with your classmates, including bugs that you ran into and how you fixed them. ​However, do not look at or copy code.​ This constitutes an Academic Integrity Violation. START EARLY! Style Requirements You will be graded on your code’s style. ​Style Guide IntelliJ has ​a great plugin​ to check your style automatically. Use it! There are style considerations beyond indentation and such. ● Choose names for test methods and for variables that are descriptive and useful when you see them in test output. ● Consider writing helper methods if you repeatedly use the same set of statements. ● Use Javadoc comments for descriptions of your methods and classes and inline comments if the logic of your program is not captured by your variable names and the natural flow of code. Part 0 – Get Starter Code Our starter code will be distributed from https://github.com/ucsd-ets/dsc30fa19-startercode​. Please follow the same procedures as before to get your starter code and set up intelliJ. Part 1 – PA Quiz Fill out this quiz about the PA to make sure you have understood the assignment thoroughly. You have 5 attempts to complete it. PA Quiz on Canvas Part 2 – Huffman Coding Tree In this part, you will need to implement methods in ​HCTree.java​ and the ​compareTo method in the inner class ​HCNode​ to help you build and use a Huffman Coding Tree. We have provided most of the inner class ​HCNode​ for you. Please read through the Javadoc comments in starter code to understand the purpose of each instance variable in ​HCNode​. After that implement the following methods: public void buildTree(int[] freq) This method takes in an int array of size 256, which is​ the ​size of extended ​ASCII​. (Since each byte is 8 bits, then the number of all the possible bytes is simply 2^8 = 256). The given array was used to keep track of the ​frequency of each byte from the input file​. For example, if ‘a’ appears in the input file 7 times, then the element at index 97 (which is the ASCII value of ‘a’) of ​freq​ would be 7. Note that to get the symbol ‘a’ of type byte back from its ASCII value 97, you can simply cast the int to byte by doing : ​byte a = (byte)97 1. In this method, you first need to create all leaf nodes of your ​HCTree​ with frequency values corresponding to each symbol in the ​freq​ array. Note that HCTree​ contains an instance variable called ​leaves​, which is an array to reference each of the leaf nodes in your ​HCTree​ (It will be useful later in your encode​ method). You need to think about where to store each leaf node in the leaf array so that when given a symbol, you can immediately find the reference to the leaf node containing that particular symbol. 2. After creating these ​HCNodes​, then you need to add them to a Java built-in priority queue to help you build the tree later. (The documentation of priority queue can be found ​here​). ​Remember that this priority queue should give higher priority to the ​HCNode​ with less frequency, and if the frequency are the same, the HCNode with symbol of smaller ASCII value should have higher priority​. Now you need to go back to the inner class ​HCNode​ and implement the ​compareTo​ method to satisfy the requirement above, as Java’s built-in priority queue will use your ​compareTo​ method defined in ​HCNode​ to determine the priority of different node. Hint: First, recall that the ​default behavior ​of ​compareTo​ method is: return positive number if “this” is considered “larger” than the given object. Then read the document about priority queue and find the information about the ​default behavior​ of priority queue: is it a min heap or max heap by default? Then think about the following questions: a. What instance variables define the priority of ​HCNode​? b. Do we want a min heap or max heap behavior from priority queue? c. If you want the default behavior of priority queue, then should you keep compareTo​ method to have its default behavior? d. If you want the opposite of the default behavior of priority queue, then should you keep ​compareTo​ method to have its default behavior? 3. Your next step is to build the ​HCTree​ using ​HCNode​ currently stored in the priority queue. The algorithm has been discussed in class: Iteratively, remove two nodes from the priority queue. Then use the first node to be the ‘0’ child and the second node to be the ‘1’ child of a newly created parent node. The frequency of this parent node is the sum of the frequencies of its two children nodes. Make sure that the parent node takes a symbol from one of its children (choose any one you want, but ​be consistent​). Then, add this parent node back to the priority queue, so that it later becomes a “branch” of your final tree. You will need to think about when the ​HCTree​ is completely built: we are essentially building the tree from the bottom up (from leaves to branches and finally to the root). The main reason why Huffman coding can compress data comes from this simple method. From a high level perspective, since we are using a priority queue here to help us build the tree bottom up, the ​HCNode​ with the most frequent symbol will hold the shortest path from the root, while the ​HCNode​ with the least frequent symbol will hold the longest path. As a result, the most frequent symbol will correspond to the shortest sequence of encoded bits. This is where the compression actually happens: We used much less encoding bits to represent the most frequent symbol. Although the least frequent symbol may have the longest number of encoded bits, but because this symbol doesn’t appear so frequently, the overall encoding will be shorter in the end. public void encode(byte symbol, BitOutputStream out) throws IOException For a given symbol, use the ​HCTree​ built before to find its encoding bits and write those bits to the given ​BitOutputStream​. In this method, to avoid inefficiently searching the whole tree to find the encoding bits of the given symbol, you must take advantage of the leaf array that we created before: it’s much faster to start from the leaf node containing the given symbol, and then traverse back to the root. You will then simply collect all the bits on that path and reverse those bits at the end to get the encoding. IMPORTANT: ​The following code will convert a byte c to its corresponding ascii value: int ascii = c & 0xff; Once you get the encoding bits, you should use ​out.writeBit(int i)​ that takes in either 0 or 1 as a bit to write the encoding bits one by one to BitOutputStream. public byte decode(BitInputStream in) throws IOException Decodes the bits from ​BitInputStream​ and returns a byte that represents the symbol that is encoded by a sequence of bits from ​BitInputStream​. Testing HCTree: Huffman Coding Tree can be hard to debug since most of the I/O is done by using bits, but bits are hard to be visualized directly. To save you time from debugging later, we enforce you to test the ​buildTree()​ to make sure the structure of your HCTree is correct before you move on to the next part. No
w create a JUnit test file ​HCTreeTester.java​ to verify the correctness of your buildTree()​ method. (You are not required to test methods other than buildTree()​) First, you should write an in-order traversal method to recursively traverse your HCTree, and in-orderly print all the nodes in your HCTree out. (Note that toString() method for HCNode was defined for you, so you can directly use it to print the node out) public void inorder(HCNode root) Then in the JUnit test method, create a freq array of size 256, and modify the array so that it contains the correct information about the provided check1.txt file: a: 17 b: 8 c: 7 d: 14 e: 9 f: 1 \n (newline char): 1 (You might refer to the ASCII table ​http://www.asciitable.com/​ for the ascii value of each symbol above.) Now from your own knowledge of Huffman coding and the description from the writeup above, draw the resulting Huffman Coding Tree in a file ​HCT.pdf​. ​Make sure you clearly indicate the symbol and freq of each HCNode, as well as what each edge stands for in your HCTree drawing​. You will need to submit this as part of the assignment. Then you should pass this array to ​buildTree()​ and use your inorder method to print out your HCTree and see if it matches your drawing. (You don’t need to use assertEqual()​ here, and you may just print the HCTree out) Part 3 – MyCompressor Once you are done with building the HCTree, now we implement our compressor. For this basic implementation, we will build the HCTree based on the character counts of a large corpus. Use the corpus file, ​corpus.txt​, to build your HCTree. public HCTree buildHuffmanTree(String corpusLocation) throws IOException This method will create the HCTree using the frequencies of the characters in the corpus. In this method, you will populate the frequency array ​freq​ by counting the occurrence of each character in the file ​corpusLocation​. The size of the array should be 256, which is the size of extended ​ASCII​. (Since each byte is 8 bits, then the number of all the possible bytes is simply 2^8 = 256). While creating the frequency make sure to keep track of the ​indices of each character.​ For example, the element at index 97 (which is the ASCII value of ‘a’) should hold the frequency of ‘a’ in the file. Note: See provided tools below. In short, his method takes in the corpus file (​corpusLocation​) and builds the HCTree and stores it in the instance variable ​huffmanTree​. public void compress(String inputFile, String compressedFile) throws IOException This method takes in two file names. ​inputFile​ is the name of the file that you want to compress and a ​compressedFile​, the name of the new compressed file that your compression algorithm should generate. public void decompress(String compressedFile, String outputFile) throws IOException Similarly, this method takes in two file names. ​compressedFile​ is the name of the compressed file that you want to decompress and ​outputFile​, the new decompressed file that your algorithm should generate after decompressing. Part 4 – Compress and Decompress Overview Now, we build our compression program. The main method of ​MyCompressor.java​ takes in args in the following format: First argument: corpus file Second argument: signal command. Either “​compress”​ or “​decompress”. Remaining arguments are the names of the file to compress or decompress. Case “​compress​”: Compressed file should be names compressed_[original].txt Case “​decompress​”: Decompressed file should be named decompressed_[original].txt Eg. 1. src/corpus.txt compress src/f1.txt src/f2.txt 2. src/corpus.txt decompress src/compressed_f1.txt src/compressed_f2.txt In the first case the output should be two files: ​compressed_f1.txt compressed_f2.txt In the second case, the output should be two files: decompressed_compressed_f1.txt decompressed_compressed_f2.txt Provided Tools Compression: First you will need to scan through the corpus file to compute the frequencies of all 256 bytes. The following line of code will read through the given text file byte by byte, and put all those bytes into a byte array. byte[] input = Files.readAllBytes( Paths.get(/*file path*/) ); IMPORTANT: ​The following code will convert a byte c to its corresponding ascii value: int ascii = c & 0xff; Notice that these output stream initializations are given to you: FileOutputStream file = new FileOutputStream(/*file path*/); DataOutputStream out = new DataOutputStream(file); BitOutputStream bitOut = new BitOutputStream(out); We won’t go into details about how these output streams work, but you should know how to write to the given file using these output stream. For example, let’s say your third command line argument is a text file named “compressed.txt”.​ ​Then to write an int 25 to “compressed.txt”, you will simply use DataOutputStream​ by calling ​out.writeInt(25)​. (You should read the document DataOutputStream​ for more information). And similarly, to write a single bit 0 to “compressed.txt”, you will simply use ​BitOutputStream​ by calling bitOut.writeBit(0); Decompression: Notice that these input stream initializations are given to you: FileInputStream inFile = new FileInputStream(/*file path*/); DataInputStream in = new DataInputStream(inFile); BitInputStream bitIn = new BitInputStream(in); You should also know how to read from the given file using these input stream. For example, let’s say your third command line argument is a text file named “compressed.txt”. Then to read an int from “compressed.txt”, you could simply use DataInputStream​ by calling ​int byteCount = in.readInt()​ (You should read the document ​DataInputStream​ for more information). And similarly, to read a single bit from “compressed.txt”, you need to use ​BitInputStream​ by calling ​int bit = bitOut.readBit() FileOutputStream outFile = new FileOutputStream(/*file path*/); DataOutputStream out = new DataOutputStream(outFile); Also, to write to the output file byte by byte, you can simply call ​out.writeByte(byte b)​. Part 5 – Results Overview You are given 10 files (f1.txt – f10.txt), compress each file using your compression algorithm and report the result in the form of the following table in a file ‘CompressionReport.txt’ or ‘CompressionReport.pdf’ File Original Size [KB] (OS) Compressed Size [KB] (CS) CS/OS File1 … Also, explain in a few lines the variation in the CS/OS values. Critical Thinking​: ​Answer the following questions in 2-3 lines Now, this method of compression using the Huffman Coding Tree might not be optimal. 1. What would happen if you used the file that you want to compress as the corpus? Would you get better results? Why? 2. Assuming you only have a compressed version, can you directly decompress the file that was compressed using HCTree built using itself as the corpus? [Yes/No] 3. If yes, explain how? If no, explain how would you modify your code to enable the decompression? Part 6 – Test Compress and Decompress Once you have done implementing ​Compress​ and ​Decompress​, you need to perform testing on your program. We have provided 3 basic check files for you to test your Huffman Coding (check1.txt and so on). You should use command line arguments to compress a check file to a compressed file, and decompress that compressed file back to see if it matches the original file. Next, you should test your program on the larger 10 input files (f1.txt – f10.txt). You should use command line arguments to compress a check file to a compressed file, and decompress that compressed file back to see if it matches the original file. Some Edge cases to fix: 1. Empty file Compressing an empty file should generate an empty file. After decompressing, your program should generate an
empty file as well. 2. File with only single kind of char We’ve provided singleChar.txt to test this case (A file with only newline char at each line). Submission Due – November 26th (Tuesday, 11:59 PM) Files to submit: 1. HCTree.java 2. HCTreeTester.java 3. MyCompressor.java 4. HCT.pdf 5. CompressionReport.txt Checkpoint – November 24th (Sunday, 11:59 PM) 1. HCTree.java 2. HCTreeTester.java 3. HCT.pdf After you submit your files to GradeScope, wait for the output of submission script. Make sure you see this after your submission: THIS IS A SUCCESSFUL SUBMISSION. YOUR SCORE WILL BE UPDATED ONCE  GRADED.    Otherwise, please fix the error and resubmit your files. New to PA 8​: For extra assurance on the code that you submit, we’ve included a basic script for you to test the input files that came with your starter code. Under the output of a successful submission, you should see the test output of input1.txt, input2.txt, and input3.txt. Passing these submission tests is ​not​ a requirement for a successful submission; these tests are there to help you ensure that you get identical files from compression followed by uncompression. The above is an example of a successful submission that passes the basic script. The above is an example of a successful submission that does not pass the basic script.

admin

Author admin

More posts by admin