辅导案例-ISTA 311 HW

ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION Due Friday, May 1, 2020, 11:59 PM 1. Introduction For this assignment, we will write a simple file compression utility for text files implementing the Huffman code algorithm. This utility reads a text file, analyzes the frequency distribution of characters in the file, and calculates an optimal symbol code (Huffman code) for that distribution. It then writes a binary file encoded in that Huffman code, and an auxiliary file containing the Python representation of the code in the form of a dictionary so that it can be decoded later. 2. Instructions 2.1. Code and submission. Create a Python module called compress.py and upload it to the assignment folder on D2L. You will need to do the following imports: from bitstring import Bits import numpy as np and then implement the functions described in the specifications below. You might have to install the bitstring library if you haven’t yet. This can be installed through pip or conda. Documentation for the Bits class can be found at https://pythonhosted.org/bitstring/constbitarray. html 2.2. Documentation. Your script must contain a header docstring containing your name, ISTA 311, the date, and a brief description of the module. Each function must contain a docstring. Each function docstring should include a description of the function’s purpose, the name, type, and purpose of each parameter, and the type and meaning of the function’s return value. 2.3. Testing and required files. A test script, test compress.py, will be available on D2L along with some auxiliary text and data files necessary to run the tests. 2.4. Grading. Your submission will be graded by running it through the test script, running some tests on other files, and examining your code. Code should be clear and concise, but you will only lose points for style if your code is a real mess. Include inline comments to explain tricky lines and summarize sections of code. 3. Function specifications Implement the following functions: • get frequency dict: This function takes a single string argument and returns a dictionary whose keys are the characters appearing in the string, and whose values are the normalized frequencies (number of times each character appears, divided by the total number of characters in the string). 1 2 ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION • dict entropy: This function takes a frequency dictionary and returns the entropy of the probability distribution. Recall that the entropy is computed by the formula: H(X) = − ∑ i P (xi) log2 P (xi) where xi are the individual outcomes. Here that means the characters in the dictionary. • huffman code: This takes a frequency dictionary and returns a Huffman code dictionary, where the keys are the characters appearing in the string and the values are the code words (strings of 0s and 1s) assigned to each character. (This may be simpler with some helper functions; see the tree algorithm in the Jupyter notebook encoding trees filled.ipynb, for instance.) • invert dict: This function takes a dictionary representing an encoding, with characters mapping to code words, and reverses it so that it maps code words to characters. (This is useful in decoding.) • encode string: This function takes two arguments: a string and a dictionary representing an encoding. Encode the string into a string of 0s and 1s and return it. • decode string: This function takes two arguments: a string of 0s and 1s and a dictionary repre- senting a reversed encoding. Use the dictionary to decode the string. • pad bitstring: Remember that bit strings we write to a file must be a whole number of bytes (i.e. a multiple of 8 bits.) This function takes a string of 1s and 0s and appends the number of 0s necessary onto the end of the string to make the length of the string a multiple of 8. Then it prepends onto the string an 8-bit bitstring representation of that number of 1s and 0s. Then return the padded string. • write encoded file: This function takes a string as its only argument, and does the following: – Passes the string to get frequency dict to get a frequency dictionary; – Finds the Huffman code for that frequency dictionary; – Converts the contents of the file into a bit string using the Huffman code, pads it using pad bitstring, and writes it to a (binary) file. The filename of the binary file should be the original filename with ’.compressed’ replacing the .txt extension; – Writes (in plain text) a representation of the Huffman code dictionary. Use repr to get a string representation of the dictionary. The filename of the text file should be the original filename with .code replacing the .txt extension. After all this, the function should return [file len, dict len], where file len is the length of the compressed file in bytes (which is the length of your bit string divided by 8) and dict len is the length of the string representation of the dictionary. • decode encoded file: This function takes a filename stem (a filename without an extension). Then do the following: – Open the file filename.compressed in binary read mode and read it to a Bits object. – Open the file filename.code in text read mode, read it to a string, and call eval() on that string to obtain the encoding dictionary; reverse it to get a decoding dictionary. – Access the bin attribute of the Bits object to obtain a string of 0s and 1s. – Interpret the first 8 bits in the bit string as an integer to learn how many digits of zero padding are on the end of the bit string. – Strip the initial 8 bits and any zero padding from the end, and pass the resulting bit string to decode string along with the decoding dictionary, then return the resulting string. • main: Finally, write a main function. The main function should: – Open the file saguaros.txt, reads the text, and passes the text to write encoded file to write a compressed version of the file. – Print the length of the file saguaros.txt in bytes (which is the same as the length of the string you read from it). Also print the entropy of the frequency dictionary, length of the compressed file in bytes and the length of the dictionary, formatting the output like this: Length of input file: xxxx Input file entropy: xxxx Length of compressed file: xxxx Length of encoding dict: xxxx where the xxxx are the various numerical quantities. ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION 3 – Pass the filename stem poem to decode encoded file. Take the string that is returned and write it to a file called poem.txt. Hint: The code necessary for encode string, decode string, pad bitstring, invert dict, get frequency dict, and part of write encoded file can be found in the Jupyter notebook encoding trees filled.ipynb on D2L.

辅导案例-ISTA 311 HW

Related

Previous Post辅导案例-COMP3314

Next Post辅导案例-CSI 2132B

Author admin