- June 20, 2020
Sample Take-home Exam Paper Semester 1, 2020 COMP5046 Natural Language Processing *This sample exam paper contains 1 or 2 exam questions for each weekly topic, and it is for sharing the structure and style of the final exam. You can see which can be 2 marks and 5 marks questions. However, it does not expect you to finish it in 1 hour, like the final exam. *The final exam will be an open-book, unsupervised exam. Week 1. Count-based Word Representation Q1. Calculate the TFIDF for the terms listed below for documents 1 to 4. There are 10,000 documents in a collection. The number of times each of these terms occur in documents 1 to 4 as well as the number of documents in the collections are stated in the following table. Use this information to calculate the TFIDF scores. (2 marks) Number of documents containing terms: ● machine: 3 ● university: 50 ● headphone: 10 ● perfume: 3 Term frequency (The number of times each of 4 terms occur in document 1 to 4) Documents Doc1 Doc2 Doc3 Doc4 machine 8 10 0 0 university 3 1 2 2 headphone 0 0 8 7 perfume 2 2 2 9 Solution IDF calculation ● machine : 3 (IDF = log(10000/3) ≈ 8.11) ● University: 50 (IDF = log(10000/50) ≈ 5.30) ● headphone: 10 (IDF = log(10000/10) ≈ 6.91) ● perfume: 3 (IDF = log(10000/3) ≈ 8.11) TFIDF calculation Documents Doc1 Doc2 Doc3 Doc4 machine 8.11 * 8 = 64.88 8.11 * 10 = 81.10 0 0 university 5.30 * 3 = 15.90 5.30 * 1 = 5.30 5.30 * 2 = 10.60 5.30 * 2 = 10.60 headphone 0 0 6.91 * 8 = 55.28 6.91 * 7 = 48.37 perfume 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 9 = 72.99 Week 2. Word Embeddings and Representation Q2. Illustrate an advantage of FastText over Word2Vec. Give application examples to support your argument. (2 marks) Solution The main advantage of FastText embeddings over Word2Vec is to take into account the internal structure of words while learning word representations, which could be very useful for morphologically rich languages, and also for words that occur rarely. For example, for a word that appears in various forms such as “teaching”, “teacher” and “teached”, the internal structure of each word can be learned and represented using n-gram based FastText embeddings. Q3. Illustrate 2 examples of how we can evaluate word vectors. For each example, please indicate whether it is intrinsic or extrinsic. (2 marks) Solution ● Intrinsic: Word Vector Analogies ; Word vector distances and their correlation with human judgments; Word clustering and categorization ● Extrinsic: Name entity recognitions: finding a person, location or organization and so on; Various text classification tasks such as sentiment analysis we did in assignment 1: identifying and extracting subjective information in the source text or document. Week 3 and 4. Word Classification with Machine Learning Q4. In class, we learned that the family of recurrent neural networks have many important advantages and can be used in a variety of NLP tasks. For each of the following tasks and inputs, state how you would run an RNN to do that task. (5 marks) 1. how many outputs i.e. number of times the softmax is called from your RNN. If (t)y︿ the number of outputs is not fixed, state it as arbitrary 2. what each is a probability distribution over(t)y︿ 3. which inputs are fed at each time step to produce each output Task A: Named-Entity Recognition: For each word in a sentence, classify that word as either a person, organization, location, or none. (Inputs: A sentence containing n words) Task B: Sentiment Analysis: Classify the sentiment of a sentence ranging from negative to positive (integer values from 0 to 4). (Inputs: A sentence containing n words.) Solution Task A: Named Entity Recognition 1. Number of Outputs: n outputs 2. Each is a probability distribution over 4 NER categories.(t)y︿ 3. Each word in the sentence is fed into the RNN and one output is produced at every time step corresponding to the predicted tag/category for each word. Task B: Sentiment Analysis 1. Number of Outputs: 1 output. (n outputs is also acceptable if it takes average of all outputs) 2. Each is a probability distribution over 5 sentiment values.(t)y︿ 3. Each word in the sentence is fed into the RNN and one output is produced from the hidden states (by either taking only the final, max, or mean across all states) corresponding to the sentiment value of the sentence. Q5. Assume that you build a sentiment analysis system that feeds a sentence into a RNN, and then computes the sentiment class between 0 (very negative) and 4 (very positive), based only on the final hidden state of the RNN. Ilustrate one advantage that an RNN would have over a neural window-based model for this task. (2 marks) Solution There are multiple answers: We can process arbitrary length inputs. It can encode temporal information.(’Take ordering into consideration’ is only partially correct, because theoretically window-based models also can, although hard to). Shared weights. Less parameters. The number of parameters would increase proportional to the input size of the neural window-based network whereas it would stay constant for RNNs since weights are shared at every time-step. Week 5. Language Fundamental Q6. Describe the difference between lemmatization and stemming. Give application examples to support your argument (2 marks) Solution Stemming is a procedure to reduce all words with the same stem to a common form whereas lemmatization removes inflectional endings and returns the base or dictionary form of a word. For example, words “trouble”, “troubling” and “troubled” may be stemmed to be “troubl” (not a valid English word) but will be lemmatized to be “trouble” for comparison. Also, another good example for lemmatization would be words “was”, “is” to be mapped to “be”. Week 6. Part of Speech Tagging Q7. Find one tagging error in each of the following sentences that are tagged with the Penn Treebank tagset. Briefly explain why. (2 marks) 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN 2. Can/MD I/PRP serve/VB the/DT dinner/NNS Solution 1. Atlanta is NNP. Atlanta is the capital city of the U.S. state of Georgia so it should be a Proper noun, singular, a name used for an individual person, place, or organization, spelled with an initial capital letter. 2. Dinner is NN. NN represents Noun, singular, whereas NNS is a Noun, plural. The word ‘dinner’ is a noun, singular. Q8-a. A hidden markov model includes states, observations, transition probabilities, observation likelihoods. Describe what each one of these would correspond to when using an HMM for POS tagging. (2 marks) Solution ● States: The POS tags at specific points in the sentence. ● Observations: The words that are observed as the sentence is read in. ● Transition probabilities: the probability of finding POS tag N following POS tag N-1 ● Observation likelihoods: the probability of seeing a particular word Q8-b. Given the sentence “I promise to back the bill.’’ show how you would compute the probability of “back’’ as a verb versus the probability of “back’’ as a noun using the probabilities in Tables a and b using the Viterbi algorithm. You are given the values for the third column of the Viterbi table which correspond to observation 3 or “to’’. They are VB: 0, TO: .00000018, NN: 0, PRP: 0. Thus, you will show two computations both of which will use these values. You do not need to do the arithmetic; just show the formula that would be computed. (3 marks) (5 marks) (*assume all verb tags as VB) Table a. Observation Likelihoods I promise to back VB 0 .0093 0 .00008 TO 0 0 .99 0 NN 0 .0085 0 .00068 PRP .37 0 0 0 Table b. Tag transition probabilities. VB TO NN PRP .019 .0043 .041 .067 VB .0038 .035 .047 .0070 TO .83 0 .00047 0 NN .0040 .016 .087 .0045 PRP .23 .00079 .0012 .00014 Solution ● back as a verb: .00000018 * Prob(VB| TO) * Prob (VB| back) = .00000018 *.83 * .00008 ● back as a noun: .00000018* Prob (NN | TO) * Prob (NN | back) = .00000018 * .00047 * .00068 Week 7. Dependency Parsing Q9. State a sequence of transitions that make an transition-based dependency parser produce the following dependency tree. Explain how to get the sequence of transitions. (5 marks) Solution Suppose SH = Shift, RA = Right Arc, LA = Left Arc. SH SH SH SH RA SH SH LA RA RA RA In order to get this dependency tree using the arc-standard algorithm, we need to do the following steps based on the three possible transactions (SH, RA, LA): Step 1. SH the ROOT 0 to the stack while all the other words from 1 to 5 will be in the buffer as our initial state. Step 2. SH the 1 from buffer to the stack Step 3. SH the 2 from buffer to the stack Step 4. SH the 3 from buffer to the stack Step 5. RA from 2 to 3 and remove 3 out of stack Step 6. SH 4 from buffer to the stack Step 7. SH 5 from the buffer to the stack Step 8. LA from 5 to 4 and remove 4 out of stack Step 9. RA from 2 to 5 and remove 5 out of stack Step 10. RA from 1 to 2 and remove 2 out of stack Step 11. RA from 0 to 1 and remove 1 out of stack Head and modifier refer to the two words in a dependency relation where the head is the one that is governor, parent and the modifier is the one that is dependen, daughter. Using the arrow for the dependencies, it will point from the head to the modifier. For example, considering the dependency of words ‘red hat’, the red will be the modifier while the hat will be the head. And the arrow will point from “hat” to “red” in this case. And, please put a detailed explanation on how to get the dependency tree using SH, LA, RA definition and what the head and modifier are. Week 8. Language Model and Natural Language Generation Q10. During training a neural language model, we normally apply teacher forcing. Describe what the teacher forcing technique is. Give application examples to support your argument. (2 marks) Solution Teacher forcing is the technique where the target word is passed as the next input to the decoder. Let us assume we want to train a sentence generation model, and the ground truth caption for the above image is “Two people reading a book”. Our model makes a mistake in predicting the 2nd word and we have “Two” and “birds” for the 1st and 2nd prediction respectively. If we use Teacher Forcing, we would feed “people” to our RNN for the 3rd prediction, after computing and recording the loss for the 2nd prediction. (optional: Without Teacher Forcing, we would feed “birds” back to our RNN to predict the 3rd word. Let’s say the 3rd prediction is “flying”. Even though it makes sense for our model to predict “flying” given the input is “birds”, it is different from the ground truth.) Week 9. Named Entity Recognition and Coreference Resolution Q11. The IOB format categorizes tagged tokens as I, O and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively? Give application examples to support your argument. (2 marks) Solution The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task. If two chunks follow each other, it would not be possible to make clear that they are two chunks instead of one chunk consisting of two words and also not where the first ends and the second begin. For example, considering the NER tags using only IO format for the sentence ‘Tom/I-PER, Amy/I-PER and/O Tony/I-PER went/O to/O Usyd/I-LOC’, the two words Tom and Amy cannot be distinguished as separate two chunks as expected. However, this can be solved by using the IOB format as ‘Tom/B-PER, Amy/B-PER and/O Tony/B-PER went/O to/O Usyd/B-LOC’.s Week 10. Attention and Reading Comprehension Q12. Describe the main intuition behind attention in a neural network model. (2 marks) Solution Attention gives us a fixed-sized summary (as a weighted sum) of an arbitrary set of representations (the values), dependent on some other representation (the query). The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. Week 11. Transformer and Machine Translation Q13. What is the motivation for using an attention-only model? (2 marks) Solution In an RNN, sequential and recursive computation prevents parallelization. However, an attention module allows us to search over any internal state of a model, so perhaps we do not need the RNN. Q14. Explain the motivation of a positional encoding as used in a Transformer network. (2 marks) Solution Position and order of words are the essential parts of any language. They define the grammar and thus the actual semantics of a sentence. The Transformer architecture ditched the recurrence mechanism in favor of multi-head self-attention mechanism. As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, the model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the need for a way to incorporate the order of the words into our model. A positional embedding is the solution to give the model some sense of order is to add a piece of information to each word about its position in the sentence.