Skip to main content
留学咨询

辅导案例-COMP 5840M01

By May 17, 2020No Comments

Module Code: COMP 5840M01 Page 1 of 7 Turn the page over Module Title: Data Mining and Text Analytics © UNIVERSITY OF LEEDS School of Computing Semester Two 2018/2019 Calculator instructions:  You are allowed to use a nonprogrammable calculator only from the following list of approved models in this exam: Casio fx-82 (all variants), Casio fx-83 (all variants), Casio fx-85 (all variants) Dictionary instructions:  You are not allowed to use your own dictionary in this exam. A basic English dictionary is available to use: raise your hand and ask an invigilator, if you need it. Exam information:  There are 7 pages to this exam.  There will be 2 hours to complete this exam.  Answer all 3 questions.  The number in brackets [ ] indicates the marks available for each question or part question.  You are reminded of the need for clear presentation in your answers.  The total number of marks for this examination paper is 60.  You are allowed to use annotated materials Module Code: COMP 5840M01 Page 2 of 7 Turn the page over Question 1 (a) Marvel Studios make films around Marvel super-heroes, such as Iron Man; and they want to promote diversity and inclusion, including more women and minority group super heroes in their movies, such as Captain Marvel, Black Widow, and T’Challa, the Black Panther. Marvel Studios wants to learn whether movie-goers who have liked women and minority Marvel super-heroes will like their latest movie, “Endgame”. They asked a group of 7 movie-goers whether they liked movies starring Captain Marvel, Black Widow, and T’Challa, the Black Panther; then they asked the group to watch “Endgame” and report whether they liked this new movie. The following csv file represents data about the 7 movie-goers and which super-heroes they liked: 1= yes, 0 = no; E = Endgame, M = Captain Marvel, W= Black Widow, and T = T’Challa, the Black Panther E,M,W,T 1,1,0,1 0,0,1,0 1,0,1,1 0,1,1,1 0,1,0,0 1,0,0,1 1,0,1,1 Construct a J48-style decision tree from this training data, to predict class E= “like the movie Endgame” with at least 85% accuracy when evaluated on the training set. Justify your choice of features for decision points. [6 marks: 2 method, 2 justification, 2 full decision tree] (b) Apply your decision tree from (a) to a new movie-goer, who likes Black Widow and Black Panther and Captain Marvel; will they like Endgame? [1 mark for answer with justification] (c) Extend your J48-style decision tree from (a) to a decision tree with 100% accuracy when evaluated on the training set. Draw or write down your decision tree, and justify your choice of features for decision points. [4 marks: 100% accurate decision tree, justification] (d) Apply your decision tree from (c) to the new movie-goer, who likes Black Widow and Black Panther and Captain Marvel; does the revised decision tree predict they will like Endgame? [1 mark for answer with justification] Module Code: COMP 5840M01 Page 3 of 7 Turn the page over (e) Marvel Studios put the same questions to a new, different group of 7 movie-goers, to collect a separate test data-set. The decision tree from (c) is more accurate than the decision tree from (a) when evaluated on the training set. Which decision tree, (a) or (c) is better to use for predicting whether the test-set of movie-goers will like Endgame? Justify your answer. [2 marks: (a) or (c) with justification] (f) Apply the a priori association rule mining algorithm to the Marvel Studios movie-goer training data-set, to find all association rules linking 2 or more features, with at least 90% accuracy and coverage of at least 4 instances. [6 marks] [Question 1 total: 20 marks] Module Code: COMP 5840M01 Page 4 of 7 Turn the page over Question 2 English and Arabic are both official languages in Sudan. English is used in many official and scientific documents, but most Sudanese people speak Arabic as their first or main language. The Sudan government wants to promote use of the Arabic language terms for plants and animals found in Sudan, by replacing English terms for these plants and animals with Arabic words (transliterated to the Roman alphabet) in all Sudanese English-language government documents. For example, in National Park official documents, references to palm trees will be replaced with the Arabic word for “palm”. To achieve this, the Sudan government have acquired some text analytics data-set resources which could be useful: a list of plants and animals found in Sudan, in both English and Arabic; a large text corpus of existing Sudan government English-language documents; and the text of a Sudanese English dictionary, a Sudanese equivalent of LDOCE Longman Dictionary of Contemporary English. However, some English words are ambiguous, for example “palm” can be a type of plant but also has another sense “the inside of a hand”. It is important that an English word is replaced by its Arabic translation only when the word in context is used in a plant or animal sense. The government needs a method to automatically classify such ambiguous words in English documents, to solve the problem of identifying plant or animal senses of ambiguous words. (a) Outline a supervised machine learning solution to the problem of classifying the sense of words in context. Explain why it could be expensive to develop the training data-set. [4 marks: 3 marks for outline supervised ML solution, 1 mark for cost explanation] (b) If the training data-set from (a) is converted to ARFF format, then you could load it into WEKA to test several different classifiers on the data-set, and comparatively evaluate results, to identify the best classifier. The WEKA Explorer Classify tab offers four Test options for evaluating classifiers: (i) Use training set (ii) Supplied test set (iii) Cross validation (iv) Percentage split State an advantage and a disadvantage of each of these Test options in selecting the best classifier for this task. [8 marks: 1 mark advantage, 1 mark disadvantage for each option] (c) Outline an unsupervised or semi-supervised method for the task of identifying plant or animal senses of ambiguous words. Explain why this model could be less expensive to build than your answer to (a). [4 marks: 3 marks for outline un/semi-supervised solution, 1 mark for cost explanation] Module Code: COMP 5840M01 Page 5 of 7 Turn the page over (d) The Sudan government wants to train Sudanese university computer scientists in data mining and text analytics, by offering them postgraduate scholarships to study overseas; and as a first step, they want to collate a database of MSc and PhD programmes related to Data Mining and Text Analytics offered by universities worldwide. Each degree programme is to be represented in a database record, with fields for degree title, names of modules included, location, cost, information source, etc. The Sudan government seeks advice on what techniques to use to gather this information from Web sources. Should they use Information Retrieval, or Information Extraction, or both? Outline the difference between IR and IE, and give an overall recommendation to the Sudan government, with justification. [4 marks: 2 marks difference between IR/IE, 2 marks justified recommendation] [Question 2 total: 20 marks] Module Code: COMP 5840M01 Page 6 of 7 Turn the page over Question 3 Kaggle.com offers online discussion forums, where users can post comments and questions. Kaggle wants to develop a Machine Learning classifier to detect forum comments that use offensive language and could be offensive to other users. The Kaggle forums receive a large volume of comments. Only a small proportion of these comments are offensive, but to maintain the Kaggle reputation for quality and fairness, it is important that all offensive comments are identified and deal
t with urgently. For Kaggle, it is most important that a classifier flags all offensive instances, to be investigated further by customer service experts; the customer service experts will focus on comments flagged by the classifier, and they do not want offensive comments to “slip through the net” and not be dealt with. It does not matter as much if the classifier incorrectly labels some innocent emails as offensive, because the customer service experts should spot these mistaken instances and discount them. However, the customer services managers would prefer to minimize time wasted on examining innocent comments incorrectly flagged as offensive. (a) Outline how to apply the CRISP-DM methodology to this data mining consultancy project. [6 marks: 1 mark for each CRISP-DM phase applied to this task] (b) Kaggle has provided a sample data-set of 100 comments, where each instance is labelled with Class value: OFF or NOT. This data-set was used in experiments with three classifiers, which we will call X, Y, and Z. The following are Confusion Matrix outputs for each classifier; for example, for Classifier X, 90 NOT (innocent) instances were classified as NOT, and 10 OFF offensive instances were classified as NOT; in other words, all 100 comments were classified as NOT. X: a b 90 0 | a = NOT 10 0 | b = OFF Y: a b 10 80 | a = NOT 0 10 | b = OFF Z: a b 80 10 | a = NOT 5 5 | b = OFF Which WEKA classifier behaves like X? What is the accuracy of X? [2 marks] Module Code: COMP 5840M01 Page 7 of 7 End (c) For classifiers Y and Z, calculate (i) accuracy (ii) precision in predicting offensive comments (iii) recall in predicting offensive comments [6 marks] (d) Which of X, Y and Z is worst and which is best in meeting Kaggle’s requirements, and why? [6 marks: 3 for worst and reason; 3 for best and reason] [Question 3 total: 20 marks]

admin

Author admin

More posts by admin