- May 19, 2020
Information School. INF6028 Coursework 2019-20 Mining and Evaluating a Structured Dataset 1. Introduction The assessment for INF6028 Data Mining consists of a piece of individual coursework to assess your ability to understand key data mining, analysis and evaluation concepts. You will be assigned a single dataset and an associated complete Knime workflow. Each workflow applies appropriate data mining methods to the dataset in order to solve a supervised prediction problem – this might be regression or classification – and to evaluate the relative performance of these different approaches/algorithms. You will interpret and critically discuss the various techniques and best practises employed in the workflow and will evaluate the performance of the algorithms. Note: a video taking you through the workflow step-by-step will also be provided. You should write a 2,000 word structured report (see Section 3) that includes the following headings (more details on how the report will be assessed are provided below): • Introduction – introduce the prediction problem. • Data mining theory – provide a theoretical description of the two supervised data mining methods used in the workflow (for example, the classification or regression techniques that have been used) and why they are appropriate to the prediction task. • Data exploration and preparation – describe the approaches used in the workflow for feature selection, transformation and normalisation, where appropriate. • Experimental setup – describe the experimental setup and the evaluation measures used in the workflow and how the data has been handled to ensure that the models were not over-fitted. You should explain which nodes were used in KNIME and provide a rationale for the various parameter settings that were used. • Results – present the results for each data mining method and compare the performance of the different methods using graphical and tabular methods. What insights can you gain from the models? For example, which are the most important features, are there any outliers in the predictions? • Conclusion and reflections – summarise the main findings of your report and reflect on the methods used. Charts, tables, references and appendices are not included in the word count. Remember: your report should be a critical evaluation of the workflow in the context of the data mining problem posed, it should not be merely a description of what was done. This assessment is worth 100% of the overall module mark for INF6028. A pass mark of 50 is required to pass the module. Submission deadline: June 8 via Turnitin. See Section 4 for more general information about Coursework Submission Requirements within the Information School. 2. The Datasets and KNIME Workflows You will be assigned a single dataset and KNIME workflow to base your report on. Please ensure before you start working on the assessment that you are using the correct dataset and workflow. Note: You should try to open the workflow in KNIME and work from there, however, should you be unable to open the workflow or install KNIME on your machine, you will also be provided with a video, which will take you through the workflow step-by-step. The datasets have been derived from Kaggle competitions and are downloadable from MOLE in the Coursework Brief & Information section. A brief description of the attributes in each dataset is given at the end of this document. Note that in both cases the data are different to the standard Kaggle datasets. Titanic-derived dataset The data is split across two files each of which contains 1204 entries representing 1204 passengers, although it should be noted that the passengers are not necessarily the same in the two files. The two files are titanic_ticket_data.csv and titanic_personal_data.csv The aim of this challenge is to build a model that is able to predict whether or not a passenger will survive the sinking of the titanic. Australian Weather-derived Dataset The Australian weather dataset consists of weather data for 16 cities and towns in Australia over the period of nearly 10 years. The aim of this challenge is to predict the total daily rainfall based on other features of the weather. 3. Report Structure You are required to produce a structured report that includes all the sections detailed in Table 1. You must state the word count somewhere in the report. As there is a word count limit you should aim to make your writing as concise and informative as possible. The emphasis of the report should be on the clarity, accuracy and quality in communicating your findings. Table 1: Required content of the structured report. Section Description Maximum allocated marks Structured abstract This should provide a summary of your report in a structured manner. This is not included in the word count. Required, but 0 marks Introduction This section should introduce the data mining task that is addressed in the report. You should indicate the property/data value that is predicted and give a brief overview of the dataset and methods used. 10 marks Data Mining Theory This section should provide an overview of the algorithms for predictive data mining used in the workflow from a theoretical aspect. Explain why they are relevant to the 25 marks prediction problem. Support your rationale by providing references to the literature where the techniques have been applied to similar problems. Include a short discussion of the most appropriate methods for evaluating the performance of these data mining methods. Data Exploration and Preparation This section should provide a brief description of the data and of the approaches used to pre-process the data. You should present an investigation of the attributes (including the data value to be predicted) and describe any data cleaning employed, including handling of missing data, data transformations and data aggregations. 10 marks Experimental Setup This section should describe the experimental design in the workflow. You should describe the process followed in order to find the best performing model for each method and how this was validated. For example, which KNIME nodes were used? How were they configured? Was any cross- validation or a separate validation set used and why? 20 marks Results and Discussion Present the results of the data mining process including the results of experiments to find the best model for each data mining method. Compare the best performance of the different methods and, if appropriate, consider which attribute contributes most to each model. Discuss the advantages and disadvantages of the data mining methods. Which of the chosen methods produced the best model and why? 20 marks Conclusion and reflections Summarise the main findings of the analysis and reflect on the choice of methods for the problem, for example, how might the models be improved with hindsight? Use evidence from the literature to support your arguments. 15 marks 4. Information School Coursework Submission Requirements It is the student’s responsibility to ensure no aspect of their work is plagiarised or the result of other unfair means. The University’s and Information School’s Advice on unfair means can be found in your Student Handbook, available via http://www.sheffield.ac.uk/is/current Your assignment has a word count limit. A deduction of 3 marks will be applied for coursework that is 5% or more above or below the word count as specified above or that does not state the word count. It is your responsibility to ensure your coursework is correctly submitted before the deadline. It is highly recommended that you submit well before the deadline. Coursework submitted after 10am on the stated submission date will result in a deduction of 5% of the mark awarded for each working day after the submission date/time up to a maximum of 5 working days, where ‘working day’ includes Monday to Friday (excluding public holidays) and runs from 10am to 10am. Coursework submitted after the maximum period will receive zero marks. Work submitted electronically, including through Turnitin, should be reviewed to ensure it appears as you intended. Before the submission deadline, you can submit coursework to Turnitin numerous times. Each submission will overwrite the previous submission. Only your most recent submission will be assessed. However, after the submission deadline, the coursework can only be submitted once. Details about the submission of work via Turnitin can be found at http://youtu.be/C_wO9vHHheo If you encounter any problems during the electronic submission of your coursework, you should immediately contact the module coordinator and one of the Information School Teaching Support Team [email protected] (Julie Priestley 0114 2222839). This does not negate your responsibilities to submit your coursework on time and correctly. Titanic Dataset The titanic data consist of two files that need to be merged. The titanic_ticket_data.csv data consists of the following variables: PassengerId: the identifier Survived: the value to predict Ticket: the Ticket Number Fare: the passenger fare Cabin: Cabin number Embarked: Port of embarkation. C = Cherbourg, Q = Queenstown, S = Southampton The personal data titanic_personal_data.csv consists of the following variables: PassengerId – the identifier Name: the name of the passenger Sex: male or female Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 SibSp: number of siblings/spouses where family relations are defined as follows: Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife Parch: number of parent/children where family relations are defined as follows: Parent = mother, father; Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them Salary: in dollars Job: job title Australian Weather Dataset The Australian weather dataset consists of a single CSV file, which contains weather data for 16 cities and towns in Australia over the period of nearly 10 years. The file consists of the following variables: Date: date of observation Location: name of town/city where observation was made MinTemp: minimum temperature recorded (Celsius) MaxTemp: maximum temperature recorded (Celsius) Rainfall: total daily rainfall (mm) Sunshine: total daily sunshine (hours) WindDir9am: wind direction at 9am WindDir3pm: wind direction at 3pm WindSpeed9am: wind speed at 9am (kph) WindSpeed3pm: wind speed at 3pm (kph) Humidity9am: humidity at 9am (%) Humidity3pm: humidity at 3pm (%) Pressure9am: atmospheric pressure at 9am (hpa) Pressure3pm: atmospheric pressure at 3pm (hpa) Temp9am: temperature at 9am (Celsius) Temp3pm: temperature at 3pm (Celsius) RainToday: did it rain? (Boolean) RISK_MM: total daily rainfall the following day (mm) RainTomorrow: did it rain the following day? (Boolean) Note: this dataset is different from the “Rain in Australia” dataset on Kaggle.