- May 26, 2020
Informatics MSc Programme Area Henley Business School University of Reading Assessed Coursework Set Front Page Module code: INMR77 Module name: Business Intelligence and Data Mining Lecturer responsible: Dr Yin Leng Tan Work to be handed in by: Full time students: 26 May 2020 Part time students: 15 June 2020 Assignment Specification The module is assessed 100% through this coursework assignment. The aim of this coursework is to assess your understanding of business intelligence and ability to perform data mining tasks by applying concepts, methods and techniques learned during the lectures and practical sessions. The coursework is carried out individually. Students are required to produce an individual report for the tasks as set out below. The complete report should not exceed 20 pages of A4 (with a variation of 20%) with a minimum font size of 10, including tables and diagrams but excluding references and appendices. An appendix can be used to include more detailed materials to back up main body points but will not be assessed. In addition, you are also required to submit the supplementary materials of your output from SAS Enterprise Miner via blackboard by the specified deadline. Case Study – Airbnb and Inside Airbnb Airbnb – Holiday Lets, Homes, Experiences & Places (airbnb.co.uk) Airbnb is an online marketplace for arranging or offering lodging i.e. temporary accommodation, primarily homestays, or tourism experiences. It was founded in August 2008 and has 12,736 employees as of 2019. Service overview: Airbnb provides a platform for hosts to accommodate guests with short-term lodging and tourism-related activities. Guest can search for accommodation using filters such as location, price, and specific types of homes. Before booking, users must provide personal and payment information. Some hosts also require a scan of government-issued identification before accepting a reservation. Hosts provide prices and other details for their rental or listing e.g. number of guests included in the price, type of property, type of room, number of bathrooms, number of bedrooms, number of beds and type of bed, minimum number of nights for a reservation, and amenities. In addition, Airbnb also provides a review system where hosts and guests can leave reviews about their experience, and rate each other after a stay. By October 2019, two million people were staying with Airbnb each night. Cancellation policy: Airbnb allows hosts to choose between five types of cancellation policies, made to protect both hosts and guests. Options include: strict_14_with_grace_period, moderate, flexible, super_strict_30, super_strict_60. (see https://www.airbnb.co.uk/home/cancellation_policies for definition for each categories) Security Deposits: some reservations include a security deposit, which can be required by either Airbnb or the host. This helps build trust for both guests and hosts. Some hosts require a security deposit for their listing. If you are a guest and you are booking a listing with a host with host-required security deposit, you will be shown the amount before you make your reservation. The amount is set by the host, not Airbnb. In this case, no authorisation hold will be placed, and you will only be charged if a host makes a claim on the security deposit. (see https://www.airbnb.co.uk/help/article/140/how-does-airbnb-handle-security-deposits Sources: Wikipedia, Airbnb.co.uk Further information of Airbnb, please visit: https://www.airbnb.co.uk/ Inside Airbnb – adding data to the debate (http://insideairbnb.com/index.html) Inside Airbnb is an independent, non-commercial set of tools and data that allows an individual to explore how Airbnb is really used in cities around the world. It was set up by Murray Cox and John Morries in 2016. Airbnb claims to be part of the “sharing economy” and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round – disrupting housing and communities. For example, local residents and governments are more concerned with people who are not present when the rental takes place and those who have multiple listing on the site, as opposed to a user who is renting a spare room. By analysing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so user can see how Airbnb is being used to compete with the residential housing market. With Inside Airbnb, user can ask fundamental questions about Ainrbnb in any neighbourhood, or across the city as a whole, such as: • how many listings are in my neighbourhood and where are they? • how many houses and apartments are being rented out frequently to tourists and not to long-term residents? • how much are hosts making from renting to tourists (compare that to long-term rentals)? • which host are running a business with a multiple listings and where are they? These questions (and the answers) get to the core of the debate for many cities around the world, with Airbnb claiming that their hosts only occasionally rent the homes in which they live. In addition, many city or state legislation or ordinances that address residential housing, short term or vacation rentals, and zoning usually make reference to allowed use, including: • how many nights a dwelling is rented per year • minimum nights stay • whether the host is present • how many rooms are being rented in a building • the number of occupants allowed in a rental • whether the listing is licensed The Inside Airbnb tool or data can be used to answer some of these questions. Some understanding of how the Airbnb platform is being used will help clear up the laws as they change. Source: insideairbnb.com Further information of Inside Airbnb, please visit: http://insideairbnb.com/index.html Airbnb in Greater Manchester, UK Dataset: Airbnb_man_reduced.csv (available to download on blackboard), two additional datasets man_reviews.csv, and man_calander.csv are also provided for information only. Description of the dataset: The Airbnb data for Greater Manchester is made available by Inside Airbnb. The original data set was downloaded from the website in November 2019. The number of variables however is reduced from the original data set. There are 4,848 listings in the data set with a total of 57 variables. Each row represents a single listing and contains information about the host of the property, the property’s characteristics and overall rating of the property, and its associated features by guests. Table 1 shows the name, description, and type of the 57 variables. Table 1: variable name and description of the variable for the dataset. # Variable Name Description Variable Type 1. listing_id Unique identifier for each Airbnb listing Numeric 2. listing_url url of the listing Text 3. description Description of the listing Text 4. house_rule Description of house rules Text 5. host_id Unique identifier of the host Numeric 6. host_url url of the host Text 7. host_name Name of the host Text 8. host_since Date since the host is a member Date 9. host_about Description of the host Text 10. host_response_time How quickly the host responds to inquiries. 5 categories: within a day, with an hour, a few days or more, within a few hours, N/A Categorical 11. host_response_rate Rate at which host responded to inquiries (percentage value) Numeric 12. host_is_superhost Is the host a superhost (1 = Yes, 0 = No) Binary 13. host_identity_verified Whether the host is verified or not (1 = Yes, 0 = No) Binary 14. neighbourhood_cleased Name of the neighbourhood (41 categories) Categorical 15. borough Name of the borough (10 categories) Categorical 16. property_type Type of the property (30 categories) Categorical 17. room_type Type of the room. 4 categories: Entire home/apt, Private room, shared room, hotel room Categorical 18. accomodates Number of people that can be accommodated Numeric 19. bathrooms Number of bathrooms Numeric 20. bedrooms Number of bedrooms Numeric 21. beds Number of beds Numeric 22. bed_type Type of bed. 6 categories: Real Bed, Pull-out Sofa, Futon, Couch, Airbed Categorical 23. amenities List of amenities included Text 24. price Price per night (in GBP) Numeric 25. weekly_price Price per week (in GBP) Numeric 26. monthly_price Price per month (in GBP) Numeric 27. Security_deposit Amount of host-required security deposit. Numeric 28. cleaning_fee One-time fee charged by host to cover the cost of cleaning their space. Numeric 29. guest_included Number of quests included in the price Numeric 30. extra_people Additional charge per person (GBP) Numeric 31. minimum_nights Minimum number of nights for a reservation Numeric 32. maximum_nights Maximum number of nights for a reservation Numeric 33. calendar_updated Calendar last updated by the host (70 categories) Categorical 34. has availability Weather the host has availability or not (1 = Yes, 0 = No) Binary 35. availability_30 Number of days available for the next 30 days Numeric 36. availability_60 Number of days available for the next 60 days Numeric 37. availability_90 Number of days available for the next 90 days Numeric 38. availability _365 Number of days available for the next 365 days Numeric 39. number_reviews number of reviews in total Numeric 40. first_review Date of first review Date/Time 41. last_review Date of last review Date/Time 42. review_scores_rating Overall rating of the property (percentage value) Numeric 43. review_scores_accuracy Rating for the accuracy of the description Numeric 44. review_scores_cleanliness Rating for the cleanliness of the property Numeric 45. review_scores_checkin Rating for the check in experience Numeric 46. review_scores_communication Rating for the host communication with guests Numeric 47. review_scores_location Rating for the location of the property Numeric 48. review_scores_value Rating for the value of the property Numeric 49. instant_bookable Whether the property can be booked in an instance (1 = Yes, 0 = No) Binary 50. cancellation_policy The cancellation policy for the host. 5 categories: strict_14_with_grace_period, moderate, flexible, super_strict_30, super_strict_60 Categorical 51. require_guest_profile_picture Whether guest profile picture is required or not (1= Yes, 0 = No) Binary 52. require_guest_phone_verificati on Whether guest phone verification is required or not (1= Yes, 0 = No) Binary 53. host_listings_count The number of listings of the host Numeric 54. host_listings_count_entire_ho mes The number of listings of the entire home Numeric 55. host_listings_count_private_ro oms The number of listings of private rooms Numeric 56. host_listings_count_shared_roo ms The number of listing of shared rooms Numeric 57. reviews_per_month Number of reviews per month for the property numeric The local government and residents would like to know how Airbnb is used in the region and seek your help on this. They would particularly like to know how many of the listings/hosts are offering lodging and not running as a business i.e. temporary accommodation, primarily homestays, or tourism experiences and, as opposed to hosts offering long term let with multiple listing with no owner present (likely to be running a business) which could be illegal. You goals are to: a) identify clusters of listings based on different (or a combination) set of variables e.g. host’s characteristics, listings/property’s characteristics and availability, and reviews from guests so as to provide insights to the local government and residents. Note: The are many measurements could be used to differentiate the two e.g. single listing vs multiple listings although a host may list separate rooms in the same apartment, or multiple apartments or entire homes. Availability is another measure, likewise, occupancy. You are asked to justify the variables/measurements used for your clustering tasks. Greater Manchester uses the following parameters for the measurements: • a high availability metric and filter of 60 days per year • a frequent rented filter of 60 days per year • a review rate of 50% for the number of guests marking a booking who leave a review • an average booking of 3 nights unless a higher minimum nights is configured for a listing • a maximum occupancy rate of 70% to ensure the occupancy model does not produce artificially high results based on the available data (see http://insideairbnb.com/greater- manchester/?neighbourhood=&filterEntireHomes=false&filterHighlyAv ailable=false&filterRecentReviews=false&filterMultiListings=false b) select what you think is the best segmentation/clustering based on the results obtained in a) and comment on the characteristics. E.g. clusters that best separate between those are genuine lodging vs those could be illegal i.e. running as a business. c) develop a classification model to identify those are genuine listings/host vs those could be considered illegal based on your results obtained in b). Useful information/websites: • Clampter (2014) Airbnb in NYC: The Real Numbers Beind the Sharing Story – available at https://skift.com/2014/02/13/airbnb-in-nyc-the-real-numbers-behind-the-sharing- story/ • Inside Airbnb http://insideairbnb.com/index.html What to deliver in the final report: You report should include the following sections: 1. Introduction: This should include background of Airbnb and Inside Airbnb, opportunities and challenges of the sharing economy to the business (Airbnb), home owners (hosts), local residents and governments, and guests/tourists, and how business intelligence and data mining could be used to address the opportunities and challenges for the various stakeholders. It should also outline how the report is structured. Justify your answer with examples/data and findings from literature and related work in this area. 2. Model building and Results Discussion a) Identify clusters of listings In this section, you should discuss the purpose of the data mining tasks, the data mining process, including data exploration and data preparation/preprocessing, and approaches taken e.g. variables used for the clustering. You are expected to justify and discuss any action/decision you made during the data mining process and models building, make references to your output in SAS Enterprise Miner within your report where necessary. Note: In deciding what k to use (and also how many variables to include), the following factors should be considered: How distinct are the clusters? Is good separation achieved? How consistent are they? If cluster#1 shows low values on one measure, does it also show low value on other measures. How simple are they to describe? Simple clusters are more interpretable by domain knowledge experts, easier to take action on, and are more likely to be statistically stable and not the result of random chance. b) Discuss what is the best segmentation/clustering based on the results obtained from the process in a). You should discuss what you think is the best segmentation and comment on the characteristic of these clusters. Consider how this information could be used by local government and residents. Use screenshots and/or make references to your output in SAS Enterprise Miner to illustrate important and interesting findings where necessary. c) Develop a classification model that classify the data into these segments. In this section, you should discuss the purpose of the data mining, including the target segment/cluster, the data mining process, including data preparation/preprocessing, and rationale and approaches taken e.g. variables used for the model building. You are expected to justify and discuss any action/decision you made during the data mining process and models building, as well as model evaluation, make references to your output in SAS Enterprise Miner within your report where necessary. 3. Conclusion, critical evaluation and suggestion for improvement In this section, you are required to conclude and provide a summary of your key findings, and discuss the limitations of your data models/mining/analyses and suggestion for improvement by taking into consideration current research issues in data mining. The criteria used for grading assignment: Aspects/Criteria % Range Descriptors Introduction (ILO-1, ILO3, ILO5) 70% and above A highly effective introduction, setting context and indicating content that will follow. Wide background reading; novel examples and use of relevant literature/sources in supporting the arguments/viewpoints. 60-69% A very good introduction, setting context and indicating content that will follow. Good background reading; generally very good use of examples and relevant sources/literature in supporting the arguments/viewpoints. 50-59% Adequate introduction incorporating one or more of the above, yet lacking in clarity in some area(s). Good use examples and sources/literature in supporting the arguments/viewpoints. 49% and below A basic introduction with a narrow or limited reference to defining the area, setting the context and indicating content that will follow. Little evidence of appropriate reading or ability to synthesise information. No or little examples given. Model Building, Results Discussion and Model Evaluation (ILO2, ILO3, ILO4, ILO6) 70% and above Novel and originality. A coherent, well focused, original approaches in the model building, entirely relevant to the tasks with excellent support and justifications for the variables, techniques used for the modelling. Excellent discussion and interpretation of the obtained results/analysis with original insights. Excellent model evaluations and comparisons provided with clear evidence of critical analysis of findings. 60-69% A generally clear and coherent discussion with good support or justification for the model building, which is directly relevant to the tasks. Clear rationale for the approaches taken. Very good discussion and interpretation of the obtained results/analysis. Very good model evaluations and comparisons provided with some critical analysis of findings. 50-59% Reasonable attempt of the modelling but prone to being descriptive or narrative; little rationale for the approaches taken or justification of the variable used. Generally relevant to the stated tasks. Reasonable discussion and interpretation of the obtained results/analysis. Reasonable discussion of model evaluations and comparisons though with little evidence of critical analysis of findings. 49% and below Little discussion and evidence of model building. Failure to understand the purpose of the task. Little discussion and interpretation of the obtained results/analysis. Little or no discussion of model evaluations and comparisons Conclusion, critical evaluation and future improvements (ILO1, ILO5 and ILO6) 70% and above Comprehensive and extremely well discussed with original insights drawing from the analyses conducted and suggestion for future improvements. 69-69% Very well discussed with interesting insight, drawing from the results/analyses conducted. Very good critical evaluation and suggestion for future improvement. 50-59% Reasonably discussed but prone to being descriptive with little critical analysis based on the results/analyses conducted. Generally relevant to the stated tasks. Some critical analysis but prone to being descriptive or narrative; evidence supports the conclusion, but not always very directly /clearly. The question is not fully addressed. 49% and below Largely descriptive. The discussion is limited in scope and/or relevance. The question is only partially addressed.