辅导案例-INF553-Assignment 4

INF553 Foundations and Applications of Data Mining Spring 2020 Assignment 4 Deadeine: Apr. 6st 11:59 PM PST 1. Overview of the Assignment In this assignment, you will explore the spark GraphFrames library as well as implement your own Girvan-Newman algorithm using the Spark Framework to detect communities in graphs. You will use the ub_sample_data.csv dataset to find users who have a similar business taste. The goal of this assignment is to help you understand how to use the Girvan-Newman algo- rithm to detect communities in an efficient way within a distributed environment. 2. Requirements 2.1 Programming ReSuirements a. You must use Python and Spark to impeement aee tasks. There will be 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementa- tions are correct. b. You can use the Spark DataFrame and GraphFrames eibrary for task1, but for task2 you can ONLY use Spark RDD and standard Python or Scaea eibraries. (ps. For Scaea, you can try GraphX, but for the assignment, you need to use GraphFrames.) 2.2 Programming Environment Python 3.6, Scaea 2.11 and Spark 2.3.2 We will use Vocareum to automatically run and grade your submission. You must test your scripts on the eocae machine and the Vocareum terminae before submission. 2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the reSuired functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other stu- dents’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism. 2.4 What you need to turn in You need to submit the following files on Vocareum: (all lowercase) a. [REQUIRED] two Python scripts, named: task1.py, task2.py b1. [REQUIRED FOR SCALA] two Scala scripts, named: task1.scaea, task2.scaea b2. [REQUIRED FOR SCALA] one jar package, named: hw4.jar c. [OPTIONAL] You can include other scripts called by your main program d. You don’t need to include your results. We will grade on your code with our testing data (data will be in the same format). 3. Datasets You will continue to use Yelp dataset. We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id and business_id. You can download it from Vocareum. 4. Tasks 4.1 Graph Construction To construct the social network graph, each node represents a user and there will be an edge between two nodes if the number of times that two users review the same business is greater than or equivaeent to the filter threshold. For example, suppose user1 reviewed [business1, business2, business3] and user2 reviewed [business2, business3, business4, business5]. If the threshold is 2, there will be an edge between user1 and user2. If the user node has no edge, we wiee not inceude that node in the graph. In this assignment, we use fieter threshoed 7. 4.2 Task1: Community Detection Based on GraphFrames (2 pts) In task1, you will explore the Spark GraphFrames library to detect communities in the net- work graph you constructed in 4.1. In the library, it provides the implementation of the Label Propagation Algorithm (LPA) which was proposed by Raghavan, Albert, and Kumara in 2007. It is an iterative community detection solution whereby information “flows” through the graph based on underlying edge structure. For the details of the algorithm, you can refer to the pa- per posted on the Piazza. In this task, you do not need to implement the algorithm from scratch, you can call the method provided by the library. The following websites may help you get started with the Spark GraphFrames: https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide- python.html https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html 4.2.1 Execution Detaie The version of the GraphFrames should be 0.6.0. For Python: • In PyCharm, you need to pip install graphframes os.environ[“PYSPARK_SUBMIT_ARGS”] “–packages graphframes:graphframes:0.6.0 • In the terminal, you need –packages graphframes:graphframes:0.6.0 For Scala: • In Intellij IDEA, you need “graphframes” % “graphframes “org.apache.spark” %% “ • In the terminal, you need –packages graphframes:graphframes:0.6.0 For the parameter “maxIter” of 4.2.2 Output Resuet In this task, you need to save your one community and the format ‘user_id1 Your result should be firstly sorted then the first user_id in the community string). The user_ids in each community If there is oney one node in the Figure 4.3 Task2: Community Detection add the sentence below into your code = ( -spark2.3-s_2.11”) to assign the parameter “packages” of the spark -spark2.3-s_2.11 to add library dependencies to your project ” % “0.6.0-spark2.3-s_2.11” spark-graphx” % sparkVersion to assign the parameter “packages” of the spark -spark2.3-s_2.11 LPA method, you shoued set it to 5. result of communities in a txt file. Each is: ’, ‘user_id2’, ‘user_id3’, ‘user_id4’, … by the size of communities in the ascending in eexicographicae order (the user_id should also be in the eexicographicae community, we stiee regard it as a vaeid community. 1: community output file format Based on Girvan-Newman aegorithm -submit: -submit: line represents order and is type of order. (6 pts) In task2, you will implement your in the network graph. Because need to construct the graph again to the Chapter 10 from the Mining For task2, you can ONLY use Spark deeete your code that imports 4.3.1 Betweenness Caecueation In this part, you will calculate structed in 4.1. Then you need to (‘user_id1 Your result should be firstly sorted then the first user_id in the tuple two user_ids in each tuple should your result. Figure 4.3.2 Community Detection You are reSuired to divide the highest modularity. The formula According to the Girvan-Newman the betweenness. The “m” in the The “A” in the formula is the adjacent step, “m” and “A” should not be If the community oney has one You need to save your result in task1. 4.4 Execution Format Execution exampee: own Girvan-Newman algorithm to detect the you task1 and task2 code will be executed in this task following the rules in section 4.1. of Massive Datasets book for the algorithm RDD and standard Python or Scala libraries. graphframes. (3 pts) the betweenness of each edge in the originae save your result in a txt file. The format of ’, ‘user_id2’), betweenness vaeue by the betweenness values in the descending in eexicographicae order (the user_id is type also in eexicographicae order. You do not 2: betweenness output file format (3 pts) graph into suitable communities, which reaches of modularity is shown below: algorithm, after removing one edge, you should formula represents the edge number of the matrix of the originae graph. (Hint: changed). user node, we stiee regard it as a vaeid community. a txt file. The format is the same with the communities separately, you You can refer details. Remember to graph you con- each line is order and of string). The need to round the global re-compute originae graph. In each remove output file from Python: spark-submit –packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 task1.py threshold> spark-submit task2.py Scala: spark-submit –packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 –-class task1 hw4.jar spark-submit –-class task2 hw4.jar Input parameters: 1. : the filter threshold to generate edges between user nodes. 2. : the path to the input file including path, file name and extension. 3. : the path to the betweenness output file including path, file name and extension.
4. : the path to the community output file including path, file name and extension. Execution time: The overall runtime limit of your task1 (from reading the input file to finishing writing the community output file) is 200 seconds. The overall runtime limit of your task2 (from reading the input file to finishing writing the community output file) is 250 seconds. If your runtime exceeds the above limit, there will be no point for this task. 5. About Vocareum a. You can use the provided datasets under the directory resource: /asnlib/publicdata/ b. You should upload the reSuired files under your workspace: work/ c. You must test your scripts on both the local machine and the Vocareum terminal before submission. d. During submission period, the Vocareum will automatically test task1 and task2. e. During grading period, the Vocareum will use another dataset that has the same format for testing. f. We do not test the Scala implementation during the submission period. g. Vocareum will automatically run both Python and Scala implementations during the grad- ing period. h. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on your last submission. 6. Grading Criteria (% penalty = % penalty of possible points you get) a. You can use your free 8-day extension separately or together. You must submit a late-day reSuest via https://forms.gle/worKTbCRBWKQ6jSu6. This form is recording the number of late days you use for each assignment. By default, we will not count the late days if no reSuest submitted. b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala. c. There will be no point if your submission cannot be executed on Vocareum. d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your assignments if there is a grading error. No exceptions. e. There will be 20% penalty for the late submission within one week and no point after that.

辅导案例-INF553-Assignment 4

Related

Previous Post辅导案例-COSC 2123-Assignment 1

Next Post辅导案例-STA 247

Author admin