• May 15, 2020

Assignment 1COMP9418 – Advanced Topics in Statistical Machine LearningLecturer: Gustavo BatistaLast revision: Tuesday 24th September, 2019 at 12:52InstructionsSubmission deadline: Friday, October 11th, 2019, at 21:00:00.Late Submission Policy: The penalty is set at 20% per late day. This is ceiling penalty, so if a group ismarked 60/100 and they submitted two days late, they still get 60/100.Form of Submission: This is a group assignment. Each group can have up to three students. Write thenames and zIDs of each student in both the report and Jupyter notebook. Only one member of thegroup should submit the assignment.The group should submit your solution in one single file in zip format with the name solution.zip. There is amaximum file size cap of 5MB, so make sure your submission does not exceed this size. The zip file shouldcontain one Jupyter notebook file and one pdf file. The Jupyter notebook should contain all your source code.Use markdown text to organise and explain your implementation. The pdf file is a 2-page report summarisingyour findings. The report can include text and plots to illustrate your results.Submit your files using give. On a CSE Linux machine, type the following on the command-line:$ give cs9418 ass1 solution.zipAlternative, you can submit your solution via the course website.Recall the guidance regarding plagiarism in the course introduction: this applies to this homework, and ifevidence of plagiarism is detected, it may result in penalties ranging from loss of marks to suspension.The dataset and breast cancer domain description in the Background section are from the assignmentdeveloped by Peter Lucas, Institute for Computing and Information Sciences, Radboud Universiteit.IntroductionIn this assignment, you will develop some sub-routines in Python to create useful operations on BayesianNetworks. You will implement an efficient independence test, learn parameters from data, sample from thejoint distribution and classify examples.We will use a Bayesian Network for diagnosis of breast cancer. We start with some background informationabout the problem.BackgroundBreast cancer is the most common form of cancer and the second leading cause of cancer death in women.Every 1 out of 9 women will develop breast cancer in her lifetime. Although it is not possible to say whatexactly causes breast cancer, some factors may increase or change the risk for the development of breastcancer. These include age, genetic predisposition, history of breast cancer, breast density and lifestyle factors.Age, for example, is the most significant risk factor for non-hereditary breast cancer: women with age of 50or older have a higher chance of developing breast cancer than younger women. Presence of BRCA1/2 genes1leads to an increased risk of developing breast cancer irrespective of other risk factors. Furthermore, breastcharacteristics, such as high breast density are determining factors for breast cancer.The main technique used currently for detection of breast cancer is mammography, an X-ray image of thebreast. It is based on the differential absorption of X-rays between the various tissue components of the breastsuch as fat, connective tissue, tumour tissue and calcifications. On a mammogram, radiologists can recognisebreast cancer by the presence of a focal mass, architectural distortion or microcalcifications. Masses arelocalised findings, generally asymmetrical in relation to the other breast, distinct from the surrounding tissues.Masses on a mammogram are characterised by several features, which help distinguish between malignantand benign (non-cancerous) masses, such as size, margin, shape. For example, a mass with irregular shapeand ill-defined margin is highly suspicious for cancer, whereas a mass with round shape and well-definedmargin is likely to be benign. Architectural distortion is focal disruption of the normal breast tissue pattern,which appears on a mammogram as a distortion in which surrounding breast tissues appear to be “pulledinward” into a focal point, often leading to spiculation (star-like structures). Microcalcifications are tinybits of calcium, which may show up in clusters, or in patterns (like circles or lines) and are associated withextra cell activity in breast tissue. They can also be benign or malignant. It is also known that most ofthe cancers are located in the upper outer quadrant of the breast. Finally, breast cancer is characterised byseveral physical symptoms: nipple discharge, skin retraction, palpable lump.Breast cancer develops in stages. The early stage is referred to as in situ (“in place”), meaning that cancerremains confined to its original location. When it has invaded the surrounding fatty tissue and possibly hasspread to other organs or the lymph, so-called metastasis, it is referred to as invasive cancer. It is known thatearly detection of breast cancer can help improve the survival rates.[25 Marks] Task 1 – Efficient d-separation testIn this part of the assignment, you will implement an efficient version of the d-separation algorithm. Let usstart with a definition for d-separation:Definition. Let X, Y and Z be disjoint sets of nodes in a DAG G. We will say that X and Y are d-separatedby Z, written dsep(X,Z,Y), iff every path between a node in X and a node in Y is blocked by Z where apath is blocked by Z iff there is at least one inactive triple on the path.2This definition of d-separation considers all paths connecting a node in X with a node in Y. The number ofsuch paths can be exponential. The following algorithm provides a more efficient implementation of the testthat does not require enumerating all paths.Algorithm. Testing whether X and Y are d-separated by Z in a DAG G is equivalent to testing whether Xand Y are disconnected in a new DAG G′, which is obtained by pruning DAG G as follows:1. We delete any leaf node W from DAG G as long as W does not belong to X ∪ Y ∪ Z. This process isrepeated until no more nodes can be deleted.2. We delete all edges outgoing from nodes in Z.Implement the efficient version of the d-separation algorithm in a function d_separation(G,X,Y,Z) thatreturn a boolean: true if X is d-separated from Y given Z and false otherwise. Comment about the timecomplexity of this procedure.[05 Marks] Task 2 – Estimate Bayesian Network parameters fromdataEstimating the parameters of a Bayesian Network is a relatively simple task if we have complete data. Thefile bc.csv has 20,000 complete instances, i.e., without missing values. The task is to estimate and store theconditional probability tables for each node of the graph. As we will see in more details in the Naive Bayesand Bayesian Network learning lectures, the Maximum Likelihood Estimate (MLE) for those probabilities aresimply the empirical probabilities (counts) obtained from data.Implement a function learn_bayes_net(G, file, outcomeSpace, prob_tables) that learns the param-eters of the Bayesian Network G. This function should output a dictionary prob_tables with the allconditional probability tables (one for each node), as well as the outcomeSpace with the variables domainvalues.We are working with a small Bayesian Network with 16 nodes. What will be the size of the joint distributionwith all 16 variables?As we have implemented most of this function in the tutorials, Task 2 has a value of 5 marks.[25 Marks] Task 3 – SamplingWe can sample a Bayesian Network to create instances according to the joint distribution. This procedurehas many applications; one of them is to answer probabilistic queries in an efficient but approximated way.A simple sampling procedure is known as forward or ancestral sampling. It consists of traversing the graph Gin topological ordering. We use a random number generator to draw a value of each variable X according toP (X|Parents(X)).The next figure illustrates this idea. One possible topological order for this graph is C, S and R. We can usea random number generator to draw a number between 0 and 1. For the node C, we use a cutoff 0.5. If thenumber is less than the cutoff then C = +c, otherwise C = -c. Let us suppose the random number is 0.3 and,therefore, we take C = +c. We continue in topological ordering and sample value for the variable S accordingto P (S|+ c). The cutoff is now 0.1. Suppose the random number generator returns 0.7. Thus, we assign thevalue -s to S. We continue to variable R and sample value according to P (R|+ c) leading to a cutoff of 0.8.We use the random number generator again and sample R = +r. In the end, the generated sample is +c,-s, +r. We can repeat this process to generate more instances.3Use forward sampling to generate 1000 samples from the Breast Cancer Bayesian Network. Comment aboutthe time complexity of the procedure and accuracy of the estimates. What happens as you add more observedvariables in the query in terms of accuracy and effective sample size?[25 Marks] Task 4 – ClassificationThis particular Bayesian Network has a variable that plays a central role in the analysis. The variable BC(Brest Cancer) can assume the values No, Invasive and InSitu. Accurately identifying its correct valuewould lead to an automatic system that could help in early breast cancer diagnosis.Use the Bayesian Network to classify cases of the dataset. Propose an experimental setup to estimate theclassification error. Compare the classification error of the Bayesian Network with your favourite MachineLearning classifier.[20 Marks] Task 5 – ReportWrite a two-page report (around 1000 words) summarising your findings in this assignment. Some suggestionsfor the report are:• Which were the main challenges and how you solved these issues?• Answer the questions of each task.• Discuss the complexity of the implemented algorithms.• Include plots to illustrate your results.4