- May 15, 2020
题意：该项目的目标是实施一些信息检索方法，评估它们并在真实用例的背景下进行比较。设置评估基础结构，包括集合和索引、主题、qrel，实现公共信息检索基线，实现排序融合方法，评估、比较和分析基线和排序融合方法解析：在第一部分中，训练2017年和2018年主题的数据集，对于所有的方法，使用此用于调整检索模型的任何参数的数据，并对2017年和2018年主题的测试集进行测试（使用从训练集中选择的参数值）。报告每个方法在训练和测试集上，分别放入一个表中。对2017和2018年的测试结果进行逐个主题的收益/损失分析数据集，通过考虑作为基线BM25，并作为比较每个TF-IDF，Borda，CombSUM和CombMNZ。涉及知识点：信息检索，文件处理，数据分析更多可加微信讨论微信号：Alexa_aupdfINFS7410 Project – Part 1 – v3
Note: these instructions have been modified on 28/08/2019PreambleThe due date for this assignment is 29 August 2019 17:00 5 September 2019 17:00, EasternAustralia Standard Time (extended from 29/08) 19 September 2019 17:00 Eastern AustraliaStandard Time, together with part 2.This project is worth 5% of the overall mark for INFS7410. A detailed marking sheet for thisassignment is provided at the end of this document.We recommend that you make an early start on this assignment, and proceed by steps. There area number of activities you make already tackle, including setting up the pipeline, manipulatingthe queries, implement some retrieval functions and perform evaluation and analysis. There aresome activities you do not know yet how to perform, in particular the implementation of the rankfusion algorithms: this will be the topic of the week 5 lecture and tutorials.AimProject aim: The aim of this project is to implement a number of information retrieval methods,evaluate them and compare them in the context of a real use-case.Project Part 1 aimThe aim of part 1 is to:setup the evaluation infrastructure, including collection and index, topics, qrelsimplement common information retrieval baselinesimplement ranking fusion methodsevaluate, compare and analyse baseline and ranking fusion methodsThe Information Retrieval Task: Ranking of studies forSystematic ReviewsIn this project we will consider the problem of ranking research studies identified as part of asystematic review. Systematic reviews are a widely used method to provide an overview of thecurrent scientific consensus, by bringing together multiple studies in a reliable, transparent way.We will use the CLEF 2017 and 2018 eHealth TAR (task 2) collections. In CLEF TAR 2017, the taskwe consider is referred to as subtask 1 (and is the only task); in CLEF TAR 2018, the task weconsider is referred to as subtask 2. We provide the CLEF 2017 and 2018 TAR task overviewpapers in the assignment folder in blackboard for your reference. These contain details about thetopics, the collection, the task, etc. These details are not necessary to complete the assignment,but nevertheless you may want to know more about this task, its importance, approaches thathave been tried, and so on.The task consists of, given as the starting point the results of the Boolean search created by theresearchers undertaking a systematic review, ranking the set of the provided documents (they arePMID – pubmed ID – in the files provided; for each PMID there is an associated title and abstract).The goal is to produce an ordering of the documents such that all the relevant documents areretrieved above the irrelevant ones. This is to be achieved through automatic methods that rankall abstracts, with the goal of retrieving relevant documents as early in the ranking as possible.There are two datasets to consider in this project. The CLEF 2017 TAR dataset; and the CLEF 2018TAR dataset. Each dataset consists of material for training, and. material for testing the developedinformation retrieval methods.What we provide you withWe provide:for each dataset, a list of topics to be used for training. Each topic is organised into a file.Each topic contains a title and a Boolean query.for each dataset, a list of topics to be used for testing. Each topic is organised into a file. Eachtopic contains a title and a Boolean query.each topic file (both those for training and those for testing), includes a list of retrieveddocuments in the form of their PMIDs: these are the documents that you have to rank. Takenote: you do not need to perform the retrieval from scratch (i.e. execute the query againstthe whole index); instead you need to rank (order) the provided documents.for each dataset, and for each train and test partition, a qrels file, containing relevanceassessments for the documents to be ranked. This is to be used for evaluation.for each dataset, and for test partitions, a set of runs from retrieval systems thatparticipated to CLEF 2017/2018 to be considered for fusion.a Terrier index of the entire Pubmed collection. This index has been produced using theTerrier stopword list and Porter stemmer.a Java Maven project that contains the Terrier dependencies and a skeleton code to give youa start. NOTE: Tip #1 provides you with a restructured skeleton code to make the processingof queries more efficient.a template for your project report.What you need to produceYou need to produce:correct implementations of the methods required by this project specificationscorrect evaluation, analysis and comparison of the evaluated methods, written up into areport following the provided templatea project report that, following the provided template, details: an explanation of the retrievalmethods used, an explanation of the evaluation settings followed, the evaluation of results(as described above), inclusive of analysis, a discussion of the findings.Required methods to implementIn part 1 of the project you are required to implement the following retrieval methods:1. TF-IDF: you can create your own implementation using the Terrier API to extract indexstatistics, or use the implementation available through the Terrier API2. BM25: you can create your own implementation using the Terrier API to extract indexstatistics, or use the implementation available through the Terrier API3. The ranking fusion method Borda; you need to create your own implementation of this4. The ranking fusion method CombSUM; you need to create your own implementation of this5. The ranking fusion method CombMNZ; you need to create your own implementation of thisWe strongly reccommend you use the provided Maven project to implement these methods. Youshould have already attempted many of the implementations above as part of the tutorialexercises.In the report, detail how the methods were implemented, i.e. (i) which formula you implemented,(ii) if you did your own implementation or levereged Terrier’s ones (for TF-IDF and BM25).For ranking fusion methods, consider to fuse the runs from previous participants from CLEF2017/2018 we provide, and the TF-IDF and the BM25 runs you will produce.What queries to useWe ask you to consider two types of queries for each topic (the second type is optional andattracts bonus points):1. for each topic, a query created from the topic title. For example, consider the example(partial) topic listed below: the query will be Rapid diagnostic tests for diagnosinguncomplicated P. falciparum malaria in endemic countries (you may considerperforming text processing).2. (OPTIONAL: 2% bonus if done) for each topic, a query created from the Boolean queryassociated with the topic. This Boolean query will be made up of the terms that appear inthe query, but will ignore any operator (e.g., will ignore and , or, Exp , / , etc.) and fieldrestrictions (e.g., .ti , .ab , .ti,ab , etc.). Note that some keywords in the Boolean queryhave been manually stemmed, e.g. diagnos* in the example topic below. As part of thequery creation process, we ask you to use the entrez API. For documentation on the entrezesearch API, please refer to the Entrez Programming Utilities Help reference available at:https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch. Example usage can befound at the following URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=diagnos*. Note the terms in the TranslationStack field. These are theterms you would use to replace diagnosis* and therefore concatenate to form the query(along with the other terms).Above: example topic fileMore on the Entrez APIThe Entrez API provides access to the Pubmed search functionalities. In this part of the project wewill not use this API for retrieval. However, it also provide some additional method. One inparticular is useful for expanding terms in the Boolean query that have been “wildcarded”(manually stemmed): the TranslationStack . We have shown you above an example of how toobtain the output of the TranslationStack for a stem term. You will have to use this methodfor all terms in the Boolean query that contain the wildcard operator * . Practically, you will needto make a call to this API by constructing an appropriate URL, then request that URL, and finallyparsing the response to obtain the list of index terms to use to substitute the wildcarded termfrom the boolean query for inclusion in your text query. Note that it is likely that one wildcardedterm will give rise to many terms you will add to your query.Tips on making query processing efficientA number of tips have been provided in Blackboard to make the execution of queries moreefficient. Please consider these tips to reduce the execution time of the experiments.Required evaluation to performIn part 1 of the project you are required to perform the following evaluation:1. For all methods, train on the training set for the 2017 topics (train here means you use thisdata to tune any parameter of a retrieval model, e.g. and for BM25, runs to beconsidered for the rank fusion methods, etc.) and test on the testing set for the 2017 topics(using the parameter values you selected from the training set). Report the results of everymethod on the training and on the testing set, separately, into one table. Perform statisticalsignificance analysis across the results of the methods.2. Comment on the results reported in the previous table by comparing the methods on the2017 dataset.3. For all methods, train on the training set for the 2018 topics (train here means you use thisdata to tune any parameter of a retrieval model, e.g. and for BM25, runs to beTitle: Rapid diagnostic tests for diagnosing uncomplicated P. falciparummalaria in endemic countriesQuery:1. Exp Malaria/2. Exp Plasmodium/3. Malaria.ti,ab4. 1or2or35. Exp Reagent kits, diagnostic/ 6. rapid diagnos* test*.ti,ab7. RDT.ti,ab8. Dipstick*.ti,abconsidered for the rank fusion methods, etc.) and test on the testing set for the 2018 topics(using the parameter values you selected from the training set). Report the results of everymethod on the training and on the testing set, separately, into one table. Perform statisticalsignificance analysis across the results of the methods.4. Comment on the results reported in the previous table by comparing the methods on the2018 dataset.5. Perform a topic-by-topic gains/losses analysis for both 2017 and 2018 results on the testingdatasets, by considering as baseline BM25, and as comparison each of TF-IDF, Borda,CombSUM and CombMNZ.6. Comment on trends and differences observed when comparing the findings from 2017 and2018 results. Is there a method that consistently outperform the others?7. Provide insights of when ranking fusion works, and when it does not, e.g. with respect toruns to be considered in the fusion process, queries, etc.In terms of evaluation measures, evaluate the retrieval methods with respect to mean averageprecision (MAP) using trec_eval . Remember to set the cut-off value ( -M , i.e. the maximumnumber of documents per topic to use in evaluation) to the number of documents to be reranked for each of the queries. Using trec_eval , also compute Rprecision (Rprec), which is theprecision after R documents have been retrieved (by default, R is the total number of relevantdocs for the topic).For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01.Perform the above analysis for: 1. queries created from topic files using the topic title; 2.(OPTIONAL) queries created from the topic files using the Boolean queries. Finish your analysis bycomparing the effectiveness difference between the methods using topic titles and those usingqueries extracted from the Boolean queries (OPTIONAL: to do only if you do consider Booleanqueries and want to obtain the bonus points).How to submitYou will have to submit 3 files:1. the report, formatted according to the provided template, saved as PDF or MS Worddocument2. a zip file containing all the runs (result files) you have created for the implemented methodsa zip file containing a folder called runs-part1 , which itself contains the runs (result files)you have created for the implemented methods.3. a zip file containing all the code to re-run your experiments. a zip file containing a foldercalled code-part1 , which itself contains all the code to re-run your experiments. You do notneed to include in this zip file the runs we have given to you. You may need to includeadditional files e.g. if you manually process the topic files into an intermediate format(rather than automatically process them from the files we provide you), so that we can rerun your experiments to confirm your results and implementation.All items need to be submitted via the relevant Turnitin link in the INFS7410 Blackboard site, by 29August 2019 17:00, Eastern Australia Standard Time 19 September 2019 17:00 Eastern AustraliaStandard Time, together with part 2, unless you have been given an extension (according to UQpolicy), before the due date of the assignment.INFS 7410 Project Part 1 – Marking Sheet – v2Criterion % 7100%450%FAIL 10%IMPLEMENTATIONThe ability to:• Understandimplement andexecute commonIR baseline• Understandimplement andexecute rankfusion methods• Perform textprocessing2 • Correctly implements thespecified baselines and therank fusion methods• Implemented methods todeal with title queries• (OPTIONAL:) Implementedmethods deal withBoolean queries, andwildcards areappropriately handled viaexpansion to possibleforms using provided API(2% bonus)• Correctly implements the specifiedbaselines and the rank fusionmethods• No implementation• Implements only baselines, but notthe rank fusion methodsEVALUATIONThe ability to:• Empirically evaluateand compare IRmethods• Analyse the results ofempirical IRevaluation• Analyse the statisticalsignificancedifference betweenIR methods’effectiveness2 • Correct empiricalevaluation has beenperformed• Uses all requiredevaluation measures• Correct handling of thetuning regime (train/test)• Reports all results for theprovided query sets intoappropriate tables• Provides graphical analysisof results on a query-byquery basis usingappropriate gain-loss plots• Provides correct statisticalsignificance analysis withinthe result table; andcorrectly describes thestatistical analysisperformed• Provides a writtenunderstanding anddiscussion of the resultswith respect to themethods• Provides examples ofwhere fusion works, andwere it does not, and why,e.g., discussion withrespect to queries, runs.• Correct empirical evaluation hasbeen performed• Uses all required evaluationmeasures• Correct handling of the tuningregime (train/test)• Reports all results for the providedquery sets into appropriate tables• Provides graphical analysis ofresults on a query-by-query basisusing appropriate gain-loss plots• Does not perform statisticalsignificance analysis, or errors arepresent in the analysis• No or only partial empirical evaluationhas been conducted, e.g. only on atopic set, or a subset of topics• Only report a partial set of evaluationmeasures• Fails to correctly handle training andtesting partitions, e.g. train on test,reports only overall resultsWRITE UPBinary score: 0/1The ability to:• use fluentlanguage withcorrect grammar,spelling andpunctuation• use appropriateparagraph,sentencestructure• use appropriatestyle and tone ofwriting• produce aprofessionallypresenteddocument,according to theprovidedtemplate1 • Structure of the documentis appropriate and meetsexpectations• Clarity promoted byconsistent use of standardgrammar, spelling andpunctuation• Sentences are coherent• Paragraph structureeffectively developed• Fluent, professional styleand tone of writing.• No proof reading errors• Polished professionalappearance• Written expression andpresentation are incoherent, with littleor no structure, well belowrequired standard• Structure of the document is notappropriate and does not meetexpectations• Meaning unclear as grammar and/orspelling contain frequent errors.• Disorganised or incoherent writing.