首页 > > 详细

FIT5166辅导讲解、Java/Python编程辅导、讲解Java/Python程序、Python编程辅导、讲解Python语言 辅导留学

FIT5166 Information Retrieval Systems
Practical Assignment - Semester 2 2018
(18% of Total Marks)
Your task is to write an information retrieval engine,which will be able to index a collection of documents, and in response to a keyword query,retrieve matching documents.The information retrieval model your program will use is the vector-space model.
You must follow all of the instructions below:
SEPARATE SUBMISSIONS ARE REQUIRED FOR THE CREDIT LEVEL ASSIGNMENT
AND THE HIGH-DISTINCTION LEVEL ASSIGNMENT (IF ATTEMPTING THE HIGHD IS TINCTION LEVEL).
I.INSTRUCTIONS FOR THE CREDIT LEVEL ASSIGNMENT
(MAXIMUM MARK 69%)
1.Your program can be written in Java, Python or any other programming language of your choice. Note that since programming skills are pre-requisite of this unit,your tutor is not to help you with the coding part of the assignment.
2.Allyourprogramming source files must be submitted as pecified in Section III,and must all follow the standard convention of having a file extension depending on the programming language you use(e.g..java,.py)Do not use package statements in your code.
3.The name of your program must be MySearchEngine(i.e.at a minimum your
source code directory must contain a file called MySearchEngine.java which
contains the main() method). You may split your code into multiple source files,as long as they compile to produce the final MySearchEngine.class file by issuing the command in instruction #4.
4. It must be possible to compile your program on the server by issuing the relevant runtime command from within the source code directory e.g.
javac *.java
5.Your program should be able to run from the command line and send its output to standard output (except for the index referred to in be stored as a file).Page 2 of 5
6.Your program must be able to be invoked from the command line with the
following usage/parameters:java MySearchEngine [command]
where [command]is one of:
a.index collection_dir index_dir stopwords.txt
index all the documents stored in collection_dir. The index so-constructed
should be stored in index_dir.The index file should be named index.txt.See instructions #8 and #9 for the prescribed tokenization/stemming rules and index format.Stopwords are contained in the file stopwords.txt,a plain text file with one stopword per line.Do not consider the stopwords in the file stopwords.txt for stemming into index terms.
for example:
java MySearchEngine index ~/mydocs ~/myindex ~/stopwords.txt
b. search index_dir num_docs keyword_list return a ranked list of the top num_docs documents that match the query specified in keyword_list.The most relevant document must appear first in the list. Note that keywords in the query are separated by white space on the command line. Refer to instruction #9 for a more detailed description of what should be returned by this command.
for example:
java MySearchEngine search ~/myindex 10 monash university
7. When indexing documents,your program must first perform appropriate
tokenization and stemming on the source document content.
You can assume the source documents will be English language and in plaintext.
Tokenization of the documents must follow these rules:
a.Any words hyphenated across a line break must be joined into a single token
(with the final token not containing the hyphen).
b.Email addresses,web URLs and IP addresses must be preserved as a single
token.
c.Text within single quotation marks or inverted commas (e.g.‘Word Press’)
should be placed in single token.
d.Two or more words separated by whitespace,all of which begin with a capital
letter,must be preserved as a single token (i.e. include the whitespace in the token).
e.Acronym should be preserved as a single token with or without full stop or
period(e.g. C.A.T can result in CAT or C.A.T)
f.For all other text,split the text into tokens using as delimiters either whitespace of elements of the following subset of
(note this set includes the braces themselves).
After tokenization,tokens must be stemmed into index terms using the Porter
stemmer.You may use code from the following website to implement the Porter
stemmer (remember to reference the website in the comments in your code):
http://tartarus.org/martin/PorterStemmer/
8.Each record in your index must have the following format (with fields separated by commas,lines separated by the end of line character and any non-integer quantities rounded to 3 decimal places).Inverse document frequencies should be calculated using natural log. Also,the denominator of the classical idf formula should be incremented by one to allow for query terms that do not appear in the index. Note,below,{}indicates a repeating but the {} characters will not appear in your index.term,{doc-id,tf},idf
For example,suppose in a corpus of 10 documents,that the stemmed term cat appears twice in document d4 and once in both documents index entry will be:cat,d4,2,d6,1,d7,1,0.916
The document-id (doc-id) will be the simple filename of the document (e.g. the text that follows the last directory separator character in the absolute pathname of the file)
9.When used with the search parameter,your program will return a ranked list of documents (i.e.in decreasing order of cosine similarity) matching the query (as represented by the user-supplied query terms). There will be one line in your output for each returned document. The format of each line in your output must be (cosine-score rounded to 3 decimal places):doc-id,cosine-score
10.submit the credit level assignment,follow the appropriate instructions
Section III.Page 4 of 5
II. INSTRUCTIONS FOR THE HIGH-DISTINCTION LEVEL
ASSIGNMENT (MAX MARK 100%):
1. Students may wish to gain further marks by extending the capability of their engine.To do so,you must first implement all of the instructions for the credit level.Remember to keep the high-distinction submission for the CREDIT level (refer to Section III).
2. Seek your tutor’s approval of how you wish to extend your program.For extensions worthy of the HD grade1 by sending an email, describing what
additional capabilities you wish to add,with the following subject-line:
Student-id-number FIT5166 HD extension
If your proposal is considered to be worthy of the HD grade should it be
successfully implemented,they will send you email approval.Only then may you implement the changes in your code.
3. Document the nature of your extensions,how they might improve the indexing and/or retrieval process and provide instructions as to how to use your program.
4. To submit the high-distinction level assignment,follow the appropriate
instructions in Section III.
III. ASSIGNMENT SUBMISSION INSTRUCTIONS AND DUE DATES
For assignment specifications,refer to documents provided separately.
Please follow these instructions exactly. Any amendments/clarifications will be posted on the unit website.
Plagiarism warning:
All assignment submissions will be put through a plagiarism detection software which automatically checks for their similarity with respect to other submissions in all years, and websites. Any plagiarism found will
procedures and may result in severe penalties,up to the University.
Make sure you properly reference any code and resources that you submit but has been done by other people.
1 Generally,extensions will require modifications of both the indexing and searching components to achieve the HD grade.Page 5 of 5
The assignments are divided into 2 stages:
1.Assignment Stage 1 due on Moodle on Tuesday 18 September 2018 at 9am(Week 9)
Students are to upload a write up of their plan on how to complete the assignment. The
report should include the list of possible functionalities and test cases for each
functionality.
This part of the assignment is not marked but is a hurdle requirement (must be
submitted for Assignment Stage 2 to be marked).
2. Assignment Stage 2 due on Moodle on Tuesday 9 October 2018 at 9am (Week 11)
Students are to submit the CREDIT level and,if applicable,the HIGH DISTINCTION level
assignment with the following details.
Each submission is to include an Experiment Report of the tests that are conducted,the
results of each test,the analysis and conclusion drawn from the experiments.
Assignment Credit Level
1. Ensure that before the due date/time,all your java source code files and test data
files are to be zipped into a file called FirstName-Surname-Assignment-Credit.zip.
2. Submit the Experiment Report (if not submitting HD level) and the zip file online
on Moodle.
You will be required to attend an interview regarding your assignment submission.
Assignment High Distinction Level
1. Ensure that before the due date/time,all your java source code files and test data
files are to be zipped into a file called FirstName-Surname-Assignment-HD.zip.
2. Submit the Experiment Report and the zip file online on Moodle.
You will be required to attend an interview regarding your assignment submission for your
assignment to be marked.

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!