Project 3 – Search Engine
In this assignment you will be building a simple search engine based off of the concepts you have
learned in the lectures so far. When working on this project, you will be working on achieving two
milestones each with its deliverables and deadline. Bear in mind that when planning for milestone
#1, you are also cognizant of the requirements of milestone #2. This is to ensure you are on the
right track to completing this project successfully. When you will have completed milestone #2, you
will be required to demonstrate the functioning of your search engines to the TAs (F2Fs).
You can use code that you or any classmate wrote for the previous projects. You cannot use code
written for this project by non-group-member classmates. Use code found over the Internet at your
own peril -- it may not do exactly what the assignment requests. If you do end up using code you
find on the Internet, you must disclose the origin of the code. As stated in the course policy
document, concealing the origin of a piece of code is plagiarism.
Use the Discussion Board on Canvas to post your questions about this assignment so that the
answers can help you and other students as well.
Goal: Implement a complete search engine.
Milestones Overview
Milestone Deadline Goal Deliverables Score (out of
50)
#1 2/28 Produce an initial index for the
corpus and a basic retrieval
component
Short report (no demo) 20% (12)
#2 3/11 Complete Search System Code or artifacts +
Demonstration
80% (48)
Information Retrieval - COMPSCI 121 / IN4MATX 141
PROJECT: SEARCH ENGINE
Corpus: all ICS web pages
We will provide you with the crawled data as a zip file (webpages_clean.zip). This contains the
downloaded content of the ICS web pages that were crawled by us. You are expected to build your
search engine index off of this data.
Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command
line or desktop GUI application or web interface)
COMPONENT 1 - INDEX:
Create an inverted index for all the corpus given to you. You can either use a database to store your
index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You
are free to choose an approach here.
The index should store more than just a simple list of documents where the token occurs. At the
very least, your index should store the TF-IDF of every term/document.
Sample Index:
Note: This is a simplistic example provided for your understanding. Please do not consider
this as the expected index format. A good inverted index will store more information than
this.
Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2
Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7
You are encouraged to come up with heuristics that make sense and will help in retrieving relevant
search results. For e.g. - words in bold and in heading (h1, h2, h3) could be treated as more
important than the other words. These are useful metadata that could be added to your inverted
index data.
Optional:
Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more
metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead
of storing a simple TF-IDF count for every page, you can store more information related to the page
(e.g. position of the words in the page). To store this information, you need to design your index in
such a way that it can store and retrieve all this metadata efficiently. Your index lookup during
search should not be horribly slow, so pay attention to the structure of your index
COMPONENT 2 – SEARCH AND RETRIEVE:
Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be
a console prompt. At the time of the query, your program will look up your index, perform. some
calculations (see ranking below) and give out the ranked list of pages that are relevant for the
query.
Information Retrieval - COMPSCI 121 / IN4MATX 141
Optional:
Extra credit will be given if your search interface has a GUI.
Information Retrieval - COMPSCI 121 / IN4MATX 141
COMPONENT 3 - RANKING:
At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add
additional components to this formula if you think they improve the retrieval.
Optional:
Extra credit will be given if your ranking formula includes parameters other than tf-idf
Milestone #1
Goal: Build an index and a basic retrieval component
By basic retrieval component; we mean that at this point you just need to be able to query your
index for links (The query can be as simple as single word at this point).
These links do not need to be accurate/ranked. We will cover ranking in the next milestone.
At least the following queries should be used to test your retrieval:
1 – Informatics
2 – Mondego
3 – Irvine
4 – artificial intelligence
5 – computer science
Note: query 4 and 5 are for milestone #2
Deliverables: Submit a report (pdf) in Canvas with the following content:
1. A table with assorted numbers pertaining to your index. It should have, at least the number
of documents, the number of [unique] words, and the total size (in KB) of your index on
disk.
2. URLs retrieved for each of the queries above
Evaluation criteria:
• Was the report submitted on time?
• Are the reported numbers plausible?
• Are the reported URLs plausible?
Information Retrieval - COMPSCI 121 / IN4MATX 141
Milestone #2
Goal: complete search engine
Deliverables:
• Submit a zip file containing all the artifacts/programs you wrote for your search
• A live demonstration of your search engine
Evaluation criteria:
- Does your program work as expected of search engines?
- How general are the heuristics that you employed to improve the retrieval?
- How complete is the UI? (e.g. links to the actual pages, snippets, etc.)
- Do you demonstrate in-depth knowledge of how your search engine works? Are you able to
answer detailed questions pertaining to any aspect of its implementation?