首页 > > 详细

讲解 COMPSCI 5011 INFORMATION RETRIEVAL 2022讲解 留学生SQL语言程序

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

INFORMATION RETRIEVAL M

COMPSCI 5011

Friday 29 April 2022

1 (a)

The following documents have been processed by an IR system where stemming is not applied:

DocID

Text

Doc1

france is world champion 1998 france won

Doc2

croatia and france played each other in the semifinal

Doc3

croatia was in the semifinal 1998

Doc4

croatia won the other semifinal in russia 2018

(i)        Assume  that  the  following  terms  are  stopwords:  and,  in,  is,  the,  was.

Construct an inverted file for these documents, showing clearly the dictionary and posting list  components. Your inverted file needs to  store  sufficient information  for  computing  a  simple  tf*idf  term  weight,  where  wij    = tfij *log2(N/dfi)      [5]

(ii)       Compute the term weights ofthe two terms “champion” and  1998” in Doc1.

Show your working.         [2]

(iii)      Assuming the use of a best match ranking algorithm, rank all documents using

their relevance scores for the following query:

1998 croatia

Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)=

0.4150.           [3]

(b)

(i)        In Web  search, explain why the use of raw term frequency (TF) counts in

scoring documents can hurt the effectiveness of the search engine.         [2]

(ii)         Suggest a solution to alleviate the problem, and show through examples how it might work. Explain through examples how modern term weighting models in IR control the raw term frequency counts.           [3]

(c)    Assume that you have decided to modify the approach you use to rank the documents of your collection. You have developed a new Web ranking approach that makes use of recent advances in neural networks. All other components of the system remain the same.  Explain in detail the steps you need to undertake to determine whether your new Web ranking approach produces a better retrieval performance than the original ranking approach.          [5]

2.

(a)     Consider  a  corpus  of  documents  C  written  in  English,  where  the  frequency distribution of words approximately follows Zipf’s law r * p(wr |C) = 0.1, where r = 1,2, …, n is the rank ofa word by decreasing order of frequency. wr is the word at rank r, and p(wr |C) is the probability of occurrence of word wr in the corpus C.

Compute the probability of occurrence of the most frequent word in C. Compute the probability of occurrence of the 2nd  most  frequent word in C. Justify your answers.           [4]

(b)   Consider the query “michael jackson music” and the following term frequencies for the three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:

 

indiana

jackson

life

michael

music

pop

really

D1

0

4

1

3

0

6

1

D2

4

0

3

4

1

0

2

D3

0

4

0

5

4

4

0

Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant.

(i)   Show the original query vector, clearly stating the dimensions of the vector.         [2]

(ii)  Use  Rocchio’s relevance  feedback algorithm (with α=β=γ=1) to provide a revised query vector for michael jackson music”. Terms in the revised query that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations.          [4]

(c) Suppose we have a corpus of documents with a dictionary of 6 words w1, ..., w6. Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its term frequency) in document doci. Let the query q be the following:

q = w1 w2

Word

p(w|C)

ct(w, doc1)

ct(w, doc2)

w1

0.8

2

7

w 2

0.1

3

1

w3

0.025

2

1

w4

0.025

2

1

w 5

0.025

1

0

w6

0.025

0

0

SUM

1.0

10

10

Word

p(w|C)

ct(w, doc1)

ct(w, doc2)

(i)        Assume that we do not apply any smoothing technique to the language model for doc1  and doc2. Calculate the query likelihood for both doc1  and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.         [3]

(ii)       Suppose we now smooth the language model for doc1 and doc2 using Jelinek- Mercer Smoothing with λ = 0.1. Recalculate the likelihood of the query for both doc1  and doc2, i.e., p(q|doc1) and p(q|doc2) (Do not compute the log- likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.           [4]

(iii)      Explain which document you think should be reasonably ranked higher (doc1 or doc2) and why?         [3]

3.

(a) How would the IDF score of a word w change (i.e., increase, decrease or stay the same)

in each of the following cases: (1) adding the word w to a document; (2) making each

document twice as long as its original length by concatenating the document with itself;

(3) Adding some documents to the collection. You must suitably justify your answers.           [4]

(b) Explain  in  detail why positive feedback is likely to be more useful than negative feedback to an information retrieval system. Illustrate your answer using an example from a suitable search scenario.         [4]

(c)  Neural retrieval models often use a re-ranking strategy over BM25 to reduce computational overhead.

Explain the key limitation of this strategy. Describe in sufficient details an approach that you might use to overcome this problem.    [5]

(d)  Consider a query q, which returns all webpages shown in the hyperlink structure below.

(i)        Write the adjacency matrix A for the above graph.       [1]

(ii)       Using the iterative HITS algorithm, provide the hub and authority scores for all the webpages of the above graph after a complete single  iteration of the algorithm. Show your workings.   [3]

(iii)      Describe in sufficient details an alternative approach to compute the hub and authority scores for the above graph. You need to show all required steps to generate the scores, but you do not need to actually compute the final scores.      [3]





联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!