(COMP336)[2000](s)midterm~=joji^_55236.pdf

=========================preview======================
(COMP336)[2000](s)midterm~=joji^_55236.pdf
Back to COMP336 Login to download
======================================================
COMP 336 Information Retrieval Midterm Examination, Spring 2000 March 20, 2000 Name: Student ID:
1. Using the basic KMP algorithm, determine the number of character positions to shift in the shift[] array for the pattern tomatoes [10]
shift
t 1
o 1
m 2
a 3
t 4
o 4
e 4
s 7

Explain briefly how the algorithm can be improved by examining the pattern character which causes a mismatch with the text. Show the new values of the shift array when the improvement is applied. [10]
If after shifting the pattern, the new pattern character to be compared with the text string is the same as the pattern character that caused the mismatch, we can conclude that a match is not possible and thus shift the pattern further by a number of position as determined by the shift table entry for the new pattern character.
shift
t 1
o 1
m 2
a 3
t 5
o 5
e 4
s 7

Give two major differences between the KMP and BM algorithms. [6]
Any two of the following (although the first two characterize BM better):
1) comparison is from right to left
2) the text character causing a mismatch is using to determine the shift (delta 1)
3) repeated substrings in the pattern are obtained from right to left to determine the shift (delta 2)

Explain how we can use KMP pattern matching to support prefix, suffix or infix truncation in the
pattern. [9]

Prefix and suffix: nothing need to be done; e.g., *xyz and xyz* are equivalent to searching for xyz
infix: ijk*xyz is equivalent to searching for ijk and upon a match search onwards for xyz

[morale of this question is that infix, prefix, and suffix truncation can be easily handled in full text
scanning compared to an inverted file]

2. Fill in the precision, recall and fallout rates in the following table. It is assumed that there are a total of 100 documents, and there are only 5 relevant documents, which are marked with a in the first column. [30]
Rank doc ID Recall Precision Fallout Rate
1 1001 0 0 1/95
2 2873 1/5 1/2 1/95
3 3916 2/5 2/3 1/95
4 0983 2/5 2/4 2/95
5 8310 2/5 2/5 3/95
6 7892 3/5 3/6 3/95
7 4562 3/5 3/7 4/95
8 4921 4/5 4/8 4/95
9 7934 1 5/9 4/95
10 9248 1 5/10 5/95

... . . . ... ... ... ...
100 3861 1 5/100 95/95

Is the following precision/recall graph possible? Justify your answer. [5]
1

1
precision
precision

11
recall recall
precision = number of relevant document retrieved / number of documents retrieved recall = number of relevant retrieved / number of relevant documents in collection
We can see that when precision = 0, recall must be zero. Note that the diagram to the right is possible but interpolation would not produce the graph on the left.
3. The following inverted file shows the values of the postings list for three terms. In each postings entry, the first value is the document ID, and the s