=========================preview======================
(COMP336)midterm02S_sol.pdf
Back to COMP336 Login to download
======================================================
COMP 336 Information Retrieval Mid-Term Examination, Spring 2002
April 18, 2002
Time Allowed: 12:10pm to 1:10pm (1 hour)
Name: Student ID:
[Note: Answer all questions in the space provided. Answers must be precise and to the point.]
1. [10] Given a document-term matrix of a large collection of documents, describe a method for identifying thesaurus groups (i.e., terms that are used in similar ways within the collection). For example, given a document-term matrix of n documents with a vocabulary of t terms:
. treat a column as a vector
. computer the similarity between every pair of columns
. construct a term-term similarity matrix
. group term-pairs whose similarity values are large enough
2. [5] Explain why stemming can improve both retrieval efficiency and effectiveness. Stemming collapses a number of terms into their stem, thus reducing the vocabulary size Stemming increases recall by allowing words of the same meaning by different spellings to match (this assumes that the stemming algorithm is accurate enough)
3. [5] (a) Explain why idf cannot be pre-computed for every term.
idf is based on N and df, which change as documents in the documents are inserted, updated and deleted.
Note that saying that N is not unknown during insertion time is not correct because pre-computation can be carried out after all documents have been inserted. (b) When document scores are computed against a submitted query, cosine similarity measure appears to be expensive to compute because of the complex normalization factor in the formula. Explain how you can speed up the computation of the normalization factor. The cosine similarity measure requires the inner product between the document and query vector to be normalized by both the document vector length and query vector length. Several key points are expected from the answer:
. Document vector length cannot be precomputed when idf is used (see reason above)
. Avoid using idf in term weights (e.g., use binary weights)
. If idf is used, precompute document vector lengths and update them periodically (say, after 5% of the collection is updated)
. Query length is computed only once or can be ignored without affecting the relative ranks of results
Important to note here is that cosine similarity (or any other similarity measures) does NOT require a particular term weight formula such as tf.idf to be used. Thus, saying cosine similarity is difficult to compute without saying what term weight method is used is not correct!
4. [10] Give THREE situations under which recall and precision are meaningless or fail to evaluate the effectiveness of an information retrieval system. When the precision and recall become meaningless due to division by zero. There are two cases: number of documents returned is zero (precision is undefined) and number of relevant documents is zero (recall is undefined). The third problem