Compass was created for a Computer Science independant project course, shown here is a short report describing it. You can also download the full report, including results and screenshots here.

Compass: An Information Retrieval Tool


Introducing Compass

Compass is an automatic information retrieval tool, designed to enhance information retrieval from the web. The approach it takes considers the observation that human search queries are often not generated, so to speak, in a vacuum; by this I mean that search queries are often generated in a larger context relating to the users information retrieval needs. The context that Compass is designed to work with is that of the document.

How does it Work?

Searches in Compass are done directly from a source document; the source document provides the context representing the user's information retrieval goals. When a user opens a document in Compass, it is analyzed and important words (keywords) are extracted and stored. The user is then free to select any portion of text which interests them. Once they have done this they can right click and select the "get more information" option. Compass will use the keywords (extracted earlier) within the users selection to form the search query which is passed on to a web based search engine. Compass currently works with Google and Yahoo!. The result set is obtained from the search engine and the documents it points to are downloaded, analyzed and compared to the source document's keywords. Each document is assigned a score and the result set is re-ranked to promote documents that more closely related to the source document as a whole.

Keyword Extraction

Keywords are significant words occurring in a particular text. The appropriate selection of keywords is crucial to the precision of the result set in commercial search systems. If keywords are poorly selected and are too general then the result set will often be of low precision, on the other hand if too many words are used in a search query the size of the result set often decreases drastically and in some cases becomes empty. The keyword extraction algorithm is based on an algorithm described in the paper Keyword Extraction from a Single Document using Word Co-Occurrence Statistical Information by Matsuo & Ishizuka (2003) [3]. It has the advantage of not requiring a corpus yet maintaining comparable performance to tf-idf (Term Frequency-Inverse Document Frequency) techniques that employ a corpus. Matsuo & Ishizuka describe the procedure as follows. "Frequent terms are extracted first, then a set of co-occurrences between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Cooccurrence distribution shows importance of a term in the document as follows. If the probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword." [3]

In brief if a word co-occurs selectively with certain frequent terms it is more likely to be a significant word than a word that co-occurs fairly evenly with respect to the frequent terms. The bias of co-occurrence distribution is measured using a refined version of the x2 test.

Document Ranking

Document ranking is a central part of Compass. A number of methods are being tested for the re-ranking of the result set (I describe it as re-ranking because individual search engines place their own ranking on their search results) but the method currently in use is the Vector Space Model with vector similarity being measured using the Cosine measure. Term frequencies in the vectors are dampened using the function f(tf) = √(tf) where tf is the term frequency.[2]

In the Vector Space Model documents are represented in a high dimensional space in which each dimension corresponds to a word in the document, therefore any collection of words from that document (such as a set of keywords) represent a vector in that space. The vector representing the keywords extracted from the source document are plotted for each document in the result set and ranked in descending order according to their cosine with the vector representing for the source document. The cosine measure is a common measure of vector similarity. In this manner documents from the result set that are closer to the source document, and presumably more precise with respect to the users search goals, are moved higher up in the list.

References:

1. C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979

2. C. Manning and H Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999

3. Y. Matsuo and M. Ishizuka. Keyword Extraction from a Single Document using Word Co-occurence Statistical Information, Int'l Journal on Artificial Intelligence Tools, Vol.13, No.1, pp.157-169, 2004

Credits

Compass is a project by Yannick Assogba under the supervision of Dr. Sabine Bergler