irlib v1.1.0

irlib.py

Description
This library provides the IRDomain class for constructing and using information retrieval indices. Complementary to this library is irlibdts, which includes a generic class for representing parsed documents. The generic document class includes properties for the document's name, frequency table, and a dictionary of attribute/value pairs. This module is dependent on the ipage.py module as well.

With a given set of parsed documents, IRDomain can compile a search index. This index is typically 33% to 40% of the original size of the text of the indexed documents. In use the IRDomain index is not all loaded into memory, but can nevertheless consume a sizeable fraction of memory resources.

IRDomain provides query mechanisms for attribute queries, Boolean keyword queries, and statistical keyword queries. Queries typically execute quite quickly.

In my test case of 6,331 archived e-mail messages of about 18MB of text, the compiled (and optimized index) is about 7MB. After the index has loaded (about 4 seconds), queries of 1 to 3 keywords can be executed in about 0.05 seconds each. (All tests performed on a PIII-450MHz Windows NT box with 128MB of RAM.)

Version
This document covers version 1.1.0 November 29,2000.

Author and copyright
Written by Nathan Denny. This code is in the public domain. No warranty on correctness.

Example
#-- Interface to an already compiled index. from irlib7 import * I=IRDomain('test.index') #-- Execute a statistically ranked keyword search. resultSet=I.statisticalQuery(['some','keywords']) #-- The resultSet consists of (score,docID) tuples. #-- For each result print the score and document name. for result in resultSet: score,docID=result print score, I.getName(docID)

irlib.IRDomain

__init__(self,sourceFileName)

getName(self,docID)

Returns the associated name of the document with id docID.

getAttribute(self,docID,attribute)

Returns the value of the specified attribute of the document with id docID.

addDocuments(self,documentList)

documentList is a list of instances of the irlibdts.Document class (or one of its subclasses). This method updates the current index set and sets the modified bit for automatic commit when the domain is being deleted from memory.

commit(self)

Flushes any modifications to the index to the file and reloads the index. If the index set has been modified, commit will automatically be called when the domain is being deleted from memory.

attributeQuery(self,query)

This method returns a list of document IDs where each document in the list satisfies the given query. query is a dictionary of attribute/pattern pairs. The pattern may be a single string, requiring an exact match, a list requiring a match with any component (logical OR), or a tuple requiring a match with the range of (min,max). For instance, assume that each document in the index possesses at least date, Subject, and from attributes. Then the following query:
{date:('20001108','20001130'), subject:'Re: irlib', from:['schcats','Nathan Denny']}
would return a list of document IDs where each document in the list had a date between '20001108' and '20001130' *AND* had the subject of 'Re: irlib' *AND* contained a from of either 'schcats' *OR* 'Nathan Denny'

booleanQuery(self,terms,resultSize=100,filterList=None)

Returns a list of tuples of the form (score,docID) such each document in the list contains at least one of the words in the terms list. resultSize limits the resultSet to the most significant subset of documents. By default resultSize is 100. filterList is a list of document IDs to which the resultSet is restricted. This is inteded to be used in conjunction with the results of a call to attributeQuery, as is in the case of performing a keyword search on all documents within a particular date range, etc. The score assigned to each result in the result set is the number of keywords that occur in the document. (*Not* the number of instances of the keywords.)

statisticalQuery(self,terms,resultSize=100,filterList=None)

Similar to booleanQuery except that the scores are real valued and are computed using the TF*IDF method of statistical weighting. For most searches, this method results in better retrieval and ranking than the booleanQuery method.

termFrequency(self,thisDocID,term)

Returns the number of occurrences of the specified term in the document with identifier thisDocID.

inverseDocumentFrequency(self,term)

Returns the proportion of documents that contain the specified term. For example, if the word wubba occurred in 12 documents in an index of 100 documents, then this method would return 0.12