|Expanding Access to Science and Technology (UNU, 1994, 462 pages)|
|Session 3: New technologies and media for information retrieval and transfer|
|Information retrieval: Theory, experiment, and operational systems|
Experimenters and theorists in IR have been working for many years with alternatives to Boolean search statements. In particular, they have been using what might be described as "associative" methods, where the retrieved documents may not match exactly the search statement but may be allowed to match approximately. For example, the search statement may consist of a list of desirable characteristics, but the system may present as possibly useful items that lack some of the characteristics. Then the output of the system may be a ranked list, where the items at the top of the list are those that match best in some sense, but the list would include items that match less well.
Until relatively recently, this work had had little impact on operational systems. Most such systems continued to use Boolean (or extended Boolean) search logic. However, some recent systems have adopted some associative retrieval ideas. For reasons that will become apparent, I believe that associative retrieval offers far better possibilities for systems that genuinely help end-users to resolve information problems or ASKs. Therefore I welcome this development and indeed see it as long overdue.
There is a wide range of possible approaches to the problem of providing associative retrieval. This section will give a very brief overview of some of the approaches, and the following section will look at one particular model in somewhat more detail.
The associative approach with the most substantial history of development is the vector space model of Salton and others . In this model, the documents and queries are seen as points in a vector space: retrieval involves finding the nearest document points to the query point. This model leads naturally to various associative-retrieval ideas, such as ranking, document clustering, relevance feedback, etc. It has been the basis of a number of experimental systems from the early 1960s on, and many different ideas have been incorporated at different times and subjected to experimental test. Mostly these tests have been along the lines described in section 4 above, with the system treated in input-output fashion.
A second approach is suggested by Zadeh's Fuzzy Set theory. There have been a few attempts to apply fuzzy set theory to information retrieval, though it has not received nearly as much attention as the vector-space model . (This reference describes the original fuzzy set theory. Much related work has occurred since in, for example, fuzzy logic or fuzzy decision-making. An example of an application in IR is: W.M. Sachs, "An Approach to Associative Retrieval through the Theory of Fuzzy Sets," Journal of the American Society for Information Science, 27: 8587.) The main attraction for the IR application is that it seems to present the possibility of combining associative ideas with Boolean logic, although there are actually some serious theoretical problems in that combination . There is a conspicuous lack of any attempt to evaluate fuzzy set theory-based systems.
A third approach is that based on statistical (probabilistic) models. Although statistical ideas have been around in IR for a very long time, most such work nowadays is based on a specific probabilistic approach, which attempts to assess the probability that a given item will be found relevant by the user. In this sense it belongs firmly with the evaluation tradition discussed in section 4 and with the ideas of relevance that emerged from that tradition, although it turns out to fit very naturally with more recent ideas of highly interactive systems. The probabilistic approach is discussed in more detail in the next section.
It is not strictly necessary to regard these three approaches as incompatible. It is possible to devise methods that make use of ideas based on more than one approach. However, they do suggest very different conceptions of the notion of degree of match between documents and queries.