Cover Image
close this bookExpanding Access to Science and Technology (UNU, 1994, 462 pages)
close this folderSession 3: New technologies and media for information retrieval and transfer
close this folderThe potential offered by ''extended retrieval''
View the document(introductory text...)
View the documentAbstract
View the document1. Introduction
View the document2. Four information retrieval ''architectures''
View the document3. Illustrations of extended retrieval
View the document4. Some technical issues
View the document5. Conclusion
View the documentReferences

(introductory text...)

Michael K. Buckland

Abstract

The traditional form of information retrieval is composed of a single resource file and a single retrieval mechanism. In the environment created by the new information technology, many resources and many computers are linked by networks. This environment requires an extension of information retrieval techniques to include retrieval from multiple files and the use of multiple retrieval mechanisms. Some benefits and technical consequences of "extended retrieval" are reviewed.

1. Introduction

The traditional form of an information retrieval system is composed of two parts: a resource file and a retrieval mechanism. A bibliographic retrieval system or an on-line library catalogue, for example, is composed of a file of bibliographic records and a retrieval mechanism designed to perform the most commonly desired searches on that file of records, such as a search by author, title, or subject. It is, in effect, a unitary system, a single system composed of one resource file and of one retrieval mechanism.

The new information technology is leading to a new computing environment. The cost-effectiveness of computer hardware is increasing, the cost of electronic storage is decreasing, and connectivity through telecommunications is becoming pervasive and less expensive. In the meanwhile, labour costs and building costs continue to rise. These changing conditions are resulting in a new environment in which:

- workstations are becoming widely available;
- very large sets of data can be stored economically;
- many thousands of computers are interconnected over local, national, and international networks; and
- the standards and protocols necessary for effective cooperation are being developed and adopted.

In this situation, we find a rapidly growing number of databases, an increasing use of databases, and a trend for individuals to use a number of heterogeneous databases. The result is increased complexity for the searcher and a greater need for expertise to identify what resources exist and how to use them cost effectively. (For a convenient general introduction, see Lynch and Preston [7].)

This changed information technology has created a new information retrieval environment in which the potential for information retrieval now extends far beyond the traditional form of a unitary retrieval system composed of one file and one retrieval mechanism. I use the term "extended retrieval" to denote this more general form of information retrieval. In this paper, I describe what I mean by extended retrieval and provide examples. Some technical consequences of the extension of information retrieval from a traditional, unitary form to an extended network environment will be noted.

2. Four information retrieval ''architectures''

The generalization of information storage and retrieval beyond the traditional, unitary case of one file and one retrieval system to the more general model of multiple files and multiple retrieval systems can be expressed as four combinations:

1. Traditional, unitary retrieval systems with one file and one retrieval mechanism. An on-line library catalogue would be an example. A search on MELVYL, the on-line catalogue of the nine campuses of the University of California, for example, retrieves 266 records for books using the combination of subject keywords "science" and "Japan."

2. Multi-stage retrieval from a single file. An example of multi-stage retrieval would be when the results of a search by one retrieval system on a file are subjected to additional retrieval operations by a second retrieval mechanism as "post-processing." For example, at Berkeley, an experimental system known as OASIS can be used to refine the results of MELVYL searches [5]. For example, if the results of the previous example, the 266 MELVYL catalogue records for books on "science" and "Japan," are downloaded into OASIS, additional processing can identify numerous subsets defined by date, by language, and by the libraries where copies are held (see table 1).

3. A retrieval system that searches multiple files would be one in which a single retrieval mechanism can search and derive records from two or more files simultaneously. An example is the "Onesearch" feature of the DIALOG retrieval service.

4. Retrieval using multiple files and multiple retrieval mechanisms. The more general case is when multiple files and multiple retrieval mechanisms are used. This is the logical consequence of the development of the new information technology environment: It is a networked environment in which many different files of resources and many different retrieval systems exist and are, in principle, widely accessible over the network.

Note that we are assuming that these systems are heterogeneous. We are not talking about the relatively simple case of distributed database systems designed for compatible, distributed use. We do not and cannot assume that software, hardware, and data structures are standardized. We are concerned with retrieving resources that are related in their meaning rather than in their form, so the problems are those of information retrieval rather than data retrieval.

These four cases are summarized in table 2.

Table 1 Analysis of 266 records by language, date, and campus. Search request: "Science" and "Japan"

Location

At Berkeley

At UCLA

At other campuses

Language:

In Japanese

In English

Other

In Japanese

In English

Other

In Japanese

In English

Other

Total

1991

3

1

0

0

0

0

1

0

0

5

1987-1990

17

9

1

8

2

0

6

11

0

54

1984-1986

3

8

0

10

4

0

7

8

0

40

1978-1983

3

10

2

7

6

1

10

7

0

46

1972-1977

0

8

1

15

5

0

7

5

0

41

1963-1971

0

7

0

14

7

0

6

4

0

38

1962

1

6

0

19

10

0

3

3

0

42

Total

27

49

4

73

34

1

40

38

0

266

Table 2 Unitary and extended retrieval

Retrieval mechanisms

Number of files


Single file

Two or more files




One

a. Unitary retrieval e.g. on-line library catalogue

b. Mutliple file searching, e.g. DIALOG Onesearch

Two or more

b. Single file, multiple processing, e.g. postprocessing

d. Fully extended retrieval

3. Illustrations of extended retrieval

To illustrate some of the potential of extended retrieval, let us consider two kinds of data: bibliographic data and scientific data.

3.1 Bibliographic Data

A record in a library catalogue will include author, title, location (call number), and a subject heading (e.g. from the Library of Congress Subject Headings list). A record in a bibliography representing the same document will likewise include author and title, but may also include an abstract and a subject heading probably from a different list, such as Medical Subject Headings. A citation index would include, again, the author and title and the references from the document. These overlapping contents are shown in figure 1. What we have for the same document is three quite different bibliographic descriptions by different publishers, in different formats, and ordinarily searched on different retrieval systems. These three records contain:


Figure 1 Related bibliographic files

- information that is the same, though possibly expressed differently and not necessarily recognizable as being the same, and
- some information not provided by the others, e.g. the catalogue has the location of a copy; the bibliography has an abstract; and the citation index shows references to and from the document.

The relationship between the records retrieved from the different databases is that they all represent the same document. But this is only one of many possible relationships. Figure 1 also shows two further relationships:

- a book review index may include a record for a different but related document, a book review; and
- outside of the bibliography and library catalogue may be some object that the book is about.

In this way, one's knowledge can be significantly increased by extending one's search to two or more heterogeneous databases. However, although the various bibliographies may refer to a single document, there is no assurance that they will do so in a consistent way. The form and contents of records in bibliographies (like references at the end of papers) vary considerably. This is not normally a problem for human beings, who can recognize what is meant, but it is a serious problem for recognition by a computer.

Differences in subject description can be substantial and significant. Consider, for example, a searcher interested in coastal pollution. A search on "coastal pollution" in the Library of Congress Subject Headings in the University of California MELVYL on-line catalogue yielded nothing either as a phrase ("exact subject") or as a pair of subject terms ("subject keyword search"). Nor does either form of search yield anything in the MELVYL file (1988 to date) of the MEDLINE bibliography. Nevertheless, material on coastal pollution does exist in both, and some of it can be found by searching for documents that contain the words "coastal" and "pollution" in their title. Analysis of these records shows that the subject headings actually assigned to these documents include:

LCSH (MELVYL Catalogue)

MeSH (MELVYL MEDLINE)

Marine pollution

Seawater

Coastal zone management

Water pollution

Water—Pollution

Bacteria

Petroleum industry and trade

Water microbiology

Waste disposal in oceans

Water pollutants

Not only is the plausible phrase "coastal pollution" not used in either set of subject headings, even as a cross-reference, but there is remarkably little overlap in the terms that are used.

3.2 Science Data

Consider the range of different data that could be relevant and available for studying a geographical area such as Kyoto Prefecture or the Sacramento delta:

Topographic: latitude, longitude, altitude
Political map
Satellite image
Land-use map
Gazetteer: place names
Weather: temperature, precipitation, humidity, wind
Textual documents
Census and socio-economic statistics
Photographs, etc.
Handling the retrieval of such diverse kinds of data from quite different sources is a major challenge.

4. Some technical issues

We use the phrase "extended retrieval" to refer to the extension of information retrieval to include search and retrieval in multiple files and/or using multiple retrieval mechanisms. In the new environment, a number of interesting problems arise and need to be resolved:

4.1 What Resources Exist?

The fact that many electronic resources exist in many places does not mean it is easy to identify or find them. Files stored on computers are just that: files that are stored. There is, as yet, little or no tradition of cataloguing computer files, so that they can be identified and found, as there is for cataloguing library books and museum objects. The task of developing "directories to the Internet" is not likely to be simple or inexpensive, but it is now receiving increasing attention. The question of identifying which resources one might wish to search is a bibliographic problem, although the describing of electronic resources is still undeveloped. However, there is also a question of which retrieval system to choose for any given search if there is a choice. For example, the catalogue records of the library of the Berkeley School of Library and Information Studies can be searched using four different retrieval systems. Different information retrieval systems have different retrieval capabilities. For a specialized search, one may need to select the retrieval mechanism as much as the resource to be searched. This implies a knowledge and understanding of the differing characteristics of different retrieval mechanisms available, with which resources they can be used, and how to use them, singly or in combination. This knowledge is inadequately developed, though Belkin and Croft [1] provide a useful review.

4.2 Search and Retrieve Protocols

It is now possible to access databases at remote sites as well as databases at one's local computer centre. This ordinarily requires establishing a telecommunications connection, a personal account and password, and the use of an unfamiliar command language (as in figure 2a). This is inconvenient and requires expertise. A significant new development is the creation of national (e.g. US NISO Z39:50) and international standards (ISO 10162/10163) for computer-to-computer "search and retrieve" standards. The adoption of these standards will enable one to delegate to one's local retrieval system the extension of a search to some other, different retrieval system. The Search and Retrieve standards translate searches from one retrieval command language to another [3, 4]. This development started among librarians to enable convenient access to each other's catalogues, but it has wider application. The effect is shown in figure 2b.


Figure 2 "Search and retrieve" (Z39:50) protocol (a) A user can connect with various on-line catalogues and must know how to use each. (b) With the "Search and retrieve" (Z39:50) protocol, the user need only know how to use the local catalogue and to instruct it to extend a search to other systems.

Because different retrieval systems have different capabilities, one could do more than simply extend a search to another database. For example, suppose that the on-line catalogue at library A does not support searching for individual words in titles, but that the on-line catalogue of library B does. A title keyword search desired at A could, instead, be performed on the online catalogue at B. The records of any books found at B could then be transferred back to A and, with the benefit of full descriptions, the catalogue at A could be searched to see if they are also held at A. Any such books also found to be held at A would provide the effect of a title word search - admittedly probably incomplete - even though the on-line catalogue at A did not support title word searching. The point is that specialized retrieval capabilities available on a remote machine but not available on a local machine could, within limits, be used to enhance local searching.

The idea of a "knowledge robot" or "knowbot" that could be sent off into the networks searching for and retrieving information on any specified topic has aroused interest. The essence of a "knowbot" is the idea of a conditional search command. A searcher at A might send a search command in the following form: Search in Resource B for data with attribute X. If found, retrieve it and transport it to A; if not found, extend the search to Resource C, and so on.

4.3 Questions of Relatedness

Extended retrieval among heterogeneous resources raises difficult questions of relatedness. If a name found in one database is similar to a name in another database, are they variant forms of the name of the same person? If the names look the same, could they still refer to different people? The same problem arises with records in bibliographic databases that may or may not represent the same document [2]. More generally, in extended retrieval in heterogeneous databases, one is concerned with the retrieval of related material but the nature of the relationship may be difficult to define or to determine.

4.4 Anatomy of the Retrieval Process

Retrieval in a unitary retrieval system is easily viewed as an event rather than a process. When considering information retrieval in a networked environment, one might think in terms of local "client" search machines and remote resource "servers," which implies a distinction between a search and retrieve mechanism and a resource file. This may exist when a retrieval system is built to retrieve from two or more files. But information retrieval is, in practice, a complex process including several different components. The question arises concerning how the retrieval process could be optimally divided into different stages on different machines. In fact, when one analyses the individual elements of the retrieval process, considerable complexity and choice emerges. For example, with bibliographic searches:

- different bibliographies represent different, more or less overlapping, populations of documents;
- different bibliographies will have more or less different descriptions, even for the same document;
- the access points ("indexes") that can be searched vary between systems;
- there can be more or less cross-referencing between different index terms ("see," "see also," and other kinds of syndetic relationships); and
- different retrieval systems support different types of searching (matching, comparing), even in the same bibliographic files. Some allow searches for keywords and/or composite Boolean search requests and others do not.

So, correspondingly, one can immediately identify five different classes of reasons for extending searches to two or more on-line bibliographies and/or on-line catalogues. Depending on the circumstances, different options might be chosen:

1. Because different bibliographies represent different populations of documents, one may want to extend a search to another bibliography or catalogue because what was desired was not found in what had already been searched and it would be desirable to extend the search to a new population of documents.

2. Because different bibliographies contain different descriptions for the same document, one may want to extend a search to another bibliography or catalogue for a document that has already been found because differing bibliographic descriptions can be used to accumulate additional information. As noted above, a book might be present in a library catalogue, in a subject bibliography, and in a citation index. All three will have more or less differing bibliographic records for the same document: The catalogue will have a standard catalogue record and a note of the location of each copy; the medical bibliography may contribute a different, detailed subject description and an abstract; and a citation index might contribute a list of other works cited in it and another list of works that cite it. Combining these descriptions could improve bibliographic access substantially.

3. For a more complete search, one may want to extend a search to two or more other systems in order to use additional access points. Citations have to be searched in a citation index. The ability to search on other features, such as searching by words within a title or within an abstract, or by the language or date of the document, varies significantly between systems.

4. Because of the complexity and vagueness of language, one may prefer using the system that has the best network of cross-references, the best "vocabulary control," to guide the searcher from the searcher's terms (e.g. "coastal pollution") to the system's terms.

5. It may be worthwhile to extend a search to another system because it has special searching abilities, such as identifying pairs of words that occur close to each other, or to extend the search by downloading results into a personal computer for more detailed analysis ("postprocessing").

There are other possibilities. For example, the extent to which texts are subdivided into fields can affect retrieval performance [6].

We can observe that information retrieval theory remains significantly incomplete, even for unitary retrieval systems, until the effects of changes in any one or more of these variables on retrieval outcomes are properly understood.

5. Conclusion

Automating a catalogue, placing a bibliography on-line, or providing access to any other electronic resource on-line is a substantial technological development. But to think only of an individual on-line catalogue, of an individual on-line bibliography, or of any other individual resource - even being aware that there are several different on-line catalogues, numerous individual bibliographies available on-line, and many other resources on-line is to think in terms of the card catalogues, published paper bibliographies, and the unitary retrieval systems of the past. Instead of thinking of individual retrieval systems, we should now base our thinking on the awareness that there are large and growing populations of electronic resources and of retrieval mechanisms, increasingly connected by telecommunications networks, and containing data sets that can, in principle, be linked, combined, and rearranged. What could happen if, instead of thinking of information retrieval in traditional, unitary terms, as when using a bibliography on-line, we were to follow the logic of electronic technology one step further and think instead of a collectivity of bibliographies on-line? We need to think in terms of using a whole electronic reference library, even multiple libraries, on-line.

Extended retrieval provides much wider opportunities, but it also increases the difficulties in selecting what is needed from so large and complex a universe.

References

1. Belkin, N.J., and W.B. Croft (1987). "Retrieval Techniques." Annual Review of Information Science and Technology 22: 109-145.

2. Buckland, M.K., A. Hindle, and P.M. Walker (1975). "Methodological Problems in Assessing the Overlap between Bibliographic Files and Library Holdings." Information Processing and Management 11: 89-105.

3. Buckland, M.K., and A. Lynch (1987). "The Linked Systems Protocol and the
Future of Bibliographic Networks and Systems." Information Technology and Libraries 6: 83-88.

4. Buckland, M.K., and C.A. Lynch (1988). "National and International Implications of the Linked Systems Protocol for Online Bibliographic Systems." Cataloging and Classification Quarterly 8: 15-33.

5. Buckland, M.K., B.A. Norgard, and C. Plaunt (1992). "Design for an Adaptive Library Catalog." In: Networks, Telecommunications, and the Networked Revolution: Proceedings of the ASIS 1992 Mid-Year Meeting May 27-30, 1992. Silver Springs, Md.: American Society for Information Science, pp. 165-171.

6. Lynch, A. (1992) "Online Searching on the Internet: The Challenge of Information Semantics for Networked Information." In: Proceedings of the National Online Meeting, New York, May 5-7, 1992. Medford, NJ: Learned Information. Forthcoming.

7. Lynch, A., and C.M. Preston (1990). "Internet Access to Information Resources." Annual Review of Information Science and Technology 25: 263-312.

8. Stonebraker, M., and J. Dozier (1991). Large Capacity Object Servers to Support Global Change Research. SEQUOIA 2000 Report 91/1. Berkeley, Calif.: University of California, Electronic Research Laboratory.