|Expanding Access to Science and Technology (UNU, 1994, 462 pages)|
|Session 1: Access to science and technology and the information revolution|
|Keynote presentation: the impact of information technology on the access to science|
The potential benefits of storing numerical scientific data in computerized formats were recognized in the early days of digital computers. Geophysics was one of the first fields to take advantage of electronic data storage, driven by the enormous quantity of data obtained from instruments on satellites. Distribution of data on magnetic tapes began during the International Geophysical Year. Other disciplines followed somewhat later. In the field of physics, for example, the International Atomic Energy Agency began to exchange tapes of neutron cross-section data in the 1960s. Databases of crystallographic data and certain types of spectroscopic data were established by 1970.
One of the most important advances in the 1960s was the introduction of computer storage and retrieval of chemical structures. This was led by the Chemical Abstracts Service, which was faced with handling records on the millions of the chemical compounds reported in the literature. Chemical structures represent a rather special type of data, essentially a record of the connections between atoms in the molecule. With this database available, it became possible to check a newly reported compound against the database and determine whether it actually was new. Furthermore, one can search the database and retrieve structures having specified features that are associated with particular chemical or biological activity. This has become a powerful tool for the development of new drugs.
In spite of these successes, the introduction of computer storage and retrieval of scientific data on a broad scale did not begin until the 1980s. Two technological developments were mainly responsible. One was the introduction of the personal computer, which enabled individual scientists to gain access to data sources, avoiding the cost and frustrations of dealing with a mainframe computer. The other was the growth of low-cost telecommunications networks. Scientists now have many options for communicating with remote computers from their own desk or laboratory.
Of course, the availability of spectacular technology does not assure benefits to the scientific community. Indeed, some of the early demonstrations of computerized data retrieval were in the nature of a tour de force and did not offer any real improvement over the traditional means of data access. It is important to look realistically at the potential advantages of electronic data dissemination in comparison with traditional methods. The principal advantages are:
1. More efficient access: the search software allows more detailed indexing than is practical with printed works; expert systems provide a new dimension in leading the user to the data he needs.
2. Linkage of related databases: different databases, even in remote physical locations, can be searched and manipulated simultaneously.
3. Multidimensional searches: desired combinations of properties (attributes) can be specified and materials selected that satisfy those requirements. This is very difficult to do with print media.
4. Direct data transfer: retrieved data can be transformed directly to computer programs for further manipulation, thus avoiding additional effort and possible errors.
5. Simplicity of updating: data files can be updated quickly and inexpensively, in contrast to printed reference works.
Electronic access to scientific data is developing along two parallel, and to some extent competing, paths. The first is the on-line approach, where the data banks are maintained at one geographical location (or perhaps on several linked mainframes) and users access this computer over a telecommunications network. The second approach is to distribute individual databases of floppy disks or CDROM to each user for installation into his own PC or workstation. Both modes have their advantages and disadvantages. Networks generally offer more powerful software and provide the capability of accessing many different databases from one entry point. Files can be kept up to date and errors corrected in an orderly fashion. However, the cost tends to be high and many people feel a psychological barrier in using an on-line network when they are being charged for every minute they are connected to the system. When databases are provided on diskettes or other transportable media, the user can load the file into his PC and then work with it at leisure, without incurring any further cost. In this mode, innovative uses of databases would seem more likely. However, updates and revisions must be physically sent to each user, who must make a conscious effort to replace the old data with the new.
It seems likely that both modes of dissemination will go through further evolution, without one mode being completely displaced by the other. The outcome will depend on economic and sociological factors, as well as strictly technical ones.
The following is a brief overview of computerized numerical databases that are currently available in different areas of science. Since the number of databases is very large and new issues appear frequently, no claim can be made that this survey is comprehensive. However, an effort will be made to give a flavour of what is now available.
6.1 Physical Sciences and Engineering
Numerical databases in physics, chemistry, materials science, and related engineering fields range from very large, multi-megabyte data collections to low-priced diskettes intended for educational purposes. Some representative examples are given below, categorized by the scientific subject matter they cover.
Most spectroscopic databases are intended for use in identifying unknown chemical substances. Therefore, they tend to be very large. The most widely used are in mass spectroscopy, which is a common analytical technique for identifying chemicals in industrial process control, environmental monitoring, and many other areas. The NIST/EPA/NIH Mass Spectral Database, available from the National Institute of Standards and Technology in the United States, contains data on more than 60,000 compounds. The version of this database for personal computers is distributed on diskettes and CD-ROM; it includes search software that permits a user to match peaks in an unknown sample with peaks in the database and thereby identify the chemical compound (fig. 1). The Wiley Register of Mass Spectral Data is a similar database, with over 150,000 substances. Several carbon-13 NMR databases exist, some designed for installation in NMR spectrometers and some accessible on-line. The largest one, containing about 70,000 entries, is maintained by the Fachinformationzentrum (FIZ) in Karlsruhe, Germany. It can be accessed on-line through STN International. Many collections of infrared spectral data are available in electronic form. Again, some are installed in instruments and others are on-line. The Handbook of Data on Organic Chemistry (HODOC) database combines spectroscopic and physical property data in one database covering over 25,000 organic compounds. It is maintained by CRC Press and can be accessed on-line through STN International; a CD-ROM version is planned (figs. 2, 3, 4, 5).
Increasing attention is being given to integrated data banks that allow searching of several types of spectra simultaneously. This is potentially a powerful technique for establishing positive identification of chemical substances that may be present in complex mixtures. In Japan, the National Chemical Laboratory for Industry has developed a large Spectral Database System of this type, and in Germany the Chemical Concepts organization is working on a similar system called "SPECINFO."
Quality control in such large databases is not an easy task. Certain automated procedures have been developed , but a trained spectroscopist is often needed to resolve discrepancies between spectra obtained from different sources.
Like spectroscopy, crystallography is a field of science characterized by the need to handle very large amounts of data. So, it is not surprising that crystallographers were among the first to adapt digital computer technology to their needs. The Cambridge Crystallographic Data Centre in the United Kingdom maintains a data bank on the crystal structures of organic compounds that contains over 80,000 substances. This database stores the three dimensional coordinates of all the atoms in each molecule, which allows the molecular structure to be displayed in graphical form. One of the most important applications of the database is in drug design, where it aids in pinpointing complex chemical compounds whose three-dimensional structure includes features that are likely to produce the desired biological activity. The Cambridge Centre licenses the data bank to pharmaceutical and chemical companies and makes it available to academic research groups through affiliates in all major countries. It can also be searched on-line through the CAN/SND system in Canada.
Figure 1 The NIST/EPA/NIH Mass Spectral Database
Figure 2 The CRC Press Database on Properties
Figure 3 The CRC Press Database on Properties
Figure 4 The CRC Press Database on Properties
Figure 5 The CRC Computer-assisted Handbook
Another important crystallographic database is the NIST Crystal Data, which contains data on over 150,000 inorganic and organic crystals (figs. 6, 7, 8). This database does not contain the full structure, but only the lattice constants that are used to identify crystalline materials. Two other databases of this type are the Powder Diffraction File and the NIST/Sandia/ICDD Electron Diffraction Database. All of these are distributed by the International Centre for Diffraction Data.
6.1.3 Physical Properties of Chemical Substances
Two German data collections of great historical importance have recently become available in computerized versions. These are the Beilstein Database, covering a wide range of properties of organic compounds, and the Gmelin Database, which serves a similar function for inorganic chemistry. Both can be accessed on STN International and other networks. The HODOC Database, mentioned earlier, is a much smaller database of organic compounds, but combines physical property with spectroscopic information.
Figure 6 The NIST Structure and Properties Database
Figure 7 The NIST Structure and Properties Database
Figure 8 The NIST Structure and Properties Database
6.1.4 Thermodynamic Properties
Data on thermodynamic and thermophysical properties find wide use in both industry and basic research. Many databases in this category are available. The Design Institute for Physical Property Data (DIPPR) has created a data bank of about 1,200 chemical substances of highest importance to industry. The DIPPR Database includes programs that calculate about 20 properties as a function of temperature. It is an example of a system that combines a collection of numerical constants with powerful computational software that generates data for the exact conditions requested by the user. It is available in magnetic tape and diskette formats and is on-line on STN. Another data bank designed for chemical industry needs is DECHEMA, developed in Germany; it also includes data on mixtures. The Thermodynamic Research Center (TRC) at Texas A&M University offers a large database on STN; they also distribute a PC diskette that provides vapour pressure data on about 5,000 substances. The NIST Chemical Thermodynamics Database contains standard-state thermodynamic properties for about 15,000 inorganic substances, and the database version of the JANAF Thermochemical Tables covers high temperature data on 1,800 substances. Both are distributed by NIST in magnetic tape form and are available on STN. Other NIST databases cover pure fluid properties, hydrocarbon mixtures, refrigerants, water and steam, molten salts, and thermodynamic property estimation.
Figure 9 The NIST Chemical Kinetics Database
6.1.5 Chemical Kinetics
Data on rates of chemical reactions are very important for environmental modelling and other applications. The NIST Chemical Kinetics Database contains data on over 5,000 gas-phase reactions (figs. 9, 10). This PC database displays data in both tabular and graphical form and permits the user to choose different models for fitting the data.
6.1.6 Nuclear and Particle Physics
The International Atomic Energy Agency (IAEA) in Vienna has led the international efforts to create databases of neutron cross-sections and other nuclear properties that are essential to the design of nuclear power reactors. These databases are distributed as magnetic tapes by the IAEA, as well as by national organizations such as the Nuclear Data Center of the Japan Atomic Energy Research Institute and the National Nuclear Data Center at Brookhaven National Laboratory in the United States. Data on fundamental particles and other data for high energy physics may be obtained on magnetic tape from the Particle Data Center at Lawrence Berkeley Laboratory.
Figure 10 The NIST Chemical Kinetics Database
6.1.7 Engineering Materials
An extensive effort has been organized in the last few years to create databases on properties of materials of engineering interest. These emphasize structural materials such as metals, ceramics, polymers, and composites, but also include special types such as electronic materials. The effort has been led by the Materials Properties Data Network (MPDN), a non-profit organization established in the United States for the purpose of encouraging the production of databases and the development of an on-line delivery system. The MPDN is now operating under the Chemical Abstracts Service, and its databases form a separate module of STN International. Databases now available to the public on MPDN cover aluminium and copper alloys, steels, plastics, structural ceramics, and advanced materials for aerospace design. A sophisticated menu-driven interface helps to lead users to the right database, and results can be displayed in any desired set of units.
Many other groups are developing materials databases for specialized applications. Groups in France have been especially active in this area . The NIST in the United States distributes PC databases on corrosion performance, phase diagrams of ceramics, and tribology. CODATA has an active Task Group on Materials Database Management that serves as a coordinating mechanism for these various efforts .
The systematic collection of information on the taxonomy and behaviour of biological species has a long and distinguished history. Numerous efforts are in progress to convert this type of information from paper to computerized form. The International Union of Biological Sciences (IUBS) has a Commission on Plant Taxonomic Databases that is attempting to establish standards for data exchange and promote cooperation between the databases. It is now studying how the community of taxonomic institutions and database developers might design and organize a global database system that would cover all the world's plants. This is clearly a long-range task, but the first steps of agreeing on nomenclature and adopting data exchange formats are in progress. Similar efforts have been started in zoology; an example is the MEDIFAUNE Databank, which covers the fauna of the Mediterranean region .
It is a daunting challenge to computerize massive amounts of data, some of it going back several centuries. One of the major problems is the variability in names of plant and animal species. The intelligent design of a database (or a series of linked databases) requires absolute conformity to an agreed-upon terminology. In an effort to alleviate this problem, CODATA has established a Commission on Standardized Terminology for Access to Biological Databanks . The plan is for this group to coordinate the efforts of the ICSU unions and other international bodies to reach agreement on standardized terminology for use in data banks.
The situation is different in the realm of cellular and molecular biology. Since these are much newer fields, such a large backlog of data does not exist, and it is easier to establish mechanisms that facilitate the transfer of new data from the laboratory to data banks in a systematic fashion. In microbiology, a detailed coding scheme was published in 1988 under the auspices of CODATA and the International Union of Microbiological Sciences (IUMS) . This scheme provides standardized codes for all the data elements likely to be of interest in a data bank on micro-organisms. At the molecular biology level, CODATA established a Task Group on Protein Sequence Databanks  in 1984. This group has promoted collaboration among the organizations throughout the world that maintain protein sequence data banks. A standard interchange format has been adopted that permits groups to exchange sequence data easily, even though each group has its own computer hardware and software.
The greatest challenge in the area of biological data banks is DNA sequences, especially the human genome. In the brief period since techniques for determining DNA sequences in chromosomes were first introduced, an immense amount of data has been accumulated. Even so, it is small compared to the 3 billion base pairs of the human genome. Major programmes have been started in several countries to map the human genome and eventually determine the full DNA sequence. Fortunately, the problem of storing all this data in data banks has been faced before the sequencing began on a large scale. Genome data banks have been established at the National Center for Biotechnology Information, part of the National Library of Medicine in the United States, the European Molecular Biology Laboratory in Germany, Los Alamos Scientific Laboratory (GENBANK), and elsewhere. Mechanisms are in place for these groups to maintain common standards and formats.
From this discussion, it is evident that the development of publicly accessible computerized data banks is at an earlier stage in the biosciences than in the physical sciences. Nevertheless, a considerable amount of biological data can already be accessed. The Microbial Strain Data Network (MSDN)  was established about five years ago with support from UNEP, CODATA, and other organizations. The MSDN serves as a gateway to a wide range of reference sources on micro-organisms and cell strains, some in computerized form and others not. The CODATA/IUIS Hybridoma and Monoclonal Antibody Databank provides data on about 20,000 hybridomas. Both protein and DNA sequence data are distributed in magnetic tape form by the groups that maintain the various databases. Mention should also be made of the Protein Data Bank at Brookhaven National Laboratory in the United States, which distributes tapes with data on the three-dimensional structures of protein molecules.
Some areas of the geosciences have a long-established pattern of compiling and organizing observational data. Geology and astronomy are examples of fields where records go back many centuries. Fields such as oceanography and meteorology also feature a considerable amount of historical data, but modern measurement techniques have led to a major expansion in the quantity of data that must be managed. Finally, remote sensing measurements from satellites and space probes are now causing a staggering data explosion.
This great size and diversity of geoscience data have given rise to data centres that maintain a multiplicity of parameter-specific data banks and provide a dissemination service to the scientific community. The system of World Data Centers coordinated by the ICSU Panel on World Data Centers links about 40 centres, each supported by its own national government. The centres exchange information and maintain some duplicate records to prevent loss in case of natural disasters. While many of the older records are in the form of paper, microfilm, and photographs, an increasing part of the holdings of these centres is in computerized form.
The scope of the World Data Centers is now described as "geophysical, solar, and environmental." Among the many topics covered are seismology, volcanology, geomagnetism, cosmic rays, solar emissions, tsunamis, etc. The individual centres disseminate computerized data to the scientific community in the form of magnetic tapes and optical disks. A summary of the services available may be found in the ICSU publication Guide to the World Data Center System .
Another ICSU activity is represented by the Federation of Astronomical and Geophysical Data Services (FAGS), which links 10 permanent data services conducted by several scientific unions. The participating centres analyse observational data from all parts of the world, checking for quality and consistency. Subjects covered include geomagnetic indexes, sunspot observations, gravity variations, and precise data on the rotation of the earth. Also included is the stellar data centre in Strasbourg, France, which maintains extensive computerized records on the positions and other features of stars.
Research on the atmosphere and oceans is another area where large quantities of data are being generated. This comes from satellite observations as well as land-based and ship-based measurements. The World Ocean Circulation Experiment (WOCE) and the Tropical Ocean Global Atmosphere (TOGA) are examples of scientific programmes that have the objective of synthesizing these data through the use of global models. Data exchange in oceanography is coordinated by the International Oceanographic Commission (IOC), while the World Meteorological Organization (WMO) serves a similar function in meteorology.
Many centres in different countries serve as distribution points for geoscience data. Taking the United States as an example, meteorological data are handled by the National Climatic Data Center in Asheville, NC; space science data by the National Space Science Data Center in Greenbelt, Md.; data on the oceans by the National Oceanographic Data Center in Washington, D.C.; various geophysical and geological records by the National Geophysical Data Center in Boulder, Colo.; and so on. Each of these centres maintains tens of thousands of magnetic tapes and provides copies on request. Much concern has been expressed about the longevity of these tapes, and studies are in progress on ways to assure that scientists will have access to the data a century from now.
In view of this bewildering assortment of geoscience data sources, most users need a great deal of help in locating what they want. In an effort to put some order into the situation, the National Space Science Data Center has developed a Global Change Master Directory. The Master Directory, which can be assessed on-line, utilizes software that switches the user to actual data records, which may exist in many different computer centres throughout the world. If a direct connection is not possible, the user is told how to contact other centres that may help him. The National Oceanographic and Atmospheric Administration (NOAA) is using the same software to provide access to some of its data.
Finally, a very interesting project designed to provide geoscience data at low cost should be mentioned. This is the Global Change Database Project (GCDP), developed by the ICSU Panel on World Data Centers with the aid of the US National Geophysics Data Center, which is intended to provide data at medium or low special resolution in forms that can be easily used by individual scientists. The pilot phase produced a diskette displaying vegetation index data for Africa. Plans call for other types of data to be added, such as topography, soil type, land use, ecosystem classification, and climate summaries. A Global Ecosystem Database in CD-ROM format is now being tested. Emphasis is being given to research needs in developing countries, and the pilot diskettes have already been used in training workshops under the UNEP/GRID programme.