Rule Induction for Semantic Query Optimization
Chun-Nan Hsu and Craig A. Knoblock
Information Sciences Institute and Department of Computer Science
University of Southern California
4676 Admiralty Way, Marina del Rey, CA 90292 fchunnan,firstname.lastname@example.org
Semantic query optimization can dramatically speed up database query answering by knowledge intensive reformulation. But the problem of how to learn required semantic rules has not previously been solved. This paper describes an approach using an inductive learning algorithm to solve the problem. In our approach, learning is triggered by user queries and then the system induces semantic rules from the information in databases. The inductive learning algorithm used in this approach can select an appropriate set of relevant attributes from a potentially huge number of attributes in real-world databases. Experimental results demonstrate that this approach can learn sufficient background knowledge to reformulate queries and provide a 57 percent average performance improvement.
Speeding up a system's performance is one of the major goals of machine learning. Explanation-based learning is typically used for speedup learning, while applications of inductive learning are usually limited to data classifiers. In this paper, we present an approach in which inductively learned knowledge is used for semantic query optimization to speed up query answering for data/knowledge-based systems.
The principle of semantic query optimization (King 1981) is to use semantic rules, such as all Tunisian seaports have railroad access, to reformulate a query into a less expensive but equivalent query, so as to reduce the query evaluation cost. For example, suppose we have a query to find all Tunisian seaports with railroad access and 2,000,000 ft3 of storage space. From the rule given above, we can reformulate the query so that there is no need to check the railroad access of seaports, which may save some execution time.
Many algorithms for semantic query optimization have been developed (Hsu & Knoblock 1993; King 1981; Shekhar, Srivastava, & Dutta 1988; Shenoy & Ozsoyoglu 1989). Average speedup ratios from 20 to 40 percent using hand-coded knowledge are reported in the literature. This approach to query optimization has gained increasing attention recently because it is applicable to almost all existing data/knowledge-base systems. This feature makes it particularly suitable for intelligent information servers connected to various types of remote information sources.
A critical issue of semantic query optimization is how to encode useful background knowledge for reformulation. Most of the previous work in semantic query optimization in the database community assume that the knowledge is given. (King 1981) proposed using semantic integrity constraints for reformulation to address the knowledge acquisition problem. Examples of semantic integrity constraints are The salary of an employee is always less than his manager's, and Only female patients can be pregnant. However, the integrity rules do not reflect properties of the contents of databases, such as related size of conceptual units, cardinality and distribution of attribute values. These properties determine the execution cost of a query. Moreover, integrity constraints rarely match query usage patterns. It is difficult to manually encode semantic rules that both reflect cost factors and match query usage patterns. The approach presented in this paper uses example queries to trigger the learning in order to match query usage patterns and uses an inductive learning algorithm to derive rules that reflect the actual contents of databases.
An important feature of our learning approach is that the inductive algorithm learns from complex real-world information sources. In these information sources, data objects are clustered into conceptual units. For example, the conceptual unit of a relational databases is a relation, or simply a table. For object-based databases, it is a class. In descriptionlogic knowledge bases (Brachman & Schmolze 1985), data instances are clustered into concepts. Each con-