Substructure Discovery Using Minimum Description Length and Background Knowledge

D. J. Cook and L. B. Holder

Volume 1, 1994

Links to Full Text:

The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUE's ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain.

Description of Online Appendix: This is a compressed tar file containing the SUBDUE discovery system, written in C. The program accepts as input databases represented in graph form, and will output discovered substructures with their corresponding value.

Extracted Text

Journal of Artificial Intelligence Research 1 {1994} 231-255 Submitted 12/93;
published 2/94
 Substructure Discovery Using Minimum Description
 Length and Background Knowledge Diane J. Cook Lawrence
B. Holder Department of Computer Science Engineering Box
 University of Texas at Arlington Arlington, TX 76019 USA
 Abstract The ability to identify interesting and repetitive substructures
isan essential compo- nent to discovering knowledge in structural data. We
describe a new version of our Sub-
 due substructure discovery system based on the minimum description length
principle. The Subdue system discovers substructures that compress the original
data and represent structural concepts in the data. By replacing previously-discovered
substructures inthe data, multiple passes of Subdue produce a hierarchical
description of the structural reg- ularities in the data. Subdue uses a computationally-bounded
inexact graph match that identifies similar, but not identical, instances
of a substructure and finds an approximate measure of closeness of two substructures
when under computationalconstraints. Inaddi- tion to the minimum description
length principle, other background knowledge can be used by Subdue to guide
the search towards more appropriate substructures. Experiments in
 a variety of domains demonstrate Subdue's abilityto find substructures capable
of com-
 pressing the original data and to discover structural concepts importantto
the domain. 1. Introduction
 The large amount of data collected today is quickly overwhelming researchers'
abilities to
 interpret the data and discover interesting patterns within the data. In
response to this
 problem, a number of researchers have developed techniques for discovering
concepts in
 databases. These techniques work well for data expressed in a non-structural,
 value representation, and address issues of data relevance, missing data,
noise and uncer-
 tainty, and utilization of domain knowledge. However, recent data acquisition
 are collecting structural data describing the relationships among the data
objects. Corre-
 spondingly, there exists a need for techniques to analyze and discover concepts
in structural
 One method for discovering knowledge in structural data is the identification
of com-
 monsubstructures within the data. The motivation for this process is to
find substructures
 capable of compressing the data and to identify conceptually interesting
substructures that
 enhance the interpretation of the data. Substructurediscovery is the process
of identifying
 concepts describing interesting and repetitive substructures within structural
data. Once
 discovered, the substructure concept can be used to simplify the data by
replacing instances of the substructure with a pointer to the newly discovered
concept. The discovered sub- structure concepts allow abstraction over detailed
structure in the original data and provide c
 fl1994 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.Cook
& Holder
 new, relevant attributes for interpreting the data. Iteration of the substructure
 and replacement process constructs a hierarchical description of the structural
data in terms of the discovered substructures. This hierarchy provides varying
levels of interpretation that
 can be accessed based on the goals of the data analysis.
 We describe a system called Subdue {Holder, Cook, & Bunke, 1992; Holder
& Cook,
 1993} that discovers interesting substructures in structural data based
on the minimum
 description length principle. The Subdue system discovers substructures
that compress
 the original data and represent structural concepts in the data. By replacing
 discoveredsubstructures in the data, multiple passes of Subdue produce a
hierarchical de- scription of the structural regularities in the data. Subdue
uses a computationally-bounded inexact graph match that identifies similar,
but not identical, instances of a substructure and finds an approximate measure
of closeness of two substructures when under computational
 constraints. In addition to the minimum description length principle, other
 knowledgecan be used by Subdue to guide the search towards more appropriate
 The following sections describe the approach in detail. Section 2 describes
the process of substructure discovery and introduces needed definitions.
Section 3 compares the Subdue discovery system to other work found in the
literature. Section 4 introduces the minimum description length encoding
used by this approach, and Section 5 presents the inexact
 graph match algorithm employed by Subdue. Section 6 describes methods of
 background knowledge into the substructure discovery process. The experiments
 in Section 7 demonstrate Subdue's ability to find substructures that compress
the data and
 to re-discover known concepts in a variety of domains. Section 8 details
the hierarchical
 discovery process. Weconclude with observations and directions for future
 2. Substructure Discovery
 The substructure discovery system represents structured data as a labeled
graph. Objects
 in the data map to vertices or small subgraphs in the graph, and relationships
between objectsmap to directed or undirected edges in the graph. A substructure
is a connected subgraph within the graphical representation. This graphical
representation serves as input to the substructure discovery system. Figure
1 shows a geometric example of such an input
 graph. The objects in the figure {e.g., T1, S1, R1} become labeled vertices
in the graph, and
 the relationships {e.g.,on{T1,S1}, shape{C1,circle}} become labeled edges
in the graph. The graphical representation of the substructure discovered
by Subdue from this data is also shown in Figure 1.
 An instance of a substructure in an input graph is a set of vertices and
edges from
 the input graph that match, graph theoretically, tothe graphical representation
of the
 substructure. For example, the instances of the substructure in Figure 1
are shown in
 Figure 2.
 The substructure discovery algorithm used by Subdue is a computationally-constrained
 beam search. The algorithm begins with the substructure matching a single
vertex in the
 graph. Each iteration through the algorithm selects the best substructure
and expands the
 instances of the substructure by one neighboring edge in all possible ways.
The new unique
 generated substructures become candidates for further expansion. The algorithm
 232Substructure DiscoveryS1T1T2T3T4S2S3S4C1R1SubstructureInput GraphonshapeshapetrianglesquareFigure
1: Examplesubstructure in graph form.
T2S2T3S3T4S4T1S1Instance  1       Instance  2       Instance  3       Instance
4Figure 2: Instances of the substructure. for the best substructure until
all possible substructures have been considered or the total amountof computation
exceeds a given limit. The evaluation of each substructure is guided by the
MDL principle and other background knowledge provided by the user.
 Typically, once the description length of an expanding substructure begins
to increase,
 further expansion of the substructure will not yield a smaller description
length. As a
 result, Subdue makes use of an optional pruning mechanism that eliminates
substructure expansions from consideration when the description lengths for
these expansions increases.
 3. Related Work
 Several approaches to substructure discovery have been developed. Winston's
Arch pro- gram {Winston, 1975} discovers substructures in order to deepen
the hierarchical description of a scene and to group objects into more general
concepts. The Arch program searches for
 two types of substructure in the blocks-world domain. The first type involves
a sequence
 of objects connected by a chain of similar relations. The second type involves
a set of
 objects each having a similar relationship to some \grouping" object. The
main difference
 between the substructure discovery procedures used by the Arch program and
Subdue is
 that the Arch program is designed specifically for the blocks-world domain.
For instance, the sequence discovery method looks for supported-by and in-front-of
relations only. Subdue's substructure discovery method is domain independent,
although the inclusion of domain-specific knowledge would improve Subdue's
 Motivated by the need to construct a knowledge base of chemical structures,
 {Levinson, 1984} developed a system for storing labeled graphs in which
individual graphs
 233Cook & Holder
 are represented by the set of vertices in a universal graph. In addition,
the individual graphs are maintained in a partial ordering defined by the
subgraph-of relation, which improves the performance of graph comparisons.
The universal graph representation provides a method for compressing the
set of graphs stored in the knowledge base. Subgraphs of
 the universal graph used by several individual graphssuggest common substructure
in the
 individual graphs. One difference between the two approaches is that Levinson's
 isdesigned to incrementally process smaller individual graphs;whereas, Subdue
 larger graphs all at once. Also, Levinson's system discovers common substructure
 as an indirect result of the universal graph construction; whereas, Subdue's
main goal is to discover and output substructure definitions that reduce
the minimum description
 length encoding of the graph. Finally,the subgraph-of partial ordering used
by Levinson's system is not included in Subdue, but maintaining this partial
ordering would improve the performance of the graph matching procedure by
pruning the number of possible matching graphs.
 Segen {Segen, 1990} describes a system for storing graphs using a probabilistic
graph model to represent subsets of the graph. Alternativemodels are evaluated
based on a min- imum description length measure of the information needed
to represent the stored graphs
 using the model. In addition, Segen's system clusters the graphs into classes
based on minimizing the description length of the graphs according to the
entire clustering. Apart from the probabilistic representation, Segen's approach
is similar to Levinson's system in that both methods take advantage of commonalities
in the graphs to assist in graph stor- age and matching. The probabilistic
graphs contain information for identifying common
 substructure in the exact graphs they represent. The portion of the probabilistic
graph withhigh probability defines a substructure that appears frequently
in the exact graphs. This notion was not emphasized in Segen's work, but
provides an alternative method to substructure discovery by clustering subgraphs
of the original input graphs. As with Levin- son's approach, graphs are processed
incrementally, and substructure is found across several graphs, not within
a single graph as in Subdue.
 The Labyrinth system {Thompson & Langley, 1991} extends the Cobweb incremental
conceptual clustering system {Fisher, 1987} to handle structured objects.
 Cobweb to form hierarchical concepts of the individual objectsin the domain
based on their primitive attributes. Concepts of structured objects are formed
in a similar manner using the individual objectsas attributes. The resulting
hierarchy represents a componential
 model of the structured objects. Because Cobweb's concepts are probabilistic,
 produces probabilistic models of the structured objects, but with an added
 organization. The upper-level components of the structured-object hierarchy
produced by Labyrinth represent substructures common to the examples. Therefore,
although not the
 primary focus, Labyrinth is discovering substructure, but in a more constrained
context than the general graph representation used by Subdue.
 Conklin et al. {Conklin & Glasgow, 1992} have developed the i-mem system
for con-
 structing an image hierarchy, similar to that of Labyrinth, used for discovering
 substructuresin a set of images and for efficient retrieval of images similar
to a given image.
 Images are expressed in terms of a set of relations defined by the user.
Specificand general
 {conceptual} images are stored in the hierarchy based on a subsumption relation
 234Substructure Discovery to Levinson's subgraph-of partial ordering. Image
matching utilizes a transformational approach{similar to Subdue's inexact
graph match} as a measure of image closeness. As with the approaches of Segen
and Levinson, i-mem is designed to process individual images. Therefore,
the general image concepts that appear higher in i-mem's hierarchy will represent
common substructures across several images. Subdue is designed to discover
common substructures within a single image. Subdue can mimic the individual
approach of these systems by processing a set of individual imagesas one
disconnected graph. The
 substructures found will be common to the individual images. The hierarchy
also represents a componential view of the images. This same view can be
constructed by Subdue using
 multiplepasses over the graph after replacing portions of the input graph
with substructures
 discovered during previous passes. i-mem has performed well in a simple
chess domain
 and molecular chemistry domains {Conklin & Glasgow, 1992}. However, i-mem
 domain-specific relations for expressing images in order for the hierarchy
to find relevant
 substructures and for image matching to be efficient. Again, maintaining
the concepts {images,graphs} in a partially-ordered hierarchy improves the
efficiency of matching and retrieval, and suggests a possible improvement
to Subdue.
 The CLiP system {Yoshida, Motoda, & Indurkhya, 1993} for graph-based induction
 more similar to Subdue than the previous systems. CLiP iteratively discovers
patterns in
 graphs by expanding and combining patterns discovered in previous iterations.
 are grouped into views based on their collective ability to compress the
original input
 graph. During each iteration CLiP uses existing views to contract the input
graph and then considers adding to the views new patterns consisting of two
vertices and an edge from the contracted graph. The compression of the new
proposed views is estimated, and the best views {according to a given beam
width} are retained for the next iteration.
 CLiP discovers substructures {patterns} differently than Subdue. First,
CLiP produces a set of substructures that collectively compress the input
graph; whereas, Subdue produces only single substructures evaluated using
the more principled minimum description length.
 CLiP has the ability to grow substructures agglomeratively {i.e., merging
two substructures together}; whereas, Subdue always produces new substructures
using incremental growth along one new edge. CLiP initially estimates the
compression value of new views based on the compression value of the parent
view; whereas, Subdue performs an expensive exact measurement of compression
for each new substructure. Finally, CLiP employs an efficient
 graph match based on graph identity, not graph isomorphism as in Subdue.
Graph identity
 assumes an ordering over the incident edges of a vertex and does not consider
all possible
 mappings when looking for occurrences of a pattern in an input graph. These
 in CLiP suggest possible enhancements to Subdue.
 Research in pattern recognition has begun to investigate the use of graphs
and graph
 grammars as an underlying representation for structural problems {Schalkoff,
1992}. Many
 resultsin grammatical inference are applicable to constrained classes of
graphs {e.g., trees}
 {Fu, 1982; Miclet, 1986}. Theapproach begins with a set of sample graphs
and produces a generalized graph grammar capable of deriving the original
sample graphs and many others.
 The production rules of this general grammar capture regularities {substructures}
in the
 sample graphs. Jeltschand Kreowski {Jeltsch & Kreowski, 1991} describe an
approach that
 begins with a maximally-specific grammarand iteratively identifies common
subgraphs in
 the right-hand sides of the production rules. These common subgraphs are
used to form
 235Cook & Holder
 new,more general production rules. Although their method does not address
the underlying
 combinatorial nondeterminism, heuristic approaches could provide a feasible
method for
 extracting substructures in the form of graph grammars. Furthermore, the
graph grammar
 production-rule may provide a suitable representation for background knowledge
during the substructure discovery process.
 4. Minimum Description Length Encoding of Graphs
 The minimum description length principle {MDLP}introduced by Rissanen {Rissanen,
 1989} states that the best theory to describe a set of data is that theory
which minimizes
 thedescription length of the entire data set. The MDL principle hasbeen
used for decision
 tree induction {Quinlan & Rivest, 1989}, image processing {Pednault, 1989;
Pentland, 1989;
 Leclerc, 1989},concept learning from relational data {Derthick, 1991}, and
learning models
 of non-homogeneous engineering domains {Rao & Lu, 1992}.
 We demonstrate how the minimum description length principle canbe used to
 substructures in complex data. In particular, a substructure is evaluated
based on how well it can compress the entire dataset using the minimum description
length. We define the minimum description length of a graph to be the number
of bits necessary to completely describe the graph.
 According to the minimum description length {MDL} principle, the theory
that best
 accounts for a collection of data is the one that minimizes I{S} + I{GjS},
where S is the
 discovered substructure, G is the input graph, I{S} is the number of bits
required to encode the discovered substructure, and I{GjS} is the number
of bits required to encode the input
 graph G with respect to S.
 The graph connectivity can be represented by an adjacency matrix. Consider
a graph
 that has n vertices, which are numbered 0; 1; : : :; n 000 1. Ann 002
n adjacency matrix A can
 be formed with entry A[i; j] set to 0 or 1. If A[i; j] = 0, then there is
no connection from
 vertex i to vertex j. If A[i; j] = 1, then there is at least one connection
from vertex i to
 vertex j. Undirected edges are recorded in only one entry of the matrix.
The adjacency
 matrixfor the graph in Figure 3 is shown below.
 square r
 2 6
 6 6
 0 1 1 0 0 0 0 0 0 0 0 0
 0 0 0 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0
 7 7
 The encoding of the graph consists of the following steps. We assume that
the decoder
 has a table of the l
 unique labels in the original graph G.
 1. Determine the number of bits vbits needed to encode the vertex labels
of the graph.
 First, we need {lg v} bits to encode the number of vertices v in the graph.
 encoding the labels of all v vertices requires {v lg l
 u } bits. We assume the vertices are specified in the same order they appear
in the adjacency matrix. The total number of bits to encode the vertex labels
is vbits = lg v + v lg l
 236Substructure DiscoveryshapeshapetrianglesquarexonyonshaperectanglerFigure
3: MDLexample graph.
 For the example in Figure 3, v = 6, and we assume that there are l
 u = 8 unique
 labels in the original graph. The number of bits needed to encode these
vertices is
 lg 6 + 6 lg 8 = 20:58 bits.
 2. Determine the number of bits rbits needed to encode the rows of the adjacency
 A. Typically, in large graphs, a single vertex has edges to only a small
percentage of
 the vertices in the entire graph. Therefore, a typical row in the adjacency
matrix will
 have much fewer than v 1s, where v is the total number of vertices in the
graph. We
 apply a variant of the coding scheme used by {Quinlan & Rivest, 1989} to
encode bit strings with length n consisting of k 1s and {n 000 k} 0s, where
k 034 {n 000 k}. In our
 case, row i {1 024 i 024 v} can be represented as a bit string of length
v containing k
 1s. Ifwe let b= max
 k i
 , then the i
 row of the adjacency matrix can be encoded as
 {a} Encoding the value of k
 i requires lg{b + 1} bits. {b} Given that only k
 1s occur in the row bit string of length v, only
 021 strings
 of 0s and 1s are possible. Since all of these strings have equal probability
 020 v
 021 bits are needed to encode the positions of 1s in row i. The
 value of v is known from the vertex encoding. Finally, we need an additional
lg{b + 1} bits to encode the number of bits needed to
 specify the value of k
 for each row. The total encoding length in bits for the adjacency
 matrix is
 rbits = lg{b + 1} +
 X i=1
 lg{b + 1} + lg 020
 i 021
 = {v + 1} lg{b + 1}
 021 237Cook & Holder
 For the example in Figure 3, b = 2, and the number of bits needed to encode
 adjacency matrix is {7lg 3}+lg
 020 6
 +lg 020
 021 +lg
 020 6
 +lg 020
 021 +lg
 = 21:49 bits.
 3. Determine the number of bits ebits needed to encode the edges represented
by the
 entries A[i; j] = 1 of the adjacency matrix A. The number of bits needed
to encode entry A[i; j] is {lg m} + e{i; j}[1 + lg l
 ], where e{i; j} is the actual number of edges between vertex i and j in
the graph and m = max
 i;j e{i; j}. The {lg m} bits are needed
 to encode the number of edges between vertex i and j, and [1 + lg l
 ] bits are needed
 per edge to encode the edge label and whether the edge is directed or undirected.
 additionto encoding the edges, we need to encode the number of bits {lg
m} needed
 to specify the number of edges per entry. Thetotal encoding of the edges
 ebits = lg m +
 X i=1
 lg m + e{i; j}[1 + lg l
 = lg m + e{1 + lg l
 } +
 v X
 A[i; j] lg m
 = e{1 + lg l
 } + {K + 1} lg m
 where e is the number of edges in the graph, and K is the number of 1s in
the adjacency
 matrix A. For the example in Figure 3, e = 5, K = 5, m = 1, l
 = 8, and the number
 of bits needed to encode the edges is 5{1 + lg 8} + 6 lg 1 = 20.
 The total encoding of the graph takes {vbits + rbits + ebits} bits. For
the example in
 Figure 3, this value is 62:07 bits.
 Both the input graph and discovered substructure can be encoded using the
 scheme. After a substructure is discovered, each instance of the substructure
in the input
 graph is replaced by a single vertex representing the entire substructure.
The discovered
 substructure is represented in I{S} bits, and the graph after the substructure
replacement is
 represented in I{GjS}bits. Subdue searches for the substructure S in graph
G minimizing
 I{S} + I{GjS}.
 5. Inexact Graph Match
 Although exact structure match can be used to find many interesting substructures,
 of the most interesting substructures show up in a slightly different form
throughout the
 data. These differences may be due to noise and distortion, or may just
illustrate slight differences between instances of the same general class
of structures. Consider the image shownin Figure 9. The pencil and the cube
would make ideal substructures in the picture, but an exact match algorithm
may not consider these as strong substructures, because they rarely occur
in the same form and level of detail throughout the picture.
 Givenan input graph and a set of defined substructures, we want to find
those subgraphs
 of the input graph that most closely resemble the given substructures. Furthermore,
we want
 to associate a distance measure between a pair of graphs consisting of a
given substructure
 and a subgraph of the input graph. We adopt the approach to inexact graph
match given
 by Bunke and Allermann {Bunke & Allermann, 1983}.
 238Substructure Discovery45123ABBBAababg1g2baFigure 4: Two similar graphs
 and g
 2 .
 Inthis inexact match approach, each distortion of a graph is assigned a
cost. A distortion
 is described in terms of basic transformations such as deletion, insertion,
and substitution
 of vertices and edges. The distortion costs can be determined by the user
to bias the match
 for or against particular types of distortions.
 An inexact graph match between two graphs g
 and g
 maps g
 to g 2
 such that g
 interpreted as a distorted version of g
 . Formally, an inexact graph match is a mapping
 f : N
 ! N 2
 [f025g, where N
 and N
 2 are the sets of vertices of g 1
 and g
 2 , respectively. A
 vertex v 2 N
 that is mapped to 025 {i.e., f {v} = 025} is deleted. That is, it has
no corresponding
 vertex in g
 . Given a set of particular distortion costs as discussed above, we define
the cost
 of an inexact graph match cost{f }, as the sum of the cost of the individual
 resultingfrom f , and we define matchcost{g
 ; g
 } as the value of the least-cost function that
 maps graph g
 1 onto graph g
 Given g
 , g
 , and a set of distortion costs, the actual computation of matchcost{g
 ; g
 can be determined using a tree search procedure. A state in the search tree
corresponds to
 a partial match that maps a subset of the vertices of g
 to a subset of the vertices in g
 Initially, we start with an empty mapping at the root of the search tree.
Expandinga state
 corresponds to adding a pair of vertices, one from g 1
 and one from g
 , to the partial mapping
 constructed so far. A final state in the search tree is a match that maps
all vertices of g
 or to 025. The complete search tree of the example in Figure 4 is shown
in Figure 5. For
 this example we assign a value of 1 to each distortion cost. The numbers
in circles in this
 figure represent the cost of a state. As we are eventually interested in
the mapping with
 minimum cost, each state in the search tree gets assigned the cost of the
partial mapping
 that it represents. Thus the goal state to be found by our tree search procedure
is the finalstate with minimum cost among all final states. From Figure 5
we conclude that the minimum cost inexact graph match of g
 and g
 2 is given by the mapping f {1} = 4, f {2} = 3.
 The cost of this mapping is 4.
 Given graphs g
 with n vertices and g
 with m vertices, m 025 n, the complexity of the full inexact graph match
is O{n
 }. Because this routine is used heavily throughout the 239Cook & Holder
{1, 3}{1, 4}{1, 5}{1,  }{2,4} {2,5} {2, } {2,3} {2,5} {2, } {2,3} {2,4} {2,
} {2,3} {2,4} {2,5} {2, }101761036977109109111Figure 5: Search tree for computing
 ,g 2
 } from Figure 4.
 discovery and evaluation process, the complexity of the algorithm can significantly
 the performance of the system.
 To improve the performance of the inexact graph match algorithm, we extend
 approach by applying a branch-and-bound search to the tree. The cost from
the root of the
 tree to a given node is computed as described above. Nodes are considered
for pairings in
 order from the most heavily connected vertex to the least connected, as
this constrains the
 remaining match. Because branch-and-bound search guarantees an optimal solution,
 search ends as soon as the first complete mapping is found.
 In addition, the user can place a limit on the number of search nodes considered
 the branch-and-bound procedure {defined as a function of the size of the
input graphs}.
 Once the number of nodes expanded in the search tree reaches the defined
limit, the search
 resorts to hill climbing using the cost of the mapping so far as the measure
for choosing the
 best node at a given level. By defining such a limit, significant speedup
can be realized at
 the expense of accuracy for the computed match cost.
 Another approach to inexact graph match would be to encode the difference
between two graphs using the MDL principle. Smaller encodings would indicate
a lower match cost
 between the two graphs. Weleave this as a future research direction. 6.
Guiding the Discovery Process with Background Knowledge
 Although the principle of minimum description length is useful for discovering
 tures that maximize compression of the data, scientists may realize more
benefit from the
 discoveryof substructures that exhibit other domain-specific and domain-independent
char- acteristics.
 To make Subdue more powerful across a wide variety of domains, we have added
 ability to guide the discovery process with background knowledge. Although
the minimum
 description length principle still drives the discovery process, the background
knowledge can
 be used to input a bias toward certain types of substructures. Thisbackground
 is encoded in the form of rules for evaluating substructures, and can represent
 independent or domain-dependent rules. Each time a substructure is evaluated,
these input
 240Substructure Discovery rules are used to determine the value of the substructure
under consideration. Because
 only the most-favored substructures are kept and expanded, these rules bias
the discovery process of the system.
 Each background rule can be assigned a positive, zero, or negative weight,
that biases
 the procedure toward a type of substructure, eliminates the use of the rule,
or biases the
 procedure away from a type of substructure, respectively. The value of a
substructure is
 defined as the description length {DL} of the input graph using the substructure
 plied by the weighted value of each background rule from a set of rules
R applied to the
 substructure. value{s} = DL{G; s} 002
 rule r
 e r
 Three domain-independent heuristics that have been incorporated as rules
into the Sub-
 due system are compactness, connectivity, and coverage. Forthe definitions
of these rules,
 we will let G represent the input graph, s represent a substructure in the
graph, and I
 represent the set of instances of the substructure s in G. The instance
weight w of an
 instancei2 I ofa substructure s is defined to be
 w{i; s} = 1 000
 matchcost{i; s}size{i} ; {2}
 where size{i} = #vertices{i}+ #edges{i}. If the match cost is greater than
the size of the larger graph, then w{i; s} = 0. The instance weights are
used in these rules to compute a
 weighted average over instances of a substructure. A value of 1 is added
to each formula so that the exponential weights can be used to control the
rule's significance.
 The first rule, compactness, is a generalization of Wertheimer's Factor
of Closure, which
 states that human attention is drawn to closed structures {Wertheimer, 1939}.
A closed
 substructure has at least as many edges as vertices, whereas a non-closed
substructure hasfewer edges than vertices {Prather, 1976}. Thus, closed substructures
have a higher compactness value. Compactness is defined as the weighted average
of the ratio of the numberof edges in the substructure to the number of vertices
in the substructure. compactness{s} = 1+ 1jIj X
 w{i; s} 002
 The second rule, connectivity, measures the amount of external connection
in the in-
 stances of the substructure. The connectivity rule is a variant of Wertheimer's
 ofProximity {Wertheimer, 1939}, and is related to earlier numerical clustering
 {Zahn, 1971}. Theseworks demonstrate the human preference for \isolated"
 that is, substructures that are minimally related to adjoining structure.
 sures the \isolation" of a substructure by computing the inverse of the
average number of
 external connections over all the weighted instances of the substructure
in the input graph.
 An external connection is defined here as an edge that connects a vertex
in the substructure
 to a vertex outside the substructure. The formula for determining the connectivity
of a
 substructures with instances I in the input graph G is given below.
 241Cook & Holder
 connectivity{s}= 1 +
 X i2I
 w{i; s} 002 numexternalconns{i}
 # 0001
 The third rule, coverage, measures the fraction of structure in the input
graph described
 by the substructure. The coverage rule is motivated from research in inductive
learning and
 provides that concept descriptions describing more input examples are considered
 {Michalski & Stepp, 1983}. AlthoughMDL measures the amount of structure,
the coverage
 rule includes the relevance of this savings with respect to the size of
the entire input graph.
 Coverage is defined as the number of unique vertices and edges in the instances
of the
 substructure divided by the total number of vertices and edges in the input
graph. In this
 formula, the uniquestructure{i} of an instance i is the number of vertices
and edges in i
 that have not already appeared in previous instances in the summation.
 coverage{s}= 1+ P
 w{i; s} 002 uniquestructure{i}size{G}
 {5} Domain-dependent rules can also be used to guide the discovery process
in a domain where scientists can contribute their expertise. For example,
CAD circuits generally consist of two types of components, active and passive
components. The active components are
 the main driving components. Identifying the active components is the first
step in under-
 standing the main function of the circuit. To add this knowledge to Subdue
we include
 a rule that assigns higher values to substructures {circuit components}
representing active
 components and lower values to substructures representing passive components.
 active components have higher scores, they are expected to be selected.
The system can
 then focus the attention on the active components which will be expanded
to the functional
 substructures. Another method of biasing the discovery process with background
knowledge is to let background rules affect the prior probabilities of possible
substructures. However, choosing the appropriate prior probabilities to express
desired properties of substructures is diffi-
 cult, but indicates a future direction for the inclusion ofbackground knowledge
into the
 substructurediscovery process.
 7. Experiments
 The experiments in this section evaluate Subdue's substructure discovery
capability in
 several domains, including chemical compound analysis, scene analysis, CAD
circuit design
 analysis, and analysis of an artificially-generated structural database.
 Two goals of our substructure discovery system are to find substructures
that can reduce the amount of information needed to describe the data, and
to find substructures that are considered interesting for the given database.
Asa result, we evaluate the Subdue system in this section along these two
criteria. First, we measure the amount of compression that Subdueprovides
across a variety of databases. Second, we use the Subdue system with the
 additional background knowledge rules to re-discover substructures that
have been identified
 as interesting by experts in each specific domain. Section 7.1 describes
the domains used
 in these experiments, and Section 7.2 presents the experimental results.
242Substructure DiscoveryCH3OOCH3C   OCH OH2OHFigure 6: Cortisone.
rubber {all-cis polyisoprene}.
 7.1 Domains
 7.1.1 Chemical Compound Analysis
 Chemical compounds are rich in structure. Identification of the common and
 substructures can benefit scientists by identifying recurring components,
simplying the data
 description, and focusing on substructures that stand out and merit additional
attention. Chemical compounds are represented graphically by mapping individual
atoms, such as
 carbon and oxygen, to labeled vertices in the graph, and by mapping bonds
between the
 atoms onto labeled edges in the graph. Figures 6, 7, and 8 show the graphs
 the chemical compound databases for cortisone, rubber, and a portion of
a DNA molecule.
 7.1.2 Scene Analysis
 Images and scene descriptions provide a rich source of structure. Images
that humans
 encounter, both natural and synthesized, have many structured subcomponents
that draw
 our attention and that help us to interpret the data or the scene.
 Discovering common structures in scenes can be useful to a computer vision
 First,automatic substructure discovery can help a system interpret an image.
Instead of
 working from low-level vertices and edges, Subdue can provide more abstract
 components, resulting in a hierarchical view of the image that the machine
can analyze at
 many levels of detail and focus, depending on the goal of the analysis.
Second, substructure
 discovery that makes use of an inexact graph match can help identify objects
in a 2D image
 of a 3D scene where noise and orientation differences are likely to exist.
If an object ap-
 pears often in the scene, the inexact graph match driving the Subdue system
may capture
 slightly different views of the same object. Although an object may be difficult
to identify 243Cook & Holder
8: Portion of a DNA molecule.Figure 9: Scene analysis example. 244Substructure
DiscoveryfalkxptmFigure 10: Possiblevertices and labels.
lllllllllllllllltttttltltlttmalllaafaFigure 11: Portionof graph representing
scene in Figure 4.
 from just one 2D picture, Subdue will match instances of similar objects,
and the differ-
 ences between these instances can provide additional information for identification.
 substructure discovery can be used to compress the image. Replacing commoninteresting
 substructures by a single vertex simplifies the image description and reduces
the amount of
 storage necessary to represent the image. To apply Subdue to image data,
we extract edge information from the image and construct a graph representing
the scene. The graph representation consists of eight types
 of vertices and two types of arcs {edge and space}. The vertex labels {f
, a, l, t, k, x, p, and
 m} follow the Waltz labelings {Waltz, 1975} of junctions of edges in the
image and represent
 the types of vertices shown in Figure 10. An edge arc represents the edge
of an object in the
 image, and a space arc links non-connecting objects together. The edge arcs
represent an
 edge in the scene that connects two vertices, and the space arcs connect
the closest vertices from two disjoint neighboring objects. Distance,curve,
and angle information has not been included in the graph representation,
but can be added to give additional information about the scene. Figure 11
shows the graph representation of a portion of the scene depicted in
 Figure 9. Inthis figure, the edge arcs are solid and the space arcs are
 245Cook & Holder
12: Amplifier circuit and graph representation.
 7.1.3 CAD Circuit Analysis
 In this domain, we employ Subdue to find circuit components in CAD circuit
data. Discov-
 ery of substructures in circuit data can be a valuable tool to an engineer
who is attempting to
 identify common reusable parts in a circuit layout. Replacingindividual
componentsin the
 circuit description by larger substructure descriptions will also simplify
the representation of the circuit.
 The data for the circuit domain was obtained from National Semiconductor,
and con-
 sists of a set of components making up a circuit as output by the Cadence
Design System. The particular circuit used for this experiment is a portion
of an analog-to-digital con- verter. Figure 12 presents a circuit for an
amplifier and gives the corresponding graph representation.
 7.1.4 Artificial Domain
 In the final domain, we artificially generate graphs to evaluate Subdue's
ability to discover
 substructures capable of compressing the graph. Four substructures are created
of varying sizes with randomly-selected vertices and edges {see Figure 13}.
The name of a substructure reflects the number of vertices and edges in its
graph representation. Next,these substruc- tures are embedded in larger graphs
whose size is 15 times the size of the substructure. The graphs vary across
four parameters: number of possible vertex and edge labels {one
 times and two times the number of labels used in the substructure}, connectivity
of the
 substructure {1 or 2 external connections}, coverage of the instances {60045
and 80045}, and 246Substructure Discoveryn1n4n3n2e1e2e3n4n3n2n1e1e3e3e3e2e2n7n5n2n5n3n1n6e6e3e1e4e2e6n3n1n7n4n2n3n1e7e8e3e5e3e8e6e9e1Figure
13: Fourartificial substructures used to evaluate Subdue.
 the amount of distortion in the instances {0, 1 or 2 distortions}. This
yields a total of 96
 graphs {24 for each different substructure}.
 7.2 Experimental Results
 7.2.1 Experiment 1: Data compression
 In the first experiment, we test Subdue's ability to compress a structural
database. Using
 a beam width of 4 and Subdue's pruning mechanism, we applied the discovery
algorithm to each of the databases mentioned above. We repeat the experiment
with match thresholds
 rangingfrom 0.0 to 1.0 in increments of 0.1. Table 1 shows the description
length {DL} of the
 original graph, the description length of the best substructure discovered
by Subdue, and
 the value of compression. Compression here is defined as
 DL of compressed graphDL of original graph
 . Figure 14, shows the actual discovered substructures for the first four
 As can be seen from Table 1, Subdue was able to reduce the database to slightly
 larger than
 of its original size in the best case. The average compression value over
 all of these domains {treating the artificial graphs as one value} is 0.62.
The results of
 this experiment demonstrate that the substructure discovered by Subdue can
 reduce the amount of data needed to represent an input graph. We expect
that compressing
 the graph using combinations of substructures and hierarchies of substructures
will realize
 even greater compression in some databases.
 247Cook & HolderDatabaseDL
{ M14095.730.72148.80.52CAD { S1SegDec1860.140.71149.290.62CAD { S1DrvBlk12715.120.79070.210.71CAD
{ BlankSub8606.690.76204.740.72CAD { And2427.730.1324.520.76Artificial {avg.
over 96 graphs}1636.250.0: : :1.01164.020.71Table 1: Graphcompression results.
laaCCOCH2OCCOCCCCCH2CH3HCH2{a}{b}{c}{d}Figure 14: Best substructure for {a}
rubber database, {b} cortisone database, {c} DNA database, and {d} image
 248Substructure DiscoveryFigure 15: Benzenering discovered by Subdue.
 7.2.2 Experiment 2: Re-discoveryof known substructures using background
 Another way of evaluating the discovery process is to evaluate the interestingness
of the
 discovered substructures. The determination of this value will change from
domain to
 domain. As a result, in this second set of experiments we test Subdue's
ability to discover
 substructures that have already been labeled as important by experts in
the domains under consideration.
 In the chemical compound domain, chemists frequently describe compounds
in terms of
 the building-block components that are heavily used. For example, in the
rubber compound
 database shown in Figure 7, the compound is made up of a chain of structures
that are
 labeled by chemists as isoprene units. Subdue's ability to re-discover this
structure is
 exemplified in Figure 14a. This substructure, which was discovered using
the MDL principle
 with no extra background knowledge, represents an isoprene unit.
 Although Subdue was able to re-discover isoprene units without extra background
 knowledge, the substructure affording the most compression will not always
be the most in-
 teresting or important substructure in the database. For example, in the
cortisone database
 the benzene ring which consists of a ring of carbons is not discovered using
only the MDL principle. However, the additional background rules can be used
to increase the chance of
 finding interesting substructures in these domains. In the case of the cortisone
compound, we know that the interesting structures exhibit a characteristic
of closure. Therefore, we give a strong weight {8.0} to the compactness background
rule and use a match threshold of
 0.2 to allow for deviations in the benzene ring instances. In the resulting
output, Subdue
 finds the benzene ring shown in Figure 15.
 In the same way, we can use the background rules to find the pencil substructure
 the image data. When the image in Figure 9 is viewed, the substructure of
interest is the pencil in its various forms. However, the substructure that
afforded the most compression does not make up an entire pencil. We know
that the pencils have a high degree of closure
 and of coverage, so the weights for these rules are set to 1.0. With these
weights, Subdue
 is able to find the pencil substructure shown in Figure 16 for all tested
match thresholds
 between 0.0 and 1.0.
 8. Hierarchical Concept Discovery
 After a substructure is discovered, each instance of the substructure in
the input graph can be replaced by a single vertex representing the entire
substructure. The discovery procedure
 can then be repeated on the compressed data set, resulting in new interesting
substructures. If the newly-discovered substructures are defined in terms
of existing substructure concepts,
 the substructure definitions form a hierarchy of substructure concepts.
 249Cook & Holder
lllaaFigure 16: Pencil substructure discovered by Subdue.
 Hierarchical concept discovery also adds the capability to improve Subdue's
 mance. When Subdue is applied to a large input graph, the complexity of
the algorithm
 prevents consideration of larger substructures. Using hierarchical concept
discovery, Sub- due can first discover those smaller substructures which
best compress the data. Applying
 the compression reduces the graph to a more manageable size, increasing
the chance that Subdue will find the larger substructures on the subsequent
passes through the database. Once Subdue selects a substructure, all vertices
that comprise the exact instances of
 the substructure are replaced in the graph by a single vertex representing
the discovered
 substructure. Edges connecting vertices outside the instance to vertices
inside the instance
 now connect to the new vertex. Edgesinternal to the instance are removed.
 process is then applied to the compressed data. If a hierarchical description
of concepts is
 particularly desired, heavier weight can be given to substructures which
utilize previously
 discovered substructures. The increased weight reflects increased attention
to this substruc-
 ture. Figure 17 illustrates the compressed rubber compound graph using the
 shown in Figure 14a.
 To demonstrate the ability of Subdue to find a hierarchy of substructures,
we let the sys-
 tem make multiple passes through a database that represents a portion of
a DNA molecule.
 Figure 8 shows a portion of two chains of a double helix, using three pairs
of bases which
 are held together by hydrogen bonds. Figure 18 shows the substructures found
by Subdue
 after each of three passes through the data. Note that, on the third pass,
Subdue linked
 together the instances of the substructure in the second pass to find the
chains of the double
 Although replacing portions of the input graph with the discovered substructures
 presses the data and provides a basis for discovering hierarchical concepts
in the data, the
 substructure replacement procedure becomes more complicated when concepts
with inexact instances are discovered. When inexact instances of a discovered
concept are replaced by
 a single vertex in the data, all distortions of the graph {differences between
the instance
 graph and the substructure definition} mustbe attached as annotations to
the vertex label.
 250Substructure DiscoveryS  =CCCH2CH3HCH21Highest-valued substructureSSSSSCCCH2CH3HCH2CCCH2CH3HCH2CCCH2CH3HCH2CCCH2CH3HCH2CCCH2CH3HCH2=G
=11111Compressed graph using discovered substructureFigure 17: Compressedgraph
for rubber compound data.
 251Cook & Holder
=3=Highest-valued substructure)(after First PassHighest-valued substructure)(after
Second PassHighest-valued substructure)(after Third PassOFigure 18: Hierarchical
discovery in DNA data.
 252Substructure Discovery 9. Conclusions
 Extracting knowledge from structural databases requires the identification
of repetitive sub-
 structures in the data. Substructurediscovery identifies interesting and
repetitive structure
 in structural data. The substructures represent concepts found in the data
and a means of
 reducing the complexity of the representation by abstracting over instances
of the substruc-
 ture. Wehave shown how the minimum description length {MDL} principle can
be used to
 perform substructure discovery in a variety of domains. The substructure
discovery process
 can also be guided by background knowledge. The use of an inexact graph
match allows
 deviation in the instances of a substructure. Once a substructure is discovered,
 of the substructure can be replaced by the concept definition, affording
compression of the data description and providing a basis for discovering
hierarchically-defined structures.
 Future work will combine structural discovery with discovery of concepts
using a linear- based representation such as AutoClass {Cheeseman, Kelly,
Self, Stutz, Taylor, & Freeman, 1988}. In particular, we will use Subdue
to compress the data fed to AutoClass, and let Subdue evaluate the interesting
structures in the classes generated by AutoClass. In
 addition, we will be developing a parallel implementation of the AutoClass
/ Subdue system that will enable application of substructure discovery to
larger structural databases. Acknowledgements
 This project is supported by NASA grant NAS5-32337. The authors would like
to thank
 Mike Shay at National Semiconductor for providing the circuit data. We would
also like
 to thank Surnjani Djoko and Tom Lai for their help with this project. Thanks
also to the
 reviewers for their numerous insightful comments.
 References Bunke, H., & Allermann, G. {1983}. Inexactgraph matching for
structural pattern recog- nition. Pattern Recognition Letters, 1 {4}, 245{253.
 Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D.
{1988}. Autoclass:
 A bayesian classification system. InProceedings of the Fifth International
 on Machine Learning, pp. 54{64.
 Conklin, D., & Glasgow, J. {1992}. Spatial analogy and subsumption. InProceedings
of the
 Ninth International Machine Learning Workshop, pp. 111{116. Derthick, M.
{1991}. Aminimal encoding approach to feature discovery. In Proceedings of
 the Ninth National Conference on Artificial Intelligence,pp. 565{571. Fisher,
D. H. {1987}. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2 {2}, 139{172. Fu, K. S. {1982}. Syntactic Pattern Recognition
and Applications. Prentice-Hall. Holder, L. B., Cook, D. J., & Bunke, H.
{1992}. Fuzzy substructure discovery. In Proceedings
 of the Ninth International Machine Learning Conference, pp. 218{223. 253Cook
& Holder
 Holder, L. B., & Cook, D. J. {1993}. Discovery of inexact concepts from
structural data.
 IEEE Transactions on Knowledge and Data Engineering,5 {6}, 992{994.
 Jeltsch, E., & Kreowski, H. J. {1991}. Grammatical inference based on hyperedge
 ment. In Fourth International Workshop on Graph Grammars and Their Application
 to Computer Science,pp. 461{474.
 Leclerc, Y. G. {1989}. Constructing simple stable descriptions for image
partitioning. In-
 ternational journal of Computer Vision, 3 {1}, 73{102. Levinson, R. {1984}.
A self-organizing retrieval system for graphs. In Proceedings of the
 Second National Conference on Artificial Intelligence, pp. 203{206. Michalski,
R. S., & Stepp, R. E. {1983}. Learning from observation: Conceptual clustering.
In Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. {Eds.}, Machine
Learning: AnArtificial Intelligence Approach, Vol. I, pp. 331{363. Tioga
Publishing Company. Miclet, L. {1986}. Structural Methods in Pattern Recognition.
Chapman and Hall. Pednault, E. P. D. {1989}. Some experiments in applying
inductive inference principles
 to surfa ce reconstruction. In Proceedings of the International Joint Conference
 Artificial Intelligence, pp. 1603{1609.
 Pentland, A. {1989}. Part segmentation for object recognition. Neural Computation,
 Prather, R. {1976}. Discrete Mathemetical Structures for Computer Science.
 Miffin Company. Quinlan, J. R., & Rivest, R. L. {1989}. Inferring decision
trees using the minimum descrip- tion length principle. Information and Computation,
80, 227{248.
 Rao, R. B., & Lu, S. C. {1992}. Learning engineering models with the minimum
 tion length principle. In Proceedings of the Tenth National Conference on
 Intelligence, pp. 717{722.
 Rissanen, J. {1989}. Stochastic Complexity in Statistical Inquiry. World
Scientific Publishing
 Company. Schalkoff, R. J. {1992}. Pattern Recognition: Statistical, Structural
and Neural Approaches.
 John Wiley & Sons.
 Segen, J. {1990}. Graph clustering and model learning by data compression.
In Proceedings of the Seventh International Machine Learning Workshop, pp.
 Thompson, K., & Langley, P. {1991}. Concept formation in structured domains.
 D. H., & Pazzani, M. {Eds.}, Concept Formation: Knowledge and Experience
in Un-
 supervised Learning, chap. 5. Morgan Kaufmann Publishers, Inc. Waltz, D.
{1975}. Understandingline drawings of scenes with shadows. InWinston, P.
H. {Ed.}, The Psychology of Computer Vision.McGraw-Hill.
 254Substructure Discovery Wertheimer, M. {1939}. Laws of organization in
perceptual forms. In Ellis, W. D. {Ed.}, A
 Sourcebook of Gestalt Psychology, pp. 331{363. Harcourt, Brace and Company.
Winston, P. H. {1975}. Learning structural descriptions from examples. In
Winston, P. H. {Ed.}, The Psychology of Computer Vision,pp. 157{210. McGraw-Hill.
 Yoshida, K., Motoda, H., & Indurkhya, N. {1993}. Unifying learning methods
by colored digraphs. In Proceedings of the Learning and Knowledge Acquisition
Workshop at IJCAI-93.
 Zahn, C. T. {1971}. Graph-theoretical methods for detecting and describing
gestalt clusters.
 IEEE Transactions on Computers, 20 {1}, 68{86.