The Difficulties of Learning Logic Programs with Cut

F. Bergadano, D. Gunetti and U. Trinchero

Volume 1, 1993

Links to Full Text:

Abstract:
As real logic programmers normally use cut (!), an effective learning procedure for logic programs should be able to deal with it. Because the cut predicate has only a procedural meaning, clauses containing cut cannot be learned using an extensional evaluation method, as is done in most learning systems. On the other hand, searching a space of possible programs (instead of a space of independent clauses) is unfeasible. An alternative solution is to generate first a candidate base program which covers the positive examples, and then make it consistent by inserting cut where appropriate. The problem of learning programs with cut has not been investigated before and this seems to be a natural and reasonable approach. We generalize this scheme and investigate the difficulties that arise. Some of the major shortcomings are actually caused, in general, by the need for intensional evaluation. As a conclusion, the analysis of this paper suggests, on precise and technical grounds, that learning cut is difficult, and current induction techniques should probably be restricted to purely declarative logic languages.



Extracted Text


Journal of Artificial Intelligence Research 1 (1993) 91-107 Submitted 8/93;
published 11/93

The Difficulties of Learning Logic Programs with Cut Francesco Bergadano
bergadan@di.unito.it Universit`a di Catania, Dipartimento di Matematica,
via Andrea Doria 6, 95100 Catania, Italy

Daniele Gunetti gunetti@di.unito.it Umberto Trinchero trincher@di.unito.it
Universit`a di Torino, Dipartimento di Informatica, corso Svizzera 185, 10149
Torino, Italy

Abstract As real logic programmers normally use cut (!), an effective learning
procedure for logic programs should be able to deal with it. Because the
cut predicate has only a procedural meaning, clauses containing cut cannot
be learned using an extensional evaluation method, as is done in most learning
systems. On the other hand, searching a space of possible programs (instead
of a space of independent clauses) is unfeasible. An alternative solution
is to generate first a candidate base program which covers the positive examples,
and then make it consistent by inserting cut where appropriate. The problem
of learning programs with cut has not been investigated before and this seems
to be a natural and reasonable approach. We generalize this scheme and investigate
the difficulties that arise. Some of the major shortcomings are actually
caused, in general, by the need for intensional evaluation. As a conclusion,
the analysis of this paper suggests, on precise and technical grounds, that
learning cut is difficult, and current induction techniques should probably
be restricted to purely declarative logic languages.

1. Introduction Much recent research in AI and Machine Learning is addressing
the problem of learning relations from examples, especially under the title
of Inductive Logic Programming (Muggleton, 1991). One goal of this line of
research, although certainly not the only one, is the inductive synthesis
of logic programs. More generally, we are interested in the construction
of program development tools based on Machine Learning techniques. Such techniques
now include efficient algorithms for the induction of logical descriptions
of recursive relations. However, real logic programs contain features that
are not purely logical, most notably the cut (!) predicate. The problem of
learning programs with cut has not been studied before in Inductive Logic
Programming, and this paper analyzes the difficulties involved.

1.1 Why Learn Programs with Cut? There are two main motivations for learning
logic programs with cut:

1. ILP should provide practical tools for developing logic programs, in the
context of

some general program development methodology (e.g., (Bergadano, 1993b));
as real size logic programs normally contain cut, learning cut will be important
for creating an integrated Software Engineering framework.

cfl1993 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

Bergadano, Gunetti, & Trinchero 2. Extensive use of cut can make programs
sensibly shorter, and the difficulty of learning

a given logic program is very much related to its length.

For both of these objectives, we need not only cuts that make the programs
more efficient without changing their input-output behavior ("green cuts"),
but also cuts that eliminate some possible computed results ("red cuts").
Red cuts are sometimes considered bad programming style, but are often useful.
Moreover, only the red cuts are effective in making programs shorter. Green
cuts are also important, and less controversial. Once a correct program has
been inferred via inductive methods, it could be made more efficient through
the insertion of green cuts, either manually or by means of automated program
transformation techniques (Lau & Clement, 1993).

1.2 Why Standard Approaches Cannot be Used? Most Machine Learning algorithms
generate rules or clauses one at a time and independently of each other:
if a rule is useful (it covers some positive example) and correct (it does
not cover any negative example), then it is added to the description or program
which is being generated, until all positive examples have been covered.
This means that we are searching a space of possible clauses, without backtracking.
This is obviously a great advantage, as programs are sets of clauses, and
therefore the space of possible programs is exponentially larger.

The one principle which allows this simplification of the problem is the
extensional evaluation of possible clauses, used to determine whether a clause
C covers an example e. The fact that a clause C covers an example e is then
used as an approximation of the fact that a logic program containing C derives
e. Consider, for instance, the clause C = "p(X,Y)  ff", and suppose the example
e is p(a,b). In order to see whether C covers e, the extensionality principle
makes us evaluate any literal in ff as true if and only if it matches some
given positive example. For instance, if ff = q(X,Z) ^ p(Z,Y), then the example
p(a,b) is extensionally covered iff there is a ground term c such that q(a,c)
and p(c,b) are given as positive examples. In particular, in order to obtain
the truth value of p(c,b), we will not need to call other clauses that were
learned previously. For this reason, determining whether C covers e only
depends on C and on the positive examples. Therefore, the learning system
will decide whether to accept C as part of the final program P independently
of the other clauses P will contain.

The extensionality principle is found in Foil (Quinlan, 1990) and its derivatives,
but is also used in bottom-up methods such as Golem (Muggleton & Feng, 1990).
Shapiro's MIS system (Shapiro, 1983) uses it when refining clauses, although
it does not when backtracing inconsistencies. We have also used an extensional
evaluation of clauses in the FILP system (Bergadano & Gunetti, 1993).

When learning programs with cut, clauses are no longer independent and their
standalone extensional evaluation is meaningless. When a cut predicate is
evaluated, other possible clauses for proving the same goal will be ignored.
This changes the meaning of these other clauses. Even if a clause extensionally
covers some example e, it may be the case that the final program does not
derive e, because some derivation paths have been eliminated by the evaluation
of a cut predicate.

92

The Difficulties of Learning Logic Programs with Cut However, an exhaustive
search in a space of programs is prohibitive. Learning methods, even if based
on extensionality, are often considered inefficient if sufficient prior information
is not available; searching for sets of clauses will be exponentially worse.
This would amount to a brute-force enumeration of all possible logic programs
containing cut, until a program that is consistent with the given examples
is found.

1.3 Is there an Alternative Method? Cut will only eliminate some computed
results, i.e., after adding cut to some program, it may be the case that
some example is no longer derived. This observation suggests a general learning
strategy: a base program P is induced with standard techniques, given the
positive and maybe some of the negative examples, then the remaining negative
examples are ruled out by inserting cut in some clause of P. Obviously, after
inserting cut, we must make sure that the positive examples may still be
derived.

Given the present technology and the discussion above, this seems to be the
only viable path to a possible solution. Using standard techniques, the base
program P would be generated one clause at a time, so that the positive examples
are extensionally covered. However, we think this view is too restrictive,
as there are programs which derive all given positive examples, although
they do not cover them extensionally (Bergadano, 1993a; DeRaedt, Lavrac,
& Dzeroski, 1993). More generally, we consider traces of the positive examples:

Definition 1 Given a hypothesis space S of possible clauses, and an example
e such that S ` e, the set of clauses T`S which is used during the derivation
of e is called a trace for e.

We will use as a candidate base program P any subset of S which is the union
of some traces for the positive examples. If P`S extensionally covers the
positive examples, then it will also be the union of such traces, but the
converse is not always true. After a candidate program has been generated,
an attempt is made to insert cuts so that the negative examples are not derived.
If this is successful, we have a solution, otherwise, we backtrack to another
candidate base program. We will analyze the many problems inherent in learning
cut with this class of trace-based learning methods, but, as we discuss later
(Section 4), the same problems need to be faced in the more restrictive framework
of extensional evaluation. In other words, even if we choose to learn the
base program P extensionally, and then we try to make it consistent by using
cut, the same computational problems would still arise. The main difference
is that standard approaches based on extensionality do not allow for backtracking
and do not guarantee that a correct solution is found (Bergadano, 1993a).

As far as computational complexity is concerned, trace-based methods have
a complexity standing between the search in a space of independent clauses
(for the extensional methods) and the exhaustive search in a space of possible
programs. We need the following:

Definition 2 Given a hypothesis space S, the depth of an example e is the
maximum number of clauses in S successfully used in the derivation of e.

For example, if we are in a list processing domain, and S only contains recursive
calls of the type "P([HjT]) :- ..., P(T), ..." then the depth of an example
P(L) is the length of L. For practical program induction tasks, it is often
the case that the depth of an example is

93

Bergadano, Gunetti, & Trinchero related to its complexity, and not to the
hypothesis space S. If d is the maximum depth for the given m positive examples,
then the complexity of trace-based methods is of the order of jSjmd, while
extensional methods will just enumerate possible clauses with a complexity
which is linear in jSj, and enumerating all possible programs is exponential
in jSj.

2. A Simple Induction Procedure The trace-based induction procedure we analyze
here takes as input a finite set of clauses S and a set of positive and negative
examples E+ and E- and tries to find a subset T of S such that T derives
all the positive examples and none of the negative examples. For every positive
example e+ 2 E+, we assume that S is large enough to derive it. Moreover,
we assume that all clauses in S are flattened1. If this is not the case,
clauses are flattened as a preprocessing step.

We consider one possible proof for S ` e+, and we build an intermediate program
T ` S containing a trace of the derivation. The same is done for the other
positive examples, and the corresponding traces T are merged. Every time
T is updated, it is checked against the negative examples. If some of them
are derived from T, cut (!) is inserted in the antecedents of the clauses
in T, so that a consistent program is found, if it exists. If this is not
the case, the procedure backtracks to a different proof for S ` e+. The algorithm
can be informally described as follows:

input: a set of clauses S

a set of positive examples E+ a set of negative examples ES := flatten(S)
T  ; For each positive example e+ 2 E+

find T1 ` S such that T1 `SLD e+ (backtracking point 1) T  T [ T1 if T derives
some negative example e- then trycut(T,e-) if trycut(T,e-) fails then backtrack
output the clauses listed in T

trycut(T,e-): insert ! somewhere in T (backtracking point 2) so that

1. all previously covered positive examples are still derived from T, and
2. T 6`SLD eThe complexity of adding cut somewhere in the trace T, so that
the negative example eis no longer derived, obviously only depends on the
size of T. But this size depends on the depth of the positive examples, not
on the size of the hypothesis space S. Although more

1. A clause is flattened if it does not contain any functional symbol. Given
an unflattened clause, it is alway

possible to flatten it (by turning functions into new predicates with an
additional argument representing the result of the function) and vice versa
(Rouveirol, in press).

94

The Difficulties of Learning Logic Programs with Cut clever ways of doing
this can be devised, based on the particular example e-, we propose a simple
enumerative technique in the implementation described in the Appendix.

3. Example: Simplifying a List In this section we show an example of the
use of the induction procedure to learn the logic program "simplif y". Simplif
y takes as input a list whose members may be lists, and transforms it into
a "flattened" list of single members, containing no repetitions and no lists
as members. This program appears as exercise number 25 in (Coelho & Cotta,
1988), is composed of nine clauses (plus the clauses for append and member);
six of them are recursive, one is doubly-recursive and cut is extensively
used. Even if simplif y is a not a very complex logic program, it is more
complex than usual ILP test cases. For instance, the quicksort and partition
program, which is very often used, is composed of only five clauses (plus
those for append), and three of them are recursive. Moreover, note that the
conciseness of simplif y is essentially due to the extensive use of cut.
Without cut, this program would be much longer. In general, the longer a
logic program, the more difficult to learn it.

As a consequence, we start with a relatively strong bias; suppose that the
following hypothesis space of N=8449 possible clauses is defined by the user:

ffl The clause "simplify(L,NL) :- flatten(L,L1), remove(L1,NL)." ffl All
clauses whose head is "flatten(X,L)" and whose body is composed of a conjunction

of any of the following literals:

head(X,H), tail(X,L1), equal(X,[L1,T]), null(T), null(H), null(L1), equal(X,[L1]),
flatten(H,X1), flatten(L1,X2), append(X1,X2,L), assign(X1,L), assign(X2,L),
list(X,L).

ffl All clauses whose head is "remove(IL,OL)" and whose body is composed
of a conjunction of any of the following literals:

cons(X,N,OL), null(IL), assign([],OL), head(IL,X), tail(IL,L), member(X,L),
remove(L,OL), remove(L,N).

ffl The correct clauses for null, head, tail, equal, assign, member, append
are given:

null([]). head([Hj ],H). tail([ jT],T). equal(X,X). assign(X,X). member(X,[Xj
]). member(X,[ jT]) :- member(X,T).

95

Bergadano, Gunetti, & Trinchero append([],Z,Z). append([HjX],Y,[HjZ]) :-
append(X,Y,Z).

By using various kinds of constraints, the initial number of clauses can
be strongly reduced. Possible constraints are the following:

ffl Once an output is produced it must not be instantiated again. This means
that any

variable cannot occur as output in the antecedent more than once.

ffl Inputs must be used: all input variables in the head of a clause must
also occur in its

antecedent.

ffl Some conjunctions of literals are ruled out because they can never be
true, e.g.

null(IL)^head(IL,X).

By applying various combination of these constraints it is possible to strongly
restrict the initial hypothesis space, which is then given in input to the
learning procedure. The set of positive and negative examples used in the
learning task is:

simplify pos([[[],[b,a,a]],[]],[b,a]). remove pos([a,a],[a]). (simplify neg([[[],[b,a,a]],[]],X),not
equal(X,[b,a])). simplify neg([[a,b,a],[]],[a,[b,a]]). remove neg([a,a],[a,a]).

Note that we define some negative examples of simplif y to be all the examples
with the same input of a given positive example and a different output, for
instance simplify neg([[[],[b,a,a]],[]],[a,b]). Obviously, it is also possible
to give negative examples as normal ground literals. The learning procedure
outputs the program for simplif y reported below, which turns out to be substantially
equivalent to the one described in (Coelho & Cotta, 1988) (we have kept clauses
unflattened).

simplify(L,NL) :- flatten(L,L1), remove(L1,NL). flatten(X,L) :- equal(X,[L1,T]),
null(T), !, flatten(L1,X2), assign(X2,L). flatten(X,L) :- head(X,H), tail(X,L1),
null(H), !, flatten(L1,X2), assign(X2,L). flatten(X,L) :- equal(X,[L1]),
!, flatten(L1,X2), assign(X2,L). flatten(X,L) :- head(X,H), tail(X,L1), !,

flatten(H,X1), !, flatten(L1,X2), append(X1,X2,L). flatten(X,L) :- list(X,L).

remove(IL,OL) :- head(IL,X), tail(IL,L), member(X,L), !, remove(L,OL). remove(IL,OL)
:- head(IL,X), tail(IL,L), remove(L,N), cons(X,N,OL). remove(IL,OL) :- null(IL),
assign([],OL).

The learning task takes about 44 seconds on our implementation. However,
This is obtained at some special conditions, which are thoroughly discussed
in the next sections:

ffl All the constraints listed above are applied, so that the final hypothesis
space is

reduced to less than one hundred clauses.

96

The Difficulties of Learning Logic Programs with Cut ffl Clauses in the hypothesis
space are generated in the correct order, as they must appear

in the final program. Moreover, literals in each clause are in the correct
position. This is important, since in a logic program with cut the relative
position of clauses and literals is significant. As a consequence, we can
learn simplif y without having to test for different clause and literal orderings
(see subsections 4.2 and 4.5).

ffl We tell the learning procedure to use at most two cuts per clause. This
seems to be

quite an intuitive constraint since, in fact, many classical logic programs
have no more than one cut per clause (see subsections 4.1 and 5.4).

4. Problems Experiments with the above induction procedure have shown that
many problems arise when learning logic programs containing cut. In the following,
we analyze these problems, and this is a major contribution of the present
paper. As cut cannot be evaluated extensionally, this analysis is general,
and does not depend on the specific induction method adopted. Some possible
partial solutions will be discussed in Section 5.

4.1 Problem 1: Intensional Evaluation, Backtracking and Cut The learning
procedure of Section 2 is very simple, but it can be inefficient. However,
we believe this is common to every intensional method, because clauses cannot
be learned independently of one another. As a consequence, backtracking cannot
be avoided and this can have some impact on the complexity of the learning
process. Moreover, cut must be added to every trace covering negative examples.
If no constraints are in force, we can range from only one cut in the whole
trace to a cut between each two literals of each clause in the trace. Clearly,
the number of possibilities is exponential in the number of literals in the
trace. Fortunately, this number is usually much smaller than the size of
the hypothesis space, as it depends on the depth of the positive examples.

However, backtracking also has some advantages; in particular, it can be
useful to search for alternative solutions. These alternative programs can
then be confronted on the basis of any required characteristic, such as simplicity
or efficiency. For example, using backtracking we discovered a version of
simplif y equivalent to the one given but without the cut predicate between
the two recursive calls of the fourth clause of f latten.

4.2 Problem 2: Ordering of Clauses in the Trace In a logic program containing
cut, the mutual position of clauses is significant, and a different ordering
can lead to a different (perhaps wrong) behavior of the program. For example,
the following program for intersection:

c1) int(X,S2,Y) :- null(X), null(Y). c2) int(X,S2,Y) :- head(X,H), tail(X,Tail),
member(H,S2), !, int(Tail,S2,S), cons(H,S,Y). c3) int(X,S2,Y) :- head(X,H),
tail(X,Tail), int(Tail,S2,Y).

behaves correctly only if c2 comes before c3. Suppose the hypothesis space
given in input to the induction procedure consists of the same three clauses
as above, but with c3 before

97

Bergadano, Gunetti, & Trinchero c2. If :int([a],[a],[]) is given as a negative
example, then the learning task fails, because clauses c1 and c3 derive that
example.

In other words, learning a program containing cut means not only to learn
a set of clauses, but also a specific ordering for those clauses. In terms
of our induction procedure this means that for every trace T covering some
negative example, we must check not only every position for inserting cuts,
but also every possible clause ordering in the trace. This "generate and
test" behavior is not difficult to implement, but it can dramatically decrease
the performance of the learning task. In the worst case all possible permutations
must be generated and checked, and this requires a time proportional to (md)!
for a trace of md clauses2.

The necessity to test for different permutations of clauses in a trace is
a primary source of inefficiency when learning programs with cut, and probably
the most difficult problem to solve.

4.3 Problem 3: Kinds of Given Examples Our induction procedure is only able
to learn programs which are traces, i.e. where every clause in the program
is used to derive at least one positive example. When learning definite clauses,
this is not a problem, because derivation is monotone, and for every program
P, complete and consistent w.r.t. the given examples, there is a program
P0`P which is also complete and consistent and is a trace3. On the other
hand, when learning clauses containing cut, it may happen that the only complete
and consistent program(s) in the hypothesis space is neither a trace, nor
contains it as a subset. This is because derivation is no longer monotone
and it can be the case that a negative example is derived by a set of clauses,
but not by a superset of them, as in the following simple example:

S = fsum(A,B,C) :- A?0, !, M is A-1, sum(M,B,N), C is N+1.

sum(A,B,C) :- C is B.g sum pos(0,2,2), sum neg(2,2,2).

The two clauses in the hypothesis space represent a complete and consistent
program for the given examples, but our procedure is unable to learn it.
Observe that the negative example is derived by the second clause, which
is a trace for the positive example, but not by the first and the second
together.

This problem can be avoided if we require that, for every negative example,
a corresponding positive example with the same input be given (in the above
case, the example required is sum pos(2,2,4)). In this way, if a complete
program exists in the hypothesis space, then it is also a trace, and can
be learned. Then it can be made consistent using cut, in order to rule out
the derivation of negative examples. The constraint on positive and negative
examples seems to be quite intuitive. In fact, when writing a program, a

2. it must be noted that if we are learning programs for two different predicates,
of j and k clauses

respectively (that is, md = j+k), then we have to consider not (j+k)! different
programs, but only j!+k!. We can do better if, inside a program, it is known
that non-recursive clauses have a fixed position, and can be put before or
after of all the recursive clauses. 3. a learned program P is complete if
it derives all the given positive examples, and it is consistent if it

does not derive any of the given negative examples

98

The Difficulties of Learning Logic Programs with Cut programmer usually thinks
in terms of what a program should compute on given inputs, and then tries
to avoid wrong computations for those inputs.

4.4 Problem 4: Ordering of Given Examples When learning clauses with cut,
even the order of the positive examples may be significant. In the example
above, if sum pos(2,2,4) comes after sum pos(0,2,2) then the learning task
fails to learn a correct program for sum, because it cannot find a program
consistent w.r.t. the first positive example and the negative one(s).

In general, for a given set of m positive examples this problem can be remedied
by testing different example orderings. Again, in the worst case k! different
orderings of a set of k positive examples must be checked. Moreover, in some
situations a favorable ordering does not exist. Consider the following hypothesis
space:

c1) int(X,Y,W) :- head(X,A), tail(X,B), notmember(A,Y), int(B,Y,W). c2) int(X,Y,W)
:- head(X,A), tail(X,B), notmember(A,Y), !, int(B,Y,W). c3) int(X,Y,Z) :-
head(X,A), tail(X,B), int(B,Y,W), cons(A,W,Z). c4) int(X,Y,Z) :- head(X,A),
tail(X,B), !, int(B,Y,W), cons(A,W,Z). c5) int(X,Y,Z) :- null(Z).

together with the set of examples: e1) int pos([a],[b],[ ]). e2) int pos([a],[a],[a]).
e3) int neg([a],[b],[a]). e4) int neg([a],[a],[ ]).

Our induction procedure will not be able to find a correct program for any
ordering of the two positive examples, even if such a program does exist
([c2,c4,c5]). This program is the union of two traces: [c2,c5], which covers
e1, and [c4,c5], which covers e2. Both of these traces are inconsistent,
because the first covers e4, and the second covers e3. This problem can be
remedied only if all the positive examples are derived before the check against
negative examples is done.

However, in that case we have a further loss of efficiency, because some
inconsistent traces are discarded only in the end. In other words, we would
need to learn a program covering all the positive examples, and then make
it consistent by using cut and by reordering clauses. Moreover, there can
be no way to make a program consistent by using cut and reorderings. As a
consequence, all the time used to build that program is wasted. As an example,
suppose we are given the following hypothesis space:

c01) int(X,Y,Z) :- head(X,A), tail(X,B), int(B,Y,W), cons(A,W,Z). c02) int(X,Y,Z)
:- null(X), null(Z). c03) int(X,Y,Z) :- null(Z).

99

Bergadano, Gunetti, & Trinchero with the examples: e01) int pos([a],[a],[a]).
e02) int pos([a,b],[c],[]). e03) int neg([a],[b],[a]).

Then we can learn the trace [c01,c02] from e01 and the trace [c03] from e02.
But [c01,c02,c03] covers e03, and there is no way to make it consistent using
cut or by reordering its clauses. In fact, the first partial trace is responsible
for this inconsistency, and hence the time used to learn [c03] is totally
wasted.

Here it is also possible to understand why we need flattened clauses. Consider
the following program for intersection, which is equivalent to [c2,c4,c5],
but with the three clauses unflattened:

u2) int([AjB],Y,W) :- notmember(A,Y), !, int(B,Y,W). u4) int([AjB],Y,[AjW])
:- !, int(B,Y,W). u5) int( , ,[]).

Now, this program covers int neg([a],[a],[]), i.e. [u2,u4,u5] ` int([a],[a],[]).
In fact, clause u2 fails on this example because a is a member of [a]. Clause
u4 fails because the empty list cannot be matched with [AjW]. But clause
u5 succeeds because its arguments match those of the negative example. As
a consequence, this program would be rejected by the induction procedure.

The problem is that, if we use unflattened clauses, it may happen that a
clause body is not evaluated because an example does not match the head of
the clause. As a consequence, possible cuts in that clause are not evaluated
and cannot influence the behavior of the entire program. In our example,
the cut in clause u4 has no effect because the output argument of int([a],[a],[])
does not match [AjW], and the body of u4 is not evaluated at all. Then u5
is fired and the negative example is covered. In the flattened version, clause
c4 fails only when cons(a,[],[]) is reached, but at that point a cut is in
force and clause c5 cannot be activated. Note that program [u2,u4,u5] behaves
correctly on the query int([a],[a],X), and gives X=[a] as the only output.

4.5 Problem 5: Ordering of Literals Even the relative position of literals
and cut in a clause is significant. Consider again the correct program for
intersection as above ([c2,c4,c5]), but with c4 modified by putting the cons
literal in front of the antecedent:

c04) int(X,Y,Z) :- cons(A,W,Z), head(X,A), tail(X,B), int(B,Y,W). Then, there
is no way to get a correct program for intersection using this clause. To
rule out the negative example int neg([a],[a],[]) we must put a cut before
the cons predicate, in order to prevent the activation of c5. But, then,
some positive examples are no longer covered, such as int pos([a],[],[]).
In fact, we have a wrong behavior every time clause c04 is

100

The Difficulties of Learning Logic Programs with Cut called and fails, since
it prevents the activation on c5. In general, this problem cannot be avoided
even by reordering clauses: if we put c04 after c2 and c5, then int neg([a],[a],[])
will be covered. As a consequence, we should also test for every possible
permutation of literals in every clause of a candidate program.

5. Situations where Learning Cut is still Practical From the above analysis,
learning cut appears to be difficult since, in general, a learning procedure
should be able to backtrack on the candidate base programs (e.g., traces),
on the position of cut(s) in the program, on the order of the clauses in
the program, on the order of literals in the clauses and on the order of
given positive examples. However, we have spotted some general conditions
at which learning cut could still be practical. Clearly, these conditions
cannot be a final solution to learning cut, but, if applicable, can alleviate
the computational problems of the task.

5.1 Small Hypothesis Space First of all, a restricted hypothesis space is
necessary. If clauses cannot be learned independently of one another, a small
hypothesis space would help to limit the backtracking required on candidate
traces (problem 1). Moreover, even the number of clauses in a trace would
be probably smaller, and hence also the number of different permutations
and the number of different positions for inserted cuts (problems 2 and 1).
A small trace would also have a slight positive impact on the need to test
for different literal orderings in clauses (problem 5).

In general, many kinds of constraints can be applied to keep a hypothesis
space small, such as ij-determinism (Muggleton & Feng, 1990), rule sets and
schemata (Kietz & Wrobel, 1991; Bergadano & Gunetti, 1993), determinations
(Russell, 1988), locality (Cohen, 1993), etc (in fact, some of these restrictions
and others, such as those listed in Section 3, are available in the actual
implementation of our procedure - see the Appendix4). Moreover, candidate
recursive clauses must be designed so that no infinite chains of recursive
calls can take place (Bergadano & Gunetti, 1993) (otherwise the learning
task itself could be non-terminating). In general, the number of possible
recursive calls must be kept small, in order to avoid too much backtracking
when searching for possible traces. However, general constraints may not
be sufficient. The hypothesis space must be designed carefully from the very
beginning, and this can be difficult. In the example of learning simplif
y an initial hypothesis space of "only" 8449 clauses was obtained specifying
not only the set of required predicates, but even the variables occurring
in every literal.

If clauses cannot be learned independently, experiments have shown to us
that a dramatic improvement of the learning task can be obtained by generating
the clauses in the hypothesis space so that recursive clauses, and in general
more complex clauses, are taken into consideration after the simpler and
non-recursive ones. Since simpler and non recursive clauses require less
time to be evaluated, they will have a small impact on the learning time.
Moreover, learning simpler clauses (i.e. shorter) also alleviates problem
5.

4. We found these constraints particularly useful. By using them we were
often able to restrict a hypothesis

space of one order of magnitude without ruling out any possible solution.

101

Bergadano, Gunetti, & Trinchero Finally, it must be noted that our induction
procedure does not necessarily require that the hypothesis space S of possible
clauses be represented explicitly. The learning task could start with an
empty set S and an implicit description of the hypothesis space, for example
the one given in Section 3. When a positive example cannot be derived from
S, a new clause is asked for to a clause generator and added to S. This step
is repeated until the example is derivable from the updated S, and then the
learning task can proceed normally.

5.2 Simple Examples Another improvement can be achieved by using examples
that are as simple as possible. In fact, each example which may involve a
recursive call is potentially responsible for the activation of all the corresponding
clauses in the hypothesis space. The more complex the example, the larger
the number of consecutive recursive activations of clauses and the larger
the number of traces to be considered for backtracking (problem 1). For instance,
to learn the append relation, it may be sufficient to use an example like
append([a],[b],[a,b]) instead of one like append([a,b,c,d],[b],[a,b,c,d,b]).
Since simple examples would probably require a smaller number of different
clauses to be derived, this would result in smaller traces, alleviating the
problem of permutation of clauses and literals in a trace (problems 2 and
5) and decreasing the number of positions for cuts (problem 1).

5.3 Small Number of Examples Since a candidate program is formed by taking
the union of partial traces learned for single examples, if we want a small
trace (problems 2 and 5) we must use as few examples as possible, while still
completely describing the required concept. In other words, we should avoid
redundant information. For example, if we want to learn the program for append,
it will be normally sufficient to use only one of the two positive examples
append([a],[b],[a,b]) and append([c],[d],[c,d]). Obviously it may happen
that different examples are derived by the same set of clauses, and in this
case the final program does not change.

Having to check for all possible orderings of a set of positive examples,
a small number of examples is also a solution to problem 4. Fortunately,
experiments have shown that normally very few positive examples are needed
to learn a program, and hence the corresponding number of different orderings
is, in any case, a small number. Moreover, since in our method a positive
example is sufficient to learn all the clauses necessary to derive it, most
of the time a complete program can be learned using only one well chosen
example. If such an example can be found (as in the case of the learning
task of section 3, where only one example of simplif y and one of remove
are given), the computational problem of testing different example orderings
is automatically solved.

However, it must be noted that, in general, a small number of examples may
not be sufficient, except for very simple programs. In fact, if we want to
learn logic programs such as member, append, reverse and so on, then any
example involving recursion will be sufficient. But for more complex programs
the choice may not be trivial. For example, our procedure is able to learn
the quicksort (plus partition) program with only one "good" example. But
if one does not know how quicksort and partition work, it is likely that
she or he will provide an example allowing to learn only a partial description
of partition. This is particularly clear in the example of simplif y. Had
we used the positive example

102

The Difficulties of Learning Logic Programs with Cut simplify pos([[[],[b,a,a]]],[b,a])
(which is very close to the one effectively used), the first clause of f
latten would not have been learned. In other words, to give few examples
we must give good examples, and often this is possible only by having in
mind (at least partially and in an informal way) the target program. Moreover,
for complex programs, good examples can mean complex examples, and this is
in contrast with the previous requirement. For further studies of learning
from good examples we refer the reader to the work of Ling (1991) and Aha,
Ling, Matwin and Lapointe (1993).

5.4 Constrained Positions for Cut and Literals Experiments have shown that
it is not practical to allow the learning procedure to test all possible
positions of cut in a trace, even if we are able to keep the number of clauses
in a trace small. The user must be able to indicate the positions where a
cut is allowed to occur, e.g., at the beginning of a clause body, or before
a recursive call. In this case, many alternative programs with cut are automatically
ruled out and thus do not have to be tested against the negative examples.
It may also be useful to limit the maximum number of cuts per clause or per
trace. For example, most of the time one cut per clause can be sufficient
to learn a correct program. In the actual implementation of our procedure,
it is in fact possible to specify the exact position of cut w.r.t. a literal
or a group of literals within each clause of the hypothesis space, when this
information is known.

To eliminate the need to test for different ordering of literals (problem
5), we may also impose a particular global order, which must be maintained
in every clause of the hypothesis space. However this requires a deep knowledge
of the program we want, otherwise some (or even all) solutions will be lost.
Moreover, this solution can be in contrast with a use of constrained positions
for cut, since a solution program for a particular literal ordering and for
particular positions for cuts may not exist.

6. Conclusion Our induction procedure is based on an intensional evaluation
of clauses. Since the cut predicate has no declarative meaning, we believe
that intensional evaluation of clauses cannot be abandoned, independently
of the kind of learning method adopted. This can decrease the performance
of the learning task, compared with extensional methods, which examine clauses
one at a time without backtracking. However, the computational problems outlined
in Section 4 remain even if we choose to learn a complete program extensionally,
and then we try to make it consistent by inserting cut. The only difference
is that we do not have backtracking (problem 1), but the situation is probably
worse, since extensional methods can fail to learn a complete program even
if it exists in the hypothesis space. (Bergadano, 1993a).

Even if the ability to learn clauses containing procedural predicates like
cut seems to be fundamental to learning "real" logic programs, in particular
short and efficient programs, many problems influencing the complexity of
the learning task must be faced. These include the number and the relative
ordering of clauses and literals in the hypothesis space, the kind and the
relative ordering of given examples. Such problems seem to be related to
the need for an intensional evaluation of clauses in general, and not to
the particular learning method adopted. Even just to alleviate these problems,
it seems necessary to know a lot about the

103

Bergadano, Gunetti, & Trinchero target program. An alternative solution is
simply to ignore some of the problems. That is, avoid testing for different
clause and/or literal and/or example orderings. Clearly, in this way the
learning process can become feasible, but it can fail to find a solution
even when it exists. However, many ILP systems (such as Foil) adopt such
an "incomplete-but-fast" approach, which is guided by heuristic information.

As a consequence, we view results presented in this paper as, at least partially,
negative. The problems we raised appear computationally difficult, and suggest
that attention should be restricted to purely declarative logic languages,
which are, in any case, sufficiently expressive.

Acknowledgements This work was in part supported by BRA ESPRIT project 6020
on Inductive Logic Programming.

Appendix A The induction procedure of Section 2 is written in C-prolog (interpreted)
and runs on a SUNsparcstation 1. We are planning to translate it in QUINTUS
prolog. This Appendix contains a simplified description of its implementation.
As a preliminary step, in order to record a trace of the clauses deriving
a positive example e+, every clause in the hypothesis space5 S must be numbered
and modified by adding to its body two literals. The first one, allowed(n,m)
is used to activate only the clauses which must be checked against the negative
examples. The second one, marker(n), is used to remember that clause number
n has been successfully used while deriving e+. Hence, in general, a clause
in the hypothesis space S takes the following form:

P (X1,: : : ,Xm) :- allowed(n,m),fl,marker(n). where fl is the actual body
of the clause, n is the number of the clause in the set and m is a number
used to deal with cuts. For every clause n, the one without cut is augmented
with allowed(n,0), while those containing a cut somewhere in their body are
augmented with allowed(n,1), allowed(n,2), ..., and so on. Moreover, for
every augmented clause as above, a fact "alt(n,m)." is inserted in S, in
order to implement an enumeration mechanism.

A simplified (but running) version of the learning algorithm is reported
below. In the algorithm, the output, if any, is the variable Trace containing
the list of the (numbers of the) clauses representing the learned program
P. By using the backtracking mechanism of Prolog, more than one solution
(trace) can be found. We assume the two predicates listpositive and listnegative
build a list of the given positive and negative examples, respectively.

consult(file containing the set of clauses S).

5. We assume clauses in the hypothesis space to be flattened

104

The Difficulties of Learning Logic Programs with Cut allowed(X,0). marker(X)
:- assert(trace(X)). marker(X) :- retract(trace(X)), !, fail.

main :- listpositive(Posexamplelist), tracer([],Posexamplelist,Trace). tracer(Covered,[ExamplejCdr],Trace)
:- Example, /? backtracking point 1 ?/

setof(L,trace(L),Trace1), notneg(Trace1,[ExamplejCovered],Cdr), tracer([ExamplejCovered],Cdr,Trace).
tracer( ,[],Trace) :- setof((I,J),allowed(I,J),Trace), asserta((marker(X)
:- true, !)).

assertem([]). assertem([IjCdr]) :- alt(I,J), backassert(allowed(I,J)), assertem(Cdr).

prep(T) :- retract(allowed(X,0)), assertem(T). backassert(X) :- assert(X).
backassert(X) :- retract(X), !, fail.

resetallowed([]) :- !. resetallowed( ) :- abolish(allowed,2), assert(allowed(X,0)),
!.

notneg(T,Covered,Remaining) :- listnegative([]). notneg(T,Covered,Remaining)
:- listnegative(Negexamplelist),

asserta((marker(X) :- true,!)),

prep(T), /? backtracking point 2 ?/ trypos(Covered), trynegs(Negexamplelist),
resetallowed(Remaining), retract((marker(X) :- true,!)). notneg(T,Covered,Remaining)
:- resetallowed(Remaining),

retract((marker(X) :- true,!)), !, fail.

trypos([ExamplejCdr]) :- Example, !, trypos(Cdr). trypos([]) :- !.

trynegs([ExamplejCdr]) :- Example,!,fail. trynegs([ExamplejCdr]) :- trynegs(Cdr).
trynegs([]) :- !.

Actually, our complete implementation is more complex, also in order to achieve
greater efficiency. The behavior of the learning task is quite simple. Initially,
the set S of clauses is read into the Prolog interpreter, together with the
learning algorithm. Then the learning task can be started by calling the
predicate main. A list of the positive examples is formed

105

Bergadano, Gunetti, & Trinchero and the tracer procedure is called on that
list. For every positive example, tracer calls the example itself, firing
all the clauses in S that may be resolved against that example. Observe that,
initially, an allowed(X,0) predicate is asserted in the database: in this
way only clauses not containing a cut are allowed to be used (this is because
clauses with cut are employed only if some negative example is derived).
Then, a trace, if any, of (the numbers associated to) the clauses successfully
used in the derivation of that example is built, using the setof predicate.

The trace is added to the traces found for the previous examples, and the
result is checked against the set of the negative examples calling the notneg
procedure. If notneg does not fail (i.e. no negative examples are covered
by this trace) then a new positive example is taken into consideration. Otherwise
notneg modifies the trace with cut and tests it again. If also this fails,
backtracking occurs and a new trace for the current example (and possibly
for the previous ones) is searched for.

The notneg procedure works as follows. First, only the clauses in the trace
are allowed to be checked against the negative examples, by retracting the
allowed(X,0) clause and asserting an allowed(n,0) if the n-th clause (without
cut) is in the trace. This is done with the prep and assertem predicates.
Then a list of the negative examples is formed and we check if they can be
derived from the clauses in the trace. If at least one negative example is
covered, (i.e., if trynegs fails) then we backtrack to the prep procedure
(backtracking point 2) where a clause of the trace is substituted with an
equivalent one but with cut inserted somewhere (or in a different position).
If no correct program can be found in such a way by trying all possible alternatives
(i.e. by using cut in all possible ways), notneg fails, and backtracking
to backtracking point 1 occurs, where another trace is searched for. Otherwise,
all clauses in S without cut are reactivated by asserting again allowed(X,0),
and the next positive example is considered. Note that trypos is used in
notneg to verify if a modified trace still derives the set of positive examples
derived initially. The possibility to substitute clauses in the current trace
with others having cut inserted somewhere is achieved through the alt predicate
in the assertem procedure. Finally, note that this simplified version of
the learning procedure is not able to generate and test for different orderings
of clauses in a trace or for different ordering of literals in each clause,
nor to use different orderings for the set of positive examples.

In order to derive all the positive examples before the check against the
negative ones (see subsection 4.4), we must change the first clause of the
tracer procedure into:

tracer([Pos1, ... ,Posn]):-Pos1, ... ,Posn, setof(L,trace(L),T), notneg(T).
The actual implementation of the above induction procedure is available through
ftp. For further information contact gunetti@di.unito.it.

References Aha, D., Ling, C., Matwin, S., & Lapointe, S. (1993). Learning
Singly Recursive Relations

from Small Datasets. In Proceedings of the IJCAI-93 workshop on ILP.

Bergadano, F. (1993a). Inductive database relations. IEEE Transactions on
Data and

Knowledge Engineering, 5 (6).

106

The Difficulties of Learning Logic Programs with Cut Bergadano, F. (1993b).
Test Case Generation by Means of Learning Techniques. In Proceedings of ACM
SIGSOFT-93.

Bergadano, F., & Gunetti, D. (1993). An interactive system to learn functional
logic programs. In Proceedings of IJCAI-93.

Coelho, H., & Cotta, J. C. (1988). Prolog by Example: how to learn teach
and use it. Berlin:

Springer-Verlag.

Cohen, W. (1993). Rapid Prototyping of ILP Systems Using Explicit Bias. In
Proceedings

of the IJCAI-93 workshop on ILP.

DeRaedt, L., Lavrac, N., & Dzeroski, S. (1993). Multiple predicate learning.
In Proceedings

of IJCAI-93.

Kietz, J. U., & Wrobel, S. (1991). Controlling the Complexity of Learning
in Logic through

Syntactic and Task-Oriented Models. In Muggleton, S. (Ed.), Inductive Logic
Programming. London: Academic Press.

Lau, K. K., & Clement, T. (Eds.). (1993). Logic Program Synthesis and Transformation.

Berlin: Springer-Verlag.

Ling, X. C. (1991). Learning from Good Examples. In Proceedings of IJCAI-91.
Muggleton, S. (Ed.). (1991). Inductive Logic Programming. London: Academic
Press. Muggleton, S., & Feng, C. (1990). Efficient Induction of Logic Programs.
In Proceedings of

the first conference on Algorithmic Learning Theory.

Quinlan, R. (1990). Learning Logical Definitions from Relations. Machine
Learning, 5,

239-266.

Rouveirol, C. (in press). Flattening: a representation change for generalization.
Machine

Learning.

Russell, S. (1988). Tree-structured bias. In Proceedings of AAAI-88. Shapiro,
E. Y. (1983). Algorithmic Program Debugging. Cambridge, CA: MIT Press.

107