next up previous
Next: Query Packs Up: Improving the Efficiency of Previous: Introduction

Inductive Logic Programming

Inductive logic programming [24] is situated in the intersection of machine learning or data mining on the one hand, and logic programming on the other hand. It shares with the former fields the goal of finding patterns in data, patterns that can be used to build predictive models or to gain insight in the data. With logic programming it shares the use of clausal first order logic as a representation language for both data and hypotheses. In the remainder of this text we will use some basic notions from logic programming, such as literals, conjunctive queries, and variable substitutions. We will use Prolog notation throughout the paper. For an introduction to Prolog and logic programming see [10].

Inductive logic programming can be used for many different purposes, and the problem statements found in ILP papers consequently vary. In this article we consider the so-called learning from interpretations setting [16,14]. It has been argued elsewhere that this setting, while slightly less powerful than the standard ILP setting (it has problems with, e.g., learning recursive predicates), is sufficient for most practical purposes and scales up better [6].

We formulate the learning task in such a way that it covers a number of different problem statements. More specifically, we consider the problem of detecting for a set of conjunctive queries for which instantiations of certain variables each query succeeds. These variables are called key variables, and a grounding substitution for them is called a key instantiation. The intuition is that an example in the learning task is uniquely identified by a single key instantiation.

The link with ILP systems that learn clauses is then as follows. The search performed by an ILP system is directed by regularly evaluating candidate clauses. Let us denote such a candidate clause by $Head(X) \leftarrow Body(X,Y)$ where $X$ represents a vector of variables appearing in the head of the clause and $Y$ represents additional variables that occur in the body. We assume that the head is a single literal and that a list of examples is given, where each example is of the form $Head(X)\theta$ with $\theta$ a substitution that grounds $X$. Examples may be labelled (e.g., as positive or negative), but this is not essential in our setting. While an example can be represented as a fact $Head(X)\theta$ when learning definite Horn clauses, we can also consider it just a tuple $X\theta$. Both notations will be used in this paper.

Intuitively, when positive and negative examples are given, one wants to find a clause that covers as many positive examples as possible, while covering few or no negatives. Whether a single example $Head(X)\theta$ is covered by the clause or not can be determined by running the query $?- Body(X,Y)\theta$. In other words, evaluating a clause boils down to running a number of queries consisting of the body of the clause. For simplicity of notation, we will often denote a conjunctive query by just the conjunction (without the $?-$ symbol).

In some less typical ILP settings, the ILP algorithm does not search for Horn clauses but rather for general clauses, e.g., CLAUDIEN [15] or for frequent patterns that can be expressed as conjunctive queries, e.g., WARMR[18]. These settings can be handled by our approach as well: all that is needed is a mapping from hypotheses to queries that allow to evaluate these hypotheses. Such a mapping is defined by [15] for CLAUDIEN; for WARMR it is trivial.

Given a set of queries $S$ and a set of examples $E$, the main task is to determine which queries $Q \in S$ cover which examples $e \in E$. We formalise this using the notion of a result set:

Definition 1 (Result set)   The result set of a set of queries S in a deductive database D for key $K$ and example set $E$, is

\begin{displaymath}RS(S,K,D,E) = \{ (K\theta,i) \vert Q_i \in S \mbox{ and } K\theta \in E \mbox{ and } Q_i\theta \mbox{ succeeds in }D\}\end{displaymath}



Similar to the learning from interpretations setting defined in [14], the problem setting can now be stated as:

Given: a set of conjunctive queries $S$, a deductive database $D$, a tuple $K$ of variables that occur in each query in S, and an example set $E$

Find: the result set $RS(S,K,D,E)$; i.e., find for each query $Q$ in $S$ those ground instantiations $\theta$ of $K$ for which $K\theta \in E$ and $Q\theta$ succeeds in $D$.

Example 1   Assume an ILP system learning a definition for grandfather/2 wants to evaluate the following hypotheses:
      grandfather(X,Y) :- parent(X,Z), parent(Z,Y), male(X).
      grandfather(X,Y) :- parent(X,Z), parent(Z,Y), female(X).

Examples are of the form grandfather($gf$,$gc$) where $gf$ and $gc$ are constants; hence each example is uniquely identified by a ground substitution of the tuple $(X,Y)$. So in the above problem setting the set of Prolog queries S equals $\{($?- parent(X,Z), parent(Z,Y), male(X)$)$, $($?- parent(X,Z), parent(Z,Y), female(X)$) \}$ and the key $K$ equals $(X,Y)$. Given a query $Q_i \in S$, finding all tuples $(x,y)$ for which $((x,y), i) \in R$ (with $R$ the result set as defined above) is equivalent to finding which of the grandfather($x$,$y$) facts in the example set are predicted by the clause grandfather(X,Y) :- $Q_i$.

The generality of our problem setting follows from the fact that once it is known which queries succeed for which examples, the statistics and heuristics that typical ILP systems use can be readily obtained from this. A few examples:

After transforming the grandfather/2 clauses into

      grandfather((X,Y)),I) :- parent(X,Z), parent(Z,Y), male(X), I = 1.
      grandfather((X,Y)),I) :- parent(X,Z), parent(Z,Y), female(X), I = 2.
the result set can clearly be computed by collecting for all grounding $\theta$'s where $K\theta \in E$ the answers to the query ?- grandfather($K\theta$,I) . In Section 3 the queries will have a literal I = i at the end or another goal which by side-effects results in collecting the result set.

In practice, it is natural to compute the result set using a double loop: one over examples and one over queries and one has the choice as to which is the outer loop. Both the ``examples in outer loop'' and the ``queries in outer loop'' have been used in data mining systems; in the context of decision trees, see for instance [25] and [22]. We shall see further that the redundancy removal approach we propose uses the ``examples in outer loop'' strategy. In both approaches however, given a query and a key instantiation, we are interested only in whether the query succeeds for that key instantiation. This implies that after a particular query has succeeded on an example, its execution can be stopped.

In other words: computing the result set defined above boils down to evaluating each query on each example, where we are only interested in the existence of success for each such evaluation. Computing more than one solution for one query on one example is unnecessary.


next up previous
Next: Query Packs Up: Improving the Efficiency of Previous: Introduction
Hendrik Blockeel 2002-02-26