J. M. Siskind

Volume 15, 2001

**Links to Full Text:**

- PDF format
- PS format
- Online HTML Version
- Appendix - Software and video sequences

This paper presents an implemented system for recognizing the occurrence of events described by simple spatial-motion verbs in short image sequences. The semantics of these verbs is specified with event-logic expressions that describe changes in the state of force-dynamic relations between the participants of the event. An efficient finite representation is introduced for the infinite sets of intervals that occur when describing liquid and semi-liquid events. Additionally, an efficient procedure using this representation is presented for inferring occurrences of compound events, described with event-logic expressions, from occurrences of primitive events. Using force dynamics and event logic to specify the lexical semantics of events allows the system to be more robust than prior systems based on motion profile.

Journal of Artiï¬cial Intelligence Research 15 (2001) 31-90 Submitted 8/00; published 8/01 Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic Jeffrey Mark Siskind QOBI@RESEARCH.NJ.NEC.COM NEC Research Institute, Inc. 4 Independence Way Princeton, NJ 08540 USA Abstract This paper presents an implemented system for recognizing the occurrence of events described by simple spatial-motion verbs in short image sequences. The semantics of these verbs is speciï¬edwith event-logic expressions that describe changes in the state of force-dynamic relations betweenthe participants of the event. An efï¬cient ï¬nite representation is introduced for the inï¬nite setsof intervals that occur when describing liquid and semi-liquid events. Additionally, an efï¬cientprocedure using this representation is presented for inferring occurrences of compound events, de-scribed with event-logic expressions, from occurrences of primitive events. Using force dynamicsand event logic to specify the lexical semantics of events allows the system to be more robust thanprior systems based on motion proï¬le. 1. Introduction If one were to look at the image sequence in Figure 1(a), one could describe the event depictedin that sequence by saying Someone picked the red block up off the green block. Similarly, ifone were to look at the image sequence in Figure 1(b), one could describe the event depicted inthat sequence by saying Someone put the red block down on the green block. One way that onerecognizes that the former is a pick up1 event is that one notices a state change in the force-dynamicrelations between the participant objects. Prior to Frame 13, the red block is supported by thegreen block by a substantiality constraint, the fact that solid objects cannot interpenetrate (Spelke,1983; Baillargeon, Spelke, & Wasserman, 1985; Baillargeon, 1986, 1987; Spelke, 1987, 1988).From Frame 13 onward, it is supported by being attached to the hand. Similarly, one way that onerecognizes that the latter is a put down event is that one notices the reverse state change in Frame 14.This paper describes an implemented computer system, called LEONARD, that can produce similarevent descriptions from such image sequences. A novel aspect of this system is that it produces eventdescriptions by recognizing state changes in force-dynamic relations between participant objects.Force dynamics is a term introduced by Talmy (1988) to describe a variety of causal relationsbetween participants in an event, such as allowing, preventing, and forcing. In this paper, I useforce dynamics in a slightly different sense, namely, to describe the support, contact, and attachmentrelations between participant objects. A number of systems have been reported that can produce event descriptions from video input. Examples of such systems include the work reported in Yamoto, Ohya, and Ishii (1992), Starner(1995), Siskind and Morris (1996), Brand (1996), and Bobick and Ivanov (1998). LEONARD differs 1. Throughout this paper, I treat verb-particle constructions, like pick up, as atomic verbs, despite the fact that the verb and its associated particle may be discontinuous. Methods for deriving the semantics of a verb-particle constructioncompositionally from its (discontinuous) constituents are beyond the scope of this paper. c 2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

SISKIND (a) Frame 0 Frame 1 Frame 2 Frame 13 Frame 14 Frame 20 (b) Frame 1 Frame 13 Frame 14 Frame 22 Frame 23 Frame 27 Figure 1: Image sequences depicting (a) a pick up event and (b) a put down event. 32

GROUNDING THE LEXICAL SEMANTICS OF VERBS from these prior systems in two crucial ways. First, the prior systems classify events based on themotion proï¬le of the participant objects. For example, Siskind and Morris (1996) characterize a pickup event as a sequence of two subevents: the agent moving towards the patient while the patient isat rest above the source, followed by the agent moving with the patient away from the source whilethe source remains at rest. Similarly, a put down event is characterized as the agent moving withthe patient towards the destination while the destination is at rest, followed by the agent movingaway from the patient while the patient is at rest above the destination. In contrast, LEONARDcharacterizes events as changes in the force-dynamic relations between the participant objects. Forexample, a pick up event is characterized as a change from a state where the patient is supported bya substantiality constraint with the source to a state where the patient is supported by being attachedto the agent. Similarly, a put down event is characterized as the reverse state change. Irrespectiveof whether motion proï¬le or force dynamics is used to recognize events, event recognition is aprocess of classifying time-series data. In the case of motion proï¬le, this time-series data takes theform of relative-and-absolute positions, velocities, and accelerations of the participant objects as afunction of time. In the case of force dynamics, this time-series data takes the form of the truthvalues of force-dynamic relations between the participant objects as a function of time. This leadsto the second difference between LEONARD and prior systems. The prior systems use stochasticreasoning, in the form of hidden Markov models, to classify the time-series data into event types.In contrast, LEONARD uses logical reasoning, in the form of event logic, to do this classiï¬cation. Using force dynamics and event logic (henceforth the ânew approachâ) to recognize events of- fers several advantages over using motion proï¬le and hidden Markov models (henceforth the âpriorapproachâ). First, the new approach will correctly recognize an event despite a wider variance inmotion proï¬le than the prior approach. For example, when recognizing, say, a pick up event, theprior approach is sensitive to aspects of event execution, like the approach angle and velocity of thehand, that are irrelevant to whether or not the event is actually a pick up. The new approach is notsensitive to such aspects of event execution. Second, the new approach will correctly recognize anevent despite the presence of unrelated objects in the ï¬eld of view. The prior approach computesthe relative-and-absolute positions and motions of all objects and pairs of objects in the ï¬eld ofview. It then selects the subset of objects whose positions and motions best matched some model.This could produce incorrect descriptions when some unintended subset matched some unintendedmodel better than the intended subset matched the intended model. The new approach does notexhibit such deï¬ciencies. Extraneous objects typically do not exhibit the precise sequence of statechanges in force-dynamic relations needed to trigger the event-classiï¬cation process and thus willnot generate spurious claims of event occurrences. Third, the new approach performs temporal andspatial segmentation of events. The prior approach matches an entire image sequence against anevent model. It fails if that image sequence depicts multiple event executions, either in sequenceor in parallel. In contrast, the new approach can segment a complex image sequence into a collec-tion of sequential and/or overlapping events. In particular, it can handle hierarchal events, such asmove, that consist of a pick up event followed by a put down event. It can recognize that all threeevents, and precisely those three events, occur in an appropriate image sequence whereas the priorapproach would try to ï¬nd the single best match. Finally, the new approach robustly detects thenon-occurrence of events as well as the occurrence of events. The prior approach always selects thebest match and reports some event occurrence for every image sequence. Thresholding the matchcost does not work because an approach based on motion proï¬le can be fooled into triggering recog-nition of an event occurrence by an event whose motion proï¬le is similar to one or more target event 33

SISKIND (a) Frame 4 Frame 11 Frame 12 Frame 18 Frame 19 Frame 21 (b) Frame 0 Frame 6 Frame 7 Frame 12 Frame 13 Frame 18 Figure 2: Image sequences depicting non-events. classes even though that event is not actually in any of those target event classes. Consider, forexample, the two image sequences in Figure 2. Suppose that an event-recognition system containedtwo target event classes, namely pick up and put down. Neither of the image sequences depict pickup or put down events. Nonetheless, the prior approach might mistakingly classify Figure 2(a) as apick up event because the second half of this image sequence matches the second half of the motionproï¬le of a pick up event. Alternatively, it might mistakingly classify this image sequence as a putdown event because the ï¬rst half of this image sequence matches the ï¬rst half of the motion proï¬leof a put down event. Similarly, the prior approach might mistakingly classify Figure 2(b) as a pickup event because the ï¬rst half of this image sequence matches the ï¬rst half of the motion proï¬leof a pick up event. Alternatively, it might mistakingly classify this image sequence as a put downevent because the second half of this image sequence matches the second half of the motion proï¬leof a put down event. In contrast, the new approach correctly recognizes that neither of these imagesequences exhibit the necessary state changes in force-dynamic relations to qualify as either pick upor put down events. All four of these advantages will be discussed in greater detail in Section 5. 34

GROUNDING THE LEXICAL SEMANTICS OF VERBS The techniques described in this paper have been implemented in a system called LEONARD. LEONARD is a comprehensive system that takes image sequences as input and produces event de- scriptions as output. The overall architecture of LEONARD is shown in Figure 6. The input to LEONARD consists of a sequence of images taken by a Canon VC-C3 camera and Matrox Me- teor frame grabber at 320Ã240 resolution at 30fps. This image sequence is ï¬rst processed by asegmentation-and-tracking component. A real-time colour- and motion-based segmentation algo-rithm places a convex polygon around each coloured and moving object in each frame. A trackingalgorithm then forms a correspondence between the polygons in each frame and those in temporallyadjacent frames. The output of the segmentation-and-tracking component consists of a sequence ofscenes, each scene being a set of polygons. Each polygon is represented as a sequence of imagecoordinates corresponding to a clockwise traversal of the polygonâs vertices. The tracker guar-antees that each scene contains the same number of polygons and that they are ordered so thatthe ith polygon in each scene corresponds to the same object. Figure 3 shows the output of thesegmentation-and-tracking component on the image sequences from Figure 1. The polygons havebeen overlayed on the input images for ease of comprehension. This scene sequence is passed to a model-reconstruction component. This component produces a force-dynamic model of each scene. This model speciï¬es three types of information: whichobjects are grounded, i.e. are supported by an unseen mechanism that is not associated with any vis-ible object, which objects are attached to other objects by rigid or revolute joints, and the qualitativedepth of each object, i.e. a qualitative representation of the relative distance of different objects inthe ï¬eld of view from the observer, in the form of a same layer relation specifying which objectsare at the same qualitative depth. Figure 4 shows the output of the model-reconstruction componenton the scene sequences from Figure 3. The models are depicted graphically, overlayed on the inputimages, for ease of comprehension. The details of this depiction scheme will be described momen-tarily. For now, it sufï¬ces to point out that Figure 4(a) shows the red block on the same layer as thegreen block up through Frame 1 and attached to the hand from Frame 14 onward. Figure 4(b) showsthe reverse sequence of relations, with the red block attached to the hand up through Frame 13 andon the same layer as the green block from Frame 23 onward. This model sequence is passed to an event-classiï¬cation component. This component ï¬rst de- termines the intervals over which certain primitive event types are true. These primitive event typesinclude SUPPORTED(x), SUPPORTS(x, y), CONTACTS(x, y), and ATTACHED(x, y). This compo-nent then uses an inference procedure to determine the intervals over which certain compoundevent types are true. These compound event types include PICKUP(x, y, z), PUTDOWN(x, y, z), STACK(w, x, y, z), UNSTACK(w, x, y, z), MOVE(w, x, y, z), ASSEMBLE(w, x, y, z), andDISASSEMBLE(w, x, y, z) and are speciï¬ed as expressions in event logic over the primitive event types. The output of the event-classiï¬cation component consists of an indication of which com-pound event types occurred in the input movie as well as the subsequence(s) of frames during whichthose event types occurred. Figure 5 shows the output of the event-classiï¬cation component on themodel sequences from Figure 4. The subsequences of frames during which the events occurred aredepicted as spanning intervals. Spanning intervals will be described in Section 4.1. LEONARD is too complex to describe completely in one paper. This paper provides a detailed description of the event-classiï¬cation component and, in particular, the event-logic inference pro-cedure. The segmentation and tracking algorithms are extensions of the algorithms presented inSiskind and Morris (1996) and Siskind (1999), modiï¬ed to place convex polygons around the par-ticipant objects instead of ellipses. The model-reconstruction techniques are extensions of those 35

SISKIND (a) Frame 0 Frame 1 Frame 2 Frame 13 Frame 14 Frame 20 (b) Frame 1 Frame 13 Frame 14 Frame 22 Frame 23 Frame 27 Figure 3: The output of the segmentation-and-tracking component applied to the image sequences from Figure 1. (a) depicts a pick up event. (b) depicts a put down event. The polygonshave been overlayed on the input images for ease of comprehension. 36

GROUNDING THE LEXICAL SEMANTICS OF VERBS (a) Frame 0 Frame 1 Frame 2 Frame 13 Frame 14 Frame 20 (b) Frame 1 Frame 13 Frame 14 Frame 22 Frame 23 Frame 27 Figure 4: The output of the model-reconstruction component applied to the scene sequences from Figure 3. (a) depicts a pick up event. (b) depicts a put down event. The models havebeen overlayed on the input images for ease of comprehension. In (a), the red block ison the same layer as the green block up through Frame 1 and is attached to the handfrom Frame 14 onward. In (b), the reverse sequence of relations holds, with the red blockattached to the hand up through Frame 13 and on the same layer as the green block fromFrame 23 onward. 37

SISKIND (PICK-UP MOVING RED GREEN)@[[0,1],[14,22]) (SUPPORTED? RED)@[[0:22])(SUPPORTED? MOVING)@[[1:13]), [[24:26])(SUPPORTS? RED MOVING)@[[1:13]), [[24:26])(SUPPORTS? MOVING RED)@[[13:22])(SUPPORTS? GREEN RED)@[[0:14])(SUPPORTS? GREEN MOVING)@[[1:13])(CONTACTS? RED GREEN)@[[0:2]), [[6:14])(ATTACHED? RED MOVING)@[[1:26])(ATTACHED? RED GREEN)@[[1:6]) (a) (PUT-DOWN MOVING RED GREEN)@[[0,14],[23,32]) (SUPPORTED? MOVING)@[[14:23])(SUPPORTED? RED)@[[0:32])(SUPPORTS? MOVING RED)@[[0:14])(SUPPORTS? RED MOVING)@[[14:23])(SUPPORTS? GREEN MOVING)@[[14:23])(SUPPORTS? GREEN RED)@[[14:32])(CONTACTS? RED GREEN)@[[22:32])(ATTACHED? MOVING RED)@[[0:23])(ATTACHED? RED GREEN)@[[14:22]) (b) Figure 5: The output of the event-classiï¬cation component applied to the model sequences from Figure 4. Note that the pick up event is correctly recognized in (a) and the put down eventis correctly recognized in (b). 38

GROUNDING THE LEXICAL SEMANTICS OF VERBS image scene model event sequence sequence sequence labels Segmentation Model Event and Tracking Reconstruction Classification Figure 6: The overall architecture of LEONARD. presented in Siskind (1997, 2000). The model-reconstruction techniques will be described brieï¬ybelow to allow the reader to understand the event-classiï¬cation techniques without reference tothose papers. 2. Model Reconstruction Certain properties of objects are visible. For example, position, orientation, shape, size, colour,texture, and so forth. Furthermore, relational variants of these properties are also visible, as wellas changes in such properties and relations over time. In contrast, force-dynamic properties andrelations are not visible. One cannot see the fact that the door knob is attached to, and supported by,the door. One must infer that fact using physical knowledge of the world. Such knowledge includesthe fact that unsupported objects fall and attachment is one way of offering support. Using physi-cal knowledge to infer force-dynamic properties and relations was ï¬rst discussed by Siskind (1991,1992, 1993). This later became known as the perceiver framework advanced by Jepson and Richards(1993). The perceiver framework states that perception involves four levels. First, one must spec-ify the observables, what properties and relations can be discerned by direct observation. Second,one must specify an ontology, what properties and relations must be inferred from the observables.Descriptions of the observables in terms of such properties and relations are called interpretations.There may be multiple interpretations of a given observation. Third, one must specify a theory,a way of differentiating consistent interpretations from inconsistent ones. The consistent interpre-tations are the models of the observation. There may be multiple models of a given observation.Finally, one must specify a preference relation, a way of ordering the models. The most-preferredmodels of the observations are the percepts. One can instantiate the perceiver framework for dif-ferent observables, ontologies, theories, and preference relations. Siskind (1991, 1992, 1993, 1994,1995, 1997) instantiated this framework for a kinematic theory applied to simulated video. Mann,Jepson, and Siskind (1996, 1997) and Mann and Jepson (1998) instantiated this framework for adynamics theory applied to real video. Siskind (2000) instantiated this framework for a kinematictheory applied to real video. This paper uses this later approach. The input to the model-reconstruction process consists of a sequence of scenes, each scene be- ing a set of convex polygons. Each polygon is represented as a sequence of points correspondingto a clockwise traversal of the polygonâs vertices. The tracker guarantees that each scene containsthe same number of polygons and that they are ordered so that the ith polygon in each scene corre-sponds to the same object. The output of the model-reconstruction process consists of a sequence ofinterpretations, one interpretation per scene. The interpretations are formulated out of the followingprimitive properties of, and relations between, the objects in each scene. GROUNDED(p) Polygon p is grounded. It is constrained to occupy a ï¬xed position and orientation by an unseen mechanism that is not associated with any visible object and thus cannot moveeither translationally or rotationally. 39

SISKIND 0 0 2 1 Figure 7: The graphical method for depicting interpretations that is used in this paper. The symbol â â indicates that a polygon is grounded. A solid circle indicates a rigid joint. A hollowcircle indicates a revolute joint. Two polygons with the same layer index are on the samelayer. RIGID(p, q, r) Polygons p and q are attached by a rigid joint at point r. Both the relative position and orientation of p and q are constrained. REVOLUTE(p, q, r) Polygons p and q are attached by a revolute joint at point r. The relative posi- tion of p and q is constrained but the relative orientation is not. SAMELAYER(p, q) Polygons p and q are on the same layer. Layers are a qualitative representation of depth, or distance from the observer. This representation is impoverished. There is nonotion of âin-front-ofâ or âbehindâ and there is no notion of adjacency in depth. The onlyrepresentable notion is whether two objects are on the same or different layers. The same-layer relation is constrained to be an equivalence relation, i.e. it must be reï¬exive, symmetric,and transitive. Furthermore, two objects on the same layer must obey the substantiality con-straint, the constraint that they not interpenetrate (Spelke, 1983; Baillargeon et al., 1985;Baillargeon, 1986, 1987; Spelke, 1987, 1988). An interpretation I is a 4-tuple: GROUNDED, RIGID, REVOLUTE, SAMELAYER . Throughout this paper, interpretations will be depicted graphically, overlayed on scene images, for ease of com-prehension. Figure 7 gives a sample interpretation depicted graphically. The symbol â â attachedto a polygon indicates that it is grounded. A solid circle indicates that two polygons are rigidlyattached at the center of the circle. A hollow circle indicates that two polygons are attached by arevolute joint at the center of the circle. The same-layer relation is indicated by giving a layer index,a small nonnegative integer, to each polygon. Polygons with the same layer index are on the samelayer, while those with different layer indices are on different layers. Model reconstruction can be viewed as a generate-and-test process. Initially, all possible inter- pretations are generated for each scene. Then, inadmissible and unstable interpretations are ï¬lteredout. Admissibility and stability can be collectively viewed as a consistency requirement. The sta-ble admissible interpretations are thus models of a scene. The nature of the theory guarantees thatthere will always be at least one model for each scene, namely the model where all objects aregrounded. There may, however, be multiple models for a given scene. Therefore, a preference re- 40

GROUNDING THE LEXICAL SEMANTICS OF VERBS lation is then applied through a sequence of circumscription processes (McCarthy, 1980) to selectthe minimal, or preferred, models for each scene. While there will always be at least one mini-mal model for each scene, there may be several, since the preference relation may not induce atotal order. If there are multiple minimal models for a given scene, one is chosen arbitrarily as themost-preferred model for that scene. The precise details of the admissibility criteria, the stabilitychecking algorithm, the preference relations, and the circumscription process are beyond the scopeof this paper. They are discussed in Siskind (2000). What is important, for the purpose of this paper,is that, given a scene sequence, model reconstruction produces a sequence of interpretations, onefor each scene, and that these interpretations are 4-tuples containing the predicates GROUNDED, RIGID, REVOLUTE, and SAMELAYER. Figure 4 shows sample interpretation sequences produced by the model-reconstruction component on the scene sequences from Figure 3. 3. Event Logic Model reconstruction determines the truth values of the force-dynamic relations on a frame-by-frame basis in the input movie. Intervals of constant truth value for a given force-dynamic rela-tion are taken to be primitive event occurrences. LEONARD uses event logic to infer compoundevent occurrences from primitive event occurrences. For example, for the image sequence in Fig-ure 1(a), model reconstruction determines that the green block supports the red block from Frame 0to Frame 13 and that the hand is attached to the red block from Frame 13 to Frame 20. This will bedenoted as SUPPORTS(green-block, red-block)@[0, 13) and ATTACHED(hand, red-block)@[13, 20), i.e. that the primitive event typesSUPPORTS(green-block, red-block) and ATTACHED(hand, red-block) occurred during the inter- vals [0, 13) and [13, 20) respectively. The compound event type PICKUP(hand, red-block, green-block) might be deï¬ned as SUPPORTS(green-block, red-block); ATTACHED(hand, red-block) i.e. SUPPORTS(green-block, red-block) followed by ATTACHED(hand, red-block). (In the above,I use â;â informally as a sequence operator. The precise deï¬nition of â;â will be given momentarily.)The task of the event-logic inference procedure is to infer PICKUP(hand, red-block, green-block)@[0, 20), i.e. that the compound event typePICKUP(hand, red-block, green-block) occurred during the interval [0, 20). Event logic provides a calculus for forming compound event types as expressions over primitive event types. The syntax and semantics of event logic will be described momentarily. Event-logicexpressions denote event types, not event occurrences. As such, they do not have truth values.Rather, they are predicates that describe the truth conditions that must hold of an interval for anevent to occur. In contrast, an event-occurrence formula does have a truth value. If Î¦ is an event-logic expression that denotes a primitive or compound event type, and i is an interval, then Î¦@iis an atomic event-occurrence formula that is true if and only if the truth conditions for the eventtype Î¦ hold of the interval i. Î¦@i denotes coincidental occurrence, the fact that an occurrence of Î¦ started at the beginning of i and ï¬nished at the end of i. Î¦@i would not hold if an occurrence of Î¦ did not precisely coin- 41

SISKIND cide with i, but instead overlapped,2 partially or totally, with i. Event types have internal temporalstructure that render this distinction important. In the case of primitive event types, that structure issimple. Each primitive event type is derived from a predicate. If Ï is a predicate, then Ï denotes theprimitive event type derived from Ï. A primitive event type Ï holds of an interval if the correspond-ing predicate Ï holds of every instant in that interval.3 This means that Â¬(Ï@i) and Â¬Ï@i mighthave different truth values. For example, if Ï is true of every instant in [0, 2) and false of every otherinstant, then Â¬(Ï@[1, 3)) is true while Â¬Ï@[1, 3) is false. Event logic takes coincidental occurrenceto be a primitive notion. As will be demonstrated below, overlapping occurrence is a derived notionthat can be expressed in terms of coincidental occurrence using compound event-logic expressions. Two auxiliary notions are needed to deï¬ne the syntax and semantics of event logic. First, there are thirteen possible relations between two intervals. Following Allen (1983), I denote these rela-tions as =, <, >, m, mi, o, oi, s, si, f, ï¬, d, and di and refer to them collectively as Allen relationsthroughout the paper. The names m, o, s, f, and d are mnemonics for meet, overlap, start, ï¬nish,and during respectively. The inverse relations, such as mi, whose names end in i are the same asthe corresponding relations, such as m, whose names do not end in i except that the arguments arereversed. Figure 8 depicts all thirteen Allen relations graphically. Second, I deï¬ne the span of twointervals i and j, denoted SPAN(i, j), as the smallest super-interval of both i and j. The syntax of event logic is deï¬ned as follows. We are given ï¬nite disjoint sets of constant symbols along with a ï¬nite set of primitive event-type symbols, each of a speciï¬ed arity. Constantsymbols, such as red-block and hand, denote objects in the input movie while primitive event-typesymbols, such as SUPPORTS, denote parameterized primitive event types. An atomic event-logicexpression is a primitive event-type symbol of arity n applied to a sequence of n constants. Forexample, SUPPORTS(hand, red-block). An event-logic expression is either an atomic event-logicexpression or one of the compound event-logic expressions Â¬Î¦, Î¦ â¨ Î¨, Î¦ â§R Î¨, or 3RÎ¦, where Î¦and Î¨ are event-logic expressions and R â =, <, >, m, mi, o, oi, s, si, f, ï¬, d, di. Informally, the semantics of compound event-logic expressions is deï¬ned as follows: â¢ Â¬Î¦ denotes the non-occurrence of Î¦. An occurrence of Â¬Î¦ coincides with i if no occurrence of Î¦ coincides with i. Note that (Â¬Î¦)@i could be true, even if an occurrence of Î¦ overlappedwith i, so long as no occurrence of Î¦ coincided with i. â¢ Î¦ â¨ Î¨ denotes the occurrence of either Î¦ or Î¨. â¢ Î¦ â§R Î¨ denotes the occurrence of both Î¦ and Î¨. The occurrences of Î¦ and Î¨ need not be simultaneous. The subscript R speciï¬es a set of allowed Allen relations between theoccurrences of Î¦ and Î¨. If occurrences of Î¦ and Î¨ coincide with i and j respectively, andirj for some r â R, then an occurrence of Î¦ â§R Î¨ coincides with the span of i and j. Iabbreviate the special case Î¦ â§= Î¨ simply as Î¦ â§ Î¨ without any subscript. Î¦ â§ Î¨ describesan aggregate event where both Î¦ and Î¨ occur simultaneously. I also abbreviate the specialcase Î¦ â§m Î¨ as Î¦; Î¨. Î¦; Î¨ describes an aggregate event where an occurrence of Î¦ isimmediately followed by an occurrence of Î¨. 2. I am using the term âoverlapâ here in a different sense than the o relation from Allen (1983). Here, I am using the term overlap in the sense that two intervals overlap if they have a nonempty intersection. This corresponds to theunion of the o, oi, s, si, f, ï¬, d, and di relations from Allen (1983). 3. To deal with noise, in the actual implementation, the primitive event type Ï is derived from the predicate Ï by ï¬rst passing Ï through a low-pass ï¬lter that takes Ï(t) to be the majority vote of a ï¬ve-frame window centered on t. 42

GROUNDING THE LEXICAL SEMANTICS OF VERBS x x x < y x > y y y x x x m y x mi y y y x x x o y x oi y y y x x x s y x si y y y x x x f y x fi y y y x x x d y x di y y y x x = y y Figure 8: The Allen relations, the thirteen possible relations between two intervals. 43

SISKIND â¢ An occurrence of 3RÎ¦ coinciding with i denotes an occurrence of Î¦ at some other inter- val j such that jri for some r â R. 3R can act as a tense operator. Expressions such as 3<Î¦, 3>Î¦, 3mÎ¦, and 3miÎ¦ specify that Î¦ happened in the noncontiguous past,noncontiguous future, contiguous past, or contiguous future respectively. The 3R operatorcan also be used to derive overlapped occurrence from coincidental occurrence. An occur-rence of 3=,o,oi,s,si,f,ï¬,d,diÎ¦ coincides with i if an occurrence of Î¦ overlaps with i. I ab-breviate 3=,o,oi,s,si,f,ï¬,d,diÎ¦ simply as 3Î¦ without any subscript. Note that while (Â¬Î¦)@iindicates that no occurrence of Î¦ coincided with i, (Â¬3Î¦)@i indicates that no occurrenceof Î¦ overlapped with i. Formally, the truth of an atomic event-occurrence formula Î¦@i is deï¬ned relative to a model. Let I be the set of all intervals and O be the set of all objects in the movie. A model M is a mapfrom primitive event-type symbols of arity n to subsets of I Ã O Ã Â· Â· Â· Ã O. (M can be viewed as n either a model or as a movie.) M thus maps primitive event-type symbols to relations that take aninterval as their ï¬rst argument, in addition to the remaining object parameters. The semantics ofevent logic is formally deï¬ned by specifying an entailment relation M |= Î¦@i as follows: â¢ M |= p(t1, . . . , tn)@i if and only if i, t1, . . . , tn â M(p). â¢ M |= (Â¬Î¦)@i if and only if M |= Î¦@i. â¢ M |= (Î¦ â¨ Î¨)@i if and only if M |= Î¦@i or M |= Î¨@i. â¢ M |= (Î¦ â§R Î¨)@i if and only if there exist two intervals j and k such that i = SPAN(j, k), jrk for some r â R, M |= Î¦@j, and M |= Î¨@k. â¢ M |= (3RÎ¦)@i if and only if there exists some interval j such that jri for some r â R and M |= Î¦@j. Figure 9 shows the primitive event types currently used by LEONARD. The deï¬nitions in Fig- ure 9 are formulated in terms of the predicates GROUNDED, RIGID, REVOLUTE, and SAMELAYERthat are produced by model reconstruction as described in Section 2. An object is supported if it isnot grounded. Two objects contact if their polygons touch and they are on the same layer. Two poly-gons p and q touch, denoted TOUCHES(p, q), if they intersect and their intersection has zero area.Two objects are attached if there is a rigid or revolute joint that joins them. Determining whetheran object x supports an object y is a little more complex and requires counterfactual reasoning.Let P be the set of polygons in a scene, I = GROUNDED, RIGID, REVOLUTE, SAMELAYER bethe most-preferred model of the scene as produced by model reconstruction, and STABLE(P, I) betrue if the scene P is stable under an interpretation I. Stability analysis, i.e. the STABLE predicate,is a subroutine used by the model-reconstruction component. An object x supports an object y if yis not grounded in the most-preferred model I and a variant of P with x removed is not stable undera variant of I where all objects except for y and those rigidly attached, directly or indirectly, to y aregrounded. In Figure 9, P x denotes the variant of P with x removed, RIGATT(x, y) denotes thefact that x is rigidly attached to y, RIGATTâ denotes the reï¬exive transitive closure of the RIGATTrelation, z|Â¬RIGATTâ(z, y) denotes the set of objects that are rigidly attached, directly or indi-rectly, to y, and GROUNDED âª z|Â¬RIGATTâ(z, y), RIGID, REVOLUTE, SAMELAYER denotes 44

GROUNDING THE LEXICAL SEMANTICS OF VERBS x = y = x = y SUPPORTED(x) = Â¬GROUNDED(x) RIGATT(x, y) = (âr)RIGID(x, y, r) ï£® Â¬GROUNDED(y)â§ ï£¹ ï£¯ ï£« GROUNDED âª z|Â¬RIGATTâ(z, y), ï£¶ ï£º SUPPORTS(x, y) = ï£¯ï£¯ RIGID, ï£º ï£¯ Â¬STABLE ï£¬ P x, ï£· ï£º ï£¯ ï£¬ REVOLUTE, ï£· ï£º ï£° ï£¬ï£ ï£· ï£º SAMELAYER ï£¸ ï£» CONTACTS(x, y) = TOUCHES(x, y) â§ SAMELAYER(x, y) ATTACHED(x, y) = (âr)RIGID(x, y, r) â¨ REVOLUTE(x, y, r) Figure 9: Deï¬nition of the primitive event types used by LEONARD. a variant of I where all objects except for y and those rigidly attached, directly or indirectly, to yare grounded. Figure 10 shows the compound event-type deï¬nitions currently used by LEONARD. PICKUP(x, y, z) denotes an event type where x picks y up off of z. It is speciï¬ed as a sequence of three intervals, where x is not attached to and does not support y in the ï¬rst interval but is at-tached to and does support y in the third interval. Additionally, z supports y in the ï¬rst intervalbut does not support y in the third interval. Furthermore, several conditions must hold in both theï¬rst and third intervals: x must be unsupported, y must not support either x or z, x and z mustnot support each other, and y must not be attached to z. During the second interval, intermediatebetween the ï¬rst and third intervals, either x is attached to y or y is attached to z.4 Additionally,several conditions must hold throughout the entire event: x, y, and z must be distinct and y mustbe supported. PUTDOWN(x, y, z) denotes an event type where x puts y down on z. It is speciï¬edin a fashion that is similar to PICKUP(x, y, z) but where the three subevents occur in reverse order. STACK(w, x, y, z) denotes an event type where w puts x down on y which is resting on z. It is spec- iï¬ed as PUTDOWN(w, x, y), where z supports but is not attached to y and z is distinct from w, x,and y. UNSTACK(w, x, y, z) denotes an event type where w picks x up off of y which is resting on z.It is speciï¬ed as PICKUP(w, x, y), where z supports but is not attached to y and z is distinct from w,x, and y. MOVE(w, x, y, z) denotes an event type where w picks x up off of y and puts it downon z which is distinct from y. ASSEMBLE(w, x, y, z) denotes an event type where w ï¬rst puts ydown on z then sometime later stacks x on top of y. Finally, DISASSEMBLE(w, x, y, z) denotes 4. Originally, a two-interval deï¬nition was used, consisting of only the ï¬rst and third intervals. Such a deï¬nition better reï¬ects human intuition. This requires that x be unsupported, y not support either x or z, x and z not support eachother, and y not be attached to z throughout the event. Unfortunately, however, the model-reconstruction processhas some quirks. When the hand grasps the patient while the patient is still resting on the source it produces a most-preferred model where all three objects are attached and collectively supported by one of them being grounded. Whilesuch a model is consistent with the theory and is minimal, it does not match human intuition. Pending improvementsin the model-reconstruction process to better reï¬ect human intuition, the compound event-type deï¬nition for pickup was modiï¬ed to reï¬ect the force-dynamic interpretations produced by the current model-reconstruction process.Fortunately, the current model-reconstruction process is robust in reproducing this counterintuitive interpretation. 45

SISKIND an event type where w ï¬rst unstacks x from on top of y (which is resting on z) and then some-time later picks y up off of z. Figure 1 shows sample movies depicting occurrences of the eventtypes PICKUP(x, y, z) and PUTDOWN(x, y, z). Figures 11 through 15 show sample movies de-picting occurrences of the event types STACK(w, x, y, z), UNSTACK(w, x, y, z), MOVE(w, x, y, z), ASSEMBLE(w, x, y, z), and DISASSEMBLE(w, x, y, z) respectively. Nominally, all atomic event-logic expressions are primitive event types. However, we allow giving a name to a compound event-logic expression and using this name in another event-logicexpression as short hand for the named expression with appropriate parameter substitution. Thisis simply a macro-expansion process and, as such, no recursion is allowed. This feature is used inFigure 10 to deï¬ne UNSTACK, MOVE, and DISASSEMBLE in terms of PICKUP; STACK, MOVE,and ASSEMBLE in terms of PUTDOWN; ASSEMBLE in terms of STACK, which is itself deï¬ned interms of PUTDOWN; and DISASSEMBLE in terms of UNSTACK, which is itself deï¬ned in terms of PICKUP. The overall goal of the event-classiï¬cation component is to infer all occurrences of a given set of compound event types from a given set of primitive event occurrences. The model-reconstructioncomponent combined with the primitive event-type deï¬nitions given in Figure 9 produces a set ofprimitive event occurrences for a given scene sequence. Figure 10 lists parameterized compoundevent types. These are instantiated for all tuples of objects in the scene sequence to yield groundcompound event-logic expressions. The event-classiï¬cation component infers all occurrences ofthese compound event types that follow from the set of primitive event occurrences. Let us deï¬neE(M, Î¦) to be i|M |= Î¦@i. The model-reconstruction component combined with the primitiveevent-type deï¬nitions given in Figure 9 produces M . Instantiating the parameterized compoundevent types from Figure 10 for all object tuples yields a set of event-logic expressions. The event-classiï¬cation component computes E(M, Î¦) for every Î¦ in this set. In principle, E(M, Î¦) could by implemented as a straightforward application of the formal se- mantics for event logic as speciï¬ed above. There is a difï¬culty in doing so, however. The primitiveevent types have the property that they are liquid. Liquid events have the following two properties.First, if they are true during an interval i, then they are also true during any subinterval of i. Second,if they are true during two overlapping intervals i and j, then they are also true during SPAN(i, j) andany subinterval of SPAN(i, j). For example, if an object is supported during [1, 10], then it also issupported during [2, 5], [3, 8], and all other subintervals of [1, 10]. Similarly, if an object is supportedduring [1, 5] and [4, 10], then it also is supported during [1, 10] and all of its subintervals. Shoham(1987) introduced the notion of liquidity and Vendler (1967), Dowty (1979), Verkuyl (1989), andKrifka (1992) have observed that many event types have this property. Because the primitive eventtypes are liquid, they will hold over an inï¬nite number of subintervals. This renders the formalsemantics inappropriate for a computational implementation. Even if one limits oneself to intervalswith integral endpoints, the primitive event types will hold over quadratically many subintervals ofthe scene sequence. Furthermore, a straightforward computational implementation of the formal se-mantics would be inefï¬cient, because it requires quantifying over subintervals to implement 3R andquantifying over pairs of subintervals to implement â§R. The central result of this paper is a novelrepresentation, called spanning intervals, that allows an efï¬cient representation of the inï¬nite setsof subintervals over which liquid event types hold along with an efï¬cient inference procedure thatoperates on that representation. This representation, and the inference procedure that implementsE(M, Î¦), are presented in the next section. 46

GROUNDING THE LEXICAL SEMANTICS OF VERBS ï£« Â¬3x = y â§ Â¬3z = x â§ Â¬3z = yâ§ ï£¶ ï£¬ SUPPORTED(y) â§ Â¬3ATTACHED(x, z)â§ ï£¬ ï£· ï£¬ ï£± Â¬3ATTACHED(x, y) â§ Â¬3SUPPORTS(x, y)â§ ï£¼ ï£· ï£¬ ï£® ï£¹ ï£· ï£¬ ï£´ SUPPORTS(z, y)â§ ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ Â¬3SUPPORTED(x) â§ Â¬3ATTACHED(y, z)â§ ï£º ; ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ï£´ ï£´ ï£´ ï£¯ Â¬3SUPPORTS(y, x) â§ Â¬3SUPPORTS(y, z)â§ ï£º ï£´ ï£·ï£· PICKUP(x, y, z) = ï£¬ï£¬ ï£´ ï£° Â¬ ï£» ï£´ ï£· ï£¬ ï£´ 3SUPPORTS(x, z) â§ Â¬3SUPPORTS(z, x) ï£´ ï£· ï£¬ ï£´ï£² [A ï£´ TTACHED(x, y) â¨ ATTACHED(y, z)] ; ï£½ ï£· ï£¬ ï£· ï£¬ ï£· ï£¬ ï£´ ï£® ATTACHED(x, y) â§ SUPPORTS(x, y)â§ ï£¹ ï£´ ï£· ï£¬ ï£´ ï£´ ï£· ï£¬ ï£´ Â¬3SUPPORTS(z, y)â§ ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ Â¬ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ 3SUPPORTED(x) â§ Â¬3ATTACHED(y, z)â§ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ Â¬ ï£º ï£´ ï£· ï£ ï£´ï£´ ï£´ ï£´ ï£¯ 3SUPPORTS(y, x) â§ Â¬3SUPPORTS(y, z)â§ ï£º ï£· ï£´ ï£° ï£» ï£´ ï£¸ ï£³ Â¬ ï£´ 3SUPPORTS(x, z) â§ Â¬3SUPPORTS(z, x) ï£´ï£¾ ï£« Â¬3x = y â§ Â¬3z = x â§ Â¬3z = yâ§ ï£¶ ï£¬ SUPPORTED(y) â§ Â¬3ATTACHED(x, z)â§ ï£¬ ï£· ï£¬ ï£± ATTACHED(x, y) â§ SUPPORTS(x, y)â§ ï£¼ ï£· ï£¬ ï£® ï£¹ ï£· ï£¬ ï£´ Â¬ ï£´ ï£· ï£¬ ï£´ ï£¯ 3SUPPORTS(z, y)â§ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ Â¬3SUPPORTED(x) â§ Â¬3ATTACHED(y, z)â§ ï£º ; ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ï£´ ï£¯ Â¬3SUPPORTS(y, x) â§ Â¬3SUPPORTS(y, z)â§ ï£º ï£´ï£´ ï£·ï£· PUTDOWN(x, y, z) = ï£¬ï£¬ ï£´ ï£° Â¬ ï£» ï£´ ï£· ï£¬ ï£´ 3SUPPORTS(x, z) â§ Â¬3SUPPORTS(z, x) ï£´ ï£· ï£¬ ï£´ï£² [A ï£´ TTACHED(x, y) â¨ ATTACHED(y, z)] ; ï£½ ï£· ï£¬ ï£· ï£¬ ï£· ï£¬ ï£´ ï£® Â¬3ATTACHED(x, y) â§ Â¬3SUPPORTS(x, y)â§ ï£¹ ï£´ ï£· ï£¬ ï£´ ï£´ ï£· ï£¬ ï£´ SUPPORTS(z, y)â§ ï£´ ï£· ï£¬ ï£´ ï£¯ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ Â¬ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ 3SUPPORTED(x) â§ Â¬3ATTACHED(y, z)â§ ï£º ï£´ ï£· ï£¬ ï£´ ï£¯ Â¬ ï£º ï£´ ï£· ï£ ï£´ï£´ ï£´ ï£´ ï£¯ 3SUPPORTS(y, x) â§ Â¬3SUPPORTS(y, z)â§ ï£º ï£· ï£´ ï£´ ï£´ ï£° ï£» ï£¸ ï£³ Â¬3SUPPORTS(x, z) â§ Â¬3SUPPORTS(z, x) ï£´ï£¾ ï£® Â¬3z = w â§ Â¬3z = x â§ Â¬3z = yâ§ ï£¹ STACK(w, x, y, z) = ï£¯ PUTDOWN(w, x, y) â§ SUPPORTS(z, y)â§ ï£° ï£º Â¬ATTACHED(z, y) ï£» Â¬ U 3z = w â§ Â¬3z = x â§ Â¬3z = yâ§ NSTACK(w, x, y, z) = PICKUP(w, x, y) â§ SUPPORTS(z, y) â§ Â¬ATTACHED(z, y) MOVE(w, x, y, z) = Â¬3y = z â§ [PICKUP(w, x, y); PUTDOWN(w, x, z)] ASSEMBLE(w, x, y, z) = PUTDOWN(w, y, z) â§< STACK(w, x, y, z) DISASSEMBLE(w, x, y, z) = UNSTACK(w, x, y, z) â§< PICKUP(x, y, z) Figure 10: The lexicon of compound event types used by LEONARD. 47

SISKIND Frame 3 Frame 11 Frame 12 Frame 23 Frame 24 Frame 26 Figure 11: An image sequence depicting a stack event. Frame 1 Frame 10 Frame 11 Frame 24 Frame 25 Frame 29 Figure 12: An image sequence depicting an unstack event. 48

GROUNDING THE LEXICAL SEMANTICS OF VERBS Frame 0 Frame 8 Frame 9 Frame 16 Frame 17 Frame 33 Frame 34 Frame 45 Frame 46 Frame 47 Figure 13: An image sequence depicting a move event. 4. An Efï¬cient Representation and Inference Procedure for Event Logic One might try to implement event logic using only closed intervals of the form [q, r], where q â¤ r.Such a closed interval would represent the set p|q â¤ p â¤ r of real numbers. With such closedintervals, one would deï¬ne Allenâs relations as follows: [q1, r1] = [q2, r2] = (q1 = q2) â§ (r1 = r2) [q1, r1] < [q2, r2] = r1 < q2 [q1, r1] > [q2, r2] = q1 > r2 [q1, r1] m [q2, r2] = r1 = q2 [q1, r1] mi [q2, r2] = q1 = r2 [q1, r1] o [q2, r2] = q1 < q2 < r1 < r2 [q1, r1] oi [q2, r2] = r1 > r2 > q1 > q2 [q1, r1] s [q2, r2] = (q1 = q2) â§ (r1 < r2) [q1, r1] si [q2, r2] = (q1 = q2) â§ (r1 > r2) [q1, r1] f [q2, r2] = (q1 > q2) â§ (r1 = r2) [q1, r1] ï¬ [q2, r2] = (q1 < q2) â§ (r1 = r2) [q1, r1] d [q2, r2] = (q1 > q2) â§ (r1 < r2) 49

SISKIND Frame 0 Frame 9 Frame 10 Frame 17 Frame 18 Frame 32 Frame 33 Frame 40 Frame 41 Frame 46 Frame 47 Frame 56 Frame 57 Frame 67 Frame 68 Frame 80 Figure 14: An image sequence depicting an assemble event. 50

GROUNDING THE LEXICAL SEMANTICS OF VERBS Frame 18 Frame 19 Frame 22 Frame 23 Frame 49 Frame 50 Frame 56 Frame 57 Frame 62 Frame 63 Frame 87 Frame 88 Figure 15: An image sequence depicting a disassemble event. [q1, r1] di [q2, r2] = (q1 < q2) â§ (r1 > r2) One difï¬culty with doing so is that it would be possible for more than one Allen relation to holdbetween two intervals when one or both of them are instantaneous intervals, such as [q, q]. Both mand s would hold between [q, q] and [q, r], both mi and si would hold between [q, r] and [q, q],both m and ï¬ would hold between [q, r] and [r, r], both mi and f would hold between [r, r] and [q, r],and =, m, and mi would all hold between [q, q] and itself. To create a domain where exactly oneAllen relation holds between any pair of intervals, let us consider both open and closed intervals.Closed intervals contain their endpoints while open intervals do not. The intervals (q, r], [q, r), and(q, r), where q < r, represent the sets p|q < p â¤ r, p|q â¤ p < r, and p|q < p < r of realnumbers respectively. The various kinds of open and closed intervals can be uniï¬ed into a singlerepresentation Î±[q, r]Î², where Î± and Î² are true or false to indicate the interval being closed or openon the left or right respectively.5 More speciï¬cally, [q, r] denotes [q, r], [q, r] denotes (q, r], T T F T [q, r] denotes [q, r), and [q, r] denotes (q, r). To do this, let us use q T F F F â¤Î± r to mean q â¤ r when Î± is true and q < r when Î± is false. Similarly, let us use q â¥Î± r to mean q â¥ r when Î±is true and q > r when Î± is false. More precisely, q â¤Î± r = (q < r) â¨ [Î± â§ (q = r)] andq â¥Î± r = (q > r) â¨ [Î± â§ (q = r)]. With these, Î±[q, r]Î² represents the set p|q â¤Î± p â¤Î² r of realnumbers. 5. Throughout this paper, I use lowercase Greek letters, such as Î±, Î², Î³, Î´, , and Î¶, to denote Boolean values, lowercase Latin letters, such as p, q, r, i, j, k, and l, to denote real numbers, lowercase bold Latin letters, such as i, j, and k, todenote intervals or spanning intervals, and uppercase Latin letters, such as I, J , and K, to denote sets of intervals orsets of spanning intervals. 51

SISKIND One can extend the deï¬nition of Allenâs relations to both open and closed intervals as follows. The relation i1 = i2 holds if the corresponding endpoints of i1 and i2 are equal and have the sameopenness. The relation i1 < i2 holds if the right endpoint of i1 precedes the left endpoint of i2 orif they are equal and both open. For example, [1, 3] < [4, 5] and [1, 3) < (3, 5], but [1, 3] < (3, 5],[1, 3) < [3, 5], and [1, 3] < [3, 5]. The relation i1 m i2 holds if the right endpoint of i1 equalsthe left endpoint of i2 and one of those endpoints is open while the other is closed. For example,[1, 3] m (3, 5] and [1, 3) m [3, 5] but [1, 3] m / [3, 5] and [1, 3) m / (3, 5]. The relation i1 o i2 holds if â¢ either the left endpoint of i1 precedes the left endpoint of i2 or they are equal while the former is closed and the latter is open, â¢ either the left endpoint of i2 precedes the right endpoint of i1 or they are equal while both endpoints are closed, and â¢ either the right endpoint of i1 precedes the right endpoint of i2 or they are equal while the former is open and the latter is closed. For example, [1, 3] o [2, 4], [1, 3] o (1, 4], [1, 2] o [2, 4], and [1, 4) o [2, 4], but [1, 3] o / [1, 4], [1, 2) o / [2, 4], and [1, 4] o / [2, 4]. The relation i1 s i2 holds if â¢ the left endpoints of i1 and i2 are equal and have the same openness and â¢ either the right endpoint of i1 precedes the right endpoint of i2 or they are equal while the former is open and the latter is closed. For example, [1, 3] s [1, 4], (1, 3] s (1, 4], and [1, 3) s [1, 3], but [1, 3] s/ (1, 4], [1, 3] s/ [1, 3],[1, 3) s/ [1, 3), and [1, 3] s/ [1, 3). The relation i1 f i2 holds if â¢ the right endpoints of i1 and i2 are equal and have the same openness and â¢ either the left endpoint of i1 follows the left endpoint of i2 or they are equal while the former is open and the latter is closed. For example, [2, 4] f [1, 4], [2, 4) f [1, 4), and (2, 4] f [2, 4], but [2, 4) f/ [1, 4], (2, 4] f/ (2, 4],[2, 4] f/ [2, 4], and [2, 4] f/ (2, 4]. The relation i1 d i2 holds if â¢ either the left endpoint of i1 follows the left endpoint of i2 or they are equal while the former is open and the latter is closed and â¢ either the right endpoint of i1 precedes the right endpoint of i2 or they are equal while the former is open and the latter is closed. For example, [2, 3] d [1, 4] and (1, 4) d [1, 4], but [1, 4) d / [1, 4], (1, 4] d / [1, 4], (1, 4) d / (1, 4], and (1, 4) d / [1, 4). The inverse Allen relations >, mi, oi, si, ï¬, and di are deï¬ned analogously to the <, m, o, s, f, and d relations respectively with the arguments reversed. The above deï¬nitions can be stated more precisely as follows: Î± [q = [q = (q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 = q2) â§ (Î±1 = Î±2) â§ (r1 = r2) â§ (Î²1 = Î²2) (1) Î± [q < [q = r 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¤(Â¬Î²1â§Â¬Î±2) q2 (2) 52

GROUNDING THE LEXICAL SEMANTICS OF VERBS Î± [q > [q = q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¥(Â¬Î±1â§Â¬Î²2) r2 (3) Î± [q m [q = (r 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 = q2) â§ (Î²1 = Î±2) (4) Î± [q mi [q = (q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 = r2) â§ (Î±1 = Î²2) (5) Î± [q o [q = q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¤(Î±1â§Â¬Î±2) q2 â¤(Î²1â§Î±2) r1 â¤(Â¬Î²1â§Î²2) r2 (6) Î± [q oi [q = r 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¥(Î²1â§Â¬Î²2) r2 â¥(Î±1â§Î²2) q1 â¥(Â¬Î±1â§Î±2) q2 (7) Î± [q s [q = (q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 = q2) â§ (Î±1 = Î±2) â§ [r1 â¤(Â¬Î²1â§Î²2) r2] (8) Î± [q si [q = (q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 = q2) â§ (Î±1 = Î±2) â§ [r1 â¥(Î²1â§Â¬Î²2) r2] (9) Î± [q f [q = [q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¥(Â¬Î±1â§Î±2) q2] â§ (r1 = r2) â§ (Î²1 = Î²2) (10) Î± [q ï¬ [q = [q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¤(Î±1â§Â¬Î±2) q2] â§ (r1 = r2) â§ (Î²1 = Î²2) (11) Î± [q d [q = [q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¥(Â¬Î±1â§Î±2) q2] â§ [r1 â¤(Â¬Î²1â§Î²2) r2] (12) Î± [q di [q = [q 1 1, r1]Î²1 Î±2 2, r2]Î²2 1 â¤(Î±1â§Â¬Î±2) q2] â§ [r1 â¥(Î²1â§Â¬Î²2) r2] (13) With the above deï¬nitions, exactly one Allen relation holds between any pair of intervals. I refer to the set of real numbers represented by an interval as its extension. Given the above deï¬nition of interval, any interval, such as [5, 4], (5, 4], [5, 4), or (5, 4), where the upper endpointis less than the lower endpoint represents the empty set. Furthermore, any open interval, suchas [5, 5), (5, 5], or (5, 5), where the upper endpoint equals the lower endpoint also represents theempty set. To create a situation where the extension of each interval has a unique representation,let us represent all such empty sets of real numbers as . Thus whenever we represent an interval Î±[q, r]Î² explicitly, it will have a nonempty extension and will satisfy the following normalizationcriterion: q â¤(Î±â§Î²) r. 4.1 Spanning Intervals When using event logic, we wish to compute and represent the set I of all intervals over which someevent-logic expression Î¦ holds. Many event types, including all of the primitive event types used in LEONARD, are liquid (Shoham, 1987) in the sense that if some event holds of an interval then that event holds of every subinterval of that interval. With real-valued interval endpoints, this creates theneed to compute and represent an inï¬nite set of intervals for a liquid event. Even limiting ourselvesto integer-valued interval endpoints, a liquid event will require the computation and representationof quadratically many intervals. To address this problem, let us introduce the notion of spanning interval. A spanning interval [i : j] represents the set of all subintervals of [i, j], in other words [q, r]|(i â¤ q â¤ j) â§ (i â¤ r â¤ j).Similarly (i : j], [i : j), and (i : j) represent (q, r]|(i â¤ q â¤ j) â§ (i â¤ r â¤ j),[q, r)|(i â¤ q â¤ j) â§ (i â¤ r â¤ j), and (q, r)|(i â¤ q â¤ j) â§ (i â¤ r â¤ j) respectively. We wishto use spanning intervals to represent the set of all intervals over which the primitive event typeshold and to compute and represent the set of all intervals over which compound event types hold viastructural induction over the compound event-logic expressions. A problem arises however. Giventwo liquid event types Î¦ and Î¨, the compound event type Î¦; Î¨ is not liquid. If Î¦ holds over [i : j)and Î¨ holds over [j : k), then Î¦; Î¨ might not hold over every subinterval of [i, k). It holds over only 53

SISKIND those subintervals that include j. For example, if Î¦ holds over [1 : 10) and Î¨ holds over [8 : 20)then Î¦; Î¨ holds for every interval that starts between 1 and 10 and ends between 8 and 20. Butit doesnât hold for every subinterval of [1, 20). For example, it doesnât hold of [12, 20). I refer tosuch event types as semi liquid. Since spanning intervals are not sufï¬cient to efï¬ciently representsemi-liquid events, let us extend the notion of spanning interval. A spanning interval [[i, j], [k, l]]represents the set of intervals [q, r]|(i â¤ q â¤ j) â§ (k â¤ r â¤ l). Similarly the spanning intervals([i, j], [k, l]], [[i, j], [k, l]), and ([i, j], [k, l]) represent the sets (q, r]|(i â¤ q â¤ j) â§ (k â¤ r â¤ l),[q, r)|(i â¤ q â¤ j) â§ (k â¤ r â¤ l), and (q, r)|(i â¤ q â¤ j) â§ (k â¤ r â¤ l) respectively. This ex-tended notion of spanning interval subsumes the original notion. The spanning intervals [i : j],(i : j], [i : j), and (i : j) can be represented as the spanning intervals [[i, j], [i, j]], ([i, j], [i, j]],[[i, j], [i, j]), and ([i, j], [i, j]) respectively. For reasons that will become apparent in Section 4.4,it is necessary to also allow for spanning intervals where the ranges of endpoint values are open.In other words, we will need to consider spanning intervals like [(i, j], [k, l]] to represent sets like[q, r]|(i < q â¤ j) â§ (i â¤ r â¤ j). All told, there are six endpoints that can independently be eitheropen or closed, namely q, r, i, j, k, and l, yielding sixty four kinds of spanning intervals. These canall be uniï¬ed into a single representation Î±[Î³[i, j]Î´, [k, l]Î¶]Î², where Î±, Î², Î³, Î´, , and Î¶ are true orfalse if the endpoints q, r, i, j, k, and l are closed or open respectively. More precisely, the spanninginterval Î±[Î³[i, j]Î´, [k, l]Î¶]Î² represents the set Î±[q, r]Î²|(i â¤Î³ q â¤Î´ j) â§ (k â¤ r â¤Î¶ l) (14) of intervals. I refer to the set of intervals represented by a spanning interval as its extension. More-over, a set of spanning intervals will represent the union of the extensions of its members. Addi-tionally, the empty set of spanning intervals will represent the empty set of intervals. I further referto the set of intervals represented by a set of spanning intervals as its extension. A key result ofthis paper is that if the set of all intervals over which some set of primitive event types hold canbe represented as ï¬nite sets of spanning intervals then the set of all intervals over which all eventtypes that are expressible as compound event-logic expressions over those primitives hold can alsobe represented as ï¬nite sets of spanning intervals. While we require that all intervals have ï¬nite endpoints, for reasons that will also become appar- ent in Section 4.4, it is necessary to allow spanning intervals to have inï¬nite endpoints, for example[[ââ, j], [k, l]]. Such spanning intervals with inï¬nite endpoints represent sets of intervals withï¬nite endpoints but where the range of possible endpoints is unconstrained from above or below. 4.2 Normalizing Spanning Intervals Just as we desire that the extension of every interval have a unique representation, we also desirethat the extension of every spanning interval have a unique representation. There are a number ofsituations where two different spanning intervals will have the same extension. First, all spanningintervals Î±[Î³[i, j]Î´, [k, l]Î¶]Î² where i = â, j = ââ, k = â, or l = ââ represent the empty set ofintervals, because there are no intervals with an endpoint that is less than or equal to minus inï¬nityor greater than or equal to inï¬nity. Second, if i = ââ, j = â, k = ââ, or l = â, the valueof Î³, Î´, , or Î¶ does not affect the denotation respectively, because there are no intervals with inï¬niteendpoints. Third, if j > l, j can be decreased as far as l without changing the denotation, becauseall intervals where the upper endpoint is less than the lower endpoint equivalently denote the emptyinterval. Similarly, if k < i, k can be increased as far as i without changing the denotation. Fourth, 54

GROUNDING THE LEXICAL SEMANTICS OF VERBS all spanning intervals where i > j or k > l represent the empty set of intervals, because the rangeof possible endpoints would be empty. Fifth, all spanning intervals where i = j and either Î³ or Î´ isfalse (indicating an open range for the lower endpoint) represent the empty set of intervals, becausethe range of possible endpoints would be empty. Similarly, all spanning intervals where k = l andeither or Î¶ is false (indicating an open range for the upper endpoint) also represent the empty set of intervals. Sixth, all spanning intervals where i = l and either Î± or Î² is false (indicating an openinterval) also represent the empty set of intervals, because the endpoints of an open interval must bedifferent. Seventh, if j = l and Î¶ is false, the value of Î´ does not affect the denotation, because ifj = l and Î¶ is false, the upper endpoint must be less than l and the lower endpoint must be less thanor equal to j which equals l, so the lower endpoint must be less than j. Similarly, if k = i and Î³ isfalse, the value of does not affect the denotation. Eighth, if j = l and either Î± or Î² is false, the value of Î´ does not affect the denotation, because the lower endpoint of an open interval must beless than its upper endpoint. Similarly, if k = i and either Î± or Î² is false, the value of does not affect the denotation. To create a situation where the extension of every spanning interval has a unique representation, let us represent all empty sets of intervals as . When the values of i, j, k, l, Î±, Î², Î³, Î´, , or Î¶can be changed without changing the denotation, we will select the tightest such values. In otherwords, false values for the Boolean parameters, maximal values for the lower bounds, and minimalvalues for the upper bounds. Thus whenever we represent a spanning interval Î±[Î³[i, j]Î´, [k, l]Î¶]Î²explicitly, it will have a nonempty extension and will satisfy the following normalization criterion: (1) (i = â) â§ (j = ââ) â§ (k = â) â§ (l = ââ)â§ (2) [(i = ââ) â Â¬Î³] â§ [(j = â) â Â¬Î´] â§ [(k = ââ) â Â¬ ] â§ [(l = â) â Â¬Î¶]â§ (3) (j â¤ l) â§ (k â¥ i)â§ (4) (i â¤ j) â§ (k â¤ l)â§ (5) [(i = j) â¨ (Î³ â§ Î´)] â§ [(k = l) â¨ ( â§ Î¶)]â§ (6) [(i = l) â¨ (Î± â§ Î²)]â§ (7) [(j = l) â§ Â¬Î¶] â Â¬Î´ â§ [(k = i) â§ Â¬Î³] â Â¬ â§ (8) [(j = l) â§ (Â¬Î± â¨ Â¬Î²)] â Â¬Î´ â§ [(k = i) â§ (Â¬Î± â¨ Â¬Î²)] â Â¬ Criteria (1) through (8) correspond to points one through eight above. A spanning interval Î±[Î³[i, j]Î´, [k, l]Î¶]Î² is normalized if i, j, k, l, Î±, Î², Î³, Î´, , and Î¶ cannot be changed without changing its denotation. Given a (potentially non-normalized) spanning interval i,its normalization i is the smallest set of normalized spanning intervals that represents the extension 55

SISKIND of i. One can compute i as follows: ï£± Î±[ ï£´ Î³ [i, j ]Î´ , [k , l]Î¶ ]Î² ï£´ï£´ where j = min(j, l) ï£´ï£´ï£´ k = max(k, i) ï£´ï£´ Î³ = Î³ â§ (i = ââ) ï£´ï£´ï£´ Î´ = Î´ â§ [min(j, l) = â] â§ [(j < l) â¨ (Î¶ â§ Î± â§ Î²)] ï£´ï£´ = â§ [max(k, i) = ââ] â§ [(k > i) â¨ (Î³ â§ Î² â§ Î±)] Î±[Î³ [i, j]Î´ , [k, l]Î¶ ]Î² = ï£² Î¶ = Î¶ â§ (l = â) ï£´ï£´ when (i ï£´ â¤ j ) â§ (k â¤ l)â§ ï£´ï£´ [(i = j ) â (Î³ â§ Î´ )] â§ [(k = l) â ( â§ Î¶ )]â§ ï£´ï£´ [(i = l) ï£´ â (Î± â§ Î²)]â§ ï£´ï£´ (i = â) â§ (j = ââ) â§ (k = â) â§ (l = ââ) ï£´ï£´ï£³ otherwise An important property of spanning intervals is that for any spanning interval i, i contains at mostone normalized spanning interval.6 4.3 Computing the Intersection of Two Normalized Spanning Intervals Given two normalized spanning intervals i1 and i2, their intersection i1 â© i2 is a set of normal-ized spanning intervals whose extension is the intersection of the extensions of i1 and i2. One cancompute i1 â© i2 as follows: Î± [ [i , [k ] â© [ [i , [k ] = 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 Î³2 2, j2]Î´2 2 2, l2]Î¶2 Î²2 ï£± Î± [1Î³[max(i1, i2), min(j1, j2)]Î´, [max(k1, k2), min(l1, l2)]Î¶]Î²1 ï£´ï£´ ï£± Î³ ï£´ 1 i1 > i2 ï£´ ï£´ ï£´ where Î³ = ï£² Î³ ï£´ 1 â§ Î³2 i1 = i2 ï£´ï£´ Î³ ï£´ ï£´ 2 i1 < i2 ï£´ ï£³ ï£´ï£´ ï£± Î´ ï£´ 1 j1 < j2 ï£´ ï£´ ï£´ Î´ = ï£² Î´ ï£´ 1 â§ Î´2 j1 = j2 ï£´ï£´ Î´ ï£´ ï£´ 2 j1 > j2 ï£² ï£³ï£± 1 k1 > k2 ï£´ ï£´ ï£´ = ï£² ï£´ 1 â§ 2 k1 = k2 ï£´ï£´ï£´ ï£´ 2 k1 < k2 ï£´ ï£³ ï£´ï£´ ï£± Î¶ ï£´ 1 l1 < l2 ï£´ ï£´ ï£´ Î¶ = ï£² Î¶ ï£´ 1 â§ Î¶2 l1 = l2 ï£´ï£´ Î¶ ï£´ ï£´ 2 l1 > l2 ï£´ ï£³ ï£´ï£´ when (Î±1 = Î±2) â§ (Î²1 = Î²2) ï£´ï£³ otherwise 6. The reason that i contains at most one normalized spanning interval and not exactly one normalized spanning inter- val is that i may denote the empty set of intervals. For example, normalizing the (non-normalized) spanning interval[[10, 10], [1, 1]] yields the empty set. Many of the deï¬nitions in the coming sections compute sets of normalizedspanning intervals as unions of one or more applications of the normalization operator Â· . Each such applicationmight yield either the empty set or a set containing a single normalized spanning interval. This leads to upper, butnot lower, bounds on the size of the computed unions. 56

GROUNDING THE LEXICAL SEMANTICS OF VERBS An important property of normalized spanning intervals is that for any two normalized spanningintervals i1 and i2, i1 â© i2 contains at most one normalized spanning interval. The intuition behind the above deï¬nition is as follows. All of the intervals in the extension of a spanning interval are of the same type, namely [q, r], (q, r], [q, r), or (q, r). The intersection oftwo spanning intervals has a nonempty extension only if the two spanning intervals contain the sametype of intervals in their extension. If they do, and the sets contain intervals whose lower endpoint isbound from below by i1 and i2 respectively, then the intersection will contain intervals whose lowerendpoint is bound from below by both i1 and i2. The resulting bound is open or closed dependingon which of the input bounds is tighter. Similarly for the upper bound on the lower endpoint andthe lower and upper bounds on the upper endpoint. 4.4 Computing the Complement of a Normalized Spanning Interval Given a normalized spanning intervals i, its complement Â¬i is a set of normalized spanning intervalswhose extension is the complement of the extension of i. One can compute Â¬i as follows: ï£« Î±[ [ , [ T ââ, â]T T ââ, k]Â¬ ]Î² âª ï£¶ ï£¬ Î±[ [ââ, â] , ]Î² âª ï£¬ T T Â¬Î¶ [l, â]T ï£· ï£¬ ï£· Î±[ [ [ ] T ââ, i]Â¬Î³ ,T ââ, â]T Î² âª ï£· Â¬ ï£¬ ï£· Î±[Î³ [i, j]Î´ , [k, l]Î¶ ]Î² = ï£¬ ï£¬ Î±[Â¬Î´[j, â] , [ ] T T ââ, â]T Î² âª ï£· ï£¬ ï£· ï£¬ Â¬Î±[ [ , [ ] T ââ, â]T T ââ, â]T Î² âª ï£· ï£¬ ï£· ï£¬ Î±[ [ââ, â] , [ââ, â] ] ï£· ï£ T T T T Â¬Î² âª ï£· ï£¸ Â¬Î±[ [ , [ ] T ââ, â]T T ââ, â]T Â¬Î² An important property of normalized spanning intervals is that for any normalized spanning inter-val i, Â¬i contains at most seven normalized spanning intervals. The intuition behind the above deï¬nition is as follows. First note that the negation of q â¤Î± r is q â¥Â¬Î± r. Next note that the extension of i contains intervals whose endpoints q and r satisfy(q â¥Î³ i) â§ (q â¤Î´ j) â§ (r â¥ k) â§ (r â¤Î¶ l). Thus the extension of Â¬i contains intervals whoseendpoints satisfy the negation of this, namely (q â¤Â¬Î³ i) â¨ (q â¥Â¬Î´ j) â¨ (r â¤Â¬ k) â¨ (r â¥Â¬Î¶ l). Sucha disjunction requires four spanning intervals, the ï¬rst four in the above deï¬nition. Additionally, ifthe extension of i contains intervals of the form [q, r], the extension of Â¬i will contain all intervalsnot of the form [q, r], namely (q, r], [q, r), and (q, r). Similarly for the cases where the extensionof i contains intervals of the form (q, r], [q, r), or (q, r). This accounts for the last three spanningintervals in the above deï¬nition. We now see why it is necessary to allow spanning intervals to have open ranges of endpoint values as well as inï¬nite endpoints. The complement of a spanning interval, such as [[i, j], [k, l]],with closed endpoint ranges and ï¬nite endpoints includes spanning intervals, such as[[ââ, i), [ââ, â]], with open endpoint ranges and inï¬nite endpoints. 4.5 Computing the Span of two Normalized Spanning Intervals The span of two intervals i1 and i2, denoted SPAN(i1, i2), is the smallest interval whose extensioncontains the extensions of both i1 and i2. For example, the span of (1, 4) and [2, 6] is (1, 6]. Simi-larly, the span of [3, 7) and (3, 7] is [3, 7]. More generally, the lower endpoint of SPAN(i1, i2) is theminimum of the lower endpoints of i1 and i2. The lower endpoint of SPAN(i1, i2) is open or closeddepending on whether the smaller of the lower endpoints of i1 and i2 is open or closed. Analogously, 57

SISKIND the upper endpoint of SPAN(i1, i2) is the maximum of the upper endpoints of i1 and i2. The upperendpoint of SPAN(i1, i2) is open or closed depending on whether the larger of the upper endpointsof i1 and i2 is open or closed. More precisely, SPAN(i1, i2) can be computed as follows: SPAN(Î± [q , [q ) = 1 1, r1]Î²1 Î±2 2, r2]Î²2 [Î±1â§(q1â¤q2)]â¨[Î±2â§(q1â¥q2)][min(q1, q2), max(r1, r2)][Î²1â§(r1â¥r2)]â¨[Î²2â§(r1â¤r2)] The notion of span will be used in Section 4.7. Let us extend the notion of span to two sets of intervals by the following deï¬nition: SPAN(I1, I2) = SPAN(i1, i2) i1âI1 i2âI2 We will want to compute the span of two sets of intervals I1 and I2, when both I1 and I2 arerepresented as spanning intervals. Additionally, we will want the resulting span to be representedas a small set of spanning intervals. Given two normalized spanning intervals i1 and i2, their span SPAN(i1, i2) is a set of normalized spanning intervals whose extension is the span of the extensions of i1 and i2. One can compute SPAN(i1, i2) as follows: SPAN(Î± [ [i , [k ] , [ [i , [k ] ) = 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 Î³2 2, j2]Î´2 2 2, l2]Î¶2 Î²2 ï£« Î± [ [i ] âª 1 Î³1 1, j]Î´ , [k, l1]Î¶1 Î²1 ï£¶ ï£¬ Î± [Î³ [i1, j]Î´, [k, l2]Î¶ ]Î² âª ï£¬ 1 1 2 2 ï£· ï£¬ [ [i ] âª ï£· ï£ Î±2 Î³2 2, j]Î´, [k, l1]Î¶1 Î²1 ï£·ï£¸ Î± [ [i ] 2 Î³2 2, j]Î´ , [k, l2]Î¶2 Î²2 where j = min(j1, j2) k = max(k1, k2) Î´ = [Î´1 â§ (j1 â¤ j2)] â¨ [Î´2 â§ (j1 â¥ j2)] = [ 1 â§ (k1 â¥ k2)] â¨ [ 2 â§ (k1 â¤ k2)] An important property of normalized spanning intervals is that for any two normalized spanningintervals i1 and i2, SPAN(i1, i2) contains at most four normalized spanning intervals. In practice,however, fewer normalized spanning intervals are needed, often only one. The intuition behind the above deï¬nition is as follows. Consider, ï¬rst, the lower endpoint. Suppose that the lower endpoints q1 and q2 of i1 and i2 are in Î³ [i and [i respectively. 1 1, j1]Î´1 Î³2 2, j2]Î´2 That means that i1 â¤Î³ q j q j 1 1 â¤Î´1 1 and i2 â¤Î³2 2 â¤Î´2 2. The lower endpoint of SPAN(i1, i2) will be q1, when q1 â¤ q2, and q2, when q1 â¥ q2. Thus it will be q1, for all i1 â¤Î³ q 1 1 â¤Î´ min(j1, j2), and will be q2, for all i2 â¤Î³ q 2 2 â¤Î´ min(j1, j2), where Î´ = Î´1, when j1 â¤ j2, and Î´ = Î´2, when j1 â¥ j2. Thus there will be two potential ranges for the lower endpoint of SPAN(i1, i2): Î³ [i [i 1 1, min(j1, j2)]Î´ and Î³2 2, min(j1, j2)]Î´ . When the lower endpoint of SPAN(i1, i2) is taken from the former, it will be open or closed depending on whether the lower endpoint of i1is open or closed. When it is taken from the later, it will be open or closed depending on whetherthe lower endpoint of i2 is open or closed. Thus the lower endpoint of SPAN(i1, i2) can be either Î± [ [i [ [i 1 Î³1 1, min(j1, j2)]Î´ or Î±1 Î³2 2, min(j1, j2)]Î´ . Analogous reasoning can be applied to the upper endpoints. If the upper endpoints of i1 and i2 are [k ] and [k ] respectively, then 1 1, l1]Î¶1 Î²1 2 2, l2]Î¶2 Î²2 there are two possibilities for the upper endpoint of SPAN(i1, i2), namely [max(k1, k2), l1]Î¶ ] and 1 Î²1 [max(k1, k2), l2]Î¶ ] , where = 2 Î²2 1, when k1 â¥ k2, and = 2, when k1 â¤ k2. 58

GROUNDING THE LEXICAL SEMANTICS OF VERBS 4.6 Computing the D of a Normalized Spanning Interval Given an Allen relation r and a set I of intervals, let D(r, I) denote the set J of all intervals j suchthat irj for some i â I. Given an Allen relation r and a normalized spanning interval i, let D(r, i)denote a set of normalized spanning intervals whose extension is D(r, I), where I is the extensionof i. One can compute D(r, i) as follows: D(=, i) = i D(<,Î± [ [i , [k ] ) = [ , [ââ, â] ] 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 (Â¬Î²1â§Â¬Î±2â§ 1)[k1, â]T T T Î²2 Î±2,Î²2âT,F D(>,Î± [ [i , [k ] ) = [ [ââ, â] , [ââ, j 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 T T T 1](Â¬Î±1â§Â¬Î²2â§Î´1)]Î²2 Î±2,Î²2âT,F D(m,Î± [ [i , [k ] ) = [ [k , [ââ, â] ] 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Â¬Î²1 1 1, l1]Î¶1 T T Î²2 Î²2âT,F D(mi,Î± [ [i , [k ] ) = [ [ââ, â] , [i ] 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 T T Î³1 1, j1]Î´1 Â¬Î±1 Î±2âT,F D(o,Î± [ [i , [k ] ) = 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î± [ ] 2 (Î±1â§Â¬Î±2â§Î³1)[i1, l1](Î²1â§Î±2â§Î¶1),(Â¬Î²1â§Î²2â§ 1) [k1, â]T Î²2 Î±2,Î²2âT,F D(oi,Î± [ [i , [k ] ) = 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î± [ [ââ, j 2 T 1](Â¬Î±1â§Î±2â§Î´1),(Î±1â§Î²2â§Î³1) [i1, l1](Î²1â§Â¬Î²2â§Î¶1)]Î²2 Î±2,Î²2âT,F D(s,Î± [ [i , [k ] ) = [ [i , ] 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±1 Î³1 1, j1]Î´1 (Â¬Î²1â§Î²2â§ 1) [k1, â]T Î²2 Î²2âT,F D(si,Î± [ [i , [k ] ) = [ [i , [ââ, l 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±1 Î³1 1, j1]Î´1 T 1](Î²1â§Â¬Î²2â§Î¶1)]Î²2 Î²2âT,F D(f,Î± [ [i , [k ] ) = [ [ââ, j [k ] 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 T 1](Â¬Î±1â§Î±2â§Î´1), 1 1, l1]Î¶1 Î²1 Î±2âT,F D(ï¬,Î± [ [i , [k ] ) = [ , [k ] 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î±2 (Î±1â§Â¬Î±2â§Î³1)[i1, â]T 1 1, l1]Î¶1 Î²1 Î±2âT,F D(d,Î± [ [i , [k ] ) = 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î± [ [ââ, j ] 2 T 1](Â¬Î±1â§Î±2â§Î´1),(Â¬Î²1â§Î²2â§ 1) [k1, â]T Î²2 Î±2,Î²2âT,F D(di,Î± [ [i , [k ] ) = 1 Î³1 1, j1]Î´1 1 1, l1]Î¶1 Î²1 Î± [ , [ââ, l 2 (Î±1â§Â¬Î±2â§Î³1)[i1, â]T T 1](Î²1â§Â¬Î²2â§Î¶1)]Î²2 Î±2,Î²2âT,F An important property of normalized spanning intervals is that for any normalized spanning inter-val i, D(r, i) contains at most 1, 4, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, or 4 normalized spanning intervals 59

SISKIND when r is =, <, >, m, mi, o, oi, s, si, f, ï¬, d, or di respectively. In practice, however, fewer normalizedspanning intervals are needed, often only one. The intuition behind the above deï¬nition is as follows. Let us handle each of the cases sepa- rately. r =< For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 < i2. From (2) we get r1 â¤(Â¬Î² r 1â§Â¬Î±2) q2. Furthermore, from (14) we get k1 â¤ 1 1. Combining these we get k1 â¤(Â¬Î²1â§Â¬Î±2â§ 1) q2. In this case, both Î±2 and Î²2 are free indicating that either endpointof i2 can be open or closed. r => For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 > i2. From (3) we get q1 â¥(Â¬Î± j 1â§Â¬Î²2) r2. Furthermore, from (14) we get q1 â¤Î´1 1. Combining these we get r2 â¤(Â¬Î±1â§Â¬Î²2â§Î´1) j1. In this case, both Î±2 and Î²2 are free indicating that either endpoint of i2can be open or closed. r = m For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 m i2. From (4) we get r1 = q2 and Î²1 = Î±2. Furthermore, from (14) we get k1 â¤ r l 1 1 â¤Î¶1 1. Combining these we get k1 â¤ q l 1 2 â¤Î¶1 1 and Î²1 = Î±2. In this case, only Î²2 is free indicating that the upper endpoint of i2 can be open or closed. r = mi For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 mi i2. From (5) we get q1 = r2 and Î±1 = Î²2. Furthermore, from (14) we get i1 â¤Î³ q j 1 1 â¤Î´1 1. Combining these we get i1 â¤Î³ r j 1 2 â¤Î´1 1 and Î±1 = Î²2. In this case, only Î±2 is free indicating that the lower endpoint of i2 can be open or closed. r = o For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 o i2. From (6) we get q1 â¤(Î± q 1â§Â¬Î±2) q2 â¤(Î²1â§Î±2) r1 â¤(Â¬Î²1â§Î²2) r2. Furthermore, from (14) we get i1 â¤Î³1 1 and k1 â¤ r l 1 1 â¤Î¶1 1. Combining these we get i1 â¤(Î±1â§Â¬Î±2â§Î³1) q2 â¤(Î²1â§Î±2â§Î¶1) l1 and k1 â¤(Â¬Î²1â§Î²2â§ 1) r2. In this case, both Î±2 and Î²2 are free indicating that either endpoint of i2can be open or closed. r = oi For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 oi i2. From (7) we get q2 â¤(Â¬Î±1â§Î±2) q1 â¤(Î±1â§Î²2) r2 â¤(Î²1â§Â¬Î²2) r1. Furthermore, from (14) we getr1 â¤Î¶ l q j 1 1 and i1 â¤Î³1 1 â¤Î´1 1. Combining these we get i1 â¤(Î±1â§Î²2â§Î³1) r2 â¤(Î²1â§Â¬Î²2â§Î¶1) l1 and q2 â¤(Â¬Î±1â§Î±2â§Î´1) j1. In this case, both Î±2 and Î²2 are free indicating that either endpointof i2 can be open or closed. r = s For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 s i2. From (8) we get q1 = q2, Î±1 = Î±2, and r1 â¤(Â¬Î²1â§Î²2) r2. Furthermore, from (14) we get i1 â¤Î³1q1 â¤Î´ j r q j 1 1 and k1 â¤ 1 1. Combining these we get Î±1 = Î±2, i1 â¤Î³1 2 â¤Î´1 1, and k1 â¤(Â¬Î²1â§Î²2â§ 1) r2. In this case, only Î²2 is free indicating that the upper endpoint of i2 canbe open or closed. r = si For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 si i2. From (9) we get q1 = q2, Î±1 = Î±2, and r1 â¥(Î²1â§Â¬Î²2) r2. Furthermore, from (14) we geti1 â¤Î³ q j l q j 1 1 â¤Î´1 1 and r1 â¤Î¶1 1. Combining these we get Î±1 = Î±2, i1 â¤Î³1 2 â¤Î´1 1, and r2 â¤(Î²1â§Â¬Î²2â§Î¶1) l1. In this case, only Î²2 is free indicating that the upper endpoint of i2 canbe open or closed. 60

GROUNDING THE LEXICAL SEMANTICS OF VERBS r = f For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 f i2. From (10) we get q1 â¥(Â¬Î± r 1â§Î±2) q2, r1 = r2, and Î²1 = Î²2. Furthermore, from (14) we get k1 â¤ 1 1 â¤Î¶1 l1 and q1 â¤Î´ j r l 1 1. Combining these we get Î²1 = Î²2, k1 â¤ 1 2 â¤Î¶1 1, and q2 â¤(Â¬Î±1â§Î±2â§Î´1) j1. In this case, only Î±2 is free indicating that the lower endpoint of i2 can be open or closed. r = ï¬ For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 ï¬ i2. From (11) we get q1 â¥(Î±1â§Â¬Î±2) q2, r1 = r2, and Î²1 = Î²2. Furthermore, from (14) weget k1 â¤ r l q r l 1 1 â¤Î¶1 1 and i1 â¤Î³1 1. Combining these we get Î²1 = Î²2, k1 â¤ 1 2 â¤Î¶1 1, and i1 â¤(Î±1â§Â¬Î±2â§Î³1) q2. In this case, only Î±2 is free indicating that the lower endpoint of i2 canbe open or closed. r = d For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 d i2. From (12) we get q1 â¥(Â¬Î±1â§Î±2) q2 and r1 â¤(Â¬Î²1â§Î²2) r2. Furthermore, from (14) we getq1 â¤Î´ j r 1 1 and k1 â¤ 1 1. Combining these we get q2 â¤(Â¬Î±1â§Î±2â§Î´1) j1 and k1 â¤(Â¬Î²1â§Î²2â§ 1) r2. In this case, both Î±2 and Î²2 are free indicating that either endpoint of i2 can be open orclosed. r = di For any intervals i1 and i2 in the extensions of i1 and i2 respectively we want i1 di i2. From (13) we get q1 â¤(Î±1â§Â¬Î±2) q2 and r1 â¥(Î²1â§Â¬Î²2) r2. Furthermore, from (14) we geti1 â¤Î³ q l 1 1 and r1 â¤Î¶1 1. Combining these we get i1 â¤(Î±1â§Â¬Î±2â§Î³1) q2 and r2 â¤(Î²1â§Â¬Î²2â§Î¶1) l1. In this case, both Î±2 and Î²2 are free indicating that either endpoint of i2 can be open or closed. 4.7 Computing the I of two Normalized Spanning Intervals Given an Allen relation r and two sets I and J of intervals, let I(I, r, J) denote the set K of allintervals k such that k = SPAN(i, j) for some i â I and j â J, where irj. Given an Allen relation rand two normalized spanning intervals i and j, let I(i, r, j) denote a set of normalized spanningintervals whose extension is I(I, r, J), where I and J are the extensions of i and j respectively.One can compute I(i, r, j) as follows: I(i, r, j) = SPAN(i , j ) i âD(râ1,j) i âi â©i j âD(r,i) j âj â©j Here, râ1 denotes the inverse relation corresponding to r, i.e. the same relation as r but with thearguments reversed. It is easy to see that |I(Â·, r, Â·)| â¤ 4|D(r, Â·)|2. Thus an important property ofnormalized spanning intervals is that for any two normalized spanning intervals i and j, I(i, r, j)contains at most 4, 64, 64, 16, 16, 64, 64, 16, 16, 16, 16, 64, or 64 normalized spanning intervals,when r is =, <, >, m, mi, o, oi, s, si, f, ï¬, d, or di respectively. While simple combinatorialenumeration yields the above weak bounds on the number of normalized spanning intervals neededto represent I(i, r, j), in practice, far fewer normalized spanning intervals are needed, in most casesonly one. The intuition behind the above deï¬nition is as follows. Let I and J be the extensions of i and j respectively. The extension of the set of all i is the set of all intervals i such that irj for some j in J .Furthermore, the extension of the set of all i is the set of all intervals i in I such that irj for some jin J . Similarly, the extension of the set of all j is the set of all intervals j such that irj for some iin I. Analogously, the extension of the set of all j is the set of all intervals j in J such that irj forsome i in I. Thus the extension of the set of all SPAN(i , j ) is the set of all intervals k such thatk = SPAN(i, j) where i is in I, j is in J , and irj. 61

SISKIND 4.8 An Efï¬cient Inference Procedure for Event Logic Given the above procedures for computing i , i1 â© i2, Â¬i, SPAN(i1, i2), D(r, i), and I(i, r, j), onecan now deï¬ne a procedure for computing E(M, Î¦). This procedure takes a model M along withan event-logic expression Î¦ and computes a set of normalized spanning intervals that represents theset I of intervals i for which Î¦@i is true. The model M is a set of atomic event-occurrence formulaeof the form p(c1, . . . , cn)@i, where p(c1, . . . , cn) is a ground primitive event-logic expression and iis a normalized spanning interval. A model entry p(c1, . . . , cn)@i indicates that the primitive eventp(c1, . . . , cn) occurred during all intervals in the extension of i. E(M, p(c1, . . . , cn)) = i|p(c1, . . . , cn)@i â M E(M, Î¦ â¨ Î¨) = E(M, Î¦) âª E(M, Î¨) E(M, Â¬Î¦) = Â· Â· Â· i1 â© Â· Â· Â· â© in i i 1âÂ¬i1 nâÂ¬in where E(M, Î¦) = i1, . . . , in E(M, Î¦ â§R Î¨) = I(i, r, j) iâE(M,Î¦) jâE(M,Î¨) râR E(M, 3RÎ¦) = D(r, i) iâE(M,Î¦) râR The procedure performs structural induction on Î¦. It computes a set of normalized spanning in-tervals to the represent the occurrence of each atomic event-logic expression in Î¦ and recursivelycombines the sets so computed for each child subexpression to yield the sets for each parent subex-pression. An important property of this inference procedure is that for any ï¬nite model M , E(M, Î¦),the set I of intervals i for which Î¦@i is true, can be represented by a ï¬nite set of normalized spanningintervals. Nominally, the number of normalized spanning intervals in E(M, Î¦) can be exponentialin the subexpression depth of Î¦ because each step in the structural induction can introduce a con-stant factor growth in the size of the set. However, in practice, such exponential growth does notoccur. Computing E(M, Î¦) for all of the event types given in Figure 10 for all of the movies thathave been tried so far have yielded sets of fewer than a dozen normalized spanning intervals. 5. Experimental Results The techniques described in this paper have been implemented as a system called LEONARD andtested on a number of video sequences.7 LEONARD successfully recognizes the events pick up, putdown, stack, unstack, move, assemble, and disassemble using the deï¬nitions given in Figure 10.Figures 1 and 11 through 15 show the key frames from movies that depict these seven event types.These movies were ï¬lmed using a Canon VC-C3 camera and a Matrox Meteor frame grabber at320Ã240 resolution at 30fps. Figures 4 and 16 through 20 show the results of segmentation, track-ing, and model reconstruction for those key frames superimposed on the original images. Figures 5and 21 through 25 show the results of event classiï¬cation for these movies. These ï¬gures show LEONARD correctly recognizing the intended event classes for each movie. 7. The code for LEONARD, the video input sequences discussed in this paper, and the full frame-by-frame output of LEONARD on those sequences is available as Online Appendix 1, as well as from ftp://ftp.nj.nec.com/pub/qobi/leonard.tar.Z. 62

GROUNDING THE LEXICAL SEMANTICS OF VERBS Frame 3 Frame 11 Frame 12 Frame 23 Frame 24 Frame 26 Figure 16: The output of the segmentation-and-tracking and model-reconstruction components ap- plied to the image sequence from Figure 11, an image sequence that depicts a stackevent. Frame 1 Frame 10 Frame 11 Frame 24 Frame 25 Frame 29 Figure 17: The output of the segmentation-and-tracking and model-reconstruction components ap- plied to the image sequence from Figure 12, an image sequence that depicts an unstackevent. 63

SISKIND Frame 0 Frame 8 Frame 9 Frame 16 Frame 17 Frame 33 Frame 34 Frame 45 Frame 46 Frame 47 Figure 18: The output of the segmentation-and-tracking and model-reconstruction components ap- plied to the image sequence from Figure 13, an image sequence that depicts a moveevent. 64

GROUNDING THE LEXICAL SEMANTICS OF VERBS Frame 0 Frame 9 Frame 10 Frame 17 Frame 18 Frame 32 Frame 33 Frame 40 Frame 41 Frame 46 Frame 47 Frame 56 Frame 57 Frame 67 Frame 68 Frame 80 Figure 19: The output of the segmentation-and-tracking and model-reconstruction components ap- plied to the image sequence from Figure 14, an image sequence that depicts an assembleevent. 65

SISKIND Frame 18 Frame 19 Frame 22 Frame 23 Frame 49 Frame 50 Frame 56 Frame 57 Frame 62 Frame 63 Frame 87 Frame 88 Figure 20: The output of the segmentation-and-tracking and model-reconstruction components ap- plied to the image sequence from Figure 15, an image sequence that depicts a disassem-ble event. 66

GROUNDING THE LEXICAL SEMANTICS OF VERBS (PUT-DOWN MOVING RED BLUE)@[[0,12],[24,30])(STACK MOVING RED BLUE GREEN)@[[0,12],[24,30]) (SUPPORTED? MOVING)@[[13:24])(SUPPORTED? RED)@[[0:30])(SUPPORTED? BLUE)@[[0:30])(SUPPORTS? MOVING RED)@[[0:12])(SUPPORTS? RED MOVING)@[[13:24])(SUPPORTS? RED BLUE)@[[19:20]), [[21:22])(SUPPORTS? GREEN MOVING)@[[19:20]), [[21:22])(SUPPORTS? GREEN RED)@[[19:20]), [[21:22])(SUPPORTS? GREEN BLUE)@[[0:30])(SUPPORTS? BLUE MOVING)@[[13:24])(SUPPORTS? BLUE RED)@[[12:30])(CONTACTS? RED BLUE)@[[12:19]), [[20:21]), [[22:30])(CONTACTS? GREEN BLUE)@[[0:30])(ATTACHED? MOVING RED)@[[0:24])(ATTACHED? RED BLUE)@[[19:20]), [[21:22]) Figure 21: The output of the event-classiï¬cation component applied to the model sequence from Figure 16. Note that the stack event is correctly recognized, as well as the constituentput down event. 67

SISKIND (PICK-UP MOVING RED BLUE)@[[0,11],[25,33])(UNSTACK MOVING RED BLUE GREEN)@[[0,11],[25,33]) (SUPPORTED? MOVING)@[[11:23])(SUPPORTED? RED)@[[0:36])(SUPPORTED? BLUE)@[[0:36])(SUPPORTS? MOVING RED)@[[23:36])(SUPPORTS? RED MOVING)@[[11:23])(SUPPORTS? RED BLUE)@[[13:14])(SUPPORTS? GREEN MOVING)@[[13:14])(SUPPORTS? GREEN RED)@[[13:14])(SUPPORTS? GREEN BLUE)@[[0:36])(SUPPORTS? BLUE MOVING)@[[11:23])(SUPPORTS? BLUE RED)@[[0:25])(CONTACTS? MOVING RED)@[[34:36])(CONTACTS? RED BLUE)@[[0:13]), [[14:24])(CONTACTS? GREEN BLUE)@[[0:13]), [[14:36])(ATTACHED? MOVING RED)@[[11:33])(ATTACHED? RED BLUE)@[[13:14])(ATTACHED? GREEN BLUE)@[[13:14]) Figure 22: The output of the event-classiï¬cation component applied to the model sequence from Figure 17. Note that the unstack event is correctly recognized, as well as the constituentpick up event. 68

GROUNDING THE LEXICAL SEMANTICS OF VERBS (PICK-UP MOVING RED GREEN)@[[0,9],[17,46])(PUT-DOWN MOVING RED BLUE)@[[17,35],[46,52])(MOVE MOVING RED GREEN BLUE)@[[0,9],[46,52]) (SUPPORTED? MOVING)@[[9:15])(SUPPORTED? RED)@[[0:52])(SUPPORTED? BLUE)@[[35:46])(SUPPORTS? MOVING RED)@[[17:46])(SUPPORTS? MOVING BLUE)@[[35:46])(SUPPORTS? RED MOVING)@[[9:15])(SUPPORTS? RED BLUE)@[[35:46])(SUPPORTS? GREEN MOVING)@[[9:15])(SUPPORTS? GREEN RED)@[[0:17])(SUPPORTS? BLUE RED)@[[46:52])(CONTACTS? RED GREEN)@[[0:17])(CONTACTS? RED BLUE)@[[46:52])(ATTACHED? MOVING RED)@[[9:46])(ATTACHED? RED BLUE)@[[35:46]) Figure 23: The output of the event-classiï¬cation component applied to the model sequence from Figure 18. Note that the move event is correctly recognized, as well as the constituentpick up and put down subevents. 69

SISKIND (PUT-DOWN MOVING RED GREEN)@[[57,68],[68,87])(PUT-DOWN MOVING GREEN BLUE)@[[18,35],[41,47])(STACK MOVING RED GREEN BLUE)@[[57,68],[68,87])(ASSEMBLE MOVING RED GREEN BLUE)@[[18,35],[68,87]) (SUPPORTED? MOVING)@[[10:18]), [[47:57])(SUPPORTED? RED)@[[57:87])(SUPPORTED? GREEN)@[[11:87])(SUPPORTED? BLUE)@[[35:41])(SUPPORTS? MOVING RED)@[[57:68])(SUPPORTS? MOVING GREEN)@[[11:41])(SUPPORTS? MOVING BLUE)@[[35:41])(SUPPORTS? RED MOVING)@[[10:18]), [[47:57])(SUPPORTS? RED GREEN)@[[11:16])(SUPPORTS? GREEN RED)@[[68:87])(SUPPORTS? GREEN BLUE)@[[35:41])(SUPPORTS? BLUE GREEN)@[[41:87])(CONTACTS? RED GREEN)@[[68:87])(CONTACTS? GREEN BLUE)@[[41:87])(ATTACHED? MOVING RED)@[[11:16]), [[49:68])(ATTACHED? MOVING GREEN)@[[11:41])(ATTACHED? GREEN BLUE)@[[35:41]) Figure 24: The output of the event-classiï¬cation component applied to the model sequence from Figure 19. Note that the assemble event is correctly recognized, as well as the constituentput down and stack subevents. 70

GROUNDING THE LEXICAL SEMANTICS OF VERBS (PICK-UP MOVING RED GREEN)@[[0,19],[23,50])(PICK-UP MOVING GREEN BLUE)@[[22,58],[62,87])(UNSTACK MOVING RED GREEN BLUE)@[[0,19],[23,50])(DISASSEMBLE MOVING RED GREEN BLUE)@[[0,19],[62,87]) (SUPPORTED? MOVING)@[[19:22])(SUPPORTED? RED)@[[0:50])(SUPPORTED? GREEN)@[[0:87])(SUPPORTED? BLUE)@[[58:62])(SUPPORTS? MOVING RED)@[[23:50])(SUPPORTS? MOVING GREEN)@[[58:87])(SUPPORTS? MOVING BLUE)@[[58:62])(SUPPORTS? RED MOVING)@[[19:22])(SUPPORTS? GREEN MOVING)@[[19:22])(SUPPORTS? GREEN RED)@[[0:23])(SUPPORTS? GREEN BLUE)@[[58:62])(SUPPORTS? BLUE GREEN)@[[0:58])(CONTACTS? RED GREEN)@[[0:23])(CONTACTS? GREEN BLUE)@[[0:58])(ATTACHED? MOVING RED)@[[19:50])(ATTACHED? MOVING GREEN)@[[58:87])(ATTACHED? GREEN BLUE)@[[58:62]) Figure 25: The output of the event-classiï¬cation component applied to the model sequence from Figure 20. Note that the disassemble event is correctly recognized, as well as the con-stituent pick up and unstack subevents. 71

SISKIND In Figure 4(a), Frames 0 through 1 correspond to the ï¬rst subevent of a pick up event, Frames 2 through 13 correspond to the second subevent, and Frames 14 through 22 correspond to the thirdsubevent. In Figure 4(b), Frames 0 through 13 correspond to the ï¬rst subevent of a put down event,Frames 14 through 22 correspond to the second subevent, and Frames 23 through 32 correspondto the third subevent. LEONARD correctly recognizes these as instances of pick up and put downrespectively. In Figure 16, Frames 0 through 11, 12 through 23, and 24 through 30 correspond tothe three subevents of a put down event. LEONARD correctly recognizes this as a put down eventand also as a stack event. In Figure 17, Frames 0 through 10, 11 through 24, and 25 through 33correspond to the three subevents of a pick up event. LEONARD correctly recognizes this as a pickup event and also as an unstack event. In Figure 18, Frames 0 through 8, 9 through 16, and 17through 45 correspond to the three subevents of a pick up event and Frames 17 through 33, 34through 45, and 46 through 52 correspond to the three subevents of a put down event. LEONARDcorrectly recognizes the combination of these two events as a move event. In Figure 19, Frames 18through 32, 33 through 40, and 41 through 46 correspond to the three subevents of a put down eventand Frames 57 through 67 and 68 through 87 correspond to the ï¬rst and third subevents of a secondput down event, with the second subevent being empty. The second put down event is also correctlyrecognized as a stack event and the combination of these two events is correctly recognized as anassemble event. In Figure 20, Frames 0 through 18, 19 through 22, and 23 through 50 correspond tothe three subevents of a pick up event and Frames 23 through 56, 57 through 62, and 63 through 87correspond to the three subevents of a second pick up event. The ï¬rst pick up event is also correctlyrecognized as an unstack event and the combination of these two events is correctly recognized asa disassemble event. These examples show that LEONARD correctly recognizes each of the sevenevent types with no false positives. As discussed in the introduction, using force dynamics and event logic to recognize events offers several advantages over the prior approach of using motion proï¬le and hidden Markov models. â¢ robustness against variance in motion proï¬le â¢ robustness against presence of extraneous objects in the ï¬eld of view â¢ ability to perform temporal and spatial segmentation of events â¢ ability to detect non-occurrence of events Figures 26 through 35 illustrate these advantages. Figure 26 shows a pick up event from the leftin contrast to Figure 4(a) which is from the right. Even though these have different motion pro-ï¬les, Figure 31 shows that LEONARD correctly recognizes that these exhibit the same sequenceof changes in force-dynamic relations and constitute the same event type, namely pick up. Fig-ure 27 shows a pick up event with two extraneous blocks in the ï¬eld of view. Figure 32 shows that LEONARD correctly recognizes that these extraneous blocks do not participate in any events and, despite their presence, the truth conditions for a pick up event still hold between the other objects.Figure 28 shows a pick up event, followed by a put down event, followed by another pick up event,followed by another put down event. Figure 33 shows that LEONARD correctly recognizes this se-quence of four event occurrences. Figure 29 shows two simultaneous pick up events. Figure 34shows that LEONARD correctly recognizes these two simultaneous event occurrences. Finally, Fig-ure 30 shows two non-events. Figure 35 shows that LEONARD is not fooled into thinking that theseconstitute pick up or put down events, even though portions of these events have similar motion 72

GROUNDING THE LEXICAL SEMANTICS OF VERBS Frame 5 Frame 10 Frame 11 Frame 17 Frame 18 Frame 22 Figure 26: The output of the segmentation-and-tracking and model-reconstruction components on an image sequence depicting a pick up event from the left instead of from the right. Frame 6 Frame 7 Frame 8 Frame 18 Frame 19 Frame 24 Figure 27: The output of the segmentation-and-tracking and model-reconstruction components on an image sequence depicting a pick up event with extraneous objects in the ï¬eld of view. proï¬le to pick up and put down events. LEONARD correctly recognizes that these movies do notmatch any known event types. An approach to even classiï¬cation is valid and useful only if it is robust. A preliminary evalua- tion of the robustness of LEONARD was conducted. Thirty ï¬ve movies were ï¬lmed, ï¬ve instancesof each of the seven event types pick up, put down, stack, unstack, move, assemble, and disassemble.These movies resemble those in Figures 1 and 11 through 15. The same subject performed all thirty 73

SISKIND Frame 4 Frame 8 Frame 9 Frame 18 Frame 19 Frame 44 Frame 45 Frame 51 Frame 52 Frame 69 Frame 70 Frame 77 Frame 78 Frame 102 Frame 103 Frame 109 Frame 110 Frame 111 Figure 28: The output of the segmentation-and-tracking and model-reconstruction components on an image sequence depicting a sequence of a pick up event, followed by a put downevent, followed by another pick up event, followed by another put down event. 74

GROUNDING THE LEXICAL SEMANTICS OF VERBS Frame 0 Frame 5 Frame 6 Frame 10 Frame 11 Frame 15 Frame 16 Frame 21 Figure 29: The output of the segmentation-and-tracking and model-reconstruction components on an image sequence depicting two simultaneous pick up events. ï¬ve events. These movies were processed by LEONARD.8 The results of this preliminary evaluationare summarized in Table 1. A more extensive evaluation of LEONARD will be conducted in thefuture. 6. Discussion This paper presents a new approach to event recognition that differs from the prior approach intwo ways. First, it uses force dynamics instead of motion proï¬le as the feature set to differentiatebetween event types. Second, it uses event logic instead of hidden Markov models as the compu-tational framework for classifying time-series data containing these features. Nominally, these twodifferences are independent. One can imagine using hidden Markov models to classify time seriesof force-dynamic features or using event logic to classify time series of motion-proï¬le features.While such combinations are feasible in principle, they are unwieldy in practice. Consider using event logic to classify time series of motion-proï¬le features. Motion-proï¬le features, such as position, velocity, and acceleration, are typically continuous. A given event usu-ally corresponds to a vague range of possible feature values. This vagueness is well modeled bycontinuous-output hidden Markov models. Event logic, which is discrete in nature, requires quan-tizing precise feature-value ranges. Such quantization can lead to a high misclassiï¬cation rate.Furthermore, continuous distributions allow partitioning a multidimensional feature space into dif-ferent classes where the boundaries between classes are more complex than lines along the featureaxes. Emulating this in event logic would require complex disjunctive expressions. Similarly, consider using hidden Markov models to classify time series of force-dynamic fea- tures. Suppose that a feature vector contains n features. Since both force-dynamic and motion-proï¬le features typically relate pairs of objects, n is often quadratic in the number of event par- 8. The movies, as well as the results produced by LEONARD when processing the movies, are available as Online Appendix 1, as well as from ftp://ftp.nj.nec.com/pub/qobi/leonard.tar.Z. 75

SISKIND (a) Frame 4 Frame 11 Frame 12 Frame 18 Frame 19 Frame 21 (b) Frame 0 Frame 6 Frame 7 Frame 12 Frame 13 Frame 18 Figure 30: The output of the segmentation-and-tracking and model-reconstruction components ap- plied to the image sequences from Figure 2, image sequences that depict non-events. 76

GROUNDING THE LEXICAL SEMANTICS OF VERBS (PICK-UP MOVING RED GREEN)@[[0,11],[18,30]) (SUPPORTED? RED)@[[0:30])(SUPPORTED? GREEN)@[[11:18])(SUPPORTS? MOVING RED)@[[11:30])(SUPPORTS? MOVING GREEN)@[[11:18])(SUPPORTS? RED GREEN)@[[11:18])(SUPPORTS? GREEN RED)@[[0:11])(CONTACTS? RED GREEN)@[[0:11])(ATTACHED? MOVING RED)@[[11:30])(ATTACHED? RED GREEN)@[[11:18]) Figure 31: The output of the event-classiï¬cation component applied to the model sequence from Figure 26. Note that the pick up event is correctly recognized despite the fact that it wasperformed from the left instead of from the right. (PICK-UP MOVING RED GREEN)@[[0,8],[19,30]) (SUPPORTED? MOVING)@[[8:19])(SUPPORTED? RED)@[[0:30])(SUPPORTED? BLUE)@[[0:30])(SUPPORTS? MOVING RED)@[[19:30])(SUPPORTS? RED MOVING)@[[8:19])(SUPPORTS? GREEN MOVING)@[[8:19])(SUPPORTS? GREEN RED)@[[0:19])(SUPPORTS? YELLOW BLUE)@[[0:30])(CONTACTS? RED GREEN)@[[0:10]), [[16:19])(CONTACTS? BLUE YELLOW)@[[0:30])(ATTACHED? MOVING RED)@[[8:30])(ATTACHED? RED GREEN)@[[10:16]) Figure 32: The output of the event-classiï¬cation component applied to the model sequence from Figure 27. Note that the pick up event is correctly recognized despite the presence ofextraneous objects in the ï¬eld of view. 77

SISKIND (PICK-UP MOVING RED GREEN)@[[52,70],[78,102]), [[0,9],[19,44]) (PUT-DOWN MOVING RED GREEN)@[[19,44],[52,70]), [[78,102],[110,117]) (SUPPORTED? MOVING)@[[9:18]), [[44:52]), [[70:77]), [[102:110]) (SUPPORTED? RED)@[[0:117])(SUPPORTS? MOVING RED)@[[18:44]), [[78:102])(SUPPORTS? RED MOVING)@[[9:18]), [[44:52]), [[70:77]), [[102:110]) (SUPPORTS? GREEN MOVING)@[[9:18]), [[44:52]), [[70:77]), [[102:110]) (SUPPORTS? GREEN RED)@[[0:19]), [[44:78]), [[102:117])(CONTACTS? RED GREEN)@[[0:9]), [[13:18]), [[46:70]), [[106:117]) (ATTACHED? MOVING RED)@[[9:52]), [[70:110])(ATTACHED? RED GREEN)@[[9:13]), [[70:76]), [[104:106]) Figure 33: The output of the event-classiï¬cation component applied to the model sequence from Figure 28. Note that LEONARD correctly recognizes a pick up event, followed by a putdown event, followed by another pick up event, followed by another put down event. 78

GROUNDING THE LEXICAL SEMANTICS OF VERBS (PICK-UP MOVING RED GREEN)@[[0,6],[16,22])(PICK-UP MOVING YELLOW BLUE)@[[0,12],[17,22]) (SUPPORTED? MOVING)@[[6:16])(SUPPORTED? MOVING)@[[12:15])(SUPPORTED? RED)@[[0:22])(SUPPORTED? YELLOW)@[[0:22])(SUPPORTS? MOVING RED)@[[16:22])(SUPPORTS? MOVING YELLOW)@[[17:22])(SUPPORTS? RED MOVING)@[[6:16])(SUPPORTS? GREEN MOVING)@[[6:16])(SUPPORTS? GREEN RED)@[[0:16])(SUPPORTS? BLUE MOVING)@[[12:15])(SUPPORTS? BLUE YELLOW)@[[0:17])(SUPPORTS? YELLOW MOVING)@[[12:15])(CONTACTS? RED GREEN)@[[0:15])(CONTACTS? BLUE YELLOW)@[[0:17])(ATTACHED? MOVING RED)@[[6:22])(ATTACHED? MOVING YELLOW)@[[12:22]) Figure 34: The output of the event-classiï¬cation component applied to the model sequence from Figure 29. Note that the two simultaneous pick up events are correctly recognized. 79

SISKIND (SUPPORTED? RED)@[[0:19])(SUPPORTED? MOVING)@[[13:31])(SUPPORTS? RED MOVING)@[[13:31])(SUPPORTS? MOVING RED)@[[0:13])(SUPPORTS? GREEN RED)@[[12:19])(SUPPORTS? GREEN MOVING)@[[13:19])(ATTACHED? RED MOVING)@[[0:31])(ATTACHED? RED GREEN)@[[13:19]) (a) (SUPPORTED? RED)@[[0:25])(SUPPORTED? GREEN)@[[7:13])(SUPPORTS? MOVING RED)@[[7:13])(SUPPORTS? MOVING GREEN)@[[7:13])(SUPPORTS? RED GREEN)@[[7:13])(SUPPORTS? GREEN RED)@[[0:7]), [[13:25])(CONTACTS? RED GREEN)@[[0:7]), [[13:25])(ATTACHED? MOVING RED)@[[7:13])(ATTACHED? RED GREEN)@[[7:13]) (b) Figure 35: The output of the event-classiï¬cation component applied to the model sequences from Figure 30. Note that LEONARD correctly recognizes that no events occurred in thesesequences. 80

GROUNDING THE LEXICAL SEMANTICS OF VERBS pick up put down stack unstack move assemble disassemble pick up 5/5 put down 5/5 stack 5/5 5/5 unstack 5/5 4/5 move 4/5 5/5 4/5 assemble 5/10 1/5 1/5 disassemble 10/10 5/5 5/5 Table 1: An evaluation of the robustness of LEONARD on a test set of ï¬ve movies of each of seven event types. The rows represent movies of the indicated event types. The columns repre-sent classiï¬cations of the indicated event type. The entries x/y indicate x, the number oftimes that a movie of the indicated event type was classiï¬ed as the indicated event type,and y, the number of times that the movie should have been classiï¬ed as the indicated eventtype. Note that stack entails put down, unstack entails pick up, move entails both a pick upand a put down, assemble entails both a put down and a separate stack, and disassembleentails both a pick up and a separate unstack. Thus off-diagonal entries are expected inthese cases. There were six false negatives and no false positives. Four of the false neg-atives were for the event type assemble. In three of those cases, LEONARD successfullyrecognized the constituent put down subevent but failed to recognize the constituent stacksubevent as well as the associated put down subevent. In one case, LEONARD failed torecognize both the constituent put down and stack subevents along with the associated putdown constituent of the stack subevent. One of the false negatives was for the event typemove. In this case, LEONARD successfully recognized the constituent put down subeventbut failed to recognize the constituent pick up subevent. The remaining false negative wasfor the event type unstack. In this case, LEONARD successfully recognized the constituentpick up subevent but failed to recognize the aggregate unstack event. 81

SISKIND ticipants. Let us contrast the number of parameters needed to represent these features in both themotion-proï¬le and force-dynamic approaches. Since, as discussed above, motion-proï¬le featuresare typically continuous, hidden Markov models with continuous outputs can be used in the motion-proï¬le approach. When the features are independent, such a model requires O(n) parameters perstate to specify the output distributions. Even if one uses, say, a multivariate Gaussian to modeldependent features, this requires only O(n2) parameters per state to specify the output distributionsin the motion-proï¬le approach. However, force-dynamic features are Boolean. This requires us-ing discrete-output hidden Markov models. Such models output a stream of symbols, not featurevectors. Constructing an appropriate alphabet of output symbols requires considering all possiblesubsets of features. This requires O(2n) parameters per state to specify the output distributions inthe force-dynamic approach. Thus continuous-output hidden Markov models appear to be bettersuited to an approach that uses motion-proï¬le features while event logic appears to be better suitedto an approach that uses force-dynamic features. Humans use language for three fundamental purposes: we describe what we see, we ask others to perform actions, and we engage in conversation. The ï¬rst two require grounding language inperception and action. Only the third involves disembodied use of language. Almost all researchin computational linguistics has focused on such disembodied language use. Data-base query pro-cessing, information extraction and retrieval, and spoken-language dialog all use language solely tomanipulate internal representations. In contrast, the work described in this paper grounds languagein perception of the external world. It describes an implemented system, called LEONARD, that useslanguage to describe events observed in short image sequences. Why is perceptual grounding of language important and relevant to computational linguistics? Current approaches to lexical semantics suffer from the âbold-face syndrome.â All too often, themeanings of words, like throw, are taken to be uninterpreted symbols, like throw, or expressionsover uninterpreted symbols, like cause to go (Leech, 1969; Miller, 1972; Schank, 1973; Jackendoff,1983, 1990; Pinker, 1989). Since the interpretation of such symbols is left to informal intuition, thecorrectness of any meaning representation constructed from such uninterpreted symbols cannot beveriï¬ed. In other words, how is one to know whether cause to go is the correct meaning of throw?Perceptual grounding offers a way to verify semantic representations. Having an implemented sys-tem use a collection of semantic representations to generate appropriate descriptions of observationsgives evidence that those semantic representations are correct. This paper takes a small step in thisdirection. In contrast to prior work, which presents informal semantic representations whose in-terpretation is left to intuition, it presents perceptually-grounded semantic representations. Whilethe system described in this paper addresses only perceptual grounding of language, the long-termgoal of this research is to provide a uniï¬ed semantic representation that is sufï¬ciently powerful tosupport all three forms of language use: perception, action, and conversation. Different parts of speech in language typically describe different aspects of visual percepts. Nouns typically describe objects. Verbs typically describe events. Adjectives typically describeproperties. Prepositions typically describe spatial and temporal relations. Grounding language invisual perception will require construction of semantic representations for all of these different partsof speech. It is likely that different parts of speech will require different machinery to represent theirlexical semantics. In other words, whatever the ultimate representation of apple and chair are, theyare likely to be based on very different principles than the ultimate representation of pick up and putdown. These, in turn, are likely to be further different from those needed to represent in, on, red, andbig. Indeed, machine vision research, at least that aspect of machine vision research that focuses 82

GROUNDING THE LEXICAL SEMANTICS OF VERBS on object recognition, can be viewed as an attempt to perceptually ground the lexical semantics ofnouns. In contrast, this paper focuses solely on verbs. Accordingly, it develops machinery that isvery different from what is typically used in the machine-vision community, machinery that is morereminiscent of that which is used in the knowledge-representation community. On the other hand,unlike typical knowledge-representation work, it grounds that machinery in image processing. When one proposes a representation, such as cause to go, as the meaning of a word, such as throw, one must specify three things to effectively specify the meaning of that word. First,one must specify the lexical semantics of the individual primitives, how one determines the truthconditions of items like cause and to go. Second, one must specify the compositional semantics ofthe representation, how one combines the truth conditions of primitives like cause and to go to getthe aggregate truth conditions of compound expressions like cause to go. Third, one must specify alexical entry, a map from a word, like throw, to a compound expression, like cause to go. All threeare necessary in order to precisely specify the word meaning. Prior work in lexical semantics, such as the work of Leech (1969), Miller (1972), Schank (1973), Jackendoff (1983, 1990), and Pinker (1989), is deï¬cient in this regard. It speciï¬es the third com-ponent without the ï¬rst two. In other words, it formulates lexical entries in terms of compoundexpressions like cause to go, without specifying the meanings of the primitives, like cause and togo, and without specifying how these meanings are combined to form the aggregate meaning of thecompound expression. This paper attempts to address that deï¬ciency by specifying all three com-ponents. First, the lexical semantics of the event-logic primitives is precisely speciï¬ed in Figure 9.Second, the compositional semantics of event logic is precisely speciï¬ed in Section 3. Third, lexi-cal entries for several verbs are precisely speciï¬ed in Figure 10. These three components togetherformally specify the meanings of those verbs with a level of precision that is absent in prior work. While these lexical entries are precise, there is no claim that they are accurate. Lexical entries are precise when their meaning is reduced to an impartial mechanical procedure. Lexical entriesare accurate when they properly reï¬ect the truth conditions for the words that they deï¬ne. Evenignoring homonymy and metaphor, words such as move and assemble clearly have meanings thatare much more complex than what is, and even can be, represented with the machinery presentedin this paper. But that holds true of prior work as well. The lexical entries given in, for example,Leech (1969), Miller (1972), Schank (1973), Jackendoff (1983, 1990), and Pinker (1989) also donot accurately reï¬ect the truth conditions for the words that they deï¬ne. The purpose of this paper isnot to improve the accuracy of deï¬nitions. In fact, the deï¬nitions given in prior work might be moreaccurate, in some ways, than those given here. Rather, its purpose is to improve the precision ofdeï¬nitions. The deï¬nitions given in prior work are imprecise and that imprecision makes assessingtheir accuracy a subjective process: do humans think an informally speciï¬ed representation matchestheir intuition. In contrast, precision allows objective assessment of accuracy: does the output of amechanical procedure applied to sample event occurrences match human judgments of which wordscharacterize those occurrences. Precision is the key methodological advance of this work. Precise speciï¬cation of the meaning of lexical semantic representations, by way of perceptual grounding, makes accuracy assessmentpossible by way of experimental evaluation. Taking this ï¬rst step of advancing precision and per-ceptual grounding will hopefully allow us to take future steps towards improving accuracy. 83

SISKIND 7. Related Work Most prior work uses motion proï¬le, some combination of relative-and absolute linear-and-angularpositions, velocities, and accelerations, as the features that drive event classiï¬cation. That work fol-lows the tradition of linguists and cognitive scientists, such as Leech (1969), Miller (1972), Schank(1973), Jackendoff (1983, 1990), and Pinker (1989), that represent the lexical semantics of verbsvia the causal, aspectual, and directional qualities of motion. Some linguists and cognitive scien-tists, such as Herskovits (1986) and Jackendoff and Landau (1991), have argued that force-dynamicrelations (Talmy, 1988), such as support, contact, and attachment, are crucial for representing thelexical semantics of spatial prepositions. For example, in some situations, part of what it means forone object to be on another object is for the former to be in contact with, and supported by, the latter.In other situations, something can be on something else by way of attachment, as in the knob on thedoor. Siskind (1992) has argued that change in the state of force-dynamic relations plays a morecentral role in specifying the lexical semantics of simple spatial motion verbs than motion proï¬le.The particular relative-and-absolute linear-and-angular positions, velocities, and accelerations donâtmatter when picking something up or putting something down. What matters is a state change inthe source of support of the patient. Similarly, what distinguishes putting something down fromdropping it is that, in the former, the patient is always supported, while in the latter, the patientundergoes unsupported motion. The work described in this paper differs from prior work in visual-event perception in a num- ber of respects. Waltz and Boggess (1979), Waltz (1981), Marr and Vaina (1982), and Rubinand Richards (1985) describe unimplemented frameworks that are not based on force dynamics.Thibadeau (1986) describes a system that recognizes when an event occurs but not what eventoccurs. His system processes simulated video and is not based on force dynamics. Badler (1975),Adler (1977), Tsuji, Morizono, and Kuroda (1977), Okada (1979), Tsuji, Osada, and Yachida (1979,1980), Abe, Soga, and Tsuji (1981), Abe and Tsuji (1982), Novak and Bulko (1990), and Regier(1992) describe systems that process simulated video and that are not based on force dynamics.Borchardt (1984, 1985) presents event deï¬nitions that are based on force-dynamic relations butdoes not present techniques for recovering those relations automatically from either simulated orreal video. Yamoto et al. (1992), Starner (1995), Siskind and Morris (1996), Siskind (1996), Brand(1996, 1997a), Brand, Oliver, and Pentland (1997), and Bobick and Ivanov (1998) present systemsthat recognize event occurrences from real video using motion proï¬le but not force dynamics. Thesesystems use hidden Markov models rather than event logic as the event-classiï¬cation engine. Funt(1980) presents a heuristic approach to stability analysis that operates on simulated video but doesnot perform model reconstruction or event classiï¬cation. Brand, Birnbaum, and Cooper (1993) andBrand (1997b) present a heuristic approach to stability analysis that operates on real video but do notuse stability analysis to perform model reconstruction and event classiï¬cation. Blum, Grifï¬th, andNeumann (1970) and Fahlman (1974) present stability-analysis algorithms that are based on linearprogramming but do not use stability analysis to perform model reconstruction or event classiï¬ca-tion. These stability-analysis algorithms use dynamics rather than kinematics. Siskind (1991, 1992,1993, 1994, 1995, 1997) presents systems that operate on simulated video and use force dynamicsto recognize event occurrences. All of that work, except Siskind (1997), uses heuristic approachesto stability analysis, model reconstruction, and event classiï¬cation. Siskind (1997) presents an earlyversion of the stability-analysis and event-logicâbased event-recognition techniques used in the cur-rent system. Mann et al. (1996, 1997) and Mann and Jepson (1998) present a system that does 84

GROUNDING THE LEXICAL SEMANTICS OF VERBS model reconstruction from real video but does not use the recovered force-dynamic relations toperform event classiï¬cation. That work uses an approach to stability analysis based on dynamicsinstead of the kinematic approach used in this paper. There is also a body of prior work that grounds fragments of natural-language semantics in physical relations between objects in graphically represented blocks worlds or for solving physicsword problems. Examples of such work include Bobrow (1964), Winograd (1972), and Palmer(1990) as well the ISSAC system (Novak, 1976) and the MECHO project (Bundy, Luger, Palmer,& Welham, 1998; Bundy, Byrd, Luger, Mellish, Milne, & Palmer, 1979; Luger, 1981). Whilethat work does not focus on recognizing events, per se, it does relate lexical semantics to physicalrelations between represented objects. LEONARD currently does not contain a learning component. It is given a ï¬xed physical theory of the world, implicitly represented in the model-reconstruction procedure, and a ï¬xed collectionof event-type descriptions, explicitly formulated as event-logic expressions. One potential area forfuture work would be to automatically learn a physical theory of the world and/or event-type descrip-tions. Adding a learning component could potentially produce more robust model-reconstructionand event-classiï¬cation components than those currently constructed by hand. Techniques such asthose presented in Martin and Geffner (2000) and Cumby and Roth (2000) might be useful for thistask. 8. Conclusion This paper has presented LEONARD, a comprehensive implemented system for recovering eventoccurrences from video input. It differs from the prior approach to the same problem in two funda-mental ways. First, it uses state changes in the force-dynamic relations between objects, instead ofmotion proï¬le, as the key descriptive element in deï¬ning event types. Second, it uses event logic,instead of hidden Markov models, to perform event classiï¬cation. One key result of this paper isthe formulation of spanning intervals, a novel efï¬cient representation of the inï¬nite sets of intervalsthat arise when processing liquid and semi-liquid events. A second key result of this paper is theformulation of an efï¬cient procedure, based on spanning intervals, for inferring all occurrences ofcompound event types from occurrences of primitive event types. The techniques of force-dynamicmodel reconstruction, spanning intervals, and event-logic inference have been used to successfullyrecognize seven event types from real video: pick up, put down, stack, unstack, move, assemble,and disassemble. Using force dynamics and event logic to perform event recognition offers fourkey advantages over the prior approach of using motion proï¬le and hidden Markov models. First,it is insensitive to variance in the motion proï¬le of an event occurrence. Second, it is insensitive tothe presence of extraneous objects in the ï¬eld of view. Third, it allows temporal segmentation ofsequential and parallel event occurrences. Fourth, it robustly detects the non-occurrence of eventsas well as their occurrence. At a more fundamental level, this paper advances a novel methodology: grounding lexical- semantic representations in visual-event perception as a means for assessing the accuracy of suchrepresentations. Prior work in lexical-semantic representations has used calculi whose semanticswere not precisely speciï¬ed. Lexical entries formulated in such calculi derived their meaning fromintuition and thus could not be empirically tested. By providing a lexical-semantic representationwhose semantics is precisely speciï¬ed via perceptual grounding, this paper opens up the ï¬eld oflexical semantics to empirical evaluation. The particular representations advanced in this paper 85

SISKIND are clearly only approximations to the ultimate truth. This follows from the primitive state of ourunderstanding of language and perception. Nonetheless, I hope that this paper offers an advancetowards the ultimate truth, both through its novel methodology and the particular details of itsmechanisms. References Abe, N., Soga, I., & Tsuji, S. (1981). A plot understanding system on reference to both image and language. In Proceedings of the Seventh International Joint Conference on Artiï¬cialIntelligence, pp. 77â84, Vancouver, BC. Abe, N., & Tsuji, S. (1982). A learning of object structures by verbalism. In Horecky, J. (Ed.), COLING 92, pp. 1â6. Elsevier North-Holland. Adler, M. R. (1977). Computer interpretation of Peanuts cartoons. In Proceedings of the Fifth International Joint Conference on Artiï¬cial Intelligence, p. 608, Cambridge, MA. Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11), 832â843. Badler, N. I. (1975). Temporal scene analysis: Conceptual descriptions of object movements. Tech. rep. 80, University of Toronto Department of Computer Science. Baillargeon, R. (1986). Representing the existence and the location of hidden objects: Object per- manence in 6- and 8-month-old infants. Cognition, 23, 21â41. Baillargeon, R. (1987). Object permanence in 3 1 - and 4 1 -month-old infants. Developmental Psy- 2 2 chology, 23(5), 655â664. Baillargeon, R., Spelke, E. S., & Wasserman, S. (1985). Object permanence in ï¬ve-month-old infants. Cognition, 20, 191â208. Blum, M., Grifï¬th, A. K., & Neumann, B. (1970). A stability test for conï¬gurations of blocks. A. I. Memo 188, MIT Artiï¬cial Intelligence Laboratory. Bobick, A. F., & Ivanov, Y. A. (1998). Action recognition using probabilistic parsing. In Proceed- ings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,pp. 196â202, Santa Barbara, CA. Bobrow, D. (1964). Natural language input for a computer problem solving system. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA. Borchardt, G. C. (1984). A computer model for the representation and identiï¬cation of physical events. Tech. rep. Tâ142, Coordinated Sciences Laboratory, University of Illinois at Urbana-Champaign. Borchardt, G. C. (1985). Event calculus. In Proceedings of the Ninth International Joint Conference on Artiï¬cial Intelligence, pp. 524â527, Los Angeles, CA. Brand, M. (1996). Understanding manipulation in video. In Proceedings of the 2nd International Conference on Face and Gesture Recognition, pp. 94â99, Killington, VT. Brand, M. (1997a). The inverse hollywood problem: From video to scripts and storyboards via causal analysis. In Proceedings of the Fourteenth National Conference on Artiï¬cial Intelli-gence, pp. 132â137, Providence, RI. 86

GROUNDING THE LEXICAL SEMANTICS OF VERBS Brand, M. (1997b). Physics-based visual understanding. Computer Vision and Image Understand- ing, 65(2), 192â205. Brand, M., Birnbaum, L., & Cooper, P. (1993). Sensible scenes: Visual understanding of com- plex scenes through causal analysis. In Proceedings of the Eleventh National Conference onArtiï¬cial Intelligence, pp. 588â593. Brand, M., Oliver, N., & Pentland, A. (1997). Coupled hidden Markov models for complex action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Visionand Pattern Recognition. Bundy, A., Byrd, L., Luger, G., Mellish, C., Milne, R., & Palmer, M. (1979). MECHO: a program to solve mechanics problems. Tech. rep. Working paper 50, Department of Artiï¬cial Intelli-gence, Edinburgh University. Bundy, A., Luger, G., Palmer, M., & Welham, R. (1998). MECHO: Year one. In Brady, M. (Ed.), Proceedings of the 2nd AISB Conference, pp. 94â103, Edinburgh. Cumby, C., & Roth, D. (2000). Relational representations that facilitate learning. In Proceedings of the 7th International Conference on Knowledge Representation and Reasoning, pp. 425â434,Colorado. Morgan Kaufmann. Dowty, D. R. (1979). Word Meaning and Montague Grammar. D. Reidel Publishing Company, Boston, MA. Fahlman, S. E. (1974). A planning system for robot construction tasks. Artiï¬cial Intelligence, 5(1), 1â49. Funt, B. V. (1980). Problem-solving with diagrammatic representations. Artiï¬cial Intelligence, 13(3), 201â230. Herskovits, A. (1986). Language and Spatial Cognition: An Interdisciplinary Study of the Preposi- tions in English. Cambridge University Press, New York, NY. Jackendoff, R. (1983). Semantics and Cognition. MIT Press, Cambridge, MA. Jackendoff, R. (1990). Semantic Structures. MIT Press, Cambridge, MA. Jackendoff, R., & Landau, B. (1991). Spatial language and spatial cognition. In Napoli, D. J., & Kegl, J. A. (Eds.), Bridges Between Psychology and Linguistics: A Swarthmore Festschriftfor Lila Gleitman. Erlbaum, Mahwah, NJ. Jepson, A. D., & Richards, W. A. (1993). What is a percept?. Tech. rep. RBCVâTRâ93â43, Uni- versity of Toronto Department of Computer Science. Krifka, M. (1992). Thematic relations as links between nominal reference and temporal constitution. In Sag, I. A., & Szabolcsi, A. (Eds.), Lexical Matters. CSLI. Leech, G. N. (1969). Towards a Semantic Description of English. Indiana University Press. Luger, G. (1981). Mathematical model building in the solution of mechanics problems: Human protocols and the MECHO trace. Cognitive Science, 5, 55â77. Mann, R., & Jepson, A. D. (1998). Toward the computational perception of action. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.794â799, Santa Barbara, CA. 87

SISKIND Mann, R., Jepson, A. D., & Siskind, J. M. (1996). The computational perception of scene dynam- ics. In Proceedings of the Fourth European Conference on Computer Vision, pp. 528â529,Cambridge, UK. Springer-Verlag. Mann, R., Jepson, A. D., & Siskind, J. M. (1997). The computational perception of scene dynamics. Computer Vision and Image Understanding, 65(2). Marr, D., & Vaina, L. (1982). Representation and recognition of the movements of shapes. Proc. R. Soc. Lond. B, 214, 501â524. Martin, M., & Geffner, H. (2000). Learning generalized policies in planning using concept lan- guages. In Proceedings of the 7th International Conference on Knowledge Representationand Reasoning, Colorado. Morgan Kaufmann. McCarthy, J. (1980). Circumscriptionâa form of non-monotonic reasoning. Artiï¬cial Intelligence, 13(1â2), 27â39. Miller, G. A. (1972). English verbs of motion: A case study in semantics and lexical memory. In Melton, A. W., & Martin, E. (Eds.), Coding Processes in Human Memory, chap. 14, pp.335â372. V. H. Winston and Sons, Inc., Washington, DC. Novak, G. (1976). Computer understanding of physics problems stated in natural language. Ameri- can Journal of Computational Linguistics, 53. Novak, G. S., & Bulko, W. C. (1990). Understanding natural language with diagrams. In Proceed- ings of the Eighth National Conference on Artiï¬cial Intelligence, Boston, MA. Okada, N. (1979). SUPP: Understanding moving picture patterns based on linguistic knowledge. In Proceedings of the Sixth International Joint Conference on Artiï¬cial Intelligence, pp. 690â692, Tokyo. Palmer, M. (1990). Semantic Processing for Finite Domains. ACL Book Series. Cambridge Uni- versity Press, New York, NY. Pinker, S. (1989). Learnability and Cognition. MIT Press, Cambridge, MA. Regier, T. P. (1992). The Acquisition of Lexical Semantics for Spatial Terms: A Connectionist Model of Perceptual Categorization. Ph.D. thesis, University of California at Berkeley. Rubin, J. M., & Richards, W. A. (1985). Boundaries of visual motion. A. I. Memo 835, MIT Artiï¬cial Intelligence Laboratory. Schank, R. C. (1973). The fourteen primitive actions and their inferences. Memo AIMâ183, Stan- ford Artiï¬cial Intelligence Laboratory. Shoham, Y. (1987). Temporal logics in AI: Semantical and ontological considerations. Artiï¬cial Intelligence, 33(1), 89â104. Siskind, J. M. (1991). Naive physics, event perception, lexical semantics and language acquisition. In The AAAI Spring Symposium Workshop on Machine Learning of Natural Language andOntology, pp. 165â168. Siskind, J. M. (1992). Naive Physics, Event Perception, Lexical Semantics, and Language Acquisi- tion. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA. Siskind, J. M. (1993). Grounding language in perception. In Proceedings of the Annual Conference of the Society of Photo-Optical Instrumentation Engineers, Boston, MA. 88

GROUNDING THE LEXICAL SEMANTICS OF VERBS Siskind, J. M. (1994). Axiomatic support for event perception. In McKevitt, P. (Ed.), Proceedings of the AAAI Workshop on the Integration of Natural Language and Vision Processing, pp.153â160, Seattle, WA. Siskind, J. M. (1995). Grounding language in perception. Artiï¬cial Intelligence Review, 8, 371â391. Siskind, J. M. (1996). Unsupervised learning of visually-observed events. In The AAAI Fall Sympo- sium Workshop on Learning Complex Behaviors in Adaptive Intelligent Systems, pp. 82â83. Siskind, J. M. (1997). Visual event perception. In Ikeuchi, K., & Veloso, M. (Eds.), Symbolic Visual Learning, chap. 9. Oxford University Press, New York, NY. Siskind, J. M. (1999). Visual event perception. In Proceedings of the 9th NEC Research Symposium. Also available as Technical report 99â033, NEC Research Institute, Inc. Siskind, J. M. (2000). Visual event classiï¬cation via force dynamics. In Proceedings of the Seven- teenth National Conference on Artiï¬cial Intelligence, pp. 149â155, Austin, TX. Siskind, J. M., & Morris, Q. (1996). A maximum-likelihood approach to visual event classiï¬ca- tion. In Proceedings of the Fourth European Conference on Computer Vision, pp. 347â360,Cambridge, UK. Springer-Verlag. Spelke, E. S. (1983). Cognition in infancy. Occasional papers in cognitive science 28, Massachusetts Institute of Technology. Spelke, E. S. (1987). Where perceiving ends and thinking begins: the apprehension of objects in infancy. In Yonas, A. (Ed.), Perceptual development in infancy. Minnesota symposia in childpsychology, pp. 197â234. Erlbaum, Mahwah, NJ. Spelke, E. S. (1988). The origins of physical knowledge. In Weiskrantz, L. (Ed.), Thought Without Language, chap. 7, pp. 168â184. Oxford University Press, New York, NY. Starner, T. E. (1995). Visual recognition of American Sign Language using hidden Markov models. Masterâs thesis, Massachusetts Institute of Technology, Cambridge, MA. Talmy, L. (1988). Force dynamics in language and cognition. Cognitive Science, 12, 49â100. Thibadeau, R. (1986). Artiï¬cial perception of actions. Cognitive Science, 10(2), 117â149. Tsuji, S., Morizono, A., & Kuroda, S. (1977). Understanding a simple cartoon ï¬lm by a com- puter vision system. In Proceedings of the Fifth International Joint Conference on Artiï¬cialIntelligence, pp. 609â610, Cambridge, MA. Tsuji, S., Osada, M., & Yachida, M. (1979). Three dimensional movement analysis of dynamic line images. In Proceedings of the Sixth International Joint Conference on Artiï¬cial Intelligence,pp. 896â901, Tokyo. Tsuji, S., Osada, M., & Yachida, M. (1980). Tracking and segmentation of moving objects in dynamic line images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(6),516â522. Vendler, Z. (1967). Linguistics in Philosophy. Cornell University Press, Ithaca, NY. Verkuyl, H. J. (1989). Aspectual classes and aspectual composition. Linguistics and Philosophy, 12(1), 39â94. 89

SISKIND Waltz, D. L. (1981). Toward a detailed model of processing for language describing the physical world. In Proceedings of the Seventh International Joint Conference on Artiï¬cial Intelligence,pp. 1â6, Vancouver, BC. Waltz, D. L., & Boggess, L. (1979). Visual analog representations for natural language understand- ing. In Proceedings of the Sixth International Joint Conference on Artiï¬cial Intelligence, pp.926â934, Tokyo. Winograd, T. (1972). Understanding Natural Language. Academic Press, New York, NY. Yamoto, J., Ohya, J., & Ishii, K. (1992). Recognizing human action in time-sequential images using hidden Markov model. In Proceedings of the 1992 IEEE Conference on Computer Vision andPattern Recognition, pp. 379â385. IEEE Press. 90