M. A. Walker, I. Langkilde-Geary, H. Wright Hastie, J. Wright and A. Gorin
Volume 16, 2002
Links to Full Text:Journal of Artificial Intelligence Research 16 (2002) 293-319 Submitted 11/01; published 5/02 Automatically Training a Problematic Dialogue Predictor for a Spoken Dialogue System Marilyn A. Walker walker@research.att.com Irene Langkilde-Geary ilangkil@isi.edu Helen Wright Hastie hhastie@research.att.com Jerry Wright jwright@research.att.com Allen Gorin algor@research.att.com AT&T Shannon Laboratory 180 Park Ave., Bldg 103, Room E103 Florham Park, NJ 07932 Abstract Spoken dialogue systems promise efficient and natural access to a large variety of information sources and services from any phone. However, current spoken dialogue systems are deficient in their strategies for preventing, identifying and repairing problems that arise in the conversation. This paper reports results on automatically training a Problematic Dialogue Predictor to predict problematic human-computer dialogues using a corpus of 4692 dialogues collected with the How May I Help YouSM spoken dialogue system. The Problematic Dialogue Predictor can be immediately applied to the system's decision of whether to transfer the call to a human customer care agent, or be used as a cue to the system's dialogue manager to modify its behavior to repair problems, and even perhaps, to prevent them. We show that a Problematic Dialogue Predictor using automaticallyobtainable features from the first two exchanges in the dialogue can predict problematic dialogues 13.2% more accurately than the baseline. 1. Introduction Spoken dialogue systems promise efficient and natural access to a large variety of information sources and services from any phone. Systems that support short utterances to select a particular function (through a statement such as "Say credit card, collect or person-toperson") are saving companies millions of dollars per year. Deployed systems and research prototypes exist for applications such as personal email and calendars, travel and restaurant information, and personal banking (Baggia, Castagneri, & Danieli, 1998; Walker, Fromer, & Narayanan, 1998; Seneff, Zue, Polifroni, Pao, Hetherington, Goddeau, & Glass, 1995; Sanderman, Sturm, den Os, Boves, & Cremers, 1998; Chu-Carroll & Carpenter, 1999) inter alia. Yet there are still many research challenges: current systems are limited in the interaction they support and brittle in many respects. This paper investigates methods by which spoken dialogue systems can learn to support more natural interaction on the basis of their previous experience. One way that current spoken dialogue systems are quite limited is in their strategies for detecting and repairing problems that arise in conversation, such as misunderstandings due to speech recognition error or misinterpretation. If a problem can be detected, the system can either transfer the call to a human customer care agent or modify its dialogue strategy in an attempt to cfl2002 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin repair the problem. We can train systems to improve their ability to detect problems by exploiting dialogues collected in interactions with human users where the initial segments of these dialogues are used to train a Problematic Dialogue Predictor (PDP) to predict that a problem is likely to occur. The output of the PDP can be immediately applied to the system's decision of whether to transfer the call to a human customer care agent, or it could potentially be used as a cue to the system's Dialogue Manager to modify its behavior to repair problems, and even perhaps, to prevent them. In previous work, we reported initial results for training a PDP using a variety of different feature sets (Langkilde, Walker, Wright, Gorin, & Litman, 1999; Walker, Langkilde, Wright, Gorin, & Litman, 2000b). When analyzing the performance of the fully automatic feature set, we examined which hand-labelled features made large performance improvements, under the assumption that future work should focus on developing automatic features that approximate the information provided by these hand-labelled features. The analysis indicated that the hand-labelled SLU-success feature, which encodes whether the spoken language understanding (slu) component captured the meaning of each exchange correctly. When this hand-labelled feature is added to the automatic features, it improved the performance of the PDP by almost 7.6%. This finding led us to develop an SLU-success predictor (Walker, Wright, & Langkilde, 2000c) and a new version of the PDP that we report on here. The new version of the PDP takes as input a fully automatic version of the SLU-success feature, which we call auto-SLU-success. We train and test both the auto-SLU-success predictor and the PDP on a corpus of 4692 dialogues collected in an experimental trial of AT&T's How May I Help You (hmihySM ) spoken dialogue system (Gorin, Riccardi, & Wright, 1997; Abella & Gorin, 1999; Riccardi & Gorin, 2000; E. Ammicht & Alonso, 1999). In this trial, the hmihy system was installed at an AT&T customer care center. hmihy answered calls from live customer traffic and successfully automated a large number of customer requests. An example dialogue that hmihy completed successfully is shown in Figure 1. The phone numbers, card numbers, and pin numbers in the sample dialogues are artificial. S1: AT&T How may I help you? U1: I need to [ uh ] put a call on my calling card please S2: May I have your card number, please? U2: 7 6 5 4 3 2 1 0 9 8 7 6 5 4 S3: What number would you like to call? U3: 8 1 4 7 7 7 6 6 6 6 (misunderstood) S4: May I have that number again? U4: 8 1 4 7 7 7 6 6 6 6 S5: Thank you. Figure 1: Sample tasksuccess Dialogue Note that the system's utterance in S4 consists of a repair initiation, motivated by the system's ability to detect that the user's utterance U3 was likely to have been misunderstood. The goal of the auto-SLU-success predictor is to improve the system's ability to detect such misunderstandings. The dialogues that have the desired outcome, in which hmihy successfully automates the customer's call, are referred to as the tasksuccess dia294 Predicting Problematic Dialogues logues. Dialogues in which the hmihy system did not successfully complete the caller's task are referred to as problematic. These are described in further detail below. This paper reports results from experiments that test whether it is possible to learn to automatically predict that a dialogue will be problematic on the basis of information the system has: (1) early in the dialogue; and (2) in real time. We train an automatic classifier for predicting problematic dialogues from features that can be automatically extracted from the hmihy corpus. As described above, one of these features is the output of the auto-SLU-success predictor, the auto-SLU-success feature, which predicts whether or not the current utterance was correctly understood (Walker et al., 2000c). The results show that it is possible to predict problematic dialogues using fully automatic features with an accuracy ranging from 69.6% to 80.3%, depending on whether the system has seen one or two exchanges. It is possible to identify problematic dialogues with an accuracy up to 87%. Section 2 describes hmihy and the dialogue corpus that the experiments are based on. Section 3 discusses the type of machine learning algorithm adopted, namely ripper and gives a description of the experimental design. Section 4 gives a breakdown of the features used in these experiments. Section 5 presents the method of predicting the feature auto-SLU-success and gives accuracy results. Section 6 presents methods used for utilizing ripper to train the automatic Problematic Dialogue Predictor and gives the results. We delay our discussion of related work to Section 7 when we can compare it to our approach. Section 8 summarizes the paper and describes future work. 2. The HMIHY Data hmihy is a spoken dialogue system based on the notion of call routing (Gorin et al., 1997; Chu-Carroll & Carpenter, 1999). In the hmihy call routing system, services that the user can access are classified into 14 categories, plus a category called other for tasks that are not covered by the automated system and must be transferred to a human operator (Gorin et al., 1997). Each category describes a different task, such as person-to-person dialing, or receiving credit for a misdialed number. The system determines which task the caller is requesting on the basis of its understanding of the caller's response to the open-ended system greeting AT&T, How May I Help You?. Once the task has been determined, the information needed for completing the caller's request is obtained using dialogue submodules that are specific for each task (Abella & Gorin, 1999). The hmihy system consists of an automatic speech recognizer, a spoken language understanding module, a dialogue manager, and a computer telephony platform. During the trial, the behaviors of all the system modules were automatically recorded in a log file, and later the dialogues were transcribed by humans and labelled with one or more of the 15 task categories, representing the task that the caller was asking hmihy to perform, on a per utterance basis. The log files also included labels indicating whether the wizard had taken over the call or the user had hung up. Our experiments use the log files to extract automatically obtainable features used as predictors, and to define the classes of dialogues that we want to learn to predict. The corpus of 4692 dialogues used in our experiments was collected in several experimental trials of hmihy on live customer traffic (Riccardi & Gorin, 2000; E. Ammicht & Alonso, 1999), and is referred to as HM2 in (Riccardi & Gorin, 2000). 295 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin The dialogues vary in length, 97% are five exchanges or less with 23% of all the dialogues consisting of only two exchanges. As mentioned above, dialogues in which hmihy successfully automates the customer's call, as illustrated in Figure 1, are referred to as tasksuccess. Other calls, which are problematic, are divided into three categories. The first category, referred to as hangup, results from a customer's decision to hang up on the system. A sample hangup dialogue is in Figure 2. A caller may hang up because s/he is frustrated with the system; our goal is to learn from the corpus which system behaviors led to the caller's frustration. S1: AT&T How may I help you? U1: I need credit please. S2: Okay. What is the reason for the credit? U2: Miss, uh, Miss, different area code than I needed. S3: Sorry. I need to know whether you need credit for a wrong number, bad connection or a call that was cut off. U3: It is a wrong number. (impatient humming) S4: What was the number that you dialed? USER HANGS UP Figure 2: Sample hangup Dialogue The second problematic category (wizard), results from a human customer care agent's decision to take over the call from the system. Because hmihy is experimental, each call during the field trial was monitored by a human agent serving as a wizard who could override the system. There were a number of agents who participated as wizards during the trial of hmihy and each wizard was simply told to take over the call if s/he perceived problems with the system's performance. The wizard's decision was logged by the experimental setup, resulting in labelling the call as one that the wizard took over. Of course we can only infer what might have motivated the wizard to take over the call, but we assume that the wizard had good reason for doing so. A dialogue where the wizard decided that the dialogue was problematic and took over the call is shown in Figure 3. S1: AT&T How may I help you? U1: (silence) S2: Sorry. Please briefly tell me how I may help you? U2: I'm trying to call 1 8 hundred call A T T. S3: Do you want to place a call? U3: Through my calling card. S4: May I have your card number, please? U4: It's 8 7 6 5 4 3 2 1 0 and then my pin number is 8 7 6 5. (misrecognized) S5: Sorry. Please enter or say your card number again. WIZARD STEPS IN Figure 3: Sample wizard Dialogue The third problematic category, the taskfailure dialogues, are cases where the system completed the call, but carried out a task that was not the one that the customer 296 Predicting Problematic Dialogues was actually requesting. An example taskfailure dialogue is given in Figure 4: hmihy interpreted utterance U2 as a request to make a third-party call e.g. to bill it to my home phone. hmihy then asked the caller for the information it needed to carry out this task, the caller complied, and the system completed the call. S1: AT&T How may I help you? U1: (silence) S2: Sorry. Please briefly tell me how I may help you? U2: I wanta find out how much my bill's worth. S3: What phone number would you like the call billed to? U3: Um the access number. S4: Sorry. Could I have that phone number again? U4: 1 8 hundred 2 2 5 5 2 8 8. S5: What number would you like to call? U5: 800 225 5288 (touchtone input) S6: Thank you. Figure 4: Sample taskfailure Dialogue 3. Experimental Design Our experiments apply the machine learning program ripper (Cohen, 1995, 1996) to automatically classify the dialogues as problematic or successful. ripper is a fast and efficient rule learning system described in more detail in (Cohen, 1995, 1996); we describe it briefly here for completeness. ripper is based on the incremental reduced error pruning (IREP) algorithm described in (Furnkranz & Widmer, 1994). ripper improves on IREP with an information gain metric to guide rule pruning and a Minimum Description Length or MDL-based heuristic for determining how many rules should be learned (see Cohen 1995, 1996 for more details). Like other learners, ripper takes as input the names of a set of classes to be learned, the names and ranges of values of a fixed set of features, and training data specifying the class and feature values for each example in a training set. Its output is a classification model for predicting the class of future examples, expressed as an ordered set of if-then rules. Although any one of a number of learners could be applied to this problem, we had a number of reasons for choosing ripper. First, it was important to be able to integrate the results of applying the learner back into the hmihy spoken dialogue system. Previous work suggests that the if-then rules that ripper uses to express the learned classification model are easy for people to understand (Catlett, 1991; Cohen, 1995), making it easier to integrate the learned rules into the hmihy system. Second, ripper supports continuous, symbolic and textual bag (set) features (Cohen, 1996), while other learners, such as Classification and Regression trees (CART) (Brieman, Friedman, Olshen, & Stone, 1984), do not support textual bag features. There are several textual features in this dataset that prove useful in classifying the dialogues. One of the features that we wished to use was the string representing the recognizer's hypothesis. This is supported in ripper because there is no a priori limitation on the size of the set. The usefulness of the textual features is exemplified in Section 6.3. Finally, previous work in which we had applied other learners to the auto-SLU297 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin success predictor, utilizing the best performing feature set with the textual bag features removed, suggested that we could not expect any significant performance improvements from using other learners (Walker et al., 2000c). In order to train the problematic dialogue predictor (PDP), ripper uses a set of features. As discussed above, initial experiments showed that the hand-labelled SLU-success feature, which encodes whether an utterance has been misunderstood or not, is highly discriminatory in identifying problematic dialogues. However, all the features used to train the PDP must be totally automatic if we are to use the PDP in a working spoken dialogue system. In order to improve the performance of the fully automatic PDP, we developed a fully automatic approximation of the hand-labelled feature, which we call the auto-SLU-success feature, in separate experiments with ripper. The training of the auto-SLU-success feature is discussed in Section 5. Evidence from previous trials of hmihy suggest that it is important to identify problems within a couple of exchanges and 97% of the dialogues in the corpus are five exchanges or less. Thus features for the first two exchanges are encoded since the goal is to predict failures before they happen. The experimental architecture of the PDP is illustrated in Figure 5. This shows how ripper is used first to predict auto-SLU-success for the first and second exchanges. This feature is fed into the PDP along with the other automatic features. The output of the PDP determines whether the system continues, or if a problem is predicted, the Dialogue Manager may adapt its dialogue strategy or transfer the customer to a customer agent. Since 23% of the dialogues consisted of only two exchanges, we exclude the second exchange features for those dialogues where the second exchange consists only of the system playing a closing prompt. We also excluded any features that indicated to the classifier that the second exchange was the last exchange in the dialogue. We compare results for predicting problematic dialogues, with results for identifying problematic dialogues, when the classifier has access to features representing the whole dialogue. In order to test the auto-SLU-success predictor as input to the PDP, we first defined a training and test set for the combined problem. The test set for the auto-SLU-success predictor contains the exchanges that occur in the dialogues of the PDP test set. We selected a random 867 dialogues as the test set and then extracted the corresponding exchanges (3829 exchanges). Similarly for training, the PDP training set contains 3825 dialogues which corresponds to a total of 16901 exchanges for training the auto-SLU-success predictor. The feature auto-SLU-success is predicted for each utterance in the test set, thus enabling the system to be used on new data without the need for hand-labelling. However, there are two possibilities for the origin of this feature in the training set. The first possibility is for the training set to also consist of solely automatic features. This method has the potential advantage that the trained PDP will compensate, if necessary, for whatever noise exists in the auto-SLU-success predictions (Wright, 2000). An alternative to training the PDP on the automatically derived auto-SLU-success feature is to train it on the handlabelled SLU-success while still testing it on the automatic feature. This second method is referred to as "hand-labelled-training" or hlt-SLU-success. This may provide a more accurate model but it may not capture the characteristics of the automatic feature in the test set. Results for these two methods are presented in Section 6.4. 298 Predicting Problematic Dialogues No Exchange 2 Feature Predictor Feature Predictor Yes System Continues -Adapt DM -Transfer to Exchange 2 SLU Features SLU Features auto-SLU-success auto-SLU-success Agent Exchange 1 Exchange 1 Human Customer P(Success)>TPDP Automatic Exchange 1&2 Features Figure 5: System architecture using features from the first 2 exchanges TEST auto-SLU-success predictor training and testing PDP Training Set PDP Test set D B C A D CB D A B C A Test Figure 6: Data for segmentation using cross-validation The problem with using auto-SLU-success for training the PDP is that the same data is used to train the auto-SLU-success predictor. Therefore, we used a cross-validation technique (also known as jack-knifing) (Weiss & Kulikowski, 1991), whereby the training set is partitioned into 4 sets. Three of these sets are used for training and the fourth for testing. The results for the fourth set are noted and the process is repeated, rotating the sets from training to testing. This results in a complete list of predicted auto-SLU-success for the training set. The features for the test set exchanges are derived by training ripper on the whole training set. This process is illustrated in Figure 6. The following section gives a breakdown of the input features. Section 5 describes the training and results of the auto-SLU-success predictor and Section 6 reports the accuracy results for the PDP. 4. The Features 299 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin ffl Acoustic/ASR Features - recog, recog-numwords, asr-duration, dtmf-flag, rg-modality, rg-grammar, tempo ffl SLU Features - a confidence measure for each of the 15 possible tasks that the user could be trying to do - salience-coverage, inconsistency, context-shift, top-task, nexttop-task, topconfidence, diff-confidence, confpertime, salpertime, auto-SLU-success ffl Dialogue Manager and Discourse History Features - sys-label, utt-id, prompt, reprompt, confirmation, subdial - running tallies: num-utts, num-reprompts, percent-reprompts, num-confirms, percent-confirms, num-subdials, percent-subdials - whole dialogue: dial-duration. ffl Hand-Labelled Features - tscript, human-label, age, gender, user-modality, clean-tscript, cltscriptnumwords, SLU-success Figure 7: Features for spoken dialogues. A dialogue consists of a sequence of exchanges where each exchange consists of one turn by the system followed by one turn by the user. Each dialogue and exchange is encoded using the set of 53 features in Figure 7. Each feature was either automatically logged by one of the system modules, hand-labelled by humans, or derived from raw features. The hand-labelled features are used to produce a topline, an estimation of how well a classifier could do that had access to perfect information. To see whether our results can generalize, we also experiment with using a subset of features that are task-independent described in detail below. Features logged by the system are utilized because they are produced automatically, and thus can be used during runtime to alter the course of the dialogue. The system modules for which logging information was collected were the acoustic processor/automatic speech recognizer (asr) (Riccardi & Gorin, 2000), the spoken language understanding (slu) module (Gorin et al., 1997), and the Dialogue Manager (dm) (Abella & Gorin, 1999). Each module and the features obtained from it are described below. Automatic Speech Recognition: The automatic speech recognizer (asr) takes as input the caller's speech and produces a potentially errorful transcription of what it believes the caller said. The asr features for each exchange include the output of the speech recognizer (recog), the number of words in the recognizer output (recog-numwords), the duration in seconds of the input to the recognizer (asr-duration), a flag for touchtone in300 Predicting Problematic Dialogues put (dtmf-flag), the input modality expected by the recognizer (rg-modality) (one of: none, speech, touchtone, speech+touchtone, touchtone-card, speech+touchtone-card, touchtonedate, speech+touchtone-date, or none-final-prompt), and the grammar used by the recognizer (rg-grammar) (Riccardi & Gorin, 2000). We also calculate a feature called tempo by dividing the value of the asr-duration feature by the recog-numwords feature. The motivation for the asr features is that any one of them may reflect recognition performance with a concomitant effect on spoken language understanding. For example, other work has found asr-duration to be correlated with incorrect recognition (Hirschberg, Litman, & Swerts, 1999). The name of the grammar (rg-grammar) could also be a predictor of slu errors since it is well known that the larger the grammar is, the more likely an asr error is. In addition, the rg-grammar feature also encodes expectations about user utterances at that point in the dialogue, which may correlate to differences in the ease with which any one recognizer could correctly understand the user's response. One motivation for the tempo feature is that previous work suggests that users tend to slow down their speech when the system has misunderstood them (Levow, 1998; Shriberg, Wade, & Price, 1992); this strategy actually leads to more errors since the speech recognizer is not trained on this type of speech. The tempo feature may also indicate hesitations, pauses, or interruptions, which could also lead to asr errors. On the other hand, touchtone input in combination with speech, as encoded by the feature dtmf-flag, might increase the likelihood of understanding the speech: since the touchtone input is unambiguous it can constrain spoken language understanding. Spoken Language Understanding: The goal of the spoken language understanding (slu) module is to identify which of the 15 possible tasks the user is attempting and extract from the utterance any items of information that are relevant to completing that task, e.g. a phone number is needed for the task dial for me. Fifteen of the features from the slu module represent the distribution for each of the 15 possible tasks of the slu module's confidence in its belief that the user is attempting that task (Gorin et al., 1997). We also include a feature to represent which task has the highest confidence score (top-task), and which task has the second highest confidence score (nexttop-task), as well as the value of the highest confidence score (top-confidence), and the difference in values between the top and next-to-top confidence scores (diff-confidence). Other features represent other aspects of the slu processing of the utterance. The inconsistency feature is an intra-utterance measure of semantic diversity, according to a task model of the domain (Abella & Gorin, 1999). Some task classes occur together quite naturally within a single statement or request, e.g. the dial for me task is compatible with the collect call task, but is not compatible with the billing credit task. The salience-coverage feature measures the proportion of the utterance which is covered by the salient grammar fragments. This may include the whole of a phone or card number if it occurs within a fragment. The context-shift feature is an inter-utterance measure of the extent of a shift of context away from the current task focus, caused by the appearance of salient phrases that are incompatible with it, according to a task model of the domain. In addition, similar to the way we calculated the tempo feature, we normalize the salience-coverage and top-confidence features by dividing them by asr-duration to produce the salpertime and confpertime features. The tempo and the confpertime and salpertime features are used only for predicting auto-SLU-success. 301 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin The motivation for these slu features is to make use of information that the slu module has as a result of processing the output of asr and the current discourse context. For example, for utterances that follow the first utterance, the slu module knows what task it believes the caller is trying to complete. The context-shift feature incorporates this knowledge of the discourse history, with the motivation that if it appears that the caller has changed her mind, then the slu module may have misunderstood an utterance. Dialogue Manager: The function of the Dialogue Manager is to take as input the output of the slu module, decide what task the user is trying to accomplish, decide what the system will say next, and update the discourse history (Abella & Gorin, 1999). The Dialogue Manager decides whether it believes there is a single unambiguous task that the user is trying to accomplish, and how to resolve any ambiguity. Features based on information that the Dialogue Manager logged about its decisions or features representing the ongoing history of the dialogue might be useful predictors of slu errors or task failure. Some of the potentially interesting Dialogue Manager events arise due to low slu confidence levels which lead the Dialogue Manager to reprompt the user or confirm its understanding. A reprompt might be a variant of the same question that was asked before, or it could include asking the user to choose between two tasks that have been assigned similar confidences by the slu module. For example, in the dialogue in Figure 2 the system utterance in S3 counts as a reprompt because it is a variant of the question in utterance S2. The features that we extract from the Dialogue Manager are the task-type label, syslabel, whose set of values include a value to indicate when the system had insufficient information to decide on a specific task-type, the utterance id within the dialogue (utt-id), the name of the prompt played to the user (prompt), and whether the type of prompt was a reprompt (reprompt), a confirmation (confirm), or a subdialogue prompt (a superset of the reprompts and confirmation prompts (subdial)). The sys-label feature is intended to capture the fact that some tasks may be harder than others. The utt-id feature is motivated by the idea that the length of the dialogue may be important, possibly in combination with other features like sys-label. The different prompt features for initial prompts, reprompts, confirmation prompts and subdialogue prompts are motivated by results indicating that reprompts and confirmation prompts are frustrating for callers and that callers are likely to hyperarticulate when they have to repeat themselves, which results in asr errors (Shriberg et al., 1992; Levow, 1998; Walker, Kamm, & Litman, 2000a). The discourse history features included running tallies for the number of reprompts (num-reprompts), number of confirmation prompts (num-confirms), and number of subdialogue prompts (num-subdials), that had been played before the utterance currently being processed, as well as running percentages (percent-reprompts, percent-confirms, percentsubdials). The use of running tallies and percentages is based on previous work suggesting that normalized features are more likely to produce generalized predictors (Litman, Walker, & Kearns, 1999). A feature available for identifying problematic dialogues is dial-duration that is not available for initial segments of the dialogue. Hand Labelling: As mentioned above, the features obtained via hand-labelling are used to provide a topline against which to compare the performance of the fully automatic features. The hand-labelled features include human transcripts of each user utterance (tscript), a set of semantic labels that are closely related to the system task-type labels (human302 Predicting Problematic Dialogues label), age (age) and gender (gender) of the user, the actual modality of the user utterance (user-modality) (one of: nothing, speech, touchtone, speech+touchtone, non-speech), and a cleaned transcript with non-word noise information removed (clean-tscript). From these features, we calculated two derived features. The first was the number of words in the cleaned transcript (cltscript-numwords), again on the assumption that utterance length is strongly correlated with asr and slu errors. The second derived feature was based on calculating whether the human-label matches the sys-label from the Dialogue Manager (SLU-success). This feature is described in detail in the next section. In the experiments, the features in Figure 7, excluding the hand-labelled features, are referred to as the automatic feature set. The experiments test how well misunderstandings can be identified and whether problematic dialogues can be predicted using the automatic features. We compare the performance of the automatic feature set to the full feature set including the hand-labelled features and to the performance of the automatic feature set with and without the auto-SLU-success feature. Figure 8 gives an example of the encoding of some of the automatic features for the second exchange of the wizard dialogue in Figure 3. The prefix "e2-" designates the second exchange. We discuss several of the features values here to ensure that the reader understands the way in which the features are used. In utterance S2 in Figure 3, the system says Sorry please briefly tell me how I may help you. In Figure 8, this is encoded by several features. The feature e2-prompt gives the name of that prompt, top-reject-rep. The feature e2-reprompt specifies that S2 is a reprompt, a second attempt by the system to elicit a description of the caller's problem. The feature e2-confirm specifies that S2 is not a confirmation prompt. The feature e2-subdial specifies that S2 initiates a subdialogue and e2-num-subdials encodes that this is the first subdialogue so far, while e2-percent-subdials encodes that out of all the system utterances so far, 50% of them initiate subdialogues. As mentioned earlier, we are also interested in generalizing our problematic dialogue predictor to other systems. Thus, we trained ripper using only features that are both automatically acquirable during runtime and independent of the hmihy task. The subset of features from Figure 7 that fit this qualification are in Figure 9. We refer to them as the auto, task-indept feature set. Examples of features that are not task-independent include recog-grammar, sys-label, prompt and the hand-labelled features. 5. Auto-SLU-success Predictor The goal of the auto-SLU-success predictor is to identify, for each exchange, whether or not the system correctly understood the user's utterance. As mentioned above, when the dialogues were transcribed by humans after the data collection was completed, the human labelers not only transcribed the users' utterances, but also labelled each utterance with a semantic category representing the task that the user was asking hmihy to perform. This label is called the human-label. The system's Dialogue Manager decides among several different hypotheses produced by the slu module, and logs its hypothesis about what task the user was asking hmihy to perform; the Dialogue Manager's hypothesis is known as the sys-label. We distinguish four classes of spoken language understanding outcomes based on comparing the human-label, the sys-label and recognition results for card and telephone numbers: (1) rcorrect: slu correctly identified the task and any digit strings were also 303 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin e2-recog: can charge no one eight hundred call A T T e2-rg-modality: speech-plus-touchtone e2-recog-numwords: 10 e2-user-modality: speech e2-dtmf-flag: 0 e2-rg-grammar: Reprompt-gram e2-asr-duration: 6.68 e2-top-task : dial-for-me e2-top-confidence : .81 e2-nexttop-task : none e2-diff-confidence : .81 e2-salience-coverage: 0.000 e2-task1 : 0 e2-inconsistency: 0.000 e2-task2: 0 e2-context-shift: 0.000 e2-task3: 0 e2-prompt : top-reject-rep e2-task4: 0 e2-reprompt: reprompt e2-task5: 0 e2-num-reprompts: 1 e2-task6: .81 e2-percent-reprompts: 0.5 e2-task7: 0 e2-confirm: not-confirm e2-task8: 0 e2-num-confirms: 0 e2-task9: 0 e2-percent-confirms: 0 e2-task10: 0 e2-subdial: subdial e2-task11: 0 e2-num-subdials: 1 e2-task12: 0 e2-percent-subdials: 0.5 e2-task13: 0 e2-cltscript-numwords: 11 e2-task14: 0 e2-sys-label: DIAL-FOR-ME e2-task15: 0 e2-human-label: no-info digitstr e2-no-info: 1 e2-tscript: epr I'm trying to call uh 1 8 hundred call A T T nspn e2-clean-tscript: I'm trying to call 1 8 hundred call A T T Figure 8: Feature encoding for Second Exchange of wizard dialogue. ffl Acoustic/ASR Features - recog, recog-numwords, asr-duration, dtmf-flag, rg-modality actual modality of the user utterance. ffl SLU Features - salience coverage, inconsistency, context-shift, top confidence, diff-confidence, auto-SLU-success. ffl Dialogue Manager Features - utterance by utterance: utt-id, reprompt, confirmation, subdial - running tallies: num-utts, num-reprompts, percent-reprompts, num-confirms, percent-confirms, num-subdials, percent-subdials, dial-duration Figure 9: Automatic task-independent features available at runtime. correctly recognized; (2) rpartial-match: slu correctly recognized the task but there was an error in recognizing a calling card number or a phone number; (3) rmismatch: slu did not correctly identify the user's task; (4) no-recog: the recognizer did not get any input to process and so the slu module did not either. This can arise either because the user did not say anything or because the recognizer was not listening when the user 304 Predicting Problematic Dialogues spoke. The rcorrect class accounts for 7481 (36.1%) of the exchanges in the corpus. The rpartial-match accounts for 109 (0.5%) of the exchanges. The rmismatch class accounts for 4197 (20.2%) of the exchanges and the no-recog class accounts for 8943 (43.1%) of the exchanges. The auto-SLU-success predictor is trained using 45 fully automatic features. These features are the Acoustic/asr features, slu features and Dialogue Manager and Discourse History features, given in Figure 7 . Hand-labelled features were not used. We evaluate the four-way auto-SLU-success classifier by reporting accuracy, precision, recall and the categorization confusion matrix. This classifier is trained on all the features for the whole training set, and then tested on the held-out test set. Table 1 summarizes the overall accuracy results of the system trained on the whole training set and tested on the test set described in Section 3. The first line of Table 1 represents the accuracy from always guessing the majority class (no-recog); this is the baseline against which the other results should be compared. The second row, labelled automatic, shows the accuracy based on using all the features available from the system modules. This classifier can identify slu errors 47.0% better than the baseline. An experiment was run to see if the cross-validation method described in Section 3 performs worse than using the whole data on the same test set. This experiment showed that there was little loss of accuracy when using cross-validation (0.6%). Features Used Accuracy baseline (majority class) 43.1% automatic 90.1 % Table 1: Results for detecting slu Errors using ripper Figure 10 shows some top performing rules that ripper learns when given all the features. These rules directly reflect the usefulness of the slu features. Note that some of the rules use asr features in combination with slu features such as salpertime. Previous studies (Walker et al., 2000c) have also shown slu features to be useful. We had also hypothesized that features from the Dialogue Manager and the discourse history might be useful predictors of slu errors, however these features rarely appear in the rules with the exception of sys-label. This is in accordance with previous experiments which show that these features do not add significantly to the performance of the slu only feature set (Walker et al., 2000c). We also report precision and recall for each category on the held-out test set. The results are shown in Tables 2 and 3. Table 2 shows that the classification accuracy rate is a result of a high rate of correct classification for the rcorrect and no-recog class, at the cost of a lower rate for rmismatch and rpartial-match. This is probably due to the fact that there are fewer examples of these categories in the training set. In some situations, one might not need to distinguish between the different misunderstanding categories: no-recog, rmismatch and rpartial-match. Therefore, experiments were performed that collapsed these 3 problematic categories into one category (rin305 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin if (sys-label = DIAL-FOR-ME) ^ (dtmf-flag = 0) ^ (recog contains "from") ^ (recognumwords ^ 8) then rpartial-match if (sys-label = DIAL-FOR-ME) ^ (dtmf-flag = 0) ^ (asr-duration * 4.08) ^ (recog-grammar = Billmethod-gram) ^ (recog contains "my") ^ (recog-numwords ^ 8) then rpartial-match if (spoken-digit = 1) ^ (salpertime ^ 0.05) ^ (top-confidence ^ 0.851) then rmismatch if (spoken-digit = 1) ^ (salpertime ^ 0.05) ^ (confpertime ^ 0.076) then rmismatch if (spoken-digit = 1) ^ (top-confidence ^ 0.836) then rmismatch if (spoken-digit = 1) ^ (salpertime ^ 0.05) ^ (sys-label = CALLING-CARD) then rmismatch if (asr-duration * 1.04) ^ (utt-id * 2) ^ (sys-label = DIAL-FOR-ME) ^ (diff-confidence ^ 0.75) then rmismatch Figure 10: A subset of rules learned by ripper when given the automatic features for determining auto-SLU-success Class Recall Precision rcorrect 92.6% 86.8% no-recog 98.5% 97.5% rmismatch 70.6% 81.0% rpartial-match 22.7% 40.0% Table 2: Precision and Recall for Test set using Automatic features correct). This resulted in a recognition accuracy of 92.4%, a 29.4% improvement over the baseline of 63%, which is the percentage of rincorrect exchanges. The precision and recall matrix is given in Table 4. rcorrrect no-recog rmismatch rpartial rcorrect 2784 6 211 5 no-recog 9 3431 44 0 rmismatch 409 83 1204 10 rpartial-match 6 0 28 10 Table 3: Confusion Matrix for Test set using Automatic features 306 Predicting Problematic Dialogues Class Recall Precision rcorrect 91.2% 89.0 % rincorrect 93.1% 94.5% Table 4: Precision and Recall for Test set using Automatic features 6. Problematic Dialogue Predictor The goal of the PDP is to predict, on the basis of information that it has early in the dialogue, whether or not the system will be able to complete the user's task. The output classes are based on the four dialogue categories described above. However, as hangup, wizard and taskfailure are treated as equivalently problematic by the system, as illustrated in Figure 5, these 3 categories are collapsed into problematic. Note that this categorization is inherently noisy because it is impossible to know the real reasons why a caller hangs up or a wizard takes over the call. The caller may hang up because she is frustrated with the system, or she may simply dislike automation, or her child may have started crying. Similarly, one wizard may have low confidence in the system's ability to recover from errors and use a conservative approach that results in taking over many calls, while another wizard may be more willing to let the system try to recover. Nevertheless, we take these human actions as a human labelling of these calls as problematic. Given this binary classification, approximately 33% of the calls in the corpus of 4692 dialogues are problematic and 67% are tasksuccess. 6.1 Problematic Dialogue Predictor Results This section presents results for predicting problematic dialogues. Taking into account the fact that a problematic dialogue must be predicted at a point in the dialogue where the system can do something about it, we compare prediction accuracy after having seen only the first exchange or the first two exchanges with identification accuracy after having seen the whole dialogue. For each of these situations, we also compare results for the automatic feature set (as described earlier) with and without the auto-SLU-success feature and with the hand-labelled feature SLU-success. Table 5 summarizes the overall accuracy results. The three columns present results for Exchange 1, Exchanges 1&2 and over the whole dialogue. The first row gives the baseline result which represents the prediction accuracy from always guessing the majority class. Since 67.1% of the dialogues are tasksuccess dialogues, we can achieve 67.1% accuracy from simply guessing tasksuccess for each dialogue. The second row gives results using only automatic features, but without the auto-SLU-success feature. The third row uses the same automatic features but adds in auto-SLU-success. This feature is obtained for both the training and the test set, using the cross-validation method discussed in Section 3. The fourth and fifth rows show results using the subset of features that are both fully automatic and task-independent as described in Section 4. 307 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin The automatic results given in row 2 are significantly higher by a paired t-test than the baseline for all three sections of the dialogue (df=866, t=2.1, p=0.035;df=866, t=7.2, p=0.001;df=866 t=13.4, p=0.001). Rows 6 and 7 show accuracy improvements gained by the addition of hand-labelled features. These rows give a topline against which to compare the results in rows 2, 3, 4 and 5. Results using all the automatic features plus the hand-labelled SLU-success are given in row 6. In these experiments, the hand-labelled SLU-success feature is used for training and testing. Comparing this result with the second row shows that if one had a perfect predictor of auto-SLU-success in the training and the test set, then this feature would increase accuracy by 5.5% for Exchange 1 (from 70.1% to 75.6%); by 7.6% for Exchanges 1&2 (from 78.1% to 85.7%); and by 5.9% for the whole dialogue (87.0% to 92.9%). These increases are significant by a paired t-test (df=866, t=5.1, p=0.0001; df=866, t=2.1, p=0.035; df=866, t=6.7, p=0.001). Comparing the result in row 6 with the result in row 3 shows that the auto-SLU-success predictor that we have trained can improve performance, but could possibly help more with different training methods. Ideally, the result in row 3, for automatic features plus autoSLU-success, should fall between the figures in rows 2 and 6, and be closer to the results in row 6. With Exchanges 1&2, adding auto-SLU-success results in an increase of 1.1% which is not significant (compare rows 2 and 3). For Exchange 1 only, ripper does not use the auto-SLU-success feature in its ruleset and does not yield an improvement over the system trained only on the automatic features. The system trained on the whole dialogue with automatic features plus auto-SLU-success also does not yield an improvement over the system trained without auto-SLU-success. 6.1.1 Task-independent Features Rows 4 and 5 give the results using the auto, task-indept feature set described in Figure 9 without and with the auto-SLU-success feature, respectively. These results are significantly above the baseline using a paired t-test, with Exchanges 1&2 giving an increase of 13.1% (df=866, t=8.6, p=0.001) using task-indept features with auto-SLU-success. By comparing rows 4 and 5, one observes an increase in the auto, task-indept features set when the feature auto-SLU-success is added using Exchanges 1&2 and whole dialogue. The 1.9% increase for Exchanges 1&2 shows a trend (df=866, t=1.7,p=0.074), whereas the 2% increase for the whole dialogue is statistically significant by a paired t-test (df=866, t=3.0, p=0.003). Although, the task-indept feature sets are a subset of those features used in row 3, it is possible for them to perform better because the task-indept features are more general, and because ripper uses a greedy algorithm to discover its rule sets. For Exchanges 1&2, the increase from row 3 to row 5 (both of which use auto-SLU-success) is not significant. Comparing rows 2 and 4, neither of which use auto-SLU-success, one sees a slight degradation in results for the whole dialogue using task-indept features. However, the increase from rows 2 to 5 from 78.1% to 80.3% for Exchanges 1&2 is statistically significant (df=866, t=2.0, p=0.042). This shows that using auto-SLU-success in combination with the set of task-indept features produces a statistically significant increase in accuracy over a set of automatic features that does not include this feature. 308 Predicting Problematic Dialogues Row Features Exchange 1 Exchange 1&2 Whole 1 Baseline 67.1 67.1 67.1 2 auto (no auto-SLU-success) 70.1 78.1 87.0 3 auto + auto-SLU-success 69.6 79.2 84.9 4 auto, task-indept (no auto-SLU-success) 70.1 78.4 83.4 5 auto, task-indept + auto-SLU-success 69.2 80.3 85.4 6 auto + SLU-success 75.6 85.7 92.9 7 all (auto + Hand-labelled) 77.1 86.9 91.7 Table 5: Accuracy % results for predicting problematic dialogues. Class Occurred Predicted Recall Precision tasksuccess 67.0 % 81.7 % 88.1 % 72.5 % problematic 33.0 % 18.3 % 31.6 % 56.6 % Table 6: Precision and Recall with Exchange 1 Automatic Features The main purpose of these experiments is to determine whether a dialogue is potentially problematic, therefore using the whole dialogue is not useful in a dynamic system. Using Exchanges 1&2 produces accurate results and would enable the system to adapt in order to complete the dialogue in the appropriate manner. 6.1.2 Hand-labelled Features Row 7 in table 5 gives the results using hand-labelled and automatic features including both SLU-success and auto-SLU-success. By comparing rows 6 and 7, one can see that there is not very much to be gained by adding the other hand-labelled features given in Figure 7 to the hand-labelled and SLU-success feature set. Only the increase for Exchange 1 from 75.6% to 77.1% is significant (df=866, t=2.3, p=0.024). For the whole utterance there is actually a degradation of results from 92.9% to 91.7%. 6.2 Precision and Recall The performance of the system that uses automatic features (including auto-SLU-success) for the first utterance is given in Table 6. This system has an overall accuracy of 69.6%. These results show that, given the first exchange, the ruleset predicts that 18.3% of the dialogues will be problematic, while 33% of them actually will be. Of the problematic dialogues, it can predict 31.6% of them. Once it predicts that a dialogue will be problematic, it is correct 56.6% of the time. The performance of the system that uses automatic features for Exchanges 1&2 is summarized in Table 7. These results show that, given the first two exchanges, this ruleset predicts that 20% of the dialogues will be problematic, while 33% of them actually will be. Of the problematic dialogues, it can predict 49.5% of them. Once it predicts that a dialogue will be problematic, it is correct 79.7% of the time. This classifier has an improvement of 309 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin if (e2-salience-coverage ^ 0.7) ^ (e2-asr-duration * 0.04) ^ (e2-auto-SLU-success = norecog) then problematic if (e2-auto-SLU-success = rmismatch)^ (e2-sys-label = DIAL-FOR-ME) ^ (e2-asrduration * 3.12) then problematic if (e1-salience-coverage ^ 0.727) ^ (e2-salience-coverage ^ 0.706) ^ (e1-recog contains "help") ^ (e1-asr-duration ^ 2.44) then problematic if (e1-top-confidence ^ 0.924) ^ (e2-auto-SLU-success = rmismatch) ^ (e2-sys-label = CALLING-CARD) then problematic if (e1-top-confidence ^ 0.924) ^ (e2-diff-confidence ^ 0.918) ^ (e2-collect ^ 0.838) ^ (e2- asr-duration * 9.36) ^ (e2-dial-for-me * 0.5) then problematic Figure 11: A subset of rules learned by ripper when given the automatic features for determining problematic dialogues if (e2-top-confidence ^ 0.897) ^ (e2-asr-duration * 0.04) ^ (e2-auto-SLU-success = norecog) then problematic if (e2-auto-SLU-success = rmismatch) ^ (e2-recog-numwords ^ 7) ^ (e2-asr-duration * 2.28) then problematic if (e2-salience-coverage ^ 0.9) ^ (e2-asr-duration * 0.04) ^ (e2-inconsistency * 0.022) ^ (e1-asr-duration ^ 3.96) ^ (e2-inconsistency * 0.18) then problematic if (e1-salience-coverage ^ 0.667) ^ (e2-salience-coverage ^ 0.692) ^ (e1-recog contains "help") ^ (e1-asr-duration ^ 7.36) then problematic if (e1-top-confidence ^ 0.924) ^ (e2-diff-confidence ^ 0.918) ^ (e2-recog contains "my") ^ (e1-asr-duration * 4) then problematic if (e1-salience-coverage ^ 0.647) ^ (e1-asr-duration * 10.24) ^ (e2-asr-duration * 5.32) then problematic Figure 12: A subset of rules learned by ripper when given the task-indept features for determining problematic dialogues 17.87% in recall and 23.09% in precision, for an overall improvement in accuracy of 9.6% over using the first exchange alone. 6.3 Examination of the Rulesets A subset of the rules from the system that uses automatic features for Exchanges 1&2 are given in Figure 11 (row 3, table 5). One observation from these hypotheses is the classifier's preference for the asr-duration feature over the feature for the number of words recognized (recog-numwords). One would expect longer utterances to be more difficult, but the learned rulesets indicate that duration is a better measure of utterance length than the number of words. Another observation is the usefulness of the slu confidence scores and the slu 310 Predicting Problematic Dialogues Class Occurred Predicted Recall Precision tasksuccess 67.0 % 80.0 % 94.8 % 79.1 % problematic 33.0 % 20.0 % 49.5 % 79.7 % Table 7: Precision and Recall with Exchange 1&2 Automatic Features salience-coverage in predicting problematic dialogues. These features seem to provide good general indicators of the system's success in recognition and understanding. The fact that the main focus of the rules is detecting asr and slu errors and that none of the dm behaviors are used as predictors also indicates that, in all likelihood, the dm is performing as well as it can, given the noisy input that it is getting from asr and slu. An alternative view is that two utterances are not enough to provide meaningful dialogue features such as counts and percentages of reprompts, confirmations, etc.. One can see that the top two rules use auto-SLU-success. The first rule basically states that if there is no recognition for the second exchange (as predicted by the auto-SLUsuccess) then the dialogue will fail. The second rule is more interesting as it states if a misunderstanding has been predicted for the second exchange and the system label is DIAL-FOR-ME and the utterance is long then the system will fail. In other words, the system frequently misinterprets long utterances as DIAL-FOR-ME resulting in task failure. Figure 12 gives a subset of the ruleset for the task-indept feature set for Exchanges 1&2. One can see a similarity between this ruleset and the one given in Figure 11. This is due to the fact that when all the automatic features are available, ripper has a tendency to pick out the more general task-independent ones, with the exception of sys-label. If one compares the second rule in both figures, one can see that ripper uses recog-numwords as a substitute for the task-specific feature sys-label. 6.4 Cross-validation Method vs. Hand-labelled-training Method As mentioned above, an alternative to training the PDP on the automatically derived autoSLU-success feature is to train it on the hand-labelled SLU-success while still testing it on the automatic feature. This second method is referred to as "hand-labelled-training" and the resulting feature is hlt-SLU-success. This may provide a more accurate model but it may not capture the characteristics of the automatic feature in the test set. Table 8 gives results for the two methods. One can see from this table that there is a slight, insignificant increase in accuracy for Exchange 1 and the whole dialogue using the hand-labelled-training method. However, the totally automated method yields a better result (79.2% compared to 77.4%) for Exchanges 1&2, which as mentioned above, is the most important result for these experiments. This increase shows a trend but is not significant (df=866, t=1.8, p=0.066). The final row of the table gives the results using the hand-labelled feature SLU-success in both the training and testing and is taken as the topline result. 311 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin Features Exchange 1 Exchange 1&2 Whole Baseline 67.1 67.1 67.1 auto 70.1 78.1 87.0 auto + hlt-SLU-success 70.4 77.4 86.2 auto + auto-SLU-success 69.6 79.2 84.9 auto + SLU-success 75.6 85.7 92.9 Table 8: Accuracy % results including hlt-SLU-success derived using the hand-labelledtraining method Features Exchange 1 Exchange 1&2 Whole Baseline 67.1 67.1 67.1 asr 66.7 75.9 85.6 slu 67.7 71.9 79.8 Dialogue 65.5 74.5 82.6 Hand-labelled 76.9 84.7 86.2 Auto-SLU-success 69.0 70.9 77.1 Hlt-SLU-success 69.0 74.1 77.2 Table 9: Accuracy % results for subsets of features 6.5 Feature Sets It is interesting to examine what types of features are the most discriminatory in determining whether a dialogue is problematic or not. ripper was trained separately on sets of features based on the groups given in Figure 7, namely Acoustic/asr, slu, Dialogue and Handlabelled (including SLU-success). These results are given in Table 9. For Exchange 1, only the slu features, out of the automatic feature sets, yields an improvement over the baseline. Interestingly, training the system on the asr yields the best result out of the automatic feature sets for Exchange 1&2 and the whole dialogue. These systems, for example, use asr-duration, number of recognized words, and type of recognition grammar as features in their ruleset. Finally, we give results for the system trained only on auto-SLU-success and hlt-SLUsuccess. One can see that there is not much difference in the two sets of results. For Exchanges 1&2, the system trained on hlt-SLU-success has an accuracy which is significantly higher than the system trained on auto-SLU-success by a paired t-test (df=866, t=3.0, p=0.03). On examining the ruleset, one finds that the hlt-SLU-success uses rpartialmismatch where the auto-SLU-success ruleset does not. The lower accuracy may be due to the fact that the auto-SLU-success predictor has a low recall and precision for rpartialmismatch as seen in Table 2. 6.6 Types of Problematic Dialogues As mentioned in Section 2, there are 3 types of problematic dialogues: taskfailure, wizard and hangup. In order to determine whether some of these types of problematic 312 Predicting Problematic Dialogues True Values Predicted successful Predicted problematic total tasksuccess 94.1% (548) 5.9% (34) 67.1% (582) taskfailure 68.5% (74) 31.5% (34) 12.0% (108) wizard 41.3% (43) 58.7% (61) 12.5% (104) hangup 35.6% (26) 64.4% (47) 8.4% (73) Total 79.7% (691) 20.3% (176) 100% (867) Table 10: Matrix of recognized tasksuccess and taskfailures True Values Predicted successful Predicted problematic Total tasksuccess 61.1% (66) 38.9% (42) 50% (108) taskfailure 21.3% (23) 78.6% (85) 50% (108) Total 41.2% (89) 58.8% (127) 100% (216) Table 11: Matrix of recognized tasksuccess and taskfailures using equal training and testing dialogues are more difficult to predict than others, we conducted a post-hoc analysis of the proportion of prediction failures for each type of problematic dialogue. Since we were primarily interested in the performance of the PDP using the full automatic feature set, after having seen Exchanges 1&2, we conducted our analysis on this version of the PDP. Table 10 shows the distribution of the 4 types of dialogue in the test set and whether the Exchanges 1&2 PDP was able to predict correctly that the dialogue would be tasksuccess or problematic. One can see that the worst performing category is taskfailure and that the PDP predicts incorrectly that 68.5% of the taskfailure dialogues are tasksuccess. One reason that this might occur is that this sub-category of dialogues are much more difficult to predict since in this case the hmihy system has no indication that it is not succeeding in the task. However, another possibility is that the PDP performs poorly on this category because there are fewer examples in the training set, although it does better on the hangup subset, which is about the same proportion. We can eliminate the first possibility by examining how a learner performs when trained on equal proportions of tasksuccess and taskfailure dialogues. We conducted an experiment using a subset of tasksuccess dialogues in the same proportion as taskfailure for the training and the test set and trained a second PDP using the fully automatic Exchange 1&2 features. This resulted in a training set of 690 dialogues and a test set of 216. The binary classifier has an accuracy of 70%, the corresponding recognition matrix is presented in table 11. The results show that fewer taskfailures are predicted as successful, suggesting that taskfailures are not inherently more difficult to predict than other classes of problematic dialogues. Below we discuss the potential of using ripper's loss ratio to weight different types of classification errors in future work. 313 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin 7. Related Work The research reported here is the first that we know of to automatically analyze a corpus of logs from a spoken dialogue system for the purpose of learning to predict problematic situations. This work builds on two strands of earlier research. First, this approach was inspired by work on the paradise evaluation framework for spoken dialogue systems which utilizes both multivariate linear regression and CART to predict user satisfaction as a function of a number of other metrics (Walker, Litman, Kamm, & Abella, 1997; Walker et al., 2000a). Research using paradise has found that task completion is always a major predictor of user satisfaction, and has examined predictors of task completion. Here, our goals are similar in that we attempt to understand the factors that predict task completion. Secondly, this work builds on earlier research on learning to identify dialogues in which the user experienced poor speech recognizer performance (Litman et al., 1999). Because that work was based on features synthesized over the entire dialogue, the hypotheses that were learned could not be used for prediction during runtime. In addition, in contrast to the current study, the previous work automatically approximated the notion of a good or bad dialogue using a threshold on the percentage of recognition errors. There is a danger of this approach being circular when recognition performance at the utterance level is a primary predictor of a good or bad dialogue. In this work, the notion of a good (tasksuccess) and bad (problematic) dialogue was labelled by humans. In previous work, (Walker et al., 2000b) reported results from training a problematic dialogue predictor in which they noted the extent to which the hand-labelled SLU-success feature improves classifier performance. As a result of this prior analysis, in this work we report results from training an auto-SLU-success classifier for each exchange and using its predictions as an input feature to the Problematic Dialogue Predictor. There are a number of previous studies on predicting recognition errors and user corrections which are related to the auto-SLU-success predictor that we report on here (Hirschberg et al., 1999; Hirschberg, Litman, & Swerts, 2000, 2001b; Levow, 1998; Litman, Hirschberg, & Swerts, 2000; Swerts, Litman, & Hirschberg, 2000). (Hirschberg et al., 1999) apply ripper to predict recognition errors in a corpus of 2067 utterances. In contrast to our work, they utilize prosodic features in combination with acoustic confidence scores. They report a best-classifier accuracy of 89%, which is a 14% improvement over their baseline of 74%. This result can be compared with our binary autoSLU-success predictor (rcorrect vs. rincorrect) discussed in Section 5. Examination of the rules learned by their classifier suggests that durational features are important. While we do not use amplitude or F0 features, we do have an asr-duration feature which is logged by the recognizer. Without any of the other prosodic features, the auto-SLU-success predictor has an accuracy of 92.4%, a 29.4% improvement over the baseline of 63%. It is possible that including prosodic features in the auto-SLU-success predictor could improve this result even further. Previous studies on error correction recognition are also related to our method of misunderstanding recognition. (Levow, 1998) applied similar techniques to learn to distinguish between utterances in which the user originally provided some information to the system, and corrections, which provided the same information a second time, following a misunderstanding. This may be more related to our research than it first appears since corrections 314 Predicting Problematic Dialogues are often misunderstood due to hyper-articulation. Levow's experiments train a decision tree using features such as duration, tempo, pitch, amplitude, and within-utterance pauses. Examination of the trained tree in this study also reveals that the durational features are the most discriminatory. Similarly in our experiments, ripper uses asr-duration frequently in the developed rule set. Levow obtains an accuracy rate of 75% with a baseline of 50%. (Swerts et al., 2000) and (Hirschberg et al., 2001b) perform similar studies for automatically identifying corrections using prosody, asr features and dialogue context. Corrections are likely to be misrecognized, due to hyperarticulation. They observe that corrections that are more distant from the error they correct, are more likely to exhibit prosodic differences. Their system automatically differentiates corrections from non-corrections with an error rate of 15.72%. Dialogue context is used in the study by (Hirschberg, Litman, & Swerts, 2001a), whereby they incorporate whether the user is aware of a mistake at the current utterance to help predict misunderstandings and misrecognition of the previous utterances. This study is similar to ours in that they use a predicted feature about an utterance (the 'aware' feature) to predict concept or word accuracy, as we use a predicted feature autoSLU-success in the PDP. However, our auto-SLU-success feature is automatically available at the time the prediction is being made, whereas they are making the predictions retroactively. In addition, they train their system on the hand-labelled feature rather than the predicted one which they leave as further work. (Kirchhoff, 2001) performs error correction identification using task independent acoustic and discourse variables. This is a two way distinction between positive and negative error correction. She uses two cascaded classifiers, the first is a decision tree trained using 80% of the data and validating on 10%. Examples that have confidence scores below a threshold go into an exception training set for a second classifier. During testing, if confidence scores are below a threshold then the utterance is passed onto the second classifier. She finds that the most discriminatory features are dialogue context (the type of previous system utterance) followed by lexical features, with prosodic features being the least discriminatory. The system recognizes error corrections with an accuracy of 90% compared to a baseline of 81.9%. In this study (Kirchhoff, 2001) deliberately eschews the use of system specific features, while in our work, we examine the separate contribution of different feature sets. Our results suggest that the use of more general features does not negatively impact performance. (Krahmer, Swerts, Theune, & Weegels, 1999a) and (Krahmer, Swerts, Theune, & Weegels, 1999b) look at different features related to responses to problematic system turns. The disconfirmations they discuss are responses to explicit or implicit system verification questions. They observe that disconfirmations are longer, have a marked word order, and contain specific lexicon such as "no". In addition, there are specific prosodic cues such as boundary tones and pauses. Some of these features such as length, choice of words are captured in our ripper ruleset as discussed above. As described in Section 5, two methodologies were compared for incorporating the feature SLU-success into the PDP. The first was to use the hand-labelled feature in the training set, the second to perform separate experiments to predict the feature for the training set. As the features in the training set are automatically predicted, it is hoped that the system would pick up the idiosyncrasies of the noisy data. This training method has been used previously in (Wright, 2000) where automatically identified intonation event features are used 315 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin to train an automatic speech-act detector. These automatically derived features provide a better training model than the hand-labelled ones. This is true also in the current study as discussed in Section 6.1. 8. Discussion and Future Work This paper reports results on automatically training a Problematic Dialogue Predictor to predict problematic human-computer dialogues using a corpus of 4692 dialogues collected with the How May I Help You spoken dialogue system. The Problematic Dialogue Predictor can be immediately applied to the system's decision of whether to transfer the call to a human customer care agent, or be used as a cue to the system's Dialogue Manager to modify its behavior to repair the problems identified. The results show that: (1) Most feature sets significantly improve over the baseline; (2) Using automatic features from the whole dialogue, we can identify problematic dialogues 20% better than the baseline; (3) Just the first exchange provides significantly better prediction (3%) than the baseline; (4) The second exchange provides an additional significant (13%) improvement, (5) A classifier based on task-independent automatic features performs slightly better than one trained on the full automatic feature set. The improved ability to predict problematic dialogues is important for fielding the hmihy system without the need for the oversight of a human customer care agent. These results are promising and we expect to be able to improve upon them, possibly by incorporating prosody into the feature set (Hirschberg et al., 1999) or expanding on the slu feature sets. In addition, the results suggest that the current PDP is likely to generalize to other dialogue systems. In future work, we plan to integrate the learned rulesets into the hmihy dialogue system and evaluate the impact that this would have on the system's overall performance. There are several ways we might be able to show this. Remember that one use of the PDP is to improve the system's decision of whether and when to transfer a call to the human customer care agent. The other use would be as input to the Dialogue Manager's dialogue strategy selection mechanism. Demonstrating the utility of the PDP for dialogue strategy selection requires experiments that test out several different ways that this information could be used by the Dialogue Manager. Demonstrating the utility of the PDP on the decision to transfer a call necessarily involves examining the tradeoffs among different kinds of errors. This is because every call that the hmihy system can handle successfully saves a company the cost of using a human customer care agent to handle the call. Thus, we can associate this cost with the decision that hmihy makes to transfer the call. When hmihy transfers the call unnecessarily, we call this cost the lost automation cost. On the other hand, every call that hmihy attempts to handle and fails, would potentially accrue a different cost, namely the lost revenue from customers who become irritated with faulty customer service and take their business elsewhere. We call this cost the system failure cost. In the results that we presented here, we report only overall accuracy results and treat lost automation cost and system failure cost as equally costly. However, in any particular installation of the hmihy system, there may be differences between these costs that would need to be accounted for in the training of the PDP. It would be possible to use ripper to do this, if these costs were known, by using its ability to vary the loss ratio. 316 Predicting Problematic Dialogues Another potential issue for future work is the utility of a dialogue level predictor, e.g. the PDP, vs. an utterance level predictor, e.g. the auto-SLU-success predictor, for the goal of automatically adapting a system's dialogue strategy. This is shown to be effective in (Litman & Pan, 2000), where they use a problematic dialogue detector in order to adapt the dialogue strategy for a train enquiry system. It would be possible, and others have argued (Levow, 1998; Hirschberg et al., 1999; Kirchhoff, 2001) that the dialogue manager's adaptation decisions can be made on the basis of local behavior, i.e. on the basis of recognizing that the current utterance has been misunderstood, or that the current utterance is a correction. However, it is clear that the decision to transfer the call to a human customer care agent cannot be made on the basis of only local information because the system can often recover from a single error. Thus, we expect that the ability to be able to predict the dialogue outcome as we do here will continue to be important even in systems that use local predictors for understanding and correction. 9. Acknowledgments Thanks to Ron Prass, Diane Litman, Richard Sutton, Mazin Rahim and Michael Kearns for discussions on various aspects of this work. References Abella, A., & Gorin, A. (1999). Construct algebra: An analytical method for dialog management. In Proceedings of Thirty Seventh Annual Meeting of the Association for Computational Linguistics. Baggia, P., Castagneri, G., & Danieli, M. (1998). Field Trials of the Italian ARISE Train Timetable System. In Interactive Voice Technology for Telecommunications Applications, IVTTA, pp. 97-102. Brieman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth and Brooks, Monterey California. Catlett, J. (1991). Megainduction: A test flight. In Proceedings of the Eighth International Conference on Machine Learning. Chu-Carroll, J., & Carpenter, B. (1999). Vector-based natural language call routing. Computational Linguistics, 25-3, 361-387. Cohen, W. (1995). Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning. Cohen, W. (1996). Learning trees and rules with set-valued features. In Fourteenth Conference of the American Association of Artificial Intelligence. E. Ammicht, A. G., & Alonso, T. (1999). Knowledge collection for natural language spoken dialog systems. In Proceedings of the European Conference on Speech Communication and Technology. 317 Walker, Langkilde-Geary, Wright Hastie, Wright & Gorin Furnkranz, J., & Widmer, G. (1994). Incremental reduced error pruning. In Proceedings of the Eleventh National Conference on Machine Learning. Gorin, A., Riccardi, G., & Wright, J. (1997). How May I Help You?. Speech Communication, 23, 113-127. Hirschberg, J. B., Litman, D. J., & Swerts, M. (1999). Prosodic cues to recognition errors. In Proc. of the Automatic Speech Recognition and Understanding Workshop. Hirschberg, J. B., Litman, D. J., & Swerts, M. (2000). Generalizing prosodic prediction of speech recognition errors. In Proceedings of the 6th International Conference of Spoken Language Processing (ICSLP-2000). Hirschberg, J. B., Litman, D. J., & Swerts, M. (2001a). Detecting misrecognitions and corrections in spoken dialogue systems from 'aware' sites. In Proceedings of the Workshop on Prosody in Speech Recognition and Understanding. Hirschberg, J. B., Litman, D. J., & Swerts, M. (2001b). Identifying user corrections automatically in spoken dialogue system. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Kirchhoff, K. (2001). A comparison of classification techniques for the automatic detection of error corrections in human-computer dialogues. In Proceedings of the North American Meeting of the NAACL Workshop on Adaptation in DIalogue Systems. Krahmer, E., Swerts, M., Theune, M., & Weegels, M. (1999a). Problem spotting in humanmachine interaction. In Proc. Eurospeech 99. Krahmer, E., Swerts, M., Theune, M., & Weegels, M. (1999b). Prosodic correlates of disconfirmations. In ESCA Workshop on Interactive Dialogue in Multi-Modal Systems. Langkilde, I., Walker, M. A., Wright, J., Gorin, A., & Litman, D. (1999). Automatic prediction of problematic human-computer dialogues in How May I Help You?. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, ASRUU99. Levow, G.-A. (1998). Characterizing and recognizing spoken corrections in human-computer dialogue. In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics, pp. 736-742. Litman, D. J., Hirschberg, J. B., & Swerts, M. (2000). Predicting automatic speech recognition performance using prosodic cues. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics. Litman, D. J., & Pan, S. (2000). Predicting and adapting to poor speech recognition in a spoken dialogue system. In Proc. of the Seventeenth National Conference on Artificial Intelligence, AAAI-2000. Litman, D. J., Walker, M. A., & Kearns, M. J. (1999). Automatic detection of poor speech recognition at the dialogue level. In Proceedings of the Thirty Seventh Annual Meeting of the Association of Computational Linguistics, pp. 309-316. 318 Predicting Problematic Dialogues Riccardi, G., & Gorin, A. (2000). Spoken language adaptation over time and state in a natural spoken dialog system. IEEE Transactions on Speech and Audio Processing, 8 (1), 3-10. Sanderman, A., Sturm, J., den Os, E., Boves, L., & Cremers, A. (1998). Evaluation of the dutchtrain timetable information system developed in the ARISE project. In Interactive Voice Technology for Telecommunications Applications, IVTTA, pp. 91- 96. Seneff, S., Zue, V., Polifroni, J., Pao, C., Hetherington, L., Goddeau, D., & Glass, J. (1995). The preliminary development of a displayless PEGASUS system. In ARPA Spoken Language Technology Workshop. Shriberg, E., Wade, E., & Price, P. (1992). Human-machine problem solving using spoken language systems (SLS): Factors affecting performance and user satisfaction. In Proceedings of the DARPA Speech and NL Workshop, pp. 49-54. Swerts, M., Litman, D. J., & Hirschberg, J. B. (2000). Corrections in spoken dialogue systems. In Proceedings of the 6th International Conference of Spoken Language Processing (ICSLP-2000). Walker, M. A., Fromer, J. C., & Narayanan, S. (1998). Learning optimal dialogue strategies: A case study of a spoken dialogue agent for email. In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics, COLING/ACL 98, pp. 1345-1352. Walker, M. A., Kamm, C. A., & Litman, D. J. (2000a). Towards developing general models of usability with PARADISE. In Natural Language Engineering: Special Issue on Best Practice in Spoken Dialogue Systems. Walker, M. A., Langkilde, I., Wright, J., Gorin, A., & Litman, D. (2000b). Learning to Predict Problematic Situations in a Spoken Dialogue System: Experiments with How May I Help You?. In Proceedings of the North American Meeting of the Association for Computational Linguistics. Walker, M. A., Litman, D., Kamm, C. A., & Abella, A. (1997). PARADISE: A general framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics, ACL/EACL 97, pp. 271- 280. Walker, M. A., Wright, J., & Langkilde, I. (2000c). Using natural language processing and discourse features to identify understanding errors in a spoken dialogue system. In Proceedings of the Seventeenth International Conference on Machine Learning. Weiss, S. M., & Kulikowski, C. (1991). Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. San Mateo, CA: Morgan Kaufmann. Wright, H. (2000). Modelling Prosodic and Dialogue Information for Automatic Speech Recognition. Ph.D. thesis, University of Edinburgh. 319