E. Reiter, S. G. Sripada and R. Robertson
Volume 18, 2003
Links to Full Text:Journal of ArtificialIntelligence Research 18 {2003} 491-516Submitted 11/02;
published 6/03
Acquiring Correct Knowledge
for NaturalLanguage Generation
Ehud Reiterereiter@csd.abdn.ac.uk
Somayajulu G. Sripada ssripada@csd.abdn.ac.uk Department of Computing Science,
University of Aberdeen, AberdeenAB24 3UE, UK
Roma Robertsonroma.robertson@ed.ac.uk
Division of Community Health Sciences -General Practice Section
University ofEdinburgh
Edinburgh EH8 9DX, UK Abstract
Natural language generation {nlg} systems are computer software systemsthat
pro-
duce texts in English and other human languages, often from non-linguistic
input data. nlg systems, like most aisystems, need substantial amounts of
knowledge.However, our
experiencein two nlg pro jects suggests that it is difficult to acquire
correct knowledge for nlg systems; indeed, everyknowledge acquisition {ka}
technique we tried had sig-
nificant problems.In general terms, these problems were due tothe complexity,
novelty,
and poorly understood nature ofthe tasks our systems attempted, and were
worsened by
the fact that people writeso differently. This meant inparticular that corpus-based
ka
approaches suffered because it was impossible to assemble a sizable corpus
of high-quality
consistent manually written texts in our domains; and structured expert-orientedka
tech-
niques suffered because experts disagreed and because we could not getenough
information
about special and unusual cases to build robust systems. We believe that
such problems are likely to affect many othernlg systems as well. In the
long term,we hope that new
ka techniques may emerge to help nlg system builders.In the shorter term,
we believe that understanding how individualka techniques can fail, and using
a mixtureof different
ka techniques withdifferent strengths and weaknesses, can help developers
acquire nlg
knowledgethat is mostly correct.
1. Introduction
Natural language generation {nlg} systems use artificial intelligence {ai}
and natural lan-
guageprocessing techniques to automatically generatetexts in English and
other human
languages, typically from some non-linguistic inputdata {Reiter & Dale,
2000}. As with most ai systems, an essential part ofbuilding an nlg system
is knowledge acquisition{ka},
that is acquiring relevant knowledge about the domain, the users,the language
used in the
texts, and soforth.
ka for nlg can be basedon structured expert-oriented techniques, suchas
think-aloud
protocols and sorting, or onmachine learning and corpus analysis, which
arecurrently very
popular in other areasof Natural Language Processing. Wehave used both types
of tech- niques in two nlg projects that included significant kaefforts {
stop {Reiter, Robertson,& Osman, 2003}, which generated tailored smoking
cessation letters, andSumTime-Mousam c
fl2003 AI Access Foundation and Morgan Kaufmann Publishers.All rights reserved.Reiter,
Sripada, & Robertson
{Sripada,Reiter, Hunter, Yu, & Davy, 2001}, which generated weather forecasts.In
both
pro jects, and for all techniques tried, the main problem turned out to
beknowledge quality;
evaluationand validation exercises identified flaws in the knowledge acquired
using every technique. The flaws were due to avariety of factors, butperhaps
the basicunderlying
reason for them was the natureof the writing tasks we were attempting toautomate.
They
were:
* complex {as are many tasks that involve interacting with humans}: hencea
lot of
knowledge was needed to cover the numerous special cases and unusualcircumstances;
* sometimesnovel {not done by humans}:hence sometimes there were no experts
at the task as a whole, and no existing corpora of texts to analyse;
* poorly understood: hencewe did not have good theoretical modelsto structure
the
knowledgebeingacquired, and fill in gaps in the knowledgeacquired from experts
or
corpora; and
* ambiguous {allowed multiple solutions}: hence different experts andcorpus
authors
produced very different texts {solutions} from the same input data. These
problems of course occur to some degreein ka for other expert system and
natural language processing tasks, but we believe they may be especially
severe fornlg.
We do not have a good solution for these problems, and indeed believe that
ka is one
of the biggestproblems in applied nlg. Afterall, there is no point in usingai
techniques
to build atext-generation system if we cannot acquire the knowledge needed
by the ai
techniques. In the longer term, more basic research into ka for nlg is badly
needed.In the shorter
term, however, webelieve that developers are more likelyto acquire correct
knowledge
whenbuilding an nlg system if they understand likely types of errors in
the knowledge acquired from different katechniques. Also, to some degree
thedifferent ka techniques we have tried have complementarystrengths and
weaknesses; thissuggests using a variety of
different techniques, sothat the weaknesses of one technique are compensated
for by the
strengthsof other techniques.
In the remainder ofthis paper we give background information onnlg, ka,
and our
systems;describe the various ka techniques we used to build our systems
and the problems we encountered; and then discuss moregenerally why ka for
nlg is difficultand how
different katechniques can be combined.
2. Background
In this section we give some background information on naturallanguage generation
and
knowledgeacquistion and validation. Wealso introduce and briefly describe
thestop and
SumTime-Mousam systems. 492AcquiringCorrect Knowledge for NLG
2.1Natural Language Generation
NaturalLanguage Generation is the subfield of artificialintelligence that
is concerned with
automatically generating written texts in humanlanguages, often from non-linguistic
input data. nlg systems often have threestages {Reiter & Dale, 2000}:
* Document Planning decides on the content and structure of the generated
text; for example that a smoking-cessation letter shouldstart with a section
that discusses the pros and cons of smoking.
* Microplanning decides on how informationand structure should be expressed
linguis- tically; for example, that the phraseby mid afternoon should be
used in aweather
report to refer to the time1500.
* Surface Re alisation generates an actual text accordingto the decisions
made in pre-
vious stages,ensuring that the text conforms to the grammar ofthe target
language
{English in our systems}. nlg systems require many typesof knowledge in
order to carry out these tasks.In par-
ticular, Kittredge, Korelsky, and Rambow {1991} point out thatnlg systems
need domain
knowledge{similar to that needed by expert systems},communication knowledge
{similar
to that needed by other Natural Language Processing systems}, and also domain
communi- cation knowledge {DCK}. DCK is knowledge about how information in
a domain is usually communicated, including standard documentstructures,
sublanguage grammars, and spe- cialised lexicons. DCK plays a role inall
aspects of language technology {for example,a
speech recogniser will work betterin a given domain if it is trained on
a corpus of texts
from that domain}, but it may be especially important in nlg.
2.2 Knowledge Acquisition and Validation
Knowledge acquisition is thesubfield of artificial intelligence that isconcerned
with acquir-
ing the knowledgeneeded to build ai systems. Broadlyspeaking the two most
common
types of ka techniques are:
* Techniques based on working with experts in a structured fashion,such
as struc-
tured interviews,think-aloud protocols, sorting, and laddered grids{Scott,
Clayton,
& Gibson, 1991; Buchanan & Wilkins, 1993}; and
* Techniques based on learning from datasets of correct solutions {such as
text corpora};
these are currently very popular in natural language processing and used
formany
different types of knowledge, ranging from grammar rules to discourse models
{for an
overview, see Jurafsky &Martin, 2000}.
There are of course other possible ka techniques as well, includingdirectly
asking experts
for knowledge,andconducting scientific experiments.Some research has been
done on evaluating and comparing katechniques, butsuch research can bedifficult
to interpret
becauseof methodological problems{Shadbolt, O'Hara, &Crow, 1999}.
Research has also beendone on verifying and validatingknowledge to check
that it
is correct {Adelman & Riedel, 1997}.Verification techniques focus on detectinglogical
493Reiter, Sripada, & Robertson
anomaliesand inconsistencies that often reflect mistakes inthe elicitation
or coding process; we will not further discuss these, as such errors are
not our primary concern in this paper.
Validation techniques focus ondetecting whether the knowledge acquired is
indeedcorrect
and will enable the construction ofa good system; theseare very relevant
to efforts to
detectproblems in knowledge acquired for nlg. Adelman and Riedel {1997}
describe two
general types of validationtechniques: {1} having experts checkthe acquired
knowledge and
built systems, and{2} using a library of test cases with knowninputs and
outputs. In other
words, just as knowledge can be acquired from experts or from data sets
of correct solutions, knowledge can also be validatedby experts or by data
sets of correctsolutions. Knowledge
can also be validated experimentally, by determiningif the system as a whole
works and has the intended effect on its users.Of course care must be taken
that the validation process
uses differentresources than the acquisition process.For example, knowledge
acquired from an expert should not be validated by that expert, and knowledgelearned
from a data set
shouldnot be validated by that data set. There has not been a great deal
of previousresearch on knowledge acquisition fornlg;
Reiter, Robertson, and Osman{2000} summarise previous efforts in this area.Generally
corpus analysis {analysis ofcollections of manually written texts} has beenthe
most popular
ka technique fornlg, as in other areas of Natural LanguageProcessing, although
sometimes
itis supplemented by expert-oriented techniques {Goldberg, Driedger, & Kittredge,
1994; McKeown, Kukich, & Shaw, 1994}.Walker, Rambow, and Rogati {2002}have
attempted
to learn nlgrules from user ratings of generated texts, whichcan perhaps
be considered a
type ofexperiment-based ka.
2.3STOP
stop {Reiter, Robertson,&Osman, 2003} is an nlg system that generatestailored
smoking-
cessation letters.Tailoring is based on a 4-page multiple-choice questionnaire
about the
smoker'shabits, health, concerns, and so forth.An extract from a questionnaire
is shown in Figure 1, and an extract from thestop letter generated from this
questionnaire isshown
in Figure 2 {we have changed the name of the smoker to preserveconfidentiality}.
From a ka perspective, the most important knowledge needed in stop is what
content and phrasing
is appropriate for anindividual smoker; for example,
* What information should be given in a letter? The example letter in Figure
2, for instance, emphasises things the smoker dislikes about smoking, confidence
building, and dealing with stress and weight gain;but it does not recommend
specific techniques for stopping smoking.
* Should a letter adopt a positive `you'llfeel better if you stop' tone {as
done inthe
letterin Figure 2}, or should it adopt a negative `smoking is killing you'
tone? stop was never operationally deployed, but it was tested with real
smokers ina clinical
trial, during which 857 smokers received stop letters {Lennox, Osman,Reiter,
Robertson,
Friend,McCann, Skatun, & Donnan, 2001}.This evaluation, incidentally, showed
that stop
letters were no more effective than control non-tailored letters.
494AcquiringCorrect Knowledge for NLG
atendatendThis job requires more memory than is available in this printer.Try
one or more of the following, and then print again:In the PostScript dialog
box, click Optimize For Portability.In the Device Options dialog box, make
sure the Available Printer Memory is accurate.Reduce the number of fonts
in the document.Print the document in parts.%%[ PrinterError: Low Printer
VM ]%%G00F0Times-RomanTimes-RomanF0_831F2Helvetica-BoldHelvetica-BoldF2_117SMOKING
QUESTIONNAIREF0_92Please answer by marking the most appropriate box for each
question like this: F5WingdingsMSTT31c4f400F5_100_F7Times-BoldTimes-BoldF7_100Q1
Have you smoked a cigarette in the last week, even a puff?F0_100YES_NO206Please
complete the following questionsPlease return the questionnaire unanswered
in theenvelope provided. Thank you.Please read the questions carefully.F7_117
If you are not sure how to answer, just give the best answer you can.Q2Home
situation:LivealoneF5_117_Live withhusband/wife/partner206Live withother
adults206Live withchildren206Q3 Number of children under 16 living at
home 2052052050205205205 boys 2052052050205205. girlsQ4Does
anyone else in your household smoke? F12Times-ItalicTimes-ItalicF12_100{If
so, please mark all boxes which apply}husband/wife/partner206other family
member206others206Q5 How long have you smoked for? 20520205 years
Tick here if you have smoked for less than a year 206Q6 How many
cigarettes do you smoke in a day? {Please mark the amount below}Less than
5 2065 226 10 20611 226 15 _16 226 20 20621 - 30 20631 or more003206Q7
How soon after you wake up do you smoke your first cigarette? {Please mark
the time below}Within 5 minutes 2066 - 30 minutes _31 - 60 minutes 206After
60 minutes 206Q8 Do you find it difficult not to smoke in places where
it isforbidden eg in church, at the library, in the cinema?YES _003003NO206003Q9
Which cigarette would you hate most to give up?The first one in the morning
_Any of the others 206Q10 Do you smoke more frequently during the
first hours afterwaking than during the rest of the day?YES 206003003NO_003Q11
Do you smoke if you are so ill that you are in bed most of theday?YES 206003003NO_003YES206Q13F14Times-BoldItalicTimes-BoldItalicF14_100
If yes, are you intending to stop smokingwithin the next month?
YES 206003003NO206003Q12Are you intending to stopsmoking in the next
6months?NO _003Q14 If no, would you like to stop smoking if it waseasy?YES
206003003003Not Sure _003003003NO206003Figure 1: Firstpage of
example smoker questionnaire 495Reiter, Sripada, & Robertson
atendatendThis job requires more memory than is available in this printer.Try
one or more of the following, and then print again:In the PostScript dialog
box, click Optimize For Portability.In the Device Options dialog box, make
sure the Available Printer Memory is accurate.Reduce the number of fonts
in the document.Print the document in parts.%%[ PrinterError: Low Printer
VM ]%%G00F0Helvetica-BoldHelvetica-BoldF0_200Smoking Information for Heather
StewartF0_117You have good reasons to stop...F3HelveticaHelveticaF3_92People
stop smoking when they really want to stop. It is encouraging thatyou have
many good reasons for stopping. The scales show the goodand bad things about
smoking for you. They are tipped in your favour.You could do it...Most people
who really want to stop eventually succeed. In fact, 10million people in
Britain have stopped smoking - and stayed stopped - inthe last 15 years.
Many of them found it much easier than they expected.Although you don't feel
confident that you would be able to stop if youwere to try, you have several
things in your favour.F5SymbolSymbolF5_92267 You have stopped before for
more than a month.267 You have good reasons for stopping smoking.267 You
expect support from your family, your friends, and yourworkmates.We know
that all of these make it more likely that you will be able to stop.Most
people who stop smoking for good have more than one attempt.Overcoming your
barriers to stopping...You said in your questionnaire that you might find
it difficult to stopbecause smoking helps you cope with F7Helvetica-ObliqueHelvetica-ObliqueF7_92stress.
Many people think thatcigarettes help them cope with stress. However, taking
a cigarette onlymakes you feel better for a short while. Most ex-smokers
feel calmer andmore in control than they did when they were smoking. There
are someideas about coping with stress on the back page of this leaflet.You
also said that you might find it difficult to stop because you would puton
weight. A few people do put on some weight. If you did stop smoking,your
appetite would improve and you would taste your food much better.Because
of this it would be wise to plan in advance so that you're notreaching for
the biscuit tin all the time. Remember that putting on weightis an overeating
problem, not a no-smoking one. You can tackle it laterwith diet and exercise.And
finally...We hope this letter will help you feel more confident about giving
upcigarettes. If you have a go, you have a real chance of succeeding.With
best wishes,The Health Centre.F3_83THINGS YOU LIKEit's relaxingit stops stressyou
enjoy itit relieves boredomit stops weight gainit stops you cravingTHINGS
YOU DISLIKEit makes you less fitit's a bad example for kidsyou're addictedit's
unpleasant for othersother people disapproveit's a smelly habitit's bad for
youit's expensiveit's bad for others' healthFigure2: Extract from letter
generated from Figure 1questionnaire
496AcquiringCorrect Knowledge for NLGday hourwind wind speed windspeeddirection
{10m altitude} {50malt}12-06-02 6WSW10 1212-06-029WSW 9 1112-06-02 12WSW
7 912-06-02 15WSW 7912-06-02 18SW 7 912-06-02 21SSW 8 1013-06-02 0SSW 1012Figure3:
Wind data extract from 12-Jun-2002 numerical weather prediction
Knowledgeacquisition in stop was primarily based onstructured expert-oriented
ka
techniques, including in particular sorting andthink-aloud protocols. Knowledge
was ac- quired from five health professionals;three doctors, a nurse, and
a health psychologist.
These experts were knowledgeable about smoking and about patientinformation,
but they
were not experts onwriting tailored smoking-cessation letters.In fact there
are no experts
at this task, since no one manually writestailored smoking-cessation letters.
Itis not unusual for an nlg system to attempt a task which is not currently
performed by human experts; other examples includedescriptions of software
models {Lavoie,Rambow,
& Reiter, 1997}, customiseddescriptions of museum items {Oberlander, O'Donnell,
Knott,
& Mellish, 1998}, andwritten feedback for adult literacy students{Williams,
Reiter, &
Osman, 2003}.Knowledge validation in stopwas mostly based on feedback from
users {smokers}, and on the results of theclinical trial.
2.4 SumTime-Mousam SumTime-Mousam {Sripada, Reiter, Hunter, & Yu, 2002}
is an nlg system thatgenerates
marine weather forecasts foroffshore oil rigs, from numerical weather simulation
data. An
extract fromSumTime-Mousam's input data is shown in Figure 3, and an extract
from the
forecastgenerated from this data is shown in Figure 4.From a ka perspective,
the main knowledgeneeded by SumTime-Mousamwas again what content and expression
wasbest
for users; for example,
* What changes in a meteorologicalparameter are significant enough to be
reported in
the text? The forecast inFigure 4, for example, mentions changes in winddirection
but not changes in wind speed.
* What words and phrases should be used to communicate time? For example,
should
1800 be describedas early evening{as in Figure 4} oras late afternoon?
SumTime-Mousamis currently being used operationally bya meteorological company,
to generate draft forecasts which are post-editedby human forecasters.
Knowledgeacquisition in SumTime-Mousam was based on both corpus analysis
of
manually-writtenforecasts and structured ka with expertmeteorologists. Unlike
the ex-
perts we worked with in stop, the meteorologists we worked with inSumTime-Mousam
497Reiter, Sripada, & RobertsonFORECAST 6 - 24 GMT, Wed 12-Jun 2002
WIND{KTS}
10M: WSW 8-13 gradually backing SW byearly evening and SSW by
midnight. 50M: WSW 10-15 gradually backing SW byearly evening and SSW by
midnight. WA VES{M}
SIG HT:0.5-1.0 mainly SW swell.
MAX HT:1.0-1.5 mainly SW swell.
PER{SEC} WA VE PERIOD: Wind wave 3-5mainly 6 second SW swell.
WINDWA VE PERIOD: 3-5.
SWELL PERIOD: 5-7. WEATHER: Partly cloudy becoming overcast with light rain
around midnight. VIS{NM}: Greater than 10 reduced to 5-8in precipitation.
AIR TEMP{C}:8-10 rising 9-11 around midnight. CLOUD{OKTAS/FT}: 2-4 CU/SC
1300-1800 lowering 7-8 ST/SC 700-900 around
midnight.Figure 4: Extract from forecastgenerated for 12-Jun-2002
wereexperienced at writing the target texts {weatherforecasts}. The forecast
corpus in- cluded the numerical weather simulation data that the forecasters
used when writing the forecasts,as well as the actual forecasttexts {Sripada,
Reiter, Hunter, & Yu, 2003}.
Knowledge validationin SumTime-Mousam has mostly been conducted by checking
knowledge acquired from thecorpus with the experts, and checking knowledge
acquired
from the experts againstthe corpus. In other words, we havetried to make
the validation
technique as different as possible fromthe acquisition technique. We are
currently evaluating
SumTime-Mousamas a system by measuring the number ofedits that forecasters
make
tothe computer-generated draft forecasts.
3. Knowledge Acquisition TechniquesTried
In this section we summarise the main ka techniques we used instop and SumTime-
Mousam.For each technique we give anexample of the knowledge acquired, and
discuss what we learned when we tried to validate the knowledge. Table1 gives
a very high level
overview of the ma jor advantages anddisadvantages of thedifferent techniques
we tried, when the different techniques were perhaps most useful, and what
types of knowledge they
were best suited to acquiring{using the classification of Section 2.1}.As
this table shows,
no one technique is clearly best; they all havedifferent strengths and weaknesses.Probably
the best overallka strategy is to use a mix of differenttechniques; we will
further discuss this in Section 5.
498AcquiringCorrect Knowledge for NLGTechniquesAdv antagesDisadv antagesWhenTypesofUsefulKnowledgedirectlyaskget
big picturemany gaps, mayinitialdomain,expertsnot match practiceprototypeDCKstructured
kaget details,limited coverage,flesh outdependswith expertsget rationaleexperts
variableprototypeon expertcorpusget lots ofhard to create,robustness,DCK,analysisknowledgetexts
inconsistent,unusual casescommunicationquicklypoor models for nlgexpertfix
problemslocal optimisation,improveallrevisionin knowledgenot major changessystemTable
1: Summary Evaluationof ka techniques for nlg
3.1 Directly Asking Experts for Knowledge The simplest and perhaps most
obvious ka technique for nlg is to simply askexperts how
to write the texts inquestion. In both stop andSumTime-Mousam, experts initially
gave usspreadsheets or flowchartsdescribing how they thought texts should
begenerated.
In both pro jects,it also turned out that the experts' descriptionof how
texts should be
generateddid not in fact match how people actuallywrote the texts in question.
This is a common finding in ka, and it ispartially due to the fact that it
is difficult for experts to
introspectively examine theknowledge they use in practice {Anderson, 1995};thisis
why
proponents of expert-oriented ka prefer structured katechniques.
For example, at the beginning of SumTime-Mousam, one of themeteorologists
gave
us a spreadsheet which he had designed, which essentially encodedhow he
thought some
partsof weather forecasts should be generated {thespreadsheet did not generate
a complete weather forecast}. We analysed thelogic used in the spreadsheet,
and largely based thefirst
version of SumTime-Mousamon this logic.
One goal of our analysiswas to create an algorithm that could decide whena
change
in a parameter value was significant enough so that itshould be mentioned
in the weather report. The spreadsheet used context-dependent change thresholds
to make this decision. For example, a change in the windspeed would be mentioned
if
* the change was 10 knots ormore, and the final wind speed was 15 knotsor
less;
* the change was 5 knots or more, and the final wind speed wasbetween 15
and 40
knots;or
* the change was 10knots or more, and the final wind speed wasover 40 knots.
The context-dependentthresholds reflect the usage of the weather reports
by the users {in
thiscase, oil company staff making decisions relatedto North Sea offshore
oil rigs}. For
example, ifa user is deciding howto unload a supply boat, moderatechangesin
wind
speed don't matter at low speeds {because light winds have minimalimpact
on supply
boatoperations} and athigh speeds {because the boat won't evenattempt to
unload in
veryheavy winds},but may affect decisions at in-between speeds. The context-dependent
499Reiter, Sripada, & Robertson
thresholdswould be expected to varyaccording to the specific forecast recipient,and
should
be set in consultation withthe recipient.
From our perspective, there were two main pieces of knowledgeencoded in
this algo-
rithm: 1. The absolute size of a changedetermines whether it should be mentioned
or not, and
2. The threshold for significancedepends on the context and ultimately on
howthe user
will use the information. 3.1.1 Validation of DirectExpert Knowledge
Wechecked these rules by comparing them towhat we observed in our corpus
analysis of manually written forecasts {Section 3.3}.This suggested that
while {2} above isprobably
correct, {1} may be incorrect.In particular, a linear segmentation model{Sripada
et al.,
2002}, whichbasicallylooks at changes in slope rather than changes in the
absolute value of a parameter, better matches the corpus texts.The expert
who designed the spreadsheet model agreed that segmentation was probably
abetter approach. He also essentially commented
that one reason for his use of theabsolute size model was that this wassomething
that was
easily comprehensible tosomeone who was neither a programmer nor an expert
at numerical
data analysis techniques.
In other words, in addition toproblems in introspecting knowledge, it also
perhaps is
not reasonable to expect a domain expert to be able to write a sophisticateddata
analysis
algorithm based on his expertise. This is not an issue if the knowledge
needed is purely
declarative,as it is in many ai applications;but if we need procedural or
algorithmic knowledge, we must bear in mind that domain experts may not have
sufficientcomputational
expertise to express their knowledge as a computer algorithm.
3.1.2 Role of Directly Asking Experts for Knowledge
Althoughthe expert's spreadsheet in SumTime-Mousamwas far from ideal, it
was ex- tremely useful as a starting point.It specified an initial system
which wecould build fairly
easily, and whichproduced at least vaguely plausible output.Much the same
in fact hap-
pened in stop, when one of the doctors gave usa flowchart whichcertainly
had many
weaknesses,but which was useful as an initial specification of a relatively
easy-to-build and somewhat plausible system. In bothstop and SumTime-Mousam,
as indeed in other nlg
pro jects we have been involved in, having an initial prototype system working
as soon as
possiblewas very useful for developing our ideas andfor explaining to domain
experts and other interested parties what we weretrying to do.
In terms of the types of knowledge mentioned in Section 2.1, both thestop
flowchart
and theSumTime-Mousam spreadsheet specified domain knowledge {for example,
how
smokersshould be categorised} and domain communicationknowledge {for example,
the use
of ranges instead of single numbers to communicate wind speed}. The stopflowchart
did not
specifyany generic communication knowledge such asEnglish grammar and morphology;
theauthor probably believed we knew more aboutsuch things than he did. The
SumTime- Mousam spreadsheet did in effect include afew English grammar rules,
but these were just to get the spreadsheet to work, the authordid not have
much confidence in them. 500AcquiringCorrect Knowledge for NLG
Insummary, we think directly asking expertsfor knowledge is an excellent
way to quickly build an initial system, especiallyif the nlg developers can
supply communication
knowledge that the domain expertmay not possess. But once the initialsystem
is in place,
it is probably best to use other ka techniques, at least in poorly understood
areas such as
nlg. However, in applications wherethere is a solid theoretical basis, and
the expert can
simply say `build your systemaccording to theory X', an expert's direct
knowledge may
perhaps be all that isneeded.
3.2 Structured Expert-OrientedKA: Think-Aloud Protocols
There are numerous types of structured expert-orientedka techniques, includingthink-
aloud protocols, sorting, and structured interviews {Scott et al., 1991}.
Wewill focus here
on think-aloud protocols, whichis the technique we haveused the most. We
have tried other structured ka techniques as well,suchas sorting {Reiter
et al., 2000}; wewill not
describe these here, but ourbroad conclusions about other structuredka techniques
were
similar to ourconclusions about think-aloud protocols. In a think-aloud
protocol, an expertcarries out the task in question {in our case,writing
a text} while `thinking aloud' into an audio {or video} recorder. We used
think-aloud
protocolsin both stop and SumTime-Mousam. They were especially important
instop,
where they provided the basisfor most content and phrasing rules. A simple
example of the think-aloud process is as follows. One of the doctors wrote
a letter for a smoker who had tried to stop before, and managed to stop for
several weeks before
starting again. Thedoctor made the following comments in thethink-aloud
transcript:
Has he tried to stopsmoking before? Yes, andthe longest hehas managed
to stop - he has tickedthe one week right up to three months andthat's
encouraging in that he has managedto stop at least once before, because
it is always said that the people who have had one or two goes are more likelyto
succeed in the future.
He also included the following paragraph in theletter that he wrote for
this smoker: I see that you managed to stop smoking on one or two occasions
before but have gone back to smoking, but you will be glad to know that this
is very common and most people who finally stop smokinghave had one or two
attempts in the past before they finally succeed.What it does show is that
you arecapable of
stopping even for a short period, and that means you are much morelikely
to be
able to stop permanently than somebody who has never ever stopped smoking
at all.
After analysing this session, we proposed two rules:
* IF {previous attempt to stop}THEN {message: more likely to succeed}
* IF {previous attempt to stop}THEN {message: most people who stop havea
few
unsuccessful attempts first} 501Reiter, Sripada, & Robertson
Thefinal system incorporated a rule {based on several ka sessions, not just
the above one} that stated that if the smoker hadtried to stop before, and
if the letter includeda
sectionon confidence building, then theconfidence-building section should
include a short message about previous attempts to stop.If the smoker had
managed to quit for morethan
one week, this should be mentionedin the message; otherwise the message
should mention the recency of the smoker's previouscessation attempt if this
was within the past 6months.
The actual text generated from thisrule in the example letter of Figure
2 is Although you don't feel confident that you would be able to stop if
you were to try, you have several thingsin your favour.
* You have stopped before for morethan a month.
Note that the text producedby the actual stop code isconsiderably simpler
than the
text originallywritten by the expert. This is fairlycommon, as are simplifications
in the logic used to decide whether to include amessage in a letter or not.
In many casesthis is
due to the expert having much more knowledge and expertise than the computersystem
{Reiter & Dale, 2000, pp30{36}. In general, the process of derivingimplementable
rules
for nlg systems fromthink-aloud protocols is perhaps more of an artthan
a science, not
leastbecause different experts often write texts in very different ways.
3.2.1 Validation of Structured KA Knowledge We attempted to verify some
of therules acquired from stop think-aloud sessions by per-
forming a series of small experiments where we asked smokers to comment
on a letter, or to
compare two versions of a letter. Many of the rules were supported by these
experiments - for example, people in general liked therecap of smoking likes
and dislikes {seeYou have
goodreasons to stop. . . sectionof Figure 2}. However, onegeneralnegative
finding
of these experimentswas that the tailoring rules were insufficiently sensitive
to unusual or
atypicalaspects of individual smokers; and most smokerswere probably unusual
or atypical in some way. Forexample, stop letters did not go into themedical
details of smoking {as
none of thethink-aloud expert-written letters contained suchinformation},
and while this
seemed like the right choice for many smokers, a few smokers did say that
they would have likedto see more medical information aboutsmoking. Another
example is that {again based on the think-aloud sessions} we adopted apositive
tone and did not try to scare smokers;
and again this seemed right for mostsmokers, but some smokers said that
a more`brutal'
approach would be moreeffective for them.
The fact that ourexperts did not tailor letters in such ways may possibly
reflect the
factthat such tailoring would not have beenappropriate for the relatively
small numberof
specific cases they considered in our think-aloud sessions. We had 30 think-aloudsessions
with experts, who looked at 24 different smoker questionnaires {6 questionnaireswere
con-
sidered by two experts}.This may sound like a lot, but it is a drop in the
ocean when we
consider how tremendously variablepeople are.
Comments made by smokers during the stop clinical trial {Reiter, Robertson,
& Os-
man, 2003} also revealedsome problems with think-aloud derived rules.For
example, we
decidednot to include practical `how-to-stop' information in letters for
people not cur-
rentlyintending to stop smoking; smokercomments suggestthat this was a mistake.
In 502AcquiringCorrect Knowledge for NLG
fact, someexperts did include such information in think-aloud letters for
such people, and
some did not. Our decision not to includethis information was influenced
by the Stages of Change theoretical model {Prochaska & diClemente, 1992}
of behaviour change, which
states that `how-to-stop'advice is inappropriate for people not currentlyintending
to stop;
in retrospect, thisdecision was probably a mistake.
We repeated two of our think-aloudexercises 15 months after we originally
performed
them; that is, we went backto one of our experts and gave him two questionnaires
he had
analysed 15 monthsearlier, and asked him to once again think aloudwhile
writing letters
based on thequestionnaires. The letters that the expertwrote in the second
session were
somewhat different from the ones he hadoriginally written, and were preferred
by smokers
over the letters he had originallywritten {Reiter et al., 2000}. Thissuggests
that our experts
werenot static knowledge sources, but were themselves learning about the
task of writing tailored smoking-cessation letters during the courseof the
pro ject. Perhaps this should not be a surprise given that none of the experts
had ever attempted to write such letters before
getting involved with our project.
3.2.2 Role of StructuredExpert-Oriented KA
Structured expert-oriented ka was certainly a useful way to expand, refine,
and generally
improve initial prototypes constructed onthe basis of experts' direct knowledge.By
focusing
on actual cases and bystructuring the ka process, welearned many things
which the
expertsdid not mention directly. We obtained all the types of knowledge
mentioned in
Section 2.1, by working withexperts with the relevant expertise.For example
in stop we acquired domain knowledge {such as themedical effects of smoking}
from doctors, domain communication knowledge {such as which words to use}
from a psychologist with expertise in writing patient information leaflets,
and communication knowledge about graphic design and layout from a graphic
designer. However, structured expert-orientedka did have some problems, including
inpartic-
ular coverage and variability . As mentioned above, 30sessions that examined
24 smoker
questionnaires could not possibly give goodcoverage of the population of
smokers, given
how complex and variablepeople are. As for variation,the fact that different
experts wrote texts in very different ways made itdifficult to extract rules
from the think-aloud protocols.
We undoubtably made some mistakes in this regard, such as not giving `how-to-stop'
infor-
mation to people not currently intending to stop smoking. Perhapswe should
have focused
ona single expert in order to reduce variation. However, ourexperiencessuggested
that
different experts werebetter at different types of information,and also
that experts changed
over time {so we might see substantial variation even in texts from a singleauthor};
these
observations raise doubtsabout the wisdom and usefulness of a single-expert
strategy.
In short, the complexity of nlg tasks means that a very large number of
structured ka
sessions may be needed to get good coverage; andthe fact that there are
numerous ways to write texts to fulfill a communicativegoal means that different
experts tend to write very
differently, which makesanalysis of structured ka sessions difficult. 503Reiter,
Sripada, & Robertson
3.3Corpus Analysis
In recent yearsthere has been great interest in Natural Language Processing
and other areas
ofai in using machine learning techniques toacquire knowledge from relevantdata
sets. For
example, instead ofbuilding a medical diagnosis system by trying tounderstand
how expert
doctors diagnosediseases, we can instead analyse data sets ofobserved symptoms
and actual
diseases,and use statistical and machine learning techniques to determine
which symptoms
predict which disease. Similarly, instead of building an English grammar
by working with
expert linguists, we caninstead analyse large collections of grammatical
Englishtexts in
order to learn the allowablestructures {grammar} of such texts.Such collections
of texts
are calledcorp or a inNatural Language Processing.
There has been growing interest in applying such techniques to learn theknowledge
needed for nlg.For example, Barzilay and McKeown {2001}used corpus-based
machine
learningto learn paraphrase possibilities; Duboue andMcKeown {2001} used
corpus-based
machine learning to learn how NP constituentsshould be ordered; and Hardt
and Rambow {2001} usedcorpus-based machine learning tolearn rules for VP
ellipsis.
Somenlg researchers, suchas McKeown etal. {1994}, have used the term `corpus
analysis' to refer to the manual analysis{without using machine learning
techniques} of a small set of texts which are writtenexplicitly for the nlg
pro jectby domain experts {and
henceare not naturally occurring}. Thisis certainly a valid and valuableka
technique,
but we regard itas a form of structured expert-orientedka, in some ways
similar to think- aloud protocols. In this paper, `corpusanalysis' refers
to the use of machine learningand
statistical techniques to analysecollections of naturally occurring texts.
Corpus analysis in our sense of the wordwas not possible in stop because
wedid not have
a collection of naturallyoccurring texts {since doctors do not currently
write personalised
smoking-cessationletters}. We briefly considered analysingthe example letters
produced
in thethink-aloud sessions with machine learning techniques, butwe only
had 30 such texts, and we believed this would be too few for successful learning,
especially given the
highvariability between experts. In other words, perhaps theprimary strength
of corpus
analysis is itsability to extract information from large data sets; but
if there are no large
data sets toextract information from, then corpus analysis loses much of
its value.
InSumTime-Mousam, we were able to acquire and analyse a substantial corpus
of 1099 human-written weather forecasts, along withthe data files that the
forecasters looked at when writing the forecasts {Sripada et al., 2003}.
Details of our corpus analysis procedures
and results have been presented elsewhere {Reiter & Sripada, 2002a; Sripadaetal.,
2003},
and will not be repeatedhere.
3.3.1 Validationof Corpus Analysis Knowledge While many of the rules we
acquiredfrom corpus analysis were valid,some rules were
problematical,primarily due to two factors: individualvariations between
the writers, and writers making choices that were appropriatefor humans but
not for nlg systems. A simple example of individual variation and the problems
it causes is as follows. One of
the first things weattempted to learn from the corpus was how toexpress
numbers in wind
statements.We initially did this by searching forthe most common textual
realisation of each number. This resulted inrules that said that 5 should
be expressed as5, but 6 should
504AcquiringCorrect Knowledge for NLGformF1 F2 F3 F4 F5 unknowntotal50 70
0 122 41330500 1 46 0 24960 44 0 089 2135060 0364 154 0 13531Table2: Usage
of 5, 05,6, 06 in wind statements, byforecaster
be expressed as 06. Now it is probably acceptable for aforecast to always
include leading zeros for single digits {that is, use05 and 06}, and to never
includeleading zeros {that is,
use5and 6}. However, it is probablynot acceptable to mix the two {that is,
use5 and 06
in the same forecast},which is what our rules would have ledto.
The usage of 5, 05, 6, and 06 by each individualforecaster is shown in Table
2.As
this table suggests, each individualforecaster is consistent; forecasters
F3 and F4always
include leading zeros, whileforecasters F2 and F5 never include leading
zeros.F1 in fact
is also consistent and always omits leading zeros; for example he uses8
instead of 08. The
reason that the overall statistics favour5 over 05 but 06 over6 is that
individuals also differ in which descriptions of wind speed theyprefer to
use. For example, F1 neverexplicitly
mentions low wind speedssuch as 5 or 6 knots, and instead alwaysuses generic
phrases such
as10 OR LESS; F2, F4, and F5 use a mix of generic phrases and explicit numbers
for low wind speeds; and F3 always usesexplicit numbers and never uses genericphrases.
Some of
the forecasters {especially F3} also have a strong preference foreven numbers.
This means that the statistics for 5 vs.05 are dominated by F5 {the only
forecasterwho both explicitly
mentionslow wind speeds and does not prefer evennumbers}; while the statistics
for6 vs.
06 are dominated by F3{who uses this number a lot because he avoids both
generic phrases
andodd numbers}. Hence the somewhat odd result that the corpus overall favours5
over
05 but 06over 6.
This example is byno means unique. Reiter and Sripada {2002b}explain how
a more
complex analysis usingthis corpus, whose goal was to determine the mostcommon
time
phrase for each time,similarly led to unacceptable rules, again largely
because of individual
differencesbetween the forecasters.
Thereare obvious methods to deal with the problemscaused by individual variation.
For
example, we could restrict the corpusto texts from one author; although
this does have the
ma jor drawback ofsignificantly reducing the size of the corpus.We could
also use a more
sophisticated model, such as learning one rulefor how all single digit numbers
areexpressed,
not separate rules for each number. Or we could analyse the behaviour of
individuals and
identify choices{such as presence of a leading zero} that vary between individuals
but are consistently made by any givenindividual; and then make such choicesparameters
which
the user of thenlg system can specify. Thislast option is probably the best
fornlg
systems {Reiter, Sripada,& Williams, 2003}, and is the one used in SumTime-Mousam
for the leading-zero choice. Our main point is simply that we would have
been in trouble if we had justaccepted our
initial corpus-derived rules{use 5 and 06} without question.As most corpus
researchers
are of courseaware, the result of corpus analysis depends on what is being
learned {for example, a rule on how to realise 5, or arule on how to realise
all single-digit numbers}
505Reiter, Sripada, & Robertson
andon what features are used in the learning {forexample, just the number,
or the number and the author}. In more complexanalyses, such as our analysis
of time-phrase choice rules
{Reiter & Sripada, 2002b}, theresult also depends on the algorithms used
forlearning and
alignment. The dependenceof corpus analysis on these choices means that
the results of
a particular analysis are notguaranteed to be correct and need to be validated
{checked}
justlike the results of other ka techniques.Also, what is often the best
approach froman
nlg perspective, namely identifying individual variations and letting theuser
choose which
variationhe or she prefers, requires analysing differences between individual
writers. To the best of our knowledge most publishednl corpus analyses have
not done this, perhaps
inpart because many popularcorpora do not include author information. The
other recurring problem with corpus-derived rules was cases where the writers
produced sub-optimal texts that in particularwere shorter than they should
have been, probably because such texts were quicker to write. For instance,
we noticedthat when
a parameter changed in a moreor less steady fashion throughout a forecast
period, the
forecasters often omitted a timephrase. For example, if a S wind rosesteadily
in speed
from10 to 20 overthe course of a forecast period covering acalendar day,
the forecasters might write S 8-12 RISING TO 18-22, instead of S 8-12 RISING
TO 18-22 BY MIDNIGHT. A statistical corpusanalysis showed thatthe `null'
time phrase was the most common one in such contexts, usedin 33045 ofcases.
The next most common time phrase,later, was only
used in 14045 ofcases. Accordingly, weprogrammed our system to omit the
time phrase
in suchcircumstances. However, when we asked experts to comment on and revise
our generated forecasts {Section 3.4}, they toldus that this behaviour was
incorrect, and that forecasts were more useful to end usersif they included
explicit time phrases and did notrely
on the readers remembering whenforecast periods ended. In other words, in
this example
the forecasters were doing thewrong thing, which of course meant that therule
produced
by corpus analysis wasincorrect.
We don't know why theforecasters did this, but discussions with the forecast
managers
about this and other mistakes{such as forecast authors describing wind speedand
direction
as changing at the same time, even when they actually were predicted to
change at different
times}suggested thatone possible cause is the desire to writeforecasts quickly.
In particular, numerical weather predictions are constantly being updated,
and customers want their forecasts to be based on the mostup-to-date prediction;
this can limit the amount oftime
available to write forecasts. In fact it can be perfectly rational for human
writers to `cut corners' because of time limitations. If the forecasters
believe, for example, that quickly writing a forecast at the
last minute will let them use moreup-to-date prediction data; and that the
benefitsof more
up-to-date data outweighs thecosts of abbreviated texts, then they are making
theright
decision when they writeshorter-than-optimal texts. An nlgsystem, however,
faces a very different set of tradeoffs {for example,omitting a time phrase
is unlikely to speed upan
nlg system}, which means that itshould not blindly imitate the choices made
byhuman
writers.
This problem is perhaps a more fundamental one than the individualvariation
problem,
becauseit can not be solved by appropriate choices as to what is being learned,
what features are considered, and so forth.Corpus analysis, however it is
performed,learns the
choice rules used by human authors. If these rules are inappropriatefor
an nlg system,
506AcquiringCorrect Knowledge for NLG
then therules learned by corpus analysis will beinappropriate ones as well,
regardless of how the corpus analysis is carried out. In very general terms,
corpus analysiscertainly has many strengths, such as looking at what people
do in practice, andcollecting large data sets which can bestatistically
analysed. But pure corpus analysis does perhaps suffer from the drawback
that it gives no
information on why experts made the choices they made, which meansthat blindly
imitating
a corpus can lead toinappropriate behaviour when the human writersface a
different set
of constraints andtradeoffs than the nlg system.
3.3.2 Role of Corpus Analysis Corpus analysis and machine learning are wonderful
ways to acquire knowledge if 1. there is a large data set {corpus}that covers
unusual and boundary cases as well as
normalcases;
2.the members of the data set {corpus} arecorrect in that they are what
we would like the software system to produce; and 3. the members of the data
set{corpus} are consistent {modulo some noise},for example
any given input generallyleads to the same output.
These conditions are probably satisfied when learning rules for medicaldiagnosis
or speech
recognition.However, they were not satisfied in our pro jects. None of the
above conditions were satisfied in stop, and onlythe first was satisfied
in SumTime-Mousam.
Of course, there may be waysto alleviate some of these problems. For example,
we could
tryto acquire general communication knowledge whichis not domain dependent
{such as English grammar} from general corpora suchas the British National
Corpus; wecould argue that certain aspects of manuallywritten texts {such
as lexical usage} are unlikely to
be adversely affected bytime pressure and hence are probably correct; andwe
could analyse
the behaviour ofindividual authors in order to enhance consistency {in other
words, treat
author as an inputfeature on a par with the actual numerical orsemantic
input data}. There
isscope for valuable research here, whichwe hope will be considered by people
interested
in corpus-based techniques innlg.
We primarily used corpusanalysis in SumTime-Mousam to acquire domain commu-
nication knowledge, such as howto linguistically express numbers and times
inweather
forecasts, whento elide information,and sublanguage constraints on the grammar
of our weather forecasts. Corpus analysis of course can also be used to acquire
generic communi- cation knowledge such as English grammar, but as mentioned
above this is probably best doneon a large general corpus such asthe British
National Corpus. Wedid not use corpus
analysisto acquire domain knowledge about meteorology. Meteorologicalresearchersin
fact
do use machinelearning techniques to learn about meteorology, but they analyse
numeric
data sets of actual and predicted weather, they do not analysetextual corpora.
In summary, machine learning and corpus-based techniquesare extremely valuable
if
the above conditions are satisfied, and in particular offer a cost-effective
solution to the
problem of acquiring the large amount of knowledge needed in complex nlg
applications {Section 3.2.2}. Acquiring large amountsof knowledge using expert-orientedka
techniques
507Reiter, Sripada, & Robertson
isexpensive and time-consuming because it requiresmany sessions with experts;
in contrast, if a large corpus of consistent and correcttexts can be created,
thenlarge amounts of knowledge can be extracted from it at low marginal cost.
But like all learning techniques,
corpus analysis is very vulnerableto the `Garbage In, Garbage Out' principle;
if thecorpus
is small, incorrect, and/orinconsistent, thenthe results of corpus analysismay
not be
correct.
3.4Expert Revision
Inbothstop and SumTime-Mousam, we made heavyuse of expert revision. That
is, we showed generated texts to experts and asked them to suggest changes
that would improve
them. In a sense, expert revisioncould be considered to be a type ofstructured
expert-
oriented ka, but it seems to have somewhat differentstrengths and weaknesses
than the
techniques mentioned in Section 3.2, so wetreat it separately.
As an example ofexpert revision, an early version of thestop system used
the phrase
there are lots of goo d re asons for stopping.One of the experts commented
during arevision
session that the phrasing should be changed to emphasise that the reasons
listed{in this
particular section of thestop letter} were ones the smoker himselfhad selected
in the
questionnairehe filled out. This eventually led tothe revised wording It
is encour agingthat
you have many goo dre asons for stopping, whichis inthe first paragraph
of the example
letter in Figure 2. An example of expertrevision in SumTime-Mousam was mentioned
in Section 3.3; when we showed expertsgenerated texts that omitted some end-of-periodtime
phrases, they told us this wasincorrect, and we should include such timephrases.
In stop, we also triedrevision sessions with recipients {smokers}.This was
less successful
than we hadhoped. Part of the problem was the smokers knew very little about
stop{unlike
our experts, who were allfamiliar with the pro ject}, and often madecomments
which were
not useful forimproving the system, such as I did stop for 10 days til my
daughter threw a wobbly and then I wanted a cigarette and bought some. Also,most
of our comments came
fromwell-educated and articulate smokers, such asuniversity students. It
was harder toget
feedback from less well-educated smokers, such as single mothers living
in council{public
housing} estates. Hencewe were unsure if the revision comments we obtained
were generally
applicableor not.
3.4.1 Validation of Expert Revision Knowledge We did not validateexpert
revision knowledge as we did with theother techniques. Indeed,
weinitially regarded expert revision as a validation technique, not a ka
technique,although
in retrospect it probably makesmore sense to think of it as a katechnique.
On a qualitative level,though, expert revision has certainly resulted in
alot of useful
knowledge and ideas for changing texts, and in particular proved a very
useful way of
improvingthe handling of unusual and boundary cases.For example, we changed
the way we described uneventful days inSumTime-Mousam {when the weather changed
very little
during a day} based on revision sessions.
The comment was made duringstop that revision was best at suggesting specific
lo-
calised changes to generatedtext, and less useful in suggesting larger changesto
the system.
One of the stopexperts suggested, after the system was built,that he might
have been
508AcquiringCorrect Knowledge for NLG
abletosuggest larger changes if we had explained thesystem's reasoning to
him, instead of just giving him a letter to revise.In other words, just as
we asked experts to `think-aloud'
as they wrote letters, in order to understandtheir reasoning, it could beuseful
in revision
sessions if expertsunderstood what the computer system was`thinking' as
well as what
it actually produced. Davis and Lenat {1982, page 260} have similarly pointed
out that
explanations can help experts debug and improve knowledge-based systems.
3.4.2Role of Expert Revision
We have certainly found expertrevision to be an extremely useful technique
forimproving
nlg systems; andfurthermore it is useful for improving all types of knowledge
{domain,
domaincommunication, and communication}. Butat the same time revision does
seem to largelybe a local optimisation technique.If an nlg system is already
generatingreasonable
texts, then revision is a goodway of adjusting the system's knowledge andrules
to improve
the quality ofgenerated text. But like all localoptimisation techniques,
expert revision may tend to push systems towards a`local optimum', and may
be less wellsuited to finding
radically differentsolutions that give a better result. 4. Discussion: Problems
Revisited In section 1 we explained that writingtasks can be difficult to
automate because these are
complex, often novel, poorlyunderstood, and allow multiple solutions.In
this section we
discusseach of these problems in more detail,based on our experiences with
stopand
SumTime-Mousam.
4.1 Complexity
Becausenlg systems communicate with humans, they need knowledge about people,
lan-
guage, and how people communicate;sinceall of these are very complex, that
meansthat in
general nlg systems need a lotof complex knowledge. This is one of thereasons
why knowl-
edge acquisition fornlg is so difficult. If we recall thedistinction in
Section 2.1 between domain knowledge, domaincommunication knowledge, andcommunication
knowledge, it may be that communication knowledge{such as grammar} is generic
and hence can be ac-
quired once {perhaps bycorpus-based techniques} and then used in manyapplications.
And
domain knowledge issimilar to what is needed by otherai systems, so problems
acquiring
it are not unique to nlg. Butdomain communication knowledge, such as theoptimal
tone
of a smoking letter and howthis tone can be achieved, or when information
in a weather
forecast can be elided,is application dependent {and hence cannot beacquired
generically}
and is also knowledgeabout language and communication {and hence iscomplex}.
Hence
ka fornlg may always require acquiring complexknowledge.
In our experience, the bestway to acquire complex knowledge robustly isto
get infor-
mation on how a large number of individual cases are handled.This can be
done by corpus
analysis if a suitable corpus can be created.It can also sometimes be done
by expertrevi-
sion, if experts have the time to look at a large number of generatedtexts;
in this regard
it may beuseful to tell them to only comment on major problems and to ignore
minor
difficulties. But however the knowledge is acquired, it will require a substantial
effort. 509Reiter, Sripada, & Robertson
4.2Novelty
Of course, manyai systems need complex knowledge, so the above comments
are hardly
uniqueto nlg. But one aspect ofnlg which perhaps is more unusual is that
many of the
tasks nlg systems are expected to perform are novel tasks that arenot currently
done by
humans.Most ai `expert systems' attempt toreplicate the performance of human
experts inareas such as medical diagnosis and creditapproval. Similarly,
most language technology
systems attempt to replicate the performance of human language users in
tasks suchas
speech recognition and informationretrieval. But many nlg applicationsare
like stop,
and attempt a task that no human performs. Evenin SumTime-Mousam, an argument
could be made that the task humans actually perform is writing weather forecasts
under time constraints, which is in factdifferent from the task performed
bySumTime-Mousam.
Noveltyis a fundamental problem, because it means thatknowledge acquired
from
expert-orientedka may not be reliable {since the experts are not in fact
experts at the actual nlg task}, andthat a corpus ofmanually-written texts
probably does not exist. Thismeans that none of the katechniques described
above are likelyto work. Indeed,
acquiring novelknowledge is almost the definition of scientific research,
so perhaps the only way to acquire such knowledge is toconduct scientific
research in the domain.Of course,
only some knowledge will needto be acquired in this way,even in a novel
application it is likely that much of the knowledgeneeded {such as grammar
and morphology} is notnovel.
On the other hand, noveltyperhapsis also an opportunity fornlg. One of the
draw-
backs of conventional expert systems isthat their performance is often limited
to that of human experts, in which case usersmay prefer to consult actual
experts instead ofcomputer
systems. But if there are noexperts at a task, an nlg system may be used
even if its output
isfar from ideal.
4.3 PoorlyUnderstood Tasks
A perhapsrelated problem is that there are no goodtheoretical models for
many of the choices that nlg systems need to make. For example, the ultimate
goal ofstop is to
change people's behaviour, and a number of colleagues havesuggested that
we base stop
on argumentation theory, as Grasso, Cawsey, and Jones {2000} did for their
dietaryadvice
system. However,argumentation theory focuses on persuading people to change
their beliefs
and desires, whereas the goal of stopwas more to encourage people to act
on beliefs and
desires they already had.In other words, stop's main goal wasto encourage
people who
alreadywanted to stop smoking to make a seriouscessation attempt, not to
convince people who had no desire to quit that they shouldchange their mind
about the desirability of smoking. The most applicable theory wecould find
was Stages of Change {Prochaska &
diClemente, 1992}, and indeed we partially based stop on this theory. However,
the results
of our evaluation suggested that some ofthe choices and rules that we based
on Stagesof
Changewere incorrect, as mentioned inSection 3.2.1.
Similarly, one of theproblems in SumTime-Mousam is generating texts that
will be
interpreted correctly despite thefact that different readers have differentidiolects
and in
particular probably interpret words in different ways {Reiter &Sripada,
2002a; Roy, 2002}.
Theoretical guidance on how to do this wouldhave been very useful, but we
werenot able
to find any such guidance. 510AcquiringCorrect Knowledge for NLG
Thelack of good theoretical modelsmeans thatnlg developers cannot use such
models to `fill in the cracks' between knowledge acquired from experts or
from data sets, as can be
done by ai systems in better understood areas such as scheduling orconfiguring
machinery.
Thisin turn means that a lot of knowledge must be acquired. In applications
where there is a good theoretical basis, the goal ofka is perhaps to acquire
a limited amountof high-level
information about searchstrategies, taxonomies, the best way to represent
knowledge, etc;
once these have been determined, the details can be filled in by theoretical
models. But in
applications where details cannot be filled infrom theory and need to be
acquired, much more knowledge is needed. Acquiringsuch knowledge with structured
expert-orientedka
could be extremely expensiveand time consuming. Corpus-based techniques
arecheaper if
a large corpus is available; however, the lack of a good theoretical understanding
perhaps contributes to the problem that we do notknow which behaviour we
observe in thecorpus
is intended to help the reader {and hence should be copied by an nlgsystem}
and which
behaviour is intended to help the writer {and hence perhapsshould not be
copied}.
4.4Expert Variation
Perhapsin part because of the lack of goodtheories, in both stop and SumTime-Mousam
we observed considerable variationbetween experts. In other words,different
experts wrote
quitedifferent texts from the same input data.In stop we also discovered
that experts
changed how they wrote overtime {Section 3.2.1}.
Variabilitycaused problems for both structured expert-oriented ka {because
different
experts told us different things} and forcorpus analysis {because variationamong
corpus
authorsmade it harder toextract a consistent set of rules with goodcoverage}.
However,
variation seems to have been less of aproblem with revision. We suspect
this isbecause
experts varyless when they are very confident about aparticular decision;
and in revision
experts tended to focus on things they wereconfident about, which was not
the casewith
the other ka techniques. In a sense variability may be especially dangerous
in corpus analysis, because thereis
no information in a corpus about thedegree of confidence authors have in
individualde-
cisions, and also because developersmay not even realise that there is variability
between
authors, especially if the corpus does not include authorinformation. In
contrast, struc- tured expert-oriented techniques such as think-aloud do
sometimes give information about experts' confidence, and also variations
between experts are usually obvious.
We experimented with various techniques for resolving differences between
experts/authors,
such as groupdiscussions and focusing on the decisions made by one particular
expert. None
ofthese were really satisfactory. Givenour experiences with revision, perhaps
the best way
to reduce variationis to develop ka techniques that veryclearly distinguish
between deci- sions experts are confident in anddecisions they have less
confidence in. 5. Development Methodology:Using Multiple KA Techniques F
rom a methodological perspective,the fact that different ka techniques have
different
strengths and weaknessessuggests that it makes sense to use a mixture of
several different
ka techniques.For example, if both structured expert-oriented ka and corpus
analysis are used, then the explanatory information fromthe expert-oriented
ka can be used to 511Reiter, Sripada, & Robertson
helpidentify which decisions are intended to helpthe reader and which are
intended to help the writer, thus helping overcome aproblem with corpus analysis;
and the broader coverage of corpus analysis can show how unusual and boundary
cases should be handled, thus overcoming a problem with expert-oriented ka.
It also may make sense to use different techniques atdifferent points in
the development process. For example, directly askingexperts for knowledge
could be stressed at the initial
stages of a pro ject,and used to build a very simple initial prototype;
structuredka with
experts andcorpus analysis could be stressed during the middlephases of
a pro ject, when
theprototype is fleshed out and converted into something resembling a real
system; and revision could be used in the later stagesof a pro ject, when
the system is beingrefined and
improved.
This strategy, which is graphically shown in Figure 5, is basically the
one we followed
in both stop andSumTime-Mousam. Note that it suggests thatknowledge acquisition
is
somethingthat happens throughout the development process. In other words,
we do not first acquire knowledge and then build a system;knowledge acquisition
is an ongoing process which is closely coupled with the general software
development effort. Ofcourse, this is
hardly a novel observation, and there are many developmentmethodologies
for knowledge-
basedsystems that stress iterative development andcontinual ka {Adelman
& Riedel, 1997}.
Inthe short term, we believe that using a development methodology thatcombines
different ka techniquesin this manner, and also validatingknowledge as much
as possible, arethe best strategies for acquiringnlg knowledge. We also believethat
whenever possible
knowledgethat is acquired in one way should be validated in another way.
Inother words,
we do not recommend validating corpus-acquired knowledge using corpus techniques
{even
if the validationis done with a held-out test set}; or validating expert-acquired
knowledge using expert-based validation{even if the validation is done using
adifferent expert}. It
ispreferable {although not always possible} tovalidate corpus-acquired knowledge
with experts, and to validateexpert-acquired knowledge with a corpus. Another
issue related to development methodology is the relationship between knowl-
edge acquisition and system evaluation.Although these are usually considered
to be sep- arate activities, in fact they can beclosely related. For example,
we arecurrently running
an evaluationof SumTime-Mousam which is based on the number of edits that
forecasters
manually make to computer-generated forecasts before publishing them; this
is similar to edit-cost evaluations of machinetranslation systems {Jurafsky
& Martin, 2000, page823}.
However, theseedits are alsoan excellent source of data for improving thesystem
via expert
revision. To take one recent example, a forecasteredited the computer-generated
text SSE 23-28 GRADUALLY BACKING SE 20-25by dropping the last speed range,
givingSSE
23-28 GRADUALLY BACKING SE. This can be considered as evaluationdata {2
token
edits needed to maketext acceptable}, oras ka data {we needto adjust our
rules for eliding
similarbut not identical speed ranges}. In other words, real-world feedback
on the effectiveness andquality of generated texts can often be used to either
improveor evaluate an nlg system.How such data should
be used depends on the goals of the pro ject.In scientific pro jects whose
goal is totest
hypotheses, it may beappropriate at some point to stop improving asystem
and use all new
effectivenessdata purely for evaluation and hypothesistesting; in a sense
this is analogous 512AcquiringCorrect Knowledge for NLG
atendatendHP LaserJet 4 Plus2013.111 ---- -mark- -dictionary- -null- -filestream-
-savelevel- -fontid- /{}-string- {}[]-array- {}[]-packedarray- VMerror VMerrorERROR:
OFFENDING COMMAND: STACK:%%[ Error: ; OffendingCommand: ]%%This job requires
more memory than is available in this printer.
Try one or more of the following, and then print again:
For the output format, choose Optimize For Portability.
In the Device Settings page, make sure the Available PostScript Memory is
accurate.Reduce the number of fonts in the document.
Print the document in parts.
%%[ PrinterError: Low Printer VM ]%%DefaultColorRendering* . .001.000 Copyright
(c)1999 Adobe Systems
Incorporated. All Rights Reserved. Euro EuroRegular
Directly Ask Experts for Knowledge Structured KA with Experts Corpus Analysis
Expert Revision Initial prototype Initial version of full system Final System
Figure 5: Our Methodology to holding back part of a corpus fortesting purposes.
In applied projects whose goal is
to build amaximally useful system, however, it may be more appropriate to
use all of the effectiveness data to improve the quality of the generated
texts.
6.Conclusion
Acquiring correct knowledge for nlg is very difficult, becausethe knowledge
needed is
largely knowledge aboutpeople, language, andcommunication, and suchknowledge
is com-
plex and poorlyunderstood. Furthermore, perhaps because writing is more
of an art than a science, different people write verydifferently, which further
complicates theknowledge ac-
quisition process; and many nlg systems attempt novel tasks not currently
done manually,
whichmakes it very hard to find knowledgeableexperts or good quality corpora.Perhaps
because of these problems, every single ka technique we tried instop and
SumTime-
Mousamhad ma jor problems and limitations. There is no easy solution to
these problems.In the short term, we believe it isuseful
to use a mixture of differentka techniques {since techniques havedifferent
strengths and
weaknesses},and to validate knowledge whenever possible, preferably using
a different tech- 513Reiter, Sripada, & Robertson
niquethan the one used to acquire the knowledge.It also helps if developers
understand theweaknesses of different techniques, such as the fact that structured
expert-orientedka
may not give good coverage of the complexities of people and language,and
the fact that
corpus-based kadoes not distinguish between behaviourintended to help the
reader and
behaviour intended to help the writer. In the longer term, we need more
research on better ka techniques for nlg. If we cannot
reliablyacquire the knowledge needed by aiapproaches to text generation,
then there is no point in using such approaches,regardless of how clever
our algorithms or models are.
The first step towards developing better ka techniques is to acknowledge
that current ka
techniquesare not working well, and understand why this is the case; we
hope that this paper constitutes a useful step in thisdirection.
Acknowledgements Numerous people have given us valuable comments over the
past five years as we strug-
gled withka for nlg, too many to acknowledge here. But we would like to
thankSandra
Williams for reading several draftsof this paper and considering it in the
lightof her own ex-
periences, and to thankthe anonymous reviewers for their very helpfulcomments.
We would
alsolike to thank the experts we workedwith in stop and SumTime-Mousam,
without whom this work would not be possible. This work was supported by
the UK Engineering and
Physical SciencesResearch Council {EPSRC}, under grants GR/L48812 and GR/M76881,
and by the ScottishOffice Department of Health under grantK/OPR/2/2/D318.
References
Adelman, L., & Riedel, S. {1997}.Handbo ok for Evaluating Knowledge-Base
d Systems.
Kluwer. Anderson, J. {1995}. CognitivePsychology and its Implications{Fourth
edition}. Freeman. Barzilay , R., & McKeown, K. {2001}.Extracting paraphrases
from a parallel corporus. In Pro c e e dings of the 39th Meeting of the Association
for ComputationLinguistics
{ACL-01}, pp. 50{57. Buchanan,B., & Wilkins, D. {Eds.}. {1993}.Re adings
in Know ledgeAc quisition and Le arn- ing. Morgan Kaufmann.
Davis, R., & Lenat, D. {1982}.Know ledge-Base d Systems in Artificial Intel
ligence. McGraw
Hill.
Duboue, P., & McKeown, K.{2001}. Empirically estimating order constraintsfor
content
planning in generation.In Pro c e e dings of the 39th Meeting of the Association
for Computation Linguistics {ACL-01}, pp. 172{179.
Goldberg, E., Driedger, N.,& Kittredge, R. {1994}. Using natural-languageprocessing
to
produceweather forecasts.IEEE Expert, 9 {2},45{53.
Grasso, F., Cawsey,A., & Jones, R. {2000}. Dialecticalargumentation to solve
conflicts in advice giving: a case study in the promotion of healthy nutrition.
International Journal of Human Computer Studies, 53, 1077{1115.
514AcquiringCorrect Knowledge for NLG
Hardt,D., & Rambow, O. {2001}. Generationof VP-ellipsis: A corpus-based
approach. In Pro c e e dings of the 39th Meeting of the Association for ComputationLinguistics
{ACL-01}, pp. 282{289. Jurafsky, D., & Martin, J. {2000}.Spe e ch and LanguagePro
c essing. Prentice-Hall. Kittredge, R., Korelsky, T., & Rambow, O. {1991}.
On the need for domain communication
language. ComputationalIntel ligence, 7 {4},305{314.
Lavoie, B., Rambow, O., & Reiter, E. {1997}. Customizable descriptions ofob
ject-oriented
models. InPro c e e dings of the Fifth Conferenc e on Applied Natural-L
anguagePro c ess-
ing {ANLP-1997}, pp. 253{256.
Lennox, S., Osman, L.,Reiter, E., Robertson, R., Friend,J., McCann, I.,
Skatun, D., & Don- nan, P. {2001}. Thecost-effectiveness of computer-tailored
andnon-tailored smoking
cessation letters in generalpractice: A randomised controlled study. British
Medic al
Journal, 322, 1396{1400.
McKeown,K., Kukich, K., & Shaw, J. {1994}.Practical issues in automatic
document generation. In Pro c e e dingsof the Fourth Conferenc eon Applied
Natural-L anguage Pr o c essing {ANLP-1994}, pp. 7{14. Oberlander, J., O'Donnell,
M., Knott, A., &Mellish, C. {1998}. Conversationin the mu-
seum: experiments indynamic hypermedia with the intelligent labelling explorer.
New
Reviewof Hyperme dia and Multimedia, 4, 11{32.
Prochaska, J., & diClemente, C. {1992}.Stages of Change in the Modific ationof
Problem
Behaviors. Sage. Reiter, E., & Dale, R. {2000}.Building Natural LanguageGeneration
Systems. Cambridge University Press.
Reiter,E., Robertson, R., & Osman, L. {2000}. Knowledgeacquisition for natural
language
generation.In Pro c e e dings of the FirstInternational Conferenc e on Natural
Language
Gener ation,pp.217{215.
Reiter, E., Robertson, R., &Osman, L. {2003}. Lessons from a failure:Generating
tailored
smoking cessation letters.Artificial Intel ligence,144, 41{58.
Reiter, E., & Sripada,S. {2002a}. Human variation and lexicalchoice. Computational
Linguistics, 28, 545{553.
Reiter,E., & Sripada, S. {2002b}. Shouldcorpora texts be gold standards
for NLG?.In
Pro c e e dings of the Sec ond International Conferenc eon Natural Language
Generation,
pp. 97{104.
Reiter, E.,Sripada, S., & Williams, S. {2003}. Acquiringand using limited
user models in
NLG. In Pro c e e dingsof the 2003 Europ e an Workshop on Natural Language
Genera-
tion, pp. 87{94.
Roy, D. {2002}. Learning visually grounded words and syntax for a scene
description task. Computer Spe e ch and Language, 16, 353{385.
Scott, A. C., Clayton, J., & Gibson, E.{1991}. A Practic al Guide to Knowledge
Ac quisition.
Addison-W esley .
515Reiter, Sripada, & Robertson
Shadbolt,N., O'Hara, K., & Crow, L. {1999}.The experimental evaluation of
knowledge ac-
quisition techniques and methods:History, problems and new directions.International
Journal of Human ComputerStudies, 51, 729{755.
Sripada,S., Reiter, E., Hunter, J., & Yu,J. {2002}. Segmenting time series
for weather
forecasting. In Applications and Innovations in Intel ligentSystems X,pp.
105{118.
Springer-Verlag.
Sripada, S., Reiter, E., Hunter,J., & Yu, J. {2003}. Summarisingneonatal
time-series data.
In Pro c e e dings of the Rese ar chNote Sessions of the EACL-2003, pp.
167{170. Sripada, S., Reiter, E., Hunter, J., Yu, J., & Davy, I. {2001}.Modelling
the task of sum-
marising timeseries data using KA techniques. InApplications and Innovations
in Intel ligent Systems IX, pp. 183{196.Springer-Verlag.
W alker,M., Rambow, O., & Rogati, M. {2002}.Training a sentence planner
for spoken dialogue using boosting.Computer Spe e ch and Language, 16, 409{433.
Williams,S., Reiter, E., & Osman, L. {2003}.Experiments with discourse-level
choices and readability. In Pro c e e dings of the 2003 Europ e an Workshop
on Natural Language
Gener ation, pp.127{134. 516