
Natural Interactive Communication for Edutainment
NICE was a European Union Human Language Technologies research project (IST-2001-35293) that ran from 2002 to 2005. Its goal was to push the boundaries of what computers could do in natural, spoken conversation — and to deploy that technology in an ambitious, playful context: letting children and teenagers talk to fairy-tale characters brought to life in a 3D computer game.
The project united five partner institutions across four countries, developing two distinct interactive systems both inspired by the universe of Danish author Hans Christian Andersen. One system let users hold free-ranging, domain-oriented conversation with an embodied, animated HCA himself. The other placed users inside his fairy-tale world, where they had to cooperate with — and negotiate against — animated story characters using spontaneous spoken and gestural dialogue.
What made NICE unusual was its deliberate move away from strictly task-oriented dialogue. Real human conversation is unpredictable, socially rich, and full of personality. NICE set out to demonstrate that speech technology could support exactly that — not just information retrieval, but character, emotion, storytelling, and collaborative problem-solving.
— Gustafson et al., SIGdial 2004
When NICE began in March 2002, the consortium identified four challenges that had never been satisfactorily met in any existing system:
Before any prototype system existed, NICE partners collected extensive Wizard-of-Oz (WoZ) data — human-simulated interactions that provided training material for speech recognisers and insights into how children actually talk when addressing these characters.
The detailed first HCA prototype specification was Wizard-of-Oz-simulated over two weeks at the Hans Christian Andersen Museum in Odense, Denmark. ~30 hours of English conversation recorded, approximately 500 dialogues transcribed. Young museum visitors (age 9–18) talked to a cartoon-like animated HCA face via headset microphone. The wizard controlled HCA's responses and lip movements in real time.
These recordings unexpectedly became a gold standard for human-natural conversation — the wizards' thoroughly human response style revealed conversation patterns (topic shifts, slow users, graceful follow-along) that were later built into PT2.
TeliaSonera set up an animated speaking agent in a museum exhibition to continuously collect children's speech data from visitors. By December 2002 alone, ~20,000 utterances had been recorded and transcribed, of which 40% came from users aged 15 or under.
A separate Lo-Fi WoZ session tested an early FTW prototype: 16 adult subjects (300 utterances) interacted with Cloddy Hans on a massive 3×5 metre screen, helping him solve a simple visual puzzle. This data informed the FTW system design and speech recogniser language models.
LIMSI conducted WoZ experiments with a simple multimodal game featuring 2D cartoon characters. 7 adults and 10 children interacted using speech and pen gestures on an interactive pen display. Sessions were video-recorded and annotated using behavioural metrics and questionnaires. This early data informed the design of the gesture recognition, interpretation, and input fusion modules.
The HCA museum WoZ data yielded an unexpected methodological insight: even though the wizards followed the detailed system specification, they did so in a thoroughly human way — understanding users' full communicative intention even when it was expressed obliquely across multiple turns. The resulting transcripts revealed that the simulated HCA "virtually always followed the user" when the user changed topic or domain — and this naturalness was later built into the second HCA prototype's conversation manager as a design principle.
Children's speech recognition was one of NICE's core technical challenges. ScanSoft (originally Philips Speech Processing) developed dedicated approaches for both the English (HCA) and Swedish (FTW) systems.
Training from scratch was not feasible with limited in-domain data. Instead, a commercial adult-speech acoustic model was adapted using two combined techniques:
Sufficient NICE-specific data allowed training from scratch. Key complication: most users were non-native English-speaking Danish children, producing high noise and hesitation rates. Solution:
Potamianos (2003) achieved similar improvements (~45% relative WER reduction) with 2–3× more data. NICE's results were therefore at or beyond the published state-of-the-art despite tighter data constraints.
For the Fairy-Tale World, TeliaSonera developed a bespoke unit-selection speech synthesis system from scratch for Swedish — giving each character a distinct voice that conveyed their personality. This went far beyond standard TTS in scope and design.
For the HCA system, the English TTS was an off-the-shelf AT&T synthesiser selected for voice quality and suitability for a 55-year-old male character. Eleven visemes were used for lip synchronisation, with up to 17 non-verbal behaviour primitives per output turn.
NICE brought together expertise from dialogue systems, human-computer interaction, speech technology, computer animation, and games across four European countries.
NICE produced two complementary systems, each tackling the challenge of natural dialogue from a different angle. Both were tested with real children.
Users hold free-form conversation with an animated, life-like Hans Christian Andersen in his 19th-century Copenhagen study. Andersen knows about his life, his fairy tales, his surroundings — and reacts emotionally to the user.
A 3D computer game where children use speech and gesture to cooperate with Cloddy Hans and negotiate with Karen to progress through scenes inspired by Andersen's fairy tales.
The HC Andersen system aimed to demonstrate domain-oriented conversation — a form of interaction halfway between strict task-oriented dialogue and full Turing-test conversation. Users could talk to HCA about anything within his knowledge domains, in any order, with mixed initiative, just as in real social conversation.
HCA appeared in his 19th-century Copenhagen study (Figure: his virtual writing desk surrounded by pictures, a feather pen, travel bag and personal objects). Users could speak to him freely in English and point at objects in his study using a touch screen or mouse to prompt him to tell stories about them.
HCA had knowledge of: his fairy tales (The Little Mermaid, The Ugly Duckling, etc.), his life and childhood in Odense, his study and its objects, the user (he asked about age, name, country, interests), and meta-communication to handle errors and keep conversation flowing.
Core traits (from project specification): imaginative, sensitive, extrovert, 55ish — a child inside
Unlike task-oriented systems where the goal is to complete a booking or answer a query, HCA had no task constraints. Users could jump between topics, back-channel, ask personal questions, or even be rude. The system aimed to make HCA feel like a real character from history, not a database. The target was to produce genuine conversation for edutainment — learning about Andersen's life and era while being entertained.
The HCA system used an event-driven, modular, asynchronous architecture with modules communicating through the KTH Message Broker — an open-source TCP server that routed calls and time-stamped all events.
The NLU pipeline consisted of: (1) a Keyphrase Spotter that mapped phrases to syntactic/semantic categories; (2) a Syntactic Analyser with number spotting, lexicon lookup, and rule engine; (3) a Domain/Topic Spotter mapping categories to knowledge domains; and (4) an FSA Processor (Finite State Automaton) for deep parsing. The vocabulary had ~1977 word forms in PT2.
The heart of HCA's intelligence. Three output states: Non-Communicative Action (HCA working alone in his study), Communicative Function (real-time awareness of user input), and Communicative Action (HCA's actual conversational turns). Domain Agents handled specific knowledge areas (life, works, study, user, meta). The Emotion Calculator updated HCA's emotional state on each user turn. Up to 17 non-verbal behaviour primitives per output turn.
Three gesture shapes were recognized: pointer (pointing), connect (line), surrounder (circle). The Gesture Interpreter identified which of 16–21 "referenceable objects" in the study the user gestured at. Input Fusion merged speech semantics and gesture using temporal coincidence with configurable time windows.
The HCA PT2 was evaluated with 13 Danish children (age 11–16) speaking English (their second language) at NISLab in February 2005. Each session lasted 60–75 minutes.
This is a composite example from two actual conversations with HCA PT2, showing mixed-initiative dialogue, emotional response, and topic shifting:
From NICE Deliverable D7.2-2, Chapter 3 →
The NICE Fairy-Tale World is an interactive 3D computer game where children and teenagers engage in spontaneous spoken and gestural dialogue with animated fairy-tale characters. What sets it apart from other dialogue systems — and from other games — is that spoken dialogue is the primary means of game control. There are no menus, no text commands. You just talk.
The project was developed at TeliaSonera R&D in Stockholm, Sweden, led by Joakim Gustafson, Johan Boye, Mats Wirén, Linda Bell and Anders Lindström. The 3D world and character animation was created by game company Liquid Media.
| Milestone | Details |
|---|---|
| Early 2003 | System design, character design (Liquid Media), message broker integration (KTH) |
| Jan 2004 | Prototype 1 (Scene 1 only: lab + Cloddy Hans). Tested at Technical Museum Stockholm with 10 children |
| Apr 2004 | D7.2b evaluation report; data analysis; design of Prototype 2 |
| Nov 2004 | Prototype 2 testing begins — "School" session with 31 children (Scene 1) |
| Dec 2004 | "Lab 1" session (11 teenagers, Scene 1, large display + gyro mouse) |
| Feb 2005 | "Lab 2" session (20 children, Scene 1+2 including Karen at the bridge) |
| Mar 2005 | "Lab 3" session (13 children, Scene 1+2). Corpus complete. |
| May 2005 | Final evaluation deliverable D7.2-2 submitted |
The fairy-tale world features five animated characters, each with a distinct personality encoded at every level — appearance, voice, vocabulary, gesture style, and long-term behaviour.
Loosely inspired by H.C. Andersen's story "Cloddy Hans." He is the user's faithful companion throughout the entire game. Seemingly slow-witted and not very bright, but honest, friendly, and anxious to help. His limitations are deliberate — they force the user to use multimodal dialogue to collaborate effectively.
OCEAN model: High Agreeableness, Low Conscientiousness, Low Openness. Users perceived him as kind, stupid, lazy, calm, polite.
A sullen, uncooperative girl who guards the drawbridge separating the island from the fairy-tale world. She is the obstacle — and the negotiation partner. Karen will only lower the bridge when given an object she deems acceptable. She will reject the first two offers regardless.
Users perceived her as smart, quick, self-confident, defiant. Deliberately designed to contrast with Cloddy Hans — creating dramatic tension and encouraging children to take sides.
Appears in the laboratory scene. Non-verbal — she points at slots in the fairy-tale machine to indicate where she wants objects placed, and expresses her satisfaction or discontent through large emotional body gestures when Cloddy Hans places objects correctly or incorrectly.
One of five animated characters developed for the full fairy-tale world. Feature characters have their own goals and drive sub-plots in the story, asking the user for help completing tasks in exchange for information or passage.
NICE introduced a three-way character taxonomy for interactive storytelling, extending the feature/supporting dichotomy from the literature:
Personality was not modelled as explicit internal parameters but used as a design guide to ensure consistent character behaviour perceived by users as compatible with intended traits. Every aspect of the characters — dialogue rules, wording, speaking style, non-verbal behaviour, idle animations — was designed around the OCEAN personality framework.
| Character | Openness | Conscientiousness | Extroversion | Agreeableness | Neuroticism |
|---|---|---|---|---|---|
| Cloddy Hans | Dunce | Thorough | Quiet · Uncertain | Friendly · Polite | Calm · Even-tempered |
| Karin | Intellectual | Frivolous | Outspoken · Self-confident | Unfriendly · Touchy | Anxious · Over-emotional |
| Character | Voice pitch | Speaking rate | Frequency range |
|---|---|---|---|
| Cloddy Hans | Low | Slow | Small |
| Karin | High | Fast | Large |
Voice actors were instructed to read utterances in personality-matching styles, then all recordings were resampled — slowing Cloddy (sounds larger) and speeding Karin (sounds younger). TD-PSOLA modified both pitch and rate simultaneously.
Both characters share the same underlying animation primitives — since both have been provided with identical animation sets it is possible to switch their non-verbal behaviours. The personality difference is expressed through the amplitude, speed, and frequency of identical gesture types, not through entirely different motion libraries. This was a key architectural decision enabling efficient character creation.
Karin never generates Problem or Safe utterances — by design, she always appears confident and in control, which makes her seem smarter than Cloddy Hans.
The game begins in Andersen's Copenhagen study in the 19th century. Andersen has just left on a trip to Odense, leaving Cloddy Hans to guard his fairy-tale laboratory. At the centre of the lab is the fairy-tale machine — forbidden to all except Andersen himself.
Cloddy Hans has sneaked an object into the machine and wants to surprise Andersen with a new fairy-tale. But he can't understand the symbols labelling the machine's four slots: "useful", "magical", "precious", "dangerous". He needs the user's help to figure out which object goes where. On the shelf: a key, hammer, diamond, magic wand, sword, ruby, axe, and more.
When Cloddy Hans pulls the lever, the machine activates — and a trapdoor drops him (and the user) into the fairy-tale world. They land on a small island with a vast gap between them and the rest of the world: windmills, villages, fields, all unreachable. The objects from the machine are now scattered in the grass.
On the far side of the gap stands Karen, guarding a raised drawbridge. She refuses to lower it — unless the user negotiates successfully. She wants one of the objects on the island, but she is picky: she will reject the first two offers, and what she ultimately wants changes each playthrough. The user must discover what she wants, convince Cloddy Hans to pick it up and bring it to her, and complete the exchange.
While the user talks to Karen (shown in camera), Cloddy Hans is out of view but still active. He receives BroadcastEvents from Karen — delivered because he is within audibility range even when off-camera — and uses these to fire contextual side-comments as hints, observations, and opinions:
This creates genuine three-party dialogue — user, Cloddy Hans, and Karen — with each character having distinct communicative roles.
Characters navigate between locations using triggers — 3D zones that fire events when entered. Walking into a trigger could cause a character to turn, speak, or initiate a new dialogue phase.
The following seven requirements guided the overall game design from the start of the project:
The concept of a story-functional event is central to measuring narrative progression: a key moment in the story that represents genuine advancement and cannot be undone. Progress is defined as the set of story-functional events achieved.
Scene 1 — The Laboratory
Scene 2 — The Fairy-Tale World
All 5,580 utterances were aligned at phoneme level. Duration z-scores were computed across all 85,000 phonemes. Key findings:
The Fairy-Tale World system is event-driven and asynchronous — enabling characters to be autonomous when no-one is talking to them and fully reactive when the user engages. All modules communicate via a central Message Dispatcher hub. Figure 3.5 (from NICE Deliverable D7.2-2) shows the complete architecture.
The system ran on three machines: a high-performance desktop running animation/rendering, speech synthesis, and gesture recognition; one laptop running ASR, the dispatcher, and the supervision GUI; a second laptop running NLU, dialogue managers, NLG, and animation planner.
Users interacted via a wireless headset microphone and a gyro mouse facing a large back-projection screen showing Cloddy Hans at life-size (180 cm tall).
For data collection, the system ran in supervised mode: an operator could approve the top suggestion, select from the N-best list, edit the ASR result, or type free text — enabling high-quality data collection while the system was still imperfect.
The NLU component addressed interpreting noisy, spontaneous, child-directed speech. The parser works in two phases:
"Pick up the axe" → request(user, cloddy, pickUp(cloddy, axe)). Concept accuracy: 53% from 1-best ASR, 70% from 10-best — a major improvement over raw ASR word accuracy of 39%.
Snacka — developed by Kåre Sjölander at KTH in collaboration with TeliaSonera — was the first unit-selection TTS system built for Swedish. Cloddy Hans has a slow, deep voice (recordings slowed via TD-PSOLA); Karin has a faster, sharper delivery. The synthesiser outputs a time-stamped phone transcription that drives lip-sync and gesture coordination in the Liquid Media animation engine.
The NICE fairy-tale dialogue manager uses a two-layer architecture: a kernel providing generic dialogue management functionality common to all characters and situations, and scripting code that specialises the kernel for a particular character, scene, or phase. This mirrors game scripting patterns used in commercial game engines.
The dialogue manager operates on a stream of dialogue events — both external (from other modules) and internal (from the kernel's reasoning). Scripting code registers callbacks for events, creating asynchronous, flexible behaviour.
| Event Type | Description |
|---|---|
| ParserEvent | Parser analysed the user's latest utterance and produced a dialogue act |
| GestureEvent | Gesture interpreter identified an object the user pointed at (multimodal input) |
| RecognitionFailureEvent | ASR detected speech but could not recognise it |
| PerformedEvent | Animation system completed a requested action, carrying an ok or failed flag; essential for sequencing agenda goals |
| BroadcastEvent | Another character spoke or performed an action — only delivered if the sending character is within visibility range and/or audibility distance of the receiving character |
| TriggerEvent | Character entered or exited a 3D trigger zone while walking — used for contextual remarks, head-turns, waypoints, and camera changes |
| WorldEvent | A world-state change: object inserted into machine slot, bridge raised or lowered, etc. |
| TimeOutEvent | Silence threshold exceeded — character should prompt, hint, or take initiative |
| NoReactionEvent | Character's agenda is empty and no incoming events are pending — must decide whether to act proactively |
| IntentionEvent | Character has formed an intention to say or do something (internal) |
Each character maintains an inner state consisting of: a world model (beliefs about the 3D world), a discourse history (past interactions), and an agenda (tree-structured current goals and their causal relations).
Task specification uses declarative rules encoding action-world relationships, similar to STRIPS planning but with non-monotonicity: true propositions can become false (another character might move an object). Preconditions are always checked before execution. The agenda tree makes reasoning explainable — Cloddy Hans can say "I am going to the shelf because I want to pick up the axe because you asked me to."
The character in focus is always in camera. The user can change dialogue partner — and camera — by saying the character's name ("Cloddy" / "Karen"). The system also triggers camera changes automatically, such as when Cloddy Hans reaches the bridge, causing the camera to swing to show Karen. Characters can make side-comments without being in camera (Cloddy whispering hints while Karen is on screen). This enables genuine three-party dialogue without parallel rendering.
Unlike the strict turn-alternation model used by virtually all spoken dialogue systems of the era — where user speech was the only input channel — NICE characters are driven by five independent event channels running simultaneously: (1) user dialogue acts from the parser; (2) completion signals from the character's own actions (PerformedEvent, ok/failed); (3) broadcast events from other characters, filtered by spatial visibility and audibility; (4) 3D trigger zones entered during movement; and (5) timeouts and empty-agenda signals. A character does not wait for the user to speak — any of these channels can push dialogue forward at any moment. This means the user can give a new instruction while Cloddy Hans is already walking, and when he arrives his PerformedEvent picks up the thread — creating dialogue that feels genuinely alive rather than turn-by-turn.
Data was collected across four sub-corpora over five months during 2004–2005, developing the system iteratively between each session.
| Condition | School | Lab 1 | Lab 2 | Lab 3 |
|---|---|---|---|---|
| Date | Nov–Dec 2004 | Dec 2004 | Feb 2005 | Mar 2005 |
| Location | School (small room) | TeliaSonera Vision Centre | TeliaSonera Multimodal Lab | TeliaSonera Multimodal Lab |
| Display | CRT, mouse | Large display, gyro mouse | Large display, gyro mouse | Large display, gyro mouse |
| Data | Audio, logs | Audio, video, logs | Audio, video, logs | Audio, video, logs |
| Gameplay | Scene 1 | Scene 1 | Scene 1+2 | Scene 1+2 |
| Age span | 8–11 | 14–15 | 9–10 | 11–12 |
| Net users | 26 | 7 | 15 | 9 |
Cloddy Hans rated as:
Karin rated as:
Cloddy Hans perceived as more…
Karin perceived as more…
Children produced rich, socially-naturalistic dialogue behaviour previously unseen in human-machine interaction. Six phenomena are documented with actual transcripts.
The conflict between Cloddy Hans and Karin was a deliberate plot device — each character tried to get the user to take their side.
Users were free to use their own words. Most picked up Cloddy Hans's terminology, and several reported that figuring out which words to use was itself part of the game.
Several users temporarily digressed from the task — then returned. This is impossible in menu-driven interfaces.
When Cloddy asked "why?", a user spontaneously explained the goal — motivating the character. Natural in human conversation, remarkable in HCI.
A user accused Cloddy Hans of still holding the diamond despite two denials. When proven wrong, they apologised — entirely natural social behaviour.
Users referenced shared dialogue history ("what was her name — now again?") and made spatial references to the 3D world as if it were real.
This journal paper presents and rigorously evaluates the spoken language understanding component of the NICE Fairy-Tale World. It describes a novel semantic representation formalism and a two-phase robust parsing algorithm designed to extract meaningful dialogue acts from the noisy, disfluent, child-directed speech produced by young users — even when the ASR word error rate is as high as 61%.
User utterances are mapped to typed, tree-structured dialogue acts — a semantic representation that sits between the shallow slot-filler lists typical of task-oriented systems and the full logical forms of general-purpose NLP. The type system enforces constraints on argument positions, providing both robustness and expressiveness.
| Type | Example Utterance | Representation |
|---|---|---|
| request | "Pick up the axe" | request(user, cloddy, pickUp(cloddy, axe)) |
| ask | "What color is the ruby?" | λx.ask(user, cloddy, x[ruby.color=x]) |
| tell | "I'm fourteen years old" | tell(user, cloddy, 14[user.age=14]) |
| offer | "I'll give you the ruby" | offer(user, karen, ruby) |
| confirm | "Yes, do that" | confirm(user, cloddy) |
| disconfirm | "No! Stop!" | disconfirm(user, cloddy) |
| askForHelp | "What should we do?" | askForSuggestion(user, cloddy) |
| Action | Argument Structure |
|---|---|
| goTo | goTo(character, place) |
| pickUp | pickUp(character, thing) |
| putDown | putDown(character, thing, location) |
| giveTo | giveTo(character, thing, character) |
| raiseDrawbridge | raiseDrawbridge(character) |
| lowerDrawbridge | lowerDrawbridge(character) |
Type constraints — e.g. the first argument of pickUp must be a character, the second a thing — let the parser infer types and reject impossible parses without full logical reasoning.
Anaphoric utterances ("pick it up", "put it down there") are represented as functions over the dialogue context using typed lambda abstractions — a key innovation over slot-filler systems:
The robust parser works in two phases. The key insight is that meaning can be recovered even when words are missing, inserted, or wrong — because the algorithm throws away words that don't match any pattern and keeps accumulating constraints from whatever does match.
Input: "take the ehh hammer"
The full rule take the hammer fails because of "ehh". But the shorter rule take matches. Then "the" and "ehh" are discarded one by one. Finally "hammer" matches hammerthing. Result: correct constraints extracted despite the disfluency.
The pattern-matching phase operates on the ASR 1-best hypothesis by default, but the paper shows that using the 10-best list can raise concept accuracy from 53% to 70% — a major potential gain for future systems.
The most common type. Infers that an object of a certain type exists from the utterance. e.g. the word "hammer" triggers: hammerthing
Asserts two objects are different. e.g. "What is…" implies the user is asking someone else: ask(user, x, y) with x ≠ user
Relates objects to attributes of other objects. e.g. "Where is…" implies: y = z.position — the thing being asked about is the position of some object z.
The parser was evaluated on 3,400 utterances from the Swedish NICE corpus — spontaneous speech from children aged 9–15 collected over 4 sessions. All figures are based on unedited ASR output; the supervised-mode operator's corrections were not used for evaluation.
Word accuracy is the bottleneck. Even with a perfect oracle choosing the best hypothesis from the 10-best list, accuracy only reaches 55% — reflecting how hard children's spontaneous speech is for ASR systems.
Concept accuracy measures how many semantic nodes in the dialogue act tree are correct — a finer-grained metric than sentence accuracy. The parser recovers substantial meaning even from very noisy input.
| Metric | Definition | Analogue |
|---|---|---|
| Sentence accuracy | % of utterances where ASR output exactly matches transcription | Standard ASR metric |
| Word accuracy | % of word tokens correctly recognised (accounting for substitutions, insertions, deletions) | Standard WER-based metric |
| Semantic accuracy | % of utterances where the parser output exactly matches the gold-standard dialogue act | Semantic analogue of sentence accuracy |
| Concept accuracy | % of tree nodes in the dialogue act correctly reconstructed by the parser | Semantic analogue of word accuracy |
Parsing produces a context-independent semantic representation — the meaning of an utterance without resolving references to the dialogue context. A second step, contextual interpretation, resolves anaphora (pronouns, ellipsis) by applying the lambda-abstract dialogue acts to the most salient discourse entities.
The contextual interpretation step uses a focus management algorithm that maintains a ranked set of salient objects — the entities most recently mentioned or acted upon. When a pronoun ("it") or an elliptic utterance ("now the hammer") is encountered, the algorithm applies the lambda function to the most salient entity of the appropriate type.
Focus management is tightly integrated with the dialogue manager (described in D5.2b) rather than the parser, since salience depends on the full history of actions and utterances — not just the current turn's linguistic content.
In a fast-paced game interaction, users naturally use pronouns and ellipsis: "pick up that one", "now the red one", "put it there" while pointing at the screen. A system that cannot handle anaphora forces users into unnatural, verbose instructions for every single action — breaking immersion and making the game feel like a command-line interface rather than a conversation.
Boye, J., Gustafson, J., & Wirén, M. (2006). "Robust spoken language understanding in a computer game." Speech Communication, 48(6), pp. 335–353. Elsevier. Special issue on spoken language understanding. DOI: 10.1016/j.specom.2005.06.015
This paper gives a detailed technical overview of the implemented dialogue manager in the NICE Fairy-Tale World — covering the character taxonomy, world representation, goal-based planning, message protocol, dialogue event system, scene/subscene structure, turn-taking, and asynchronous dialogue. It is the most complete published account of how the NICE dialogue engine actually works.
The paper introduces a three-way character taxonomy for interactive storytelling, extending the feature/supporting dichotomy from earlier work (Charles & Cavazza 2004) with a new third type — the helper character — that is central to the NICE design.
Drive the story. Have their own goals and desires. Represent obstacles or information sources. Require a full NLU and dialogue manager. Their agenda trees can be complex and deeply nested.
Provide atmosphere and react to events. Non-verbal or limited verbal output. Express emotions through body language. Do not drive the plot. Thumbelina points at machine slots and reacts with large body gestures when displeased.
A new type, not in prior literature. Guides and assists the user throughout the game. Has no goals of his own — only doing what the user asks and helping them when stuck. Requires a uniquely broad set of capabilities.
The paper identifies that helper characters demand a uniquely broad capability set — broader in some ways than feature characters:
Each fairy-tale character maintains a rich inner state that drives all its behaviour — both proactive (pursuing its own agenda) and reactive (responding to user input or world events).
The character's beliefs about the current state of the 3D world — represented as a set of interrelated objects with attributes and relations. Includes: where objects are located, who is holding what, which slots are filled, whether the bridge is raised or lowered, etc.
Key property: beliefs can be wrong. A character's world model is updated by perceptual events and broadcasts from other characters, but may lag behind reality.
A record of past interactions — what has been said, what has been done, what agreements have been reached. Used for anaphora resolution, detecting repeated requests, and deciding whether something has already been explained.
Enables Cloddy Hans to say "But I already told you — the axe is in the machine!" and to remember that the user previously asked him to go somewhere.
A set of tree-structures encoding the character's current goals, past and future actions, and their causal relationships. The trees can be deeply nested. Crucially, the tree encodes why as well as what:
These causal chains can be verbalized — Cloddy Hans can explain "I'm walking to the shelf because I want to pick up the axe."
Task specifications are sets of declarative rules encoding relationships between actions and world propositions. The dialogue manager uses these to build agenda trees — similar to classical STRIPS planning, but with a critical difference:
In classical STRIPS planning, once a proposition is established as true, it stays true. In the fairy-tale world, this assumption breaks down. For example:
available(axe) may be true when Cloddy Hans starts walking towards it — but become false if Karen grabs it firstSolution: The system checks all necessary preconditions before executing any action — even those already found to be true at an earlier point. A precondition is only assumed to hold at the moment of checking, not persistently.
Certain key moments are designated story-functional events — irreversible milestones that mark genuine narrative progression. They cannot be undone. Examples:
Scene 1 (Lab)
Scene 2 (Fairy-Tale World)
All characters load the story-functional events at scene initialisation and add corresponding goals to their agendas. This allows Cloddy Hans to proactively guide the user towards unmet milestones when the user is stuck.
The dialogue manager communicates with the rest of the system through a strictly defined message protocol. It receives a stream of input messages and generates output messages.
| Message | Description |
|---|---|
| convey <dialogue_act> | Turn into words via NLG → TTS + animation |
| perform <action> | Send command to the Animation Planner |
Every observable character behaviour is produced through exactly one of these two primitives. convey produces speech; perform produces physical action.
| Message | Description |
|---|---|
| parserInput <dialogue_act> | User said something; parser result attached |
| gestureInput <object> | User pointed at a specific 3D object |
| recognitionFailure | User spoke; ASR could not interpret it |
| broadcast <message> | Another character spoke or acted — only delivered if sender is within visibility range and/or audibility distance |
| performed <id> <flag> | Character completed own requested action (ok/failed) |
| trigger <id> | Character moved into/out of a 3D trigger zone |
The performed message might seem redundant — after all, the character itself requested the action. But since some actions (like walking across the world) take considerable time, and actions may fail for unexpected reasons, the character cannot treat an action as complete until the animation system explicitly confirms it. This feedback is essential for correctly sequencing agenda goals.
Event-based dialogue progression — the five input channels that drive dialogue forward in NICE
At the time NICE was developed (2002–2005), virtually all spoken dialogue research — whether in phone-based information systems, in-car assistants, or early conversational agents — followed the same fundamental model: the user speaks, the system produces a response, and so on in strict alternation. The only currency was user speech, and the only driving force was the parser's output from that speech.
The NICE dialogue engine discarded this assumption entirely. Each character's dialogue manager is driven by a stream of heterogeneous events arriving asynchronously from five independent sources. The character's scripting code responds to whichever event arrives next, regardless of whether the user has spoken. This makes it possible for characters to behave proactively, to react to each other, and to interact with the 3D world around them — in a way that had no precedent in spoken dialogue systems.
When the ASR recognises speech and the NLU parser succeeds, a ParserEvent is delivered carrying a fully interpreted dialogue act — e.g. instruction(go, drawbridge) or question(location, ruby). The scripting code for the current scene and phase catches this event, performs contextual interpretation (resolving pronouns, ellipsis, and screen references), and determines the appropriate response from the character's repertoire.
A RecognitionFailureEvent fires when the user speaks but ASR fails — characters handle this gracefully rather than silently failing. A GestureEvent fires when the user clicks on a 3D object (multimodal input: "pick this up" + click).
When a character performs a physical action — walking to a location, picking up an object, pulling a lever — the animation system executes it asynchronously. It reports back via a PerformedEvent carrying an ok or failed flag. This event is essential for sequencing: the character can only say "I'm at the machine, what now?" after actually arriving.
Because walking can take considerable time, the user may have issued new instructions in the meantime. The PerformedEvent lets the agenda cleanly resolve whether those intervening requests should now be acted upon or whether to resume the original goal chain.
Whenever any character speaks or performs an action, it broadcasts that event to all other characters. Each receiving character's dialogue manager fires a BroadcastEvent — but only if the sending character is currently within visibility range and/or within audibility distance of the receiving character. The system tracked spatial relationships between characters in the 3D world to enforce this filter.
This is what enables Cloddy Hans to whisper side-comments during the Karen negotiation ("Don't trust her!") even while off-camera, and for Karen to react to things Cloddy does across the world only if she could plausibly have seen or heard them. Characters could thus be genuinely surprised by events they had no line of sight to.
The virtual world is populated with invisible trigger zones — 3D volumes defined by coordinates. As a character walks through the world, the animation system monitors their position and fires a TriggerEvent each time they enter or exit a zone. The dialogue manager uses these to make characters react to their immediate surroundings in real time.
If no other event arrives within a configurable time window, a TimeOutEvent fires. Characters use this to decide that the user needs prompting — perhaps they are confused, distracted, or have simply stopped speaking for too long. The response is determined by the current scene and agenda state: a first timeout might produce a gentle hint ("Perhaps we should go to the drawbridge?"); a repeated timeout might escalate to a more explicit suggestion, or Cloddy Hans might use his knowledge of unmet story-functional events to guide the user towards the next goal.
The NoReactionEvent is a related internal event fired when the agenda is empty and no incoming events are pending — the character has nothing queued and must decide whether to take initiative. Together, timeouts and NoReactionEvents are what keep the game from stalling when a child doesn't know what to say or do next.
Underlying all five event channels is the character's agenda — a set of goal-trees whose content is determined by the current scene and subscene. When a scene initialises, each character loads the set of story-functional events for that scene and populates their agenda accordingly. This means the character knows at all times what still needs to happen for the narrative to progress, and can use that knowledge to decide how to respond to any incoming event.
At scene initialisation, story-functional milestones are added as top-level agenda goals. As events arrive and goals are fulfilled, the agenda shrinks. The NoReactionEvent fires when it is empty — signalling that the scene is complete or the character must find something proactive to do. Cloddy Hans uses the remaining agenda to generate unprompted hints ("We still haven't spoken to Karen about the bridge").
Unlike classical STRIPS planning, the agenda does not assume that satisfied preconditions remain satisfied. When Karen grabs the axe just as Cloddy Hans was walking to pick it up, his goal fails and must be replanned. Each precondition is re-checked at execution time, not assumed true from when it was last verified. This is essential in a dynamic multi-character world where any character can change the state of any object.
| Event | Triggered when… |
|---|---|
| ParserEvent | Parser delivers analysis of user's utterance |
| GestureEvent | User pointed at a 3D object |
| RecognitionFailureEvent | ASR detected speech but could not recognise it |
| PerformedEvent | Animation system completed a requested action (ok / failed) |
| BroadcastEvent | Another character spoke or acted (filtered by visibility & audibility) |
| WorldEvent | A world-state change (bridge raised/lowered, object inserted in machine…) |
| TriggerEvent | Character entered or exited a 3D trigger zone while walking |
| Event | Meaning |
|---|---|
| IntentionEvent | Character has formed an intention to say or do something |
| NoReactionEvent | Agenda empty — character must decide whether to take initiative |
| TimeOutEvent | Silence threshold exceeded — prompt or hint |
| AlreadySatisfiedEvent | A goal already true was added to the agenda |
| CannotSolveEvent | An unsolvable goal was added to the agenda |
| PossibleGoalConflictEvent | New goal conflicts with an existing agenda item |
| RequestEvent | Derived from ParserEvent — user made an instruction |
| QuestionEvent | Derived from ParserEvent — user asked a question |
This step-by-step trace shows how a single user instruction flows through the entire event/scripting system:
interpret operation → contextual interpretation resolves references
convey(acknowledge) → Cloddy says "OK, I'll do that" · Then calls addGoal(standNextTo(cloddy, machine))
perform(walkTo(machine)) → Cloddy Hans starts walking. TriggerEvents fire as he passes objects of interest along the route.
The game is divided into scenes and subscenes (and sub-subscenes), forming a hierarchical dramatic structure. Scenes map to game levels; scene transitions mark significant narrative events. This hierarchy is entirely modular — new scenes can be added without touching the kernel.
The dialogue management kernel provides generic functionality — input handling, reference resolution, goal reasoning, event dispatching. The scripting code per character per scene per phase provides the specific personality and behaviour. Adding a new scene means writing new scripting code; the kernel is unchanged.
Cloddy Hans tells the user about the fairy-tale world. Story-functional events: Cloddy introduces the world; Karen introduces herself; both characters give their opinions of each other.
The player and Cloddy Hans explore the island together, finding the fairy-tale objects from Scene 1 scattered in the grass. Camera and dialogue are entirely user-driven.
Starts when the player meets Karen. Multi-party negotiation: user tries to find what Karen wants, Cloddy Hans makes side-comments, Karen rejects first two offers unconditionally. Rich three-party dialogue.
After successful negotiation. Cloddy Hans delivers the agreed object to Karen. The drawbridge is lowered. Story-functional event: Cloddy crosses the bridge and pays Karen. Scene complete.
At any moment exactly one subscene phase is active. Phases transition on events — for example, the Negotiation phase begins the first time the user speaks to Karen. The phase can also change based on geographic location (which trigger zone the character is in) or on character actions. This allows non-linear story progression within the overall scene structure.
Unlike most spoken dialogue systems — which use a strict turn-alternation model where user and system must speak one at a time — the NICE dialogue manager is fully asynchronous. This is one of its most important innovations.
A trigger is a three-dimensional zone with specific coordinates in the virtual world. When a character enters or exits a trigger, a TriggerEvent is fired to its dialogue manager. Uses include:
The paper identifies only one comparable system: the Mission Rehearsal Exercise (MRE) from USC Institute of Creative Technologies (Swartout et al. 2004). MRE has more complex character-character interactions and a more sophisticated emotion model. However, MRE operates in a military domain with more codified language — whereas NICE targets spontaneous children's speech in a domain with negotiation and social interaction. To the authors' knowledge, NICE and MRE were the only two multi-party spoken dialogue systems in existence at the time of publication.
Boye, J., & Gustafson, J. (2005). "How to do dialogue in a fairy-tale world." In Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal.
This paper describes the design and implementation of Snacka — a new corpus-based unit-selection speech synthesis system for Swedish developed specifically for the NICE project. It was one of the first unit-selection TTS systems built for Swedish, and the first designed from the ground up to convey animated character personality in a real-time interactive application. The paper covers the unique requirements that a fairy-tale game domain places on speech synthesis, the corpus design, the system architecture, voice transformation, and early evaluation results.
The central challenge driving the development of a new TTS system was straightforward: the NICE characters engage in open-ended clarification dialogues that reference any combination of objects, slots, actions, and dialogue contexts. This makes pre-recording all possible utterances impossible — a unit-selection synthesiser is the only viable path to both quality and coverage.
Existing off-the-shelf Swedish TTS was not suitable for several reasons: generic voices have no personality; commercial systems are not designed for real-time integration into game engines; and at the time of development (2003–2004), no suitable Swedish unit-selection system was available that met the project's requirements. The solution was to build Snacka — a completely new corpus-based unit-selection synthesiser extending the open-source Snack sound toolkit by co-author Kåre Sjölander (KTH).
Cloddy Hans was voiced by Mattias Heldner, a speech scientist at KTH. The voice actor was instructed to read all corpus utterances in a style matching Cloddy Hans's intended personality — friendly, uncertain, and slow — creating a performance-driven foundation that the acoustic transformations then further shaped to match the character's visual appearance.
The defining architectural principle of Snacka is symmetry between analysis and synthesis: every module used to analyse the recorded corpus is also used during synthesis. This is critical for quality — it ensures the unit inventory and the synthesis output are described in exactly the same terms, maximising the chance that available units match what synthesis needs.
A dedicated quality assessment tool was built on top of WaveSurfer. It allows simple point-and-click synthesis of typical domain sentences. When a faulty utterance is found, the tool can be used to inspect the automatic segmentation, correct errors, and prune bad units — in most cases the utterance can then be re-synthesised instantly without rebuilding the entire voice. This rapid iteration cycle was essential for achieving acceptable quality in a research timeline.
The corpus was designed to cover everything the system needed to say as well as everything it was designed to understand — a key symmetry principle. Since the system handles clarification dialogues about arbitrary object/slot combinations, utterances were carefully designed to cover all grammatical positions and emphasis patterns.
To allow Cloddy Hans to ask targeted clarification questions, all domain utterances were recorded with objects and slots in both medial and final positions:
All utterances were also tagged with prosodic phrase boundaries and emphasized words before recording, and recorded in all combinations of position and emphasis — giving the synthesiser the full range of prosodic variation needed for natural clarification dialogues.
In addition to domain-specific utterances, a rich set of task-independent dialogue acts was recorded — covering all the ways Cloddy Hans needs to regulate turns, handle errors, and express attitude. These are used across all tasks and plots.
The turn regulation utterances and attitudinal feedback serve two purposes: (1) to buy time while the system generates Cloddy Hans's next utterance, and — more importantly — (2) to convey his uncertain, hesitant personality. A confident, fast character would not produce filler words and sighs; Cloddy Hans's acoustic hesitancy is as much a personality signal as his slow walking speed.
Making different characters sound like themselves — not just like different speed settings of the same voice — required a combination of directed recording and post-hoc acoustic transformation.
Voice actors were given explicit personality profiles and instructed to read all corpus utterances in a performance style that matched the target character. For Cloddy Hans, the instruction was to sound friendly, uncertain, and slow — producing a baseline recording already shaped by character intent.
After recording, further acoustic shaping was applied using the Time-Domain Pitch-Synchronous Overlap-Add (TD-PSOLA) algorithm — a well-established technique for modifying speech prosody without re-synthesis:
The approach was grounded in empirical research on synthetic voice personality. Nass & Lee had shown that providing a synthetic voice with stereotypical extrovert vocal features — high loudness, increased pitch, wide frequency range, fast speaking rate — caused users to describe it using extrovert personality adjectives. NICE applied this in reverse: design the vocal parameters to match the target personality, and users will perceive the character accordingly.
At the time of the workshop paper, the synthesiser was already deployed for user testing at the Telecommunication Museum in Stockholm. Adults and children interacted with a full-size Cloddy Hans on a large back-projected video wall, in semi-automatic mode (human wizard available to override). Users were asked post-interaction about their perception of:
Future work identified at the time: improved prosodic models, and experimenting with making repetitions acoustically different from original utterances — so that repeated attempts at the same phrase don't sound identical.
One of Snacka's most important features for the NICE application was its tight integration with the animation system — achieved through a custom XML markup language that carried synthesis instructions, prosodic markup, and gesture coordination tags in a single pass.
The dialogue system outputs XML-tagged text to the synthesiser. A real example from the paper:
| Tag | Function |
|---|---|
| <EMPH> | Mark a word for prosodic emphasis — synthesiser increases pitch and energy on the emphasised word |
| <PointAt object="X"> | Coordinate synthesis with a pointing gesture at object X in the 3D world — triggers animation system |
| <PointAt slot="Y"> | Coordinate synthesis with a pointing gesture at slot Y in the fairy-tale machine |
| Prosodic phrase markup | Mark phrase boundaries for more natural prosody across longer utterances |
| Extralinguistic tags | Trigger recorded non-speech sounds (cough, laughter, sigh) at specific points |
| Meta-information tags | Carry additional context used by the animation planner |
In addition to the audio output, the synthesiser produces a time-stamped phone transcription containing:
This output is fed directly to the Liquid Media animation system, which drives Cloddy Hans's lip movements and body gestures in precise synchrony with the speech output.
Most TTS systems output audio only. Snacka outputs audio plus a complete synchronisation track — enabling Cloddy Hans to point at the axe precisely as he says "that", and to gesture towards the useful slot as he says "there". Without this tight coupling, lip sync would slip, gestures would arrive late, and the character would lose believability. This integration of synthesis and animation was a key technical contribution of the NICE project.
Gustafson, J., & Sjölander, K. (2004). "Voice creation for conversational fairy-tale characters." In Proceedings of the 5th ISCA Speech Synthesis Workshop (SSW5), Pittsburgh, PA.
This paper presents the complete NICE Fairy-Tale World system — the game scenario, the characters' conversational requirements, the animation system designed by Liquid Media, the voice creation approach, and the results of the user studies with children. It argues that spoken dialogue technology has the potential to greatly enrich future computer games, and demonstrates a working system in which children used speech as the primary means of game progression.
Personality traits were not explicitly modelled as internal variables, but used as design guidelines to ensure that every aspect of a character's behaviour — appearance, voice, wording, speaking style, non-verbal gestures — was perceived as consistent with the intended personality. The characters were designed around contrasting OCEAN (Big Five) trait profiles.
Slow, deep voice. Lumbering walk. Honest, anxious to help, no agenda of his own. His limitations force the user to cooperate using multimodal dialogue. His apparent stupidity is deliberate — it makes the game harder in a way users perceive as intentional design.
Perceived by users as: Kind · Stupid · Lazy · Calm · Polite · Distressed
Fast, sharp voice — made to sound younger by speeding up the recordings. Drives the negotiation, rejects every offer for the first two attempts, and makes side-comments about Cloddy Hans. She appears smarter partly because she can "see" what Cloddy brings without needing to understand the user's speech.
Perceived by users as: Smart · Quick · Self-confident · Defiant · Secretive
Karen's animated face conveys a wide range of emotions through the phonemic, intonational and emotional channels described in the paper. Facial actions — eyebrow position, mouth shape, gaze direction — communicate her sullen, uncooperative personality even before she speaks.
Cloddy Hans's animation repertoire spans autonomous idle behaviour (walking, looking around) and reactive dialogue behaviour (picking up objects, gesturing, celebrating). His slow, lumbering movement style — deliberate in the animation design — reinforces his dunce personality at every frame.
The animation system designed by Liquid Media was central to making the characters feel alive and reactive. It consisted of two main modules — the Animation Handler and the Animation Renderer — operating asynchronously within the event-driven architecture.
A bespoke unit-selection Swedish TTS was developed in cooperation with KTH. Voice actors were told to read utterances in a style matching each character's intended personality. The recordings were then processed to further accentuate the personality differences.
The simple procedure of resampling all utterances — simultaneously changing speaking rate and voice pitch — had a desirable emergent property: the personality differences were visually reinforced. Cloddy Hans's slower, lower voice matched his large, lumbering body; Karen's faster, higher voice matched her small, quick character design. The audio and visual channels reinforced each other without requiring complex acoustic modelling.
One of the most striking findings from the corpus analysis was how children adapted their speaking rate dynamically across the two scenes and across the two different character interlocutors:
Karen was designed so she could respond to what Cloddy Hans physically brought her — she could "see" the object — without needing to parse the user's speech. This meant she appeared far more intelligent than Cloddy Hans (who needed correct verbal instructions) even though her underlying language understanding was actually simpler. This asymmetry was a deliberate design choice that also helped cover for ASR failures: Karen's reliable responsiveness made the system feel more robust overall.
Gustafson, J., Boye, J., Fredriksson, M., Johannesson, L., & Königsmann, J. (2005). "Providing Computer Game Characters with Conversational Abilities." In Intelligent Virtual Agents (IVA 2005), Lecture Notes in Computer Science, Springer, Kos, Greece.
Watch recordings from both NICE systems — system demonstrations, user interactions, and game footage from the fairy-tale world.
The NICE project produced over 15 research papers across international venues in speech technology, dialogue systems, human-computer interaction, and intelligent virtual agents.
The NICE Final Report (May 2005) reflected on the project's broader significance and each partner's plans to build on the work. The consortium's conclusion was unambiguous: all the original challenges had been satisfactorily met, and the systems pointed towards entirely new categories of interactive application.
The NICE project attracted attention from industry press covering the emerging intersection of speech technology and video games.