NICE Project – Natural Interactive Communication for Edutainment

How It Started

Background & Early Development

When NICE began in March 2002, the consortium identified four challenges that had never been satisfactorily met in any existing system:

Uncharted territory in 2002

No computer game included real spoken language dialogue — only isolated voice commands
No speech recogniser was optimised for children's and adolescents' speech patterns and linguistic behaviour
Advanced spoken dialogue systems had hardly been used for edutainment or entertainment
Domain-oriented conversation — unconstrained dialogue across semi-open domains — had never been demonstrated in a running system

Additional research challenges

Integrating 2D gesture input with spontaneous spoken dialogue
Personality, knowledge and mental state modelling for believable characters
Multi-party dialogue — several characters interacting simultaneously
NLU robust enough to handle the error rates of children's ASR
New evaluation methods for entertainment-type human-system interaction
Rapid prototyping framework portable to new embodied character applications

"There is an important sense in which the NICE systems are simply interactive spoken computer games. This potentially revolutionary field of computer games was non-existent when the NICE project began."
— NICE Final Report, May 2005

Before any prototype system existed, NICE partners collected extensive Wizard-of-Oz (WoZ) data — human-simulated interactions that provided training material for speech recognisers and insights into how children actually talk when addressing these characters.

🏛️

HCA Museum, Odense

NISLab — Summer 2003

The detailed first HCA prototype specification was Wizard-of-Oz-simulated over two weeks at the Hans Christian Andersen Museum in Odense, Denmark. ~30 hours of English conversation recorded, approximately 500 dialogues transcribed. Young museum visitors (age 9–18) talked to a cartoon-like animated HCA face via headset microphone. The wizard controlled HCA's responses and lip movements in real time.

These recordings unexpectedly became a gold standard for human-natural conversation — the wizards' thoroughly human response style revealed conversation patterns (topic shifts, slow users, graceful follow-along) that were later built into PT2.

📡

Telecom Museum, Stockholm

TeliaSonera — 2002–2003

TeliaSonera set up an animated speaking agent in a museum exhibition to continuously collect children's speech data from visitors. By December 2002 alone, ~20,000 utterances had been recorded and transcribed, of which 40% came from users aged 15 or under.

A separate Lo-Fi WoZ session tested an early FTW prototype: 16 adult subjects (300 utterances) interacted with Cloddy Hans on a massive 3×5 metre screen, helping him solve a simple visual puzzle. This data informed the FTW system design and speech recogniser language models.

🖊️

LIMSI Lab, Orsay

LIMSI-CNRS — 2002

LIMSI conducted WoZ experiments with a simple multimodal game featuring 2D cartoon characters. 7 adults and 10 children interacted using speech and pen gestures on an interactive pen display. Sessions were video-recorded and annotated using behavioural metrics and questionnaires. This early data informed the design of the gesture recognition, interpretation, and input fusion modules.

The "Gold Standard" Insight

The HCA museum WoZ data yielded an unexpected methodological insight: even though the wizards followed the detailed system specification, they did so in a thoroughly human way — understanding users' full communicative intention even when it was expressed obliquely across multiple turns. The resulting transcripts revealed that the simulated HCA "virtually always followed the user" when the user changed topic or domain — and this naturalness was later built into the second HCA prototype's conversation manager as a design principle.

Children's speech recognition was one of NICE's core technical challenges. ScanSoft (originally Philips Speech Processing) developed dedicated approaches for both the English (HCA) and Swedish (FTW) systems.

Swedish System (FTW)

Training from scratch was not feasible with limited in-domain data. Instead, a commercial adult-speech acoustic model was adapted using two combined techniques:

Frequency-domain speech transformation: children's input speech was mapped in the frequency domain to better match the adult-trained model → 22% WER reduction
Acoustic model adaptation: optimised adaptation algorithms on ~20 hours of children's data → ~45% relative WER reduction (additive to the 22%)
Noise and hesitation modelling also improved in both acoustic and language models

WER reduction from transformation22%

WER reduction from adaptation (45% relative)~45%

English System (HCA)

Sufficient NICE-specific data allowed training from scratch. Key complication: most users were non-native English-speaking Danish children, producing high noise and hesitation rates. Solution:

Iterative training: started with simple utterances, expanded the training set in criteria-governed steps. Standard full-data training produced significantly inferior models.
Vocabulary of ~1977 word forms in PT2
Out-of-vocabulary rate: only 2.5% (excluding proper names)

Comparison with state-of-the-art

Potamianos (2003) achieved similar improvements (~45% relative WER reduction) with 2–3× more data. NICE's results were therefore at or beyond the published state-of-the-art despite tighter data constraints.

For the Fairy-Tale World, TeliaSonera developed a bespoke unit-selection speech synthesis system from scratch for Swedish — giving each character a distinct voice that conveyed their personality. This went far beyond standard TTS in scope and design.

Design Criteria

Acoustic & Quality

High-quality, natural-sounding Swedish voices
Personal speaking style per character — not generic TTS
Faster than real-time speech generation
Prosodic control: emphasis and speaking rate
Voice design via acoustic signal processing (e.g. giant/gnome effects)

Conversational & Expressive

Fillers and filler words with prosodic variation — used for attitudinal feedback and turn management
Extralinguistic sounds: inhalations, coughs, laughter — convey emotional and attitudinal state
Easy to add new languages and build new voices
Designed for easy integration into a live dialogue system

Animation Coordination

Automatic generation of lip-sync track
Coordinated body and facial gesture output with verbal content
Sub-second control of character movement enabling highly reactive animated characters
Blending of multiple concurrent animation tracks (Liquid Media engine)

"The faster-than-real-time automatic generation of conversational spoken Swedish coordinated with facial as well as bodily animations for several ECAs with distinct personalities represents a breakthrough — to the best of our knowledge, this is the first time all these features have been combined in one and the same system."
— NICE Final Report, May 2005

For the HCA system, the English TTS was an off-the-shelf AT&T synthesiser selected for voice quality and suitability for a 55-year-old male character. Eleven visemes were used for lip synchronisation, with up to 17 non-verbal behaviour primitives per output turn.

System One

The H.C. Andersen System

The HC Andersen system aimed to demonstrate domain-oriented conversation — a form of interaction halfway between strict task-oriented dialogue and full Turing-test conversation. Users could talk to HCA about anything within his knowledge domains, in any order, with mixed initiative, just as in real social conversation.

HCA appeared in his 19th-century Copenhagen study (Figure: his virtual writing desk surrounded by pictures, a feather pen, travel bag and personal objects). Users could speak to him freely in English and point at objects in his study using a touch screen or mouse to prompt him to tell stories about them.

HCA had knowledge of: his fairy tales (The Little Mermaid, The Ugly Duckling, etc.), his life and childhood in Odense, his study and its objects, the user (he asked about age, name, country, interests), and meta-communication to handle errors and keep conversation flowing.

HCA's Personality

Core traits (from project specification): imaginative, sensitive, extrovert, 55ish — a child inside

Emotionally responsive — happy, sad, angry, or neutral depending on conversation
Pursues a general conversational agenda; initiates when user is silent
Asks about the user and remembers what they said
Tells stories, makes jokes, and reacts to compliments or insults
Gets annoyed if insulted; delighted if praised about his fairy tales
~450 spoken utterance templates, ~130 non-verbal behaviour primitives
Up to 17 non-verbal primitives per output turn; 11 visemes for lip sync

Knowledge Domains

His fairy tales and literary works
His life (childhood in Odense, youth in Copenhagen)
His physical presence and study
The user (name, age, country, interests)
Games children like (shared ground)
Meta-communication (repeat, clarify, rephrase)

The "Domain-Oriented" Approach

Unlike task-oriented systems where the goal is to complete a booking or answer a query, HCA had no task constraints. Users could jump between topics, back-channel, ask personal questions, or even be rude. The system aimed to make HCA feel like a real character from history, not a database. The target was to produce genuine conversation for edutainment — learning about Andersen's life and era while being entertained.

Prototype History

Prototype 1 (PT1), January 2004: Wizard-of-Oz speech recognition (human typed what user said). Tested with children at NISLab. ~300 utterance types, 100 non-verbal primitives.
Prototype 2 (PT2), February 2005: Full speech recognition (Scansoft, trained on ~50 hours of children's speech). Tested with 13 users (age 11–16, Danish children speaking English). 26 conversations, 8 hours of recorded speech.

The HCA system used an event-driven, modular, asynchronous architecture with modules communicating through the KTH Message Broker — an open-source TCP server that routed calls and time-stamped all events.

Input Pipeline

Microphone

→

Speech Recognition
(Scansoft SpeechPearl)

→

NLU
(NISLab)

→

Touch Screen

→

Gesture Recognition
(LIMSI)

→

Gesture Interpretation
(LIMSI)

→

Fusion & Reasoning

Input Fusion
(LIMSI)

→

Character Module / Mind-State Agent
(NISLab) — conversation planning, domain agents, emotion calculator, knowledge base

Output Pipeline

Character Module

→

Response Generation
(NISLab)

→

Speech Synthesis
(AT&T)

Animation
(Liquid Media)

Infrastructure

KTH Message Broker (TCP)

Natural Language Understanding (NISLab)

The NLU pipeline consisted of: (1) a Keyphrase Spotter that mapped phrases to syntactic/semantic categories; (2) a Syntactic Analyser with number spotting, lexicon lookup, and rule engine; (3) a Domain/Topic Spotter mapping categories to knowledge domains; and (4) an FSA Processor (Finite State Automaton) for deep parsing. The vocabulary had ~1977 word forms in PT2.

Character Module — The Mind-State Agent

The heart of HCA's intelligence. Three output states: Non-Communicative Action (HCA working alone in his study), Communicative Function (real-time awareness of user input), and Communicative Action (HCA's actual conversational turns). Domain Agents handled specific knowledge areas (life, works, study, user, meta). The Emotion Calculator updated HCA's emotional state on each user turn. Up to 17 non-verbal behaviour primitives per output turn.

Gesture & Multimodal Input (LIMSI)

Three gesture shapes were recognized: pointer (pointing), connect (line), surrounder (circle). The Gesture Interpreter identified which of 16–21 "referenceable objects" in the study the user gestured at. Input Fusion merged speech semantics and gesture using temporal coincidence with configurable time windows.

The HCA PT2 was evaluated with 13 Danish children (age 11–16) speaking English (their second language) at NISLab in February 2005. Each session lasted 60–75 minutes.

Speech Recognition Performance

Perfectly recognised sentences (Danish group)23%

Perfectly recognised (English group)33%

User input understood (Danish group)49%

Input understood (English group)60%

Understood + meta-comm handled (Danish)85%

Understood + meta-comm handled (English)87%

Usability Interview Scores (1=good → 3=poor)

Overall user evaluation1.5 / 3

Fun to talk to HCAPositive

Learned something12/13 users

Interested in spoken games10/13 users

Graphics qualityRated good

Gesture recognition accuracy87.2%

        Key Findings
        Users unanimously found talking to HCA fun, even those who struggled with the speech recognition
Most users learned something — about HCA's life, about English, or both
System scored 1.5/3.0 overall (1=great), an improvement over PT1's 1.8, despite PT2 being harder (real ASR vs. Wizard-of-Oz)
Biggest challenges: animation naturalness (walk looked gliding), and emotion expression (subtle facial changes users didn't notice)
All 18 referenceable objects were gestured at; average user gestured at 62% of them
Word Error Rate: ~70% (test condition 1) / ~61% (test condition 2) — children speaking English as second language

      

This is a composite example from two actual conversations with HCA PT2, showing mixed-initiative dialogue, emotional response, and topic shifting:

HCA:Hello there, welcome to my study. I am glad to meet you. It is always so interesting when I get visitors. I would like to ask you a few questions. Please tell me your name.

User:My name is Aske.

HCA:How old are you?

User:I'm twelve.

HCA:When I was your age I was a big fan of the theatre. Whenever there was an opportunity I went to the theatre. It was my favourite place. I would like to know if you are a boy or a girl?

User:I am a boy.

HCA:I was a curious boy. Once I ventured into the hospital for the dangerous lunatics. A woman sat in her cell singing with a beautiful voice. Suddenly she got up and ran screaming towards the door. She reached for me and her fingertips touched my clothes. I was half dead from fear when the attendant came. Which country do you come from?

User:I come from Denmark.

HCA:Excellent! Now let us talk. I can tell you a lot about my family and my childhood in Odense and about my youth in Copenhagen.

User:How old are you?

HCA:I was born in 1805 and died in 1875. So my lifetime lasted 70 years. It was a time full of great events in the world and also for me.

User:[gestures at a picture on the wall] What is this?

HCA:[turns to look at the picture, then back to user] Ah, that is a picture of Jenny Lind — the Swedish Nightingale, as she was called. She was a wonderful singer. I was very fond of her.

System Two

The Fairy-Tale World

From NICE Deliverable D7.2-2, Chapter 3 →

The NICE Fairy-Tale World is an interactive 3D computer game where children and teenagers engage in spontaneous spoken and gestural dialogue with animated fairy-tale characters. What sets it apart from other dialogue systems — and from other games — is that spoken dialogue is the primary means of game control. There are no menus, no text commands. You just talk.

"The interaction with the characters is domain-oriented. The dialogue concerns different subplots in the fairy-tales, but without a clear goal-orientation and without other demands than it being entertaining to the user."

The project was developed at TeliaSonera R&D in Stockholm, Sweden, led by Joakim Gustafson, Johan Boye, Mats Wirén, Linda Bell and Anders Lindström. The 3D world and character animation was created by game company Liquid Media.

Why games? Why children?

Computer games provide a natural, engaging context for spontaneous dialogue — users are motivated to communicate
Spoken interaction can enable richer, more immersive game experiences than menus and mouse clicks
Children are an underexplored user group for speech technology — their speech patterns and interaction styles differ from adults
Social dialogue, negotiation and cooperation — areas where speech excels — are natural game mechanics

Development Timeline

Milestone	Details
Early 2003	System design, character design (Liquid Media), message broker integration (KTH)
Jan 2004	Prototype 1 (Scene 1 only: lab + Cloddy Hans). Tested at Technical Museum Stockholm with 10 children
Apr 2004	D7.2b evaluation report; data analysis; design of Prototype 2
Nov 2004	Prototype 2 testing begins — "School" session with 31 children (Scene 1)
Dec 2004	"Lab 1" session (11 teenagers, Scene 1, large display + gyro mouse)
Feb 2005	"Lab 2" session (20 children, Scene 1+2 including Karen at the bridge)
Mar 2005	"Lab 3" session (13 children, Scene 1+2). Corpus complete.
May 2005	Final evaluation deliverable D7.2-2 submitted

The fairy-tale world features five animated characters, each with a distinct personality encoded at every level — appearance, voice, vocabulary, gesture style, and long-term behaviour.

Cloddy Hans

The Helper Character

Loosely inspired by H.C. Andersen's story "Cloddy Hans." He is the user's faithful companion throughout the entire game. Seemingly slow-witted and not very bright, but honest, friendly, and anxious to help. His limitations are deliberate — they force the user to use multimodal dialogue to collaborate effectively.

Friendly Slow-witted Loyal Honest No hidden agenda Deep slow voice

OCEAN model: High Agreeableness, Low Conscientiousness, Low Openness. Users perceived him as kind, stupid, lazy, calm, polite.

Karen

The Gatekeeper Character

A sullen, uncooperative girl who guards the drawbridge separating the island from the fairy-tale world. She is the obstacle — and the negotiation partner. Karen will only lower the bridge when given an object she deems acceptable. She will reject the first two offers regardless.

Sullen Uncooperative Shrewd Defiant Secretive

Users perceived her as smart, quick, self-confident, defiant. Deliberately designed to contrast with Cloddy Hans — creating dramatic tension and encouraging children to take sides.

Thumbelina

Supporting Character (non-verbal)

Appears in the laboratory scene. Non-verbal — she points at slots in the fairy-tale machine to indicate where she wants objects placed, and expresses her satisfaction or discontent through large emotional body gestures when Cloddy Hans places objects correctly or incorrectly.

Non-verbal Expressive Gesture-based

The Prince

Feature Character

One of five animated characters developed for the full fairy-tale world. Feature characters have their own goals and drive sub-plots in the story, asking the user for help completing tasks in exchange for information or passage.

Goal-driven Feature character

"The characters need to be responsive, and be able to engage in conversation which makes sense to the player. Each character must have its own voice that conveys the nature of that character's personality, and be able to use prosodic cues to signal mood and emotions."

Character Classification

NICE introduced a three-way character taxonomy for interactive storytelling, extending the feature/supporting dichotomy from the literature:

Feature characters — drive the story, have their own goals, need a full NLU + dialogue manager
Supporting characters — provide atmosphere and react, like Thumbelina's non-verbal expressions
Helper characters (new) — like Cloddy Hans: guides the user, no goals of his own, but needs knowledge of all plots, cooperative conversation skills, and spatial awareness

Personality was not modelled as explicit internal parameters but used as a design guide to ensure consistent character behaviour perceived by users as compatible with intended traits. Every aspect of the characters — dialogue rules, wording, speaking style, non-verbal behaviour, idle animations — was designed around the OCEAN personality framework.

OCEAN Personality Profiles (Table 3.1)

Character	Openness	Conscientiousness	Extroversion	Agreeableness	Neuroticism
Cloddy Hans	Dunce	Thorough	Quiet · Uncertain	Friendly · Polite	Calm · Even-tempered
Karin	Intellectual	Frivolous	Outspoken · Self-confident	Unfriendly · Touchy	Anxious · Over-emotional

Voice Design (Table 3.2)

Character	Voice pitch	Speaking rate	Frequency range
Cloddy Hans	Low	Slow	Small
Karin	High	Fast	Large

Voice actors were instructed to read utterances in personality-matching styles, then all recordings were resampled — slowing Cloddy (sounds larger) and speeding Karin (sounds younger). TD-PSOLA modified both pitch and rate simultaneously.

Animation & Non-Verbal Behaviour

Cloddy Hans — Introvert, Calm

Small, slow, deliberate body gestures generated by the AnimationHandler
Maintains attention on the user at all times while idle
Idle sequence: small encouraging gestures → impatience → walks around the scene
Displays states of mind: listening, thinking, understanding, turn regulation

Karin — Outspoken, Anxious

Larger, faster body gestures rendered directly by the animators
Enters idle phase faster — less patient than Cloddy Hans
Gaze wanders to the environment — less focused on the user
If ignored long enough, walks away — user must call her back to resume negotiation

Same Animation Set, Different Personality

Both characters share the same underlying animation primitives — since both have been provided with identical animation sets it is possible to switch their non-verbal behaviours. The personality difference is expressed through the amplitude, speed, and frequency of identical gesture types, not through entirely different motion libraries. This was a key architectural decision enabling efficient character creation.

Utterance Distribution by Character (Tables 3.5–3.7)

User Utterances (5,580 total)

Domain (instructions, questions)39%

Repetition17%

Meta (help, what can I do?)17%

Yes/No12%

Fragment8%

Social/fun (jokes, testing)7%

Cloddy Hans Utterances (5,445)

Safe (confirmations, "I'm going to…")39%

Domain (subtask initiatives)15%

Problem ("I didn't understand")15%

Meta (acknowledgements, fillers)13%

Nice (social, polite)12%

Bad (mocking Karin)7%

Karin Utterances (255 total)

Domain (demands, what she wants)45%

Nice (pleasantries)27%

Bad (mockery, rejections)27%

Problem (confusion)0%

Karin never generates Problem or Safe utterances — by design, she always appears confident and in control, which makes her seem smarter than Cloddy Hans.

Scene 1: H.C. Andersen's Fairy-Tale Laboratory

The game begins in Andersen's Copenhagen study in the 19th century. Andersen has just left on a trip to Odense, leaving Cloddy Hans to guard his fairy-tale laboratory. At the centre of the lab is the fairy-tale machine — forbidden to all except Andersen himself.

Cloddy Hans has sneaked an object into the machine and wants to surprise Andersen with a new fairy-tale. But he can't understand the symbols labelling the machine's four slots: "useful", "magical", "precious", "dangerous". He needs the user's help to figure out which object goes where. On the shelf: a key, hammer, diamond, magic wand, sword, ruby, axe, and more.

Why this "grounding game" design?

Training: Users learn how to interact with Cloddy Hans through trial and error in a low-stakes task
Grounding: User and Cloddy agree on names for objects — "the ruby", "the sword" — which persist into Scene 2
Data collection: The repetitive structure makes it easy to measure dialogue success
Deliberate limitation: The user cannot directly touch objects — only dialogue moves the game forward

Scene 2: The Fairy-Tale World — The Drawbridge

When Cloddy Hans pulls the lever, the machine activates — and a trapdoor drops him (and the user) into the fairy-tale world. They land on a small island with a vast gap between them and the rest of the world: windmills, villages, fields, all unreachable. The objects from the machine are now scattered in the grass.

On the far side of the gap stands Karen, guarding a raised drawbridge. She refuses to lower it — unless the user negotiates successfully. She wants one of the objects on the island, but she is picky: she will reject the first two offers, and what she ultimately wants changes each playthrough. The user must discover what she wants, convince Cloddy Hans to pick it up and bring it to her, and complete the exchange.

Cloddy Hans's Side-Commentary

While the user talks to Karen (shown in camera), Cloddy Hans is out of view but still active. He receives BroadcastEvents from Karen — delivered because he is within audibility range even when off-camera — and uses these to fire contextual side-comments as hints, observations, and opinions:

"Maybe she will lower the bridge if we give her something nice."
"Girls like shiny things, don't they?"
"She is a bit grumpy today!"

This creates genuine three-party dialogue — user, Cloddy Hans, and Karen — with each character having distinct communicative roles.

Game Locations

H.C. Andersen's study / fairy-tale lab — Scene 1
The island — starting point of Scene 2
The gap and drawbridge — Karen's domain
The village — across the bridge
The windmill — a key location in the world

Characters navigate between locations using triggers — 3D zones that fire events when entered. Walking into a trigger could cause a character to turn, speak, or initiate a new dialogue phase.

Design Requirements (from D1.1b)

The following seven requirements guided the overall game design from the start of the project:

Spoken multimodal dialogue should be appreciated as the primary means of progressing in the game, not merely an add-on
Primary intended users are children aged nine and older
Domain built on a small collection of autonomous fairy-tale characters from H. C. Andersen
Both the overall story and component subplots should have goals easily explained in ten seconds
The introduction to multimodal dialogue should be part of the game and fun in its own right
The game should be many-off — player and character actions influence events and how the story ends
The game should be enjoyable regardless of whether played to completion or just for one plot

Narrative Progression — Story-Functional Events

The concept of a story-functional event is central to measuring narrative progression: a key moment in the story that represents genuine advancement and cannot be undone. Progress is defined as the set of story-functional events achieved.

Scene 1 — The Laboratory

Cloddy Hans introduces himself
Cloddy Hans introduces the plot
Cloddy Hans picks up an object for the first time
Cloddy Hans drops an object in the fairy-tale machine (irreversible — can't be retrieved)
Cloddy Hans pulls the lever → they enter the fairy-tale world

Scene 2 — The Fairy-Tale World

Cloddy Hans informs user about the fairy-tale world
The user meets Karin for the first time
Cloddy Hans gives his opinion of Karin
Karin gives her opinion of Cloddy Hans
Karin demands payment to lower the bridge
The user offers Karin an object as payment
Karin lowers the drawbridge
Cloddy Hans crosses the drawbridge

Speech Rate Adaptation Across Phases

All 5,580 utterances were aligned at phoneme level. Duration z-scores were computed across all 85,000 phonemes. Key findings:

With Cloddy Hans

Users took fewer turns per object over time (learning), but spoke progressively slower — hyper-articulating to ensure the sluggish character understood
In Scene 2, they initially spoke faster, then slowed again when talking with Cloddy
Confirmed earlier Bell, Gustafson & Heldner (2003) finding that users adapt speech rate to system pace

With Karin

Users spoke significantly faster with Karin — significant difference vs. Cloddy dialogues (Wilcoxon, p<0.05)
Speech rate increased further with each successive Karin turn — the negotiation felt more lively and dynamic
Karin drives the dialogue and doesn't show comprehension difficulties, making the interaction feel quicker

The Fairy-Tale World system is event-driven and asynchronous — enabling characters to be autonomous when no-one is talking to them and fully reactive when the user engages. All modules communicate via a central Message Dispatcher hub. Figure 3.5 (from NICE Deliverable D7.2-2) shows the complete architecture.

Figure 3.5 — The NICE Fairy-Tale World system architecture

Figure 3.5. The system architecture (NICE Deliverable D7.2-2).

Speech Recognition

Scansoft OSR3; bigram statistical language model
Symmetric design: LM trained on both what Cloddy Hans says and what users say to him
Domain model: all object/slot names, action verbs, clarification questions, grounding and error-handling utterances
Supplemented with socialising utterances from August and Pixie corpora

Animation System (Liquid Media)

Characters react with: emotional displays, state-of-mind signals, turn regulation gestures, attention gestures
Can look at and point at interactive/non-interactive objects and landmarks
Idle behaviour: encouraging gestures → impatience → walking around
8 concurrent animation tracks; lip-sync track from TTS time-stamps

Hardware Setup

The system ran on three machines: a high-performance desktop running animation/rendering, speech synthesis, and gesture recognition; one laptop running ASR, the dispatcher, and the supervision GUI; a second laptop running NLU, dialogue managers, NLG, and animation planner.

Users interacted via a wireless headset microphone and a gyro mouse facing a large back-projection screen showing Cloddy Hans at life-size (180 cm tall).

Supervision Interface (Figure 3.4)

Figure 3.4. The supervision interface.

For data collection, the system ran in supervised mode: an operator could approve the top suggestion, select from the N-best list, edit the ASR result, or type free text — enabling high-quality data collection while the system was still imperfect.

Natural Language Understanding: Robust Parsing

The NLU component addressed interpreting noisy, spontaneous, child-directed speech. The parser works in two phases:

Pattern-matching phase (domain-dependent): Scans ASR output left-to-right, accumulating typed semantic constraints from syntactic patterns — discarding unrecognised words.
Rewriting phase (domain-independent): Applies object merging, constraint inference, filtering, and abstraction to produce the final dialogue act tree.

"Pick up the axe" → request(user, cloddy, pickUp(cloddy, axe)). Concept accuracy: 53% from 1-best ASR, 70% from 10-best — a major improvement over raw ASR word accuracy of 39%.

Speech Synthesis: Snacka Unit-Selection TTS

Snacka — developed by Kåre Sjölander at KTH in collaboration with TeliaSonera — was the first unit-selection TTS system built for Swedish. Cloddy Hans has a slow, deep voice (recordings slowed via TD-PSOLA); Karin has a faster, sharper delivery. The synthesiser outputs a time-stamped phone transcription that drives lip-sync and gesture coordination in the Liquid Media animation engine.

Kernel + Scripting Architecture

The NICE fairy-tale dialogue manager uses a two-layer architecture: a kernel providing generic dialogue management functionality common to all characters and situations, and scripting code that specialises the kernel for a particular character, scene, or phase. This mirrors game scripting patterns used in commercial game engines.

Why kernel + scripting?

Generic functions (input handling, reference resolution, belief reasoning, response generation) in the kernel — written once
Character personality, scene-specific behaviour, phase transitions — in modular scripting code
New scenes and characters can be added without modifying the kernel
Theoretical benefit: distinguishes genuinely generic dialogue management from situation-specific aspects

Event-Driven Dialogue Model

The dialogue manager operates on a stream of dialogue events — both external (from other modules) and internal (from the kernel's reasoning). Scripting code registers callbacks for events, creating asynchronous, flexible behaviour.

Event Type	Description
ParserEvent	Parser analysed the user's latest utterance and produced a dialogue act
GestureEvent	Gesture interpreter identified an object the user pointed at (multimodal input)
RecognitionFailureEvent	ASR detected speech but could not recognise it
PerformedEvent	Animation system completed a requested action, carrying an ok or failed flag; essential for sequencing agenda goals
BroadcastEvent	Another character spoke or performed an action — only delivered if the sending character is within visibility range and/or audibility distance of the receiving character
TriggerEvent	Character entered or exited a 3D trigger zone while walking — used for contextual remarks, head-turns, waypoints, and camera changes
WorldEvent	A world-state change: object inserted into machine slot, bridge raised or lowered, etc.
TimeOutEvent	Silence threshold exceeded — character should prompt, hint, or take initiative
NoReactionEvent	Character's agenda is empty and no incoming events are pending — must decide whether to act proactively
IntentionEvent	Character has formed an intention to say or do something (internal)

World Model and Task Planning

Each character maintains an inner state consisting of: a world model (beliefs about the 3D world), a discourse history (past interactions), and an agenda (tree-structured current goals and their causal relations).

Task specification uses declarative rules encoding action-world relationships, similar to STRIPS planning but with non-monotonicity: true propositions can become false (another character might move an object). Preconditions are always checked before execution. The agenda tree makes reasoning explainable — Cloddy Hans can say "I am going to the shelf because I want to pick up the axe because you asked me to."

Turn-Taking & Camera Control

The character in focus is always in camera. The user can change dialogue partner — and camera — by saying the character's name ("Cloddy" / "Karen"). The system also triggers camera changes automatically, such as when Cloddy Hans reaches the bridge, causing the camera to swing to show Karen. Characters can make side-comments without being in camera (Cloddy whispering hints while Karen is on screen). This enables genuine three-party dialogue without parallel rendering.

Asynchronous Dialogue

Unlike the strict turn-alternation model used by virtually all spoken dialogue systems of the era — where user speech was the only input channel — NICE characters are driven by five independent event channels running simultaneously: (1) user dialogue acts from the parser; (2) completion signals from the character's own actions (PerformedEvent, ok/failed); (3) broadcast events from other characters, filtered by spatial visibility and audibility; (4) 3D trigger zones entered during movement; and (5) timeouts and empty-agenda signals. A character does not wait for the user to speak — any of these channels can push dialogue forward at any moment. This means the user can give a new instruction while Cloddy Hans is already walking, and when he arrives his PerformedEvent picks up the thread — creating dialogue that feels genuinely alive rather than turn-by-turn.

Data Collection

Data was collected across four sub-corpora over five months during 2004–2005, developing the system iteratively between each session.

Condition	School	Lab 1	Lab 2	Lab 3
Date	Nov–Dec 2004	Dec 2004	Feb 2005	Mar 2005
Location	School (small room)	TeliaSonera Vision Centre	TeliaSonera Multimodal Lab	TeliaSonera Multimodal Lab
Display	CRT, mouse	Large display, gyro mouse	Large display, gyro mouse	Large display, gyro mouse
Data	Audio, logs	Audio, video, logs	Audio, video, logs	Audio, video, logs
Gameplay	Scene 1	Scene 1	Scene 1+2	Scene 1+2
Age span	8–11	14–15	9–10	11–12
Net users	26	7	15	9

Total child users

5,580

User utterances

5,583

Cloddy Hans turns

255

Karin turns

Avg turns per user

~6h

Speech recorded

Results

Gameplay Questionnaire (median, 1–5)

The game was fun4.0

The game was interesting4.0

Easy to get started4.0

I understood what to do3.5

The game was easy3.0

The game was irritating2.0 (low=good)

Character Personality Perception (Wilcoxon p<0.05)

Cloddy Hans rated as:

Kind
Stupid
Lazy
Calm
Polite
Distressed

Karin rated as:

Smart
Quick
Self-confident
Defiant
Secretive

ASR Performance

Word accuracy (1-best)~39–55%

Concept accuracy (1-best)53%

Concept accuracy (10-best)70%

Utterance Types (5,580 user utterances)

Domain (commands, questions)39%

Repetition17%

Meta (help requests)17%

Yes/No12%

Social/fun7%

Usability Evaluation

        Key Findings
        Most users found it natural to use speech in games and expected future games to work this way
Cloddy Hans's sluggishness was perceived as intentional design; Karen's negotiations were considered fun
A few users said speaking with the characters was "almost like talking to real persons"
No significant effect of dialogue success on perceived fun — fun was independent of task completion
Personality traits "smart" and "stupid" were not affected by Cloddy Hans's actual dialogue performance — the strong visual/acoustic persona design overrode evidence from dialogue quality

      

Conversational Abilities — Cloddy vs Karin

Cloddy Hans perceived as more…

Obedient (does what users tell him)

Karin perceived as more…

Understands the user better
Easier to understand
More like a real human
Does what she wants herself

Dialogue Phenomena

Children produced rich, socially-naturalistic dialogue behaviour previously unseen in human-machine interaction. Six phenomena are documented with actual transcripts.

The conflict between Cloddy Hans and Karin was a deliberate plot device — each character tried to get the user to take their side.

Karin:Why are you dragging along that Cloddy Hans by the way?

User:Because he is fun.

Cloddy:I don't like talking to her.

Karin:Why do you keep dragging along that Cloddy Hans figure, by the way?

User:Because he is fair — a lot more so than you, in any case.

Karin:If you want me to lower the bridge, I want something in return.

Cloddy:She is so snappy.

User:You are very very very kind. [sarcastically]

Cloddy:Yes! Now we have crossed the bridge!

Karin:Hey — wasn't I supposed to get something in return?

User:Give her the emerald.

Cloddy:OK. [hands over emerald]

Karin:Thanks! It's a pleasure doing business with you.

User:I know! [sarcastically]

Cloddy:I have a bit of a problem in telling right and left apart.

User:My brain is disconnected. And this occurred to you only now, or what?

Users were free to use their own words. Most picked up Cloddy Hans's terminology, and several reported that figuring out which words to use was itself part of the game.

User:Put the hammer by er — that tube with the hammer and key.

Cloddy:Is it OK if we name that "useful"?

User:Take the bag in your hand, the sack ehh —

Cloddy:I'm already holding the sack.

User:Take the book from the shelf as well.

Cloddy:I can only carry one thing at a time.

Several users temporarily digressed from the task — then returned. This is impossible in menu-driven interfaces.

User:Try pulling the lever.

User:Pick up that little man. [about Thumbelina]

Cloddy:Now I don't understand. Oh — you mean Thumbelina.

User:Ask her something.

Cloddy:I don't think she wants that.

User:Coward! [laughter]

Cloddy:What did you say about me?

User:Put that lamp in the pipe to your farthest right.

When Cloddy asked "why?", a user spontaneously explained the goal — motivating the character. Natural in human conversation, remarkable in HCI.

Cloddy:Why do you want me to do that?

User:To put the wand in the magic slot.

Cloddy:I'm going to walk over to the machine.

User:Put the wand in the magic tube.

Cloddy:I'm going to put the wand in magic.

A user accused Cloddy Hans of still holding the diamond despite two denials. When proven wrong, they apologised — entirely natural social behaviour.

User:Would you mind letting go of the diamond?

Cloddy:I've already done that.

User:Put the diamond down. Put the diamond down Cloddy —

Cloddy:I've already done that.

User:Oh — it seems you actually did! Could you pick up the emerald?

Users referenced shared dialogue history ("what was her name — now again?") and made spatial references to the 3D world as if it were real.

User:What was her name — now again? [lowers voice, sounds intimate]

Cloddy:Her name is Karin.

User:Oh yes! Karin!

Cloddy:Take it easy — I am dashing! [appears to climb bridge rail and levitate]

User:What are you doing up there?! Come down immediately, Cloddy Hans!

"The addition of spoken language and social relations between several animated characters and the user actually did evoke social behaviour and emotively coloured reactions."
— NICE D7.2-2, Chapter 3 Conclusions

Featured Paper · Speech Communication 2006

Robust Spoken Language Understanding in a Computer Game

Johan Boye, Joakim Gustafson, Mats Wirén · TeliaSonera R&D, Haninge, Sweden

Speech Communication 48(6), pp. 335–353, Elsevier, 2006 · Special issue on Spoken Language Understanding

This journal paper presents and rigorously evaluates the spoken language understanding component of the NICE Fairy-Tale World. It describes a novel semantic representation formalism and a two-phase robust parsing algorithm designed to extract meaningful dialogue acts from the noisy, disfluent, child-directed speech produced by young users — even when the ASR word error rate is as high as 61%.

User utterances are mapped to typed, tree-structured dialogue acts — a semantic representation that sits between the shallow slot-filler lists typical of task-oriented systems and the full logical forms of general-purpose NLP. The type system enforces constraints on argument positions, providing both robustness and expressiveness.

Dialogue Act Types

Core Types

Type	Example Utterance	Representation
request	"Pick up the axe"	request(user, cloddy, pickUp(cloddy, axe))
ask	"What color is the ruby?"	λx.ask(user, cloddy, x[ruby.color=x])
tell	"I'm fourteen years old"	tell(user, cloddy, 14[user.age=14])
offer	"I'll give you the ruby"	offer(user, karen, ruby)
confirm	"Yes, do that"	confirm(user, cloddy)
disconfirm	"No! Stop!"	disconfirm(user, cloddy)
askForHelp	"What should we do?"	askForSuggestion(user, cloddy)

Action Types (World Model)

Action	Argument Structure
goTo	goTo(character, place)
pickUp	pickUp(character, thing)
putDown	putDown(character, thing, location)
giveTo	giveTo(character, thing, character)
raiseDrawbridge	raiseDrawbridge(character)
lowerDrawbridge	lowerDrawbridge(character)

Type constraints — e.g. the first argument of pickUp must be a character, the second a thing — let the parser infer types and reject impossible parses without full logical reasoning.

Anaphora via Lambda Abstractions

Anaphoric utterances ("pick it up", "put it down there") are represented as functions over the dialogue context using typed lambda abstractions — a key innovation over slot-filler systems:

User:"Cloddy Hans, please pick up the axe."

Parse:request(user, cloddy, pickUp(cloddy, axe))

User:"Now put it down."

Parse:λx^thing λy^location . request(user, cloddy, putDown(cloddy, x, y))

Context:The lambda function is then applied to the most salient thing (the axe, established in the previous turn) and a location inferred from context — yielding a fully-grounded action.

"Previous systems based on pattern matching have been restricted to producing relatively simple semantic structures, such as variable-free slot–filler lists. Unfortunately, such structures are not suitable as input to a dialogue manager in our domain, which involves information-seeking utterances, commands and simple negotiation."

The robust parser works in two phases. The key insight is that meaning can be recovered even when words are missing, inserted, or wrong — because the algorithm throws away words that don't match any pattern and keeps accumulating constraints from whatever does match.

Phase 1: Pattern-Matching (Domain-Dependent)

Scans the ASR word string left-to-right
Tries to match each position against DCG (Definite Clause Grammar) rules
On match: appends semantic constraints, advances past matched segment
On failure: discards the first word and tries again — this is the robustness mechanism
Accumulates a flat list of typed semantic constraints

Handling Disfluencies

Input: "take the ehh hammer"
The full rule take the hammer fails because of "ehh". But the shorter rule take matches. Then "the" and "ehh" are discarded one by one. Finally "hammer" matches hammer^thing. Result: correct constraints extracted despite the disfluency.

Phase 2: Rewriting (Domain-Independent)

Object merging: combines compatible constraints referring to the same entity
Constraint inference: derives implicit type constraints (e.g. if something is being picked up, it must be a thing)
Filtering: removes inconsistent or underspecified constraint sets
Abstraction: constructs the final typed tree-structured dialogue act
Domain-independent — the same rewriter works for any domain; only the pattern matcher needs rewriting when porting

N-Best Input

The pattern-matching phase operates on the ASR 1-best hypothesis by default, but the paper shows that using the 10-best list can raise concept accuracy from 53% to 70% — a major potential gain for future systems.

Three Types of Semantic Constraints

Existence Constraints

The most common type. Infers that an object of a certain type exists from the utterance. e.g. the word "hammer" triggers: hammer^thing

Disequality Constraints

Asserts two objects are different. e.g. "What is…" implies the user is asking someone else: ask(user, x, y) with x ≠ user

Equality Constraints

Relates objects to attributes of other objects. e.g. "Where is…" implies: y = z.position — the thing being asked about is the position of some object z.

The parser was evaluated on 3,400 utterances from the Swedish NICE corpus — spontaneous speech from children aged 9–15 collected over 4 sessions. All figures are based on unedited ASR output; the supervised-mode operator's corrections were not used for evaluation.

Corpus Statistics

3,400

Utterances evaluated

810

Unique words

11,925

Word tokens

14.4%

OOV rate

525

System vocabulary

Performance Results

Speech Recognition Accuracy

Word accuracy (1-best ASR)39%

Word accuracy (oracle 10-best)55%

Word accuracy is the bottleneck. Even with a perfect oracle choosing the best hypothesis from the 10-best list, accuracy only reaches 55% — reflecting how hard children's spontaneous speech is for ASR systems.

Parser Semantic Accuracy

Concept accuracy (1-best input)53%

Concept accuracy (10-best input)70%

Parser improvement over raw ASR+14 pts

Concept accuracy measures how many semantic nodes in the dialogue act tree are correct — a finer-grained metric than sentence accuracy. The parser recovers substantial meaning even from very noisy input.

        The Core Finding
        ASR word accuracy: only 39% — the recognition of children's spontaneous speech is extremely hard
Yet concept accuracy reaches 53% from 1-best, and 70% from 10-best — the parser "rescues" a large portion of meaning that would be lost if only the raw ASR string were used
The 17-point gap between 1-best (53%) and 10-best (70%) concept accuracy represents "potentially a lot to be gained" by N-best re-ranking in future systems
The typed tree structure was essential for negotiative dialogue (Karen scene) — flat slot-filler representations cannot express the alternative-referencing needed for bargaining utterances like "I'll give you the ruby if you lower the bridge"

      

Measurement Definitions

Metric	Definition	Analogue
Sentence accuracy	% of utterances where ASR output exactly matches transcription	Standard ASR metric
Word accuracy	% of word tokens correctly recognised (accounting for substitutions, insertions, deletions)	Standard WER-based metric
Semantic accuracy	% of utterances where the parser output exactly matches the gold-standard dialogue act	Semantic analogue of sentence accuracy
Concept accuracy	% of tree nodes in the dialogue act correctly reconstructed by the parser	Semantic analogue of word accuracy

Parsing produces a context-independent semantic representation — the meaning of an utterance without resolving references to the dialogue context. A second step, contextual interpretation, resolves anaphora (pronouns, ellipsis) by applying the lambda-abstract dialogue acts to the most salient discourse entities.

Reference Resolution in Action

User:"Cloddy Hans, please pick up the axe and put it in the machine."

Cloddy:"OK, I'll do that." (Picks up the axe)

User:"Now put it down."

Parse:λx^thing λy^location . request(user, cloddy, putDown(cloddy, x, y))

After context:request(user, cloddy, putDown(cloddy, axe, machine))

Cloddy:"OK." (Puts the axe in the machine)

Focus Management

The contextual interpretation step uses a focus management algorithm that maintains a ranked set of salient objects — the entities most recently mentioned or acted upon. When a pronoun ("it") or an elliptic utterance ("now the hammer") is encountered, the algorithm applies the lambda function to the most salient entity of the appropriate type.

Focus management is tightly integrated with the dialogue manager (described in D5.2b) rather than the parser, since salience depends on the full history of actions and utterances — not just the current turn's linguistic content.

Why This Matters for Games

In a fast-paced game interaction, users naturally use pronouns and ellipsis: "pick up that one", "now the red one", "put it there" while pointing at the screen. A system that cannot handle anaphora forces users into unnatural, verbose instructions for every single action — breaking immersion and making the game feel like a command-line interface rather than a conversation.

Paper Citation

Boye, J., Gustafson, J., & Wirén, M. (2006). "Robust spoken language understanding in a computer game." Speech Communication, 48(6), pp. 335–353. Elsevier. Special issue on spoken language understanding. DOI: 10.1016/j.specom.2005.06.015

How to Do Dialogue in a Fairy-Tale World

Johan Boye & Joakim Gustafson · TeliaSonera R&D, Sweden

6th SIGdial Workshop on Discourse and Dialogue, Lisbon, 2005 (demo paper)

This paper gives a detailed technical overview of the implemented dialogue manager in the NICE Fairy-Tale World — covering the character taxonomy, world representation, goal-based planning, message protocol, dialogue event system, scene/subscene structure, turn-taking, and asynchronous dialogue. It is the most complete published account of how the NICE dialogue engine actually works.

The Drawbridge. Karen on the far side; Cloddy Hans and the user trapped on the island. The waterfall gap separates them.

The Fairy-Tale World. Aerial view of the two-island landscape. The user and Cloddy Hans start on the left plateau; the village and further quests lie beyond.

The paper introduces a three-way character taxonomy for interactive storytelling, extending the feature/supporting dichotomy from earlier work (Charles & Cavazza 2004) with a new third type — the helper character — that is central to the NICE design.

Feature Characters

e.g. The Prince, Karen

Drive the story. Have their own goals and desires. Represent obstacles or information sources. Require a full NLU and dialogue manager. Their agenda trees can be complex and deeply nested.

Own goals Story drivers Full dialogue manager

Supporting Characters

e.g. Thumbelina

Provide atmosphere and react to events. Non-verbal or limited verbal output. Express emotions through body language. Do not drive the plot. Thumbelina points at machine slots and reacts with large body gestures when displeased.

Atmosphere Non-verbal Reactive

Helper Characters

e.g. Cloddy Hans

A new type, not in prior literature. Guides and assists the user throughout the game. Has no goals of his own — only doing what the user asks and helping them when stuck. Requires a uniquely broad set of capabilities.

No own goals Faithful guide Grounding Error handling Visual perception Multi-plot aware

Requirements for a Helper Character

The paper identifies that helper characters demand a uniquely broad capability set — broader in some ways than feature characters:

Grounding & cooperation: must continuously confirm shared understanding and cooperate towards goals set by the user
Dialogue regulation & error handling: must manage communication breakdowns, recognition failures, and misunderstandings gracefully
Knowledge of all plots: unlike feature characters who only know their own subplot, the helper must know every subplot in every scene to be able to guide the user
Visual perception: must be able to notice objects the user hasn't seen yet and suggest actions involving them ("Hey, there's a ruby in the grass over there")
Multi-character awareness: must track what other characters say and do, to provide timely hints and side-comments

Each fairy-tale character maintains a rich inner state that drives all its behaviour — both proactive (pursuing its own agenda) and reactive (responding to user input or world events).

🌍 World Model

The character's beliefs about the current state of the 3D world — represented as a set of interrelated objects with attributes and relations. Includes: where objects are located, who is holding what, which slots are filled, whether the bridge is raised or lowered, etc.

Key property: beliefs can be wrong. A character's world model is updated by perceptual events and broadcasts from other characters, but may lag behind reality.

📜 Discourse History

A record of past interactions — what has been said, what has been done, what agreements have been reached. Used for anaphora resolution, detecting repeated requests, and deciding whether something has already been explained.

Enables Cloddy Hans to say "But I already told you — the axe is in the machine!" and to remember that the user previously asked him to go somewhere.

🗂️ Agenda

A set of tree-structures encoding the character's current goals, past and future actions, and their causal relationships. The trees can be deeply nested. Crucially, the tree encodes why as well as what:

Goal:hold(cloddy, axe)

→ because:standing-next-to(cloddy, shelf)

→ because:perform(walk-to, shelf)

These causal chains can be verbalized — Cloddy Hans can explain "I'm walking to the shelf because I want to pick up the axe."

STRIPS-Like Planning with Non-Monotonicity

Task specifications are sets of declarative rules encoding relationships between actions and world propositions. The dialogue manager uses these to build agenda trees — similar to classical STRIPS planning, but with a critical difference:

Non-Monotonicity

In classical STRIPS planning, once a proposition is established as true, it stays true. In the fairy-tale world, this assumption breaks down. For example:

available(axe) may be true when Cloddy Hans starts walking towards it — but become false if Karen grabs it first
The drawbridge may be raised when Cloddy approaches — but lowered (or re-raised) by Karen independently

Solution: The system checks all necessary preconditions before executing any action — even those already found to be true at an earlier point. A precondition is only assumed to hold at the moment of checking, not persistently.

Story-Functional Events

Certain key moments are designated story-functional events — irreversible milestones that mark genuine narrative progression. They cannot be undone. Examples:

Scene 1 (Lab)

Cloddy Hans introduces himself
Cloddy Hans introduces the plot
Cloddy Hans picks up an object for the first time
Cloddy Hans puts object X in the fairy-tale machine (irreversible — objects cannot be retrieved)
Cloddy Hans pulls the lever → they fall into the fairy-tale world

Scene 2 (Fairy-Tale World)

Cloddy Hans introduces the fairy-tale world
Karen introduces herself
Cloddy Hans gives his opinion of Karen
Karen gives her opinion of Cloddy Hans
Karen informs user she demands payment
Karen accepts an object and lowers the drawbridge
Cloddy Hans crosses the drawbridge and gives Karen payment

All characters load the story-functional events at scene initialisation and add corresponding goals to their agendas. This allows Cloddy Hans to proactively guide the user towards unmet milestones when the user is stuck.

The dialogue manager communicates with the rest of the system through a strictly defined message protocol. It receives a stream of input messages and generates output messages.

Output Messages (DM → System)

Message	Description
convey <dialogue_act>	Turn into words via NLG → TTS + animation
perform <action>	Send command to the Animation Planner

Every observable character behaviour is produced through exactly one of these two primitives. convey produces speech; perform produces physical action.

Input Messages (System → DM)

Message	Description
parserInput <dialogue_act>	User said something; parser result attached
gestureInput <object>	User pointed at a specific 3D object
recognitionFailure	User spoke; ASR could not interpret it
broadcast <message>	Another character spoke or acted — only delivered if sender is within visibility range and/or audibility distance
performed <id> <flag>	Character completed own requested action (ok/failed)
trigger <id>	Character moved into/out of a 3D trigger zone

Why is "performed" needed?

The performed message might seem redundant — after all, the character itself requested the action. But since some actions (like walking across the world) take considerable time, and actions may fail for unexpected reasons, the character cannot treat an action as complete until the animation system explicitly confirms it. This feedback is essential for correctly sequencing agenda goals.

Dialogue Act Categories

User → Character

Instructions: "Go to the drawbridge", "Pick it up"
Domain questions: "What is that red object?", "How old are you?"
Giving information: "I'm fourteen years old"
Negotiating: "I can give you the ruby if you lower the bridge"
Confirmations: "Yes please!", "OK, do that"
Disconfirmations: "No!", "Stop!", "I didn't say that!"
Help requests: "Help!", "What can I do?", "Do you hear me?"
Explanation requests: "Why did you say that?", "Why are you doing this?"

Character → User

Responses to instructions: "OK, I'll do that" / "No, the knife is already in the machine"
Answers: "The ruby is red", "I am 30 years old"
Stating intentions: "I'm going to the drawbridge now"
Confirmation questions: "So you want me to go to the shelf?"
Clarification questions: "Where do you want me to go?", "What should I put on the shelf?"
Suggestions: "Perhaps we should go to the drawbridge now?"
Negotiating: "I won't do that for nothing", "What a piece of junk! Find something better"
Explanations: "Because I want the axe in the machine"

The character's dialogue act repertoire overlaps with — but is not identical to — the user's. Only characters can state intentions, make suggestions, and give explanations. Only users can make help requests and explanation requests.

Event-based dialogue progression in NICE

Event-based dialogue progression — the five input channels that drive dialogue forward in NICE

"In traditional spoken dialogue systems of the era, one thing drove the system: the user spoke, the system replied. In NICE, user speech was only one of five distinct classes of event that could push dialogue forward — making characters feel genuinely alive in their world rather than waiting passively for the next utterance."

A Novel Approach to Dialogue Progression

At the time NICE was developed (2002–2005), virtually all spoken dialogue research — whether in phone-based information systems, in-car assistants, or early conversational agents — followed the same fundamental model: the user speaks, the system produces a response, and so on in strict alternation. The only currency was user speech, and the only driving force was the parser's output from that speech.

The NICE dialogue engine discarded this assumption entirely. Each character's dialogue manager is driven by a stream of heterogeneous events arriving asynchronously from five independent sources. The character's scripting code responds to whichever event arrives next, regardless of whether the user has spoken. This makes it possible for characters to behave proactively, to react to each other, and to interact with the 3D world around them — in a way that had no precedent in spoken dialogue systems.

The Five Channels of Dialogue Progression

🎤 1. User Dialogue Acts

When the ASR recognises speech and the NLU parser succeeds, a ParserEvent is delivered carrying a fully interpreted dialogue act — e.g. instruction(go, drawbridge) or question(location, ruby). The scripting code for the current scene and phase catches this event, performs contextual interpretation (resolving pronouns, ellipsis, and screen references), and determines the appropriate response from the character's repertoire.

A RecognitionFailureEvent fires when the user speaks but ASR fails — characters handle this gracefully rather than silently failing. A GestureEvent fires when the user clicks on a 3D object (multimodal input: "pick this up" + click).

▶️ 2. The Character's Own Completed Actions

When a character performs a physical action — walking to a location, picking up an object, pulling a lever — the animation system executes it asynchronously. It reports back via a PerformedEvent carrying an ok or failed flag. This event is essential for sequencing: the character can only say "I'm at the machine, what now?" after actually arriving.

Because walking can take considerable time, the user may have issued new instructions in the meantime. The PerformedEvent lets the agenda cleanly resolve whether those intervening requests should now be acted upon or whether to resume the original goal chain.

📣 3. Broadcast Actions from Other Characters

Whenever any character speaks or performs an action, it broadcasts that event to all other characters. Each receiving character's dialogue manager fires a BroadcastEvent — but only if the sending character is currently within visibility range and/or within audibility distance of the receiving character. The system tracked spatial relationships between characters in the 3D world to enforce this filter.

This is what enables Cloddy Hans to whisper side-comments during the Karen negotiation ("Don't trust her!") even while off-camera, and for Karen to react to things Cloddy does across the world only if she could plausibly have seen or heard them. Characters could thus be genuinely surprised by events they had no line of sight to.

🌎 4. 3D Environment Triggers

The virtual world is populated with invisible trigger zones — 3D volumes defined by coordinates. As a character walks through the world, the animation system monitors their position and fires a TriggerEvent each time they enter or exit a zone. The dialogue manager uses these to make characters react to their immediate surroundings in real time.

Turn to look at an object of interest while passing it
Make a contextual remark near a landmark ("That's where Karen lives")
Generate waypoints for long walks — giving the character things to say or do at each waypoint rather than walking in silence
Trigger a camera pan when Cloddy Hans reaches the bridge gap and Karen comes into view
Know when the destination has been reached and transition dialogue to the next subscene phase

⏳ 5. Timeouts

If no other event arrives within a configurable time window, a TimeOutEvent fires. Characters use this to decide that the user needs prompting — perhaps they are confused, distracted, or have simply stopped speaking for too long. The response is determined by the current scene and agenda state: a first timeout might produce a gentle hint ("Perhaps we should go to the drawbridge?"); a repeated timeout might escalate to a more explicit suggestion, or Cloddy Hans might use his knowledge of unmet story-functional events to guide the user towards the next goal.

The NoReactionEvent is a related internal event fired when the agenda is empty and no incoming events are pending — the character has nothing queued and must decide whether to take initiative. Together, timeouts and NoReactionEvents are what keep the game from stalling when a child doesn't know what to say or do next.

The Agenda: What the Character Wants to Do Next

Underlying all five event channels is the character's agenda — a set of goal-trees whose content is determined by the current scene and subscene. When a scene initialises, each character loads the set of story-functional events for that scene and populates their agenda accordingly. This means the character knows at all times what still needs to happen for the narrative to progress, and can use that knowledge to decide how to respond to any incoming event.

Scene-Driven Agenda Population

At scene initialisation, story-functional milestones are added as top-level agenda goals. As events arrive and goals are fulfilled, the agenda shrinks. The NoReactionEvent fires when it is empty — signalling that the scene is complete or the character must find something proactive to do. Cloddy Hans uses the remaining agenda to generate unprompted hints ("We still haven't spoken to Karen about the bridge").

Non-Monotonic Goal Reasoning

Unlike classical STRIPS planning, the agenda does not assume that satisfied preconditions remain satisfied. When Karen grabs the axe just as Cloddy Hans was walking to pick it up, his goal fails and must be replanned. Each precondition is re-checked at execution time, not assumed true from when it was last verified. This is essential in a dynamic multi-character world where any character can change the state of any object.

Complete Event Reference

External Events (arriving from outside the DM)

Event	Triggered when…
ParserEvent	Parser delivers analysis of user's utterance
GestureEvent	User pointed at a 3D object
RecognitionFailureEvent	ASR detected speech but could not recognise it
PerformedEvent	Animation system completed a requested action (ok / failed)
BroadcastEvent	Another character spoke or acted (filtered by visibility & audibility)
WorldEvent	A world-state change (bridge raised/lowered, object inserted in machine…)
TriggerEvent	Character entered or exited a 3D trigger zone while walking

Internal Events (generated within the kernel)

Event	Meaning
IntentionEvent	Character has formed an intention to say or do something
NoReactionEvent	Agenda empty — character must decide whether to take initiative
TimeOutEvent	Silence threshold exceeded — prompt or hint
AlreadySatisfiedEvent	A goal already true was added to the agenda
CannotSolveEvent	An unsolvable goal was added to the agenda
PossibleGoalConflictEvent	New goal conflicts with an existing agenda item
RequestEvent	Derived from ParserEvent — user made an instruction
QuestionEvent	Derived from ParserEvent — user asked a question

Kernel Operations Available to Scripting Code

interpret utterance in context

convey <dialogue act>

perform <action>

add goal to agenda

remove goal from agenda

find & pursue next goal on agenda

Worked Example: "Go to the fairy-tale machine"

This step-by-step trace shows how a single user instruction flows through the entire event/scripting system:

1. ParserEvent arrives → scripting code of current scene is called

2. Scripting code calls kernel's interpret operation → contextual interpretation resolves references

3. Kernel generates RequestEvent → scripting code catches it

4. Scripting code calls convey(acknowledge) → Cloddy says "OK, I'll do that" · Then calls addGoal(standNextTo(cloddy, machine))

5. Kernel pursues goal → calls perform(walkTo(machine)) → Cloddy Hans starts walking. TriggerEvents fire as he passes objects of interest along the route.

↕ Meanwhile: user can speak again. New events arrive and can open a new dialogue line in parallel. A BroadcastEvent from Karen may also arrive if she is within audibility range.

6. PerformedEvent (ok) arrives → Cloddy reached machine. Scripting code generates next utterance: "I'm at the machine. What now?"

The game is divided into scenes and subscenes (and sub-subscenes), forming a hierarchical dramatic structure. Scenes map to game levels; scene transitions mark significant narrative events. This hierarchy is entirely modular — new scenes can be added without touching the kernel.

Kernel + Scripting = Modular Personality

The dialogue management kernel provides generic functionality — input handling, reference resolution, goal reasoning, event dispatching. The scripting code per character per scene per phase provides the specific personality and behaviour. Adding a new scene means writing new scripting code; the kernel is unchanged.

Practical benefit: simpler to understand and maintain
Theoretical benefit: clearly separates generic dialogue management from character/situation-specific aspects — advancing our understanding of dialogue in general
Game-industry precedent: this pattern is standard in commercial game engines (Varanese & LaMothe 2003)

Scene 2 — Four Subscenes

🏝️

1. Introduction

Cloddy Hans tells the user about the fairy-tale world. Story-functional events: Cloddy introduces the world; Karen introduces herself; both characters give their opinions of each other.

🔍

2. Exploration

The player and Cloddy Hans explore the island together, finding the fairy-tale objects from Scene 1 scattered in the grass. Camera and dialogue are entirely user-driven.

🤝

3. Negotiation

Starts when the player meets Karen. Multi-party negotiation: user tries to find what Karen wants, Cloddy Hans makes side-comments, Karen rejects first two offers unconditionally. Rich three-party dialogue.

🌉

4. Bridge Crossing

After successful negotiation. Cloddy Hans delivers the agreed object to Karen. The drawbridge is lowered. Story-functional event: Cloddy crosses the bridge and pays Karen. Scene complete.

At any moment exactly one subscene phase is active. Phases transition on events — for example, the Negotiation phase begins the first time the user speaks to Karen. The phase can also change based on geographic location (which trigger zone the character is in) or on character actions. This allows non-linear story progression within the overall scene structure.

Asynchronous Dialogue

Unlike most spoken dialogue systems — which use a strict turn-alternation model where user and system must speak one at a time — the NICE dialogue manager is fully asynchronous. This is one of its most important innovations.

What "asynchronous" means

The user can make multiple dialogue contributions in sequence without waiting for the system to reply
Characters are triggered to speak by five independent channels — not just user utterances, but also their own completed actions (PerformedEvent), broadcasts from other characters (filtered by visibility/audibility), 3D trigger zones entered while walking, and timeouts. E.g. Cloddy Hans entering the bridge trigger zone fires a TriggerEvent that pans the camera to Karen.
Multiple dialogue threads can be open simultaneously — a new utterance can be addressed while Cloddy Hans is still walking to fulfil a previous request
Characters resume the original dialogue thread after handling an interruption — via PerformedEvent upon completing the action

Why this matters for games

Games are dynamic — things happen continuously, not in polite alternating turns
Natural human conversation is not strictly turn-based — interruptions, overlaps, side-comments all occur
Strict turn-taking would make the system feel unresponsive and robotic during actions that take time (walking, picking up objects)
Side-comments from off-camera characters (Cloddy whispering hints during Karen negotiations) would be impossible in a strict turn model

"A character may take the turn and start speaking for a number of reasons: because the user has said something, because some other fairy-tale character has said or done something, because of an event in the fairy-tale world, or because a certain amount of time has elapsed."

Turn-Taking & Camera Control

User-Initiated Camera Changes

The character in focus is always in camera
User says a character's name ("Cloddy" / "Karen") → camera pans, dialogue partner changes
This is the primary mechanism for switching between characters in multi-party dialogue

System-Initiated Camera Changes

Triggered by events — e.g. Cloddy Hans reaching the bridge gap → camera automatically pans to Karen
Characters can make side-comments while off-camera — Cloddy whispering hints during the Karen negotiation
Creates the feel of a cinematic multi-character scene without requiring simultaneous rendering

3D Triggers

A trigger is a three-dimensional zone with specific coordinates in the virtual world. When a character enters or exits a trigger, a TriggerEvent is fired to its dialogue manager. Uses include:

Character turns head when passing object of interest

Character makes contextual utterance near a landmark

Generate waypoints for long walks across the world

Camera pan when character reaches the gap/bridge

Initiate negotiation when user approaches Karen

Comparison with Related Systems

The paper identifies only one comparable system: the Mission Rehearsal Exercise (MRE) from USC Institute of Creative Technologies (Swartout et al. 2004). MRE has more complex character-character interactions and a more sophisticated emotion model. However, MRE operates in a military domain with more codified language — whereas NICE targets spontaneous children's speech in a domain with negotiation and social interaction. To the authors' knowledge, NICE and MRE were the only two multi-party spoken dialogue systems in existence at the time of publication.

Paper Citation

Boye, J., & Gustafson, J. (2005). "How to do dialogue in a fairy-tale world." In Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal.

Featured Paper · 5th ISCA Speech Synthesis Workshop 2004

Voice Creation for Conversational Fairy-Tale Characters

Joakim Gustafson (TeliaSonera) & Kåre Sjölander (KTH Centre for Speech Technology, Stockholm)

5th ISCA Speech Synthesis Workshop (SSW5), Pittsburgh, 2004

This paper describes the design and implementation of Snacka — a new corpus-based unit-selection speech synthesis system for Swedish developed specifically for the NICE project. It was one of the first unit-selection TTS systems built for Swedish, and the first designed from the ground up to convey animated character personality in a real-time interactive application. The paper covers the unique requirements that a fairy-tale game domain places on speech synthesis, the corpus design, the system architecture, voice transformation, and early evaluation results.

The fairy-tale world and its characters. Each character — the Prince, Thumbelina, Cloddy Hans — required a distinct synthesised voice designed to match their personality and visual appearance.

Cloddy Hans in the fairy-tale lab. The fairy-tale machine with its four symbol-labelled slots. Cloddy Hans's synthesised voice must handle clarification dialogues about every object and slot combination.

The central challenge driving the development of a new TTS system was straightforward: the NICE characters engage in open-ended clarification dialogues that reference any combination of objects, slots, actions, and dialogue contexts. This makes pre-recording all possible utterances impossible — a unit-selection synthesiser is the only viable path to both quality and coverage.

"For the fairy-tale characters to be lifelike and believable in their roles in the game they have to be provided with natural sounding voices with distinct personalities. The characters also have to be able to engage in clarification dialogues that make the number of possible system utterances prohibitively large."

Why a New System?

Existing off-the-shelf Swedish TTS was not suitable for several reasons: generic voices have no personality; commercial systems are not designed for real-time integration into game engines; and at the time of development (2003–2004), no suitable Swedish unit-selection system was available that met the project's requirements. The solution was to build Snacka — a completely new corpus-based unit-selection synthesiser extending the open-source Snack sound toolkit by co-author Kåre Sjölander (KTH).

Design Criteria

Language & Quality

Swedish language as primary target; architecture designed to be extensible to new languages
High quality output with a personal speaking style — not a generic neutral voice
Easy to build new voices for different characters

System Integration

Easy to use as a component in a real-time dialogue system
Faster than real-time speech generation — critical for interactive use
Prosodic control over emphasis and speaking rate

Personality & Transformation

Voice design via acoustic signal processing — the "giant/gnome" approach to character differentiation
Support for extralinguistic sounds: inhalations, coughs, laughter, sighs
Coordination of verbal output with body gestures and lip-sync via time-stamped output

The Voice Actor: Mattias Heldner

Cloddy Hans was voiced by Mattias Heldner, a speech scientist at KTH. The voice actor was instructed to read all corpus utterances in a style matching Cloddy Hans's intended personality — friendly, uncertain, and slow — creating a performance-driven foundation that the acoustic transformations then further shaped to match the character's visual appearance.

The Symmetric Analysis–Synthesis Principle

The defining architectural principle of Snacka is symmetry between analysis and synthesis: every module used to analyse the recorded corpus is also used during synthesis. This is critical for quality — it ensures the unit inventory and the synthesis output are described in exactly the same terms, maximising the chance that available units match what synthesis needs.

Recorded speech corpus
+ orthographic transcriptions

→

HMM Segmentation
auto phone & word labels

→

Unit Inventory
half-phones, clustered
with decision trees

→

Voice Definition File
all unit descriptions +
pointers to original sound files

XML-tagged text input
from dialogue system

→

Target Description
same modules as analysis

→

Viterbi Search
best unit sequence

→

Output utterance
+ time-stamped phone
transcription for lip-sync

Core Technical Components

Speech Segmentation

HMM-based automatic phone and word label generation
Can process very long files — initial experiments used talking books as one long file + transcription
No manual segmentation required — fully automatic pipeline

Unit Inventory

Basic unit: half-phones with variable boundaries using optimal-coupling
Clustered using decision trees
Acoustic vectors: Mel cepstrum, F0, and first-order regression coefficients
Decision tree features: phone context, stress, phone position in syllable, syllable position in word, lexical stress, focus

Pronunciation & Prosody

Pronunciation modelling via co-occurrence statistics — captures speaker-specific preferences for word pronunciations in context
Prosody model trained on corpus using decision trees
Letter-to-sound, text analysis, co-articulation modelling all shared between analysis and synthesis

Implementation

Entirely built on Snack sound toolkit (Sjölander & Beskow 2000) with new primitives added: decision tree modelling, Viterbi search
Whole voice-building process in one single script — no external tools needed
Voice definition file + original sound files = complete deployable voice
Achieves faster than real-time synthesis — essential for interactive use

Quality Assessment Tool

A dedicated quality assessment tool was built on top of WaveSurfer. It allows simple point-and-click synthesis of typical domain sentences. When a faulty utterance is found, the tool can be used to inspect the automatic segmentation, correct errors, and prune bad units — in most cases the utterance can then be re-synthesised instantly without rebuilding the entire voice. This rapid iteration cycle was essential for achieving acceptable quality in a research timeline.

The corpus was designed to cover everything the system needed to say as well as everything it was designed to understand — a key symmetry principle. Since the system handles clarification dialogues about arbitrary object/slot combinations, utterances were carefully designed to cover all grammatical positions and emphasis patterns.

Clefted Clarification Coverage

To allow Cloddy Hans to ask targeted clarification questions, all domain utterances were recorded with objects and slots in both medial and final positions:

Cloddy:"Is it the axe you want me to put in the useful slot?" [object in medial position]

Cloddy:"Was it in the useful slot you wanted me to put the axe?" [slot in final position]

All utterances were also tagged with prosodic phrase boundaries and emphasized words before recording, and recorded in all combinations of position and emphasis — giving the synthesiser the full range of prosodic variation needed for natural clarification dialogues.

Coverage Boosting

Sentences containing only function words recorded separately to ensure adequate coverage of common Swedish words
A set of old Swedish sayings (ordspråk) recorded specifically to increase coverage of Swedish diphones — ensuring smooth transitions between units in unexpected word combinations

Task-Independent Dialogue Speech Acts (Table 1 from the paper)

In addition to domain-specific utterances, a rich set of task-independent dialogue acts was recorded — covering all the ways Cloddy Hans needs to regulate turns, handle errors, and express attitude. These are used across all tasks and plots.

Plan Regulating

agree / disagree
apology / non-apology
grateful / ungrateful
ask for clarification / repetition / rephrase
open question / bounce question
ask for accept / reject / correct request
accept / reject offer
report knowing / not knowing
report correct / wrong action

Error Handling

report hearing / not hearing
ask for repetition
report understanding / not understanding / misunderstanding
error acknowledgement

Turn Handling

feedback continuer
floorholder
easy question / hard question
backchannel question
neutral filled pauses

Attitudinal Feedback

positive / negative filler words and filled pauses
attitude to grateful / ungrateful
attitude to success / failure
attitude to good / bad

Extralinguistic Sounds

Cough
Exhalation
Inhalation
Laughter
Clear throat
Sigh

Why Attitudinal Feedback and Extralinguistic Sounds?

The turn regulation utterances and attitudinal feedback serve two purposes: (1) to buy time while the system generates Cloddy Hans's next utterance, and — more importantly — (2) to convey his uncertain, hesitant personality. A confident, fast character would not produce filler words and sighs; Cloddy Hans's acoustic hesitancy is as much a personality signal as his slow walking speed.

Making different characters sound like themselves — not just like different speed settings of the same voice — required a combination of directed recording and post-hoc acoustic transformation.

Directed Recording

Voice actors were given explicit personality profiles and instructed to read all corpus utterances in a performance style that matched the target character. For Cloddy Hans, the instruction was to sound friendly, uncertain, and slow — producing a baseline recording already shaped by character intent.

TD-PSOLA Voice Transformation

After recording, further acoustic shaping was applied using the Time-Domain Pitch-Synchronous Overlap-Add (TD-PSOLA) algorithm — a well-established technique for modifying speech prosody without re-synthesis:

🐢 Cloddy Hans — Slowed & Lowered

Speaking rate reduced — all utterances slowed down via TD-PSOLA
Vocal tract length modified (longer → lower formants)
Combined effect: sounds slower and larger / heavier
Matches his lumbering walk and large body visually
Transformation applied to the entire recorded corpus at once, after building the voice description

🌟 Other Characters — Gnome/Giant Axis

The "giant/gnome" approach: vocal tract length modification shifts character up or down a personality axis
Shorter vocal tract → higher formants → sounds smaller, younger, more sprite-like
Longer vocal tract → lower formants → sounds larger, older, more imposing
Rate change: faster → more energetic/smart; slower → more lumbering/dull
A 2003 study (Bell, Gustafson, Heldner at ICPhS) showed these transformations measurably influence how users interact with the character

"The personalities of the two characters were deliberately chosen so that this simple voice transformation would also make their voices more matching with the visual appearance of the two animated characters."
— Gustafson & Sjölander, SSW 2004

Personality via Vocal Parameters (Nass & Lee 2000)

The approach was grounded in empirical research on synthetic voice personality. Nass & Lee had shown that providing a synthetic voice with stereotypical extrovert vocal features — high loudness, increased pitch, wide frequency range, fast speaking rate — caused users to describe it using extrovert personality adjectives. NICE applied this in reverse: design the vocal parameters to match the target personality, and users will perceive the character accordingly.

Early Evaluation at Telecom Museum

At the time of the workshop paper, the synthesiser was already deployed for user testing at the Telecommunication Museum in Stockholm. Adults and children interacted with a full-size Cloddy Hans on a large back-projected video wall, in semi-automatic mode (human wizard available to override). Users were asked post-interaction about their perception of:

Cloddy Hans's personality traits
The quality and naturalness of his voice

Future work identified at the time: improved prosodic models, and experimenting with making repetitions acoustically different from original utterances — so that repeated attempts at the same phrase don't sound identical.

One of Snacka's most important features for the NICE application was its tight integration with the animation system — achieved through a custom XML markup language that carried synthesis instructions, prosodic markup, and gesture coordination tags in a single pass.

XML-Tagged Input

The dialogue system outputs XML-tagged text to the synthesiser. A real example from the paper:

          "do you want <EMPH>me</EMPH> to take <PointAt object="Axe">that</PointAt> and put it <PointAt slot="Useful">there</PointAt>"
        

XML Tag Types

Tag	Function
<EMPH>	Mark a word for prosodic emphasis — synthesiser increases pitch and energy on the emphasised word
<PointAt object="X">	Coordinate synthesis with a pointing gesture at object X in the 3D world — triggers animation system
<PointAt slot="Y">	Coordinate synthesis with a pointing gesture at slot Y in the fairy-tale machine
Prosodic phrase markup	Mark phrase boundaries for more natural prosody across longer utterances
Extralinguistic tags	Trigger recorded non-speech sounds (cough, laughter, sigh) at specific points
Meta-information tags	Carry additional context used by the animation planner

Lip-Sync Output

In addition to the audio output, the synthesiser produces a time-stamped phone transcription containing:

Precise start and end times for every phone in the utterance
Animation tags at the time points corresponding to each phoneme's viseme
Animation trigger points for gesture coordination (when Cloddy Hans should point at an object)

This output is fed directly to the Liquid Media animation system, which drives Cloddy Hans's lip movements and body gestures in precise synchrony with the speech output.

Why This Integration Matters

Most TTS systems output audio only. Snacka outputs audio plus a complete synchronisation track — enabling Cloddy Hans to point at the axe precisely as he says "that", and to gesture towards the useful slot as he says "there". Without this tight coupling, lip sync would slip, gestures would arrive late, and the character would lose believability. This integration of synthesis and animation was a key technical contribution of the NICE project.

Paper Citation

Gustafson, J., & Sjölander, K. (2004). "Voice creation for conversational fairy-tale characters." In Proceedings of the 5th ISCA Speech Synthesis Workshop (SSW5), Pittsburgh, PA.

Providing Computer Game Characters with Conversational Abilities

Joakim Gustafson, Johan Boye (TeliaSonera) · Morgan Fredriksson, Lasse Johannesson, Jürgen Königsmann (Liquid Media)

Intelligent Virtual Agents (IVA 2005), Kos, Greece · Lecture Notes in Computer Science, Springer

This paper presents the complete NICE Fairy-Tale World system — the game scenario, the characters' conversational requirements, the animation system designed by Liquid Media, the voice creation approach, and the results of the user studies with children. It argues that spoken dialogue technology has the potential to greatly enrich future computer games, and demonstrates a working system in which children used speech as the primary means of game progression.

The Two Scenes

Cloddy Hans in the fairy-tale laboratory

Scene 1: The Fairy-Tale Laboratory. Cloddy Hans stands beside the shelf of fairy-tale objects — a teapot, a gem, a hammer, a book, a potion and more. The user must instruct him to place objects in the right slots of the fairy-tale machine.

Scene 2: The Drawbridge. Karen (left) guards the raised drawbridge across the gap. Cloddy Hans (right) waits on the island with the user. The user must negotiate with Karen to lower the bridge.

Character Personalities — The OCEAN Model

Personality traits were not explicitly modelled as internal variables, but used as design guidelines to ensure that every aspect of a character's behaviour — appearance, voice, wording, speaking style, non-verbal gestures — was perceived as consistent with the intended personality. The characters were designed around contrasting OCEAN (Big Five) trait profiles.

Cloddy Hans

The Helper · Faithful Companion

Slow, deep voice. Lumbering walk. Honest, anxious to help, no agenda of his own. His limitations force the user to cooperate using multimodal dialogue. His apparent stupidity is deliberate — it makes the game harder in a way users perceive as intentional design.

Dunce Uncertain Friendly Polite Calm Even-tempered

Perceived by users as: Kind · Stupid · Lazy · Calm · Polite · Distressed

Karen

The Gatekeeper · Obstacle & Negotiator

Fast, sharp voice — made to sound younger by speeding up the recordings. Drives the negotiation, rejects every offer for the first two attempts, and makes side-comments about Cloddy Hans. She appears smarter partly because she can "see" what Cloddy brings without needing to understand the user's speech.

Intellectual Frivolous Self-confident Unfriendly Touchy Anxious

Perceived by users as: Smart · Quick · Self-confident · Defiant · Secretive

Karen — Facial Expressions & Body Language

Karen's animated face conveys a wide range of emotions through the phonemic, intonational and emotional channels described in the paper. Facial actions — eyebrow position, mouth shape, gaze direction — communicate her sullen, uncooperative personality even before she speaks.

Facial Expressions

Neutral

Speaking

Pleased

Displeased

Skeptical

Body Poses & Gestures

Neutral pose

Pleased

Thinking

Open gesture

Listening

Listening while Walking

Cloddy Hans — Actions & Expressions

Cloddy Hans's animation repertoire spans autonomous idle behaviour (walking, looking around) and reactive dialogue behaviour (picking up objects, gesturing, celebrating). His slow, lumbering movement style — deliberate in the animation design — reinforces his dunce personality at every frame.

In the Fairy-Tale World

Walking on island

Approaching object

Picking up sword

Triumphant!

Side-comment

Arms crossed

Pointing

Facial Expressions

Neutral

Attentive

Listening

Pondering

Distracted

Profile

Animation System (Liquid Media)

The animation system designed by Liquid Media was central to making the characters feel alive and reactive. It consisted of two main modules — the Animation Handler and the Animation Renderer — operating asynchronously within the event-driven architecture.

Dialogue Manager

→

Animation Handler
deconstructs requests,
manages queues & priorities

→

Speech Synthesiser
generates audio +
lip-sync track

→

Animation Renderer
8 concurrent tracks,
blending, real-time

→

Receipt → Handler
confirms completion,
triggers next action

Animation Handler

Deconstructs high-level requests (e.g. "go to the machine") into fine-grained sequences: change camera → walk → change camera → turn to face
One queue per character — instructions sent one at a time, next sent on receipt confirmation
Priority system for speech: High = reply to user; Medium = suggestion when user is silent; Low = chit-chat. Higher priority interrupts ongoing utterance.
If characters talk simultaneously (event-driven system allows this), the handler resolves priority conflicts correctly

Animation Renderer

8 concurrent animation tracks per character — different body parts can animate independently and simultaneously
Facial animations blended with bodily movement in real time (e.g. lip-sync while walking)
Façade architecture — subsystems have virtual interfaces, making it easy to swap implementations
Resource system tracks all graphical meshes and animated models by type and name

Voice Design

A bespoke unit-selection Swedish TTS was developed in cooperation with KTH. Voice actors were told to read utterances in a style matching each character's intended personality. The recordings were then processed to further accentuate the personality differences.

🐢 Cloddy Hans — Slow & Deep

Voice actor told to sound uncertain, friendly, slow
All utterances slowed down in post-processing
Side-effect: slowing makes him sound larger / more massive
Low frequency range — deep, lumbering quality
Fillers and hesitations reinforce his uncertainty

⚡ Karen — Fast & Sharp

Voice actor told to sound intellectual, self-confident, touchy
All utterances speeded up in post-processing
Side-effect: speeding makes her sound younger
Higher frequency range — crisp, quick delivery
Deliberate contrast with Cloddy Hans — users immediately sense the personality difference

The Voice Transformation Insight

The simple procedure of resampling all utterances — simultaneously changing speaking rate and voice pitch — had a desirable emergent property: the personality differences were visually reinforced. Cloddy Hans's slower, lower voice matched his large, lumbering body; Karen's faster, higher voice matched her small, quick character design. The audio and visual channels reinforced each other without requiring complex acoustic modelling.

User Study Findings — Speaking Rate Adaptation

One of the most striking findings from the corpus analysis was how children adapted their speaking rate dynamically across the two scenes and across the two different character interlocutors:

Scene 1 — With Cloddy Hans (Lab)

Users became more efficient over time — fewer turns needed per object placed in the machine
But they also spoke progressively slower with each object, hyper-articulating to help Cloddy understand
Users adapted their speech rate down to match his apparent cognitive pace

Scene 2 — Island & Drawbridge

At start of Scene 2 (exploration with Cloddy), users began faster, then gradually slowed again when talking to him
When switching to Karen, users spoke significantly faster
Speech rate with Karen increased further with each turn — the negotiation felt more lively and dynamic
Karen's side of the interaction felt more engaging partly because she never struggled to "understand"

"Users rated Cloddy Hans as stupid and Karen as smart — even though the trick of having Karen tell users what to say to Cloddy was used in both directions."
— Gustafson et al., IVA 2005

The "Smart Karen" Design Trick

Karen was designed so she could respond to what Cloddy Hans physically brought her — she could "see" the object — without needing to parse the user's speech. This meant she appeared far more intelligent than Cloddy Hans (who needed correct verbal instructions) even though her underlying language understanding was actually simpler. This asymmetry was a deliberate design choice that also helped cover for ASR failures: Karen's reliable responsiveness made the system feel more robust overall.

Paper Citation

Gustafson, J., Boye, J., Fredriksson, M., Johannesson, L., & Königsmann, J. (2005). "Providing Computer Game Characters with Conversational Abilities." In Intelligent Virtual Agents (IVA 2005), Lecture Notes in Computer Science, Springer, Kos, Greece.

What was NICE?

Background & Early Development

Uncharted territory in 2002

Additional research challenges

The "Gold Standard" Insight

Swedish System (FTW)

English System (HCA)

Comparison with state-of-the-art

Design Criteria

Acoustic & Quality

Conversational & Expressive

Animation Coordination

Project Partners

Overview

The H.C. Andersen System

HCA's Personality

Knowledge Domains

The "Domain-Oriented" Approach

Prototype History

Natural Language Understanding (NISLab)

Character Module — The Mind-State Agent

Gesture & Multimodal Input (LIMSI)

Speech Recognition Performance

Usability Interview Scores (1=good → 3=poor)

Key Findings

The Fairy-Tale World

Why games? Why children?

Development Timeline

Character Classification

OCEAN Personality Profiles (Table 3.1)

Voice Design (Table 3.2)

Animation & Non-Verbal Behaviour

Cloddy Hans — Introvert, Calm

Karin — Outspoken, Anxious

Same Animation Set, Different Personality

Utterance Distribution by Character (Tables 3.5–3.7)

User Utterances (5,580 total)

Cloddy Hans Utterances (5,445)

Karin Utterances (255 total)

Scene 1: H.C. Andersen's Fairy-Tale Laboratory

Why this "grounding game" design?

Scene 2: The Fairy-Tale World — The Drawbridge

Cloddy Hans's Side-Commentary

Game Locations

Design Requirements (from D1.1b)

Narrative Progression — Story-Functional Events

Speech Rate Adaptation Across Phases

With Cloddy Hans

With Karin

Speech Recognition

Animation System (Liquid Media)

Hardware Setup

Supervision Interface (Figure 3.4)

Natural Language Understanding: Robust Parsing

Speech Synthesis: Snacka Unit-Selection TTS

Kernel + Scripting Architecture

Why kernel + scripting?

Event-Driven Dialogue Model

World Model and Task Planning

Turn-Taking & Camera Control

Asynchronous Dialogue

Data Collection

Results

Gameplay Questionnaire (median, 1–5)

Character Personality Perception (Wilcoxon p<0.05)

ASR Performance

Utterance Types (5,580 user utterances)

Usability Evaluation

Key Findings

Conversational Abilities — Cloddy vs Karin

Dialogue Phenomena

Robust Spoken Language Understanding in a Computer Game

Dialogue Act Types

Core Types

Action Types (World Model)

Anaphora via Lambda Abstractions

Phase 1: Pattern-Matching (Domain-Dependent)

Handling Disfluencies

Phase 2: Rewriting (Domain-Independent)

N-Best Input