EU Project · IST-2001-35293 · 2002–2005

NICE

Natural Interactive Communication for Edutainment

🧒 Children & Teenagers 🎮 Computer Game AI 🗣️ Spoken Dialogue 🤖 Embodied Agents 📖 H.C. Andersen

What was NICE?

NICE was a European Union Human Language Technologies research project (IST-2001-35293) that ran from 2002 to 2005. Its goal was to push the boundaries of what computers could do in natural, spoken conversation — and to deploy that technology in an ambitious, playful context: letting children and teenagers talk to fairy-tale characters brought to life in a 3D computer game.

The project united five partner institutions across four countries, developing two distinct interactive systems both inspired by the universe of Danish author Hans Christian Andersen. One system let users hold free-ranging, domain-oriented conversation with an embodied, animated HCA himself. The other placed users inside his fairy-tale world, where they had to cooperate with — and negotiate against — animated story characters using spontaneous spoken and gestural dialogue.

What made NICE unusual was its deliberate move away from strictly task-oriented dialogue. Real human conversation is unpredictable, socially rich, and full of personality. NICE set out to demonstrate that speech technology could support exactly that — not just information retrieval, but character, emotion, storytelling, and collaborative problem-solving.

5
Partner Institutions
4
Countries
2
Interactive Systems
57+
Child Users Tested
€4.57M
Total Project Budget
€3.01M
EU Commission Funding
40h+
Wizard-of-Oz Data Collected
15+
Research Papers
"Computer games provide an excellent application area for research in spoken dialogue technology, requiring an advance of the state-of-the-art on several fronts — and game players have a lot to gain from a successful incorporation of spoken dialogue technology."

— Gustafson et al., SIGdial 2004

Background & Early Development

When NICE began in March 2002, the consortium identified four challenges that had never been satisfactorily met in any existing system:

Uncharted territory in 2002

  • No computer game included real spoken language dialogue — only isolated voice commands
  • No speech recogniser was optimised for children's and adolescents' speech patterns and linguistic behaviour
  • Advanced spoken dialogue systems had hardly been used for edutainment or entertainment
  • Domain-oriented conversation — unconstrained dialogue across semi-open domains — had never been demonstrated in a running system

Additional research challenges

  • Integrating 2D gesture input with spontaneous spoken dialogue
  • Personality, knowledge and mental state modelling for believable characters
  • Multi-party dialogue — several characters interacting simultaneously
  • NLU robust enough to handle the error rates of children's ASR
  • New evaluation methods for entertainment-type human-system interaction
  • Rapid prototyping framework portable to new embodied character applications
"There is an important sense in which the NICE systems are simply interactive spoken computer games. This potentially revolutionary field of computer games was non-existent when the NICE project began."
— NICE Final Report, May 2005

Before any prototype system existed, NICE partners collected extensive Wizard-of-Oz (WoZ) data — human-simulated interactions that provided training material for speech recognisers and insights into how children actually talk when addressing these characters.

🏛️
HCA Museum, Odense
NISLab — Summer 2003

The detailed first HCA prototype specification was Wizard-of-Oz-simulated over two weeks at the Hans Christian Andersen Museum in Odense, Denmark. ~30 hours of English conversation recorded, approximately 500 dialogues transcribed. Young museum visitors (age 9–18) talked to a cartoon-like animated HCA face via headset microphone. The wizard controlled HCA's responses and lip movements in real time.

These recordings unexpectedly became a gold standard for human-natural conversation — the wizards' thoroughly human response style revealed conversation patterns (topic shifts, slow users, graceful follow-along) that were later built into PT2.

📡
Telecom Museum, Stockholm
TeliaSonera — 2002–2003

TeliaSonera set up an animated speaking agent in a museum exhibition to continuously collect children's speech data from visitors. By December 2002 alone, ~20,000 utterances had been recorded and transcribed, of which 40% came from users aged 15 or under.

A separate Lo-Fi WoZ session tested an early FTW prototype: 16 adult subjects (300 utterances) interacted with Cloddy Hans on a massive 3×5 metre screen, helping him solve a simple visual puzzle. This data informed the FTW system design and speech recogniser language models.

🖊️
LIMSI Lab, Orsay
LIMSI-CNRS — 2002

LIMSI conducted WoZ experiments with a simple multimodal game featuring 2D cartoon characters. 7 adults and 10 children interacted using speech and pen gestures on an interactive pen display. Sessions were video-recorded and annotated using behavioural metrics and questionnaires. This early data informed the design of the gesture recognition, interpretation, and input fusion modules.

The "Gold Standard" Insight

The HCA museum WoZ data yielded an unexpected methodological insight: even though the wizards followed the detailed system specification, they did so in a thoroughly human way — understanding users' full communicative intention even when it was expressed obliquely across multiple turns. The resulting transcripts revealed that the simulated HCA "virtually always followed the user" when the user changed topic or domain — and this naturalness was later built into the second HCA prototype's conversation manager as a design principle.

Children's speech recognition was one of NICE's core technical challenges. ScanSoft (originally Philips Speech Processing) developed dedicated approaches for both the English (HCA) and Swedish (FTW) systems.

Swedish System (FTW)

Training from scratch was not feasible with limited in-domain data. Instead, a commercial adult-speech acoustic model was adapted using two combined techniques:

  • Frequency-domain speech transformation: children's input speech was mapped in the frequency domain to better match the adult-trained model → 22% WER reduction
  • Acoustic model adaptation: optimised adaptation algorithms on ~20 hours of children's data → ~45% relative WER reduction (additive to the 22%)
  • Noise and hesitation modelling also improved in both acoustic and language models
WER reduction from transformation22%
WER reduction from adaptation (45% relative)~45%

English System (HCA)

Sufficient NICE-specific data allowed training from scratch. Key complication: most users were non-native English-speaking Danish children, producing high noise and hesitation rates. Solution:

  • Iterative training: started with simple utterances, expanded the training set in criteria-governed steps. Standard full-data training produced significantly inferior models.
  • Vocabulary of ~1977 word forms in PT2
  • Out-of-vocabulary rate: only 2.5% (excluding proper names)

Comparison with state-of-the-art

Potamianos (2003) achieved similar improvements (~45% relative WER reduction) with 2–3× more data. NICE's results were therefore at or beyond the published state-of-the-art despite tighter data constraints.

For the Fairy-Tale World, TeliaSonera developed a bespoke unit-selection speech synthesis system from scratch for Swedish — giving each character a distinct voice that conveyed their personality. This went far beyond standard TTS in scope and design.

Design Criteria

Acoustic & Quality

  • High-quality, natural-sounding Swedish voices
  • Personal speaking style per character — not generic TTS
  • Faster than real-time speech generation
  • Prosodic control: emphasis and speaking rate
  • Voice design via acoustic signal processing (e.g. giant/gnome effects)

Conversational & Expressive

  • Fillers and filler words with prosodic variation — used for attitudinal feedback and turn management
  • Extralinguistic sounds: inhalations, coughs, laughter — convey emotional and attitudinal state
  • Easy to add new languages and build new voices
  • Designed for easy integration into a live dialogue system

Animation Coordination

  • Automatic generation of lip-sync track
  • Coordinated body and facial gesture output with verbal content
  • Sub-second control of character movement enabling highly reactive animated characters
  • Blending of multiple concurrent animation tracks (Liquid Media engine)
"The faster-than-real-time automatic generation of conversational spoken Swedish coordinated with facial as well as bodily animations for several ECAs with distinct personalities represents a breakthrough — to the best of our knowledge, this is the first time all these features have been combined in one and the same system."
— NICE Final Report, May 2005

For the HCA system, the English TTS was an off-the-shelf AT&T synthesiser selected for voice quality and suitability for a 55-year-old male character. Eleven visemes were used for lip synchronisation, with up to 17 non-verbal behaviour primitives per output turn.

Project Partners

NICE brought together expertise from dialogue systems, human-computer interaction, speech technology, computer animation, and games across four European countries.

🇩🇰 Denmark
NISLab
University of Southern Denmark, Odense
HC Andersen system: NLU, character modelling, response generation, emotion, evaluation. Led by Niels Ole Bernsen & Laila Dybkjær.
🇸🇪 Sweden
TeliaSonera R&D
(Voice Technologies)
Haninge / Farsta, Sweden
Fairy-tale game: NLU, dialogue management, speech recognition training, corpus collection, system integration. Team led by Joakim Gustafson, Johan Boye, Mats Wirén, Linda Bell and Anders Lindström
🇫🇷 France
LIMSI-CNRS
Orsay, France
Gesture recognition, gesture interpretation, multimodal input fusion for both systems. Jean-Claude Martin, Stéphanie Buisine, Sarkis Abrilian.
🇩🇪 Germany
ScanSoft (formerly Philips Speech Processing)
Originally Philips Speech Processing — acquired by ScanSoft during the project
Speech recognition development and training. Dedicated acoustic models for children's and non-native speakers' voices — a central technical challenge. Swedish: adapted commercial adult models + frequency-domain speech transformation (45% + 22% WER reduction). English: trained iteratively from scratch on NICE data. Reinhard Blasig.
🇸🇪 Sweden
Liquid Media
Stockholm, Sweden
3D world creation, character animation, graphics rendering engine for the fairy-tale game. Morgan Fredriksson, Lasse Johannesson, Jürgen Königsmann.

Overview

NICE produced two complementary systems, each tackling the challenge of natural dialogue from a different angle. Both were tested with real children.

📜
The H.C. Andersen System
English · Museum Deployment

Users hold free-form conversation with an animated, life-like Hans Christian Andersen in his 19th-century Copenhagen study. Andersen knows about his life, his fairy tales, his surroundings — and reacts emotionally to the user.

  • Domain-oriented open conversation
  • Speech + 2D gesture input
  • Emotional state (happy, sad, angry, neutral)
  • Target: museums, 10–18 year olds
  • Language: English
Read more →
🏰
The Fairy-Tale World
Swedish · Computer Game

A 3D computer game where children use speech and gesture to cooperate with Cloddy Hans and negotiate with Karen to progress through scenes inspired by Andersen's fairy tales.

  • Multi-character, multi-scene game
  • Spontaneous dialogue as primary game control
  • Cooperative + negotiation dialogues
  • Language: Swedish
  • 57 child users tested
Read more →

The H.C. Andersen System

H.C. Andersen System Overview

The HC Andersen system aimed to demonstrate domain-oriented conversation — a form of interaction halfway between strict task-oriented dialogue and full Turing-test conversation. Users could talk to HCA about anything within his knowledge domains, in any order, with mixed initiative, just as in real social conversation.

HCA appeared in his 19th-century Copenhagen study (Figure: his virtual writing desk surrounded by pictures, a feather pen, travel bag and personal objects). Users could speak to him freely in English and point at objects in his study using a touch screen or mouse to prompt him to tell stories about them.

HCA had knowledge of: his fairy tales (The Little Mermaid, The Ugly Duckling, etc.), his life and childhood in Odense, his study and its objects, the user (he asked about age, name, country, interests), and meta-communication to handle errors and keep conversation flowing.

HCA's Personality

Core traits (from project specification): imaginative, sensitive, extrovert, 55ish — a child inside

  • Emotionally responsive — happy, sad, angry, or neutral depending on conversation
  • Pursues a general conversational agenda; initiates when user is silent
  • Asks about the user and remembers what they said
  • Tells stories, makes jokes, and reacts to compliments or insults
  • Gets annoyed if insulted; delighted if praised about his fairy tales
  • ~450 spoken utterance templates, ~130 non-verbal behaviour primitives
  • Up to 17 non-verbal primitives per output turn; 11 visemes for lip sync

Knowledge Domains

  • His fairy tales and literary works
  • His life (childhood in Odense, youth in Copenhagen)
  • His physical presence and study
  • The user (name, age, country, interests)
  • Games children like (shared ground)
  • Meta-communication (repeat, clarify, rephrase)

The "Domain-Oriented" Approach

Unlike task-oriented systems where the goal is to complete a booking or answer a query, HCA had no task constraints. Users could jump between topics, back-channel, ask personal questions, or even be rude. The system aimed to make HCA feel like a real character from history, not a database. The target was to produce genuine conversation for edutainment — learning about Andersen's life and era while being entertained.

Prototype History

  • Prototype 1 (PT1), January 2004: Wizard-of-Oz speech recognition (human typed what user said). Tested with children at NISLab. ~300 utterance types, 100 non-verbal primitives.
  • Prototype 2 (PT2), February 2005: Full speech recognition (Scansoft, trained on ~50 hours of children's speech). Tested with 13 users (age 11–16, Danish children speaking English). 26 conversations, 8 hours of recorded speech.

The HCA system used an event-driven, modular, asynchronous architecture with modules communicating through the KTH Message Broker — an open-source TCP server that routed calls and time-stamped all events.

Input Pipeline
Microphone
Speech Recognition
(Scansoft SpeechPearl)
NLU
(NISLab)
Touch Screen
Gesture Recognition
(LIMSI)
Gesture Interpretation
(LIMSI)
Fusion & Reasoning
Input Fusion
(LIMSI)
Character Module / Mind-State Agent
(NISLab) — conversation planning, domain agents, emotion calculator, knowledge base
Output Pipeline
Character Module
Response Generation
(NISLab)
Speech Synthesis
(AT&T)
+
Animation
(Liquid Media)
Infrastructure
KTH Message Broker (TCP)

Natural Language Understanding (NISLab)

The NLU pipeline consisted of: (1) a Keyphrase Spotter that mapped phrases to syntactic/semantic categories; (2) a Syntactic Analyser with number spotting, lexicon lookup, and rule engine; (3) a Domain/Topic Spotter mapping categories to knowledge domains; and (4) an FSA Processor (Finite State Automaton) for deep parsing. The vocabulary had ~1977 word forms in PT2.

Character Module — The Mind-State Agent

The heart of HCA's intelligence. Three output states: Non-Communicative Action (HCA working alone in his study), Communicative Function (real-time awareness of user input), and Communicative Action (HCA's actual conversational turns). Domain Agents handled specific knowledge areas (life, works, study, user, meta). The Emotion Calculator updated HCA's emotional state on each user turn. Up to 17 non-verbal behaviour primitives per output turn.

Gesture & Multimodal Input (LIMSI)

Three gesture shapes were recognized: pointer (pointing), connect (line), surrounder (circle). The Gesture Interpreter identified which of 16–21 "referenceable objects" in the study the user gestured at. Input Fusion merged speech semantics and gesture using temporal coincidence with configurable time windows.

The HCA PT2 was evaluated with 13 Danish children (age 11–16) speaking English (their second language) at NISLab in February 2005. Each session lasted 60–75 minutes.

Speech Recognition Performance

Perfectly recognised sentences (Danish group)23%
Perfectly recognised (English group)33%
User input understood (Danish group)49%
Input understood (English group)60%
Understood + meta-comm handled (Danish)85%
Understood + meta-comm handled (English)87%

Usability Interview Scores (1=good → 3=poor)

Overall user evaluation1.5 / 3
Fun to talk to HCAPositive
Learned something12/13 users
Interested in spoken games10/13 users
Graphics qualityRated good
Gesture recognition accuracy87.2%

Key Findings

  • Users unanimously found talking to HCA fun, even those who struggled with the speech recognition
  • Most users learned something — about HCA's life, about English, or both
  • System scored 1.5/3.0 overall (1=great), an improvement over PT1's 1.8, despite PT2 being harder (real ASR vs. Wizard-of-Oz)
  • Biggest challenges: animation naturalness (walk looked gliding), and emotion expression (subtle facial changes users didn't notice)
  • All 18 referenceable objects were gestured at; average user gestured at 62% of them
  • Word Error Rate: ~70% (test condition 1) / ~61% (test condition 2) — children speaking English as second language

This is a composite example from two actual conversations with HCA PT2, showing mixed-initiative dialogue, emotional response, and topic shifting:

HCA:Hello there, welcome to my study. I am glad to meet you. It is always so interesting when I get visitors. I would like to ask you a few questions. Please tell me your name.
User:My name is Aske.
HCA:How old are you?
User:I'm twelve.
HCA:When I was your age I was a big fan of the theatre. Whenever there was an opportunity I went to the theatre. It was my favourite place. I would like to know if you are a boy or a girl?
User:I am a boy.
HCA:I was a curious boy. Once I ventured into the hospital for the dangerous lunatics. A woman sat in her cell singing with a beautiful voice. Suddenly she got up and ran screaming towards the door. She reached for me and her fingertips touched my clothes. I was half dead from fear when the attendant came. Which country do you come from?
User:I come from Denmark.
HCA:Excellent! Now let us talk. I can tell you a lot about my family and my childhood in Odense and about my youth in Copenhagen.
User:How old are you?
HCA:I was born in 1805 and died in 1875. So my lifetime lasted 70 years. It was a time full of great events in the world and also for me.
User:[gestures at a picture on the wall] What is this?
HCA:[turns to look at the picture, then back to user] Ah, that is a picture of Jenny Lind — the Swedish Nightingale, as she was called. She was a wonderful singer. I was very fond of her.

The Fairy-Tale World

From NICE Deliverable D7.2-2, Chapter 3 →

Fairy-Tale World System Overview

The NICE Fairy-Tale World is an interactive 3D computer game where children and teenagers engage in spontaneous spoken and gestural dialogue with animated fairy-tale characters. What sets it apart from other dialogue systems — and from other games — is that spoken dialogue is the primary means of game control. There are no menus, no text commands. You just talk.

"The interaction with the characters is domain-oriented. The dialogue concerns different subplots in the fairy-tales, but without a clear goal-orientation and without other demands than it being entertaining to the user."

The project was developed at TeliaSonera R&D in Stockholm, Sweden, led by Joakim Gustafson, Johan Boye, Mats Wirén, Linda Bell and Anders Lindström. The 3D world and character animation was created by game company Liquid Media.

Why games? Why children?

  • Computer games provide a natural, engaging context for spontaneous dialogue — users are motivated to communicate
  • Spoken interaction can enable richer, more immersive game experiences than menus and mouse clicks
  • Children are an underexplored user group for speech technology — their speech patterns and interaction styles differ from adults
  • Social dialogue, negotiation and cooperation — areas where speech excels — are natural game mechanics

Development Timeline

MilestoneDetails
Early 2003System design, character design (Liquid Media), message broker integration (KTH)
Jan 2004Prototype 1 (Scene 1 only: lab + Cloddy Hans). Tested at Technical Museum Stockholm with 10 children
Apr 2004D7.2b evaluation report; data analysis; design of Prototype 2
Nov 2004Prototype 2 testing begins — "School" session with 31 children (Scene 1)
Dec 2004"Lab 1" session (11 teenagers, Scene 1, large display + gyro mouse)
Feb 2005"Lab 2" session (20 children, Scene 1+2 including Karen at the bridge)
Mar 2005"Lab 3" session (13 children, Scene 1+2). Corpus complete.
May 2005Final evaluation deliverable D7.2-2 submitted

The fairy-tale world features five animated characters, each with a distinct personality encoded at every level — appearance, voice, vocabulary, gesture style, and long-term behaviour.

Cloddy Hans
Cloddy Hans
The Helper Character

Loosely inspired by H.C. Andersen's story "Cloddy Hans." He is the user's faithful companion throughout the entire game. Seemingly slow-witted and not very bright, but honest, friendly, and anxious to help. His limitations are deliberate — they force the user to use multimodal dialogue to collaborate effectively.

Friendly Slow-witted Loyal Honest No hidden agenda Deep slow voice

OCEAN model: High Agreeableness, Low Conscientiousness, Low Openness. Users perceived him as kind, stupid, lazy, calm, polite.

Karen
Karen
The Gatekeeper Character

A sullen, uncooperative girl who guards the drawbridge separating the island from the fairy-tale world. She is the obstacle — and the negotiation partner. Karen will only lower the bridge when given an object she deems acceptable. She will reject the first two offers regardless.

Sullen Uncooperative Shrewd Defiant Secretive

Users perceived her as smart, quick, self-confident, defiant. Deliberately designed to contrast with Cloddy Hans — creating dramatic tension and encouraging children to take sides.

Thumbelina
Thumbelina
Supporting Character (non-verbal)

Appears in the laboratory scene. Non-verbal — she points at slots in the fairy-tale machine to indicate where she wants objects placed, and expresses her satisfaction or discontent through large emotional body gestures when Cloddy Hans places objects correctly or incorrectly.

Non-verbal Expressive Gesture-based
The Prince
The Prince
Feature Character

One of five animated characters developed for the full fairy-tale world. Feature characters have their own goals and drive sub-plots in the story, asking the user for help completing tasks in exchange for information or passage.

Goal-driven Feature character
"The characters need to be responsive, and be able to engage in conversation which makes sense to the player. Each character must have its own voice that conveys the nature of that character's personality, and be able to use prosodic cues to signal mood and emotions."

Character Classification

NICE introduced a three-way character taxonomy for interactive storytelling, extending the feature/supporting dichotomy from the literature:

  • Feature characters — drive the story, have their own goals, need a full NLU + dialogue manager
  • Supporting characters — provide atmosphere and react, like Thumbelina's non-verbal expressions
  • Helper characters (new) — like Cloddy Hans: guides the user, no goals of his own, but needs knowledge of all plots, cooperative conversation skills, and spatial awareness

Personality was not modelled as explicit internal parameters but used as a design guide to ensure consistent character behaviour perceived by users as compatible with intended traits. Every aspect of the characters — dialogue rules, wording, speaking style, non-verbal behaviour, idle animations — was designed around the OCEAN personality framework.

OCEAN Personality Profiles (Table 3.1)

CharacterOpennessConscientiousnessExtroversionAgreeablenessNeuroticism
Cloddy Hans DunceThorough Quiet · Uncertain Friendly · Polite Calm · Even-tempered
Karin IntellectualFrivolous Outspoken · Self-confident Unfriendly · Touchy Anxious · Over-emotional

Voice Design (Table 3.2)

CharacterVoice pitchSpeaking rateFrequency range
Cloddy HansLowSlowSmall
KarinHighFastLarge

Voice actors were instructed to read utterances in personality-matching styles, then all recordings were resampled — slowing Cloddy (sounds larger) and speeding Karin (sounds younger). TD-PSOLA modified both pitch and rate simultaneously.

Animation & Non-Verbal Behaviour

Cloddy Hans — Introvert, Calm

  • Small, slow, deliberate body gestures generated by the AnimationHandler
  • Maintains attention on the user at all times while idle
  • Idle sequence: small encouraging gestures → impatience → walks around the scene
  • Displays states of mind: listening, thinking, understanding, turn regulation

Karin — Outspoken, Anxious

  • Larger, faster body gestures rendered directly by the animators
  • Enters idle phase faster — less patient than Cloddy Hans
  • Gaze wanders to the environment — less focused on the user
  • If ignored long enough, walks away — user must call her back to resume negotiation

Same Animation Set, Different Personality

Both characters share the same underlying animation primitives — since both have been provided with identical animation sets it is possible to switch their non-verbal behaviours. The personality difference is expressed through the amplitude, speed, and frequency of identical gesture types, not through entirely different motion libraries. This was a key architectural decision enabling efficient character creation.

Utterance Distribution by Character (Tables 3.5–3.7)

User Utterances (5,580 total)

Domain (instructions, questions)39%
Repetition17%
Meta (help, what can I do?)17%
Yes/No12%
Fragment8%
Social/fun (jokes, testing)7%

Cloddy Hans Utterances (5,445)

Safe (confirmations, "I'm going to…")39%
Domain (subtask initiatives)15%
Problem ("I didn't understand")15%
Meta (acknowledgements, fillers)13%
Nice (social, polite)12%
Bad (mocking Karin)7%

Karin Utterances (255 total)

Domain (demands, what she wants)45%
Nice (pleasantries)27%
Bad (mockery, rejections)27%
Problem (confusion)0%

Karin never generates Problem or Safe utterances — by design, she always appears confident and in control, which makes her seem smarter than Cloddy Hans.

Scene 1: H.C. Andersen's Fairy-Tale Laboratory

The game begins in Andersen's Copenhagen study in the 19th century. Andersen has just left on a trip to Odense, leaving Cloddy Hans to guard his fairy-tale laboratory. At the centre of the lab is the fairy-tale machine — forbidden to all except Andersen himself.

Cloddy Hans has sneaked an object into the machine and wants to surprise Andersen with a new fairy-tale. But he can't understand the symbols labelling the machine's four slots: "useful", "magical", "precious", "dangerous". He needs the user's help to figure out which object goes where. On the shelf: a key, hammer, diamond, magic wand, sword, ruby, axe, and more.

Why this "grounding game" design?

  • Training: Users learn how to interact with Cloddy Hans through trial and error in a low-stakes task
  • Grounding: User and Cloddy agree on names for objects — "the ruby", "the sword" — which persist into Scene 2
  • Data collection: The repetitive structure makes it easy to measure dialogue success
  • Deliberate limitation: The user cannot directly touch objects — only dialogue moves the game forward

Scene 2: The Fairy-Tale World — The Drawbridge

When Cloddy Hans pulls the lever, the machine activates — and a trapdoor drops him (and the user) into the fairy-tale world. They land on a small island with a vast gap between them and the rest of the world: windmills, villages, fields, all unreachable. The objects from the machine are now scattered in the grass.

On the far side of the gap stands Karen, guarding a raised drawbridge. She refuses to lower it — unless the user negotiates successfully. She wants one of the objects on the island, but she is picky: she will reject the first two offers, and what she ultimately wants changes each playthrough. The user must discover what she wants, convince Cloddy Hans to pick it up and bring it to her, and complete the exchange.

Cloddy Hans's Side-Commentary

While the user talks to Karen (shown in camera), Cloddy Hans is out of view but still active. He receives BroadcastEvents from Karen — delivered because he is within audibility range even when off-camera — and uses these to fire contextual side-comments as hints, observations, and opinions:

  • "Maybe she will lower the bridge if we give her something nice."
  • "Girls like shiny things, don't they?"
  • "She is a bit grumpy today!"

This creates genuine three-party dialogue — user, Cloddy Hans, and Karen — with each character having distinct communicative roles.

Game Locations

  • H.C. Andersen's study / fairy-tale lab — Scene 1
  • The island — starting point of Scene 2
  • The gap and drawbridge — Karen's domain
  • The village — across the bridge
  • The windmill — a key location in the world

Characters navigate between locations using triggers — 3D zones that fire events when entered. Walking into a trigger could cause a character to turn, speak, or initiate a new dialogue phase.

Design Requirements (from D1.1b)

The following seven requirements guided the overall game design from the start of the project:

  1. Spoken multimodal dialogue should be appreciated as the primary means of progressing in the game, not merely an add-on
  2. Primary intended users are children aged nine and older
  3. Domain built on a small collection of autonomous fairy-tale characters from H. C. Andersen
  4. Both the overall story and component subplots should have goals easily explained in ten seconds
  5. The introduction to multimodal dialogue should be part of the game and fun in its own right
  6. The game should be many-off — player and character actions influence events and how the story ends
  7. The game should be enjoyable regardless of whether played to completion or just for one plot

Narrative Progression — Story-Functional Events

The concept of a story-functional event is central to measuring narrative progression: a key moment in the story that represents genuine advancement and cannot be undone. Progress is defined as the set of story-functional events achieved.

Scene 1 — The Laboratory

  1. Cloddy Hans introduces himself
  2. Cloddy Hans introduces the plot
  3. Cloddy Hans picks up an object for the first time
  4. Cloddy Hans drops an object in the fairy-tale machine (irreversible — can't be retrieved)
  5. Cloddy Hans pulls the lever → they enter the fairy-tale world

Scene 2 — The Fairy-Tale World

  1. Cloddy Hans informs user about the fairy-tale world
  2. The user meets Karin for the first time
  3. Cloddy Hans gives his opinion of Karin
  4. Karin gives her opinion of Cloddy Hans
  5. Karin demands payment to lower the bridge
  6. The user offers Karin an object as payment
  7. Karin lowers the drawbridge
  8. Cloddy Hans crosses the drawbridge

Speech Rate Adaptation Across Phases

All 5,580 utterances were aligned at phoneme level. Duration z-scores were computed across all 85,000 phonemes. Key findings:

With Cloddy Hans

  • Users took fewer turns per object over time (learning), but spoke progressively slower — hyper-articulating to ensure the sluggish character understood
  • In Scene 2, they initially spoke faster, then slowed again when talking with Cloddy
  • Confirmed earlier Bell, Gustafson & Heldner (2003) finding that users adapt speech rate to system pace

With Karin

  • Users spoke significantly faster with Karin — significant difference vs. Cloddy dialogues (Wilcoxon, p<0.05)
  • Speech rate increased further with each successive Karin turn — the negotiation felt more lively and dynamic
  • Karin drives the dialogue and doesn't show comprehension difficulties, making the interaction feel quicker

The Fairy-Tale World system is event-driven and asynchronous — enabling characters to be autonomous when no-one is talking to them and fully reactive when the user engages. All modules communicate via a central Message Dispatcher hub. Figure 3.5 (from NICE Deliverable D7.2-2) shows the complete architecture.

Figure 3.5 — The NICE Fairy-Tale World system architecture
Figure 3.5. The system architecture (NICE Deliverable D7.2-2).

Speech Recognition

  • Scansoft OSR3; bigram statistical language model
  • Symmetric design: LM trained on both what Cloddy Hans says and what users say to him
  • Domain model: all object/slot names, action verbs, clarification questions, grounding and error-handling utterances
  • Supplemented with socialising utterances from August and Pixie corpora

Animation System (Liquid Media)

  • Characters react with: emotional displays, state-of-mind signals, turn regulation gestures, attention gestures
  • Can look at and point at interactive/non-interactive objects and landmarks
  • Idle behaviour: encouraging gestures → impatience → walking around
  • 8 concurrent animation tracks; lip-sync track from TTS time-stamps

Hardware Setup

The system ran on three machines: a high-performance desktop running animation/rendering, speech synthesis, and gesture recognition; one laptop running ASR, the dispatcher, and the supervision GUI; a second laptop running NLU, dialogue managers, NLG, and animation planner.

Users interacted via a wireless headset microphone and a gyro mouse facing a large back-projection screen showing Cloddy Hans at life-size (180 cm tall).

Supervision Interface (Figure 3.4)

Figure 3.4 — The supervision interface
Figure 3.4. The supervision interface.

For data collection, the system ran in supervised mode: an operator could approve the top suggestion, select from the N-best list, edit the ASR result, or type free text — enabling high-quality data collection while the system was still imperfect.

Natural Language Understanding: Robust Parsing

The NLU component addressed interpreting noisy, spontaneous, child-directed speech. The parser works in two phases:

  1. Pattern-matching phase (domain-dependent): Scans ASR output left-to-right, accumulating typed semantic constraints from syntactic patterns — discarding unrecognised words.
  2. Rewriting phase (domain-independent): Applies object merging, constraint inference, filtering, and abstraction to produce the final dialogue act tree.

"Pick up the axe" → request(user, cloddy, pickUp(cloddy, axe)). Concept accuracy: 53% from 1-best ASR, 70% from 10-best — a major improvement over raw ASR word accuracy of 39%.

Speech Synthesis: Snacka Unit-Selection TTS

Snacka — developed by Kåre Sjölander at KTH in collaboration with TeliaSonera — was the first unit-selection TTS system built for Swedish. Cloddy Hans has a slow, deep voice (recordings slowed via TD-PSOLA); Karin has a faster, sharper delivery. The synthesiser outputs a time-stamped phone transcription that drives lip-sync and gesture coordination in the Liquid Media animation engine.

Kernel + Scripting Architecture

The NICE fairy-tale dialogue manager uses a two-layer architecture: a kernel providing generic dialogue management functionality common to all characters and situations, and scripting code that specialises the kernel for a particular character, scene, or phase. This mirrors game scripting patterns used in commercial game engines.

Why kernel + scripting?

  • Generic functions (input handling, reference resolution, belief reasoning, response generation) in the kernel — written once
  • Character personality, scene-specific behaviour, phase transitions — in modular scripting code
  • New scenes and characters can be added without modifying the kernel
  • Theoretical benefit: distinguishes genuinely generic dialogue management from situation-specific aspects

Event-Driven Dialogue Model

The dialogue manager operates on a stream of dialogue events — both external (from other modules) and internal (from the kernel's reasoning). Scripting code registers callbacks for events, creating asynchronous, flexible behaviour.

Event TypeDescription
ParserEventParser analysed the user's latest utterance and produced a dialogue act
GestureEventGesture interpreter identified an object the user pointed at (multimodal input)
RecognitionFailureEventASR detected speech but could not recognise it
PerformedEventAnimation system completed a requested action, carrying an ok or failed flag; essential for sequencing agenda goals
BroadcastEventAnother character spoke or performed an action — only delivered if the sending character is within visibility range and/or audibility distance of the receiving character
TriggerEventCharacter entered or exited a 3D trigger zone while walking — used for contextual remarks, head-turns, waypoints, and camera changes
WorldEventA world-state change: object inserted into machine slot, bridge raised or lowered, etc.
TimeOutEventSilence threshold exceeded — character should prompt, hint, or take initiative
NoReactionEventCharacter's agenda is empty and no incoming events are pending — must decide whether to act proactively
IntentionEventCharacter has formed an intention to say or do something (internal)

World Model and Task Planning

Each character maintains an inner state consisting of: a world model (beliefs about the 3D world), a discourse history (past interactions), and an agenda (tree-structured current goals and their causal relations).

Task specification uses declarative rules encoding action-world relationships, similar to STRIPS planning but with non-monotonicity: true propositions can become false (another character might move an object). Preconditions are always checked before execution. The agenda tree makes reasoning explainable — Cloddy Hans can say "I am going to the shelf because I want to pick up the axe because you asked me to."

Turn-Taking & Camera Control

The character in focus is always in camera. The user can change dialogue partner — and camera — by saying the character's name ("Cloddy" / "Karen"). The system also triggers camera changes automatically, such as when Cloddy Hans reaches the bridge, causing the camera to swing to show Karen. Characters can make side-comments without being in camera (Cloddy whispering hints while Karen is on screen). This enables genuine three-party dialogue without parallel rendering.

Asynchronous Dialogue

Unlike the strict turn-alternation model used by virtually all spoken dialogue systems of the era — where user speech was the only input channel — NICE characters are driven by five independent event channels running simultaneously: (1) user dialogue acts from the parser; (2) completion signals from the character's own actions (PerformedEvent, ok/failed); (3) broadcast events from other characters, filtered by spatial visibility and audibility; (4) 3D trigger zones entered during movement; and (5) timeouts and empty-agenda signals. A character does not wait for the user to speak — any of these channels can push dialogue forward at any moment. This means the user can give a new instruction while Cloddy Hans is already walking, and when he arrives his PerformedEvent picks up the thread — creating dialogue that feels genuinely alive rather than turn-by-turn.

Data Collection

Data was collected across four sub-corpora over five months during 2004–2005, developing the system iteratively between each session.

ConditionSchoolLab 1Lab 2Lab 3
DateNov–Dec 2004Dec 2004Feb 2005Mar 2005
LocationSchool (small room)TeliaSonera Vision CentreTeliaSonera Multimodal LabTeliaSonera Multimodal Lab
DisplayCRT, mouseLarge display, gyro mouseLarge display, gyro mouseLarge display, gyro mouse
DataAudio, logsAudio, video, logsAudio, video, logsAudio, video, logs
GameplayScene 1Scene 1Scene 1+2Scene 1+2
Age span8–1114–159–1011–12
Net users267159
57
Total child users
5,580
User utterances
5,583
Cloddy Hans turns
255
Karin turns
90
Avg turns per user
~6h
Speech recorded

Results

Gameplay Questionnaire (median, 1–5)

The game was fun4.0
The game was interesting4.0
Easy to get started4.0
I understood what to do3.5
The game was easy3.0
The game was irritating2.0 (low=good)

Character Personality Perception (Wilcoxon p<0.05)

Cloddy Hans rated as:

  • Kind
  • Stupid
  • Lazy
  • Calm
  • Polite
  • Distressed

Karin rated as:

  • Smart
  • Quick
  • Self-confident
  • Defiant
  • Secretive

ASR Performance

Word accuracy (1-best)~39–55%
Concept accuracy (1-best)53%
Concept accuracy (10-best)70%

Utterance Types (5,580 user utterances)

Domain (commands, questions)39%
Repetition17%
Meta (help requests)17%
Yes/No12%
Social/fun7%

Usability Evaluation

Key Findings

  • Most users found it natural to use speech in games and expected future games to work this way
  • Cloddy Hans's sluggishness was perceived as intentional design; Karen's negotiations were considered fun
  • A few users said speaking with the characters was "almost like talking to real persons"
  • No significant effect of dialogue success on perceived fun — fun was independent of task completion
  • Personality traits "smart" and "stupid" were not affected by Cloddy Hans's actual dialogue performance — the strong visual/acoustic persona design overrode evidence from dialogue quality

Conversational Abilities — Cloddy vs Karin

Cloddy Hans perceived as more…

  • Obedient (does what users tell him)

Karin perceived as more…

  • Understands the user better
  • Easier to understand
  • More like a real human
  • Does what she wants herself

Dialogue Phenomena

Children produced rich, socially-naturalistic dialogue behaviour previously unseen in human-machine interaction. Six phenomena are documented with actual transcripts.

The conflict between Cloddy Hans and Karin was a deliberate plot device — each character tried to get the user to take their side.

Karin:Why are you dragging along that Cloddy Hans by the way?
User:Because he is fun.
Cloddy:I don't like talking to her.
Karin:Why do you keep dragging along that Cloddy Hans figure, by the way?
User:Because he is fair — a lot more so than you, in any case.
Karin:If you want me to lower the bridge, I want something in return.
Cloddy:She is so snappy.
User:You are very very very kind. [sarcastically]
Cloddy:Yes! Now we have crossed the bridge!
Karin:Hey — wasn't I supposed to get something in return?
User:Give her the emerald.
Cloddy:OK. [hands over emerald]
Karin:Thanks! It's a pleasure doing business with you.
User:I know! [sarcastically]
Cloddy:I have a bit of a problem in telling right and left apart.
User:My brain is disconnected. And this occurred to you only now, or what?

Users were free to use their own words. Most picked up Cloddy Hans's terminology, and several reported that figuring out which words to use was itself part of the game.

User:Put the hammer by er — that tube with the hammer and key.
Cloddy:Is it OK if we name that "useful"?
User:Take the bag in your hand, the sack ehh —
Cloddy:I'm already holding the sack.
User:Take the book from the shelf as well.
Cloddy:I can only carry one thing at a time.

Several users temporarily digressed from the task — then returned. This is impossible in menu-driven interfaces.

User:Try pulling the lever.
User:Pick up that little man. [about Thumbelina]
Cloddy:Now I don't understand. Oh — you mean Thumbelina.
User:Ask her something.
Cloddy:I don't think she wants that.
User:Coward! [laughter]
Cloddy:What did you say about me?
User:Put that lamp in the pipe to your farthest right.

When Cloddy asked "why?", a user spontaneously explained the goal — motivating the character. Natural in human conversation, remarkable in HCI.

Cloddy:Why do you want me to do that?
User:To put the wand in the magic slot.
Cloddy:I'm going to walk over to the machine.
User:Put the wand in the magic tube.
Cloddy:I'm going to put the wand in magic.

A user accused Cloddy Hans of still holding the diamond despite two denials. When proven wrong, they apologised — entirely natural social behaviour.

User:Would you mind letting go of the diamond?
Cloddy:I've already done that.
User:Put the diamond down. Put the diamond down Cloddy —
Cloddy:I've already done that.
User:Oh — it seems you actually did! Could you pick up the emerald?

Users referenced shared dialogue history ("what was her name — now again?") and made spatial references to the 3D world as if it were real.

User:What was her name — now again? [lowers voice, sounds intimate]
Cloddy:Her name is Karin.
User:Oh yes! Karin!
Cloddy:Take it easy — I am dashing! [appears to climb bridge rail and levitate]
User:What are you doing up there?! Come down immediately, Cloddy Hans!
"The addition of spoken language and social relations between several animated characters and the user actually did evoke social behaviour and emotively coloured reactions."
— NICE D7.2-2, Chapter 3 Conclusions

Robust Spoken Language Understanding in a Computer Game

Robust Spoken Language Understanding in a Computer Game
Johan Boye, Joakim Gustafson, Mats Wirén · TeliaSonera R&D, Haninge, Sweden
Speech Communication 48(6), pp. 335–353, Elsevier, 2006 · Special issue on Spoken Language Understanding

This journal paper presents and rigorously evaluates the spoken language understanding component of the NICE Fairy-Tale World. It describes a novel semantic representation formalism and a two-phase robust parsing algorithm designed to extract meaningful dialogue acts from the noisy, disfluent, child-directed speech produced by young users — even when the ASR word error rate is as high as 61%.

User utterances are mapped to typed, tree-structured dialogue acts — a semantic representation that sits between the shallow slot-filler lists typical of task-oriented systems and the full logical forms of general-purpose NLP. The type system enforces constraints on argument positions, providing both robustness and expressiveness.

Dialogue Act Types

Core Types

TypeExample UtteranceRepresentation
request"Pick up the axe"request(user, cloddy, pickUp(cloddy, axe))
ask"What color is the ruby?"λx.ask(user, cloddy, x[ruby.color=x])
tell"I'm fourteen years old"tell(user, cloddy, 14[user.age=14])
offer"I'll give you the ruby"offer(user, karen, ruby)
confirm"Yes, do that"confirm(user, cloddy)
disconfirm"No! Stop!"disconfirm(user, cloddy)
askForHelp"What should we do?"askForSuggestion(user, cloddy)

Action Types (World Model)

ActionArgument Structure
goTogoTo(character, place)
pickUppickUp(character, thing)
putDownputDown(character, thing, location)
giveTogiveTo(character, thing, character)
raiseDrawbridgeraiseDrawbridge(character)
lowerDrawbridgelowerDrawbridge(character)

Type constraints — e.g. the first argument of pickUp must be a character, the second a thing — let the parser infer types and reject impossible parses without full logical reasoning.

Anaphora via Lambda Abstractions

Anaphoric utterances ("pick it up", "put it down there") are represented as functions over the dialogue context using typed lambda abstractions — a key innovation over slot-filler systems:

User:"Cloddy Hans, please pick up the axe."
Parse:request(user, cloddy, pickUp(cloddy, axe))
User:"Now put it down."
Parse:λxthing λylocation . request(user, cloddy, putDown(cloddy, x, y))
Context:The lambda function is then applied to the most salient thing (the axe, established in the previous turn) and a location inferred from context — yielding a fully-grounded action.
"Previous systems based on pattern matching have been restricted to producing relatively simple semantic structures, such as variable-free slot–filler lists. Unfortunately, such structures are not suitable as input to a dialogue manager in our domain, which involves information-seeking utterances, commands and simple negotiation."

The robust parser works in two phases. The key insight is that meaning can be recovered even when words are missing, inserted, or wrong — because the algorithm throws away words that don't match any pattern and keeps accumulating constraints from whatever does match.

Phase 1: Pattern-Matching (Domain-Dependent)

  • Scans the ASR word string left-to-right
  • Tries to match each position against DCG (Definite Clause Grammar) rules
  • On match: appends semantic constraints, advances past matched segment
  • On failure: discards the first word and tries again — this is the robustness mechanism
  • Accumulates a flat list of typed semantic constraints

Handling Disfluencies

Input: "take the ehh hammer"
The full rule take the hammer fails because of "ehh". But the shorter rule take matches. Then "the" and "ehh" are discarded one by one. Finally "hammer" matches hammerthing. Result: correct constraints extracted despite the disfluency.

Phase 2: Rewriting (Domain-Independent)

  • Object merging: combines compatible constraints referring to the same entity
  • Constraint inference: derives implicit type constraints (e.g. if something is being picked up, it must be a thing)
  • Filtering: removes inconsistent or underspecified constraint sets
  • Abstraction: constructs the final typed tree-structured dialogue act
  • Domain-independent — the same rewriter works for any domain; only the pattern matcher needs rewriting when porting

N-Best Input

The pattern-matching phase operates on the ASR 1-best hypothesis by default, but the paper shows that using the 10-best list can raise concept accuracy from 53% to 70% — a major potential gain for future systems.

Three Types of Semantic Constraints

Existence Constraints

The most common type. Infers that an object of a certain type exists from the utterance. e.g. the word "hammer" triggers: hammerthing

Disequality Constraints

Asserts two objects are different. e.g. "What is…" implies the user is asking someone else: ask(user, x, y) with x ≠ user

Equality Constraints

Relates objects to attributes of other objects. e.g. "Where is…" implies: y = z.position — the thing being asked about is the position of some object z.

The parser was evaluated on 3,400 utterances from the Swedish NICE corpus — spontaneous speech from children aged 9–15 collected over 4 sessions. All figures are based on unedited ASR output; the supervised-mode operator's corrections were not used for evaluation.

Corpus Statistics

3,400
Utterances evaluated
810
Unique words
11,925
Word tokens
14.4%
OOV rate
525
System vocabulary

Performance Results

Speech Recognition Accuracy

Word accuracy (1-best ASR)39%
Word accuracy (oracle 10-best)55%

Word accuracy is the bottleneck. Even with a perfect oracle choosing the best hypothesis from the 10-best list, accuracy only reaches 55% — reflecting how hard children's spontaneous speech is for ASR systems.

Parser Semantic Accuracy

Concept accuracy (1-best input)53%
Concept accuracy (10-best input)70%
Parser improvement over raw ASR+14 pts

Concept accuracy measures how many semantic nodes in the dialogue act tree are correct — a finer-grained metric than sentence accuracy. The parser recovers substantial meaning even from very noisy input.

The Core Finding

  • ASR word accuracy: only 39% — the recognition of children's spontaneous speech is extremely hard
  • Yet concept accuracy reaches 53% from 1-best, and 70% from 10-best — the parser "rescues" a large portion of meaning that would be lost if only the raw ASR string were used
  • The 17-point gap between 1-best (53%) and 10-best (70%) concept accuracy represents "potentially a lot to be gained" by N-best re-ranking in future systems
  • The typed tree structure was essential for negotiative dialogue (Karen scene) — flat slot-filler representations cannot express the alternative-referencing needed for bargaining utterances like "I'll give you the ruby if you lower the bridge"

Measurement Definitions

MetricDefinitionAnalogue
Sentence accuracy% of utterances where ASR output exactly matches transcriptionStandard ASR metric
Word accuracy% of word tokens correctly recognised (accounting for substitutions, insertions, deletions)Standard WER-based metric
Semantic accuracy% of utterances where the parser output exactly matches the gold-standard dialogue actSemantic analogue of sentence accuracy
Concept accuracy% of tree nodes in the dialogue act correctly reconstructed by the parserSemantic analogue of word accuracy

Parsing produces a context-independent semantic representation — the meaning of an utterance without resolving references to the dialogue context. A second step, contextual interpretation, resolves anaphora (pronouns, ellipsis) by applying the lambda-abstract dialogue acts to the most salient discourse entities.

Reference Resolution in Action

User:"Cloddy Hans, please pick up the axe and put it in the machine."
Cloddy:"OK, I'll do that." (Picks up the axe)
User:"Now put it down."
Parse:λxthing λylocation . request(user, cloddy, putDown(cloddy, x, y))
After context:request(user, cloddy, putDown(cloddy, axe, machine))
Cloddy:"OK." (Puts the axe in the machine)

Focus Management

The contextual interpretation step uses a focus management algorithm that maintains a ranked set of salient objects — the entities most recently mentioned or acted upon. When a pronoun ("it") or an elliptic utterance ("now the hammer") is encountered, the algorithm applies the lambda function to the most salient entity of the appropriate type.

Focus management is tightly integrated with the dialogue manager (described in D5.2b) rather than the parser, since salience depends on the full history of actions and utterances — not just the current turn's linguistic content.

Why This Matters for Games

In a fast-paced game interaction, users naturally use pronouns and ellipsis: "pick up that one", "now the red one", "put it there" while pointing at the screen. A system that cannot handle anaphora forces users into unnatural, verbose instructions for every single action — breaking immersion and making the game feel like a command-line interface rather than a conversation.

Paper Citation

Boye, J., Gustafson, J., & Wirén, M. (2006). "Robust spoken language understanding in a computer game." Speech Communication, 48(6), pp. 335–353. Elsevier. Special issue on spoken language understanding. DOI: 10.1016/j.specom.2005.06.015

How to Do Dialogue in a Fairy-Tale World

How to Do Dialogue in a Fairy-Tale World
Johan Boye & Joakim Gustafson · TeliaSonera R&D, Sweden
6th SIGdial Workshop on Discourse and Dialogue, Lisbon, 2005 (demo paper)

This paper gives a detailed technical overview of the implemented dialogue manager in the NICE Fairy-Tale World — covering the character taxonomy, world representation, goal-based planning, message protocol, dialogue event system, scene/subscene structure, turn-taking, and asynchronous dialogue. It is the most complete published account of how the NICE dialogue engine actually works.

Karen and Cloddy Hans at the drawbridge
The Drawbridge. Karen on the far side; Cloddy Hans and the user trapped on the island. The waterfall gap separates them.
Aerial view of the fairy-tale world
The Fairy-Tale World. Aerial view of the two-island landscape. The user and Cloddy Hans start on the left plateau; the village and further quests lie beyond.

The paper introduces a three-way character taxonomy for interactive storytelling, extending the feature/supporting dichotomy from earlier work (Charles & Cavazza 2004) with a new third type — the helper character — that is central to the NICE design.

Karen
Feature Characters
e.g. The Prince, Karen

Drive the story. Have their own goals and desires. Represent obstacles or information sources. Require a full NLU and dialogue manager. Their agenda trees can be complex and deeply nested.

Own goals Story drivers Full dialogue manager
Thumbelina
Supporting Characters
e.g. Thumbelina

Provide atmosphere and react to events. Non-verbal or limited verbal output. Express emotions through body language. Do not drive the plot. Thumbelina points at machine slots and reacts with large body gestures when displeased.

Atmosphere Non-verbal Reactive
Cloddy Hans
Helper Characters
e.g. Cloddy Hans

A new type, not in prior literature. Guides and assists the user throughout the game. Has no goals of his own — only doing what the user asks and helping them when stuck. Requires a uniquely broad set of capabilities.

No own goals Faithful guide Grounding Error handling Visual perception Multi-plot aware

Requirements for a Helper Character

The paper identifies that helper characters demand a uniquely broad capability set — broader in some ways than feature characters:

  • Grounding & cooperation: must continuously confirm shared understanding and cooperate towards goals set by the user
  • Dialogue regulation & error handling: must manage communication breakdowns, recognition failures, and misunderstandings gracefully
  • Knowledge of all plots: unlike feature characters who only know their own subplot, the helper must know every subplot in every scene to be able to guide the user
  • Visual perception: must be able to notice objects the user hasn't seen yet and suggest actions involving them ("Hey, there's a ruby in the grass over there")
  • Multi-character awareness: must track what other characters say and do, to provide timely hints and side-comments

Each fairy-tale character maintains a rich inner state that drives all its behaviour — both proactive (pursuing its own agenda) and reactive (responding to user input or world events).

🌍 World Model

The character's beliefs about the current state of the 3D world — represented as a set of interrelated objects with attributes and relations. Includes: where objects are located, who is holding what, which slots are filled, whether the bridge is raised or lowered, etc.

Key property: beliefs can be wrong. A character's world model is updated by perceptual events and broadcasts from other characters, but may lag behind reality.

📜 Discourse History

A record of past interactions — what has been said, what has been done, what agreements have been reached. Used for anaphora resolution, detecting repeated requests, and deciding whether something has already been explained.

Enables Cloddy Hans to say "But I already told you — the axe is in the machine!" and to remember that the user previously asked him to go somewhere.

🗂️ Agenda

A set of tree-structures encoding the character's current goals, past and future actions, and their causal relationships. The trees can be deeply nested. Crucially, the tree encodes why as well as what:

Goal:hold(cloddy, axe)
→ because:standing-next-to(cloddy, shelf)
→ because:perform(walk-to, shelf)

These causal chains can be verbalized — Cloddy Hans can explain "I'm walking to the shelf because I want to pick up the axe."

STRIPS-Like Planning with Non-Monotonicity

Task specifications are sets of declarative rules encoding relationships between actions and world propositions. The dialogue manager uses these to build agenda trees — similar to classical STRIPS planning, but with a critical difference:

Non-Monotonicity

In classical STRIPS planning, once a proposition is established as true, it stays true. In the fairy-tale world, this assumption breaks down. For example:

  • available(axe) may be true when Cloddy Hans starts walking towards it — but become false if Karen grabs it first
  • The drawbridge may be raised when Cloddy approaches — but lowered (or re-raised) by Karen independently

Solution: The system checks all necessary preconditions before executing any action — even those already found to be true at an earlier point. A precondition is only assumed to hold at the moment of checking, not persistently.

Story-Functional Events

Certain key moments are designated story-functional events — irreversible milestones that mark genuine narrative progression. They cannot be undone. Examples:

Scene 1 (Lab)

  • Cloddy Hans introduces himself
  • Cloddy Hans introduces the plot
  • Cloddy Hans picks up an object for the first time
  • Cloddy Hans puts object X in the fairy-tale machine (irreversible — objects cannot be retrieved)
  • Cloddy Hans pulls the lever → they fall into the fairy-tale world

Scene 2 (Fairy-Tale World)

  • Cloddy Hans introduces the fairy-tale world
  • Karen introduces herself
  • Cloddy Hans gives his opinion of Karen
  • Karen gives her opinion of Cloddy Hans
  • Karen informs user she demands payment
  • Karen accepts an object and lowers the drawbridge
  • Cloddy Hans crosses the drawbridge and gives Karen payment

All characters load the story-functional events at scene initialisation and add corresponding goals to their agendas. This allows Cloddy Hans to proactively guide the user towards unmet milestones when the user is stuck.

The dialogue manager communicates with the rest of the system through a strictly defined message protocol. It receives a stream of input messages and generates output messages.

Output Messages (DM → System)

MessageDescription
convey <dialogue_act>Turn into words via NLG → TTS + animation
perform <action>Send command to the Animation Planner

Every observable character behaviour is produced through exactly one of these two primitives. convey produces speech; perform produces physical action.

Input Messages (System → DM)

MessageDescription
parserInput <dialogue_act>User said something; parser result attached
gestureInput <object>User pointed at a specific 3D object
recognitionFailureUser spoke; ASR could not interpret it
broadcast <message>Another character spoke or acted — only delivered if sender is within visibility range and/or audibility distance
performed <id> <flag>Character completed own requested action (ok/failed)
trigger <id>Character moved into/out of a 3D trigger zone

Why is "performed" needed?

The performed message might seem redundant — after all, the character itself requested the action. But since some actions (like walking across the world) take considerable time, and actions may fail for unexpected reasons, the character cannot treat an action as complete until the animation system explicitly confirms it. This feedback is essential for correctly sequencing agenda goals.

Dialogue Act Categories

User → Character

  • Instructions: "Go to the drawbridge", "Pick it up"
  • Domain questions: "What is that red object?", "How old are you?"
  • Giving information: "I'm fourteen years old"
  • Negotiating: "I can give you the ruby if you lower the bridge"
  • Confirmations: "Yes please!", "OK, do that"
  • Disconfirmations: "No!", "Stop!", "I didn't say that!"
  • Help requests: "Help!", "What can I do?", "Do you hear me?"
  • Explanation requests: "Why did you say that?", "Why are you doing this?"

Character → User

  • Responses to instructions: "OK, I'll do that" / "No, the knife is already in the machine"
  • Answers: "The ruby is red", "I am 30 years old"
  • Stating intentions: "I'm going to the drawbridge now"
  • Confirmation questions: "So you want me to go to the shelf?"
  • Clarification questions: "Where do you want me to go?", "What should I put on the shelf?"
  • Suggestions: "Perhaps we should go to the drawbridge now?"
  • Negotiating: "I won't do that for nothing", "What a piece of junk! Find something better"
  • Explanations: "Because I want the axe in the machine"
The character's dialogue act repertoire overlaps with — but is not identical to — the user's. Only characters can state intentions, make suggestions, and give explanations. Only users can make help requests and explanation requests.
Event-based dialogue progression in NICE

Event-based dialogue progression — the five input channels that drive dialogue forward in NICE

"In traditional spoken dialogue systems of the era, one thing drove the system: the user spoke, the system replied. In NICE, user speech was only one of five distinct classes of event that could push dialogue forward — making characters feel genuinely alive in their world rather than waiting passively for the next utterance."

A Novel Approach to Dialogue Progression

At the time NICE was developed (2002–2005), virtually all spoken dialogue research — whether in phone-based information systems, in-car assistants, or early conversational agents — followed the same fundamental model: the user speaks, the system produces a response, and so on in strict alternation. The only currency was user speech, and the only driving force was the parser's output from that speech.

The NICE dialogue engine discarded this assumption entirely. Each character's dialogue manager is driven by a stream of heterogeneous events arriving asynchronously from five independent sources. The character's scripting code responds to whichever event arrives next, regardless of whether the user has spoken. This makes it possible for characters to behave proactively, to react to each other, and to interact with the 3D world around them — in a way that had no precedent in spoken dialogue systems.

The Five Channels of Dialogue Progression

🎤 1. User Dialogue Acts

When the ASR recognises speech and the NLU parser succeeds, a ParserEvent is delivered carrying a fully interpreted dialogue act — e.g. instruction(go, drawbridge) or question(location, ruby). The scripting code for the current scene and phase catches this event, performs contextual interpretation (resolving pronouns, ellipsis, and screen references), and determines the appropriate response from the character's repertoire.

A RecognitionFailureEvent fires when the user speaks but ASR fails — characters handle this gracefully rather than silently failing. A GestureEvent fires when the user clicks on a 3D object (multimodal input: "pick this up" + click).

▶️ 2. The Character's Own Completed Actions

When a character performs a physical action — walking to a location, picking up an object, pulling a lever — the animation system executes it asynchronously. It reports back via a PerformedEvent carrying an ok or failed flag. This event is essential for sequencing: the character can only say "I'm at the machine, what now?" after actually arriving.

Because walking can take considerable time, the user may have issued new instructions in the meantime. The PerformedEvent lets the agenda cleanly resolve whether those intervening requests should now be acted upon or whether to resume the original goal chain.

📣 3. Broadcast Actions from Other Characters

Whenever any character speaks or performs an action, it broadcasts that event to all other characters. Each receiving character's dialogue manager fires a BroadcastEvent — but only if the sending character is currently within visibility range and/or within audibility distance of the receiving character. The system tracked spatial relationships between characters in the 3D world to enforce this filter.

This is what enables Cloddy Hans to whisper side-comments during the Karen negotiation ("Don't trust her!") even while off-camera, and for Karen to react to things Cloddy does across the world only if she could plausibly have seen or heard them. Characters could thus be genuinely surprised by events they had no line of sight to.

🌎 4. 3D Environment Triggers

The virtual world is populated with invisible trigger zones — 3D volumes defined by coordinates. As a character walks through the world, the animation system monitors their position and fires a TriggerEvent each time they enter or exit a zone. The dialogue manager uses these to make characters react to their immediate surroundings in real time.

  • Turn to look at an object of interest while passing it
  • Make a contextual remark near a landmark ("That's where Karen lives")
  • Generate waypoints for long walks — giving the character things to say or do at each waypoint rather than walking in silence
  • Trigger a camera pan when Cloddy Hans reaches the bridge gap and Karen comes into view
  • Know when the destination has been reached and transition dialogue to the next subscene phase

⏳ 5. Timeouts

If no other event arrives within a configurable time window, a TimeOutEvent fires. Characters use this to decide that the user needs prompting — perhaps they are confused, distracted, or have simply stopped speaking for too long. The response is determined by the current scene and agenda state: a first timeout might produce a gentle hint ("Perhaps we should go to the drawbridge?"); a repeated timeout might escalate to a more explicit suggestion, or Cloddy Hans might use his knowledge of unmet story-functional events to guide the user towards the next goal.

The NoReactionEvent is a related internal event fired when the agenda is empty and no incoming events are pending — the character has nothing queued and must decide whether to take initiative. Together, timeouts and NoReactionEvents are what keep the game from stalling when a child doesn't know what to say or do next.

The Agenda: What the Character Wants to Do Next

Underlying all five event channels is the character's agenda — a set of goal-trees whose content is determined by the current scene and subscene. When a scene initialises, each character loads the set of story-functional events for that scene and populates their agenda accordingly. This means the character knows at all times what still needs to happen for the narrative to progress, and can use that knowledge to decide how to respond to any incoming event.

Scene-Driven Agenda Population

At scene initialisation, story-functional milestones are added as top-level agenda goals. As events arrive and goals are fulfilled, the agenda shrinks. The NoReactionEvent fires when it is empty — signalling that the scene is complete or the character must find something proactive to do. Cloddy Hans uses the remaining agenda to generate unprompted hints ("We still haven't spoken to Karen about the bridge").

Non-Monotonic Goal Reasoning

Unlike classical STRIPS planning, the agenda does not assume that satisfied preconditions remain satisfied. When Karen grabs the axe just as Cloddy Hans was walking to pick it up, his goal fails and must be replanned. Each precondition is re-checked at execution time, not assumed true from when it was last verified. This is essential in a dynamic multi-character world where any character can change the state of any object.

Complete Event Reference

External Events (arriving from outside the DM)

EventTriggered when…
ParserEventParser delivers analysis of user's utterance
GestureEventUser pointed at a 3D object
RecognitionFailureEventASR detected speech but could not recognise it
PerformedEventAnimation system completed a requested action (ok / failed)
BroadcastEventAnother character spoke or acted (filtered by visibility & audibility)
WorldEventA world-state change (bridge raised/lowered, object inserted in machine…)
TriggerEventCharacter entered or exited a 3D trigger zone while walking

Internal Events (generated within the kernel)

EventMeaning
IntentionEventCharacter has formed an intention to say or do something
NoReactionEventAgenda empty — character must decide whether to take initiative
TimeOutEventSilence threshold exceeded — prompt or hint
AlreadySatisfiedEventA goal already true was added to the agenda
CannotSolveEventAn unsolvable goal was added to the agenda
PossibleGoalConflictEventNew goal conflicts with an existing agenda item
RequestEventDerived from ParserEvent — user made an instruction
QuestionEventDerived from ParserEvent — user asked a question

Kernel Operations Available to Scripting Code

interpret utterance in context
convey <dialogue act>
perform <action>
add goal to agenda
remove goal from agenda
find & pursue next goal on agenda

Worked Example: "Go to the fairy-tale machine"

This step-by-step trace shows how a single user instruction flows through the entire event/scripting system:

1. ParserEvent arrives → scripting code of current scene is called
2. Scripting code calls kernel's interpret operation → contextual interpretation resolves references
3. Kernel generates RequestEvent → scripting code catches it
4. Scripting code calls convey(acknowledge) → Cloddy says "OK, I'll do that" · Then calls addGoal(standNextTo(cloddy, machine))
5. Kernel pursues goal → calls perform(walkTo(machine)) → Cloddy Hans starts walking. TriggerEvents fire as he passes objects of interest along the route.
Meanwhile: user can speak again. New events arrive and can open a new dialogue line in parallel. A BroadcastEvent from Karen may also arrive if she is within audibility range.
6. PerformedEvent (ok) arrives → Cloddy reached machine. Scripting code generates next utterance: "I'm at the machine. What now?"

The game is divided into scenes and subscenes (and sub-subscenes), forming a hierarchical dramatic structure. Scenes map to game levels; scene transitions mark significant narrative events. This hierarchy is entirely modular — new scenes can be added without touching the kernel.

Kernel + Scripting = Modular Personality

The dialogue management kernel provides generic functionality — input handling, reference resolution, goal reasoning, event dispatching. The scripting code per character per scene per phase provides the specific personality and behaviour. Adding a new scene means writing new scripting code; the kernel is unchanged.

  • Practical benefit: simpler to understand and maintain
  • Theoretical benefit: clearly separates generic dialogue management from character/situation-specific aspects — advancing our understanding of dialogue in general
  • Game-industry precedent: this pattern is standard in commercial game engines (Varanese & LaMothe 2003)

Scene 2 — Four Subscenes

🏝️
1. Introduction

Cloddy Hans tells the user about the fairy-tale world. Story-functional events: Cloddy introduces the world; Karen introduces herself; both characters give their opinions of each other.

🔍
2. Exploration

The player and Cloddy Hans explore the island together, finding the fairy-tale objects from Scene 1 scattered in the grass. Camera and dialogue are entirely user-driven.

🤝
3. Negotiation

Starts when the player meets Karen. Multi-party negotiation: user tries to find what Karen wants, Cloddy Hans makes side-comments, Karen rejects first two offers unconditionally. Rich three-party dialogue.

🌉
4. Bridge Crossing

After successful negotiation. Cloddy Hans delivers the agreed object to Karen. The drawbridge is lowered. Story-functional event: Cloddy crosses the bridge and pays Karen. Scene complete.

At any moment exactly one subscene phase is active. Phases transition on events — for example, the Negotiation phase begins the first time the user speaks to Karen. The phase can also change based on geographic location (which trigger zone the character is in) or on character actions. This allows non-linear story progression within the overall scene structure.

Asynchronous Dialogue

Unlike most spoken dialogue systems — which use a strict turn-alternation model where user and system must speak one at a time — the NICE dialogue manager is fully asynchronous. This is one of its most important innovations.

What "asynchronous" means

  • The user can make multiple dialogue contributions in sequence without waiting for the system to reply
  • Characters are triggered to speak by five independent channels — not just user utterances, but also their own completed actions (PerformedEvent), broadcasts from other characters (filtered by visibility/audibility), 3D trigger zones entered while walking, and timeouts. E.g. Cloddy Hans entering the bridge trigger zone fires a TriggerEvent that pans the camera to Karen.
  • Multiple dialogue threads can be open simultaneously — a new utterance can be addressed while Cloddy Hans is still walking to fulfil a previous request
  • Characters resume the original dialogue thread after handling an interruption — via PerformedEvent upon completing the action

Why this matters for games

  • Games are dynamic — things happen continuously, not in polite alternating turns
  • Natural human conversation is not strictly turn-based — interruptions, overlaps, side-comments all occur
  • Strict turn-taking would make the system feel unresponsive and robotic during actions that take time (walking, picking up objects)
  • Side-comments from off-camera characters (Cloddy whispering hints during Karen negotiations) would be impossible in a strict turn model
"A character may take the turn and start speaking for a number of reasons: because the user has said something, because some other fairy-tale character has said or done something, because of an event in the fairy-tale world, or because a certain amount of time has elapsed."

Turn-Taking & Camera Control

User-Initiated Camera Changes

  • The character in focus is always in camera
  • User says a character's name ("Cloddy" / "Karen") → camera pans, dialogue partner changes
  • This is the primary mechanism for switching between characters in multi-party dialogue

System-Initiated Camera Changes

  • Triggered by events — e.g. Cloddy Hans reaching the bridge gap → camera automatically pans to Karen
  • Characters can make side-comments while off-camera — Cloddy whispering hints during the Karen negotiation
  • Creates the feel of a cinematic multi-character scene without requiring simultaneous rendering

3D Triggers

A trigger is a three-dimensional zone with specific coordinates in the virtual world. When a character enters or exits a trigger, a TriggerEvent is fired to its dialogue manager. Uses include:

Character turns head when passing object of interest
Character makes contextual utterance near a landmark
Generate waypoints for long walks across the world
Camera pan when character reaches the gap/bridge
Initiate negotiation when user approaches Karen

Comparison with Related Systems

The paper identifies only one comparable system: the Mission Rehearsal Exercise (MRE) from USC Institute of Creative Technologies (Swartout et al. 2004). MRE has more complex character-character interactions and a more sophisticated emotion model. However, MRE operates in a military domain with more codified language — whereas NICE targets spontaneous children's speech in a domain with negotiation and social interaction. To the authors' knowledge, NICE and MRE were the only two multi-party spoken dialogue systems in existence at the time of publication.

Paper Citation

Boye, J., & Gustafson, J. (2005). "How to do dialogue in a fairy-tale world." In Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal.

Voice Creation for Conversational Fairy-Tale Characters

Voice Creation for Conversational Fairy-Tale Characters
Joakim Gustafson (TeliaSonera) & Kåre Sjölander (KTH Centre for Speech Technology, Stockholm)
5th ISCA Speech Synthesis Workshop (SSW5), Pittsburgh, 2004

This paper describes the design and implementation of Snacka — a new corpus-based unit-selection speech synthesis system for Swedish developed specifically for the NICE project. It was one of the first unit-selection TTS systems built for Swedish, and the first designed from the ground up to convey animated character personality in a real-time interactive application. The paper covers the unique requirements that a fairy-tale game domain places on speech synthesis, the corpus design, the system architecture, voice transformation, and early evaluation results.

The fairy-tale world and characters
The fairy-tale world and its characters. Each character — the Prince, Thumbelina, Cloddy Hans — required a distinct synthesised voice designed to match their personality and visual appearance.
Cloddy Hans in the fairy-tale lab
Cloddy Hans in the fairy-tale lab. The fairy-tale machine with its four symbol-labelled slots. Cloddy Hans's synthesised voice must handle clarification dialogues about every object and slot combination.

The central challenge driving the development of a new TTS system was straightforward: the NICE characters engage in open-ended clarification dialogues that reference any combination of objects, slots, actions, and dialogue contexts. This makes pre-recording all possible utterances impossible — a unit-selection synthesiser is the only viable path to both quality and coverage.

"For the fairy-tale characters to be lifelike and believable in their roles in the game they have to be provided with natural sounding voices with distinct personalities. The characters also have to be able to engage in clarification dialogues that make the number of possible system utterances prohibitively large."

Why a New System?

Existing off-the-shelf Swedish TTS was not suitable for several reasons: generic voices have no personality; commercial systems are not designed for real-time integration into game engines; and at the time of development (2003–2004), no suitable Swedish unit-selection system was available that met the project's requirements. The solution was to build Snacka — a completely new corpus-based unit-selection synthesiser extending the open-source Snack sound toolkit by co-author Kåre Sjölander (KTH).

Design Criteria

Language & Quality

  • Swedish language as primary target; architecture designed to be extensible to new languages
  • High quality output with a personal speaking style — not a generic neutral voice
  • Easy to build new voices for different characters

System Integration

  • Easy to use as a component in a real-time dialogue system
  • Faster than real-time speech generation — critical for interactive use
  • Prosodic control over emphasis and speaking rate

Personality & Transformation

  • Voice design via acoustic signal processing — the "giant/gnome" approach to character differentiation
  • Support for extralinguistic sounds: inhalations, coughs, laughter, sighs
  • Coordination of verbal output with body gestures and lip-sync via time-stamped output

The Voice Actor: Mattias Heldner

Cloddy Hans was voiced by Mattias Heldner, a speech scientist at KTH. The voice actor was instructed to read all corpus utterances in a style matching Cloddy Hans's intended personality — friendly, uncertain, and slow — creating a performance-driven foundation that the acoustic transformations then further shaped to match the character's visual appearance.

The Symmetric Analysis–Synthesis Principle

The defining architectural principle of Snacka is symmetry between analysis and synthesis: every module used to analyse the recorded corpus is also used during synthesis. This is critical for quality — it ensures the unit inventory and the synthesis output are described in exactly the same terms, maximising the chance that available units match what synthesis needs.

Recorded speech corpus
+ orthographic transcriptions
HMM Segmentation
auto phone & word labels
Unit Inventory
half-phones, clustered
with decision trees
Voice Definition File
all unit descriptions +
pointers to original sound files
XML-tagged text input
from dialogue system
Target Description
same modules as analysis
Viterbi Search
best unit sequence
Output utterance
+ time-stamped phone
transcription for lip-sync

Core Technical Components

Speech Segmentation

  • HMM-based automatic phone and word label generation
  • Can process very long files — initial experiments used talking books as one long file + transcription
  • No manual segmentation required — fully automatic pipeline

Unit Inventory

  • Basic unit: half-phones with variable boundaries using optimal-coupling
  • Clustered using decision trees
  • Acoustic vectors: Mel cepstrum, F0, and first-order regression coefficients
  • Decision tree features: phone context, stress, phone position in syllable, syllable position in word, lexical stress, focus

Pronunciation & Prosody

  • Pronunciation modelling via co-occurrence statistics — captures speaker-specific preferences for word pronunciations in context
  • Prosody model trained on corpus using decision trees
  • Letter-to-sound, text analysis, co-articulation modelling all shared between analysis and synthesis

Implementation

  • Entirely built on Snack sound toolkit (Sjölander & Beskow 2000) with new primitives added: decision tree modelling, Viterbi search
  • Whole voice-building process in one single script — no external tools needed
  • Voice definition file + original sound files = complete deployable voice
  • Achieves faster than real-time synthesis — essential for interactive use

Quality Assessment Tool

A dedicated quality assessment tool was built on top of WaveSurfer. It allows simple point-and-click synthesis of typical domain sentences. When a faulty utterance is found, the tool can be used to inspect the automatic segmentation, correct errors, and prune bad units — in most cases the utterance can then be re-synthesised instantly without rebuilding the entire voice. This rapid iteration cycle was essential for achieving acceptable quality in a research timeline.

The corpus was designed to cover everything the system needed to say as well as everything it was designed to understand — a key symmetry principle. Since the system handles clarification dialogues about arbitrary object/slot combinations, utterances were carefully designed to cover all grammatical positions and emphasis patterns.

Clefted Clarification Coverage

To allow Cloddy Hans to ask targeted clarification questions, all domain utterances were recorded with objects and slots in both medial and final positions:

Cloddy:"Is it the axe you want me to put in the useful slot?" [object in medial position]
Cloddy:"Was it in the useful slot you wanted me to put the axe?" [slot in final position]

All utterances were also tagged with prosodic phrase boundaries and emphasized words before recording, and recorded in all combinations of position and emphasis — giving the synthesiser the full range of prosodic variation needed for natural clarification dialogues.

Coverage Boosting

  • Sentences containing only function words recorded separately to ensure adequate coverage of common Swedish words
  • A set of old Swedish sayings (ordspråk) recorded specifically to increase coverage of Swedish diphones — ensuring smooth transitions between units in unexpected word combinations

Task-Independent Dialogue Speech Acts (Table 1 from the paper)

In addition to domain-specific utterances, a rich set of task-independent dialogue acts was recorded — covering all the ways Cloddy Hans needs to regulate turns, handle errors, and express attitude. These are used across all tasks and plots.

Plan Regulating

  • agree / disagree
  • apology / non-apology
  • grateful / ungrateful
  • ask for clarification / repetition / rephrase
  • open question / bounce question
  • ask for accept / reject / correct request
  • accept / reject offer
  • report knowing / not knowing
  • report correct / wrong action

Error Handling

  • report hearing / not hearing
  • ask for repetition
  • report understanding / not understanding / misunderstanding
  • error acknowledgement

Turn Handling

  • feedback continuer
  • floorholder
  • easy question / hard question
  • backchannel question
  • neutral filled pauses

Attitudinal Feedback

  • positive / negative filler words and filled pauses
  • attitude to grateful / ungrateful
  • attitude to success / failure
  • attitude to good / bad

Extralinguistic Sounds

  • Cough
  • Exhalation
  • Inhalation
  • Laughter
  • Clear throat
  • Sigh

Why Attitudinal Feedback and Extralinguistic Sounds?

The turn regulation utterances and attitudinal feedback serve two purposes: (1) to buy time while the system generates Cloddy Hans's next utterance, and — more importantly — (2) to convey his uncertain, hesitant personality. A confident, fast character would not produce filler words and sighs; Cloddy Hans's acoustic hesitancy is as much a personality signal as his slow walking speed.

Making different characters sound like themselves — not just like different speed settings of the same voice — required a combination of directed recording and post-hoc acoustic transformation.

Directed Recording

Voice actors were given explicit personality profiles and instructed to read all corpus utterances in a performance style that matched the target character. For Cloddy Hans, the instruction was to sound friendly, uncertain, and slow — producing a baseline recording already shaped by character intent.

TD-PSOLA Voice Transformation

After recording, further acoustic shaping was applied using the Time-Domain Pitch-Synchronous Overlap-Add (TD-PSOLA) algorithm — a well-established technique for modifying speech prosody without re-synthesis:

🐢 Cloddy Hans — Slowed & Lowered

  • Speaking rate reduced — all utterances slowed down via TD-PSOLA
  • Vocal tract length modified (longer → lower formants)
  • Combined effect: sounds slower and larger / heavier
  • Matches his lumbering walk and large body visually
  • Transformation applied to the entire recorded corpus at once, after building the voice description

🌟 Other Characters — Gnome/Giant Axis

  • The "giant/gnome" approach: vocal tract length modification shifts character up or down a personality axis
  • Shorter vocal tract → higher formants → sounds smaller, younger, more sprite-like
  • Longer vocal tract → lower formants → sounds larger, older, more imposing
  • Rate change: faster → more energetic/smart; slower → more lumbering/dull
  • A 2003 study (Bell, Gustafson, Heldner at ICPhS) showed these transformations measurably influence how users interact with the character
"The personalities of the two characters were deliberately chosen so that this simple voice transformation would also make their voices more matching with the visual appearance of the two animated characters."
— Gustafson & Sjölander, SSW 2004

Personality via Vocal Parameters (Nass & Lee 2000)

The approach was grounded in empirical research on synthetic voice personality. Nass & Lee had shown that providing a synthetic voice with stereotypical extrovert vocal features — high loudness, increased pitch, wide frequency range, fast speaking rate — caused users to describe it using extrovert personality adjectives. NICE applied this in reverse: design the vocal parameters to match the target personality, and users will perceive the character accordingly.

Early Evaluation at Telecom Museum

At the time of the workshop paper, the synthesiser was already deployed for user testing at the Telecommunication Museum in Stockholm. Adults and children interacted with a full-size Cloddy Hans on a large back-projected video wall, in semi-automatic mode (human wizard available to override). Users were asked post-interaction about their perception of:

  • Cloddy Hans's personality traits
  • The quality and naturalness of his voice

Future work identified at the time: improved prosodic models, and experimenting with making repetitions acoustically different from original utterances — so that repeated attempts at the same phrase don't sound identical.

One of Snacka's most important features for the NICE application was its tight integration with the animation system — achieved through a custom XML markup language that carried synthesis instructions, prosodic markup, and gesture coordination tags in a single pass.

XML-Tagged Input

The dialogue system outputs XML-tagged text to the synthesiser. A real example from the paper:

"do you want <EMPH>me</EMPH> to take <PointAt object="Axe">that</PointAt> and put it <PointAt slot="Useful">there</PointAt>"

XML Tag Types

TagFunction
<EMPH>Mark a word for prosodic emphasis — synthesiser increases pitch and energy on the emphasised word
<PointAt object="X">Coordinate synthesis with a pointing gesture at object X in the 3D world — triggers animation system
<PointAt slot="Y">Coordinate synthesis with a pointing gesture at slot Y in the fairy-tale machine
Prosodic phrase markupMark phrase boundaries for more natural prosody across longer utterances
Extralinguistic tagsTrigger recorded non-speech sounds (cough, laughter, sigh) at specific points
Meta-information tagsCarry additional context used by the animation planner

Lip-Sync Output

In addition to the audio output, the synthesiser produces a time-stamped phone transcription containing:

  • Precise start and end times for every phone in the utterance
  • Animation tags at the time points corresponding to each phoneme's viseme
  • Animation trigger points for gesture coordination (when Cloddy Hans should point at an object)

This output is fed directly to the Liquid Media animation system, which drives Cloddy Hans's lip movements and body gestures in precise synchrony with the speech output.

Why This Integration Matters

Most TTS systems output audio only. Snacka outputs audio plus a complete synchronisation track — enabling Cloddy Hans to point at the axe precisely as he says "that", and to gesture towards the useful slot as he says "there". Without this tight coupling, lip sync would slip, gestures would arrive late, and the character would lose believability. This integration of synthesis and animation was a key technical contribution of the NICE project.

Paper Citation

Gustafson, J., & Sjölander, K. (2004). "Voice creation for conversational fairy-tale characters." In Proceedings of the 5th ISCA Speech Synthesis Workshop (SSW5), Pittsburgh, PA.

Providing Computer Game Characters with Conversational Abilities

Providing Computer Game Characters with Conversational Abilities
Joakim Gustafson, Johan Boye (TeliaSonera) · Morgan Fredriksson, Lasse Johannesson, Jürgen Königsmann (Liquid Media)
Intelligent Virtual Agents (IVA 2005), Kos, Greece · Lecture Notes in Computer Science, Springer

This paper presents the complete NICE Fairy-Tale World system — the game scenario, the characters' conversational requirements, the animation system designed by Liquid Media, the voice creation approach, and the results of the user studies with children. It argues that spoken dialogue technology has the potential to greatly enrich future computer games, and demonstrates a working system in which children used speech as the primary means of game progression.

The Two Scenes

Cloddy Hans in the fairy-tale laboratory
Scene 1: The Fairy-Tale Laboratory. Cloddy Hans stands beside the shelf of fairy-tale objects — a teapot, a gem, a hammer, a book, a potion and more. The user must instruct him to place objects in the right slots of the fairy-tale machine.
Karen and Cloddy Hans at the drawbridge
Scene 2: The Drawbridge. Karen (left) guards the raised drawbridge across the gap. Cloddy Hans (right) waits on the island with the user. The user must negotiate with Karen to lower the bridge.

Character Personalities — The OCEAN Model

Personality traits were not explicitly modelled as internal variables, but used as design guidelines to ensure that every aspect of a character's behaviour — appearance, voice, wording, speaking style, non-verbal gestures — was perceived as consistent with the intended personality. The characters were designed around contrasting OCEAN (Big Five) trait profiles.

Cloddy Hans portrait
Cloddy Hans
The Helper · Faithful Companion

Slow, deep voice. Lumbering walk. Honest, anxious to help, no agenda of his own. His limitations force the user to cooperate using multimodal dialogue. His apparent stupidity is deliberate — it makes the game harder in a way users perceive as intentional design.

Dunce Uncertain Friendly Polite Calm Even-tempered

Perceived by users as: Kind · Stupid · Lazy · Calm · Polite · Distressed

Karen portrait
Karen
The Gatekeeper · Obstacle & Negotiator

Fast, sharp voice — made to sound younger by speeding up the recordings. Drives the negotiation, rejects every offer for the first two attempts, and makes side-comments about Cloddy Hans. She appears smarter partly because she can "see" what Cloddy brings without needing to understand the user's speech.

Intellectual Frivolous Self-confident Unfriendly Touchy Anxious

Perceived by users as: Smart · Quick · Self-confident · Defiant · Secretive

Karen — Facial Expressions & Body Language

Karen's animated face conveys a wide range of emotions through the phonemic, intonational and emotional channels described in the paper. Facial actions — eyebrow position, mouth shape, gaze direction — communicate her sullen, uncooperative personality even before she speaks.

Facial Expressions

Body Poses & Gestures

Cloddy Hans — Actions & Expressions

Cloddy Hans's animation repertoire spans autonomous idle behaviour (walking, looking around) and reactive dialogue behaviour (picking up objects, gesturing, celebrating). His slow, lumbering movement style — deliberate in the animation design — reinforces his dunce personality at every frame.

In the Fairy-Tale World

Facial Expressions

Animation System (Liquid Media)

The animation system designed by Liquid Media was central to making the characters feel alive and reactive. It consisted of two main modules — the Animation Handler and the Animation Renderer — operating asynchronously within the event-driven architecture.

Dialogue Manager
Animation Handler
deconstructs requests,
manages queues & priorities
Speech Synthesiser
generates audio +
lip-sync track
Animation Renderer
8 concurrent tracks,
blending, real-time
Receipt → Handler
confirms completion,
triggers next action

Animation Handler

  • Deconstructs high-level requests (e.g. "go to the machine") into fine-grained sequences: change camera → walk → change camera → turn to face
  • One queue per character — instructions sent one at a time, next sent on receipt confirmation
  • Priority system for speech: High = reply to user; Medium = suggestion when user is silent; Low = chit-chat. Higher priority interrupts ongoing utterance.
  • If characters talk simultaneously (event-driven system allows this), the handler resolves priority conflicts correctly

Animation Renderer

  • 8 concurrent animation tracks per character — different body parts can animate independently and simultaneously
  • Facial animations blended with bodily movement in real time (e.g. lip-sync while walking)
  • Façade architecture — subsystems have virtual interfaces, making it easy to swap implementations
  • Resource system tracks all graphical meshes and animated models by type and name

Voice Design

A bespoke unit-selection Swedish TTS was developed in cooperation with KTH. Voice actors were told to read utterances in a style matching each character's intended personality. The recordings were then processed to further accentuate the personality differences.

🐢 Cloddy Hans — Slow & Deep

  • Voice actor told to sound uncertain, friendly, slow
  • All utterances slowed down in post-processing
  • Side-effect: slowing makes him sound larger / more massive
  • Low frequency range — deep, lumbering quality
  • Fillers and hesitations reinforce his uncertainty

⚡ Karen — Fast & Sharp

  • Voice actor told to sound intellectual, self-confident, touchy
  • All utterances speeded up in post-processing
  • Side-effect: speeding makes her sound younger
  • Higher frequency range — crisp, quick delivery
  • Deliberate contrast with Cloddy Hans — users immediately sense the personality difference

The Voice Transformation Insight

The simple procedure of resampling all utterances — simultaneously changing speaking rate and voice pitch — had a desirable emergent property: the personality differences were visually reinforced. Cloddy Hans's slower, lower voice matched his large, lumbering body; Karen's faster, higher voice matched her small, quick character design. The audio and visual channels reinforced each other without requiring complex acoustic modelling.

User Study Findings — Speaking Rate Adaptation

One of the most striking findings from the corpus analysis was how children adapted their speaking rate dynamically across the two scenes and across the two different character interlocutors:

Scene 1 — With Cloddy Hans (Lab)

  • Users became more efficient over time — fewer turns needed per object placed in the machine
  • But they also spoke progressively slower with each object, hyper-articulating to help Cloddy understand
  • Users adapted their speech rate down to match his apparent cognitive pace

Scene 2 — Island & Drawbridge

  • At start of Scene 2 (exploration with Cloddy), users began faster, then gradually slowed again when talking to him
  • When switching to Karen, users spoke significantly faster
  • Speech rate with Karen increased further with each turn — the negotiation felt more lively and dynamic
  • Karen's side of the interaction felt more engaging partly because she never struggled to "understand"
"Users rated Cloddy Hans as stupid and Karen as smart — even though the trick of having Karen tell users what to say to Cloddy was used in both directions."
— Gustafson et al., IVA 2005

The "Smart Karen" Design Trick

Karen was designed so she could respond to what Cloddy Hans physically brought her — she could "see" the object — without needing to parse the user's speech. This meant she appeared far more intelligent than Cloddy Hans (who needed correct verbal instructions) even though her underlying language understanding was actually simpler. This asymmetry was a deliberate design choice that also helped cover for ASR failures: Karen's reliable responsiveness made the system feel more robust overall.

Paper Citation

Gustafson, J., Boye, J., Fredriksson, M., Johannesson, L., & Königsmann, J. (2005). "Providing Computer Game Characters with Conversational Abilities." In Intelligent Virtual Agents (IVA 2005), Lecture Notes in Computer Science, Springer, Kos, Greece.

Videos

Watch recordings from both NICE systems — system demonstrations, user interactions, and game footage from the fairy-tale world.

H.C. Andersen System
Demonstration of the conversational H.C. Andersen system — domain-oriented spoken dialogue with the animated author in his 19th-century Copenhagen study.
Fairy-Tale System — Real User Interaction
A real interaction from one of the user studies with the NICE Fairy-Tale system. A child user engages in spontaneous spoken dialogue with Cloddy Hans and Karen in the 3D fairy-tale world.
NICE System Overview
Watch on YouTube ↗
NICE Video — System Overview [with subtitles]
Overview of the NICE fairy-tale game system with subtitles. Covers the project concept, the fairy-tale world, characters, and spoken dialogue interaction. Opens on YouTube.

Publications

The NICE project produced over 15 research papers across international venues in speech technology, dialogue systems, human-computer interaction, and intelligent virtual agents.

2006
Boye, J., Gustafson, J., Wirén, M.
Robust spoken language understanding in a computer game
Speech Communication, 48(6), pp. 335–353 (Elsevier). Special issue on spoken language understanding.
PDF
2005
Gustafson, J., Boye, J., Fredriksson, M., Johannesson, L., Königsmann, J.
Providing Computer Game Characters with Conversational Abilities
Proceedings of Intelligent Virtual Agents (IVA 2005), Kos, Greece.
PDF
2005
Bell, L., Boye, J., Gustafson, J., Heldner, M., Lindström, A., Wirén, M.
The Swedish NICE Corpus – Spoken dialogues between children and embodied characters in a computer game scenario
Proceedings of Interspeech 2005, Lisbon, Portugal.
PDF
2005
Boye, J., Gustafson, J.
How to do dialogue in a fairy-tale world
Proceedings of SIGdial Workshop on Discourse and Dialogue 2005.
PDF
2005
Buisine, S., Martin, J.-C., Bernsen, N.O.
Children's Gesture and Speech in Conversation with 3D Characters
Proceedings of HCII 2005 (Human-Computer Interaction International).
PDF
2004
Gustafson, J., Bell, L., Boye, J., Lindström, A., Wirén, M.
The NICE Fairy-tale Game System
Proceedings of SIGdial Workshop on Discourse and Dialogue 2004, Cambridge, MA.
PDF
2004
Bernsen, N.O., Charfuelàn, M., Corradini, A., Dybkjær, L., Hansen, T., et al.
Conversational H.C. Andersen — First Prototype Description
Proceedings of Affective Dialogue Systems (ADS 2004), Kloster Irsee, Germany.
PDF
2004
Bernsen, N.O., Dybkjær, L.
Domain-Oriented Conversation with H.C. Andersen
ICSLP 2004 / Interspeech 2004.
PDF
2004
Gustafson, J., Boye, J., Bell, L., Wirén, M.
NICE System Description
LREC 2004 (Language Resources and Evaluation Conference), Lisbon, Portugal.
PDF
2004
Boye, J., Wirén, M.
Natural language understanding for the NICE fairy-tale game
NICE Deliverable D3.5b.
2004
Gustafson, J., Boye, J., Bell, L., Wirén, M.
Evaluation of the first NICE fairy-tale game prototype
NICE Deliverable D7.2b. (Corpus: 10 children at Technical Museum Stockholm.)
2003
Boye, J., Wirén, M.
Spoken language processing for NICE
DiaBruck 2003 / Eurospeech 2003.
PDF (DiaBruck) PDF (Eurospeech)
2004
Gustafson, J. et al.
Speech synthesis for fairy-tale characters
SSW 5th ISCA Workshop on Speech Synthesis, Pittsburgh 2004.
PDF
2005
Bernsen, N.O., Dybkjær, L. (Eds.)
Evaluation of the Second NICE Prototype
NICE Deliverable D7.2-2 (99 pages). Includes evaluation of both HCA and Fairytale World systems.
2005
Corradini, A., Mehta, M., Bernsen, N.O., Charfuelan, M.
Animating an Interactive Conversational Character for an Educational Game System
Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI 2005), San Diego, CA, USA. ACM Press, pp. 183–190.
PDF
2007
Bell, L., Gustafson, J.
Children's Convergence in Referring Expressions to Graphical Objects in a Speech-Enabled Computer Game
Proceedings of Interspeech 2007, Antwerp, Belgium. ISCA, pp. 2209–2212.
PDF
2004
Bernsen, N.O., Charfuelàn, M., Corradini, A., Dybkjær, L., Hansen, T., Kiilerich, S., Kolodnytsky, M., Kupkin, D., Mehta, M.
First Prototype of Conversational H.C. Andersen
Proceedings of the ACM International Working Conference on Advanced Visual Interfaces (AVI 2004), Gallipoli, Italy. ACM Press.
PDF
2004
Boye, J., Wirén, M., Gustafson, J.
Contextual Reasoning in Multimodal Dialogue Systems: Two Case Studies
Proceedings of the 8th Workshop on the Semantics and Pragmatics of Dialogue (Catalog'04), Barcelona, Spain.
PDF
2004
Corradini, A., Fredriksson, M., Mehta, M., Königsmann, J., Bernsen, N.O., Johannesson, L.
Towards Believable Behavior Generation for Embodied Conversational Agents
Proceedings of Intelligent Virtual Agents (IVA 2004), Lecture Notes in Computer Science, vol. 3038, Springer.
PDF
2002
Gustafson, J., Sjölander, K.
Voice Transformations for Improving Children's Speech Recognition in a Publicly Available Dialogue System
Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002 / Interspeech 2002), Denver, CO, USA.
PDF

Impact & Legacy

The NICE Final Report (May 2005) reflected on the project's broader significance and each partner's plans to build on the work. The consortium's conclusion was unambiguous: all the original challenges had been satisfactorily met, and the systems pointed towards entirely new categories of interactive application.

"The user tests of both NICE prototype systems showed that multimodal applications using speech I/O are highly enjoyable and indicated that there is a large market potential for games using the combination of AI character modeling, nonverbal behavior and speech I/O."
— NICE Final Report D9.3, May 2005

What NICE Made Possible

For Children & Society

  • First workable children's speech recognition for spontaneous, conversational speech in large domains
  • Demonstrated that children readily accept and enjoy speech-driven games
  • A democratic argument: speech services that exclude young users exclude them from information access — NICE helped address this
  • Results also improved Swedish speech recognition for adult users generally

For the Games Industry

  • First demonstration of non-command-based spontaneous spoken dialogue as primary game control
  • Proved multi-party conversational characters with distinct personalities are technically feasible
  • Liquid Media built a commercially available skeletal animation engine with real-time lip sync and combinable animation tracks as a direct result
  • Showed fast visual-motor gaming and "think-before-you-speak" dialogue interaction require different design philosophies

For Dialogue Research

  • One of only two multi-party dialogue systems existing at the time (alongside USC ICT's MRE system)
  • First domain-oriented (non-task-oriented) conversational system with a famous historical character
  • New metrics for measuring conversation success, narrative progression, and character believability
  • Modality-independent, mixed-initiative conversation manager reusable for new applications

Data Resources

  • The Swedish NICE corpus: 5,500+ dialogue turns, children age 8–15
  • NISLab's HCA corpus: 8 hours of children's English, made available for further research
  • LIMSI's gesture corpus: 2D gesture data during interaction with 3D objects — unique at the time
  • All corpora made available to the research community post-project

Partner Outlooks (from the Final Report)

NISLab
Planned to continue advancing ontology-based natural interactive systems, mixed-initiative conversation management, and systems with emotion. Invited to join consortia in mobile phones, cultural heritage, and emotion-sensing character technologies — including one with Liquid Media.
TeliaSonera
Planned to apply NICE methods to more entertaining commercial service interfaces — where "entertainment value" is increasingly a competitive factor — and to improve Swedish speech recognition broadly. The FTW dialogue architecture was seen as directly portable to new edutainment applications.
LIMSI-CNRS
Continued using the GR/GI/IF software modules and annotation methodology in new research projects on multimodal interfaces. Established a national working group on Embodied Conversational Agents in France. Protocols adapted for evaluating emotional expressivity in ECAs.
Liquid Media
Built a commercially available advanced skeletal animation system with multiple concurrent animation tracks, hierarchical sorting, high-level nonverbal gesture controls, and real-time lip sync — all implemented in a commercial game engine. NICE "opened up a new market" in speech-driven games.
ScanSoft
The NICE prototypes served as a "very helpful test bed" for future complex multimodal system design and component requirements. The children's speech recognition advances informed ScanSoft's broader European product development and future multilingual EU project collaborations.

Press Coverage

The NICE project attracted attention from industry press covering the emerging intersection of speech technology and video games.

Citation: Oberteuffer, J. A. (2005). "Speech Technologies Make Video Games Complete." Speech Technology Magazine, April 26, 2005. Available at: speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=29432