Research Project · 1999 – 2003

Multimodal Spoken Dialogue

AdApt

A multimodal conversational dialogue system for browsing apartments in Stockholm — combining natural speech, an animated talking agent, and an interactive map into a single coherent interface.

14
Research Papers
4
Years Active
1845
Corpus Utterances
33
WoZ Participants
AdApt GUI: Stockholm map with apartment icons and animated agent Urban
Spoken Dialogue Animated Agent Interactive Map Negotiative Dialogue Wizard-of-Oz
About the Project

A New Generation of Human–Computer Dialogue

How do humans really talk to computers?

AdApt was built on a simple but radical idea: to understand how people actually speak to computers, you had to let them speak freely — and then listen carefully to what happened.

Running at the Centre for Speech Technology (CTT) at KTH in collaboration with Telia Research, AdApt set out to build a multimodal conversational dialogue system in the domain of apartment browsing in Stockholm. Users could speak naturally to an animated agent named Urban, point and click on an interactive map, and receive spoken responses accompanied by visual feedback — all in a fluid search for their ideal Stockholm apartment.

The project followed a deliberate empirical methodology: first collect authentic human–computer dialogues through Wizard-of-Oz simulations, analyze the linguistic and interactional patterns that emerge, then use those findings to engineer robust system components. Each iteration brought the system closer to handling the spontaneous, fragmentary, and multimodal nature of real conversation.

AdApt built on earlier KTH dialogue systems — Waxholm, Gulan, and August — but pushed further into negotiative dialogue, constraint manipulation, multimodal output coordination, and real-time handling of fragmented speech. The project ultimately contributed to Joakim Gustafson's doctoral dissertation (2002) and Linda Bell's PhD thesis on linguistic adaptations in spoken human–computer dialogue (2003).

Project Details

Funder VINNOVA / Centre for Speech Technology
Duration 1999 – 2003
Principal Investigator Joakim Gustafson
Host institution KTH Royal Institute of Technology
Department Speech, Music and Hearing (TMH)
Industrial partner Telia Research
Application domain Apartment search, Stockholm
Centre for Speech Technology · CTT
The System

Multimodal Interaction with Agent Urban

AdApt combined speech recognition, an animated 3D talking head, and an interactive Stockholm map into a single conversational experience. Nine loosely coupled services — spanning Prolog dialogue logic, Tcl/Tk interfaces, and a Java message broker — communicated through XML over TCP sockets.

Input understanding modules Multimodal Parser Reference resolution Speech act identifier Input devices ASR GUI input Vision Input/Output Manager Input fusion module Message handling module Output fission module Dialogue Manager – Response planner Multimodal output generator Output devices GUI output Animated character Speech synthesizer

Figure from Gustafson 2002. A system architecture for a multimodal conversational system — almost identical to AdApt. The I/O Manager is divided into three sub-modules: input fusion (merging multimodal inputs), message handling (coordinating timing and routing), and output fission (decomposing DM output for each output channel).

User Interface
🗣

Conversational Speech Interface

Users spoke freely in Swedish — no fixed command syntax required. An open microphone (rather than a push-to-talk button) enabled more natural turn-taking. The system used continuous speech recognition (Nuance) and a robust parser to handle incomplete utterances, disfluencies, topicalized phrases, and mixed verbal–gestural input.

🏙

Interactive Map of Stockholm

The graphical interface centered on a detailed, scrollable map of Stockholm. Apartment icons appeared as colored markers. Users could click icons to select apartments and cross-reference a live table of addresses, sizes, and prices — all while speaking. Mouse clicks and speech were fused in real time by the I/O Manager.

👤

Animated Talking Agent "Urban"

The system's voice was embodied in a 3D animated talking head — a parameterized facial model driven by a TTS engine with gesture rules. Urban displayed listening and thinking expressions in real time, used gaze shifts to signal turn-taking, and showed constraint icons in a thought-bubble to confirm what the system had understood.

🏠

Live Apartment Database

Rather than a static dataset, AdApt used a dynamic database scraped from Stockholm real-estate websites, updated six times daily. Structured fields (address, size, price) were augmented with information extracted from free-text descriptions — capturing features like fireplaces, balconies, dishwashers, and cable TV.

Technical Components

Java Message Broker

Every component registered a service name with the broker on startup — UttPars (parser), DM (dialogue manager), DB (database), NuanceServer (ASR) — and all inter-component traffic flowed through it as XML over TCP on port 2345. Each client declared an input filter (xml_2_term) and output filter (nice_xml) so components could speak native Prolog terms while the wire format remained XML. Two call modes supported synchronous queries (callfunc) and fire-and-forget events (callproc). The main loop ran accept_request_from_client continuously, dispatching incoming messages to registered Prolog callback predicates.

🔤

Two-Pass SICStus Prolog Parser

A two-phase robust shallow parser converted Swedish speech-recogniser output into typed semantic expressions. The first pass (hintfinder.pl) extracted over 22 hint types left-to-right — numbers, addresses, room counts, area names, comparative phrases, and discourse markers — merging them with graphical mouse-click events via a parallel graphic_hintfinder.pl pass. The second pass (structurer.pl) assembled the hints into semantic forms and classified each utterance as closing or non_closing. The four expression types and the closing/non-closing decision algorithm are described in depth in the Fragmented Utterances section.

🧠

Goal-Based Dialogue Manager

State was represented as an immutable triple dialogue_state(Objects, History, Agenda). Each turn passed through an eight-stage context pipeline — from reference resolution through ellipsis expansion, feature-based tagging, and multi-hypothesis voting — culminating in goal-directed action selection. Eight named goals drove the agenda, each decomposing into hierarchical sub-plans. When database searches returned no results, relax_one_constraint/3 iteratively widened soft-constraint bounds. The context resolution machinery is described in depth in the Contextual Reasoning section.

🗄

Apartment Database Server

Apartments scraped from Stockholm real-estate websites were pre-parsed into 23-field associative structures covering street name, number, floor, living space, rooms, price, monthly fee, construction year, renovation history, and a coordinate pair (x, y) for map placement. Free-text ad descriptions were mined by a Prolog NL pre-parser to populate list fields such as bathroom (tub, shower…), kitchen (furnished, island…), balcony direction, and amenities. At query time db_select/3 applied relational algebra — filter, project, join — over in-memory Prolog facts. Hard constraints (number of rooms, floor) were never relaxed; soft constraints (price, area, year, coordinates) could be widened iteratively. A WOZ-quality filter ensured all loaded records had valid coordinates and core fields.

I/O Manager & Fragmented Utterance Timing

The I/O Manager coordinated all multimodal timing: merging speech and mouse-click events before forwarding to the parser, and decomposing Dialogue Manager responses into parallel commands for the Agent, Map Handler, and Icon Handler. Its central innovation was deciding in real time whether a user utterance was complete or needed continuation — a problem explored in full in the Fragmented Utterances section. It also drove Urban’s turn-taking state machine, detailed in the Multimodal Output section.

🎤

Agent "Urban" — Gesture & Speech Synthesis

Urban was driven by a Tcl/Tk engine combining Snack 2.0, TTS 1.5, and GUB 2.1 for gesture animation. Two gesture classes — pre-recorded jog-gestures and procedurally generated proc-gestures — were synchronised to the TTS token stream via a hotspot offset marking each gesture’s peak. Urban’s full turn-taking state machine (five states, six transitions, gesture co-articulation algorithm) is described in the Multimodal Output section.

Research Goals

Six Long-Term Research Questions

AdApt was designed as an empirical testbed to investigate fundamental questions about how people interact with multimodal spoken dialogue systems.

01

Modality Use and System Architecture

How do people make use of different modalities — speech, pointing, clicking — and what are the implications of their choices for system architecture? When and why do users switch between speech and graphical interaction?

02

Reference and Context Resolution

How should the system interpret references not only to objects previously mentioned in dialogue, but also to objects currently visible on screen? Deictic and discourse references frequently co-occur in multimodal utterances.

03

Multimodality and Robustness

Can multimodality improve the robustness of a spoken dialogue system? Users often switch modality when speech interaction becomes difficult. Can the system exploit this switching behavior to recover from recognition errors?

04

Speech in Multimodal Settings

How does the multimodal setting — mouse, interactive map, animated speaking agent — influence how people speak? Does the presence of graphical referents change utterance structure, vocabulary, and prosody?

05

Influencing Users' Modality Choices

Can the system actively steer users’ choice of input modality? If the system consistently uses one modality for output, will users mirror that choice in their input?

06

Handling Fragmented Utterances

How can a system handle the large number of topicalized, incomplete, and fragmented utterances produced in multimodal interaction? What linguistic and contextual cues distinguish a pause mid-utterance from a genuine end-of-turn?

Key Findings

What AdApt Discovered

These four findings emerged from the Wizard-of-Oz corpus and subsequent user studies — each is explored in depth in the User Studies section below.

Disfluencies

Average Disfluency Rate

The disfluency rate per word (excluding unfilled pauses) was 6% — high but within literature ranges. Filled pauses and segment prolongations were most frequent, alongside repetitions and truncated words. Individual variation was substantial.

Convergence

Modality Convergence

The system’s own output modality directly shaped users’ subsequent input choices. See the User Studies section for the full convergence experiment.

Fragments

Topicalized and Fragmented Input

Users produced large numbers of topicalized phrases: "The apartment on Nybrogatan — how much is that?" Pause lengths between fragments varied systematically: avg. 0.5 s after a direct question, up to 3 s after a list presentation.

Adaptation

Linguistic Adaptation

Linda Bell’s PhD research (2003) documented systematic accommodation — users adjusted vocabulary, syntax, and prosody to mirror the system’s own language, a pattern analysed in depth in her dissertation.

User Studies

Wizard-of-Oz Data Collection

Before the fully automated system existed, a simulated version was used to collect naturalistic human–computer dialogue data. A human wizard operated the system's dialogue management and response generation behind the scenes, giving subjects the impression of interacting with a working system. The resulting corpus — 33 subjects, 50 dialogues, 1,845 utterances — became the empirical basis for three independent studies on disfluency, modality convergence, and user feedback. (Gustafson, Bell, Beskow, Boye, Carlson, Edlund, Granström, House & Wirén, 2000.)

The Wizard-of-Oz interface showing a form-based apartment query panel on the left with fields for street, price, rooms and amenities, a colour Stockholm map with apartment icons in the centre, and a panel of pre-made response templates at the bottom
The wizard interface. The left panel let the wizard enter the user’s verbally expressed preferences; the centre map displayed matching apartment icons; the bottom panel provided ready-made answer templates. Wizard response times were typically 1–2 seconds, giving subjects the impression of a fully functional system.

The Wizard-of-Oz Setup

The wizard had access to a frozen snapshot of the live apartment database and selected responses from a menu of pre-formulated templates — verbal descriptions, map highlights, and table entries — covering the most common dialogue situations. The system automatically displayed a “thinking” gesture on Urban’s face as soon as speech was detected, so the wizard could take one to two seconds to select the right template without users noticing the human-in-the-loop.

To avoid verbal biasing — where written task instructions cause subjects to repeat the exact words used in the scenario — tasks were presented as pictorial scenarios. Each scenario showed a shaded area on the Stockholm map, a timeline indicating a preferred construction year range, room-count bars, and photographs of interior features (bathroom fittings, tiled stoves) rather than naming them explicitly. Subjects were told only to “find apartments you are interested in.”

A pictorial scenario strip showing: a Stockholm map with two shaded pink search areas; a timeline from 1800 to 2000 with a highlighted range; a room-count axis from 1 to 5 with a highlighted range; a photograph of a bathroom with a claw-foot tub; a photograph of a room with a tiled stove
A pictorial scenario. From left: a preferred area on the Stockholm map; a construction-year range (timeline); a room-count range; a bathroom photograph; a tiled-stove photograph. No text descriptions were given, so subjects had to express the constraints in their own words.

Three Studies from the Corpus

Study 1 — Bell, Eklund & Gustafson (2000)

Disfluency Distribution: Unimodal vs. Multimodal

Does adding a visual channel make people speak more fluently? A representative subset of the AdApt WOZ corpus (16 speakers, 847 utterances from the full 33-subject, 1,845-utterance collection) was compared against a unimodal telephone corpus of the same size. Every disfluency — filled pauses, repairs, prolongations, truncations — was hand-annotated.

6.0%
Multimodal disfluency rate per word
14.3%
Unimodal telephone rate per word
1.2%
Repairs (multimodal)
4.3%
Repairs (unimodal)

The multimodal interface produced significantly fewer disfluencies across all categories (p < 0.001 for repairs and prolongations). Individual speaker variation was the strongest predictor of disfluency rate — stronger than gender, task type, or utterance length. The visual display appears to reduce working-memory load, allowing users to plan utterances more fluently.

Study 2 — Bell, Boye, Gustafson & Wirén (2000)

Modality Convergence

Does the system’s output modality influence how users choose to refer to apartments? 16 subjects each used two simulated systems in counterbalanced order: System G (highlighted apartment icons graphically) and System S (referred to apartments by colour name verbally).

Users who encountered System G first showed a significant increase in colour references in their second dialogue (cNorm: 0.04 → 0.48, p = 0.012). They adopted the system’s graphical strategy and kept it even when switching systems.

Weak convergence was supported: users adopted new referential behaviours while retaining old ones. Strong convergence was not: users did not abandon previously adopted strategies. Instead they amplified behaviours from the first dialogue into the second — a form of entrainment that persisted across system changes.

Post-session interviews revealed that users perceived graphical input as less efficient than speech — yet 94% used it anyway, suggesting the visual interface actively shaped referential choices regardless of perceived efficiency.

Study 3 — Bell & Gustafson (2000)

Positive & Negative User Feedback

Users naturally produce feedback cues — “yes”, “okay”, “no”, “not really” — even when the system neither requests nor explicitly acknowledges them. All 1,845 utterances in the WOZ corpus were annotated along three dimensions: valence (positive/negative), explicitness (explicit/implicit), and function (attention/attitude).

18%
Of all utterances contained feedback
94%
Of subjects used feedback at least once
65%
Positive
35%
Negative

Individual variation was extreme: feedback rates ranged from 0% to 70% across subjects. Negative feedback — especially attitude feedback — was found to carry actionable information: it signals error recovery opportunities, preference corrections, and moments where a clarification subdialogue should be initiated. 94% of feedback occurred within longer user turns rather than as standalone utterances.

Technical Highlights

Engineering the System

Building AdApt required novel solutions across speech processing, dialogue management, and multimodal output generation.

Real-time Fragmented Utterance Handling

A new system architecture with a dedicated I/O handler allowed AdApt to dynamically judge whether a user utterance was complete or required waiting for continuation — a critical capability given the high prevalence of topicalized and fragmented input.

🔎

Robust Parsing for Negotiative Dialogue

A robust parser handled ellipsis, anaphora, out-of-vocabulary words, and mixed speech–mouse input. A mouse click combined with "How much is it?" was interpreted as a complete, fully specified query based on the current dialogue context.

🎭

Turn-taking Gestures and Hourglasses

Agent Urban used real-time facial feedback: a "listening" expression while the user spoke, and a "thinking" gesture the moment silence was detected. Combined with visual hourglass icons, this reduced user uncertainty about system state and kept turn-taking smooth.

🗺

Constraint Manipulation and Visualization

Users could express and refine apartment search constraints incrementally across turns. The system tracked active constraints, visualized their effects on the map in real time, and resolved conflicts through negotiative sub-dialogues with the user.

🔊

Multimodal Output Coordination (GESOM)

A GESOM-based model specified and coordinated multimodal output — deciding what to convey verbally, what to display on the map, and what to show in the table, while ensuring consistency across modalities and appropriate prosodic marking of new information.

🧠

Contextual Reasoning

The dialogue manager maintained a rich context model for resolving cross-turn references, handling follow-up questions, and managing the interplay between comparison, browsing, and information-seeking sub-tasks — all within a single coherent dialogue session.

Constraint Visualization

Making the System's Reasoning Transparent

A central challenge in conversational database browsing is that the system's internal state — which constraints it is currently using, whether it has relaxed any of them — is invisible to the user. AdApt addressed this through a real-time constraint visualization strategy: every constraint the system understood was immediately rendered as a symbolic icon on screen, giving users a continuous, graspable picture of the system's current search query. (Gustafson, Bell, Boye, Edlund & Wirén, 2002.)

AdApt interface showing Urban's thought balloon with '=2M:-' and the Constraints panel displaying icons for 2-room apartment, elevator, apartment type, and the relaxed price range 1.5–2.5M
Figure from Gustafson et al. 2002. Urban's thought balloon shows the just-recognized constraint "=2M:-". The Constraints panel (bottom left) reveals the system's actual search state: the price has been automatically relaxed to the range 1.5–2.5M. The trashcan icon allows drag-and-drop removal of any constraint.

Immediate Iconic Feedback After Each Turn

Immediately after the user finishes speaking, the system responds by displaying icons that represent each information unit it successfully recognised and parsed. This replaces the verbose verbal confirmation prompts ("You said you wanted a two-room apartment…") that make speech-only systems feel slow and unnatural.

Crucially, the system shows two distinct things: the thought balloon above Urban's head displays what was just said in the current turn, while the persistent Constraints panel shows the system's actual inner state — including any automatic relaxations. In the example above, the user said "two million" but the system is actually searching the range 1.5–2.5M. Without the icon panel, this silent relaxation would be entirely invisible.

The constraint icons are also interactive. A user can remove any constraint by dragging its icon to the trashcan, or multimodally by pointing at it and saying "forget this" or "forget about the balcony". This turns the icon layer into a direct manipulation interface for query editing — without requiring the user to remember or re-state earlier turns.

Hard constraint — never relaxed (rooms, floor)
Soft constraint — relaxable (price ±50%, year ±20yr, coordinates)

Designing Icons Across Abstraction Levels

A key design question was how abstract or concrete the constraint icons should be. Abstract icons have a shorter visual form but a steeper learning curve; concrete icons are immediately recognisable but can feel cluttered. The team evaluated three levels of abstraction for each apartment attribute — from pure geometric symbols (top row) to semi-realistic pictograms (middle) to photographic-style illustrations (bottom).

In the end, AdApt opted for the more concrete level. User studies showed that abstract icons were harder to interpret during short sessions, whereas concrete icons were picked up intuitively — consistent with the "idiomatic paradigm" in icon design, where users unconsciously learn symbol meanings through repeated exposure during naturalistic tasks.

Each icon could also carry a small lock button. By clicking the lock, the user explicitly marked a constraint as necessary — instructing the system not to relax it regardless of the relaxation strategy in use. This gave advanced users direct control over the boundary between hard and soft constraints.

The Icon Handler module translated each drag-and-drop or verbal reference to an icon into a semantic operation — retraction, modification, or locking — and sent it as an XML message to the I/O Manager, which merged it with any concurrent speech input before forwarding to the parser.

Three rows of icons for bath tub, freezer and microwave oven, ranging from abstract geometric symbols at the top to photographic-style concrete illustrations at the bottom
Figure from Gustafson et al. 2002. Icon abstraction spectrum for three household appliances: bath tub (left), freezer (centre), microwave oven (right). Top row: abstract geometric symbols. Middle: semi-realistic pictograms. Bottom: concrete photographic-style illustrations. AdApt used the more concrete level to minimise learning time in short user sessions.
Real-time Processing

Handling Fragmented Utterances

An open microphone, a multimodal interface, and a system that encourages mixed initiative all conspire to produce the same effect: users speak in fragments, pausing mid-utterance to think, to gesture at the screen, or to react to what the system has just shown them. AdApt needed to decide in real time — at every silence — whether the user had finished speaking or was about to say more. (Bell, Boye & Gustafson, 2001.)

The Core Problem: Silence Does Not Mean Finished

Spontaneous spoken dialogue is full of pauses that fall inside an utterance rather than at its end. A user might say "I would like a… [pause] …three-room apartment… [pause] …on Södermalm or… [pause] …Vasastan". Each silence looks like end-of-turn from the recogniser's perspective, yet acting on the first fragment would produce a wrong or premature system response.

Analysis of 800 utterances from the Wizard-of-Oz corpus revealed the scale of the problem: while 60% of utterances contained a single, unambiguously complete fragment, one third of all utterances contained at least one non-closing fragment followed by more input. Average pause length was 1 second for both closing and non-closing pauses — making duration alone an unreliable signal. 90% of closing pauses were 2.5 seconds or shorter; non-closing pauses extended to an average of 3.5 seconds.

A further complication: 18% of all utterances began with positive or negative feedback on the system's previous turn ("Yes… ehh, is there one with a stuccoed ceiling?"). In isolation the "yes" would look like a complete acknowledgement; in context it is the preamble to a question.

The AdApt user interface showing Stockholm map with apartment markers on the left and the animated talking head Urban on the right
Figure. The AdApt user interface: an interactive Stockholm map with apartment markers (left) and the 3D animated talking head Urban (right). The open-microphone design and simultaneous map display both contributed to the high prevalence of fragmented user utterances.

Four Semantic Expression Types

wh(X, P)

Find X with property P. "How much does the apartment cost?" → always treated as closing.

yn(X, P)

Yes/no question about X. "Does it have a balcony?" → always treated as closing.

frag(X, P)

A definite NP fragment referring to an object. "The apartment on Hagagatan…" → closing or non-closing depending on dialogue context.

ack(T)

Acknowledgement (positive, neutral, negative). "Yes…" → closing only when DM asks a yes/no question.

The Incremental Interpretation Algorithm

After each system turn the Dialogue Manager sends the I/O Manager (IOM) a list of closing patterns — utterance types that should be treated as complete given the current dialogue state. For instance, if the DM has just asked "How much are you willing to pay?", then frag(money-X,[]) is added to the closing list, so that a bare price fragment like "Two million" will be recognised as a complete answer.

When speech is detected, the IOM calls the parser and receives a semantic expression S tagged as closing or non-closing:

  • If closing: pass S to the DM, generate response R, output it. Urban performs a turn-taking gesture, then looks back at the user.
  • If non-closing: wait up to 4 seconds. Urban raises his eyebrows and holds an attentive expression — a back-channelling signal that encourages the user to continue speaking.
  • If new input arrives within the timeout: append it to the previous fragment, re-parse the combined utterance, and repeat the decision.
  • If the timeout fires with no new input: send the non-closing expression to the DM anyway; this typically results in a clarification or an "I'm sorry, I didn't understand" response.

A second decision point arises while the DM is computing a response: if new user input arrives before the response has been sent to Urban, the IOM checks whether the combined expression S₂ is closing and different from the original S. If so, S₂ is sent to the DM to generate a new response R₂ — the earlier response is discarded. This allows the system to handle utterances like "I want to look at apartments in the Old Town… with two bedrooms" coherently even when the fragments straddle the system's response computation window.

System architecture diagram.
Figure. The AdApt system architecture.

A Worked Example: Fragment Merging in Practice

User says (with pause between fragments)

"The apartment on Hagagatan…"
[silence — IOM waits]
"…how much does it cost?"

What the IOM does

Fragment 1 parsed as frag(apartment-X, [street_name=hagagatan, ref(apartment-X, definite_np)]) — non-closing. IOM waits.

Fragment 2 appended → combined parse becomes wh(money-Y, [price=Y, street_name=hagagatan, ref(apartment-X, definite_np)]) — closing. Sent to DM.

DM answers: "The apartment on Hagagatan costs 1,695,000 crowns."

Contextual Reasoning

Reference Resolution Across Two Domains

How should a multimodal dialogue system interpret "the green apartment", "What about the red one?", or "Is there anything cheaper?" — expressions that only make sense in light of what has been shown, said, and done in previous turns? This paper by Boye, Wirén & Gustafson (2004) presents a single, uniform mechanism — typed combinators resolved by β-reduction — validated across two radically different dialogue domains: the AdApt apartment browser and the NICE fairy-tale game ↗.

The ADAPT system interface showing an apartment listing table at top and a Stockholm map with apartment markers alongside Urban's face
Figure 1 from Boye et al. 2004. The AdApt graphical interface. Apartment icons appear as distinct coloured markers on the map; users refer to them deictically ("the green apartment") or anaphorically ("the red one"). The visual context — the set of currently displayed icons — serves as the primary source of candidate antecedents for reference resolution.
Screenshot from the NICE fairy-tale game showing the animated character Cloddy Hans standing in front of a shelf with objects and a large fairy-tale machine
Figure. The NICE fairy-tale game ↗. Cloddy Hans — an animated character inspired by H.C. Andersen — stands in Anderson's authoring laboratory. The user collaborates with Cloddy Hans to put objects from the shelf into the correct slots of the fairy-tale machine. Unlike AdApt's scrollable map, the camera follows Cloddy Hans through an immersive 3D world, making visual context management more complex.

Three Knowledge Sources for Reference Resolution

Visual Context

Objects currently visible on screen. In AdApt: apartments whose coloured icons are shown on the map. In NICE: all objects in the room — shelf, machine, slots, and their labelled properties — kept constant for the scene. Grounds deictic expressions like "the green apartment" and "the slot furthest away".

Dialogue History

A recency-ordered list of resolved utterances and system turns. Used to resolve pronouns ("it"), ellipses ("What about the red one?"), and comparative references ("Is there anything cheaper?" — referring to the price of the previously discussed apartment). The most recently type-compatible antecedent wins.

Domain Model

A typed object hierarchy defining which attributes belong to which types. In NICE: picking up an object is a pickUp(agent, patient:thing) action. This prevents "it" from being wrongly resolved to the machine when the hammer was last picked up — the type constraint eliminates impossible antecedents.

A Single Mechanism for All Reference Types

The key contribution of the paper is that pronouns, definite descriptions, ellipses, and comparative references can all be resolved by the same formal operation: β-reduction in lambda calculus. User utterances are parsed into typed combinators — lambda expressions over the domain model. Resolving a reference amounts to applying the expression to the correct antecedent object found in the visual or dialogue context.

For example, utterance A5 — "I see… The green apartment… how much does it cost?" — is represented as:

λxapartment ?pmoney (x.price = p & x.color = green)

The system finds apt1 (the green apartment) as the antecedent and applies the expression via β-reduction, yielding "Give me the price of apt1" — a direct database query.

The subsequent elliptic question A7 — "What about the red one?" — is handled by reverse functional application: the system extracts a higher-order function from the resolved representation of A5 (λyapartment ?pmoney (y.price = p)) and applies it to the red apartment. The same β-reduction step produces a new query without any special-case ellipsis handling.

How AdApt and NICE Differ

The two domains stress the resolution mechanism in different ways. In AdApt, the visual context changes continuously: each new search produces a new set of coloured apartment icons, replacing the old set. The system tracks a visual context history — a recency-ordered list of icon sets — but in practice the current set provides almost all necessary antecedents. Users rarely refer back to apartments no longer on screen.

In NICE, the visual context is fixed for the duration of a scene — all objects in the room are always salient — but past events become important. The utterance "Where we put the magic wand… there you can put it" refers to a slot identified by a previous physical action, not by anything currently visible. An event history (planned for future work at the time of the paper) is needed to resolve such references.

Both systems use the same recency principle for disambiguation: when a reference is type-compatible with multiple antecedents, the system prefers the one that appeared most recently in the dialogue or visual context. In AdApt this is almost always sufficient; in NICE, domain model constraints do additional work ruling out impossible resolutions.

The NICE project continued beyond AdApt, exploring fairy-tale dialogue with children in public museum settings. Visit the NICE project page ↗ for system demonstrations, collected corpora, and further publications.

AdApt — Apartment Dialogue Fragment

A1. User: Are there any two-room apartments on the South Side that cost less than 2 million?

A3. User: A balcony would be nice. [elliptic preference]

A5. User: I see… The green apartment… how much does it cost? [deictic + fragmented]

A7. User: What about the red one? [elliptic question — same question type, new referent]

A9. User: Okay… Is there anything cheaper? [comparative — price of previously discussed apt]

NICE — Fairy-tale Dialogue Fragment (age 11)

N1. User: I want you to go to the shelf.

N3. User: I want you to pick up the bag.

N5. User: Yes, pick up the sachet. [clicks on money sachet — graphical + verbal]

N7. User: Then I want you to go to the slots.

N9. User: Now I want you to put the money sachet in the farthest slot. [spatial reference in 3D]

Multimodal Output

Specifying What to Say Without Knowing How to Say It

Coordinating speech, facial expressions, gaze, head movements and map updates across a single dialogue turn is a hard engineering problem. This paper by Beskow, Edlund & Nordstrand (2002) presents the XML-based output specification layer that sits between AdApt's Dialogue Manager and its animated agent — letting the DM declare communicative intent without knowing anything about which gestures are available or how they are rendered.

Colourful block diagram of the AdApt system showing the AdApt logo, Parser (pink), Database (pink), Speech Recogniser (orange), I/O Manager (green), Dialogue Manager (teal), Map Handler (blue), Agent (purple), Response Generator (pink), all sitting on a GUI Manager layer
Figure from Beskow et al. 2002. The AdApt system architecture. The Agent module (purple) receives output commands from the Dialogue Manager via the I/O Manager. The GUI Manager (blue base layer) provides a shared window frame for the Map Handler, Agent, and Icon Handler so all output modules share the same screen real estate and keyboard input.

The Abstraction Layer Problem

A naive design forces the Dialogue Manager to know which specific gestures the animated agent can perform — tying the DM to a particular output device and making the system hard to port. The solution is an XML-based output specification that describes communicative functions (emphasis, acknowledgement, listening) without specifying their physical realisation. The Agent module then looks up the appropriate gesture from a library and executes it.

This separation has two benefits. First, the same DM can drive radically different output hardware — a 3D animated agent, a simple hourglass icon, or a blinking lamp — by swapping only the Agent module and gesture library. Second, once gesture realisations have been implemented they can be re-used across dialogue systems and domains without touching the DM at all.

Five Dialogue System Events

Rather than waiting for the user to finish speaking before reacting, AdApt gives Urban something to do at every stage of processing. Five events trigger feedback during each turn:

  1. Start of speech — recogniser detects voice → enter listening state
  2. End of speech — silence detected → hold state
  3. Recognition done — words passed to parser → fire has_heard event gesture
  4. Semantics done — parse complete: if closing → busy; if non-closing → continued attention
  5. Planning done — DM has decided on a response → prepare to speak
Event Gestures — Transient

Finite-duration signals triggered at a specific moment. The gesture fires, completes, and leaves Urban in exactly the same pose as before. Examples:

  • has_heard — fired on Recognition done; head nod or eyebrow flash to signal the recogniser has processed the utterance
  • emphasis — fired mid-sentence on a stressed word; eyebrow raise synchronised to the TTS stroke
  • Agreement, surprise, or focus-marking nods triggered inline within a response
State Gestures — Continuous

Arbitrary-duration behaviour that Urban sustains until a new state is entered. Five states in AdApt:

  • idle — default waiting posture
  • listening — attentive expression while user speaks
  • continued attention — non-closing fragment; slight lean, eyebrows raised
  • busy — DM computing; gaze averted, "thinking" expression
  • talking — lip-synced speech with sustain blinks

Each state has enter, sustain, and exit gesture segments. Enter/exit are performed once; sustain gestures loop at random intervals for naturalness.

The XML Output API

The DM emits a single <response> XML element containing the spoken text interleaved with <state> and <event> tags. A background attribute sets global context (positive, negative, neutral) that influences gesture selection weights across the whole response.

<response background="positive">
  ja den röda lägenheten
  <state>talking</state>
  har öppen spis
  <event>emphasis</event>
</response>

Tags are TTS-synchronised: a <state> tag takes effect when the first stressed vowel of the following word is spoken. This means gesture timing is driven by the prosodic structure of the spoken output rather than by wall-clock delays.

Gesture Co-articulation

Treating each gesture as independent produces unnatural movement: after a head nod the agent would snap back to neutral before beginning a gaze shift. A co-articulation algorithm merges overlapping and adjacent gestures. It preserves the area around each gesture's stroke (the moment of maximum expressiveness) while reducing the approach and recovery segments and interpolating them with a smooth spline curve — the same principle used in articulatory co-articulation in human speech.

Gesture realisations are selected by weighted random choice within each semantic group, ensuring Urban never performs exactly the same sequence twice. The background variable shifts the weights — an emphatic nod is de-weighted for negative-valence responses where an eyebrow raise is more appropriate.

Table 1 showing a complete AdApt dialogue turn: User says 'den röda lägenheten... har den öppen spis?' System responds 'ja den röda lägenheten har öppen spis'. Events, states and gestures are listed for each processing stage.
Table 1 (Beskow et al. 2002). A complete question–answer turn in AdApt. The user asks a fragmented yes/no question ("the red apartment… does it have a fireplace?"). The table maps each processing stage to the corresponding dialogue system event, Urban's current state, and any event gesture fired. Note that the continued attention state is triggered when the parser classifies the first fragment as non-closing — before the user has finished speaking.
State machine diagram showing Urban’s five turn-taking facial states: Attention, End-of-speech detection, Continued attention, Taking turn, and Talking, connected by labelled transitions for Speech detected, Non-closing, Closing, Timeout, Answer prepared, and Utterance finished
Turn-handling gestures in AdApt (Gustafson 2002, Fig. 15). Each face represents a distinct system state; transitions are driven by speech recogniser and parser events. The state machine runs in parallel with — and visually reflects — the incremental fragment-handling algorithm in the I/O Manager.

Urban’s Turn-Taking State Machine

Rather than a fixed “thinking face” between user turns and system replies, Urban moves through a principled state machine whose transitions are driven directly by the processing pipeline. The diagram shows five distinct facial states and six labelled transitions that together cover every possible dialogue moment.

The cycle begins in Attention — a neutral, forward-looking posture that signals readiness. The moment the speech recogniser detects voice, Urban shifts to End-of-speech detection, a subtly more focused expression that confirms the system is actively listening. When silence is detected the parser evaluates the fragment:

  • Non-closingContinued attention: eyebrows slightly raised, gaze held on the user — a back-channelling signal encouraging more speech. The system waits up to four seconds.
  • Timeout (from Continued attention) → Talking: the system gives up waiting and delivers whatever response it can compute from the fragment so far.
  • ClosingTaking turn: gaze averts slightly as the DM begins computing — a natural signal that the conversational floor is changing hands.
  • Answer preparedTalking: lip-synchronised speech begins, with sustain blinks and emphasis gestures fired inline at stressed words.
  • Utterance finishedAttention: Urban looks back at the user to signal the floor is theirs again.

Every transition is triggered by a real processing event — not a timer or a heuristic — which means Urban’s behaviour is always an accurate reflection of the system’s internal state. User studies confirmed this approach reduced uncertainty about whether the system had understood and was responding, making dialogues feel markedly more natural than a push-to-talk baseline.

Gesture Library Structure

Realisations

Each event and state maps to a set of alternative gesture realisations. Gestures in a group share the same semantic meaning but differ in style or duration — a head nod vs. an eye-widening for emphasis. Each is given a weight; the agent picks by weighted random draw, avoiding repetitive behaviour.

State Segments

State gestures are divided into enter, sustain, and exit segments. Enter and exit are performed once on transition; sustain gestures loop at random intervals for as long as the state persists. Enter/exit pairs are chosen together so the exit gesture correctly undoes the parameter changes made by the enter gesture.

Stroke Timing

Every gesture realisation records a time offset to stroke — the point of peak expressiveness (e.g. the apex of a head nod). This offset is used to align the gesture's stroke with the first stressed vowel of the corresponding TTS word, producing authentic prosodic co-expression between speech and face.

See It in Action

AdApt System Demo

Watch a live demonstration of the AdApt system — a user browsing Stockholm apartments through natural speech and interactive map input, with agent Urban responding in real time.

AdApt system demonstration. The user browses apartments in Stockholm using natural spoken Swedish, map clicks, and constraint manipulation — all handled in real time by the distributed dialogue architecture.

Publications

Research Output

Papers published by the AdApt team spanning system architecture, corpus linguistics, multimodal interaction, and user evaluations.

Doctoral Theses
Thesis
Developing multimodal spoken dialogue systems: Empirical studies of spoken human–computer interaction
Gustafson, J. (2002). Doctoral dissertation, Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden.
PDF
Thesis
Linguistic adaptations in spoken human–computer dialogues
Bell, L. (2003). Doctoral dissertation, Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden.
PDF
2003–2005
Eurospeech
Robust parsing of utterances in negotiative dialogue
Boye, J., & Wirén, M. (2003). In Proceedings of Eurospeech 2003 (Interspeech). Geneva, Switzerland.
PDF
Workshop
Negotiative spoken-dialogue interfaces to databases
Boye, J., & Wirén, M. (2003). TeliaSonera Voice Technologies, Sweden. (Workshop paper; EU/HLT-NICE project IST-2001-35293.)
PDF
Workshop
Coordination of referring expressions in multimodal human-computer dialogue
Skantze, G. (2003). Centre for Speech Technology, Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden.
PDF
Workshop
Contextual reasoning in multimodal dialogue systems: Two case studies
Boye, J., Wirén, M., & Gustafson, J. (2004). TeliaSonera Voice Technologies, Sweden.
PDF
Book Ch.
A model for generalised multi-modal dialogue system output applied to an animated talking head
Beskow, J., Edlund, J., & Nordstrand, M. (2005). In W. Minker, D. Bühler, & L. Dybkjær (Eds.), Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Dordrecht: Kluwer Academic Publishers.
PDF
2002
ISCA WS
Constraint manipulation and visualization in a multimodal dialogue system
Gustafson, J., Bell, L., Boye, J., Edlund, J., & Wirén, M. (2002). In Proceedings of the ISCA Workshop on Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany, June 17–19.
PDF
ISCA WS
GESOM – A model for describing and generating multi-modal output
Edlund, J., Beskow, J., & Nordstrand, M. (2002). In Proceedings of the ISCA Workshop on Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany, June 17–19.
PDF
ISCA WS
Turn-taking gestures and hourglasses in a multi-modal dialogue system
Edlund, J., & Nordstrand, M. (2002). In Proceedings of the ISCA Workshop on Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany, June 17–19.
PDF
Workshop
Specification and realisation of multimodal output in dialogue systems
Beskow, J., Edlund, J., & Nordstrand, M. (2002). Centre for Speech Technology (CTT), KTH Royal Institute of Technology, Stockholm, Sweden.
PDF
2001
NAACL
Real-time handling of fragmented utterances
Bell, L., Boye, J., & Gustafson, J. (2001). In Proceedings of the NAACL 2001 Workshop on Adaptation in Dialogue Systems. Pittsburgh, PA, USA, June.
PDF
2000
ICSLP
AdApt – a multimodal conversational dialogue system in an apartment domain
Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., & Wirén, M. (2000). In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP 2000), Vol. 2, pp. 1732–1735. Beijing, China, October 16–20.
PDF
Götalog
Modality convergence in a multimodal dialogue system
Bell, L., Boye, J., Gustafson, J., & Wirén, M. (2000). In Proceedings of Götalog 2000: Fourth Workshop on the Semantics and Pragmatics of Dialogue, pp. 29–34. Göteborg, Sweden.
PDF
ICSLP
Positive and negative user feedback in a spoken dialogue corpus
Bell, L., & Gustafson, J. (2000). In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP 2000), Vol. 1, pp. 589–592. Beijing, China, October 16–20.
PDF
ICSLP
A comparison of disfluency distribution in a unimodal and a multimodal speech interface
Bell, L., Eklund, R., & Gustafson, J. (2000). In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP 2000), Vol. 3, pp. 626–629. Beijing, China, October 16–20.
PDF
The Team

Researchers

AdApt was a collaborative project at KTH's Centre for Speech Technology, with contributions from Telia Research.

JG
Joakim Gustafson
KTH
LB
Linda Bell
KTH
JoB
Jonas Beskow
KTH
JE
Jens Edlund
KTH
JB
Johan Boye
Telia Research
MW
Mats Wirén
Telia Research