Multimodal Spoken Dialogue
A multimodal conversational dialogue system for browsing apartments in Stockholm — combining natural speech, an animated talking agent, and an interactive map into a single coherent interface.
AdApt was built on a simple but radical idea: to understand how people actually speak to computers, you had to let them speak freely — and then listen carefully to what happened.
Running at the Centre for Speech Technology (CTT) at KTH in collaboration with Telia Research, AdApt set out to build a multimodal conversational dialogue system in the domain of apartment browsing in Stockholm. Users could speak naturally to an animated agent named Urban, point and click on an interactive map, and receive spoken responses accompanied by visual feedback — all in a fluid search for their ideal Stockholm apartment.
The project followed a deliberate empirical methodology: first collect authentic human–computer dialogues through Wizard-of-Oz simulations, analyze the linguistic and interactional patterns that emerge, then use those findings to engineer robust system components. Each iteration brought the system closer to handling the spontaneous, fragmentary, and multimodal nature of real conversation.
AdApt built on earlier KTH dialogue systems — Waxholm, Gulan, and August — but pushed further into negotiative dialogue, constraint manipulation, multimodal output coordination, and real-time handling of fragmented speech. The project ultimately contributed to Joakim Gustafson's doctoral dissertation (2002) and Linda Bell's PhD thesis on linguistic adaptations in spoken human–computer dialogue (2003).
AdApt combined speech recognition, an animated 3D talking head, and an interactive Stockholm map into a single conversational experience. Nine loosely coupled services — spanning Prolog dialogue logic, Tcl/Tk interfaces, and a Java message broker — communicated through XML over TCP sockets.
Figure from Gustafson 2002. A system architecture for a multimodal conversational system — almost identical to AdApt. The I/O Manager is divided into three sub-modules: input fusion (merging multimodal inputs), message handling (coordinating timing and routing), and output fission (decomposing DM output for each output channel).
Users spoke freely in Swedish — no fixed command syntax required. An open microphone (rather than a push-to-talk button) enabled more natural turn-taking. The system used continuous speech recognition (Nuance) and a robust parser to handle incomplete utterances, disfluencies, topicalized phrases, and mixed verbal–gestural input.
The graphical interface centered on a detailed, scrollable map of Stockholm. Apartment icons appeared as colored markers. Users could click icons to select apartments and cross-reference a live table of addresses, sizes, and prices — all while speaking. Mouse clicks and speech were fused in real time by the I/O Manager.
The system's voice was embodied in a 3D animated talking head — a parameterized facial model driven by a TTS engine with gesture rules. Urban displayed listening and thinking expressions in real time, used gaze shifts to signal turn-taking, and showed constraint icons in a thought-bubble to confirm what the system had understood.
Rather than a static dataset, AdApt used a dynamic database scraped from Stockholm real-estate websites, updated six times daily. Structured fields (address, size, price) were augmented with information extracted from free-text descriptions — capturing features like fireplaces, balconies, dishwashers, and cable TV.
Every component registered a service name with the broker on startup — UttPars (parser), DM (dialogue manager), DB (database), NuanceServer (ASR) — and all inter-component traffic flowed through it as XML over TCP on port 2345. Each client declared an input filter (xml_2_term) and output filter (nice_xml) so components could speak native Prolog terms while the wire format remained XML. Two call modes supported synchronous queries (callfunc) and fire-and-forget events (callproc). The main loop ran accept_request_from_client continuously, dispatching incoming messages to registered Prolog callback predicates.
A two-phase robust shallow parser converted Swedish speech-recogniser output into typed semantic expressions. The first pass (hintfinder.pl) extracted over 22 hint types left-to-right — numbers, addresses, room counts, area names, comparative phrases, and discourse markers — merging them with graphical mouse-click events via a parallel graphic_hintfinder.pl pass. The second pass (structurer.pl) assembled the hints into semantic forms and classified each utterance as closing or non_closing. The four expression types and the closing/non-closing decision algorithm are described in depth in the Fragmented Utterances section.
State was represented as an immutable triple dialogue_state(Objects, History, Agenda). Each turn passed through an eight-stage context pipeline — from reference resolution through ellipsis expansion, feature-based tagging, and multi-hypothesis voting — culminating in goal-directed action selection. Eight named goals drove the agenda, each decomposing into hierarchical sub-plans. When database searches returned no results, relax_one_constraint/3 iteratively widened soft-constraint bounds. The context resolution machinery is described in depth in the Contextual Reasoning section.
Apartments scraped from Stockholm real-estate websites were pre-parsed into 23-field associative structures covering street name, number, floor, living space, rooms, price, monthly fee, construction year, renovation history, and a coordinate pair (x, y) for map placement. Free-text ad descriptions were mined by a Prolog NL pre-parser to populate list fields such as bathroom (tub, shower…), kitchen (furnished, island…), balcony direction, and amenities. At query time db_select/3 applied relational algebra — filter, project, join — over in-memory Prolog facts. Hard constraints (number of rooms, floor) were never relaxed; soft constraints (price, area, year, coordinates) could be widened iteratively. A WOZ-quality filter ensured all loaded records had valid coordinates and core fields.
The I/O Manager coordinated all multimodal timing: merging speech and mouse-click events before forwarding to the parser, and decomposing Dialogue Manager responses into parallel commands for the Agent, Map Handler, and Icon Handler. Its central innovation was deciding in real time whether a user utterance was complete or needed continuation — a problem explored in full in the Fragmented Utterances section. It also drove Urban’s turn-taking state machine, detailed in the Multimodal Output section.
Urban was driven by a Tcl/Tk engine combining Snack 2.0, TTS 1.5, and GUB 2.1 for gesture animation. Two gesture classes — pre-recorded jog-gestures and procedurally generated proc-gestures — were synchronised to the TTS token stream via a hotspot offset marking each gesture’s peak. Urban’s full turn-taking state machine (five states, six transitions, gesture co-articulation algorithm) is described in the Multimodal Output section.
AdApt was designed as an empirical testbed to investigate fundamental questions about how people interact with multimodal spoken dialogue systems.
How do people make use of different modalities — speech, pointing, clicking — and what are the implications of their choices for system architecture? When and why do users switch between speech and graphical interaction?
How should the system interpret references not only to objects previously mentioned in dialogue, but also to objects currently visible on screen? Deictic and discourse references frequently co-occur in multimodal utterances.
Can multimodality improve the robustness of a spoken dialogue system? Users often switch modality when speech interaction becomes difficult. Can the system exploit this switching behavior to recover from recognition errors?
How does the multimodal setting — mouse, interactive map, animated speaking agent — influence how people speak? Does the presence of graphical referents change utterance structure, vocabulary, and prosody?
Can the system actively steer users’ choice of input modality? If the system consistently uses one modality for output, will users mirror that choice in their input?
How can a system handle the large number of topicalized, incomplete, and fragmented utterances produced in multimodal interaction? What linguistic and contextual cues distinguish a pause mid-utterance from a genuine end-of-turn?
These four findings emerged from the Wizard-of-Oz corpus and subsequent user studies — each is explored in depth in the User Studies section below.
The disfluency rate per word (excluding unfilled pauses) was 6% — high but within literature ranges. Filled pauses and segment prolongations were most frequent, alongside repetitions and truncated words. Individual variation was substantial.
The system’s own output modality directly shaped users’ subsequent input choices. See the User Studies section for the full convergence experiment.
Users produced large numbers of topicalized phrases: "The apartment on Nybrogatan — how much is that?" Pause lengths between fragments varied systematically: avg. 0.5 s after a direct question, up to 3 s after a list presentation.
Linda Bell’s PhD research (2003) documented systematic accommodation — users adjusted vocabulary, syntax, and prosody to mirror the system’s own language, a pattern analysed in depth in her dissertation.
Before the fully automated system existed, a simulated version was used to collect naturalistic human–computer dialogue data. A human wizard operated the system's dialogue management and response generation behind the scenes, giving subjects the impression of interacting with a working system. The resulting corpus — 33 subjects, 50 dialogues, 1,845 utterances — became the empirical basis for three independent studies on disfluency, modality convergence, and user feedback. (Gustafson, Bell, Beskow, Boye, Carlson, Edlund, Granström, House & Wirén, 2000.)
The wizard had access to a frozen snapshot of the live apartment database and selected responses from a menu of pre-formulated templates — verbal descriptions, map highlights, and table entries — covering the most common dialogue situations. The system automatically displayed a “thinking” gesture on Urban’s face as soon as speech was detected, so the wizard could take one to two seconds to select the right template without users noticing the human-in-the-loop.
To avoid verbal biasing — where written task instructions cause subjects to repeat the exact words used in the scenario — tasks were presented as pictorial scenarios. Each scenario showed a shaded area on the Stockholm map, a timeline indicating a preferred construction year range, room-count bars, and photographs of interior features (bathroom fittings, tiled stoves) rather than naming them explicitly. Subjects were told only to “find apartments you are interested in.”
Does adding a visual channel make people speak more fluently? A representative subset of the AdApt WOZ corpus (16 speakers, 847 utterances from the full 33-subject, 1,845-utterance collection) was compared against a unimodal telephone corpus of the same size. Every disfluency — filled pauses, repairs, prolongations, truncations — was hand-annotated.
The multimodal interface produced significantly fewer disfluencies across all categories (p < 0.001 for repairs and prolongations). Individual speaker variation was the strongest predictor of disfluency rate — stronger than gender, task type, or utterance length. The visual display appears to reduce working-memory load, allowing users to plan utterances more fluently.
Does the system’s output modality influence how users choose to refer to apartments? 16 subjects each used two simulated systems in counterbalanced order: System G (highlighted apartment icons graphically) and System S (referred to apartments by colour name verbally).
Weak convergence was supported: users adopted new referential behaviours while retaining old ones. Strong convergence was not: users did not abandon previously adopted strategies. Instead they amplified behaviours from the first dialogue into the second — a form of entrainment that persisted across system changes.
Post-session interviews revealed that users perceived graphical input as less efficient than speech — yet 94% used it anyway, suggesting the visual interface actively shaped referential choices regardless of perceived efficiency.
Users naturally produce feedback cues — “yes”, “okay”, “no”, “not really” — even when the system neither requests nor explicitly acknowledges them. All 1,845 utterances in the WOZ corpus were annotated along three dimensions: valence (positive/negative), explicitness (explicit/implicit), and function (attention/attitude).
Individual variation was extreme: feedback rates ranged from 0% to 70% across subjects. Negative feedback — especially attitude feedback — was found to carry actionable information: it signals error recovery opportunities, preference corrections, and moments where a clarification subdialogue should be initiated. 94% of feedback occurred within longer user turns rather than as standalone utterances.
Building AdApt required novel solutions across speech processing, dialogue management, and multimodal output generation.
A new system architecture with a dedicated I/O handler allowed AdApt to dynamically judge whether a user utterance was complete or required waiting for continuation — a critical capability given the high prevalence of topicalized and fragmented input.
A robust parser handled ellipsis, anaphora, out-of-vocabulary words, and mixed speech–mouse input. A mouse click combined with "How much is it?" was interpreted as a complete, fully specified query based on the current dialogue context.
Agent Urban used real-time facial feedback: a "listening" expression while the user spoke, and a "thinking" gesture the moment silence was detected. Combined with visual hourglass icons, this reduced user uncertainty about system state and kept turn-taking smooth.
Users could express and refine apartment search constraints incrementally across turns. The system tracked active constraints, visualized their effects on the map in real time, and resolved conflicts through negotiative sub-dialogues with the user.
A GESOM-based model specified and coordinated multimodal output — deciding what to convey verbally, what to display on the map, and what to show in the table, while ensuring consistency across modalities and appropriate prosodic marking of new information.
The dialogue manager maintained a rich context model for resolving cross-turn references, handling follow-up questions, and managing the interplay between comparison, browsing, and information-seeking sub-tasks — all within a single coherent dialogue session.
A central challenge in conversational database browsing is that the system's internal state — which constraints it is currently using, whether it has relaxed any of them — is invisible to the user. AdApt addressed this through a real-time constraint visualization strategy: every constraint the system understood was immediately rendered as a symbolic icon on screen, giving users a continuous, graspable picture of the system's current search query. (Gustafson, Bell, Boye, Edlund & Wirén, 2002.)
Immediately after the user finishes speaking, the system responds by displaying icons that represent each information unit it successfully recognised and parsed. This replaces the verbose verbal confirmation prompts ("You said you wanted a two-room apartment…") that make speech-only systems feel slow and unnatural.
Crucially, the system shows two distinct things: the thought balloon above Urban's head displays what was just said in the current turn, while the persistent Constraints panel shows the system's actual inner state — including any automatic relaxations. In the example above, the user said "two million" but the system is actually searching the range 1.5–2.5M. Without the icon panel, this silent relaxation would be entirely invisible.
The constraint icons are also interactive. A user can remove any constraint by dragging its icon to the trashcan, or multimodally by pointing at it and saying "forget this" or "forget about the balcony". This turns the icon layer into a direct manipulation interface for query editing — without requiring the user to remember or re-state earlier turns.
A key design question was how abstract or concrete the constraint icons should be. Abstract icons have a shorter visual form but a steeper learning curve; concrete icons are immediately recognisable but can feel cluttered. The team evaluated three levels of abstraction for each apartment attribute — from pure geometric symbols (top row) to semi-realistic pictograms (middle) to photographic-style illustrations (bottom).
In the end, AdApt opted for the more concrete level. User studies showed that abstract icons were harder to interpret during short sessions, whereas concrete icons were picked up intuitively — consistent with the "idiomatic paradigm" in icon design, where users unconsciously learn symbol meanings through repeated exposure during naturalistic tasks.
Each icon could also carry a small lock button. By clicking the lock, the user explicitly marked a constraint as necessary — instructing the system not to relax it regardless of the relaxation strategy in use. This gave advanced users direct control over the boundary between hard and soft constraints.
The Icon Handler module translated each drag-and-drop or verbal reference to an icon into a semantic operation — retraction, modification, or locking — and sent it as an XML message to the I/O Manager, which merged it with any concurrent speech input before forwarding to the parser.
An open microphone, a multimodal interface, and a system that encourages mixed initiative all conspire to produce the same effect: users speak in fragments, pausing mid-utterance to think, to gesture at the screen, or to react to what the system has just shown them. AdApt needed to decide in real time — at every silence — whether the user had finished speaking or was about to say more. (Bell, Boye & Gustafson, 2001.)
Spontaneous spoken dialogue is full of pauses that fall inside an utterance rather than at its end. A user might say "I would like a… [pause] …three-room apartment… [pause] …on Södermalm or… [pause] …Vasastan". Each silence looks like end-of-turn from the recogniser's perspective, yet acting on the first fragment would produce a wrong or premature system response.
Analysis of 800 utterances from the Wizard-of-Oz corpus revealed the scale of the problem: while 60% of utterances contained a single, unambiguously complete fragment, one third of all utterances contained at least one non-closing fragment followed by more input. Average pause length was 1 second for both closing and non-closing pauses — making duration alone an unreliable signal. 90% of closing pauses were 2.5 seconds or shorter; non-closing pauses extended to an average of 3.5 seconds.
A further complication: 18% of all utterances began with positive or negative feedback on the system's previous turn ("Yes… ehh, is there one with a stuccoed ceiling?"). In isolation the "yes" would look like a complete acknowledgement; in context it is the preamble to a question.
wh(X, P)
Find X with property P. "How much does the apartment cost?" → always treated as closing.
yn(X, P)
Yes/no question about X. "Does it have a balcony?" → always treated as closing.
frag(X, P)
A definite NP fragment referring to an object. "The apartment on Hagagatan…" → closing or non-closing depending on dialogue context.
ack(T)
Acknowledgement (positive, neutral, negative). "Yes…" → closing only when DM asks a yes/no question.
After each system turn the Dialogue Manager sends the I/O Manager (IOM) a list of closing patterns — utterance types that should be treated as complete given the current dialogue state. For instance, if the DM has just asked "How much are you willing to pay?", then frag(money-X,[]) is added to the closing list, so that a bare price fragment like "Two million" will be recognised as a complete answer.
When speech is detected, the IOM calls the parser and receives a semantic expression S tagged as closing or non-closing:
A second decision point arises while the DM is computing a response: if new user input arrives before the response has been sent to Urban, the IOM checks whether the combined expression S₂ is closing and different from the original S. If so, S₂ is sent to the DM to generate a new response R₂ — the earlier response is discarded. This allows the system to handle utterances like "I want to look at apartments in the Old Town… with two bedrooms" coherently even when the fragments straddle the system's response computation window.
User says (with pause between fragments)
What the IOM does
Fragment 1 parsed as frag(apartment-X, [street_name=hagagatan, ref(apartment-X, definite_np)]) — non-closing. IOM waits.
Fragment 2 appended → combined parse becomes wh(money-Y, [price=Y, street_name=hagagatan, ref(apartment-X, definite_np)]) — closing. Sent to DM.
DM answers: "The apartment on Hagagatan costs 1,695,000 crowns."
How should a multimodal dialogue system interpret "the green apartment", "What about the red one?", or "Is there anything cheaper?" — expressions that only make sense in light of what has been shown, said, and done in previous turns? This paper by Boye, Wirén & Gustafson (2004) presents a single, uniform mechanism — typed combinators resolved by β-reduction — validated across two radically different dialogue domains: the AdApt apartment browser and the NICE fairy-tale game ↗.
Objects currently visible on screen. In AdApt: apartments whose coloured icons are shown on the map. In NICE: all objects in the room — shelf, machine, slots, and their labelled properties — kept constant for the scene. Grounds deictic expressions like "the green apartment" and "the slot furthest away".
A recency-ordered list of resolved utterances and system turns. Used to resolve pronouns ("it"), ellipses ("What about the red one?"), and comparative references ("Is there anything cheaper?" — referring to the price of the previously discussed apartment). The most recently type-compatible antecedent wins.
A typed object hierarchy defining which attributes belong to which types. In NICE: picking up an object is a pickUp(agent, patient:thing) action. This prevents "it" from being wrongly resolved to the machine when the hammer was last picked up — the type constraint eliminates impossible antecedents.
The key contribution of the paper is that pronouns, definite descriptions, ellipses, and comparative references can all be resolved by the same formal operation: β-reduction in lambda calculus. User utterances are parsed into typed combinators — lambda expressions over the domain model. Resolving a reference amounts to applying the expression to the correct antecedent object found in the visual or dialogue context.
For example, utterance A5 — "I see… The green apartment… how much does it cost?" — is represented as:
The system finds apt1 (the green apartment) as the antecedent and applies the expression via β-reduction, yielding "Give me the price of apt1" — a direct database query.
The subsequent elliptic question A7 — "What about the red one?" — is handled by reverse functional application: the system extracts a higher-order function from the resolved representation of A5 (λyapartment ?pmoney (y.price = p)) and applies it to the red apartment. The same β-reduction step produces a new query without any special-case ellipsis handling.
The two domains stress the resolution mechanism in different ways. In AdApt, the visual context changes continuously: each new search produces a new set of coloured apartment icons, replacing the old set. The system tracks a visual context history — a recency-ordered list of icon sets — but in practice the current set provides almost all necessary antecedents. Users rarely refer back to apartments no longer on screen.
In NICE, the visual context is fixed for the duration of a scene — all objects in the room are always salient — but past events become important. The utterance "Where we put the magic wand… there you can put it" refers to a slot identified by a previous physical action, not by anything currently visible. An event history (planned for future work at the time of the paper) is needed to resolve such references.
Both systems use the same recency principle for disambiguation: when a reference is type-compatible with multiple antecedents, the system prefers the one that appeared most recently in the dialogue or visual context. In AdApt this is almost always sufficient; in NICE, domain model constraints do additional work ruling out impossible resolutions.
The NICE project continued beyond AdApt, exploring fairy-tale dialogue with children in public museum settings. Visit the NICE project page ↗ for system demonstrations, collected corpora, and further publications.
A1. User: Are there any two-room apartments on the South Side that cost less than 2 million?
A3. User: A balcony would be nice. [elliptic preference]
A5. User: I see… The green apartment… how much does it cost? [deictic + fragmented]
A7. User: What about the red one? [elliptic question — same question type, new referent]
A9. User: Okay… Is there anything cheaper? [comparative — price of previously discussed apt]
N1. User: I want you to go to the shelf.
N3. User: I want you to pick up the bag.
N5. User: Yes, pick up the sachet. [clicks on money sachet — graphical + verbal]
N7. User: Then I want you to go to the slots.
N9. User: Now I want you to put the money sachet in the farthest slot. [spatial reference in 3D]
Coordinating speech, facial expressions, gaze, head movements and map updates across a single dialogue turn is a hard engineering problem. This paper by Beskow, Edlund & Nordstrand (2002) presents the XML-based output specification layer that sits between AdApt's Dialogue Manager and its animated agent — letting the DM declare communicative intent without knowing anything about which gestures are available or how they are rendered.
A naive design forces the Dialogue Manager to know which specific gestures the animated agent can perform — tying the DM to a particular output device and making the system hard to port. The solution is an XML-based output specification that describes communicative functions (emphasis, acknowledgement, listening) without specifying their physical realisation. The Agent module then looks up the appropriate gesture from a library and executes it.
This separation has two benefits. First, the same DM can drive radically different output hardware — a 3D animated agent, a simple hourglass icon, or a blinking lamp — by swapping only the Agent module and gesture library. Second, once gesture realisations have been implemented they can be re-used across dialogue systems and domains without touching the DM at all.
Rather than waiting for the user to finish speaking before reacting, AdApt gives Urban something to do at every stage of processing. Five events trigger feedback during each turn:
Finite-duration signals triggered at a specific moment. The gesture fires, completes, and leaves Urban in exactly the same pose as before. Examples:
has_heard — fired on Recognition done; head nod or eyebrow flash to signal the recogniser has processed the utteranceemphasis — fired mid-sentence on a stressed word; eyebrow raise synchronised to the TTS strokeArbitrary-duration behaviour that Urban sustains until a new state is entered. Five states in AdApt:
idle — default waiting posturelistening — attentive expression while user speakscontinued attention — non-closing fragment; slight lean, eyebrows raisedbusy — DM computing; gaze averted, "thinking" expressiontalking — lip-synced speech with sustain blinksEach state has enter, sustain, and exit gesture segments. Enter/exit are performed once; sustain gestures loop at random intervals for naturalness.
The DM emits a single <response> XML element containing the spoken text interleaved with <state> and <event> tags. A background attribute sets global context (positive, negative, neutral) that influences gesture selection weights across the whole response.
Tags are TTS-synchronised: a <state> tag takes effect when the first stressed vowel of the following word is spoken. This means gesture timing is driven by the prosodic structure of the spoken output rather than by wall-clock delays.
Treating each gesture as independent produces unnatural movement: after a head nod the agent would snap back to neutral before beginning a gaze shift. A co-articulation algorithm merges overlapping and adjacent gestures. It preserves the area around each gesture's stroke (the moment of maximum expressiveness) while reducing the approach and recovery segments and interpolating them with a smooth spline curve — the same principle used in articulatory co-articulation in human speech.
Gesture realisations are selected by weighted random choice within each semantic group, ensuring Urban never performs exactly the same sequence twice. The background variable shifts the weights — an emphatic nod is de-weighted for negative-valence responses where an eyebrow raise is more appropriate.
Rather than a fixed “thinking face” between user turns and system replies, Urban moves through a principled state machine whose transitions are driven directly by the processing pipeline. The diagram shows five distinct facial states and six labelled transitions that together cover every possible dialogue moment.
The cycle begins in Attention — a neutral, forward-looking posture that signals readiness. The moment the speech recogniser detects voice, Urban shifts to End-of-speech detection, a subtly more focused expression that confirms the system is actively listening. When silence is detected the parser evaluates the fragment:
Every transition is triggered by a real processing event — not a timer or a heuristic — which means Urban’s behaviour is always an accurate reflection of the system’s internal state. User studies confirmed this approach reduced uncertainty about whether the system had understood and was responding, making dialogues feel markedly more natural than a push-to-talk baseline.
Each event and state maps to a set of alternative gesture realisations. Gestures in a group share the same semantic meaning but differ in style or duration — a head nod vs. an eye-widening for emphasis. Each is given a weight; the agent picks by weighted random draw, avoiding repetitive behaviour.
State gestures are divided into enter, sustain, and exit segments. Enter and exit are performed once on transition; sustain gestures loop at random intervals for as long as the state persists. Enter/exit pairs are chosen together so the exit gesture correctly undoes the parameter changes made by the enter gesture.
Every gesture realisation records a time offset to stroke — the point of peak expressiveness (e.g. the apex of a head nod). This offset is used to align the gesture's stroke with the first stressed vowel of the corresponding TTS word, producing authentic prosodic co-expression between speech and face.
Watch a live demonstration of the AdApt system — a user browsing Stockholm apartments through natural speech and interactive map input, with agent Urban responding in real time.
AdApt system demonstration. The user browses apartments in Stockholm using natural spoken Swedish, map clicks, and constraint manipulation — all handled in real time by the distributed dialogue architecture.
Papers published by the AdApt team spanning system architecture, corpus linguistics, multimodal interaction, and user evaluations.
AdApt was a collaborative project at KTH's Centre for Speech Technology, with contributions from Telia Research.