A world-class shared research infrastructure for generative AI, human-machine interaction, and robotics - jointly operated by the Departments of Speech, Music & Hearing (TMH) and Robotics, Perception & Learning (RPL) at KTH.
End-to-end experimental infrastructure for human-robot interaction, from motion-capture data collection to real-world deployment.
Hands-on environment for MSc and PhD students to work with professional-grade hardware, capture systems, and humanoid robots.
Open to industrial partners for joint R&D. Past collaborators include Electrolux and Akademiska Hus. A proven high-profile venue.
Shared booking across KTH schools and departments. CloudGripper opens remote robot access to researchers worldwide.
KTH IRL is a one-of-a-kind shared facility bringing together world-leading expertise in spoken dialogue, computer vision, robotics, and machine learning — all under one roof on the KTH main campus in Stockholm.
The lab is jointly operated by the Department of Speech, Music and Hearing (TMH) and the Department of Robotics, Perception and Learning (RPL) — two of KTH's most research-intensive units within the EECS School. Together they house over 25 faculty members, approximately 25 postdocs, and close to 100 PhD students, making KTH IRL one of Europe's densest concentrations of expertise in social robotics, spoken dialogue, computer vision, and robot learning.
Located at Lindstedtsvägen 24, KTH IRL provides purpose-built spaces for the full research cycle: from motion-capture data collection and multimodal interaction studies, through AI model development, to deployment and evaluation with real users in domestic-like settings.
The facility has hosted landmark events — including the announcement of Sweden's national AI commission by Prime Minister Ulf Kristersson in December 2023 — and has attracted major external funding from Promobilia, Vinnova, Digital Futures, WASP, Vetenskapsrådet, and the European Research Council.
Lindstedtsvägen 24, KTH Main Campus, Stockholm. Ground floor, open layout with six dedicated research environments.
GPU server clusters, three dedicated control rooms with cooling and 100 Gb wired internet, and professional AV capture throughout.
Humanoid robots (Furhat, Softbank Pepper, PAL ARI, Rainbow HRN-Y1, Unitree H1), Boston Dynamics Spot, dual-arm manipulators, UAVs, social robots.
Collaboratively managed by TMH and RPL, KTH EECS School. Shared booking open to all KTH researchers.
KTH IRL hosts six purpose-built research environments — from an AI-enabled smart kitchen and a professional motion-capture studio, to an audience-research theatre, humanoid robot hall, aerial robotics workshop, and cloud-connected manipulation lab.
The primary development and testing environment for robotic manipulation and full-body autonomy research. The lab houses a growing fleet of humanoid robots and dual-arm platforms for dexterous manipulation, loco-manipulation, and physical human-robot collaboration studies.
Through the SAInt project (Promobilia), the fleet has been expanded with Rainbow Robotics HRN-Y1 and Unitree H1 humanoid robots. During the SAInt project period, this will grow to up to five full humanoid platforms — both legged and wheeled — enabling parallel research threads across perception, planning, and adaptive control simultaneously.
The lab also maintains a collection of social robots and expressive robot heads used in dialogue and HRI research by the TMH department.
SAInt is the primary driver of the humanoid robot fleet expansion. The project develops robots that understand context, learn from interaction, and provide proactive physical and verbal assistance — targeting independent living for older adults and people with special needs. The Humanoid Robot Lab is where hardware development, manipulation learning, and physical collaboration research take place across WP2 and WP3.
Legged and wheeled humanoid platforms from leading vendors run in parallel, enabling simultaneous development across multimodal scene representation (Kragic, WP2) and safe compliant physical interaction (Jaquier, WP3).
Led by Noémie Jaquier (WASP Assistant Professor, RPL), the GeoRob Lab develops data-efficient robot learning, optimization, and control algorithms with sound theoretical guarantees. Research treats differential geometry and physics as core inductive biases — enabling robots to generalise from fewer demonstrations and operate reliably under real-world constraints. The lab was recently awarded a Swedish Research Council Starting Grant (2025).
Funded by Knut och Alice Wallenbergs Stiftelse, this decade-long programme develops robotic systems that interpret their surroundings like human senses and learn through interaction with people and the environment. The goal is robots that can perform a wide range of tasks in homes and complex real-world settings — handling unpredictable, non-repetitive situations that are impossible to pre-program. Key advances include multimodal AI models processing sound, images, and force data simultaneously; learning from human demonstration and VR teleoperation; and manipulation of soft and deformable objects such as textiles, clothing, and groceries.
"Now we can train robots instead of programming robots, which makes it much easier to find solutions to environments where things change."
A ten-year Swedish Research Council Distinguished Professor grant developing new self-supervised and meta-learning methodologies with causal reasoning for perception, control, and reasoning in robotics. The project targets robots capable of complex interactions with both rigid and deformable objects and humans in unstructured, real-world environments — moving beyond carefully structured lab settings to handle the open challenges of limited data, unknown unknowns, and transfer across tasks.
A five-year ERC Advanced Grant (€2.4M) addressing one of the core open challenges in robotics: enabling machines to interact with deformable objects as naturally as humans do. BIRD created new informative and compact representations of deformable objects that combine analytical and learning-based approaches, encoding geometric, topological, and physical properties of the robot, object, and environment. Research focused on multimodal, bimanual interaction tasks, combining theoretical methods with rigorous experimental evaluation to model skilled sensorimotor behaviour in dual-arm robot systems.
The IA-Lab is a fully-equipped, sensor-rich smart home kitchen built in collaboration with Akademiska Hus and Electrolux. Designed to resemble a real domestic environment, it bridges the gap between laboratory AI research and everyday life.
The space features a fully functioning Electrolux kitchen, an adjacent living room area (convertible to a warehouse layout for robot picking tasks), and a dedicated control room with GPU servers and professional video capture equipment — all connected on a 100 Gb wired internet backbone.
Research focus areas include AI and robot-supported cooking, human cooking behaviour studies in partnership with Electrolux, and zero-waste cooking through AI assistance.
The IA-Lab kitchen is the primary human-subject testing environment for SAInt. All five interaction scenarios — from passive observation to full robot-human collaboration — are designed around realistic cooking and domestic tasks performed here. The lab's sensor array (overhead cameras, Meta Aria glasses, embedded displays, smart appliances) provides the rich multimodal data streams needed for each work package.
Five scenarios of increasing complexity run in this kitchen: Observer → Apprentice → Instructor → Teacher → Collaborator, building from passive task observation to full shared autonomy between human and robot.
PerCorSo designs the most appropriate ways for robots to behave in human-crowded environments. Its novelty lies in integrating spatial and social context understanding, multimodal communication, and autonomous motion strategies to advance real-world social robots. The project addresses trust in autonomous robots in two complementary senses: verifiability (provably safe systems) and social acceptability (perceived as safe and trustworthy by humans). The IA-Lab kitchen is central to the project's real-world validation — building from controlled lab scenarios toward deployment in environments such as elderly care, autonomous driving, and human-robot coworking.
A multimodal dataset for referential expression grounding collected in the IA-Lab kitchen. Participants wore Meta Aria smart glasses for first-person gaze-tracked video while a GoPro captured the exocentric view — synchronising eye tracking, speech, and spatial grounding during live cooking tasks.
Each recording pairs egocentric Aria video (30 fps) with exocentric GoPro footage, real-time gaze coordinates, 48kHz audio, and word-level transcription via WhisperX. Designed for referential expression grounding, gaze–speech synchronisation, and embodied dialogue research. Created by Anna Deichler (KTH); presented at NeurIPS 2025 SpaVLE Workshop.
A Vinnova-funded project (Swedish: Prata mat) developing a proactive conversational AI cooking assistant for elderly users. Wizard-of-Oz experiments in the IA-Lab smart kitchen engaged 6 senior participants (ages 63–66) in comparing two AI chef personas: an Instructional variant and a Chatty variant. The chatty AI Chef was perceived as more situationally aware and intelligent. The project addressed food waste reduction and independent living for older adults.
A Digital Futures postdoctoral fellowship (June 2022–June 2024) developing robots capable of lifelong personalised dialogue — learning and recalling a person’s attributes, preferences, and shared history over long time horizons. Research addressed how foundation models and LLMs can enable open-domain conversation that adapts to individual elderly users, supporting daily reminders, collaborative tasks, and social engagement.
Deep learning models for predicting turn-taking in spoken interaction — identifying when speakers will yield or hold the floor. The project produced two key models: TurnGPT (language-model-based, trained on transcripts with speaker-shift markers) and Voice Activity Projection (VAP) (audio-based, trained on ~2,000 hours of telephone conversations, preserving acoustic features like intonation and pausing). HRI experiments used a Furhat social robot — a KTH spin-off company co-founded by PI Gabriel Skantze and in business since 2014 — enabling the models to drive fewer interruptions and faster response times in face-to-face dialogue. Research showed turn-taking cues are largely language-specific across three language families.
PMIL is a professional full-body motion capture studio and the primary facility for capturing high-fidelity human movement, gesture, dance, and multimodal interaction data for AI training and HRI research.
The studio features 25+ Optitrack infrared tracking cameras mounted on the ceiling and walls, providing sub-millimetre accuracy across the full capture volume. A Peel Capture system and Tentacle timecode devices synchronize all recording modalities into a unified data stream.
PMIL is collaboratively managed — there is no dedicated technician, and shared responsibility is placed on booking users to follow the usage guidelines. The space connects to the adjacent IA-Lab lounge for larger recording sessions.
PMIL has a foldable wall that when opened can host large AI workshops and press releases.
eecs_lv24_pmil.
Maximum 4 consecutive days per booking. Master and bachelor students require a supervisor to hold the booking.
BodyTalk develops unified models that synthesize speech, facial expressions, and gestures simultaneously from text, generating spontaneous and non-repetitive conversational behaviors for virtual characters and social robots. The project tackles two core challenges: (1) joint multimodal synthesis maintaining congruence across all modalities, and (2) high-level style control (engagement, agitation) that consistently shapes all output channels.
SignBot employs state-of-the-art generative AI to create high-quality sign language animations and language processing systems. Sign languages — used by over 70 million people globally — have unique visuo-spatial and highly parallel structure that challenges conventional NLP methods. The project uses PMIL's motion capture infrastructure to record and model sign language motion, training neural synthesis systems capable of end-to-end sign language generation to improve accessibility and inclusion for deaf communities.
A multimodal corpus of two-party conversations recorded in PMIL using the Meta Quest VR headset and the Optitrack motion capture system. Participants engaged in referential communication tasks inside the AI2-THOR physics simulator — describing, identifying, and giving instructions about objects and locations in a virtual 3D environment — while their full-body motion, speech, and gaze were captured simultaneously.
The dataset is designed to advance co-speech gesture generation in spatially grounded contexts: skeletal motion capture data is streamed directly from the VR headset into the simulator, enabling synchronised capture of embodied referential communication with full scene-graph annotations. Created by Anna Deichler, Jim O'Regan & Jonas Beskow (KTH); presented at the ECCV Multimodal Agents Workshop.
Artificial Actors developed virtual digital humans that function like actors receiving directorial guidance — agents with psychological inner states that govern their nonverbal behaviours, enabling motion synthesis with specific emotional qualities such as shyness or social anxiety rather than neutral, generic movement. The project combined three strands of work: (1) recording a database of acted behaviours across diverse psychological states using PMIL's motion capture infrastructure, (2) developing probabilistic generative methods for gesture synthesis conditioned on high-level personality traits, and (3) establishing a cognitive modelling framework to guide behavioural synthesis. A key outcome was a virtual agent simulating a therapy patient, evaluated with practising therapists and trainees.
Adapts diffusion models to synthesise co-speech gesture and dance from audio in real time. A Conformer-based architecture replaces dilated convolutions for improved modelling power, achieving top-of-the-line motion quality with controllable, speaker-distinctive styles. Generalised guidance enables product-of-expert diffusion ensembles for style interpolation and blending.
Houses a fleet of custom-built aerial robots and ground vehicles, including multiple quadrotor UAV designs, the Boston Dynamics Spot quadruped, and wheeled autonomous ground systems. Equipped with onboard compute, custom sensor arrays, and dedicated workshop space for hardware development.
Research areas include autonomous navigation and mapping, aerial perception, multi-robot coordination, and long-range inspection. The Spot robot — with its Spot Arm — is a regular presence throughout KTH IRL and is frequently used in the IA-Lab for mobile manipulation studies.
A Vinnova-funded project coordinated by Scania developing scene perception capabilities for safe autonomous driving. The KTH RPL team contributes two work packages: WP4 builds comprehensive local scene models by fusing data from multiple sensors, algorithms, and HD maps; WP5 uses these models for proactive situation interpretation and prediction — detecting objects that individual sensors would miss in occluded or complex traffic scenarios (e.g. vehicles emerging from behind stopped buses, or at intersections with limited sightlines).
A Digital Futures postdoctoral project (Jan 2025–Dec 2026) building a collaborative spatial perception framework for city-scale digital twinning. Multiple autonomous agents jointly construct multi-level abstract representations of urban environments from LiDAR point clouds, RGB-D imagery, and remote sensing data — enabling robots to autonomously create and continuously update a complete digital mirror of the physical world with minimal human involvement.
CloudGripper is an open-source cloud robotics testbed for remote robotic manipulation research, benchmarking, and large-scale data collection — hosted at KTH IRL and accessible to researchers worldwide over the internet.
The lab currently houses 32 small robot arm cells, each enclosed in a transparent acrylic frame with overhead cameras that capture ground-truth images of every grasp and push interaction. Researchers anywhere in the world can log in, operate the hardware remotely, collect training data, and run manipulation experiments — no physical presence required.
The transparent enclosure design ensures clean overhead ground-truth images for computer vision and robot learning training. The system is expanding to include industrial arms for more complex dexterous manipulation tasks.
A unique theatre-style space with tiered red velvet seating for up to 50 people, purpose-built for studying group dynamics, audience attention, and social perception in larger gatherings.
Equipped with wireless audience-response clickers for real-time participant input. On the front wall, a video wall consisting of two times two 55” displays that are connected to a control room as well as to the Intelligence Augmentation lab. This setup makes it possible for groups, panels, and public audiences to watch (and control) an ongoing human-robot interaction in the kitchen lab.
The Group Perception Lab doubles as a high-quality venue for seminars and shared presentations with similar venues (e.g USC-ICT).
KTH hosts the speech technology node of Sweden’s national language infrastructure. Språkbanken Tal develops and maintains open resources for Swedish speech — including automatic speech recognition (ASR), text-to-speech synthesis (TTS), and forced alignment tools — as part of Nationella språkbanken and the European CLARIN ERIC network. Funded jointly by the Swedish Research Council (VR) and KTH, the infrastructure serves universities, broadcasters, public agencies, and cultural institutions across Sweden.
KTH IRL serves as the physical substrate for externally funded research projects spanning humanoid robotics, AI-supported daily life, conversational agents, and distributed robot learning.
A five-year project developing humanoid robots that understand context, learn from interaction, and provide proactive verbal and physical support — bridging the care gap for older adults and people with special needs. SAInt will expand the IRL humanoid fleet to five platforms and conduct longitudinal user studies in the IA-Lab kitchen.
Project WebpageDesigns robot behaviour for human-crowded environments by integrating spatial & social context understanding, multimodal communication, and formal-methods decision-making. Research targets trust through both verifiability and social acceptability, with real-world validation in the IA-Lab. PIs: Leite, Tumova, Gustafson, Jensfelt (KTH).
WASP project pageInvestigated conversational AI and gaze-grounded multimodal interaction in the IA-Lab kitchen and PMIL. Produced two publicly released corpora: the KTH-ARIA-referential dataset (egocentric cooking with Meta Aria glasses) and MM-Conv (motion-captured VR dialogue with AI2-THOR scene graphs). Both datasets are open for research use.
An open cloud robotics platform enabling researchers worldwide to collect large-scale robot manipulation datasets via internet teleoperation. Hosted in the Cloud Robotics Lab, CloudGripper democratises access to robotic manipulation hardware and enables AI training data collection at unprecedented scale.
cloudgripper.orgDevelops unified generative models that synthesise speech, facial expressions, and gestures simultaneously from text, producing spontaneous and non-repetitive conversational behaviours for virtual characters and social robots. Addresses joint multimodal congruence and high-level style control (engagement, agitation). Hosted in PMIL.
KTH TMH project pageEmploys state-of-the-art generative AI to create high-quality sign language animations and language processing systems, addressing the visuo-spatial and highly parallel structure of signed languages. Uses PMIL's motion capture to record and train neural sign synthesis models, improving accessibility for 70M+ sign language users worldwide.
KTH TMH project pageDeveloped virtual digital humans with psychological inner states governing nonverbal behaviour — enabling synthesis with specific emotional qualities such as shyness or social anxiety. Combined PMIL motion capture recordings, probabilistic gesture synthesis, and cognitive modelling. Key output: Listen, Denoise, Action! (ACM TOG / SIGGRAPH 2023), a diffusion model for audio-driven gesture and dance synthesis.
Digital Futures project pageLed by Noémie Jaquier (WASP Assistant Professor, RPL), the GeoRob Lab develops data-efficient robot learning, optimisation, and control algorithms using differential geometry and physics as core inductive biases. Recently awarded a Swedish Research Council Starting Grant (2025). Hosted in the Humanoid Robot Lab.
GeoRob Lab websiteA decade-long Wallenberg Scholar programme developing robotic systems that interpret their surroundings like human senses and learn through interaction. Focus areas include multimodal AI (sound, image, force), learning from demonstration and VR teleoperation, and manipulation of deformable objects such as textiles and groceries. Hosted in the Humanoid Robot Lab.
KAW project pageA ten-year VR Rådprofessor grant developing self-supervised and meta-learning methodologies with causal reasoning for robot perception, control, and interaction with rigid and deformable objects in unstructured environments. Addresses open problems in transfer learning, modelling unknown unknowns, and multisensory physical interaction. Hosted in the Humanoid Robot Lab.
KTH RPL project pageA five-year ERC Advanced Grant (€2.4M, KTH) creating informative representations of deformable objects combining analytical and learning-based approaches. Research covered geometric, topological, and physical modelling for skilled sensorimotor behaviour in bimanual robot systems. PI: Danica Kragic. Hosted in the Humanoid Robot Lab.
CORDIS project pageVinnova-funded project (Swedish: Prata mat) developing a proactive conversational AI cooking assistant for elderly users. Wizard-of-Oz experiments in the IA-Lab smart kitchen compared Instructional vs Chatty AI Chef personas with 6 senior participants — the chatty variant was rated as more aware and intelligent. Partners: KTH, Electrolux, Nagoon. PI: Joakim Gustafson.
KTH IRL is jointly operated by two of KTH's most research-intensive departments. Their combined expertise spans the full stack — from dialogue and social perception to planning and physical robot control.
TMH investigates how humans use voice, music, and sound to communicate and interact. Faculty expertise covers multimodal dialogue systems, generative models for speech and gesture, social robotics, expressive speech synthesis, and face-to-face interaction modelling. TMH leads interaction design and user research across KTH IRL projects, and operates the IA-Lab, PMIL, and Group Perception Lab.
Visit TMHRPL bridges machine learning, computer vision, and robotics to create autonomous systems that perceive and act in the physical world. Faculty expertise includes dexterous manipulation, geometric robot learning, human-robot interaction, formal methods, and autonomous navigation. RPL leads hardware, perception, and control research within KTH IRL, and operates the Humanoid Robot Lab, Mobile Robotics Lab, and Cloud Robotics Lab.
Visit RPLWhether you are a KTH researcher ready to book the motion-capture studio, an industry partner exploring collaboration, or a research group wanting to use CloudGripper remotely — KTH IRL welcomes you.
KTH faculty, postdocs, and PhD students can request access to PMIL via the online form and book through the KTH Outlook calendar system. Room name: eecs_lv24_pmil. Maximum 4 consecutive days per booking. Master and bachelor students may be granted time-limited access under supervisor responsibility.
KTH IRL has a strong track record of joint R&D with Electrolux, Akademiska Hus, and other partners. We welcome collaboration with industry, care organisations, and international research groups. Contact us to explore how KTH IRL can support your research, product development, or innovation goals.