Contextual Interactive Evaluation of TTS Models in Dialogue Systems

Siyang Wang, Éva Székely and Joakim Gustafson

In proceedings of Interspeech 2024, Kos, Greece

Pdf available at isca archive

More TTS papers with audio samples are available at KTH TTS-demos

I this paper we evaluate our conversational TTS model with prosody control in a contectual interactive setting.
In order to do this we developed a 20 question game using GPT-4 as a back-end, Google ASR and our TTS server.
We also develoved an interaction manager that kept track of the progress in the game and controlled the prosodic realization accordingly.
In the first table we illustrate this prosody control in two example sentances.

Bad progress

Moderate progress

Good progress


Speaking style

Positive text

Negative text

Slow speech rate and low pitch
Slower speech rate and lower pitch
moderate speech rate and pitch
higher speech rate and higher pitch
high speech rate and high pitch


Using the an animal guessing game we performed an evalaution of a Tacotron 2 TTS voice with prosody control (TTS-Pros),
controlling speaking style (from read to conversational) and prosody (mean pitch and speaking rate)
We compared it with a Tacotron voice without prosody control (TTS-T2) and a Matcha TTS voice (TTS-Matcha).
In the study 21 subjects were invited to our smart home lab to play an animal guessing game.
In the game the users and system took turn thinking about an animal or asking questions and guessing.
Each user played 4 games with system that used one of the three TTS systems.
Below are some examples of these interactions. The tables only contain the ASR transcript of the user input.


Tacotron with prosody control, System guessing

Time (start-end)

Speaker

Audio

Text

00:00:04,096 - 00:00:05,798 TTS-Pros: Hi there Lets play 10 questions.
00:00:06,096 - 00:00:08,798 TTS-Pros: Is the animal you're thinking of typically kept as a pet?

00:00:09,638 -- 00:00:09,858 USER: Yes
00:00:12,359 -- 00:00:13,119 TTS-Pros: Ah, fantastic.
00:00:13,640 -- 00:00:14,820 TTS-Pros: That certainly narrows it down.
00:00:22,584 -- 00:00:25,825 TTS-Pros: Is the animal you're thinking of small enough to be held in your arms?

00:00:27,406 -- 00:00:27,906 USER: Of course, yes.
00:00:30,069 -- 00:00:31,991 TTS-Pros: Oh wonderful that narrows it down quite a bit.
00:00:36,615 -- 00:00:38,336 TTS-Pros: Is this animal a type of rodent?

00:00:40,878 -- 00:00:41,739 USER: I'm not quite sure.
00:00:41,759 -- 00:00:46,063 TTS-Pros: Okay, so then back to the drawing board it seems.
00:00:49,486 -- 00:00:50,627 TTS-Pros: Does this animal have fur?

00:00:51,548 -- 00:00:51,748 USER: Yes.
00:00:54,210 -- 00:00:54,830 TTS-Pros: Ah, splendid.
00:00:56,712 -- 00:00:57,573 TTS-Pros: That narrows it down.
00:01:01,428 -- 00:01:02,549 TTS-Pros: Does this animal have fur?

00:01:03,630 -- 00:01:03,850 USER: Yes!
00:01:06,192 -- 00:01:06,792 TTS-Pros: Ah, that's great.
00:01:07,413 -- 00:01:08,614 TTS-Pros: Seems like I'm getting closer.
00:01:14,759 -- 00:01:16,520 TTS-Pros: Is it known for its ability to purr?

00:01:20,444 -- 00:01:21,444 USER: Can you repeat that again?
00:01:23,626 -- 00:01:25,888 TTS-Pros: Is it known for its ability to purr?

00:01:26,769 -- 00:01:35,342 USER: To purr? What does that mean? I'm gonna skip this.
00:01:31,921 -- 00:01:35,342 TTS-Pros: Is the animal primarily a carnivore known for eating meat?

00:01:35,982 -- 00:01:37,502 USER: Noo, I don't think so.
00:01:39,843 -- 00:01:40,283 TTS-Pros: Uh, no?
00:01:40,923 -- 00:01:43,084 TTS-Pros: Alright, I need to consider other options.

00:01:43,724 -- 00:01:45,645 USER: I mean, yes, but not raw meat.
00:01:48,186 -- 00:01:49,286 TTS-Pros: Is it a type of bird?

00:01:50,907 -- 00:01:51,307 USER: No.
00:01:51,327 -- 00:01:57,869 TTS-Pros: Oof I thought I was out to do something interesting let's see what else could it be?
00:02:02,540 -- 00:02:04,862 TTS-Pros: Does this animal commonly reside indoors?

00:02:05,402 -- 00:02:11,286 USER: yes yes uh both but more or less indoor yeah
00:02:11,546 -- 00:02:12,667 TTS-Pros: Phew okay I'm getting warmer with this.
00:02:17,610 -- 00:02:22,113 TTS-Pros: Is this animal prone to climbing or jumping onto elevated surfaces like furniture?

00:02:24,815 -- 00:02:33,115 USER: it depends on what species species
00:02:39,785 -- 00:02:43,010 TTS-Pros: I have asked my questions and have failed to guess correctly.

Tacotron with prosody control, User guessing

Time (start-end)

Speaker

Audio

Text

00:05,754 00:07,134 TTS-Pros: I am thinking about an animal.
00:11,537 - 00:13,117 TTS-Pros: Please start asking me questions.

00:15,879 - 00:17,180 USER: Does it have four legs?
00:21,822 - 00:23,123 TTS-Pros: No, it does not have four legs.
00:24,103 - 00:26,825 TTS-Pros: Keep on your toes. You might just stumble on the answer.

00:27,219 - 00:28,100 USER: sir
00:28,926 - 00:29,026 TTS-Pros: Um...
00:35,698 - 00:39,219 TTS-Pros: Looks like you took a tiny misstep there, don't trip over your own questions.

00:39,219 - 00:45,100 USER: /SILENCE/
00:45,100 - 00:46,001 TTS-Pros: You have to say something?

00:46,741 - 00:48,621 USER: Does it have more than four legs?
00:53,102 - 00:54,923 TTS-Pros: No, it does not have more than four legs.
00:56,163 - 00:58,944 TTS-Pros: Definitely not crawling with possibilities, are we?

01:03,740 - 01:05,001 USER: Is it a reptile?
01:05,101 - 01:06,001 TTS-Pros: You have to say something.

01:10,623 - 01:11,603 USER: Is it a reptile?
01:13,224 - 01:14,144 TTS-Pros: Just a second.
01:16,605 - 01:17,865 TTS-Pros: No, it is not a reptile.
01:23,608 - 01:27,009 TTS-Pros: guess we're not slithering towards the answer this time.

01:34,349 - 01:35,790 USER: Does it live in the water?
01:37,370 - 01:39,071 TTS-Pros: Engaging deeper thought.
01:40,071 - 01:41,411 TTS-Pros: Yes, it resides in the water.
01:42,312 - 01:43,812 TTS-Pros: You're not floundering anymore.
01:44,572 - 01:45,833 TTS-Pros: Keep swimming for the answer.

01:46,973 - 01:48,513 USER: Is it a type of fish?
01:49,370 - 01:50,071 TTS-Pros: Let's see
01:52,654 - 01:54,215 TTS-Pros: Yes, you're asking if it's a fish.
01:55,195 - 01:58,096 TTS-Pros: Cast a wider net, but you're swimming in the right school now.

02:08,692 - 02:10,714 USER: Is it larger than a cat?
02:12,535 - 02:13,075 TTS-Pros: Hold on.
02:15,517 - 02:17,158 TTS-Pros: Yes, it is larger than a cat.
02:18,179 - 02:21,021 TTS-Pros: Seems like you're not fishing for compliments, just bigger fish.

02:26,805 - 02:29,047 USER: Would you consider this animal dangerous?
02:30,788 - 02:31,368 TTS-Pros: In a jiffy.
02:33,390 - 02:34,811 TTS-Pros: Yes, it is considered dangerous.
02:35,836 - 02:38,418 TTS-Pros: Careful now, don't bite off more than you can chew with your guesses.

02:41,940 - 02:44,701 USER: Is it extra dangerous for surfers?
02:46,823 - 02:47,463 TTS-Pros: uh, let me see.
02:51,545 - 02:53,707 TTS-Pros: Yes, it is particularly risky for those who surf.
02:54,547 - 02:56,589 TTS-Pros: You're definitely riding the wave to success now.

03:02,092 - 03:03,733 USER: Does it have very many teeth?
03:05,666 - 03:06,227 TTS-Pros: in a jiffy.
03:07,871 - 03:09,676 TTS-Pros: You have asked ten questions and have to guess.

03:10,678 - 03:11,500 USER: Is it a shark?
03:13,906 - 03:15,710 TTS-Pros: Yes, it is a shark, you guessed correctly.
03:16,612 - 00:03:18,316 TTS-Pros: This concludes this round of the game.

Tacotron, User guessing

Time (start-end)

Speaker

Audio

Text

00:00:07,469 -- 00:00:09,070 TTS-T2: I am thinking about an animal.

00:00:11,752 -- 00:00:13,974 USER: Is the animal larger than a human?
00:00:19,697 -- 00:00:21,859 TTS-T2: Yes, it is larger than a human.
00:00:22,699 -- 00:00:25,201 TTS-T2: You're not sizing me up to quickly, are you?

00:00:28,143 -- 00:00:29,524 USER: Does the animal live on land?
00:00:31,167 -- 00:00:31,707 TTS-T2: Let's see.
00:00:36,948 -- 00:00:39,069 TTS-T2: Yes, it does reside on land.
00:00:40,269 -- 00:00:43,550 TTS-T2: Absolutely towering over the competition with that guess.

00:01:04,378 -- 00:01:07,259 USER: yeah is it uh domesticated a domestic animal
00:01:08,760 -- 00:01:09,300 TTS-T2: Hold on.
00:01:11,481 -- 00:01:13,703 TTS-T2: No, it is not a domestic animal.
00:01:14,803 -- 00:01:18,125 TTS-T2: Looks like someone's barking up the wrong family tree.

00:01:22,227 -- 00:01:23,848 USER: uh is it uh carnivore
00:01:25,369 -- 00:01:26,009 TTS-T2: Quick scan.
00:01:28,835 -- 00:01:30,116 TTS-T2: No, not a carnivore.
00:01:31,477 -- 00:01:34,319 TTS-T2: uh you might want to sharpen those claws of thought.

00:01:34,319 -- 00:01:40,123 USER: /SILENCE/
00:01:40,123 -- 00:01:41,284 TTS-T2: You have to say something.

00:01:41,304 -- 00:01:43,705 USER: Is it a lot heavier than a human?
00:01:45,286 -- 00:01:46,207 TTS-T2: Just a second.
00:01:50,730 -- 00:01:53,512 TTS-T2: Yes, it is significantly heavier than a human.
00:01:54,560 -- 00:01:57,581 TTS-T2: You're definitely not lightweight in your questioning.

00:01:58,781 -- 00:02:00,121 USER: Is it an elephant?
00:02:01,702 -- 00:02:02,202 TTS-T2: Hold on.
00:02:04,982 -- 00:02:07,183 TTS-T2: No, your guess is not correct.
00:02:08,063 -- 00:02:10,623 TTS-T2: keep trunk ting along with those questions.

00:02:18,414 -- 00:02:19,414 USER: does uh is the animal dangerous to humans
00:02:15,164 -- 00:02:15,885 TTS-T2: Quick scan.
00:02:27,472 -- 00:02:30,533 TTS-T2: No, it's not considered dangerous to humans.
00:02:31,553 -- 00:02:34,394 TTS-T2: Don't worry, you won't need any armor for this one.

00:02:39,916 -- 00:02:44,797 USER: Does the animal have any patterns on itself?
00:02:46,458 -- 00:02:47,618 TTS-T2: Reflecting on that.
00:02:59,710 -- 00:03:02,355 TTS-T2: You have asked 10 questions and have to guess.

00:03:03,076 -- 00:03:04,539 USER: It's the animal, a giraffe.
00:03:06,923 -- 00:03:08,867 TTS-T2: I am sorry you guessed wrong.

Matcha, User guessing

Time (start-end)

Speaker

Audio

Text

00:00:06,915 -- 00:00:08,436 TTS-Matcha: I am thinking about an animal.
00:00:11,218 -- 00:00:13,459 TTS-Matcha: Please start asking me questions.

00:00:14,240 -- 00:00:15,320 USER: Is it able to swim?
00:00:21,024 -- 00:00:23,005 TTS-Matcha: No, it's not known for swimming.
00:00:23,926 -- 00:00:27,928 TTS-Matcha: I'd throw you a life jacket, but we're on dry land with this one.

00:00:29,645 -- 00:00:30,525 USER: Is it a carnivore?
00:00:32,266 -- 00:00:32,926 TTS-Matcha: Hold on.
00:00:35,647 -- 00:00:37,147 TTS-Matcha: Yes, it is a carnivore.
00:00:37,828 -- 00:00:38,808 TTS-Matcha: Chomp chomp.
00:00:39,568 -- 00:00:41,909 TTS-Matcha: You've got some sharp teeth on that one.

00:00:43,529 -- 00:00:44,450 USER: Does it have scales?
00:00:46,250 -- 00:00:47,331 TTS-Matcha: Just a second.
00:00:51,872 -- 00:00:53,573 TTS-Matcha: No, it does not have scales.
00:00:54,453 -- 00:00:57,974 TTS-Matcha: Looks like you're fishing for clues in the wrong pond.

00:01:00,330 -- 00:01:01,191 USER: Is it a K9?
00:01:03,072 -- 00:01:03,712 TTS-Matcha: In a jiffy
00:01:05,374 -- 00:01:07,055 TTS-Matcha: No, it is not a canine.
00:01:07,875 -- 00:01:10,277 TTS-Matcha: Maybe try howling up a different tree.

00:01:11,658 -- 00:01:16,321 USER: Is it part of the cat family?
00:01:18,658 -- 00:01:20,321 TTS-Matcha: Let's see..
00:01:22,506 -- 00:01:24,887 TTS-Matcha: Yes, it is part of the cat family.
00:01:25,708 -- 00:01:28,350 TTS-Matcha: purr fect you're clawing your way to the top.

00:01:31,195 -- 00:01:33,196 USER: Is it known to reside in a jungle?
00:01:35,116 -- 00:01:36,377 TTS-Matcha: Reflecting on that.
00:01:38,677 -- 00:01:41,938 TTS-Matcha: No, it's not typically associated with the jungle.
00:01:54,282 -- 00:01:58,944 TTS-Matcha: Looks like you're not quite the king of this particular guessing jungle yet.

00:02:00,151 -- 00:02:01,331 USER: Does it live in the savannah?
00:02:03,052 -- 00:02:05,312 TTS-Matcha: Excuse me, I need to think a bit.
00:02:09,053 -- 00:02:11,353 TTS-Matcha: Yes, it does reside in the savannah.
00:02:12,073 -- 00:02:13,053 TTS-Matcha: Hot, hot, hot.
00:02:13,853 -- 00:02:16,774 TTS-Matcha: Your instincts are on a safari to success.

00:02:18,414 -- 00:02:19,414 USER: Does it have a mane?
00:02:21,115 -- 00:02:21,715 TTS-Matcha: Let's see.
00:02:24,975 -- 00:02:26,496 TTS-Matcha: Yes, it does have a mane.
00:02:27,276 -- 00:02:28,176 TTS-Matcha: Looking sharp.
00:02:29,013 -- 00:02:31,616 TTS-Matcha: You're nearly the king of this guessing game.
00:02:37,942 -- 00:02:40,725 TTS-Matcha: You have asked 10 questions and have to guess.

00:02:42,166 -- 00:02:42,847 USER: Is it a lion?
00:02:45,149 -- 00:02:47,792 TTS-Matcha: Yes, it is a lion, You guessed correctly.

Finally, since we have the log file from the dialogue system, we have access to the interaction manager's
prosody control parameters that were not used by the TTS systems without prosody to control.
Below we take a dialogue with the vanilla Tacotron 2 voice and add a column of what it would
have sounded like if the dialogue would have been with Tacotron 2 equipped with prosody control.

Vanilla Tacotron 2 and Tacotron 2 with prosody control, User guessing

Time (start-end)

Speaker

Original (T2)

Alternative (prosody)

Text

00:00:07,470 -- 00:00:09,031 SYSTEM: I am thinking about an animal.
00:00:15,316 -- 00:00:17,678 SYSTEM: Please start asking me questions.

00:00:18,899 -- 00:00:20,980 USER: Does the animal live in water?
00:00:25,243 -- 00:00:27,385 SYSTEM: No, it does not reside in water.
00:00:28,388 -- 00:00:31,369 SYSTEM: Hope you're not fishing for clues in the wrong pond.

00:00:32,569 -- 00:00:34,350 USER: Does your animal fly?
00:00:36,050 -- 00:00:37,210 SYSTEM: Swiftly checking.
00:00:38,911 -- 00:00:40,351 SYSTEM: No, it does not fly.
00:00:41,692 -- 00:00:45,073 SYSTEM: Seems like your guesses are having a bit of turbulence.

00:01:04,378 -- 00:01:07,259 USER: Does your animal usually reside in Europe?
00:00:51,194 -- 00:00:53,255 SYSTEM: Excuse me, I need to think a bit.
00:00:53,975 -- 00:00:55,676 SYSTEM: Yes, it can reside in Europe.
00:00:56,436 -- 00:00:56,796 SYSTEM: Eureka.
00:00:57,802 -- 00:00:59,943 SYSTEM: somewhat closer to the truth you are.

00:01:02,245 -- 00:01:05,146 USER: Is your animal usually kept as a pet?
00:01:07,148 -- 00:01:07,908 SYSTEM: Let me see.
00:01:11,530 -- 00:01:13,732 SYSTEM: Yes, it is usually kept as a pet.
00:01:14,932 -- 00:01:16,593 SYSTEM: you're barking up the right wait i'm not supposed to say that.
00:01:19,115 -- 00:01:20,896 SYSTEM: Keep fetching those questions.

00:01:20,896 -- 00:01:32,557 USER: /SILENCE/
00:01:32,557 -- 00:01:33,818 SYSTEM: You have to say something.

00:01:34,499 -- 00:01:35,940 USER: Is your animal carnivorous?
00:01:39,884 -- 00:01:41,345 SYSTEM: Yes, it is carnivorous.
00:01:42,386 -- 00:01:45,069 SYSTEM: salivating at the right scent i see.

00:01:48,152 -- 00:01:50,536 USER: uh does your animal
00:01:50,936 -- 00:01:52,536 SYSTEM: Swiftly checking.
00:01:56,265 -- 00:01:59,286 SYSTEM: uh I think you might have missed a part of your question there.

00:01:59,666 -- 00:02:00,166 USER: Yes, I did.
00:02:00,186 -- 00:02:02,687 SYSTEM: That was certainly a mysterious attempt.

00:02:06,147 -- 00:02:07,928 USER: Is your animal a type of cat?
00:02:09,928 -- 00:02:10,128 SYSTEM: uh..
00:02:13,589 -- 00:02:15,609 SYSTEM: No, it is not a type of cat.
00:02:16,530 -- 00:02:19,070 SYSTEM: You're barking up the wrong tree with that guess.
00:02:19,990 -- 00:02:22,351 SYSTEM: Try a different category, perhaps.

00:02:23,751 -- 00:02:25,416 USER: Is your animal a type of dog?
00:02:27,864 -- 00:02:30,050 SYSTEM: Yes, it is a dog you guessed correctly.
00:02:30,933 -- 00:02:33,200 SYSTEM: This concludes this round of the game.