Magnus Lundeberg and Jonas Beskow Centre for Speech Technology, KTH, Sweden.
ABSTRACT In our continuing work with multimodal text-to-speech synthesis with high quality for speechreading, a new talking head has been developed with the purpose of acting as an interactive agent in a dialogue system, set up in a public exhibition area in downtown Stockholm. The new agent conforms to the same set of basic control parameters as our earlier faces, allowing us to control it using existing rules for visual speech synthesis. To add to the realism and believability of the dialogue system, the agent has been given a rich repertoire of extra-linguistic gestures and expressions, including emotional cues, turn-taking signals and prosodic cues such as punctuators and emphasizers. Studies of user reactions indicated that people have a positive attitude towards our new agent.
The August system includes the following main components: For the August system we wanted an agent that looked like the 19th century Swedish author August Strindberg. The purpose of creating a Strindberg lookalike was Animated synthetic talking faces and characters have been developed using a variety of techniques and for a variety of purposes during the last two decades. For an overview of the field, see [7]. Our approach is based on parameterized deformable 3D facial models, controlled by rules within a text-to-speech framework [8]. The rules generate the parameter tracks for the face from a representation of the text, taking coarticulation into account [5]. We employ a generalized parameterization technique to adapt a static 3D-wireframe of a face for visual speech animation [9]. Based on concepts first introduced by Parke [10] we define a set of parameters that will deform the wireframe by applying weighted transformations to its vertices. One critical difference from Parke's system, however, is that we have de-coupled the model definitions from the animation engine, thereby greatly increasing the flexibility. For example, this allows us to define and edit the weighted transformations using a graphical modeling interface, rather than hand-coding them into e.g. C-source files. The parameterization information is then stored, together with the rest of the model, in a specially designed file format. For all our models developed to date, we have decided to conform to the same set of basic control parameters for the articulators that was used in [5]. This has the advantage of making the models independent of the rules that control them during visual speech synthesis; all models that conform to the parameter set will produce compatible results for any given set of parameter tracks. The animation engine and the modeling
interface currently run under Windows and several UNIX dialects including
IRIX, Linux and HP-UX. Cross-platform portability was made possible by
utilizing the industry standard OpenGL library for 3D-graphics and Tcl/Tk
for the graphical user interface and scripting. For the August dialogue
system we chose SGI/IRIX as the animation platform.
The new face was developed using a static 3D-model that was available as freeware from Viewpoint DataLabs [11] as a base. This 3D-model was built up by a reasonable number of polygons and head contours that could be made to look similar to Strindberg's with only a few adjustments (Figure 1).
Before parameterization could begin, the static viewpoint-model had to undergo some surgery. The lips were split to create a mouth opening, and the static eyes were replaced with moveable 3D eyeballs. We also added eyelashes, eyebrows, teeth and tongue, and increased the level of detail around the eyes. Teeth and the tongue were borrowed from the Holger model but enhanced to add to the realism. The teeth visible through the mouth opening were refined by adding polygons to give their surface a more convincing 3D appearance compared to the basically flat teeth used in Holger and Olga. The properties of the tongue were also refined to obtain a smoother look. It should be clearly stated that teeth and tongue were modeled to look good from the outside (i.e. through the mouth opening) and not to be anatomically correct when looking through the skin. An improved 3D-model of the internal articulatory organs [12] is currently under development at CTT. The face was parameterized using the methods described above and in [9]. The parameter set was chosen to conform to our earlier models, allowing us to control the face using existing target values for the visemes. When applying the weighted transformations
for the vertices, Vowel-Consonant-Vowel-words (VCV-words) from the Teleface
project speech corpus [4] were used as prototypes. By viewing this speech
material, both frame by frame and at full speed, and adjusting the weights
in the model to resemble the articulation of our video recorded natural
speaker, smooth and realistic articulatory movements were defined. This
pre-August agent that was created for use as a general agent in different
experimental setups was named Alf. After getting Alf up and talkingwith
all the basic speech features the creation of August begun. The
shape of the face was adjusted to make it more of a "Strindberg lookalike".
A small beard and a mustache were added and August was also given some
supernatural behavior like the ability to twist and stretch the beard and
the mustache.
When designing the agent, it was considered of paramount importance that August should not only be able to generate convincing lip-synchronized speech, but also exhibit a rich and believable non-verbal behavior. To facilitate this we have developed a library of gestures that serve as building blocks in the dialogue generation. This library consists of communicative gestures of varying complexity and purpose, ranging from primitive punctuators such as blinks and nods to complex gestures tailored for particular sentences (Table 1). They are used to communicate such non-verbal information as emotion, attitude, turn taking, and to highlight prosodic information in the speech, like stressed syllables and phrase boundaries. The parameters used to signal prosodic information in our model are primarily eyebrow and head motion.
Each gesture is defined in terms of a set of parameter tracks, which can be invoked at any point in time, either in-between or during an utterance. Several gestures can be executed in parallel. Articulatory movements created by the TTS will always supersede movements of the non-verbal gestures if there is a conflict. Scheduling and coordination of the gestures is controlled through a scripting language. 4.1 Prosodically motivated gestures Having the agent accentuate the auditory speech with non-articulatory movements is found to be very important with respect to reactivity and believability of the system. The main rules of thumb for creating the prosodic gestures were to use a combination of head movements and eyebrow motion and maintain a high level of variation between different utterances. Earlier experiments [13], such as associating eyebrow movements with words in focal position, have indicated that this is a possible way of controlling the prosodic facial motion. Potential problems with such automatic methods are that the agent could look a little nervous or intense and that the eyebrow motion could become predictable. To avoid predictability and to obtain a more natural flow, we have tried to create subtle and varying cues employing a combination of head and eyebrow motion. A typical utterance from August can consist of either a raising of the eyebrow early in the sentence followed by a small vertical nod on a focal word or stressed syllable, or a small initial raising of the head followed by eyebrow motion on selected stressed syllables. A small tilting of the head forward or backward often highlights the ending of a phrase. A number of standard gestures with typically one or two eyebrow raises and some head motion were defined. The standard gestures work well with short answering sentences like "Yes, I believe so." or "Stockholm is more than 700 years old." 4.2 Agents have feelings too To enable display of the agent's different
moods, six basic emotions similar to the six universal emotions defined
by Ekman [14] were implemented (Figure 2), in a way similar to that described
by Pelachaud, Badler and Steedman [15]. Due to the limited resolution in
our current 3D-model, some of the face properties, such as 'wrinkling
of the nose' are not possible, and were therefore left out. Appropriate
emotional cues were assigned to a number of utterances in the system, often
paired with other gestures.
4.3 An example sentence Specially tailored, utterance specific
gestures were created for about 200 sentences in the August system. An
example of how eyebrow motion has been modeled in such a sentence is shown
in Figure 3, and it can also be viewed
on the CD-ROM from the workshop.
The utterance is a Strindberg quotation [16] "Regelbundna konstverk bli
lätt tråkiga liksom regelbundna skönheter; fullkomliga
människor eller felfria äro ofta odrägliga." (Translation:
"Symmetrical
works of art easily become dull just like symmetrical beauties; impeccable
or flawless people are often unbearable."). Eyebrows are raised on
the syllables ?-verk?, ?-het?, ?-er?, ?-skor? ?-fria? and there
is a final long rise that peaks on ?a? in the last word
?odrägliga?.
Notice the lowering of the eyebrows that starts at t=5.8, which is not
intended to convey prosodic information but rather emotional information.
Not shown is the rotation of the head in the same utterance. At the first
phrase boundary (after ?tråkiga? at t=3.1 s), August tilts
his head forward. At the next phrase boundary (before
?fullkomliga?
at t=5.5 s) he tilts the head even more forward and slightly sideways,
lowers his eyebrows and looks very serious. A slow continuous raising of
the head follows to the end of the utterance.
4.4 Listening, thinking and turn-taking For turn-taking issues, visual cues
such as raising of the eyebrows and tilting of the head slightly at the
end of question phrases were created. Visual cues were also used to further
emphasize the message (e.g. showing directions by turning the head). To
enhance the perceived reactivity of the system, a set of listening gestures
and thinking gestures was created. When the user presses the push-to-talk
button, the agent immediately starts a randomly selected listening gesture,
for example raising the eyebrows. At the release of the push-to-talk button,
the agent changes to a randomly selected thinking gesture like frowning
or looking upwards with the eyes searching. (Figure 4)
The August system was set up in an
exhibition area at Kulturhuset (Stockholm Culture Center) in downtown Stockholm
as part of the program of the Cultural Capital of Europe '98. To
catch the attention of the people visiting the exhibition, various entertaining
behaviors of the agent were needed. For example, August can roll his head
in various directions, he can pretend he is watching a game of tennis and
he can do a flirt gesture where he looks toward the entrance and performs
a whistling while raising his eyebrows. These idling motions are important
in making the system look alive, as opposed to a non moving, staring face
which very well can make the system look 'crashed' or 'hung'. A high degree
of variation for these idling motions is needed to prevent users from predicting
the actions of the dialogue system.
It is difficult to objectively judge the success of the agent in a system like this one - an open system with completely unsupervised user interaction. However, some indications might be given by studying the speech material collected in the August database [17]. It shows an incredible variety of questions that people have been asking the system. Although some of these questions obviously are stated to test the capabilities of the system, many of them are of high complexity and are stated by people expecting a good answer. This could indicate that the appearance and the audio-visual speech quality of our agent are believable enough to raise people?s expectations of the knowledge of the system. The development of August and the August
dialogue system has brought us knowledge of how to create 3D-agents as
human-machine interfaces to dialogue engines with minor effort. A new agent
named Per (Figure 5), was recently developed using the methods described
here. Per is acting as a gatekeeper/receptionist in a current speaker verification/dialogue
system project at CTT. A new version of Alf with a reduced number of polygons
is under development, suitable for deployment on lower end platforms. Reduction
is concentrated to areas with low surface curvature that were considered
being of less communicative importance, such as the cheeks, forehead and
the back of the head, while regions around the mouth and eyes are left
intact.
Most of the gestures created for August can be used also with Alf or Per but the result will look less convincing. To increase the portability of gestures between our different models, the facial parameters defined in the models and the gesture procedures need to be more generalized. A gesture library is under construction, containing procedures with general emotion settings and non-speech specific gestures as well as some procedures with linguistic cues. These procedures can serve as a base for the creation of new communicative gestures in future dialogue systems. Acknowledgments The work with multimodal speech synthesis is done within the Centre for Speech Technology (CTT) at KTH. Some of the audio output of the text-to-speech synthesis used in the August system, is generated using an MBROLA database [18] for Swedish, created by Marcus Filipsson and Gösta Bruce of the Department of Linguistics and Phonetics of Lund University, Sweden. References
|