The synthetic face from a hearing impaired view (Fonetik98)
The synthetic face from a hearing impaired viewEva Agelfors , Jonas
Beskow, Martin Dahlquist, Björn Granström, Magnus
Lundeberg,
Karl-Erik
Spens & Tobias Öhman,
Department of Speech, Music and Hearing, KTH, Stockholm, Sweden.
Email: teleface@speech.kth.se
Abstract
In the Teleface Project the possibilities for a visual telephone communication aid for hearing impaired persons are being evaluated. Multimodal speech intelligibility experiments showed a marked intelligibility advantage for the addition of the face information, both for natural and synthetic faces. This is valid for hearing impaired persons as well as for normal hearing persons in a noisy environment. In this paper we present results from two series of tests with hearing impaired subjects.
Introduction
The Teleface project at KTH focuses on the usage of multimodal speech technology for hearing impaired persons. The aim of the first phase of the project is to evaluate the increased intelligibility hearing impaired persons experience from an auditory signal when it is complemented by a synthesised face. A demonstrator of a system for telephony with a synthetic face that articulates in synchrony with a natural voice will be implemented in phase two of the project.
Audio-Visual Speech Synthesis and Analysis
The project's different stages involve different kinds of processing of acoustic and visual speech data. In the intelligibility study, we utilize a rule-based audio-visual text-to-speech-synthesis framework (Beskow, 1995) to generate synthetic acoustic, as well as visual, speech stimuli. This parametrically controlled visual speech synthesis will also form the basis for the intended telephone conversation aid in phase two of the project.
Automatic extraction of facial parameters from the acoustic signal requires extensive analysis of the relationship between the facial parameters and the acoustics. To this end, we have built a framework for automatic measurements of visible speech movements (Öhman, 1998). A database of video sequences of a male speaker has been recorded. Parts of the speakers face have been marked with a blue color to facilitate image analysis of lips and other parts of the face that are important for speechreading. The parameters from the optical measurements will be statistically analyzed together with the acoustic signal, providing knowledge about the relationship between the visual and acoustic modes of speech. The optical measurements will also be used to improve naturalness of the visual speech synthesis.
Intelligibility tests
Preliminary test with normal hearing subjects
A preliminary test was performed with normal hearing subjects. The subjects were 18 fourth-year-students in engineering at KTH. The audio signal was degraded by adding white noise. The results for the VCV corpus show that adding a synthetic face to a natural male voice increases correct responses from 63% to 70%. Corresponding result for adding a natural face is 76% (Beskow et al., 1997).
First test with hearing impaired subjects
Subjects and method
The first round of the test series involved 12 subjects with a severe to profound hearing loss for the better ear with a mean of 88.4 dB pure tone average hearing level (HL), (range, 62-103 dB) at the frequencies 0.5, 1 and 2 kHz. 11 subjects are experienced hearing-aid users (HA) and one subject uses a cochlea implant (CI). The subjects were between 23 and 76 years old with a mean age of 58.2 years. The speech was presented at about 70 dB SPL and each subject was allowed to adjust their hearing aid to their most comfortable listening level during a training session (2x5 sentences of the natural and the synthetic faces respectively).
The test material consisted of VCV-syllables, "everyday sentences" and a questionnaire.
The VCV-syllables consisted of seventeen consonants presented in /aCa,
iCi, uCu/ context. Three test lists (3 x 17 stim./list) were presented
in different conditions for the subjects
(2 audio-visual and 1 audio alone). Subjects were asked to respond
with the consonant. Responses for the VCV corpus were force-choice and
made with a graphical interface on a computer screen. The sentences were
unrelated "everyday sentences" developed at TMH by Öhngren based on
MacLeod and Summerfield (1990). Each list contained 15 sentences (45 keywords
i.e. three keywords/sentence). Six test lists were presented, two test
lists of each condition (2 audiovisual and 1 audio alone). The number of
correctly repeated keywords were counted and expressed as the percent keywords
correct. Responses for the sentences were given verbally.
The visual database is identical to the database that we use for the visual measurements, but without colors and markers in the face. The speaker is a Swedish male from Stockholm. During recording of the database, the speech rate was kept constant by prompting the speaker with text-to-speech synthesis set to normal speech.
Results
The results for the VCV corpus showed a significant gain in intelligibility when adding a face to a natural voice. The result for the natural voice only was 31% correct responses. When adding a synthetic face the result improved to 54% correct, with a natural face it was 57%. Results for the sentence corpus are presented in Table 1. The benefit of adding a synthetic face varied a lot between different subjects. This is probably due to a combination of the kind of hearing loss and the ability to speechread. Some subjects had a result with the synthetic face very close to the result with the natural face. On the other hand one subject with a profound hearing loss seemed to get very little information from the synthetic face although the same subject got 82 % keywords correct with the natural face (Figure 1).
Second test with hearing impaired subjects
Subjects and method
The second group consisted of 12 adult subjects with a mild to profound
hearing loss (11 HA-users, 1 CI-user) and 3 subjects who participated in
the first round and now were retested. The subjects were between 37-82
years old (mean 59,1) with a hearing loss of 83.2 dB HL (range 32-113 dB).
The method and corpus differed from that of the first round in the following
ways: a) Only the sentence corpus was used. The sentence lists were scrambled
compared to round one. b) The audio signal was filtered to telephone bandwidth
in order to get the test conditions closer to those of the intended Teleface-application.
c) Visual distraction was introduced as one condition of the test without
the subjects knowledge. This was done by displaying the synthetic face
with articulatory movements generated by a simple phoneme recognizer not
trained on our corpus. d) A visual only condition was introduced. This
had not previously been tried with the synthetic face.
Table 1. Results for the sentence corpus (% keywords correct). Last
row shows result from the second round with the articulation generated
by using a phoneme recognizer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We couldn't see any significant difference between the results from the first and second round for the retested subjects. The characteristics of their hearing loss, implies that a telephone bandwidth filtering of the speech signal doesn't make much difference in intelligibility. Therefore it is possible to compare the results from the first and the second round. However, comparing results between the two groups should be done carefully since the subjects' individual differences in hearing loss are large.
Results from the second round are presented in Table 1. As in the first round the differences in the results between the individual subjects are large. The visual distraction generated with the phoneme recognizer was obvious for the subjects with a profound hearing loss and they got a lower score in that test condition. For subjects with a mild hearing loss the distracted visual stimuli didn't affect their results very much, probably because they relied on their hearing.
The answers from the questionnaire showed that the subjective opinions
about the synthetic face matches the objective results for the subjects
with a profound to severe hearing loss (Figure 1). The lesser the degree
of hearing loss, the larger the variation between the subjective and objective
result. The retested subjects rated the benefit of the synthetic face a
little lower in the second round than in the first round. The reason for
this is probably the sometimes incorrect articulatory movements that we
introduced as one condition of the tests without the subjects' knowledge.
Every subject was presented two lists of sentences per audio-visual combination.
We found a significant learning effect between the first and the second
list with the synthetic face but not with the natural face. The average
over all subjects for the first list was 62 % and for the second list 71
% keywords correct.
Figure 1. The subjective ratings of benefit (maximum
100 %) and objective results of speechreading from the first round (left)
and the second round (right). The subjects are presented in order of increasing
performance on the audio only condition of the test.
Conclusions and future work
The highest absolute benefit from speechreading a
synthetic face is obtained by the subjects with such a hearing loss that
they reach between 40 and 80 % keywords correct in the audio only condition
of our experiment. These are the ones that would have the best use of a
visual hearing aid like the intended Teleface-application. Subjects with
a lower score often get a good gain from adding a synthetic face to the
natural voice, but although the gain is high they will probably not benefit
enough in a telephone situation, since their starting point is too low
(Figure 2). Subjects with a higher score than approximately 80 % keywords
correct seem to rely on their hearing and therefore they do not gain very
much from adding a face, synthetic or natural, in our experiment. We have
showed earlier (Beskow et al., 1997) that normal hearing persons benefit
from a synthetic face when audio is being degraded by adding white noise.
This most probably also applies to persons with a hearing loss.
Figure 2. The bars show the contribution of a synthetic
face when added to a natural voice in relation to the performance on the
audio only condition (x). The subjects are presented in order of increasing
performance on the audio only condition of the test.
The result for the first phase of the project presupposes
a perfect mapping from acoustics to facial gestures. Preliminary studies
in the second round of phase one, with a simple phoneme recognizer not
trained on our database show that displaying not so good articulation movements
decreases the intelligibility. The current research in this area is now
focused on how to train a speech recognizer for the special needs of the
intended Teleface-application.
Acknowledgements
The work is funded by KFB,
the Swedish Transport and Communications Research Board, and CTT,
the Centre for Speech Technology, KTH.
References
Beskow, J.,Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E & Öhman, T. (1997). The Teleface project - Multimodal
Speech Communication for the Hearing Impaired.
MacLeod, A. & Summerfield, Q. (1990), A procedure
for measuring auditory and audiovisual speech-reception thresholds for
sentences in noise. Rationale, evaluation and recommendations for use.
British Journal of Audiology, 24:29-43.
Öhman, T. (1998). An audio-visual database in
Swedish for bimodal speech processing. TMH-QPSR , KTH, 1/1998.
Proceedings of Eurospeech '97, Rhodos, Greece. [postscript][gzipped postscript]