WEB-BASED EDUCATIONAL TOOLS FOR SPEECH
TECHNOLOGY
Kåre Sjölander,
Jonas Beskow, Joakim Gustafson, Erland Lewin, Rolf Carlson, and Björn
Granström
Department of Speech, Music, and Hearing, KTH, Sweden
ABSTRACT
This paper describes the efforts at KTH
in creating educational tools for speech technology. The demand for such
tools is increasing with the advent of speech as a medium for man-machine
communication. The world wide web was chosen as our platform in order to
increase the usability and accessibility of our computer exercises. The
aim was to provide dedicated educational software instead of exercises
based on complex research tools. Currently, the set of exercises comprises
basic speech analysis, multi-modal speech synthesis and spoken dialogue
systems. Students access web pages in which the exercises have been embedded
as applets. This makes it possible to use them in a classroom setting,
as well as from the students? home computers.
1. INTRODUCTION
The speech group at KTH has developed
a number of speech technology tools for use in education of undergraduate
students or researchers in the speech field. Many of these tools have so
far been developed for a certain computer environment and have also needed
teacher guidance. During the last year development work was started on
a toolkit for spoken language technology that can be used over the Internet.
The aim is to free the students from the need of using a particular computer
at a particular time and place.
A speech technology toolkit that serves
as a basis in the creation spoken language systems has been developed.
This toolkit is partly based on the software technology in our existing
spoken dialogue systems. A new addition is the design of an architecture
for communication between programs on different computers using the Internet,
the Broker Architecture [1]. We have used the speech toolkit to build three
web-based educational modules.
Our courses on speech technology includes
an introductory section on basic phonetics and speech analysis. A set of
exercises for this section has been developed in which students analyze
their own speech in various ways.
An interactive tool for working with
parametric speech synthesis has been developed. The tool facilitates editing
of parameter tracks, and it provides real-time feedback of the synthesized
speech. It serves as an interface to KTH's multilingual rule based synthesis
system, and can be used to control a formant synthesiser as well as a 3-D
"talking head".
The integrated lab environment GULAN
[2] for spoken dialog systems has seen further development in the area
of dialogue management [3] in cooperation with the NLP lab at the University
of Linköping. The system has also been redesigned for web deployment.
2. IMPLEMENTATION
The Broker Architecture relays function
calls, results and error conditions between modules in text form over standard
TCP internet connections. A number of programming languages are used in
the modules of our toolkit.
Java [4], an object oriented, powerful
programming language is used in components that require complex data structures.
C/C++ is used where maximum performance
is required, for example, for the speech recognition and speech synthesis
engines.
The Tcl/Tk language [5] is used for
user interfaces and as a glue language in some modules. It was chosen because
of several useful characteristics. It is platform independent and makes
it simple to integrate existing modules and applications. It has powerful
and easy to use networking facilities. The accompanying Tk toolkit provides
facilities for quickly creating GUIs. Being a scripting language it is
easy to change and extend the functionality of applications as well as
to maintain them. The main drawbacks with the Tcl language is its execution
speed and primitive syntax. However this can be overcome by implementing
complex and time critical code in a more powerful language. Also, it is
possible to run scripts which are embedded in web pages and which download
quickly because of their relatively compact text representation. In all,
an ideal solution for computer based instruction and distance learning.
3. SPEECH ANALYSIS MODULE
One of our extensions to the Tcl language
is the Snack speech analysis module [6]. It provides a uniform interface
to the audio hardware on a number of platforms, adding commands to play,
record and manipulate sound in many audio formats, as well as disk I/O
in common audio file formats. Also, it has streaming audio capabilities
which makes it easy to create client/server audio applications. There are
commands to visualize sounds using waveforms, spectrograms, and spectrum
sections. The Snack module serves as a basis when creating customized recording
tools, speech analysis applications, audio annotation tools, demonstrators,
and the like. The module has a powerful and intuitive way of handling sound
as objects. A spectrogram object connected to a sound object will update
automatically and in real time as the sound data changes. The modules also
supports postscript printing in order to create hard copies or, for example,
make it possible to create illustrations. Currently, it is possible to
write platform independent scripts which run on Unix (Linux, Solaris, HP-UX,
IRIX) and Windows95/NT using the Snack module. It is also possible to run
scripts embedded in web pages through the use of the Tcl plug-in. The Snack
module can be freely downloaded from http://www.speech.kth.se/SNACK/download.html.
A short example of how this module can be used follows:
#!/usr/local/bin/wish
package require snack
sound snd
pack [ spectrogram .s -sound
snd -height 200]
pack [ button .r -text Record
-com {snd record}]
pack [ button .t -text Stop -command
{snd stop}]
The example creates a simple real time
spectrogram application. A sound object called snd is created, which
is empty initially. Next, a spectrogram is created, that is linked to the
sound. And finally two buttons labeled Record and Stop. When clicked, these
buttons will execute the commands record and stop, respectively,
of the sound object. As recording commences the spectrogram will update
in real time to reflect the changing contents of the sound object. There
are numerous options to handle analysis bandwidth, scales, and similar
properties. The example script could easily be extended with for example
the ability to play a recording:
pack [button .p -text Play -command
{snd play}]
In order to be able to save the recording
the following line could be added:
pack [button .w -text Save -command
{snd write [tk_getSaveFile]}]
Also, the script above would run without
modification if embedded in a web page, except for the save function that
would need special privileges.
3.1. Speech Analysis Exercises
In our courses on speech technology we
have an introductory section on basic phonetics and speech analysis. For
this section a set of exercises were developed in which students analyze
their own speech in various ways. These exercises are accessed through
web pages, in which simple speech analysis tools have been embedded as
small applications (applets) dedicated to the task at hand (http://www.speech.kth.se/labs/analysis/).
In this way it was possible to make these exercises available to students
both working in our laboratory, at Linköping University and from their
home computers. The big advantage of using a web browser as a platform
is that all installation issues are solved, except for the download and
installation of plug-ins. Instructions and other useful information can
also accompany the tools in a natural and easily accessible way, using
HTML. A screen-shot of one of the exercises is shown in Figure 1. The exercises
covered measurements of vowel formant frequencies, comparisons of speakers
and speaking styles, Swedish word accent, and phonetic segmentation.
4. SPEECH SYNTHESIS TOOL
An interactive tool for working with parametric
speech synthesis has been developed. The tool facilitates editing of parameter
tracks, and it provides real-time feedback of the synthesized speech. It
serves as an interface to KTH's multilingual rule based synthesis system
[7, 8], and can be used to control a formant synthesizer as well as a 3-D
"talking head" [9]. The tool can run either as a stand-alone application
or as an applet in a Web browser. It has been used in research and education
during the past two years.
The tool gives full control over all
parameters involved in the formant synthesis process, including formant
frequencies and bandwidths, fundamental frequency and voice source parameters.
The user can select a language (Swedish, French, American English or German)
and synthesize arbitrary text, either in orthographic or phonetic mode.
Once the phonetic transcription is generated, the synthesizer produces
the control parameter tracks in a two-step process: First, the phonetic
rules generate a series of control points for each parameter that define
a target track. Next, filters are applied to the target track to create
a smoothed continuos track to be output to the synthesiser. The filter
type and coefficients may differ between parameters, and the filter coefficients
for a given parameter may be time varying, under the control of another
parameter.
The main interface, shown in Figure
2, consists of a number of panels that display the parameter tracks, a
time scale, a menu bar, a horizontal scrollbar and a status bar. Each of
the panels has an associated value-scale and controls for vertical zooming
and scrolling. The panels are stacked and aligned vertically, in such a
way that all panels share the same time scale, and horizontal scrolling
affects all panels. Typically, related parameters or parameters of the
same unit are displayed together in one panel. The default configuration
contains three panels, displaying formant parameters, fundamental frequency
and source parameters respectively. The parameter tracks in the formant
panel can be overlaid on top of a spectrogram. Parameter tracks are edited
in an intuitive way by dragging the control points. Control points can
be freely inserted or deleted, and segment durations can be lengthened
or shortened using a time scale at the bottom of the display.
4.1. Speech Synthesis Exercises
Students are given a number of tasks to
accomplish using the editing tool. The first task is to change the identity
of the consonant in a synthesised CV syllable by manipulating the formant
transitions. For example, to change /ba/ to /da/, the transitions of the
second and third formant into the vowel part will have to be changed from
rising to falling. In the second task, the students will apply the knowledge
gained in the previous speech analysis exercises, where they study the
pitch contour of Swedish tones. Using a set of minimal pairs with respect
to tone, the task is to synthesise the first word in the pair and manipulate
the F0 contour to arrive at the other word. A similar exercise involves
changing the meaning of a word by modifying vowel length, vowel quality
and stress. The last task is to experiment with prosodic modifications
at sentence level, such as changing a statement into question.
Formant based synthesis is sometimes
compared to more commercially popular synthesis methods based on concatenation
of segments recorded from natural speech. Concatenation based synthesis
can offer high voice quality, but is limited in flexibility. Typically,
only pitch and duration can be altered freely. In contrast, we feel that
formant based synthesis has a significant pedagogical value. By using a
parametric synthesis paradigm based on a familiar phonetic representation
controlled from intuitive graphical interface, exercises can be designed
that provide the students with a deeper understanding not only of fundamental
speech synthesis techniques, but also about acoustic-phonetic correlates
in general.
5. EDUCATIONAL DIALOGUE SYSTEM
The educational dialogue system GULAN
[2] has been redesigned into an application which is accessed through a
web page. In this dialogue system users can make simple queries in the
web-based Yellow Pages on selected topics using speech. Results are presented
using a combination of synthesized speech and an interactive map. Our aim
is to give the students hands-on experience by letting them use the system
on their own, examining it in detail, and extending its functionality.
In this way, we hope to give them an understanding of the problems and
issues involved in building dialogue systems and to spur their interest
for spoken dialogue technology and its possibilities. Recently the system
has been redesigned with an improved dialogue manager described in [3].
This system also makes extensive use of the broker for modules such as
recognition and synthesis in order to make the system more lightweight.
5.1. Dialogue System Exercises
The students were given a set of tasks
to complete. First of all they had to use the system in order to figure
out what it could and could not do. This also included experimenting with
the speech recognition component itself in order to understand its current
limitations regarding, for example, speaking style, vocabulary, and grammar.
The principal task was to extend the system with new fields from the Yellow
Pages. New words and phrases had to be added to the lexicon and the grammar
had to be modified accordingly. This is an interactive process where the
students can listen to the transcriptions using the text-to-speech system.
Immediately after they have loaded the updated lexicon into the running
recognizer they can use the new words. They also have to extend the text
generation capabilities in order to handle the new fields. The students
could also modify the prosodic patterns of the synthesized responses.
6. CONCLUSIONS AND FUTURE WORK
In this paper some our efforts in creating
web-based educational tools for speech technology have been presented.
Freeing our students from the need to use certain computers or special
laboratory set-ups at certain hours has proved incredibly useful and also
necessary, considering that the total number of students has risen from
about 20 to 150 in two years. Especially bearing in mind that one of our
courses was given at Linköping University.
Much remains to be done, but the basic
framework has shown its strength. Our current systems will be continuously
developed and updated and their scope is also going to be widened. The
speech analysis module will be extended in order to make resynthesis possible
in conjunction with the speech synthesis tool. The educational dialogue
system will be improved. Modules for multimodal synthesis and prosodic
analysis will be added, as well as dialogue dependent speech recognition
and speech synthesis. A main focus will be the continued development of
the highly flexible dialogue manager in cooperation with Linköping
University.
We believe that using the Internet
will play an increasingly important role for making speech technology available
anywhere for educational and co-operative purposes. Our investment in the
web-based modular approach has already paid off in terms of effortless
portability and easy implementation of demonstrators.
7. ACKNOWLEDGEMENTS
This work was in part supported by the
Centre for Speech Technology (CTT) and the Swedish Language Technology
Program.
8. REFERENCES
-
Lewin E., "The Broker Architecture," http://www.speech.kth.se/proj/broker/.
-
Sjölander, K., and Gustafson, J.,
"An Integrated System for Teaching Spoken Dialogue Systems Technology",
Proc. Eurospeech 97, Rhodes, Greece, 1997.
-
Gustafson, J., Elmberg, P., Carlson, R.,
Jönsson, A., "An educational dialogut system with a user controllable
dialogue manager", Proceedings of ICLSP´98, Sydney, Australia, 1998.
-
Sun Microsystems, Inc., "The Java Technology
Hompage," http://java.sun.com/.
-
J. K. Ousterhout, "Tcl and the Tk Toolkit."
Addison Wesley, ISBN: 3-89319-793-1, 1994.
-
Sjölander, K., "The Snack Sound Visualization
Module", http://www.speech.kth.se/SNACK/.
-
Carlson, R., Granström, B., and Hunnicutt,
S., "A multi-language text-to-speech module", Proceedings of ICASSP '82,
Paris, Vol. 3, pp 1604-1607, 1982.
-
Carlson, R., Granström, B., Karlsson,
I., "Experiments with voice modelling in speech synthesis", Speech Communication
10, pp 481-489, 1991.
-
Beskow, J., "Rule-based Visual Speech
Synthesis", Proceedings of Eurospeech '95, Madrid, Spain, September 1995.