Home page




Public Resources

Public Profiles

Giampiero Salvi :: Publications

Show this list in BibTeX format.


Kumar Dhaka, A., & Salvi, G. (2016). Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing. CoRR, abs/1606.09163. [link]Kumar Dhaka, A., & Salvi, G. (2016). Semi-supervised Learning with Sparse Autoencoders in Phone Classification. CoRR, abs/1610.00520. [link]


Lopes, J., Salvi, G., Skantze, G., Abad, A., Gustafson, J., Batista, F., Meena, R., & Trancoso, I. (2015). Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances. In Proceedings of Interspeech 2015 (pp. 1805-1809). Dresden, Germany. [abstract] [pdf]Abstract: Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.Salvi, G. (2015). An Analysis of Shallow and Deep Representations of Speech Based on Unsupervised Classification of Isolated Words. In Proceedings of Nonlinear Speech Processing. [abstract]Abstract: We analyse the properties of shallow and deep representa- tions of speech. Mel frequency cepstral coefficients (MFCC) are compared to representations learned by a four layer Deep Belief Network (DBN) in terms of discriminative power and invariance to irrelevant factors such as speaker identity or gender. To avoid the influence of supervised statistical modelling, an unsupervised isolated word classification task is used for the comparison. The deep representations are also obtained with unsupervised training (no back-propagation pass is performed). The results show that DBN features provide a more concise clustering and higher match between clusters and word categories in terms of adjusted Rand score. Some of the confusions present with the MFCC features are, however, retained even with the DBN features.Strömbergsson, S., Salvi, G., & House, D. (2015). Acoustic and perceptual evaluation of category goodness of /t/ and /k/ in typical and misarticulated children's speech. Journal of the Acoustic Society of America, 137(6), 3422--3435. [abstract] [pdf]Abstract: This investigation explores perceptual and acoustic characteristics of children's successful and unsuccessful productions of /t/ and /k/, with a specific aim of exploring perceptual sensitivity to phonetic detail, and the extent to which this sensitivity is reflected in the acoustic domain. Recordings were collected from 4- to 8-year-old children with a speech sound disorder (SSD) who misarticulated one of the target plosives, and compared to productions recorded from peers with typical speech development (TD). Perceptual responses were registered with regards to a visual-analog scale, ranging from “clear [t]” to “clear [k].” Statistical models of prototypical productions were built, based on spectral moments and discrete cosine transform features, and used in the scoring of SSD productions. In the perceptual evaluation, “clear substitutions” were rated as less prototypical than correct productions. Moreover, target-appropriate productions of /t/ and /k/ produced by children with SSD were rated as less prototypical than those produced by TD peers. The acoustical modeling could to a large extent discriminate between the gross categories /t/ and /k/, and scored the SSD utterances on a continuous scale that was largely consistent with the category of production. However, none of the methods exhibited the same sensitivity to phonetic detail as the human listeners.


Pieropan, A., Salvi, G., Pauwels, K., & Kjellström, H. (2014). A dataset of human manipulation actions. In Proc of the IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China.Pieropan, A., Salvi, G., Pauwels, K., & Kjellström, H. (2014). Audio-Visual Classification and Detection of Human Manipulation Actions. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Chicago, Illinois. [pdf]Salvi, G., & Vanhainen, N. (2014). The WaveSurfer Automatic Speech Recognition Plugin. In Proceedings of LREC. Reykjavik, Iceland. [pdf]Strömbergsson, S., Salvi, G., & House, D. (2014). Gradient evaluation of /k/-likeness in typical and misarticulated child speech. In Proc. of ICPLA 2014. Stockholm, Sweden. [abstract] [pdf]Abstract: Phonetic transcription is an important instrument in the evaluation of misarticulated speech. However, this instrument is not sensitive to fine acoustic-phonetic detail – information that can provide insight into the processes underlying speech production [1]. An objective and fine-grained measure of children’s efforts at producing a specific speech target would be clinically valuable, both in assessment and when monitoring progress in therapy. Here, we describe the first steps towards such a measure. This study describes the perceptual and acoustic evaluation of children’s successful and inaccurate efforts at producing /k/. A corpus of 2990 recordings of isolated words, beginning with either /tV/ or /kV/, produced by 4-8-year-old children, was used. The recordings were labelled with regards to whether they were a) correct productions, b) clear substitutions (i.e. [t] for /k/, or [k] for /t/), or c) intermediate productions between [t] and [k]. In the perceptual evaluation, 10 adult listeners judged 120 typical and misarticulated productions of /t/ and /k/ with regards to a scale from “clear /t/” to “clear /k/”. The listeners utilized the whole scale, thus exhibiting sensitivity to sub-phonemic detail. This finding demonstrates that listeners perceive more detail than is conveyed in phonetic transcription. However, despite their experience of evaluating misarticulated child speech, the listeners did not discriminate between correct productions and clear substitutions, i.e. they did not distinguish successful productions of [t] for /t/ from cases where [t] was a misarticulated production of /k/ (and vice versa). In order to explore the existence of covert contrasts, i.e. sub-perceptual differentiation between correct productions and clear substitutions, acoustic analysis was performed. Here, a frequently described approach [2] to the analysis of voiceless plosives was compared to more recent methods. We report on the performance of the different methods, regarding how well they modelled the human evaluations, and to their sensitivity to covert contrast.Vanhainen, N., & Salvi, G. (2014). Free Acoustic and Language Models for Large Vocabulary Continuous Speech Recognition in Swedish. In Proceedings of LREC. Reykjavik, Iceland. [pdf]Vanhainen, N., & Salvi, G. (2014). Pattern Discovery in Continuous Speech Using Block Diagonal Infinite HMM. In Proceedings of ICASSP. Florence, Italy. [pdf]


Franovic, T., Herman, P., Salvi, G., Benjaminsson, S., & Lansner, A. (2013). Cortex-inspired network architecture for large-scale temporal information processing. In Frontiers in neuroinformatics. Koniaris, C., Salvi, G., & Engwall, O. (2013). On Mispronunciation Analysis of Individual Foreign Speakers Using Auditory Periphery Models. Speech Communication, 55(5), 691-706. [abstract] [004]Abstract: In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of non-native speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners’ ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.Neiberg, D., Salvi, G., & Gustafson, J. (2013). Semi-supervised methods for exploring the acoustics of simple productive feedback. Speech Communication, 55(3), 451-469. [007]Oertel, C., Salvi, G., Götze, J., Edlund, J., Gustafson, J., & Heldner, M. (2013). The KTH Games Corpora: How to Catch a Werewolf. In IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video - MMC 2013. [pdf]Oertel, C., & Salvi, G. (2013). A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI). Sydney, Australia. [abstract] [pdf]Abstract: This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71\% of accuracy.Salvi, G. (2013). Biologically Inspired Methods for Automatic Speech Understanding. In Advances in Intelligent Systems and Computing (AISC) (pp. 283). Palermo, Italy. [abstract]Abstract: Automatic Speech Recognition (ASR) and Understanding (ASU) systems heavily rely on machine learning techniques to solve the problem of mapping spoken utterances into words and meanings. The statistical methods employed, however, greatly deviate from the processes involved in human language acquisition in a number of key aspects. Although ASR and ASU have recently reached a level of accuracy that is sufficient for some practical applications, there are still severe limitations due, for example, to the amount of training data required and the lack of generalization of the resulting models. In our opinion, there is a need for a paradigm shift and speech technology should address some of the challenges that humans face when learning a first language and that are currently ignored by the ASR and ASU methods. In this paper, we point out some of the aspects that could lead to more robust and flexible models, and we describe some of the research we and other researchers have performed in the area.Saponaro, G., Salvi, G., & Bernardino, A. (2013). Robot Anticipation of Human Intentions through Continuous Gesture Recognition. In Proc. 4th International Workshop on Collaborative Robots and Human Robot Interaction (CR-HRI 2013). San Diego, USA. [pdf]


Koniaris, C., Engwall, O., & Salvi, G. (2012). Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations. In Interspeech 2012. Portland, OR, USA. [abstract] [pdf]Abstract: This paper expands our previous work on automatic pronunciation error detection that exploits knowledge from psychoacoustic auditory models. The new system has two additional important features, i.e., auditory and acoustic processing of the temporal cues of the speech signal, and classification feedback from a trained linear dynamic model. We also perform a pronunciation analysis by considering the task as a classification problem. Finally, we evaluate the proposed methods conducting a listening test on the same speech material and compare the judgment of the listeners and the methods. The automatic analysis based on spectro-temporal cues is shown to have the best agreement with the human evaluation, particularly with that of language teachers, and with previous plenary linguistic studies.Koniaris, C., Engwall, O., & Salvi, G. (2012). On the Benefit of Using Auditory Modeling for Diagnostic Evaluation of Pronunciations. In Inter. Symp. on Auto. Detect. Errors in Pronunc. Training (IS ADEPT), 2012 (pp. 59-64). Stockholm, Sweden. [abstract] [pdf]Abstract: In this paper we demonstrate that a psychoacoustic model based distance measure performs better than a speech signal distance measure in assessing the pronunciation of individual foreign speakers. The experiments show that the perceptual-based method performs not only quantitatively better than a speech spectrum-based method, but also qualitatively better, hence showing that auditory information is beneficial in the task of pronunciation error detection. We first present the general approach of the method, which is using the dissimilarity between the native perceptual domain and the non-native speech power spectrum domain. The problematic phonemes for a given non-native speaker are determined by the degree of disparity between the dissimilarity measure for the non-native and a group of native speakers. The two methods compared here are applied to different groups of non-native speakers of various language backgrounds and validated against a theoretical linguistic study.Salvi, G., Montesano, L., Bernardino, A., & Santos-Victor, J. (2012). Language bootstrapping: Learning word meanings from perception-action association. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 42(3), 660-671. [abstract] [pdf]Abstract: We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment. The model takes verbal descriptions of a task as the input, and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operation's environment as well as to shed some light as to how this challenging process develops with human infants.Vanhainen, N., & Salvi, G. (2012). Word Discovery with Beta Process Factor Analysis. In Proceedings of Interspeech. Portland, Oregon. [abstract] [pdf]Abstract: We propose the application of a recently developed non-parametric Bayesian method for factor analysis to the problem of word discovery from continuous speech. The method, based on Beta Process priors, has a number of advantages compared to previously proposed methods, such as Non-negative Matrix Factorisation (NMF). Beta Process Factor Analysis (BPFA) is able to estimate the size of the basis, and therefore the num- ber of recurring patterns, or word candidates, found in the data. We compare the results obtained with BPFA and NMF on the TIDigits database, showing that our method is capable of not only finding the correct words, but also the correct number of words. We also show that the method can infer the approximate number of words for different vocabulary sizes by testing on randomly generated sequences of words.


Ananthakrishnan, G., & Salvi, G. (2011). Using Imitation to learn Infant-Adult Acoustic Mappings. In Proceedings of Interspeech (pp. 765-768). Florence, Italy. [abstract] [pdf]Abstract: This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is crucial aspect of the model. The feedback is in terms of as overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2011). Sound systems are shaped by their users: The recombination of phonetic substance. In G. Nick Clements, G. N., & Ridouane, R. (Eds.), Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories. CNRS & Sorbonne-Nouvelle. [abstract] [pdf]Abstract: Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content. Keywords: phonological universals; phonetic systems; formal structure; intrinsic content; behavioral origins; substance-based explanationSalvi, G., & Al Moubayed, S. (2011). Spoken Language Identification using Frame Based Entropy Measures. TMH-QPSR, 51(1), 69-72. [pdf]Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2011). Analisi Gerarchica degli Inviluppui Spettrali Differenziali di una Voce Emotiva. In Contesto comunicativo e variabilità nella produzione e percezione della lingua (AISV). Lecce, Italy.


Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2010). Cluster Analysis of Differential Spectral Envelopes on Emotional Speech. In Proceedings of Interspeech (pp. 322--325). Makuhari, Japan. [abstract] [PDF]Abstract: This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes of time aligned speech frames are compared between emotionally neutral and active utterances. Statistics are computed over the resulting differential spectral envelopes for each phoneme. Finally, these statistics are classified using agglomerative hierarchical clustering and a measure of dissimilarity between statistical distributions and the resulting clusters are analysed. The results show that there are systematic changes in spectral envelopes when going from neutral to sad or happy speech, and those changes depend on the valence of the emotional content (negative, positive) as well as on the phonetic properties of the sounds such as voicing and place of articulation.


Al Moubayed, S., Beskow, J., Öster, A-M., Salvi, G., Granström, B., van Son, N., Ormel, E., & Herzke, T. (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf]Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is clear that many subjects benefit from SynFace especially with speech with stereo babble.Al Moubayed, S., Beskow, J., Öster, A., Salvi, G., Granström, B., van Son, N., & Ormel, E. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proceedings of Interspeech 2009. Beskow, J., Salvi, G., & Al Moubayed, S. (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores.Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2009). Affordance based word-to-meaning association. In IEEE International Conference on Robotics and Automation (ICRA). Kobe, Japan. [abstract] [pdf]Abstract: This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.Salvi, G., Beskow, J., Al Moubayed, S., & Granström, B. (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf]Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).


Al Moubayed, S., Beskow, J., & Salvi, G. (2008). SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [abstract] [pdf]Abstract: In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.Beskow, J., Granström, B., Nordqvist, P., Al Moubayed, S., Salvi, G., Herzke, T., & Schulz, A. (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf]Abstract: The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2008). Associating word descriptions to learned manipulation task models. In IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS). Nice, France.Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2008). (Re)use of place features in voiced stop systems: Role of phonetic constraints. In Proceedings of Fonetik 2008. University of Gothenburg. [abstract] [pdf]Abstract: Computational experiments focused on place of articulation in voiced stops were designed to generate ‘optimal’ inventories of CV syllables from a larger set of ‘possible CV:s’ in the presence of independently and numerically defined articulatory, perceptual and developmental constraints. Across vowel contexts the most salient places were retroflex, palatal and uvular. This was evident from acoustic measurements and perceptual data. Simulation results using the criterion of perceptual contrast alone failed to produce systems with the typologically widely attested set [b] [d] [g], whereas using articulatory cost as the sole criterion produced inventories in which bilabial, dental/alveolar and velar onsets formed the core. Neither perceptual contrast, nor articulatory cost, (nor the two combined), produced a consistent re-use of place features (‘phonemic coding’). Only systems constrained by ‘target learning’ exhibited a strong recombination of place features.


Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., & Thomas, N. (2006). User Evaluation of the SYNFACE Talking Head Telephone. Lecture Notes in Computer Science, 4061, 579-586. [abstract] [pdf]Abstract: The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful productSalvi, G. (2006). Dynamic behaviour of connectionist speech recognition with strong latency constraints. Speech Communication, 48(7), 802-818. [abstract] [pdf]Abstract: This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.Salvi, G. (2006). Mining Speech Sounds, Machine Learning Methods for Automatic Speech Recognition and Analysis. Doctoral dissertation, KTH, School of Computer Science and Communication. [pdf]Salvi, G. (2006). Segment boundaries in low latency phonetic recognition. Lecture Notes in Computer Science, 3817, 267-276. [abstract] [pdf]Abstract: The segment boundaries produced by the Synface low latency phoneme recogniser are analysed. The precision in placing the boundaries is an important factor in the Synface system as the aim is to drive the lip movements of a synthetic face for lip-reading support. The recogniser is based on a hybrid of recurrent neural networks and hidden Markov models. In this paper we analyse the look-ahead length in the Viterbi-like decoder aff ects the precision of boundary placement. The properties of the entropy of the posterior probabilities estimated by the neural network are also investigated in relation to the distance of the frame from a phonetic transition.Salvi, G. (2006). Segment boundary detection via class entropy measurements in connectionist phoneme recognition. Speech Communication, 48(12), 1666-1676. [abstract] [pdf]Abstract: This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.


Salvi, G. (2005). Advances in regional accent clustering in Swedish. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 2841-2844). Lisbon, Portugal. [abstract] [pdf]Abstract: The regional pronunciation variation in Swedish is analysed on a large database. Statistics over each phoneme and for each region of Sweden are computed using the EM algorithm in a hidden Markov model framework to overcome the difficulties of transcribing the whole set of data at the phonetic level. The model representations obtained this way are compared using a distance measure in the space spanned by the model parameters, and hierarchical clustering. The regional variants of each phoneme may group with those of any other phoneme, on the basis of their acoustic properties. The log likelihood of the data given the model is shown to display interesting properties regarding the choice of number of clusters, given a particular level of details. Discriminative analysis is used to find the parameters that most contribute to the separation between groups, adding an interpretative value to the discussion. Finally a number of examples are given on some of the phenomena that are revealed by examining the clustering tree.Salvi, G. (2005). Ecological language acquisition via incremental model-based clustering. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 1181-1184). Lisbon, Portugal. [abstract] [pdf]Abstract: We analyse the behaviour of Incremental Model-Based Clustering on child-directed speech data, and suggest a possible use of this method to describe the acquisition of phonetic classes by an infant. The effects of two factors are analysed, namely the number of coefficients describing the speech signal, and the frame length of the incremental clustering procedure. The results show that, although the number of predicted clusters vary in different conditions, the classifications obtained are essentially consistent. Different classifications were compared using the variation of information measure.Salvi, G. (2005). Segment Boundaries in Low Latency Phonetic Recognition. In Proceedings of Non Linear Speech Processing (NOLISP). Barcelona, Spain.


Beskow, J., Karlsson, I., Kewley, J., & Salvi, G. (2004). SYNFACE - A talking head telephone for the hearing-impaired. In Miesenberger, K., Klaus, J., Zagler, W., & Burger, D. (Eds.), Computers Helping People with Special Needs (pp. 1178-1186). Springer-Verlag. [abstract] [pdf]Abstract: SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the rst user trials have just started.Siciliano, C., Williams, G., Faulkner, A., & Salvi, G. (2004). Intelligibility of an ASR-controlled synthetic talking face. Journal of the Acoustical Society of America, 115(5), 2428.Spens, K-E., Agelfors, E., Beskow, J., Granström, B., Karlsson, I., & Salvi, G. (2004). SYNFACE, a talking head telephone for the hearing impaired. In Proc IFHOH 7th World Congress. Helsinki, Finland.


Karlsson, I., Faulkner, A., & Salvi, G. (2003). SYNFACE - a talking face telephone. In Proc of EuroSpeech 2003 (pp. 1297-1300). Geneva, Switzerland. [abstract] [pdf]Abstract: The SYNFACE project has as its primary goal to facilitate for hearing-impaired people to use an ordinary telephone. This will be achieved by using a talking face connected to the telephone. The incoming speech signal will govern the speech movements of the talking face, hence the talking face will provide lip-reading support for the user. The project will define the visual speech information that supports lip-reading, and develop techniques to derive this information from the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development of automatic speech recognition methods that detect information in the acoustic signal that correlates with the speech movements. This information will govern the speech movements in a synthetic face and synchronise them with the acoustic speech signal. A prototype system is being constructed. The prototype contains results achieved so far in SYNFACE. This system will be tested and evaluated for the three languages by hearing-impaired users. SYNFACE is an IST project (IST-2001-33327) with partners from the Netherlands, UK and Sweden. SYNFACE builds on experiences gained in the Swedish Teleface project.Salvi, G. (2003). Accent clustering in Swedish using the Bhattacharyya distance. In Proceedings of the International Congress of Phonetic Sciences (ICPhS) (pp. 1149-1152). Barcelona, Spain. [abstract] [pdf]Abstract: In an attempt to improve automatic speech recognition (ASR) models for Swedish, accent variations were considered. These have proved to be important variables in the statistical distribution of the acoustic features usually employed in ASR. The analysis of feature variability have revealed phenomena that are consistent with what is known from phonetic investigations, suggesting that a consistent part of the information about accents could be derived form those features. A graphical interface has been developed to simplify the visualization of the geographical distributions of these phenomena.Salvi, G. (2003). Truncation Error and Dynamics in Very Low Latency Phonetic Recognition. In Proceedings of Non Linear Speech Processing (NOLISP). Le Croisic, France. [abstract] [pdf]Abstract: The truncation error for a two-pass decoder is analyzed in a problem of phonetic speech recognition for very demanding latency constraints (look-ahead length < 100ms) and for applications where successive refinements of the hypotheses are not allowed. This is done empirically in the framework of hybrid MLP/HMM models. The ability of recurrent MLPs, as a posteriori probability estimators, to model time variations is also considered, and its interaction with the dynamic modeling in the decoding phase is shown in the simulations.Salvi, G. (2003). Using accent information in ASR models for Swedish. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 2677-2680). Geneva, Switzerland. [abstract] [pdf]Abstract: In this study accent information is used in an attempt to improve acoustic models for automatic speech recognition (ASR). First, accent dependent Gaussian models were trained independently. The Bhattacharyya distance was then used in conjunction with agglomerative hierarchical clustering to define optimal strategies for merging those models. The resulting allophonic classes were analyzed and compared with the phonetic literature. Finally, accent “aware” models were built, in which the parametric complexity for each phoneme corresponds to the degree of variability across accent areas and to the amount of training data available for it. The models were compared to models with the same, but evenly spread, overall complexity showing in some cases a slight improvement in recognition accuracy.


Johansen, F. T., Warakagoda, N., Lindberg, B., Lehtinen, G., Kai, Z., Gank, A., Elenius, K., & Salvi, G. (2000). The COST 249 SpeechDat multilingual reference recogniser. In Gavrilidou, M., Caryannis, G., Markantonatou, S., Piperidis, S., & Stainhaouer, G. (Eds.), Proc. of LREC 2000, 2nd Intl Conf on Language Resources and Evaluation (pp. 1351-1356). Athens, Greece. [pdf]Lindberg, B., Johansen, F. T., Warakagoda, N., Lehtinen, G., Kai, Z., Gank, A., Elenius, K., & Salvi, G. (2000). A noise robust multilingual reference recogniser based on SpeechDat(II). In Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 370-373). Beijing. [pdf]


Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). A Synthetic Face as a Lip-reading Support for Hearing Impaired Telephone Users - Problems and Positive Results. In Proceedings of the 4th European Conference on Audiology. Oulo, Finland.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Two methods for Visual Parameter Extraction in the Teleface Project. In Proceedings of Fonetik. Gothenburg, Sweden.Agelfors, E., Beskow, J., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Synthetic visual speech driven from auditory speech. In Proceedings of Audio-Visual Speech Processing (AVSP). Santa Cruz, USA. [pdf]Salvi, G. (1999). Developing acoustic models for automatic speech recognition in Swedish. The European Student Journal of Language and Speech. [abstract] [pdf]Abstract: This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.Öhman, T., & Salvi, G. (1999). Using HMMs and ANNs for mapping acoustic to visual speech. TMH-QPSR, 40(1-2), 045-050. [abstract] [pdf]Abstract: In this paper we present two different methods for mapping auditory, telephone quality speech to visual parameter trajectories, specifying the movements of an animated synthetic face. In the first method, Hidden Markov Models (HMMs) where used to obtain phoneme strings and time labels. These where then transformed by rules into parameter trajectories for visual speech synthesis. In the second method, Artificial Neural Networks (ANNs) were trained to directly map acoustic parameters to synthesis parameters. Speaker independent HMMs were trained on a phonetically transcribed telephone speech database. Different underlying units of speech were modelled by the HMMs, such as monophones, diphones, triphones, and visemes. The ANNs were trained on male, female , and mixed speakers. The HMM method and the ANN method were evaluated through audio-visual intelligibility tests with ten hearing impaired persons, and compared to “ideal” articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct, respectively), but not as well as the “ideal” articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.


Salvi, G. (1998). Developing acoustic models for automatic speech recognition. Master's thesis, KTH, TMH, CTT. [pdf]