Show this list in BibTeX format.
Publications by Daniel Neiberg
2013Speech Communication, 55(3), 451-469.  (2013). Semi-supervised methods for exploring the acoustics of simple productive feedback.
2012IEEE transactions on Audio, Speech, and Language Processing. [abstract] [link] (2012). Exploring the Predictability of Non-unique Acoustic-to-Articulatory Mappings. Abstract: This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.Fonetik 2012. Göteborg, Sweden. [pdf] (2012). Towards letting machines humming in the right way - prosodic analysis of six functions of short feedback tokens in English. In Modelling Paralinguistic Conversational Interaction: Towards social awareness in spoken human-machine dialogue. Doctoral dissertation, KTH School of Computer Science and Communication. [link] (2012). The Interdisciplinary Workshop on Feedback Behaviors in Dialog. [abstract] [pdf] (2012). Cues to perceived functions of acted and spontaneous feedback expressions. In Abstract: We present a two step study where the ﬁrst part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.The Interdisciplinary Workshop on Feedback Behaviors in Dialog. [abstract] [pdf] (2012). Exploring the implications for feedback of a neurocognitive theory of overlapped speech. In Abstract: Neurocognitive evidence suggests that the cognitive load caused by decoding interlocutors speech while one self is talking is dependent on two factors: the type of incoming speech, i.e. nonlexical feedback, lexical feedback or non-feedback; and the duration of the speech segment. This predicts that the fraction of overlap should be high for non-lexical feedback, medium for lexical feedback and low for non-feedback, and that short segments has a higher fraction of overlapped speech than long segments. By normalizing for duration, it is indeed shown that the
fraction of overlap is 32% for non-lexical feedback, 27% for lexical feedback and 12% for non-feedback. Investigating nonfeedback tokens for the durational factor gives that the fraction of overlap can be modeled by linear regression and logarithmic transform of duration giving a R
2 = 0.57 (p < 0.01 for F-test) and a slope b(2) = -0.04 (p < 0.01 for T-test). However, it is not enough to take duration into account when modeling overlap in feedback tokens.
2011Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5256 - 5259). Prague, Czech Republic. [abstract] [pdf] (2011). Syllabification of conversational speech using bidirectional long-short-term memory neural networks. In Abstract: Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases - Switchboard and TIMIT - for both read and spontaneous speech, and a favourable comparison with other published results.Computer Speech and Language, 25(1), 84-104. [link] (2011). Expression of Affect in Spontaneous Speech: Acoustic Correlates and Automatic Detection of Irritation and Resignation. TMH-QPSR, 51(1), 57-60. [abstract] [pdf] (2011). Visualizing prosodic densities and contours: Forming one from many. Abstract: This paper summarizes a flora of explorative visualization techniques for prosody developed at KTH. It is demonstrated how analysis can be made which goes beyond conventional methodology. Examples are given for turn taking, affective speech, response tokens and Swedish accent II.INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf] (2011). Tracking pitch contours using minimum jerk trajectories. In Abstract: This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm included
in the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf] (2011). A Dual Channel Coupled Decoder for Fillers and Feedback. In Abstract: This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy.. [abstract] [pdf] (2011). Predicting Speaker Changes and Listener Responses With And Without Eye-contact. In Abstract: This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association (pp. 1581-1584). Florence, Italy.. [abstract] [pdf] (2011). Intra-, Inter-, and Cross-cultural Classification of Vocal Affect. In Abstract: We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5836 - 5839). Prague, Czech Republic. [abstract] [pdf] (2011). Online Detection Of Vocal Listener Responses With Maximum Latency Constraints. In Abstract: When human listeners utter Listener Responses (e.g. back-channels or acknowledgments) such as `yeah' and `mmhmm', interlocutors commonly continue to speak or resume their speech even before the listener has finished his/her response. This type of speech interactivity results in frequent speech overlap which is common in human-human conversation. To allow for this type of speech interactivity to occur between humans and spoken dialog systems, which will result in more human-like continuous and smoother human-machine interaction, we propose an on-line classifier which can classify incoming speech as Listener Responses. We show that it is possible to detect vocal Listener Responses using maximum latency thresholds of 100-500 ms, thereby obtaining equal error rates ranging from 34% to 28% by using an energy based voice activity detector.Journal on Multimodal User Interfaces, 4(2), 97-118. [link] (2011). Continuous Interaction with a Virtual Human.
2010Proceedings of SLTC 2010. Linköping, Sweden. (2010). Directing conversation using the prosody of mm and mhm. In Proceedings of DiSS-LPSS Joint Workshop 2010. Tokyo, Japan. [pdf] (2010). Prosodic cues to engagement in non-lexical response tokens in Swedish. In Proceedings of DiSS-LPSS Joint Workshop 2010. Tokyo, Japan. [pdf] (2010). Modeling Conversational Interaction Using Coupled Markov Chains. In Fonetik 2010. Lund. (2010). Prosodic Characterization and Automatic Classification of Conversational Grunts in Swedish. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association (pp. 2562-2565). Makuhari, Chiba, Japan. [pdf] (2010). The Prosody of Swedish Conversational Grunts. In Fifth International Conference on Speech Prosody (Speech Prosody 2010). Chicago, Illinois, U.S.A. [abstract] [pdf] (2010). Classification of Affective Speech using Normalized Time-Frequency Cepstra. In Abstract: Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered.
Speaker independent classification results using nu-SVM
the the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was
31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.Proceedings of SLTC 2010. Linköping, Sweden. [abstract] (2010). A Maximum Latency Classifier for Listener Responses. In Abstract: When Listener Responses such as “yeah”, “right” or “mhm” are uttered in a face-to-face conversation, it is not uncommon for the interlocutor to continue to speak in overlap, i.e. before the Listener becomes silent. We propose a classifier which can classify incoming speech as a Listener Response or not before the talk-spurt ends. The classifier is implemented as an upgrade of the Embodied Conversational Agent developed in the SEMAINE project during the eNTERFACE 2010 workshop.
2009Fonetik 2009. Stockholm. [abstract] [pdf] (2009). Cross - modal Clustering in the Acoustic - Articulatory Space. In Abstract: This paper explores cross-modal clustering in the acoustic-articulatory space. A method to improve clustering using information from more than one modality is presented. Formants and the Electromagnetic Articulography meas-urements are used to study corresponding clus-ters formed in the two modalities. A measure for estimating the uncertainty in correspon-dences between one cluster in the acoustic space and several clusters in the articulatory space is suggested.INTERSPEECH 2009 - 10th Annual Conference of the International Speech Communication Association (pp. 2799 – 2802). Brighton, UK. [abstract] [pdf] (2009). In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping. In Abstract: This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially
in stop consonants, nasals and fricatives. The relationship
between the importance of the articulator position and non-uniqueness at each instance is also explored.INTERSPEECH 2009 - 10th Annual Conference of the International Speech Communication Association (pp. 1387 – 1390). Brighton, UK. [abstract] [pdf] (2009). On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition. In Abstract: The paper proposes a general version of a coupled Hidden
Markov/Bayesian Network model for performing phoneme
recognition on acoustic-articulatory data. The model uses
knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline
phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.Computers in the Human Interaction Loop (pp. 96-105). Berlin/Heidelberg: Springer. [pdf] (2009). Emotion Recognition. In Waibel, A., & Stiefelhagen, R. (Eds.),
2008Workshop on Corpora for Research on Emotion and Affect. Marrakesh, Marocko. [abstract] [pdf] (2008). Vocal Expression in spontaneous and experimentally induced affective speech: Acoustic correlates of anxiety, irritation and resignation. In Abstract: We present two studies on authentic vocal affect expressions. In Study 1, the speech of social phobics was recorded in an anxiogenic public speaking task both before and after treatment. In Study 2, the speech material was collected from real life human-computer interactions. All speech samples were acoustically analyzed and subjected to listening tests. Results from Study 1 showed that a decrease in experienced state anxiety after treatment was accompanied by corresponding decreases in a) several acoustic parameters (i.e., mean and maximum F0, proportion of high-frequency components in the energy spectrum, and proportion of silent pauses), and b) listeners’ perceived level of nervousness. Both speakers’ self-ratings of state anxiety and listeners’ ratings of perceived nervousness were further correlated with similar acoustic parameters. Results from Study 2 revealed that mean and maximum F0, mean voice intensity and H1-H2 was higher for speech perceived as irritated than for speech perceived as neutral. Also, speech perceived as resigned had lower mean and maximum F0, and mean voice intensity than neutral speech. Listeners’ ratings of irritation, resignation and emotion intensity were further correlated with several acoustic parameters. The results complement earlier studies on vocal affect expression which have been conducted on posed, rather than authentic, emotional speech.Fonetik. Göteborg. (2008). On the Non-uniqueness of Acoustic-to-Articulatory Mapping. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association (pp. 1485-1488). Brisbane, Australia. [pdf] (2008). The Acoustic to Articulation Mapping: Non-linear or Non-unique?. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association (pp. 1485-1488). Brisbane, Australia. [pdf] (2008). Automatic Recognition of Anger in Spontaneous Speech. In
2006Working Papers 52: Proceedings of Fonetik 2006 (pp. 101-104). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [pdf] (2006). Emotion Recognition in Spontaneous Speech. In INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing. Pittsburgh, PA, USA,. [pdf] (2006). Emotion Recognition in Spontaneous Speech Using GMMs. In
2005WorldHaptics Conference. Pisa, Italy. [pdf] (2005). A Haptic Enabled Multimodal Pre-Operative Planner for Hip Arthroplasty. In 5th Annual Meeting of Computer Assisted Orthopaedic Surgery. Helsinki, Finland. (2005). An innovative multisensorial environment for pre-operative planning of total hip replacement", 5th Annual Meeting of Computer Assisted Orthopaedic Surgery. In A mutlimodal and multisensorial pre-operative planning environment for total hip replacement. IEEE Computer Society Press. [pdf] (2005).
2002Textoberoende talarverifiering med adapterade Gaussian-Mixture-modeller. Master's thesis, KTH, TMH, CTT.