Track 1 – 09:30-13:00
Duration: 3 hours, 30min coffee break
Location: Conference 1
Presenters: Bernd T. Meyer1,5, Hynek Hermansky 2,3,5, Nelson Morgan4,5
1) Medizinische Physik and Cluster of Excellence Hearing4all, University of Oldenburg, Germany
2) Center for Language and Speech Processing, The Johns Hopkins University
3) Human Language Technology Center of Excellence, The Johns Hopkins University
4) EECS Department, University of California at Berkeley, Berkeley, CA, USA
5) International Computer Science Institute, Berkeley, CA, USA
Target Audience: The tutorial is intended for researchers and research students with a background in signal processing and some experience in speech recognition who wish to gain insight into bio-inspired speech recognition.
This tutorial picks up the main theme of Interspeech 2016 „Understanding Speech Processing in Humans and Machines“ and is centered around bio-inspired processing of speech signals. Specifically, the following topics will be addressed: (A) differences and similarities between human and machine listening, (B) consequences for machine listening with a focus on auditory-inspired signal processing and feature extraction, and (C) how to use these rich auditory representations and their potential for many-stream systems based on neural nets. Despite of the advances in speech recognition systems, automatic speech recognition (ASR) is still a fragile technology – especially for mismatched training and test conditions – hence the gap between machine listening and human listeners is often found to be very large.
(A) One approach to narrow this gap is to learn from research results in human speech recognition (HSR) (Moore and Cutler, 2001; Scharenborg, 2007), where the comparison of HSR and ASR responses is an important tool (i) to quantify the gap in order to determine how far we have come in designing ASR systems that achieve human performance levels and (ii) to identify specific differences between HSR and ASR that can be used for overcoming weaknesses of ASR systems. This tutorial reviews studies on man-machine comparison, starting with the classic overview by Lippmann (1997) and covering research that focused on different aspects of the recognition process such as identifying phones in the presence of different noise types (Sroka and Braida, 2005; Cooke and Scharenborg, 2008) or when intrinsic parameters such as speaking rate or effort are systematically varied (Meyer et al., 2011). Other comparisons employ more natural (albeit more difficult to analyze) speech data, e.g., noisy digits (Meyer, 2013), and recently also investigated man-machine differences in natural acoustic scenes with typical noises in home environments (Vincent, 2013; Spille and Meyer, 2014), as well as the effect of imperfect language models for continuous speech recognition (Shen et al., 2008). In the tutorial, we will highlight the implications of man-machine differences for ASR and discuss insights from follow-up studies building upon these differences to improve speech recognizers. Several of the datasets containing HSR responses are freely available; we will present a comprehensive list of these collections and give a couple of examples of how to make use of this data for obtaining a different view on ASR results.
(B) We will also address a topic that is closely linked to speech processing in humans and machines, which is to mimic the relevant signal processing strategies of the human auditory system to improve speech recognizers (Hermansky, 1998). Some of the pattern analysis schemes to be presented emerge from results of the human-machine comparison, others are inspired from physiological or psychoacoustic findings about the healthy auditory system. The latter techniques can be coarsely categorized in peripheral vs. mid- or central level-inspired coding strategies (Stern and Morgan, 2012), which will be described and discussed: Mel-frequency cepstral coefficients (MFCCs) already incorporate two important properties of peripheral auditory processing, i.e., a non-linear frequency warping resembling the frequency-place mapping on the basilar membrane and a logarithmic amplitude compression. The widely-used perceptual linear predictive (PLP) features (Hermansky, 1990) employ Bark scaling and a power law for frequency and amplitude compression, respectively, and feature additional properties related to psychoacoustic findings such as asymmetric filter shapes in a frequency analysis step. Sensitivity towards slowly varying channel characteristics can be alleviated by application of RASTA by analysis of relative spectra, a processing scheme that was also inspired by psychoacoustic evidence (Hermansky and Morgan, 1994). Human listeners are sensitive to temporal modulations encoded in mid- and high-level stages of the auditory pathway, which is considered in features that explicitly cover the long-time temporal evolution of the underlying signal, i.e., TRAPs (Hermansky and Sharma, 1999) and frequency domain linear prediction (FDLP) features that approximate the Hilbert envelope of the signal by an auto-regressive model (Athineos and Ellis, 2003). The last feature type to be discussed are spectro-temporal Gabor features that provide a method for simultaneous time and frequency analysis (Kleinschmidt and Gelbart, 2002; Ravuri and Morgan, 2010; Schädler et al., 2012), motivated by physiologic measurements showing that neurons in the primary auditory cortex of mammals are explicitly sensitive to diagonal patterns in the time-frequency representation of the stimulus.
(C) In the last part of the tutorial, we discuss the application of bio-inspired feature processing in multi-layer perceptron (MLP) based multi-stream ASR that aims at dealing with unexpected speech distortions that are not seen in training data. We present a short description of two different approaches of how neural nets can be embedded in recognizers based on hidden Markov models, i.e. by replacing the Gaussian mixture component in a hybrid system (Morgan, N. and Bourlard, H.A., 1995) or by combining an MLP with a conventional recognizer, resulting in a tandem approach (Hermansky et al, 2000). We discuss how heterogeneous information extracted from speech signals can be exploited in multi-stream systems (Ravuri and Morgan, 2010), and how bio-inspired features contribute to the robustness of speech recognizers using current deep neural nets in hybrid systems (Castro Martínez et al., 2014).
- Athineos, M., and Ellis, D. P. W. (2003). “Frequency-domain linear prediction for temporal features,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 261-266.
- Cooke, M., and Scharenborg, O. (2008). “The Interspeech 2008 Consonant Challenge,” Proc. Interspeech, pp. 1765-1768.
- Moore, R. K., and Cutler, A. (2001). “Constraints on Theories of Human vs Machine Recognition of Speech,” in Proc. SPRAAC Workshop on Speech Recognition as Pattern Classification, pp. 145-150.
- Kleinschmidt, M. and Gelbart, D. (2002), “Improving word accuracy with Gabor feature extraction,” in Proc. ICSLP.
- Lippmann, R. P. (1997). “Speech recognition by machines and humans,” Speech Commun., 22, pp. 1-15.
- Scharenborg, O. (2007). “Reaching over the gap: A review of efforts to link human and automatic speech recognition research,” Speech Commun., 49, pp. 336–347.
- Shen, W., Olive, J., and Jones, D. (2008), “Two protocols comparing human and machine phonetic recognition performance in conversational speech,” Proc. Interspeech, pp. 1630–1633.
- Sroka, J. J., and Braida, L. D. (2005). “Human and machine consonant recognition,” Speech Commun., 45, pp. 401–423.
- Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. (2013). The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines. ICASSP – 38th International Conference on Acoustics, Speech, and Signal Processing, pp.126-130.