WeB.SS: Structure-Based and Template-Based Automatic Speech Recognition Comparing parametric and non-parametric approaches
Wednesday, August 29, 2007, Astrid Plaza hotel, Room Scala 1
Session chairs: Li Deng, Microsoft, USA and Helmer Strik, Radboud University Nijmegen, the Netherlands
While hidden Markov modeling (HMM) has been the dominant technology for acoustic modeling in automatic speech recognition today, many of its weaknesses have also been well known and they have become the focus of much intensive research. One prominent weakness in current HMMs is the handicap in representing long-span temporal dependency in the acoustic feature sequence of speech, which, nevertheless, is an essential property of speech dynamics. The main cause of this handicap is the conditional IID (Independent and Identical Distribution) assumption inherit in the HMM formalism. Furthermore, in the standard HMM approach the focus is on verbal information. However, experiments have shown that non-verbal information also plays an important role in human speech recognition which the HMM framework has not attempted to address directly. Numerous approaches have been taken over the past dozen years to address the above weaknesses of HMMs. These approaches can be broadly classified into the following two categories.
The first, parametric, structure-based approach establishes mathematical models for stochastic trajectories/segments of speech utterances using various forms of parametric characterization, including polynomials, linear dynamic systems, and nonlinear dynamic systems embedding hidden structure of speech dynamics. In this parametric modeling framework, systematic speaker variation can also be satisfactorily handled. The essence of such a hidden-dynamic approach is that it exploits knowledge and mechanisms of human speech production so as to provide the structure of the multi-tiered stochastic process models. A specific layer in this type of models represents long-range temporal dependency in a parametric form.
The second, non-parametric and template-based approach to overcoming the HMM weaknesses involves direct exploitation of speech feature trajectories (i.e., “template”) in the training data without any modeling assumptions. Due to the dramatic increase of speech databases and computer storage capacity available for training, as well as the exponentially expanded computational power, non-parametric methods using the traditional pattern recognition techniques of kNN (k-nearest-neighbor decision rule) and DTW (dynamic time warping) have recently received substantial attention. Such template-based methods have also been called exemplar-based or data-driven techniques in the literature.
The purpose of this special session is to bring together researchers who have special interest in novel techniques that are aimed at overcoming weaknesses of HMMs for acoustic modeling in speech recognition. In particular, we plan to address issues related to the representation and exploitation of long-range temporal dependency in speech feature sequences, the incorporation of fine phonetic detail in speech recognition algorithms and systems, comparisons of pros and cons between the parametric and non-parametric approaches, and the computation resource requirements for the two approaches.
This Special Session addresses key issues of Sound to Sense (S2S), a Marie Curie Research Training Network that started in 2007. S2S's unifying theme is the role of fine phonetic detail (FPD) in speech processing. This special session focuses on alternative theoretical and computational modeling paradigms for encoding FPD.
10:00 – 10:45 Poster presentations
Temporal Episodic Memory Model: An Evolution of MINERVA2, Viktoria Maier and Roger K. Moore, University of Sheffield (United Kingdom)
Speech Recognition with Factorial-HMM Syllabic Acoustic Models, Gianpaolo Coro, Francesco Cutugno and Fulvio Caropreso, University of Naples “Federico II”, Naples and ABLA srl (Italy)
Evaluating Acoustic Distance Measures for Template Based Recognition, Mathias De Wachter, Kris Demuynck, Patrick Wambacq and Dirk Van Compernolle, K.U.Leuven (Belgium)
Hierarchical Acoustic Modeling Based on Random-Effects Regression for Automatic Speech Recognition, Yan Han and Lou Boves, Radboud University of Nijmegen (the Netherlands)
Construction and Analysis of Multiple Paths in Syllable Models, Annika Hämäläinen, Louis ten Bosch and Lou Boves, Radboud University Nijmegen (the Netherlands)
Landmark-based Approach to Speech Recognition: An Alternative to HMMs, Carol Espy-Wilson, Tarun Pruthi, Amit Juneja and Om Deshmukh, University of Maryland and Think-A-Move, Ltd. (USA)
Automatic Recognition of Connected Vowels Only Using Speaker-invariant Representation of Speech Dynamics, Satoshi Asakawa, Nobuaki Minematsu and Keikichi Hirose, The University of Tokyo (Japan)
A Structured Speech Model Parameterized by Recursive Dynamics and Neural Networks, Roberto Togneri and Li Deng, The University of Western Australia (Australia) and Microsoft Research (USA)
10:45 – 11:30 Oral presentations
10:45 - Structure-Based and Template-Based Automatic Speech Recognition --- Comparing parametric and non-parametric approaches, Li Deng and Helmer Strik, Microsoft Research (USA) and Radboud University (the Netherlands)
11:00 - Learning the Inter-frame Distance for Discriminative Template-based Keyword Detection, David Grangier and Samy Bengio, IDIAP Research Institute (Switzerland) and Google Inc (USA)
11:15 - Handling Phonetic Context and Speaker Variation in a Structure-Based Speech Recognizer, Dong Yu, Li Deng and Alex Acero, Microsoft Research (USA)
11:30 – 12:00 Panel discussion, panelists: Janet Baker (Saras Institute & MIT, USA), Chin-Hui Lee (Georgia Tech, USA), Roger Moore (University of Sheffield, UK), Dirk Van Compernolle (KU Leuven, Belgium), Helmer Strik (Radboud University, the Netherlands), Li Deng (Microsoft Research, USA)
Radboud University Nijmegen