Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session ThB.P1b: Speech coding and transmission


Type poster
Date Thursday, August 30, 2007
Time 10:00 – 12:00
Room Foyer
Chair Isabel Trancoso (INESC, Lisboa)

ThB.P1b‑1

Normalized Two Stage SVQ for Minimum Complexity Wide-band LSF Quantization
Saikat Chatterjee, Indian Institute of Science
T.V. Sreenivas, Indian Institute of Science

We develop a two stage split vector quantization method with optimum bit allocation, for achieving minimum computational complexity. This also results in much lower memory requirement than the recently proposed switched split vector quantization method. To improve the rate-distortion performance further, a region specific normalization is introduced, which results in 1 bit/vector improvement over the typical two stage split vector quantizer, for wide-band LSF quantization.
ThB.P1b‑2

A Novel 2kb/s Waveform Interpolation Speech Coder Based on Non-negative Matrix Factorization
Peng Zhang, Beijing University of Technology
Changchun Bao, Beijing University of Technology

In this paper, a 2kb/s Waveform Interpolation speech coder is proposed based on non-negative matrix factorization (NMF). In characteristic waveforms (CWs) decomposition, band-partitioning initialization constraints were set to basis vectors before NMF was carried out. This decomposition method only requires speech signal from the current frame, and can yield high decomposition quality with low computational complexity. Besides, the high dimensional CWs matrix can be expressed by the low dimensional coding matrix, and this has facilitated the CWs quantization. The listening test shows that the proposed 2kb/s NMF-WI coder can give smooth speech with quality close to 2.4kb/s SVD-based WI coder.
ThB.P1b‑3

A Novel Energy Distribution Comparison Approach for Robust Speech Spectrum Vector Quantization
Ahmed Ismail, Mentor Graphics Corp.
Yasser Dakroury, Computer & Systems Eng. Department, Faculty of Engineering, Ain Shams University, Cairo
Hazem Abbas, Mentor Graphics Corp.

Vector Quantization (VQ) has been extensively used in speech vocoders. The training process normally requires a very large training-set. This paper introduces a novel energy distribution comparison distortion measure for the high-band speech spectrum that enables the vector quantizer to operate given a relatively small training-set. This measure has been used in the construction of a segmental vocoder using the pitch period as segments. A description of the proposed approach, the Energy-Mass distortion measure, is given and compared to the use of MFCC as a distortion measure showing the ability of the proposed approach to better represent the speech formants, when operating under the small training-set constraint. Finally, the performance of the new Energy-Mass is evaluated using the Spectral Distortion (SD). Speech quality perceived by the receiver is evaluated using the recently standardized objective quality measure PESQ, where an improvement of 0.3 PESQ score was obtained.
ThB.P1b‑4

Novel Low-Band Phase Representation for Low Bit-Rate Speech Coding
Ahmed Ismail, Mentor Graphics Corp.
Yasser Dakroury, Computer & Systems Eng. Department, Faculty of Engineering, Ain Shams University, Cairo
Hazem Abbas, Mentor Graphics Corp.

Vector Quantization (VQ) has been extensively used in speech vocoders. Phase information is often ignored or coarsely represented in parametric coders because of the difficulties facing phase quantization. This paper introduces a novel distortion measure for the low-band speech signal that takes phase information into consideration, with no increase in the bit-rate. This measure has been used in the construction of a segmental vocoder, which is using the pitch period as segments. A description of the proposed Time-Domain Phase-Aware (TDPA) distortion measure is given and compared to the use of the MFCC as a distortion measure showing the effect of the phase information represented in the TDPA model on improving the inter-frame correlation of the synthesized speech. Finally, the performance of the TDPA is evaluated using the Segmental Signal-to-Noise Ratio (SNR), and Spectral Distortion (SD). Speech quality is evaluated using the recently standardized objective quality measure PESQ.
ThB.P1b‑5

Perceptual-Based Playout Mechanisms for Multi-Stream Voice over IP Networks
Chun-Feng Wu, National Chiao-Tung University
Cheng-Lung Lee, National Chiao-Tung University
Wen-Whei Chang, National Chiao-Tung University

Packet loss and delay are two essential problems to real-time voice transmission over best-effort packet networks. In the proposed system, multiple descriptions of the speech are transmitted to take advantage of largely uncorrelated delay and loss characteristics on different network paths. Adaptive playout scheduling of multiple voice streams is formulated as an optimization problem leading to a better delay-loss tradeoff. Also proposed is a perceptually motivated optimization criterion based on a simplified version of the ITU-T E-model. Experimental results show that the proposed playout buffer algorithm improves the delay-loss tradeoff as well as speech reconstruction quality.
ThB.P1b‑6

Time-Warping and Re-Phasing in Packet Loss Concealment
Robert Zopf, Broadcom Corporation
Jes Thyssen, Broadcom Corporation
Juin-Hwey Chen, Broadcom Corporation

This paper proposes two techniques to improve packet loss concealment (PLC). In the first technique, time-warping is used to stretch or shrink the time axis of the signal received in the first good frame after frame loss to align it with the extrapolated signal used to conceal the bad frame. This aligning procedure avoids any destructive interference that might otherwise occur when the two signals are out of phase and overlap-added. The second technique may be applied to speech codecs with memory, particularly suited for backward-adaptive systems. In this technique, called “re-phasing”, the internal states of the codec are phase-aligned with the signal in the first good frame. Both techniques are part of the ITU-T G.722 Appendix III packet loss concealment standard and provide significant quality improvement.
ThB.P1b‑7

The Harmonic Model Codec (HMC) framework for VoIP
Yannis Agiomyrgiannakis, Computer Science Department, University Of Crete, Hellas
Yannis Stylianou, Computer Science Department, University Of Crete, Hellas

A framework for joint source/channel coding of speech is presented. It is based on a harmonic representation of the speech signal and facilitates efficient quantization of harmonic amplitudes and phases both in a single description and a multiple description setting. Furthermore, it combines high-quality packet loss concealment with efficient source coding and multiple description coding. Two proof-of-concept codecs are presented; a single description codec that is equivalent to iLBC in terms of bitrate and quality but more robust in conditions of increased packet losses and a multiple description codec that is capable of accepting loss rates up to 40% for a DCR score of 3.8.
ThB.P1b‑8

Bit-Erasure Channel Decoding for GMM-based Multiple Description Coding
Yannis Agiomyrgiannakis, Computer Science Department, University Of Crete
Yannis Stylianou, Computer Science Department, University Of Crete

Multiple Description Coding (MDC) is a plausible way to use the diversity of packet networks to increase the robustness of the transmission to packet losses. The redundancy that is introduced via MDC can also be used to increase the robustness of the transmission to bit-errors. This paper presents a novel decoding method for GMM (Gaussian Mixture Model)-based MDC in the presence of detected bit-errors. Particularly for speech transmission over bit-erasure channels, is is shown that the proposed method considerably improves the quality of the received speech spectral envelopes when one side-description is damaged. In highly correlated descriptions, for example, single and double bit-errors can almost be corrected.
ThB.P1b‑9

Degradation-Classification Assisted Single-Ended Quality Measurement of Speech
Hua Yuan, Queen's University
Tiago H. Falk, Queen's University
Wai-Yip Chan, Queen's University

We propose an algorithm to classify speech degradations at network endpoints and to estimate the speech quality based on the degradation classification decision. Perceptual features from degraded speech signals are used to form statistical reference models of different degradation classes. Consistency measures, calculated between degraded speech signals and the reference models, are used to train a degradation classifier and mean opinion score (MOS) mappings. The quality of a received speech signal is estimated based on its degradation class and the MOS mapping associated with the class. Experimental results show that the proposed algorithm achieves high classification accuracy, and degradation classification improves the accuracy of the quality estimate.
ThB.P1b‑10

Concept and Evaluation of a Downward-Compatible System for Spatial Teleconferencing using Automatic Speaker Clustering
Alexander Raake, Deutsche Telekom Laboratories, Berlin University of Technology
Sascha Spors, Deutsche Telekom Laboratories, Berlin University of Technology
Jens Ahrens, Deutsche Telekom Laboratories, Berlin University of Technology
Jitendra Ajmera, Deutsche Telekom Laboratories, Berlin University of Technology

In multi-party teleconferencing, the transport of separate speech streams to a particular user and the subsequent spatial rendering of the different streams enables a more efficient communication. A simple means of spatial presentation at client side is that of binaural rendering and headphone presentation. For downward-compatibility, e.g. when the transport mechanism does not support multiple parallel downlink streams, a system is proposed that combines an automatic speaker classification mechanism with a spatial rendering of the segregated streams. The combined system aims at a better separability of the speakers than conventional systems. The paper details the two basic components, namely automatic speaker classification, and binaural rendering. Based on a first evaluation of the approach, a proof of concept is provided, and directions for further improvement are discussed.
ThB.P1b‑11

Speech Quality Estimation using Packet Loss Effects in CELP-type Speech Coders
Min-Ki Lee, Yonsei University
Kyung-Tae Kim, Yonsei University
Hong-Goo Kang, Yonsei University
Dae Hee Youn, Yonsei University

This paper proposes an objective quality assessment method for voice communication systems using packet loss information. Based on the fact that the effect of packet loss to perceptual quality varies depending on the signal characteristics of lost packets, different weighting factors are applied to predicting overall quality. Considering the key paradigm of low bit rate speech coders such as parameter prediction in consecutive frames, we also include the effect of neighborhood frames. To verify the performance of the proposed algorithm, we apply it into two well-known speech codecs, G.729 and AMR-NB. Simulation results with a large set of speech database verify the superiority of the proposed algorithm. The normalized correlation between degraded listening quality (LQ) scale of the PESQ and the proposed method is 0.9121 at G.729A and 0.9289 at AMR codec based on SMV classification of input signal and 0.8505 at G.729A and 0.8586 at AMR codec based on SMV classification of decoded signal
ThB.P1b‑12

An 8-32 kbit/s Scalable Wideband Coder Extended with MDCT-based Bandwidth Extension on top of a 6.8 kbit/s Narrowband CELP Coder
Masahiro Oshikiri, Next-Generation Mobile Communications Development Center, Matsushita Electric (Panasonic)
Hiroyuki Ehara, Next-Generation Mobile Communications Development Center, Matsushita Electric (Panasonic)
Toshiyuki Morii, Next-Generation Mobile Communications Development Center, Matsushita Electric (Panasonic)
Tomofumi Yamanashi, Next-Generation Mobile Communications Development Center, Matsushita Electric (Panasonic)
Kaoru Satoh, Next-Generation Mobile Communications Development Center, Matsushita Electric (Panasonic)
Koji Yoshida, Next-Generation Mobile Communications Development Center, Matsushita Electric (Panasonic)

In this paper, we present a 6.8-32 kbit/s scalable speech and audio coder using a modified-discrete-cosine-transform (MDCT)-based bandwidth extension on top of a 6.8 kbit/s code-excited-linear-prediction (CELP) coder. The proposed coder comprises a 6.8 kbit/s narrowband CELP as its core-layer and eight enhancement layers with the bitrates of 0.8, 1.2, 3.2, or 4.0 kbit/s. After encoding of a narrowband signal by the core-layer, the first enhancement layer extends the bandwidth of a narrowband decoded signal, and the other enhancement layers increase the fidelity of an extended wideband signal or robustness against frame erasure conditions. Subjective evaluation test results demonstrate that the proposed coder outperforms G.729.1 for music signals at 16 and 24 kbit/s in particular with competitive or even better performance in other conditions like clean speech, background noise, and frame erasure.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo