Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session ThC.O3: Speech synthesis I


Type oral
Date Thursday, August 30, 2007
Time 13:30 – 15:30
Room Marble
Chair Alan Black (Language Technologies Institute, CMU)

ThC.O3‑1
13:30
An HMM-Based Speech Synthesis System applied to German and its Adaptation to a Limited Set of Expressive Football Announcements
Sacha Krstulovic, DFKI GmbH
Anna Hunecke, DFKI GmbH
Marc Schroeder, DFKI GmbH

The paper assesses the capability of an HMM-based TTS system to produce German speech. The results are discussed in qualitative terms, and compared over three different choices of context features. In addition, the system is adapted to a small set of football announcements, in an exploratory attempt to synthesise expressive speech. We conclude that the HMMs are able to produce highly intelligible neutral German speech, with a stable quality, and that the expressivity is partially captured in spite of the small size of the football dataset.
ThC.O3‑2
13:50
Statistical Vowelization of Arabic Text for Speech Synthesis in Speech-to-Speech Translation Systems
Liang Gu, IBM T. J. Watson Research Center
Wei Zhang, IBM T. J. Watson Research Center
Lazkin Tahir, IBM T. J. Watson Research Center
Yuqing Gao, IBM T. J. Watson Research Center

Vowelization presents a principle difficulty in building text-to-speech synthesizers for speech-to-speech translation systems. In this paper, a novel log-linear modeling method is proposed that takes into account vowel and diacritical information at both the word level and character level. A unique syllable based normalization algorithm is then introduced to enhance both word coverage and data consistency. A recursive data generation and model training scheme is further devised to jointly optimize speech synthesizers and vowelizers for an English-Arabic speech translation system. The diacritization error rate is reduced by over 50% in vowelization experiments.
ThC.O3‑3
14:10
A Pair-based Language Model for the Robust Lexical Analysis in Chinese Text-to-Speech Synthesis
Wu Liu, France Telecom R&D Beijing
Dezhi Huang, France Telecom R&D Beijing
Yuan Dong, Beijing University of Posts and Telecommunications
Xinnian Mao, France Telecom R&D Beijing
Haila Wang, France Telecom R&D Beijing

This paper presents a robust method of lexical analysis for Chinese text-to-speech (TTS) synthesis using a pair-based Language Model (LM). The traditional way of Chinese lexical analysis simply regards the word segmentation and part-of-speech (POS) tagging as two separated phases. Each of them utilizes its own algorithms and models. Actually, the POS information is useful for word segmentation, and vice versa. Therefore, a pair-based language model is proposed to integrate basic word segmentation, POS tagging and named entity (NE) identification into a unified framework. The objective evaluation indicates that the proposed method achieves the top-level performance, and confirms its effectiveness in Chinese lexical analysis.
ThC.O3‑4
14:30
A trainable excitation model for HMM-based speech synthesis
Ranniery Maia, NiCT/ATR-SLC
Tomoki Toda, Nara Institute of Science and Technology
Heiga Zen, Nagoya Institute of Technology
Yoshihiko Nankaku, Nagoya Institute of Technology
Keiichi Tokuda, Nagoya Institute of Technology

This paper introduces a novel excitation approach for speech synthesizers in which the final waveform is generated through parameters directly obtained from Hidden Markov Models (HMMs). Despite the attractiveness of the HMM-based speech synthesis technique, namely utilization of small corpora and flexibility concerning the achievement of different voice styles, synthesized speech presents a characteristic "buzziness" caused by the simple excitation model which is employed during the speech production. This paper presents an innovative scheme where mixed excitation is modeled through closed-loop training of a set of state-dependent filters and pulse trains, with the minimization of the error between excitation and residual sequences. The proposed method shows effectiveness, yielding synthesized speech with quality far superior to the simple excitation baseline and comparable to the best excitation schemes thus far reported for HMM-based speech synthesis.
ThC.O3‑5
14:50
Cross-Language Phonemisation In German Text-To-Speech Synthesis
Jochen Steigner, DFKI GmbH
Marc Schröder, DFKI GmbH

We present a TTS component for transcribing English words in German text. In addition to loan words, whose form does not change, we also cover xenomorphs, English stems with German morphology. We motivate the need for such a processing component, and present the algorithm in some detail. In an evaluation on unseen material, we find a precision of 0.85 and a recall of 0.997.
ThC.O3‑6
15:10
Preliminary Experiments toward Automatic Generation of New TTS Voices from Recorded Speech Alone
Ryuki Tachibana, Tokyo Research Laboratory, IBM Japan
Tohru Nagano, Tokyo Research Laboratory, IBM Japan
Gakuto Kurata, Tokyo Research Laboratory, IBM Japan
Masafumi Nishimura, Tokyo Research Laboratory, IBM Japan
Noboru Babaguchi, Graduate School of Engineering, Osaka University

To generate a new concatenative text-to-speech (TTS) voice from recordings of a human's voice, not only recordings but also additional information such as the transcriptions, prosodic labels, and the phonemic alignments are necessary. Since some of the information depends on the speaking style of the narrator, these types of information need to be manually added by listening to the recordings, which is costly and time consuming. To tackle this problem, we have been working on a totally trainable TTS system every component of which, including the text processing module, can be automatically trained from a speech corpus. In this paper, we refine the framework and propose several submodules to collect all of the linguistic and acoustic information necessary for generating a TTS voice from the recorded speech. Though completely automatic generation of a new voice is not yet possible, we report progress in the submodules by showing experimental results.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo