Tuesday, August 28, 2007, Astrid Plaza hotel, Room Scala 1
Session Chair: Gerrit Bloothooft, Utrecht University, the Netherlands
Singing is perhaps the most expressive usage of human voice and speech. An excellent singer, whether in classical opera, musical, pop, folk music, or any other style, can express a message and emotion so intensely that it moves and delights a wide audience. Synthesizing singing may be considered therefore as the ultimate challenge to our understanding and modeling of human voice. In this two hours interactive special session of INTERSPEECH 2007 on synthesized singing, an enjoyable demonstration of the current state of the art has been given, with active evaluation by the audience.
The session was special in many ways:
Participants did not only contribute a short paper, including an audio demonstration of their system, they also produced their own version of The Synthesizer Song.
The contribution was commented by a panel consisting of Nick Campbell (ATR), Peter Pabon (Utrecht University), Lieve Geuens (soprano singer), and the audience.
Evaluative statements will be voted for by everyone, using a realtime voting box system with a five-point Likert scale from excellent (1) to poor (5); they concern
1. bass-baritone voice
Let me sing
Let me sing
Let me sing by bits and bytes
Let me bring
Let me bring
Let me bring devine delights
----
I sing an /a/
I sing an /i/
I sing an /u/
for you, for you, for you
2. soprano voice
Let me sing
Let me sing
Let me sing by bits and bytes
Let me bring
Let me bring
Let me bring devine delights
----
It is an art
This is the best
That I can do
for you, for you, for you
3. voice of choice
Let me sing
Let me sing
Let me sing by bits and bytes
Let me bring
Let me bring
Let me bring devine delights.
----
I sing an /a/
I sing an /i/
I sing an /u/
for you, for you, for you
The musical score is written for soprano voice. Transposition of one or two octaves lower gives the scores for tenor or alto voice, or bass-baritone voice, respectively. For the realization of the Synthesizer Song, the first verse should be sung by bass-baritone and the second verse by soprano voice. No accompaniment or reverberation is allowed. For the third verse, any voice can be chosen, and accompaniment is permitted.
Programme
The papers and materials submitted to the Synthesis of Singing Challenge did not follow the regular review procedure, but were chosen by the session organizer to be fit for this session.
During the session, judgments were given by 60 voters from the audience (of 150 people), partly by an electronic voting system, partly on a paper form. The average scores are presented below. It should be realized that the audience had a difficult task, since not all systems produced both a baritone and a soprano version, while the quality of the voices used could be quite different (weaker results for the female voice). Also, the speech-to-singing systems had a considerable different starting position than the tekst-to-singing systems.
The Synthesizer Song sung by Lieve Geuens during the Special Session (.mov; 30 MB)
A system for the synthesis of singing on the basis of an articulatory speech synthesizer is presented. To enable the synthesis of singing, the speech synthesizer was extended in many respects. Most importantly, a rule-based transformation of a musical score into a gestural score for articulatory gestures was developed. Furthermore, a pitch-dependent articulation of vowels was implemented. The results of these extensions are demonstrated by the synthesis of the canon “Dona nobis pacem”. The two voices in the canon were generated with the same underlying articulatory models and the same musical score, the only difference being that their pitches differ by one octave.
Note: See http://www.vocaltractlab.de/ for background information, including a download of the "Vocal Tract Laboratory", an interactive multimedial software tool to demonstrate the mechanism of speech production (in due course).
Takeshi Saitou1, Masataka Goto1, Masashi Unoki2, and Masato Akagi2 [1 National Institute of Advanced Industrial Science and Technology (AIST), Japan; 2 School of Information Science, Japan Advanced Institute of Science and Technology, Japan], "Vocal Conversion from Speaking Voice to Singing Voice Using STRAIGHT"
A vocal conversion system that can synthesize a singing voice given a speaking voice and a musical score is proposed. It is based on the speech manipulation system STRAIGHT [1], and comprises three models controlling three acoustic features unique to singing voices: the F0, duration, and spectral envelope. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results showed that the proposed system could convert speaking voices into singing voices whose quality resembles that of actual singing voices.
The technique used for this composition is a semi automatic system for speech to chant conversion. The transformation is performed using an implementation of shapeinvariant signal modifications in the phase vocoder and a recent technique for envelope estimation that is denoted as True Envelope estimation. We first describe the compositional idea and give an overview of the preprocessing steps that were required to identify the parts of the speech signal that can be used to carry the singing voice. Furthermore we describe the envelope processing that was used to be able to continuously transform the orginal voice of the actor into different female singing voices.
The song submitted here to the “Synthesis of Singing Challenge” is synthesized by the latest version of the singing synthesizer “Vocaloid”, which is commercially available now. In this paper, we would like to present the overview of Vocaloid, its product lineups, description of each component, and the synthesis technique used in Vocaloid.
In this paper we describe the different investigations that are part of the development of a new singing digital musical instrument, adapted to real-time performance. It concerns improvement of low-level synthesis modules, mapping strategies underlying the development of a coherent and expressive control space, and the building of a concrete bi-manual controller.
Sten Ternström, Johan Sundberg [Department of Speech, Music and Hearing, School of Computer Science and Communication, Kungliga Tekniska Högskolan, Sweden], "Formant-based synthesis of singing"
Rule-driven formant synthesis is a legacy technique that still has certain advantages over currently prevailing methods. The memory footprint is small and the flexibility is high. Using a modular, interactive synthesis engine, it is easy to test the perceptual effect of different source waveform and formant filter configurations. The rule system allows the investigation of how different styles and singer voices are represented in the low-level acoustic features, without changing the score. It remains difficult to achieve natural-sounding consonants and to integrate the higher abstraction levels of musical expression.
Contact
Session organizer:
Gerrit Bloothooft
UiL-OTS, Utrecht University, The Netherlands