Home / singing text to speech / SPEECH-TO-SINGING SYNTHESIS: …

SPEECH-TO-SINGING SYNTHESIS: … - singing text to speech

SPEECH-TO-SINGING SYNTHESIS: …-singing text to speech

2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 21-24, 2007, New Paltz, NY
Takeshi Saitou, Masataka Goto, Masashi Unoki, and Masato Akagi
National Institute of Advanced Industrial Science School of Information Science, Japan Advanced
and Technology (AIST) Institute of Science and Technology
1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, Japan 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan
{saitou-t,m.goto}[at]aist.go.jp {unoki,akagi}[at]jaist.ac.jp
ABSTRACT contour [8, 10] and the spectrum [7]) of singing voices, the percep-
This paper describes a speech-to-singing synthesis system that can tual effect of each feature has been individually investigated, but
synthesize a singing voice, given a speaking voice reading the there has been no comparison between those two features in terms
lyrics of a song and its musical score. The system is based on the of their perceptual contributions. Although Ohishi et al. [11] de-
speech manipulation system STRAIGHT and comprises three mod- veloped a method for automatically discriminating between speak-
els controlling three acoustic features unique to singing voices: ing and singing voices, they did not attempt speech-to-singing syn-
the fundamental frequency (F0), phoneme duration, and spectrum. thesis. In our preliminary study [7], we found that a speaking voice
Given the musical score and its tempo, the F0 control model gen- could potentially be converted to a singing voice by manually con-
erates the F0 contour of the singing voice by controlling four types trolling its three acoustic features: the F0, phoneme duration, and
of F0 fluctuations: overshoot, vibrato, preparation, and fine fluctu- spectrum. In that work, we hand-tuned those control parameters
ation. The duration control model lengthens the duration of each by trial and error; there were no acoustic-feature control models
phoneme in the speaking voice by considering the duration of its except for the F0 control model [8]. In addition, the naturalness of
musical note. The spectral control model converts the spectral the converted singing voice was not evaluated.
envelope of the speaking voice into that of the singing voice by We therefore propose an automatic speech-to-singing synthe-
controlling both the singing formant and the amplitude modula- sis system that integrates acoustic-feature control models for the
tion of formants in synchronization with vibrato. Experimental F0, phoneme duration, and spectrum. Section 2 describes the three
results show that the proposed system can convert speaking voices models having experimentally optimized control parameters. Sec-
into singing voices whose naturalness is almost the same as actual tion 3 shows experimental results indicating that converted singing
singing voices. voices are natural enough compared to actual singing voices and
that the perceptual contribution of the F0 control is stronger than
1. INTRODUCTION that of the spectral control. Finally, Section 4 summarizes the con-
The goal of this research is to synthesize natural singing voices tributions of this research.
by controlling the acoustic features unique to them. Most previ-
ous research approaches [1, 2, 3] have focused on text-to-singing 2. SPEECH-TO-SINGING SYNTHESIS SYSTEM
(lyrics-to-singing) synthesis, which generates a singing voice from A block diagram of the proposed speech-to-singing synthesis sys-
scratch like speech is generated in text-to-speech synthesis. On the tem is shown in Fig 1. The system takes as the input a speak-
other hand, our approach focuses on speech-to-singing synthesis, ing voice reading the lyrics of a song, the musical score of a
which converts a speaking voice reading the lyrics of a song to singing voice, and their synchronization information in which each
a singing voice given its musical score. Research on the speech- phoneme of the speaking voice is manually segmented and asso-
to-singing synthesis is important for investigating the acoustic dif- ciated with a musical note in the score. This system converts the
ferences between speaking and singing voices. It will also be use- speaking voice to the singing voice in six steps by: (1) decom-
ful for developing practical applications for computer-based music posing the speaking voice into three acoustic parameters -- F0
productions where the pitch of singing voices is often manipulated contour, spectral envelope, and aperiodicity index (AP) -- esti-
(corrected or intentionally modified) [4] but their naturalness is mated by using the analysis part of the speech manipulation sys-
sometimes degraded. Our research will make it possible to manip- tem STRAIGHT [12]; (2) generating the continuous F0 contour of
ulate singing voices while keeping their naturalness. In addition, the singing voice from discrete musical notes by using the F0 con-
speech-to-singing synthesis itself is interesting for end users be- trol model; (3) lengthening the duration of each phoneme by using
cause even if the original speaker of a speaking voice is not good the duration control model; (4) modifying the spectral envelope
at singing, end users, including the speaker, can listen to the con- and AP by using the spectral control model 1: (5) synthesizing the
verted good singing voice having the speaker's voice timbre. singing voice by using the synthesis part of the STRAIGHT; and
Although many studies have investigated the acoustic fea- (6) modifying the amplitude of the synthesized voice by using the
tures unique to singing voices [5, 6] and their perceptual effects spectral control model 2.
[7, 8, 9, 10], few have investigated the acoustic differences be-
tween speaking and singing voices [7, 11]. For example, by modi- 2.1. F0 control model
fying (deteriorating) one of the two main acoustic features (the F0
When converting a speaking voice to a singing voice, the F0 con-
This research was supported in part by CrestMuse, CREST, JST. tour of the speaking voice is discarded and the target F0 contour
978-1-4244-1619-6/07/$25.00 ?2007 IEEE 215
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 21-24, 2007, New Paltz, NY
Speaking voice Musical notes Overshoot
Synchronization (musical score) (second-order
information damping model)
F0 contour
Melody contour of singing voice
(second-order + +
STRAIGHT(analysis part) switch oscillation model)
F0 contour Preparation Fine fluctuation
Spectral Aperiodicity (second-order
envelope index (AP) damping model) (high-pass filter)
White noise
Duration control model F0 control Figure 3: Block diagram of the F0 control model for singing
Spectral control model voices.
model 1
Modified F0 contour The overshoot, vibrato, and preparation are added by using the
spectral envelope Modified AP of singing voice transfer function of a second-order system represented as
H(s) = s2 + 2s + 2 , (1)
STRAIGHT(synthesis part)
Spectral control
where is the natural frequency, is the damping coefficient and k
is the proportional gain of the system. Here, the impulse response
model 2 of H(s) can be obtained as
8 k (exp(1t) - exp(2t)), || > 1
> 2 2-1
> k exp(-t) sin(p

>kt exp(-t), || = 1
: k sin(t), || = 0
Figure 1: Block diagram of the speech-to-singing synthesis system.
6.2 where 1 = - + p p
2 - 1, 2 = - - 2 - 1. The above
Overshoot Preparation three fluctuations are represented by Eq. (2) as follows:
6 1. Overshoot: the second-order damping model (0 < || < 1).
2. Vibrato: the second-order oscillation model (|| = 0).
3. Preparation: the second-order damping model (0 < || < 1).
5.6 Vibrato Characteristics of each F0 fluctuation are controlled by the sys-
Overshoot tem parameters , , and k. In this study, the system parameters
: Musical notes in musical score
0 1000 2000 3000 4000 5000 6000 (, , and k) were set to (0.0348 [rad/ms], 0.5422, 0.0348) for
Time [ms] overshoot, (0.0345 [rad/ms], 0, 0.0018) for vibrato, and (0.0292
Figure 2: Examples of F0 fluctuations in the singing voice of an [rad/ms], 0.6681, 0.0292) for preparation. These parameter values
amateur singer. were determined using the nonlinear least-squared-error method
[16] to minimize errors between the generated F0 contours and
actual ones.
of the singing voice is generated by using the musical notes of a The fine fluctuation is generated from white noise. The white
song. The target F0 contour should have the following character- noise is first high-pass-filtered and its amplitude is normalized. It
istics: (a) global F0 changes that correspond to the musical notes is then added to the generated F0 contour having the other three
and (b) local F0 changes that include F0 fluctuations. There are F0 fluctuations. In this study, the cut off frequency of the high-
four types of F0 fluctuations, which are defined as follows: pass filter was 10 Hz, its damping rate was -20 dB/oct, and the
1. Overshoot: a deflection exceeding the target note after a note amplitude was normalized so that its maximum is 5 Hz.
change [13]. 2.2. Duration control model
2. Vibrato: a quasi-periodic frequency modulation (4-7 Hz) [14].
3. Preparation: a deflection in the direction opposite to a note Because the duration of each phoneme of the speaking voice is
change observed just before the note change. different from that of the singing voice, it should be lengthened or
4. Fine fluctuation: an irregular frequency fluctuation higher than shortened according to the duration of the corresponding musical
10 Hz [15]. note. Note that each phoneme of the speaking voice is manually
Figure 2 shows examples of these fluctuations. Our previous study segmented and associated with a musical note in the score in ad-
[8] confirmed that all of the above F0 fluctuations are contained in vance. The duration of each phoneme is determined by the kind of
various singing voices and affect the naturalness of singing voices. musical note (e.g., crotchet or quaver) and the given local tempo.
Figure 4 shows a schema of the duration control model. This
Figure 3 shows a block diagram of the proposed F0 control model assumes that each segmented boundary between a conso-
model [8]. This model can generate the target F0 contour by nant and a succeeding vowel consists of a consecutive combina-
adding the four types of F0 fluctuations to a score-based melody tion of a consonant part, a boundary part, and a vowel part. The
contour, which is the input of this model as shown in Fig. 3. The boundary part occupies a region ranging from 10 ms before the
melody contour is described by the sum of consecutive step func- boundary to 30 ms after the boundary, so its duration is 40 ms.
tions, each corresponding to a musical note. The three parts are controlled as follows:
978-1-4244-1619-6/07/$25.00 ?2007 IEEE 216
Speech-to-singing synthesis system
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 21-24, 2007, New Paltz, NY
Boundary between consonant and vowel 80
6000 Speaking voice 70 Singing formant
5000 60
Tc: Consonant duration 40
3000 Tb: Boudnudraatriyon (40 ms) 20 Tenor (singing)
2000 Tv: Vowel duration 10 Tenor (speaking)
1000 k: Lengthening rate 0 Japanese ballad (singing)
-100 1000 2000 3000 4000 5000 6000
0 Frequency [Hz]
0 20 40 60 80 100 120
Tcspk Tbspk Tvspk Lengthen
Figure 5: Examples of singing formant near 3 kHz.
102 103
6000 Singing voice 7 F0 4
5000 6
Amplitude envelope
4000 5 2
3000 4 1
2000 3 0
103 synchronization
0 50 100 150 200 250
Time [ms]
Tcsig = kTcspk Tvsig=Note duration - (Tcsig+Tbsig ) 3
Tbsig =Tbspk
Figure 4: Schema of the duration control model. 0 200 400 600 800 1000 1200 1400
Time [ms]
1. The consonant part is lengthened according to fixed rates that Figure 6: Example of formant amplitude modulation (AM) in syn-
were determined experimentally by comparing speaking and chronization with vibrato of the F0.
singing voices (1.58 for a fricative, 1.13 for a plosive, 2.07 for
a semivowel, 1.77 for a nasal, and 1.13 for a /y/).
2. The boundary part is not lengthened. tion for emphasizing the formant in Ssp(f ) and represented as
3. The vowel part is lengthened so that the duration of the whole ( f )), |f - Fs| Fb
combination corresponds to the note duration. Wsf (f ) = (1 + ksf )(1 - cos(2 Fb+1 2
1, otherwise
2.3. Spectral control model where Fs is the frequency of the peak in Ssp(f ) near 3 kHz, Fb
To generate the spectral envelope of the singing voice, the spec- is the bandwidth for the emphasis, and ksf is the gain for adjust-
tral envelope of the speaking voice is modified by controlling the ing the degree of emphasis. In this study Fb was set to 2000 Hz,
spectral characteristics unique to singing voices as reported in the and ksf was set to emphasize Fs by 12 dB. These values were
previous works [9, 17]. Sundberg [9] showed that the spectral en- determined by analyzing the characteristics of singing formants in
velope of a singing voice has a remarkable peak called the "singing several singing voices [15]. The dip of AP can also be emphasized
formant" near 3 kHz. Oncley [17] reported that the formant ampli- in the same way.
tude of a singing voice is modulated in synchronization with the
frequency modulation of each vibrato in the F0 contour. Figure After synthesizing the singing voice, the spectral control
5 shows examples of the singing formant, and Fig. 6 shows an model 2 adds the corresponding AM to the amplitude envelope
example where the formant amplitude in the lower panel as well of the synthesized singing voice. During each vibrato in the gen-
as the amplitude envelope in the upper panel is modulated in syn- erated F0 contour, the AM is added as follows:
chronization with the frequency modulation of the F0 contour. Our Esg(t) = (1 + kam sin(2famt))Esp(t), (5)
previous study [7] also confirmed that these two types of acoustic where Esp(f ) and Esg(f ) are the amplitude envelopes of the
features are contained in various kinds of singing voices and that speaking and singing voices, respectively. fam is the rate (fre-
they affect how a singing voice is perceived. quency) of AM and kam is the extent (amplitude) of AM. In this
As shown in Fig. 1, the spectral envelope of the speaking voice study, fam and kam were set to 5.5 Hz and 0.2, respectively. These
is modified by two spectral control models (1 and 2) correspond- values were determined by considering the characteristics of the
ing to the two acoustic features. The spectral control model 1 adds vibrato generated by the F0 control model.
the singing formant to the speaking voice by emphasizing the peak
of the spectral envelope and the dip of the aperiodicity index (AP)
at about 3 kHz during vowel parts of the speaking voice. The peak 3. EVALUATION
of the spectral envelope can be emphasized by the following equa-
tion: We examined the performance of the proposed speech-to-singing
Ssg(f ) = Wsf (f )Ssp(f ), (3) synthesis system by evaluating the quality of synthesized singing
voices in a psychoacoustics experiment. In this experiment, per-
where Ssp(f ) and Ssg(f ) are the spectral envelopes of the speak- ceptual contributions of the F0 control and the spectral control
ing and singing voices, respectively. Wsf (f ) is a weighting func- were also investigated.
978-1-4244-1619-6/07/$25.00 ?2007 IEEE 217
Frequency [Hz]
Frequency [Hz]
Frequency [Hz] Frequency [Hz]
Amplitude [dB]

What are the best text to speech voices? Why NaturalTTS is the best text to speech software 61 Natural Voices. An incredible amount of real-sounding natural voices that are presented in our text to speech software. SSML Support. Simply switching the special SSML tab, you can easily customize and control aspects of speech such as pronunciation, volume, and speech rate. Commercial Benefits. ... Download audio files