Home / singing text to speech / Singing Text-to-Speech Conversion - JMESS

Singing Text-to-Speech Conversion - JMESS - singing text to speech

Singing Text-to-Speech Conversion - JMESS-singing text to speech

Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 8 Issue 7, July - 2022
A Singing Text-to-Speech Conversion
Based on UTAU Software
Hung-Che Shen
Dept. of digital multimedia design
I-Shou University
Kaohsiung, Taiwan
Abstract--This paper describes a singing text- We propose a novel talking voice synthesis system,
to-speech system that can synthesize a talking "Singing Text-to-Speech" that can convert a singing
voice from an input singing voice and the song voice to a speaking voice while keeping the voice
lyrics. The system controls four acoustic features quality of singer's voice bank. The singing text-to-
that determine the difference between speaking speech conversion is based on the speech analysis in
and singing voices: the pitch, phoneme duration, the spectrogram that allow for visualization of pitch
tempo and velocity. By changing these features contours. It is helpful for control the singing voice to
of a singing voice, the system synthesizes a talking voice. The primary singing text-to-speech
talking voice while retaining the timbre of the conversion requires four acoustic features. They are
singing voice. Originally, the UTAU software was the pitch (F0 contour), the duration of each phoneme
designed for singing voice synthesis. Controlling of the lyrics, velocity and tempo, from the input singing
the musical note's features to some target values voice. The timbre extraction is directly using the UTAU
with input lyrics, a singing text-to-speech system voice bank. The process of singing text-to-speech
is derived. The singing voice becomes talking
process is shown in Fig. 1.
voice by preserving the same timbre. The system
finally generates a talking voice that preserves the
timbre of the singing voice but has speech-like
features. Currently, only Mandarin text-to-speech
is implemented. Experimental results show that
this singing text-to-speech system can convert
singing voices into speech voices whose timbre is
almost the same as the original singing voices
and quality is nature. Fig. 1. The process of singing text-to-speech conversion
Keywords--singing text-to-speech; UTAU; To obtain the target values of these features,
Mandarin; singing Text-to-Speech uses an available UTAU
software (singing voice synthesis) and supplies the
I. INTRODUCTION text of the song lyrics to obtain a talking voice. Note
This paper attempts to show that a singing voice that this talking voice obtained by singing is to get the
synthesis by UTAU software [1] can also be used as a natural quality. Since there's a lot more to both male
text-to-speech system. The singing text-to-speech is voice and female voice types for UTAU singing voice
achieved by controlling the acoustic features unique to bank, the singing text-to-speech applications extend
speech. On the basis of signal generation, singing has the variety of synthesized voices that can be obtained.
a close affinity to speech. Previous study on speech- II. SINGING VOICE AND TALKING VOICE
to-Singing [2,3] focus on converting the speaking voice Conventional studies focus on three points: pitch
to the singing voice. The success of this conversion contour, phoneme duration and power, to clarify the
can also suggest that the ability to sing is a good differences between singing and speaking voices.
indicator of the ability to imitate speech. Therefore, a These differences are explained in the following. The
singing text-to-speech synthesis, which converts a acoustic difference in speech and singing is explained
voice singing any text (e.g., the lyrics of a song) into a below using the example of spectrogram analysis
speaking voice is possible. shown in Fig. 2.
The research of singing text-to-speech synthesis
can facilitate the both conversions between singing
and speech voices from investigation of their acoustic
differences [4,5]. Manipulating singing voices for
speech means the singing synthesis and also
becomes talking synthesis with the same timbre. As a
tool of virtual singer like UTAU, the end users will also
find it interesting for rapping purpose using this Fig. 2. The process of singing text-to-speech conversion
synthesis technique.
JMESSP13420830 4511
Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 8 Issue 7, July - 2022
B. Phoneme Duration
First, confirm that you have the correct template for For the singing voice, the duration of each
There are three characteristics of differences in the F0 phoneme changes in accordance with the musical
contours between speech and singing voices [6]. score. For the speaking voice, on the other hand, the
(a) The dynamic range of the speech F0 contours is duration of each phoneme has a relatively similar
wider than that of singing voices, while the singing length. To be precise, corresponding consonant parts
voice has higher pitch than speech. have approximately the same length, the boundary
(b) The singing voice has a steady state of an F0
between a consonant and a succeeding vowel also
contour corresponds to a Note. The note changes of
have approximately the same length, and vowel parts
the F0 contours correspond to melody. have different lengths.
It has been reported that, in singing, note onsets
(c) There are many F0 fluctuations that are are located at vowel onsets rather than at consonant
observed only in singing voices, while the speech has onsets. The phonemes of the lyrics must be distributed
tonal pitch bend. between the notes such that the transitions between
The thing about talking vs. singing is that singing notes coincide with the onset of the vowel or set of
consists primarily of notes that that work at a vowels. In this way, considering one syllable, the
consistent tempo, and have mostly straight pitch that consonantal phonemes located before the first vowel
can be marked down on a music sheet, and UTAU's will be pronounced within the previous note interval.
interface reflects this, whereas in speech, tempo can The result is a redefinition of the borders of the
be more sporadic, and pitch just starts flying all over syllables.
the place. C. Accents
A. Pitch in Tones Mandarin is a tonal language. The perceptual
In singing, duration of syllables changes according correlation of phrase accent in Mandarin includes pitch
to reference score. A syllable is not stretched equally. and timing. That is, words are marked with changes in
To investigate stretch characteristic of consonants, we larger fundamental frequency range and longer
calculated ratio between the duration of consonants in duration is generally perceived as phrase accent.
singing and speech. The result demonstrates that These two features are the acoustic correlates of pitch
stretch ratio of a consonant is stable and depends on and timing respectively. Recently, perceptual
type of the consonant. Mandarin Chinese is a tone experiments and acoustic studies showed that the
language, in which there are four pitched tones. In timing might serve as the primary cue to the
speech, pitch variety inside a syllable depends on its prominence and the presence of prominence
tone. A composer should be able to find out melody increases word duration.
which matches tone of lyric or vice versa.
Unfortunately, not all melody matches its D. Tempo
corresponding lyric perfectly. Fig. 3 reveals pitch
variety ranges in speech and singing. It can be It is well known that each Mandarin character is
observed that pitch inside a syllable is stable and pronounced as a syllable. For mandarin speaking rate,
independent of tone of a syllable. Pitch variety range 120 words per minute are equal to 120 syllables
inside a syllable in singing is about 1.6 semitone, while singing in 120 BPM.
that of a tone 4 syllable in Chinese can be 8.4 E. Velocity
semitone. For the singing voice, velocity changes are
Fig. 3. Pitch variety ranges in speech and singing synchronized with pitch. For the speaking voice, the
For the singing voice, a musical note corresponds
power always varies continuously.
to a steady state of the pitch contour. A musical score III. MAKING UTAU TALK
therefore corresponds to the pitch contour that has a The singing-to-speaking synthesis system has the
step-like shape [7], as shown in Figure 1. For the following input and output:
speaking voice, the pitch contour has a fluid shape that
has a low frequency at the beginning and end of each Input: Singing voice and lyrics of the song.
utterance. Output: Synthesized speaking voice.
The voice conversion is achieved by changing
characteristics of the three different acoustic features,
JMESSP13420830 4512
Journal of Multidisciplinary Engineering Science Studies (JMESS)
ISSN: 2458-925X
Vol. 8 Issue 7, July - 2022
i.e., phoneme duration, F0 contour, and power, into D. Pitch Bending
characteristics of acoustic features generated by TTS.
These three features are chosen since they are the There are four tones in Mandarin Chinese. They
main differences between singing and speaking differ from each other by the changes of their pitches.
voices, as discussed in Section 2. The system extracts As shown in Table I, every syllable in Mandarin can
three acoustic features from comparing the singing have one of four tones. Every tone can represent
voice and speech voice using the spectrogram different meaning.
analysis program called PRAAT [8,9]. The following TABLE I. TABLE STYLES
procedures are used to make UTAU talk. Type Syllable Tone Gloss
A. Set the Tempo Tone 1 Ma1 High level "mother"
We can make the tempo whatever you want, but
we often find that it's easier to make it faster than the Tone 2 Ma2 Rising "hemp"
default 120 BPM. Usually, for talking speed, the tempo Tone 3 Ma3 Low-falling "horse"
is set to 180 BPM, as syllables can often comfortably
take up an entire eighth note at that speed; of course, Tone 4 Ma4 Falling "scold"
you'll probably want to make the BPM higher or lower
depending on the sort of talking emotion. For this
example, I'm setting the tempo to 175 BPM. Fig. 6. shows that a UTAU has a plugin called pitch
B. Entering Notes and Lyrics bend editor. Control points are used to shape four
Mandarin tones in a tune that sound nice and nature.
For singing Text-to-Speech, the first thing to do is
laying out the number of notes needed for the string of
talking words. A music score editor that accepts a MIDI
file and lyric input was used as the front-end user
interface for this synthesis system. In Fig 4, the
Mandarin words have 12 syllables, so 12 notes are
placed, and then type in the lyrics. For this particular
example, all the notes are quarter notes. We can play
it to make sure it's at a speed we want, and adjust the
Fig. 6. Pitch bend editor is used to create Mandarin four
E. Moving Notes Around
Now that it's starting to resemble speech a little bit,
tempo as needed. the last thing we need to do is move notes around
(vertically, that is). Note that on this entire
Fig. 4. The process of singing text-to-speech conversion demonstration, I have kept the notes I've made on a
In general, at beginning of a Mandarin phrase,
single note, that note being C3 on the piano roll, this
duration of the syllable tends to be larger and as length
was intentional, as this way, I can change what parts
of the phrase goes to longer one, duration of the
of the phrase are stressed once I've decided on how
syllable at end of phrase turn to be shorter. Therefore,
the pitch bends are placed. This allows us to change
in Fig. 5, the piano view editor of UTAU software can
what parts of a phrase are emphasized, and have
be used to adjust each duration of syllable.
more stress.
Unfortunately, this is mostly something you need to
have intuition on to do properly, so if you don't have
that, it'll mostly be guesswork until you get something
nice. After having played around with it a little bit, this
was what I got.
Fig. 5. The process of singing text-to-speech conversion IV. EXPERIMENTS
We evaluated our system by conducting two
C. Spacing Out Notes psycho acoustic experiments. First, we compared the
Now we'll adjust the notes in relation to each other timbre of synthesized speaking and singing voices,
to make it sound more natural for speech. Now, unlike and then evaluated the perceptual similarity in their
singing, speech doesn't fit as comfortably in a voice timbre. Second, we evaluated the naturalness of
consistent rhythm, so some notes are going to working synthesized speaking voices when the mean F0 was
outside of the boundaries of half-notes, quarter notes, varied. In order to evaluate Mandarin MIDI-to-Singing
eighth notes, etc. Make sure your default quantization [10], the synthesized songs are available online at
and length quantization is at 1/64, otherwise we won't http://sing.dmd.isu.edu.tw/en/mandarin.html.
be able to move them around so easily.
JMESSP13420830 4513

What are the best text to speech voices? Why NaturalTTS is the best text to speech software 61 Natural Voices. An incredible amount of real-sounding natural voices that are presented in our text to speech software. SSML Support. Simply switching the special SSML tab, you can easily customize and control aspects of speech such as pronunciation, volume, and speech rate. Commercial Benefits. ... Download audio files