Vocal Inflection, Part I
Feb 19th, 2008 | By Trevor Baca | Category: TTS | Text to SpeechCommunications-enabled business processes (CEBP) take many forms. Think school- and jobsite-closing messages broadcast simultaneously and automatically to many phones at once some morning when there’s bad weather and you get the idea.
When we at Jaduka collaborate with clients on a new CEBP improvement project, the question of text-to-speech, or TTS, frequently comes up. Not all CEBP improvement projects need TTS. But some can benefit from careful TTS somewhere. Our general advice is to be smart about TTS — make sure you need it and then use it sparingly. And we find that we sometimes have to go back over this point because executives tend to want TTS even when they don’t need it. Think Flash webpage intros in the 1.0 bubble.
This post introduces vocal inflection as part of our continued series on what makes TTS tricky.
Vocal inflection — aka intonation, aka tone of voice, aka prosody — is the combination of pitch, loudness, speed, pauses, stops and starts that modulate words up and down during speech. Speakers of all languages make use of vocal inflection, though unevenly. As far as we have data, English makes greater use of vocal inflection than most other languages (leading German, for example, with French probably near the bottom of the list of languages that admit a wide array of different inflection patterns), though this may change as better data appear.
The array of different inflection patterns in English is enormous. For reasons of space, we’ll go over just a single example set here. And, as you’re reading, imagine you’re a developer responsible for coming up with some sort of algorithm to handle this sort of thing programmatically … precisely like a good TTS system.
First, here are some tones in English. Click on each of these to listen to the sound of the word “up” in each example:
Example #1: “She put it up.”
Example #2: “Did she put it up?”
Example #3: “She didn’t put it up but down.”
It generally comes as no surprise to native speakers that the “up” in example #2 occurs with an up-tone because we are taught that the voice rises on a question (though this is far from always the case). But it generally does come as a surprise that the “up” in example #1 occurs with a down-tone. And it is astonishing indeed that English makes very regular use of the compound falling-rising tone on the “up” in example #3. Listen to examples again and hear the falling tone in #1, the rising tone in #2, and the falling-rising tone in #3.
These things are easier to hear side-by-side. So thanks to the wonder of the audio editor, we’ve cut the three different “ups” out and spliced them together here. We’ve also added accent marks:
Example 4: “Up … Up … Up.”
These results are surprising because we’re not used to thinking about vocal inflection in English as an independent phenomenon. It’s just something we kinda do, but that we are expected to do correctly (and the foreigners very frequently do not do correctly). Alan Cruttenden, in his textbook-length treatment on the subject, identifies the need for at least seven so-called “nuclear tones” for the analysis of spoken English, with an even greater number of tones required in certain special cases.
So what does this tells us? If we’re a text-to-speech robot, the data tell us that we better be able to figure which tone to use when. Let’s test “Mike”, the TTS robot at AT&T Labs that we introduced in an earlier post.
Example #5: Mike says, “She put it up.”
Example #6: Mike says, “Did she put it up?”
Example #7: Mike says, “She didn’t put it up but down.”
And side by side:
Example #8: Mike says, “Up … Up … Up.”
So how does Mike do?
About 1 1/2 out of 3. Mike knows — like native speakers — that the voice should rise in the question in example #6. But Mike uses exactly this same inflection in the declarative example #5; while this isn’t wrong (but perhaps an expression of something akin to “cheerfulness”) it’s less likely. Example #7 Mike gets completely wrong; Mike has no contrastive falling-rising tone at all, it would appear, and the substitution of a flat-low tone seems to be Mike’s programmers just trying to escape the problem.
—
These are the absolute most basic cases possible of vocal inflection in English and AT&T Labs starts off with a score of about 50% relative to a native speaker. The results are guaranteed only to get worse as consider more tones and more sentence types.
The conclusion for voice applications developers? Approach TTS with a healthy distance. Your app probably doesn’t need it. But if it does, expect TTS to be at best comprehensible. But not idiomatic.
Related posts:
- Vocal Inflection, Part II In part I of this post we looked at the...
- Vocal Inflection, Part III Whereas part I and part II of this series have...

