Messing Around with Amazon Polly

By John

November 23, 2020

I heard good things about Amazon Polly and wanted to take a peek and judge for myself. It’s very easy to use. The example below were done with the AWS CLI.

In case you’re not familiar, Polly is a text to speech service created by Amazon for AWS. Learn more at https://aws.amazon.com/polly/

First Example

Lets start with the standard text to speech engine, which uses Concatenative Synthesis. I generated some random text using Nietzsche Ipsum

Some randomized pseudo philosophy

Then I fed this text into Polly like this.

aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "text" \
--text 'Decrepit endless salvation salvation god revaluation deceptions evil. Horror deceptions free insofar free faith overcome disgust. Philosophy marvelous faithful virtues will joy passion chaos battle battle justice madness. Madness transvaluation transvaluation christian virtues christianity.

Pinnacle fearful grandeur mountains ultimate philosophy ascetic reason derive gains chaos. Will victorious zarathustra salvation endless inexpedient society self overcome ultimate. Decrepit philosophy play spirit insofar law enlightenment strong. Law contradict of superiority abstract. Faithful depths.' \
"polly-nietzsche-mathew.mp3"

And it sounds like this.

Polly example using Matthew and Nietzsche text

Standard voice examples

Here are some other examples of the standard engine at work. I picked different voices (voice-id) this time and generated the text using Obama Ipsum.

Joey

aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Joey" \
--output-format "mp3" \
--text-type "text" \
--text 'She was born in a town on the other side of the world, in Kansas. Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents. Let us be our brother''s keeper, Scripture tells us. He said that our economy has made "great progress" under this President. It was Islam - at places like Al-Azhar University - that carried the light of learning through so many centuries, paving the way for Europe''s Renaissance and Enlightenment.' \
"polly-obama-joey.mp3"

Polly Joey example with Obama text

Aditi

aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Aditi" \
--output-format "mp3" \
--text-type "text" \
--text 'She was born in a town on the other side of the world, in Kansas. Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents. Let us be our brother''s keeper, Scripture tells us. He said that our economy has made "great progress" under this President. It was Islam - at places like Al-Azhar University - that carried the light of learning through so many centuries, paving the way for Europe''s Renaissance and Enlightenment.' \
"polly-obama-aditi.mp3"

Polly standard Aditi example with Obama text

Enrique

aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Enrique" \
--output-format "mp3" \
--text-type "text" \
--text 'She was born in a town on the other side of the world, in Kansas. Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents. Let us be our brother''s keeper, Scripture tells us. He said that our economy has made "great progress" under this President. It was Islam - at places like Al-Azhar University - that carried the light of learning through so many centuries, paving the way for Europe''s Renaissance and Enlightenment.' \
"polly-obama-enrique.mp3"

Polly standard Enrique example with Obama text

Neural examples

Matthew

Matthew is the same as the first example, but instead of concatenative synthesis it uses a nueral net to synthesize the voice. Learn more about Amazon’s NTTS algorithm and usage see https://docs.aws.amazon.com/polly/latest/dg/NTTS-main.html

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak>Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents.</speak>' \
"polly-matthew-neural.mp3"

Polly neural Mathew example with text

Olivia

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Olivia" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak>Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents.</speak>' \
"polly-olivia-neural.mp3"

Polly neural Olivia example with text

Conversational examples

To create conversational speech we have to use SSML, Speech Synthesis Markup Language. You may have noticed <speak> in the previous example text. That is part of SMML. It allows for more expression in your rendered speech, including adding high level styles. Lets listen to what conversational speech sounds like.

Matthew

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain></speak>' \
"polly-conversational-matthew.mp3"

Polly conversation Matthew example with SSML

Joanna

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Joanna" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain></speak>' \
"polly-conversational-joanna.mp3"

Polly conversation Joanna example with SSML

News examples

They have a great news caster voice.

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="news">I can also speak in a news style, which simulates the tone of a someone reading the news.</amazon:domain></speak>' \
"polly-conversational-mathew-news.mp3"

Polly example using Matthew neural news voice

Comparison

Compare conversational to news.

Joanna

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Joanna" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain><amazon:domain name="news">I can also speak in a news style, which simulates the tone of a someone reading the news.</amazon:domain></speak>' \
"polly-conversational-joanna-mix.mp3"

Polly comparison example using Joanna conversation and news voices

Lupe

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Lupe" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain><amazon:domain name="news">I can also speak in a news style, which simulates the tone of a someone reading the news.</amazon:domain></speak>' \
"polly-conversational-lupe-mix.mp3"

Polly comparison example using Lupe conversation and news voices

Ganster

Now lets listen to something a little more fun!

aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Brian" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak>Lorizzle ipsum fizzle doggy amizzle, consectetuer adipiscing shut the shizzle up. Nullizzle phat velizzle, dang volutpizzle, phat izzle, gizzle away, ghetto. Pellentesque fo shizzle my nizzle tortizzle. Shizzle my nizzle crocodizzle erizzle. For sure izzle bizzle get down get down mammasay mammasa mamma oo sa tempizzle tempizzle. Mauris pellentesque nibh et turpizzle. Vestibulum izzle for sure. Pellentesque eleifend rhoncizzle . In ass habitasse ma nizzle dictumst. Donec dapibus. Doggy own yo urna, pretium eu, mah nizzle phat, eleifend vitae, nunc. Bizzle suscipizzle. Integer sempizzle velit sizzle fo shizzle.</speak>' \
"polly-lorizzle-jf.mp3"

Polly neural Brian example with lorizzle ipsum text

The text was generated at http://lorizzle.nl/

Other Voices

You can see all the available voices at https://docs.aws.amazon.com/polly/latest/dg/voicelist.html

Conclusion

Overall it sounds great and generates realistic enough speech for many different use cases. It renders fast and the api is easy to use. It has a good selection of voices and nice styles. You may have noticed various accents in the eaxmple above. These were created by using voices from different countries and then setting the language to en-US.

In terms of cost for basic voiceover, it is cost effective, especially if you wanted to add a voice reader to your blog. For industrial applications like a high volume speech assistent it would probably be expensive. I haven’t done a full analysis, but it seems that if you are rendering long text quite frequently you could end up with a high AWS bill. If you are just messing around you probably won’t pay more than a couple of cents.