Messing Around with Amazon Polly
By John
I heard good things about Amazon Polly and wanted to take a peek and judge for myself. It’s very easy to use. The example below were done with the AWS CLI.
In case you’re not familiar, Polly is a text to speech service created by Amazon for AWS. Learn more at https://aws.amazon.com/polly/
First Example
Lets start with the standard text to speech engine, which uses Concatenative Synthesis. I generated some random text using Nietzsche Ipsum
Then I fed this text into Polly like this.
aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "text" \
--text 'Decrepit endless salvation salvation god revaluation deceptions evil. Horror deceptions free insofar free faith overcome disgust. Philosophy marvelous faithful virtues will joy passion chaos battle battle justice madness. Madness transvaluation transvaluation christian virtues christianity.
Pinnacle fearful grandeur mountains ultimate philosophy ascetic reason derive gains chaos. Will victorious zarathustra salvation endless inexpedient society self overcome ultimate. Decrepit philosophy play spirit insofar law enlightenment strong. Law contradict of superiority abstract. Faithful depths.' \
"polly-nietzsche-mathew.mp3"
And it sounds like this.
Standard voice examples
Here are some other examples of the standard engine at work. I picked different voices (voice-id) this time and generated the text using Obama Ipsum.
Joey
aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Joey" \
--output-format "mp3" \
--text-type "text" \
--text 'She was born in a town on the other side of the world, in Kansas. Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents. Let us be our brother''s keeper, Scripture tells us. He said that our economy has made "great progress" under this President. It was Islam - at places like Al-Azhar University - that carried the light of learning through so many centuries, paving the way for Europe''s Renaissance and Enlightenment.' \
"polly-obama-joey.mp3"
Aditi
aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Aditi" \
--output-format "mp3" \
--text-type "text" \
--text 'She was born in a town on the other side of the world, in Kansas. Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents. Let us be our brother''s keeper, Scripture tells us. He said that our economy has made "great progress" under this President. It was Islam - at places like Al-Azhar University - that carried the light of learning through so many centuries, paving the way for Europe''s Renaissance and Enlightenment.' \
"polly-obama-aditi.mp3"
Enrique
aws polly synthesize-speech \
--engine "standard" \
--language-code "en-US" \
--voice-id "Enrique" \
--output-format "mp3" \
--text-type "text" \
--text 'She was born in a town on the other side of the world, in Kansas. Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents. Let us be our brother''s keeper, Scripture tells us. He said that our economy has made "great progress" under this President. It was Islam - at places like Al-Azhar University - that carried the light of learning through so many centuries, paving the way for Europe''s Renaissance and Enlightenment.' \
"polly-obama-enrique.mp3"
Neural examples
Matthew
Matthew is the same as the first example, but instead of concatenative synthesis it uses a nueral net to synthesize the voice. Learn more about Amazon’s NTTS algorithm and usage see https://docs.aws.amazon.com/polly/latest/dg/NTTS-main.html
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak>Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents.</speak>' \
"polly-matthew-neural.mp3"
Olivia
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Olivia" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak>Back home, my grandmother raised their baby and went to work on a bomber assembly line. A common dream, born of two continents.</speak>' \
"polly-olivia-neural.mp3"
Conversational examples
To create conversational speech we have to use SSML, Speech Synthesis Markup Language. You may have noticed <speak>
in the previous example text. That is part of SMML. It allows for more expression in your rendered speech, including adding high level styles. Lets listen to what conversational speech sounds like.
Matthew
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain></speak>' \
"polly-conversational-matthew.mp3"
Joanna
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Joanna" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain></speak>' \
"polly-conversational-joanna.mp3"
News examples
They have a great news caster voice.
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Matthew" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="news">I can also speak in a news style, which simulates the tone of a someone reading the news.</amazon:domain></speak>' \
"polly-conversational-mathew-news.mp3"
Comparison
Compare conversational to news.
Joanna
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Joanna" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain><amazon:domain name="news">I can also speak in a news style, which simulates the tone of a someone reading the news.</amazon:domain></speak>' \
"polly-conversational-joanna-mix.mp3"
Lupe
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Lupe" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak><amazon:domain name="conversational">I can also speak in a Conversational style, which simulates the tone of a friendly conversation.</amazon:domain><amazon:domain name="news">I can also speak in a news style, which simulates the tone of a someone reading the news.</amazon:domain></speak>' \
"polly-conversational-lupe-mix.mp3"
Ganster
Now lets listen to something a little more fun!
aws polly synthesize-speech \
--engine "neural" \
--language-code "en-US" \
--voice-id "Brian" \
--output-format "mp3" \
--text-type "ssml" \
--text '<speak>Lorizzle ipsum fizzle doggy amizzle, consectetuer adipiscing shut the shizzle up. Nullizzle phat velizzle, dang volutpizzle, phat izzle, gizzle away, ghetto. Pellentesque fo shizzle my nizzle tortizzle. Shizzle my nizzle crocodizzle erizzle. For sure izzle bizzle get down get down mammasay mammasa mamma oo sa tempizzle tempizzle. Mauris pellentesque nibh et turpizzle. Vestibulum izzle for sure. Pellentesque eleifend rhoncizzle . In ass habitasse ma nizzle dictumst. Donec dapibus. Doggy own yo urna, pretium eu, mah nizzle phat, eleifend vitae, nunc. Bizzle suscipizzle. Integer sempizzle velit sizzle fo shizzle.</speak>' \
"polly-lorizzle-jf.mp3"
The text was generated at http://lorizzle.nl/
Other Voices
You can see all the available voices at https://docs.aws.amazon.com/polly/latest/dg/voicelist.html
Conclusion
Overall it sounds great and generates realistic enough speech for many different use cases. It renders fast and the api is easy to use. It has a good selection of voices and nice styles. You may have noticed various accents in the eaxmple above. These were created by using voices from different countries and then setting the language to en-US.
In terms of cost for basic voiceover, it is cost effective, especially if you wanted to add a voice reader to your blog. For industrial applications like a high volume speech assistent it would probably be expensive. I haven’t done a full analysis, but it seems that if you are rendering long text quite frequently you could end up with a high AWS bill. If you are just messing around you probably won’t pay more than a couple of cents.