Rob Moorman
Rob Moorman

Founder, technology consultant, architect and full stack developer
Mei 23


Listen to this blog post


Turn blog posts into lifelike speech with Wagtail and Amazon Polly

Synthesize speech that sounds like a human voice with AI services that uses advanced deep learning technologies

A while ago Amazon Web Services launched Polly; a service that turns text into lifelike speech. Today we did some tests with this service by building an integration for Django Wagtail. The goal is the convert structured data (with StreamFields) of our blog posts into streaming audio objects.

Polly offers quite a lot of voices over multiple languages (for us it's important the Dutch language is also supported with two voices), with currently a total of 48. The first thing I noticed playing around with the API is the latency; it's fast! Text is almost directly converted into an audio stream which makes real-time text-to-speech (e.g. "while typing") possible.

As core developer of Django Wagtail we find it important to explain why Wagtail has great potential for a headless CMS. This means data from pages and/or components can be easily transformed for other media, such as text-to-speech. Wagtail already offers a great REST API which you can use. However we don't want to read data via the REST API and transform it for this use case, we actually want our blog posts to be rendered as audio streams when a new version is published using wagtail hooks.

So, say hello to wagtail-speech. A tiny module which helps you to render your Wagtail pages and StreamField content to audio streams with Polly. 

Actually this blog post is already using it! Hit the play button on the top if you haven't noticed it already to listen to an audio version of the blog post. It this case only headers, quotes and call-to-actions are provided as text-to-speech, but you can imagine how easily you can extend this (see the usage section of our module).

You can use Amazon Polly to develop applications that increase engagement and accessibility.


Rob Moorman

Limitations

There is just one major limitation with Amazon Polly; the maximum allowed characters to provide as input. It's limited to 1500 characters per audio stream. So a complete blog post with paragraphs rendered as audio won't be possible in one API call at the moment. I hope this is something AWS works on in the nearby future. Also the maximum audio length is currently set to 5 minutes, everything after that will be cut-off.

However doing multiple API calls, concat audio streams into a single e.g. mp3 file will be possible. We just haven't got the time to get this up and running in our module (we will pick this up in the future).

Speech Synthesis Markup Language

Polly galdy also supports SSML values, which makes it possible to manipulate your audio stream. Probably you have ever seen already some text-to-speech engines were you fill in some text but the output of voice will mostly sound like spitting a rap. It just doesn't make sense if you listen to pronunciations and break times.

With SSML you can do for example:

  • Add breaks (1s, medium or very long)
  • Use substitutes (e.g. World Wide Web)
  • Audio effects like whispering (imagine Polly speaking to childs before night time)

Conclusion

The service really fills the gap between easy API integration and a decent quality of audio stream if you look at pronunciation. Picking the right voice is important as we noticed quite some difference in quality. With Polly it makes you think about improving your accessibility, especially with large content like blog posts. If a website offers me to listen to a blog post instead of scrolling down and reading from my screen, I'll definitely use it.

I actually don't get it why large publishing platforms like newspapers don't use this these kind of services already on high scale. It would definitely improve quality and gain better user experience.

There is one thing you should take care of, that is making your data available in a structured way. So try to get away from storing HTML rendered content (most WYSIWYG editors) in your database.

Wagtail is a platform which keeps these things in mind (forwards to a headless CMS), that's why the integration wagtail-speech was made in a very short period of time as structured data is already present. This is really a big plus if you compare it with other more traditional CMS platforms.

Do you want to know more about the importance of structured content and it's advantages?


Contact us


Rob Moorman
Rob Moorman

Founder, technology consultant, architect and full stack developer