Photoshop for speech (Skill Zone News, 24-Nov-2016)

Photoshop for speech

Adobe has demonstrated VoCo, software which it says will become the Photoshop of speech manipulation. It is scarily good. SCARILY good.

What does the VoCo software do? It lets you assemble segments of speech by typing in words on the keyboard. That in itself isn't so revolutionary. We already have software that can synthesise speech from text. The revolutionary part is that VoCo can take a sample of about 20 minutes worth of speech from someone, analyse that speech, break it down into speech patterns, and learn how the person speaks, the timings, the cadences, the quirks.

When you type a phrase into VoCo, it looks for those words in the speech sample, and if it cannot find all the words in the sample, it can construct the missing words from word segments and basic pronunciation rules. The constructed sound wave is then polished to ensure a smooth flowing sentence which is enunciated in the same style as the original speaker. This is the video of the demonstration which Adobe has released.

This makes constructing a spoken message as easy as using cut and paste in a word processor, and immediately it is obvious that this sort of technology can be used to greatly improve the quality of voice in interfaces such as satnavs, or automated switchboards.

How might it be used in the creative world? One benefit to production companies is that if, during the editing stages, they need to change something said by one of the actors, they would not need to call the actors back in just to re-record a few seconds of dialogue. They could simply use VoCo to generate the correct phrases. Voice-over artists might find it increases their marketability. Instead of having to go to a studio to record the voice over phrases, they could simply license their voice sample and allow ad agencies to try out numerous permutations of phrase until they found the one that had most impact when tested on consumers.

It is this ability to not just re-arrange speech but also to create completely new sentences which is the most frightening. Already we can no longer trust photographs because they are too easily photoshopped, and even video is now so easily manipulated by CGI that we see the fantastically impossible every day. Now, with the addition of speech manipulation, we will soon no longer trust our ears either. People with malicious agendas will be able to put damning words into the mouths of politicians, celebrities, or enemies of the state. It is hard to see what good uses will come out of this.

Sadly it will probably also be the end of the creative cut and chop parodies like this one where the charm of it is that it is so obviosuly fake that we know it is a parody.

24th November 2016

About SKILLZONE News
This article comes from the SKILLZONE email newsletter, published monthly since January 2008, and covering topics related to technology and the internet. All articles and artwork in the SKILLZONE newsletter are orignal content.