Q&A  | 

Matthew Aylett, expert in speech synthesis technology

"The more critical we are the less likely a deep fake will fool us."

Tags: 'Matthew Aylett' 'Sintesis del habla' 'Speech technology'


Reading Time: 4 minutes

Dr Matthew Aylett (October 1964) holds a BA in Computing and Artificial Intelligence from the University of Sussex, and an MSc with Distinction and PhD in Speech and Language Technology from the University Of Edinburgh, where he spent five years researching speech and dialogue technologies.

He is a recognized world authority in speech technology research and development. Before returning to Scotland in late-2005 to co-found CereProc, Dr Aylett was working at the International Computer Science Institute (ICSI) at Berkeley, California. Prior to ICSI he was the senior development engineer at speech synthesis leader Rhetorical Systems (later acquired by Scansoft), which he joined at its formation. At Rhetorical, Dr Aylett was responsible for the design, implementation, and testing of the company's core speech technology. Before Rhetorical.

Could you give us a short explanation of the technology you work on.

I work with speech synthesis technology, sometimes called text to speech (TTS) technology. The objective is to take text and generate natural and appropriate sounding human speech for any content. In some cases there is also a requirement to control the perceived emotion and how the artificial voice expresses itself in the output speech. This is often implemented by using markup on the text.

Which opportunities and benefits arise from speech synthesis?

Computers are increasingly entering the social domain.

If we want to communicate with computer systems using voice, then they need to communicate back to us with speech.

When applied correctly this can support new types of interaction and applications and is particularly useful for eye-free and hands-free contexts. In addition we also supply synthetic voices for people with communication difficulties. For example if you have throat cancer and life saving surgery means you can no longer speak, you could use an artificial voice and a computer application to continue to communicate verbally. In this context you might want to copy (or clone) a user’s voice before they lose the ability to speak. In this way the user can retain their character and identity within the artificial voice that is created.

What are the risks in weaponising the technology of speech synthesis?

As we get better and better at copying (cloning) voices, it opens the possibility of cloning a person’s voice without their permission, using audio that is publicly available or has been recorded for some other purpose. Often, when we communicate over the phone, our identity is assumed on the basis of the unique identity of our voices. By illicitly copying a person’s voice it becomes possible to impersonate their voice and deceive people over the phone. Furthermore, with the increased ability to alter video to change mouth movements, and then adding speech created using a cloned voice, we can create fake videos which can mislead or deceive.

Will deepfake change our lives? How?

For a short period of time, analogue pictures, film and recordings where hard to manipulate and could be trusted to be accurate. With digital systems such as Photoshop the extent to which we can depend on photos has changed. This uncertainty now exists with fake video and fake audio.

We can no longer assume that a video represents the person who is in it accurately.

This will have a big impact on the way we respond to video and audio material. Just as with text, we now also need to know who is the source of the information and whether it can be trusted.

Following Hao Li's words in the MIT Technology Review, "'perfect and virtually undetectable deepfakes are only a few years away'". How true is that in the case of voice faking?

We can certainly create audio which is very hard to detect as artificial. For example, have a look at this video, where we cloned Donald Trump’s voice. We even got him to sing. As with fake news stories, a precondition is for the person consuming the fake news to be uncritical and keen to have beliefs that they already hold to be supported. The more critical we are the less likely a deepfake will fool us. Creating convincing conversational speech is still relatively hard. We speak at around 240-300 words per minute and communicating that fast with a synthetic voice in an interactive setting is difficult.

Could we and should we be educated as children to detect and protect ourselves from deepfakes?

Children should always be taught to be critical of all information given to them. What is the source of the information? Is there corroboration? Social media companies have long attempted to avoid responsibility for the information they supply, refusing to see themselves as publishers but rather platforms. The issues raised by deepfake video are the same issues raised by bogus and fake news sources. At a time where good quality objective journalism is under threat from social news sources, the veracity of the material available is also becoming deeply compromised.

In some respects the ability to create deepfakes highlights what we have known all along, a good education which helps children respond and question news sources is very, very important. A second serious issue is the use of deepfake audio to defraud and impersonate somebody over the phone. Here education is critical. Just as internet and phone scams can be very plausible, by adding deepfake audio they can be made even harder to detect. The key is to always question the source of any information and not to give out bank details or login information without being certain who is contacting you and why. Currently it is hard to chat using deepfake audio, so if they take a long time to respond, and don’t seem able to hold a normal conversation don’t assume the voice is real.


Should governments and regulators look at introducing dedicated legislation designed to protect against deepfake videos and audio?

International legislation is required to protect people from liars, cheats, crooks and thieves. The disinterest of companies like Facebook in regulating their content – to the extent of allowing lies in political advertising. The increasing complexity of information technology mean that protecting people from disinformation is very challenging indeed. Legislation is slow to create and enforce, for many good reasons. Technology is fluid and changing very rapidly. As we have seen in the very slow process of creating legislation against internet scams, the real difficulty is in apprehending those responsible. Governments and regulators have a real challenge in dealing with this sort of information misuse.

Could deep fake technology play a role in the 2020 US presidential elections?

Deepfake video is yet another weapon that can be used to distort the truth and misinform. However, very prominent fakes (such as the manipulated video of Pelosi) can be counter productive. They will be used in the 2020 election, but in the end, it is the bare faced lies told by real politicians that need to be addressed first.


Have you ever been deceived by your own technology?

For the science lates event at the Science Museum in London in 2017 we created a little quiz to see how people would decide synthesised voices were artificial or natural. 

I set the test and when I took it a few weeks later I still got 2 questions wrong! 

Our technology has increased dramatically since then, so sometimes you really just can’t tell if a voice is real or not.