dark mode light mode Search Menu
Search

Is That You or is it AI TTS?

palesa on Unsplash

Have you ever typed something into your computer and had it read back to you? That technology is called text-to-speech (TTS). It’s been around for quite a while (your parents may remember Microsoft Sam) but, more recently, AI has super-charged this experience taking it from a robotic sound to a voice that can imitate real voices like your favorite TV or movie character reading back what you typed. 

For example, TikTok regularly uploads voices from famous sources (like Rocket Raccoon from Guardians of the Galaxy) who will read out any text you put on your videos.

Hear. Say.

But how does AI know what these real people sound like? In real life, you can’t tell someone to impersonate a voice they’ve never heard before. But if you show them a few YouTube videos of the person talking, they can try and imitate it. 

It’s the same with AI and TTS. The computer first needs to learn what the person sounds like, so it can recreate that voice. So, like humans, the more videos you show the impersonator, the better the impersonation will be. 

Computers work much the same way. Computers can analyze voices as they do with other data types; through deep learning. You feed the computer some voice samples and it tries to learn as much about the voice as possible. This includes tone, how fast they talk, and any speech habits they may have.

For a while, it required quite a lot of samples to get a voice to work. This included reading aloud a long series of different sentences so the computer knew what you sounded like. Then, you could feed the computer some text and it would try its best to read it out in your voice.

Imitation Nation

Technology companies have gotten very good at capturing voices lately—a bit scary, really. For example, Microsoft is working on its own engine, which does a much better job than Sam ever did. It’s called VALL-E, and the website for VALL-E claims that the engine can copy someone’s voice from a three-second clip. That’s fast.

Of course, with any new AI technology, this can be used for both good and evil. For example, if you were developing a video game, you could use this technology to record your voice, then have your character read out its lines. You can even tweak the text and the AI will read it as is. No more re-recording.

On the flip side, a lot of our current tech uses voice identification. Some banks will ask you to read out a password and match your voice to what it has on file. If someone can take a three-second clip of your voice and use it to replicate what you sound like, they can use it to bypass these security checks.

Is this just the start of AI voice generation? It’s hard to say. But it’ll be interesting to see how people use these AI TTS systems to revolutionize the computing industry, from video games to personal assistants.

How do you think AI TTS voices will change our lives?

Learn More

Explanation of voice cloning

https://www.voxwaveai.com/understanding-voice-cloning-technology-what-you-need-to-know

VALL-E

https://www.microsoft.com/en-us/research/project/vall-e-2/

What is voice cloning?

https://builtin.com/artificial-intelligence/what-is-voice-cloning

Voice Cloning 101

https://www.veritonevoice.com/blog/voice-cloning-101/#voice-cloning/

Voice Cloning Guide

https://elevenlabs.io/safety