Synthesia’s AI clones are more expressive than ever. Soon they’ll be able to talk back.

When Synthesia launched in 2017, its primary purpose was to match AI versions of real human faces—for example, the former footballer David Beckham—with dubbed voices speaking in different languages. A few years later, in 2020, it started giving the companies that signed up for its services the opportunity to make professional-level presentation videos starring either AI versions of staff members or consenting actors. But the technology wasn’t perfect. The avatars’ body movements could be jerky and unnatural, their accents sometimes slipped, and the emotions indicated by their voices didn’t always match their facial expressions.

Now Synthesia’s avatars have been updated with more natural mannerisms and movements, as well as expressive voices that better preserve the speaker’s accent—making them appear more humanlike than ever before. For Synthesia’s corporate clients, these avatars will make for slicker presenters of financial results, internal communications, or staff training videos.

I found the video demonstrating my avatar as unnerving as it is technically impressive. It’s slick enough to pass as a high-definition recording of a chirpy corporate speech, and if you didn’t know me, you’d probably think that’s exactly what it was. This demonstration shows how much harder it’s becoming to distinguish the artificial from the real. And before long, these avatars will even be able to talk back to us. But how much better can they get? And what might interacting with AI clones do to us?

The creation process

When my former colleague Melissa visited Synthesia’s London studio to create an avatar of herself last year, she had to go through a long process of calibrating the system, reading out a script in different emotional states, and mouthing the sounds needed to help her avatar form vowels and consonants. As I stand in the brightly lit room 15 months later, I’m relieved to hear that the creation process has been significantly streamlined. Josh Baker-Mendoza, Synthesia’s technical supervisor, encourages me to gesture and move my hands as I would during natural conversation, while simultaneously warning me not to move too much. I duly repeat an overly glowing script that’s designed to encourage me to speak emotively and enthusiastically. The result is a bit as if if Steve Jobs had been resurrected as a blond British woman with a low, monotonous voice.

It also has the unfortunate effect of making me sound like an employee of Synthesia.“I am so thrilled to be with you today to show off what we’ve been working on. We are on the edge of innovation, and the possibilities are endless,” I parrot eagerly, trying to sound lively rather than manic. “So get ready to be part of something that will make you go, ‘Wow!’ This opportunity isn’t just big—it’s monumental.”

Just an hour later, the team has all the footage it needs. A couple of weeks later I receive two avatars of myself: one powered by the previous Express-1 model and the other made with the latest Express-2 technology. The latter, Synthesia claims, makes its synthetic humans more lifelike and true to the people they’re modeled on, complete with more expressive hand gestures, facial movements, and speech. You can see the results for yourself below.

Last year, Melissa found that her Express-1-powered avatar failed to match her transatlantic accent. Its range of emotions was also limited—when she asked her avatar to read a script angrily, it sounded more whiny than furious. In the months since, Synthesia has improved Express-1, but the version of my avatar made with the same technology blinks furiously and still struggles to synchronize body movements with speech.

By way of contrast, I’m struck by just how much my new Express-2 avatar looks like me: Its facial features mirror my own perfectly. Its voice is spookily accurate too, and although it gesticulates more than I do, its hand movements generally marry up with what I’m saying.