With Koe Recast, you can change your voice as easily as your clothing

  News
image_pdfimage_print
A colorful waveform that actually has nothing to do with Koe: Recast.
Enlarge / A colorful waveform dramatically swirls through latent space, seeking kawaii.

Thanks to a web demo of a new AI tool called Koe Recast, you can transform up to 20 seconds of your voice into different styles, including an anime character, a deep male narrator, an ASMR whisper, and more. It’s an eye-opening preview of a potential commercial product currently undergoing private alpha testing.

Koe Recast emerged recently from a Texas-based developer named Asara Near, who is working independently to develop a desktop app with the aim of allowing people to change their voices in real time through other apps like Zoom and Discord. “My goal is to help people express themselves in any way that makes them happier,” said Near in a brief interview with Ars.

Several demos on the Koe website show altered clips of Mark Zuckerberg talking about augmented reality with a female voice, a deep male narrator voice, and a high-pitched anime voice, all powered by Recast.

This kind of realistic AI-powered voice transformation technology isn’t new. Google made waves with similar tech in 2018, and audio deepfakes of celebrities have caused controversy for several years now. But seeing this capability in an independent startup funded by one person—”I’ve funded this project entirely by myself thus far,” Near said—shows how far AI vocal synthesis tech has come and perhaps hints at how close voice transformation might be to widespread adoption through a low-cost or open source release.

When asked what specific kind of AI powers Recast’s voice transformation under the hood, Near held back specifics but generalized how it works, “We’re able to dive in and alter the characteristics of voices within the embedding space that we’ve created. Our goal, then, is to modify the parts of audio that correspond to a speaker’s personal style or timbre while preserving the parts of the audio that correspond to the spoken content such as prosody and words. This allows us to change the style of someone’s voice to any other style, including their perceived gender, age, ethnicity, and so on.”

Recast supports 10 different voices, and more are on the way. “It’s currently undecided if we will be offering existing voices of celebrities or other well-known persons,” said Near.

Offering celebrity voices (or those imitating non-celebrity living persons) may pose ethical and legal questions, however. When asked about the potential misuse of Recast, Near replied, “As with any technology, it’s possible for there to be both positives and negatives, but I think the vast majority of humanity consists of wonderful people and will benefit greatly from this.” Near also pointed out that Recast includes a Terms of Service policy prohibiting illegal and hateful usage.

As for a release timeline, Near is pursuing commercial options but isn’t ruling out an open source release, which could potentially have an impact similar to Stable Diffusion by putting realistic audio deepfakes into the hands of many without hard restrictions. “We’re exploring some monetization strategies,” Near said. “If the profit models I have in mind don’t work out, open-sourcing this technology may be an option in the future.”

As deep learning technology continues to peel away the 20th century concept (or some might say “illusion”) of media as a fixed and accurate record of reality, we are looking at a near-future in which digital representations of a living human’s voice, much like images and video, will be one more thing you can’t take at face value without significant trust in the source. Still, the technology could empower many people who might otherwise be discriminated against while doing business—or simply having fun—online.

https://arstechnica.com/?p=1879606