AI that Speaks for itself

Or sings for others if you want to learn how

Apr 28, 2023

Just this week, an open sourced text to voice platform came out called Bark (see link here). It makes it so you can type any prompt and it will speak for you like this:

1×

0:00

-0:08

Here is an AI talking with text I have given it, complete with the “uh” and pauses like a human naturally would.

What makes Bark unique from ElevenLabs, Tortoise, Resemble AI is that is uses a transformer-based solution (like GPT) and bases it’s next spoken word on probabilities of how it should be pronounced. Thus it can “naturally” sound like a human, but it can also go off the rails if it has a mismatch of what the next spoken word should sound like. For some more detail on how this works, check this paper about Vall-E, where researchers used 60k hours of English speech to train a model. It could take 3 seconds of someone’s voice, and finish the sentence with the prompt complete with emotions and acoustics. Amazing. This paper was only written this January, and we are already seeing some stuff come out.

With only 3 seconds, you can complete the prompt

The caveat with GPT type voice synthesis is that because it generates one spoken word at a time, it can mistakenly sound like a different person each start, that means with the base git repo, it isn’t able to have consistent voices with the default settings. Bark comes with several types of speakers in the form of npz files, which are used in Python for storing arrays and data. It only contains semantic, coarse, and fine prompts for processing a voice, and is surprisingly only 34 kb, which means you can store a large number of these easily.

Bark has already been forked and added to a lot of other projects. One example is Oobagooba’s web UI platform, where you can install this extension and it will work with a chatbot (though not in real time, but that is coming soon).

Compare this with the popular and dominant ElevenLabs:

1×

0:00

-0:06

As you can see, it’s pretty straightforward in converting text to speech, and lacks the human emotions that come with Bark. However, with training, ElevenLabs can also include a lot of human emotive signals. ElevenLabs was smart to make a simple interface to create voice from text but also a VoiceLab, where you can design or clone a voice from only a few samples

Simply add voice samples and it will attempt to clone it

ElevenLabs also is charging for this service, and caps it by number of characters, starting at roughly 30 minutes of audio:

You’ll need Start+ to get the Instant Voice Cloning which is what is making its way around viral internet posts, or pay up to “Let’s talk” to get the best of everything (or > 40 hours of audio per month).

Similar to OpenAI, ElevenLabs makes their service extensible by allowing users to generate audio through their API. This is how AutoGPT can talk back to you.

Last week, we also saw the Drake and The Weekend fake collaboration song that got taken down. Surprisingly, that wasn’t hard to do. If you want to clone someone’s voice and make a song with it, you’ll need to likely use the platform Singing Voice Conversion (or SVC), or check it out at so-vits-svc (this is the fork). This tool lets you take a song, and replace the voice with another. That means you can create a song yourself and then replace your own voice with a famous person and pretend you are them.

This is actually very easy to install if you have python, just use these commands:

python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U so-vits-svc-fork

Once installed, just type in svcg and it will launch the UI shown below, here I’ll try an sample for you:

You can take a popular hit song like Blank Space:

1×

0:00

-0:10

Then run it through SVC:

And get a voice that is trained to another artist, yourself, or just something different:

1×

0:00

-0:10

As you can see the AI can accurately take the voice and transform it into something else, in this example I just used a pretrained model called Applejack. You can instead use an existing artist or yourself if you want. I am guessing the music industry is going to have a field day hunting down AI versions of its artists that are freely available on the internet. In fact, it’s questionable how long SVC can stay up…

If you want the whole song, you’ll also have to extract the original song without vocals, and then add this on top.

To train this on your own voice (or another artist), there are a few extra steps, you actually have to provide around 50 or more samples of your voice, but you can break up a 5 minute file into 5 second clips for that.

So just to summarize all the tools we talked about:

Bark: A GPT open source platform to generate text to speech complete with acoustics and human behavior. It’s only 1 week old
SVC: An open source platform to convert a voice into another trained voice, the fork is only a month old
Eleven Labs: A paid platform to convert text to spoken word with trained data
Other text to speech: Tortoise TTS, Resemble.ai
Other versions of SVC: Diff-SVC

As you can see, the most disruptive products only came out in the last month, things are going to move fast from here.

The ramifications of voice cloning are huge, just to name a few:

Music that is just as good without the artist - there will be a huge legal race to take down likeness of existing artists, darn those .pth files!
Scams where a loved one is telling you to give money, or even corporate espionage where the CEO is telling you to send gift cards
Podcasts where the host or guest doesn’t actually need to be there, I should start a new podcast where you guest who’s real and who’s not
Conversational ChatGPT or agents that are able to run tasks (AutoGPT already has ElevenLabs built in)
Support or sales calls run completely by AI

Eventually, AI will talk to AI and close decision loops, all with pretrained transformers.

What will go on behind the scenes in the future

In fact, in the near future, people may create LoRAs of themselves to leave voice messages or even reply in their style of speech and replies. This is already happening with emails!

While this is still early, as Bark voice cloning is not freely open yet, within weeks we will probably see a decent competitor to ElevenLabs and cloned music will eventually just be plug and play. You can see I had to do a few steps, but soon someone will make it only a few clicks, but with a $ charge on top. Grimes is already ahead of this trend, saying she’ll split the royalties 50/50. Until then, stay vigilant out there and think a little bit about if what you are hearing is real!

AI Relevance - Jeff's Substack

Discussion about this post