Hey there! Allison here, CTO of Tonic Audio. Hello! 👋
Though I spend most of my time at Tonic behind code editors and terminal windows, I do get the chance to work on music here and there. And one thing I’ve been meaning to try out is the current state-of-the-art for AI as it applies to songwriting–specifically vocal replacement.
To start, this is quite a controversial issue, as you can imagine. And in case it isn’t absolutely clear, Tonic, and its founding team are extremely pro-musicmaker. I personally take a pretty dim light to the enshittification happening within the music industry and want to do something about it. So much so that we made Rebecca Giblin and Cory Doctorow’s book Chokepoint Capitalism required reading for us.
That said, is there a world where AI can work for the independent musician, and actually protect the creators against the current choke points erected in the music industry? Before we can answer that, first we need to know what we’re working with.
While AI spans many different areas within the music creation space, I wanted to focus on one that is near and dear to my heart, the voice. Ever since I began playing music at the ripe age of 14, I always found singing difficult and I was pretty bad at it. Since then I’ve learned a bit about harmonies, voice timbre, and the mechanics of a typically good vocal take. However, even after vocal lessons, I just don’t really have one of those voices that does what I want it to. So for me, figuring out how to add vocals to the songs I write that I can hear in my head but cannot sing is really interesting to me.
I figured if I wanted to see what the current state-of-the-art was, I ought to go directly lto the place that is infuriating a lot of the music industry execs, and that would be the Discord Server AI Hub. It has all the vibe of early gray-market internet IRC chat rooms–a flurry of activity, braggadocio, and a constant stream of new content.
But seeing the huge stream of music covers there made me a bit uneasy. Had royalties been paid for the covers of these songs? I don’t support blatant copyright infringement of artists who work hard for their craft. So seeing all these songs posted made me a bit unsure of this experiment.
However, in this morass of channels and content, there was a link to a Google doc, instructing one on how they might go about exploring this new technology. And, being a technologist for most of my life, and believing that the best way to predict the future is to invent it, I needed to see just how difficult it was, and just how good the quality had become.
Putting the tools to the test
First order of business: determine what song to work on. I’m involved with a few smaller Discord servers with other music creators, where we socialize, do silly song challenges, give feedback on each others’ songs, and generally just try to be decent to each other while we all attempt to hone our craft. In fact, it was these early interactions that had a deep influence on how we build out Tonic’s community features.
It just so happened that one of the current song challenges I was involved with was for everyone to pick the most irritating, hated song one can possibly think of–you know, that one when it comes on you have a visceral negative reaction to it, compelled to tell your friend how much you hate this fucking song? Yeah, so the challenge was for everyone to pick their most hated song, then toss them all in an online random number generator, and out came these songs assigned to each other. We were all assigned a random hated song from someone else’s submission, that we then had to remix/cover, and then share.
I feel bad for the poor soul who got my “Macarena” submission. I’m not sure they ever finished the challenge. I’m so sorry.
I was assigned the song Rude, by a band called Magic!. I had never heard this song before,which to me is really great. My sonic palette was clear of any annoying ear worm residue from prior listening. Even better, despite the reggae influence, this song was a traditional pop song with a pretty straightforward chord progression, and not a lot going on otherwise. Most importantly, it had a prominent vocal track that made it a great candidate to see how the AI voice tech really works.
As an aside, I am a huge fan of making covers of songs to hone my music production skills. Doing so does so many things for you, and implores you to dissect the song on multiple levels. What are the lyrics saying? How does the harmony support or contrast with the lyrical content (e.g. happy words, sad chords)? How tight is the rhythm? How do the musical elements combine from an arrangement standpoint? When covering a song, you can then decide how you’d like to reinterpret the song. I chose to move it into a darker, moodier, queerer place–one that resonated more closely with my past experience and worldview. And as always, could I make this a song that I’d actually be proud of, and would want to listen to? Only one way to find out.
Let’s do this
I followed the half-dozen steps to accomplish this–from removing the vocal track of the original song, de-noising, de-reverbing, and de-echoing it, to extracting the pitch information out of that, then feeding it into a plethora of additional tools and utilities, and tweaking parameters and listening, until I had a pretty good female vocal cover of the original vocal track. This was by no means a straight-ahead process, and there were lots of footguns around that could mess up the whole process.
But dear reader, once I finally had the new vocal track created, and soloed it to listen to the result, my. mind. was. blown. From a voice timbre and human expression standpoint, it was so incredibly detailed–it was unlike any artificial audio I’d ever heard before. The voice broke into falsetto at all the right spots, it sounded natural, had believable prosody, and even the low register didn’t sound robotic or gender-bending. While there were slight glitches here and there, and the top frequency range was a little lacking, it was nothing some editing and EQ/effects couldn’t address. Dropping my favorite tape delay onto it made it sit just perfectly.
It really felt like magic–like the first time I made a drum machine play a beat I heard in my head and played along with it. Or like I imagine the first time Peter Gabriel heard the E-mu Emulator II Shakuhachi flute sample that he (along with Enigma and many other artists) used everywhere.
Now that I had a new vocal track, it was time to write the musical reinterpretation to match this vocal part. I spent another week putting together an arrangement that I felt would best support this vocal track–as I have a strict no-original-samples rule when doing a cover (in contrast to doing a remix of an existing song, where samples of the original are a-ok). Inspired by several of my favorite artists–hello dear Lali Puna!–I put together something i thought was new and fresh for me.
Finally, out came this little gem.
Despite my chagrin for Spotify, and Tonic Audio being a perfectly viable place to post music, I still posted it on Spotify since the distribution provider I use allows me to properly license the songwriting rights to the original composers. That’s really important to me.
The takeaway for me is two-fold. First, the technology is finally here. While it may not pass for exacting, perfect vocal takes in a critical listening environment–it is completely viable as a vocal track for pop songs–and is already being used for this today.
But what about proper attribution and compensation? Purchasing a license to cover the song is currently available through my distribution provider, but I cannot license the vocal model I used. There are so many more questions now too, Should voices be copyrightable? If so, what prevents those copyrights from falling into the hands of the major record labels and being abused? If we can copyright voices, does that mean we can copyright other tonal characteristics, like guitar distortion, or channelstrips? Can we get the license fees directly to the original vocalists? How can we prevent enshittification of this nascent new market?
I’m hopeful that projects like Benn Jordan’s Vocal Swap can help address some of these new questions. Of course, they’ll need to ensure–just like we do at Tonic Audio–that they stay immune to the inevitable pull of becoming enshittifiers themselves.
And in that space, Tonic Audio has some cool new ideas we’re cookin’ up. I’m hopeful we can get them dialed in and shared with you in a future update!