MIT Creates AI That Can Extract Individual Music Instruments From Videos

If a music is a cake, the instruments that are played to create that song, are the ingredients.

When the ingredients are mixed and baked, together they create the cake. But once a cake, the ingredients cannot be separated anymore. A music is similar, as instruments are played to create a distinction of note and tone to construct a song.

AI researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created an app that can undo this.

Using deep learning neural network trained by analyzing over 60 hours of video featuring musicians playing instruments, the software is able to identify over 20 different instruments without being told what they are. After that, the AI can isolate each of the instruments played in the song.

All it requires is the user to click on which instrument they want to hear. What usually is a tedious process that requires hours of sound processing and tweaking by engineers, can be done almost instantly.

The software is called PixelPlayer.

The CSAIL researchers said that as the software improves, and learns how to tell apart different instruments in the same family, it could become a tool for remixing and mastering performances where original recordings no longer exists.

For example, an old concert footage that has a sound of a piano and a trumpet, can have each of the instruments tweaked individually to restore the quality. Or, musicians who want to perfect an instrument could easily focus on a specific instrument from a song they're trying to master.

The software also has the potential to revolutionize the process of remixing songs, or creating mashups.

It could also be used to train robots to identify various environmental sounds, such as those made by animals, vehicles and various appliances.

The researchers refer the AI as a “self-supervised” deep learning. What this means, it requires no direct human interaction.

And what makes it interesting is that, the team involved didn't really know how exactly the AI is doing its thing when they first created it.

The AI locates literal pixels in the video that are producing sound, hence its name, to identify instruments at that pixel level by isolating the sounds as they are produced. How it does this is by ‘seeing’ where the sound is originating, in order for it to split the sounds accurately.

"We were surprised that we could actually spatially locate the instruments at the pixel level," said Hang Zhao, lead author of the paper and a PhD at CSAIL. "Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video."

While the technology is in its infancy, it has indeed shown a lot of potential.