This AI Can Model The World Using Sound: Opening Up A Multimodal Metaverse

AI

AIs can be trained to do what they need to do, as long as they're trained right, and trained using the right data.

And researchers at Massachusetts Institute of Technology (MIT) have successfully created a machine learning (ML) model capable of predicting what a listener would hear in a variety of locations within a 3D space. This, should open up the possibility of creating a multimodal metaverse.

Just when the hype of the metaverse, marketed by Meta, the rebranded Facebook, with others have followed suit, this AI here is able to build up a picture of a 3D room in the same way people use sound to understand their environment.

When Meta's founder Mark Zuckerberg introduced "legs" using AI in order to make people in the metaverse look like real humans, this AI here should be able to make the metaverse a lot more immersive.

Partnering with IBM Watson AI Lab, the team at MIT first used the ML model to understand how any sound in a room will propagate through the space.

In a paper authored by Yilun Du, an MIT graduate student in the Department of Electrical Engineering and Computer Science (EECS) and Andrew Luo, a graduate student at Carnegie Mellon University (CMU), the researchers show how techniques similar to visual 3D modeling can be applied to acoustics.

At first, the team struggled with elements where sound and light diverge.

For example, when changing the location of the listener in a room, this can create a very different impression of the sound due to obstacles, the shape of the room, and nature of the sound. This makes it extremely difficult to predict the outcome.

To address this issue, according to its research page, the researchers developed their AI model to understand features of acoustics.

First, the source of the sound and the listener can swap places without change in what the listener hears, an by making all other things equal. The researchers also tweaked the system to understand that sound is specifically dependent on local features, such as obstacles in the way of the listener or sound.

"Most researchers have only focused on modeling vision so far. But as humans, we have multimodal perception. Not only is vision important, sound is also important. I think this work opens up an exciting research direction on better-utilizing sound to model the world," Du said, in a post at an MIT web page.

Using the approach, the neural acoustic field (NAF) AI model is fed both visual information and spectrograms of what a given audio sample sounds like at selected points in the area. This gave it the ability to predict what happens to the sound when the listener moves.

The end result is that the model was able to randomly sample points on that grid to learn the features at specific locations.

This happened because the AI is able to include inputs, like the proximity of a doorway, and make it a factor that strongly affects what that the listener would hear, relative to other geometric features that are further away on the other side of the room.

The end result, the AI is able to predict what the listener might hear from a specific acoustic stimulus based on their relative locations in the room.

AI
(a): Top down view of a room. (b): Walkable regions shown in grey. (c)-(f): Spatial acoustic field generated by an emitter placed at the red dot.

According to the research paper (PDF):

"By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds. "

"We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations."

In most research and development, AIs are trained using images and videos.

This is because computer vision has become a field that is required for lots of things, primarily because of its immediate and obvious applications of building autonomous vehicles and others tools that can "see" the world as humans do.

But the researchers at MIT and IBM Watson AI Lab, they worked with sound in order to avoid the model's reliance on photometric consistency, a phenomenon where an object looks roughly the same regardless of where people are standing.

By allowing ML models to build an idea of how sound propagates through a space, they should be able to simulate what people stood at any given location would hear.

According to Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab who also worked on the project, this should help advance the metaverse:

"This new technique might open up new opportunities to create a multimodal immersive experience in the metaverse application."

The researchers also pitched its potential applications to help AI's understand the world around them.

"For instance, by modeling the acoustic properties of the sound in its environment, an underwater exploration robot could sense things that are farther away than it could with vision alone."
Published: 
03/11/2022