The AI field was once relatively quiet, with little impact beyond its niche intentions.
That changed when OpenAI introduced ChatGPT, sparking an arms race among tech giants and startups alike to develop the best Large Language Models (LLMs).
This race has driven rapid advancements in creating AI that interacts with users in increasingly human-like ways.
From OpenAI to Google, Meta, X, and smaller players like Anthropic and Stable Diffusion, most have focused on developing multi-purpose AI targeting a similar audience.
Chipmaker Nvidia, however, is taking a different approach.
Riding the AI wave, the company has introduced an LLM-powered AI—but instead of aiming to be all-knowing, Nvidia focuses on creating an AI that masters sound.
The world’s most flexible sound machine?
With text and audio inputs, this new #generativeAI model, named Fugatto, can create any combination of music, voices, and sounds.
Read more in our blog by @RichardKerris https://t.co/AvTAbjn1iJ #NVIDIAResearch
Note: Some… pic.twitter.com/0IlYboF9JZ— NVIDIA AI Developer (@NVIDIAAIDev) November 25, 2024
In a blog post, Nvidia said that:
"While some AI models can compose a song or modify a voice, none have the dexterity of the new offering."
"Called Fugatto (short for Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files."
What this 'Fugatto' can create, include music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice.
Nvidia's takes its innovation a step further, enabling users to also generate sounds that have never been heard before.
This is possible because Fugatto supports numerous audio generation and transformation tasks.
Nvidia claims that Fugatto is "the first foundational generative AI model that showcases emergent properties — capabilities that arise from the interaction of its various trained abilities — and the ability to combine free-form instructions."
According to Rafael Valle, a manager of applied audio research at Nvidia, who was Fugatto's orchestral conductor and composer:
"Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale."
Use cases include music producers who could use Fugatto to quickly create prototypes or edit an idea for a song, trying out different styles, voices and instruments.
They could also add effects and enhance the overall audio quality of an existing track.
Advertising agencies could use Fugatto go quickly target an existing campaign for multiple regions or situations, applying different accents and emotions to voiceovers.
Developers of language learning tools could use the AI to personalize the experience by allowing users to use any voice they want.
Video game developers could also se the model to modify prerecorded assets in their title to fit the changing action as users play the game. Or, they could create new assets on the fly from text instructions and optional audio inputs.
Because it's LLM at heart, it can respond to users and customize their response based on queries.
In this case, Fugatto can, for example, make a trumpet bark or a saxophone meow.
"Whatever users can describe, the model can create," said Nvidia.
With fine-tuning and small amounts of singing data, researchers found it could handle tasks it was not pretrained on, like generating a high-quality singing voice from a text prompt.
Fugatto is essentially a foundational generative transformer model that is built on top of Nvidia's previous work in areas like speech modeling, audio vocoding and audio understanding.
The full model features 2.5 billion parameters, trained on Nvidia DGX systems equipped with 32 Nvidia H100 Tensor Core GPUs.
A key challenge was creating a blended dataset of millions of audio samples for training. To address this, the team used a multifaceted approach to expand the range of tasks the model could handle, improve accuracy, and unlock new capabilities without needing extra data.
By meticulously analyzing existing datasets, they uncovered new relationships within the data, enhancing the model's versatility.
The entire development process of Fugatto— from creation to training and final completion— took over a year.