Meta Introduces 'CM3leon' As A State-Of-The-Art Generative AI Model For Text And Images

AI is about intelligence demonstrated by machines, as opposed to natural intelligence displayed by living things.

OpenAI made news quite that just a few times, following the introduction of ChatGPT and GPT-4.

But before that, it has been in the 'GPT' business, after previously creating GPT-3, an AI capable of producing text rich with context, nuance and even humor, and GPT-2.

Besides them all, OpenAI also has what it calls the DALL·E, and DALL·E 2, which are essentially GPTs for images.

While quite late in the competition, Meta introduces its own generative AI, but one model for both creating text and images.

Calling it the 'CM3leon', Meta wants to use it to revolutionize multimodal AI generation models.

Introducing CM3leon, a first-of-its-kind multimodal model that achieves state-of-the-art performance for text-to-image generation with 5x the compute efficiency of competitive models.

More details https://t.co/VR12zkmLDs pic.twitter.com/jUnG7G1Fxf
— Meta AI (@MetaAI) July 14, 2023

In a blog post:

"Interest and research in generative AI models has accelerated in recent months with advancements in natural language processing that lets machines understand and express language, as well as systems that can generate images based on text input. Today, we’re showcasing CM3leon (pronounced like 'chameleon'), a single foundation model that does both text-to-image and image-to-text generation."

What CM3leon does, is becoming "the first multimodal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage."

According to Meta, this allows the model to be much more efficient than other existing generative diffusion-based models.

This is possible because CM3leon achieves better performance for text-to-image generation, even when it was trained with five times less compute than previous transformer-based methods.

The result is a more versatile and effective autoregressive model.

In turn, this allows the Meta to maintain a low training cost and inference efficiency.

What's more, the name CM3 comes from the term causal masked mixed-modal (CM3) model, which means that it can generate both text and images based on arbitrary sequences of other image and text content.

This greatly expands its functionality compared to other models that are specialized for either text-to-image or image-to-text generation.

What's more, CM3leon also performs well across a variety of vision-language tasks, including image caption generation, visual question answering, and text-based editing. It can generate complex compositional objects and produce coherent imagery that follows input prompts accurately.

To make this happen, the researchers at Meta trained CM3Leon using a dataset of millions of licensed images from Shutterstock.

The most capable of several versions of CM3Leon that Meta built has 7 billion parameters, which is more than two times as many as DALL-E 2.

But the main thing that makes CM3leon superior, is because it was trained using a technique called supervised fine-tuning, or SFT for short.

This method is already being used to train text-generating models like OpenAI’s ChatGPT. However, the researchers at Meta theorized that it could be useful when applied to the image domain. And indeed they were right.

As most image generators struggle with "complex" objects and text prompts that include too many constraints, CM3Leon shows less effort to do the same tasks.

Because CM3leon does these all well, Meta boasts it as a state-of-the-art performance for text-to-image generation model.

CM3leon outperforms all in the competition in terms of quality and efficiency.

As AI efforts from competitors, mainly from the aforementioned OpenAI products, Stability AI's Stable Diffusion, Google Imagen, and Adobe Firefly, have garnered a lot of attention, Meta believes its model could be one of the best available, at least when it comes to AI art generation.

And CM3leon does deliver.

In conclusion, CM3leon represents a significant advancement in the field of generative AI models.

Because its ability to generate both text and images with high quality and efficiency, Meta opens up new possibilities for creative applications, as the company continues its way to further advance in multimodal language models.

By making the work transparent, the creators of CM3leon aim to encourage collaboration and innovation in the field of generative AI.

Published:

15/07/2023