'Pixtral 12B' Is Mistral AI's First Multimodal AI That Can Process Both Text And Images

It's an arms race, and things just get fiercer.

Soon after OpenAI introduced ChatGPT, the trend that follows see a lot of tech companies, large and small, trying to develop their own solutions to make use of the emerging Large Language Model technology. And among them, include Mistral AI.

This time, the French company is finally entering the multimodal arena.

In order to take on the likes of more established rivals, like the OpenAI's GPT-4 and Google's Gemini, Mistral released 'Pixtral 12B', its first ever multimodal model with both language and vision processing capabilities baked in.

In all, the AI is designed for tasks like captioning images, identifying objects, and answering image-related queries.

Initially not available to the public, Pixtral 12B's source code is available on GitHub and on Hugging Face for free under the Apache 2.0 license, meaning anyone can use, modify, or commercialize it without restrictions.

magnet:?xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-240910&tr=udp%3A%2F%https://t.co/OdtBUsbMKD%3A1337%2Fannounce&tr=udp%3A%2F%https://t.co/2UepcMHjvL%3A1337%2Fannounce&tr=http%3A%2F%https://t.co/NsTRgy7h8S%3A80%2Fannounce
— Mistral AI (@MistralAI) September 11, 2024

Pixtral 12B's 12 billion parameter model is built on Mistral’s Nemo 12B, an AI model previously released by the company capable of understanding text, with the addition of a 400-million-parameter vision adapter.

The adapter allows users to add images through URLs or encode them via base64 within the inputted text.

Users can upload images or provide links and receive detailed insights on the subjects within the images. Unlike some competitors, Pixtral 12B is designed to support an arbitrary number of images of any size natively.

Early testers have shared insights into the model’s architecture, which includes 40 layers, a hidden dimension size of 14,336, and 32 attention heads. Its 24GB structure also features a dedicated vision encoder with support for 1024x1024 resolution and 24 hidden layers, promising advanced image processing capabilities.

However, this may evolve once the model is available via API.

Sophia Yang, Mistral's head of developer relations, hinted in a post on X that the model shall soon be accessible through its own web chatbot and through Mistral’s La Platforme, which offers API endpoints for developers to interact with the company's models.

You can download the model via the torrent link. It'll be available on le Chat and la Plateforme soon.
— Sophia Yang, Ph.D. (@sophiamyang) September 11, 2024

Multimodal models like Pixtral 12B are seen as the next frontier in generative AI, and Mistral AI, backed by Microsoft and others, is positioning itself as Europe's answer to OpenAI.

The release of Pixtral 12B highlights Mistral's determined push to compete with leading AI labs.

However, with the rise of similar tools from competitors, concerns are growing about the data sources used to train these models, raising important questions about transparency and ethical AI practices.

As noted by various media publications, Mistral AI, like many AI firms, likely trained their AI models using vast quantities of publicly available web data - a practice that’s sparked lawsuits from copyright holders challenging the "fair use" argument often made by tech companies.

Besides Pixtral 12B, Mistral AI's portfolio also includes Mixtral 8x22B, a mixture-of-experts model, Codestral, a 22B parameter open-weight coding model, and a math and scientific reasoning-focused model.

Published:

12/09/2024

Dark Mode

Search form

'Pixtral 12B' Is Mistral AI's First Multimodal AI That Can Process Both Text And Images