In the age of Large Language Models, tech companies are racing to build increasingly advanced generative AI.
Since OpenAI’s ChatGPT announcement, many have joined the intense competition, striving for dominance in the field. However, the race isn't limited to private or commercial companies—non-profits can also compete.
And here, the Allen Institute for AI (Ai2) has unveiled 'Molmo.'
This state-of-the-art family of multimodal AI models, is reported to outperform several top competitors, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5, across multiple third-party benchmarks.
Molmo, which stands for 'Multimodal Open Language Model,' is able to interpret images as well as converse through a chat interface.
According to Ali Farhadi, CEO of Ai2:
"It should be an enabler for next-generation apps."
Rivals have developed similar AI models, but many of these models are proprietary, meaning that the magic that made them work are hidden from view and accessible only via a paid application programming interface, or API.
Meta has released a family of AI models called LLaMA, which under its license, limits its commercial use. However, it has yet to provide developers with a multimodal version.
Molmo here is different because not only that it's free, because it's also open source.
With abilities that can match some of its more established rivals, Molmo is literally one of the most capable AI models, meaning that its open-source state could seen and help more developers, researchers, and startups develop more powerful AI agents in the future.
For starters, Molmo that can interpret images as well as converse through a chat interface, means that it can understand the context it sees on a computer screen.
This can help users do things, like browsing the web, navigating through file directories, and drafting documents.
Having an open source, multimodal model means that any startup or researcher that has an idea can try to do it, and that there is no restriction into what they can or cannot do.
Molmo's open-source nature allows developers to more easily fine-tune AI agents for specific tasks, such as working with spreadsheets, by providing additional training data.
In contrast, proprietary rivals can only be fine-tuned to a limited extent through APIs.
And because Molmo is entirely open-source, Ai2 is also releasing the training data, providing transparency for researchers.
The first release of Molmo includes a demo, inference code, a technical report on arXiv and the following model weights:
- Molmo-7B-O, the most open 7B model.
- Molmo-7B-D, the demo model.
- Molmo-72B, the best model.
These models run on 7 billion parameters.
Ai2 however, also introduces MolmoE-1B, which runs on 1 billion parameter, said to be a mixture of experts model with 1B (active) 7B (total).
This mini version is small enough to run on mobile devices.
While a model's parameter count reflects its data processing capabilities, Ai2 claims Molmo, which has a smaller size, can rival much larger commercial models due to its high-quality training data.
"The billion-parameter model performs on par with models at least 10 times larger," Farhadi said.
In its finding, Ai2 said that its Molmo model MolmoE-1B, which is based on the family's fully open OLMoE-1B-7B mixture-of-experts LLM, "nearly matches the performance of GPT-4V on both academic benchmarks and human evaluation."
Its two Molmo-7B models "perform comfortably between GPT-4V and GPT-4o on both academic benchmarks and human evaluation, and significantly outperform recently released Pixtral 12B models on both benchmarks."
And its best model, the Molmo-72B, achieves the highest academic benchmark score, and is second on human evaluation, "just slightly behind GPT-4o."
Ai2 added that Molmo-72B outperforms Gemini 1.5 Pro and Flash and Claude 3.5 Sonnet.
Releasing such powerful AI models comes with risks, and Ai2 knows it very well.
With powerful AIs that are becoming increasingly accessible, Ai2 is becoming one of the providers of tools that could be easily used for malicious purposes, such as automating cyberattacks.
Farhadi acknowledges this.
But he thinks that Molmo’s presence should offer many positives, because its open-source nature should encourage innovation and collaboration, than misuse for harmful purposes.
It's worth noting that as Large Language Models become increasingly powerful, researchers want to make these models to also be able to reason.
With AI hallucination and LLM's ability to make misinformation sound legit, researchers want AI models to be able to explain their answers.
At this time, this is an area OpenAI is exploring with its o1 model, which exhibits step-by-step reasoning.
In the future, giving multimodal models these capabilities should be the next breakthrough.