Google Gemini 1.5 Pro, With Its 'Long-Context Understanding,' Can Understand Details In Films

Google Gemini 1.5

The war rages on, and it's getting fiercer than ever.

The industry was quite dull and boring, and that AI rarely made ripples outside its own realms. But since generative AI took the world by storm, especially after OpenAI introduced ChatGPT, pretty much all big tech companies are pursuing the same target, in order to please the same audience.

And Google being one of the giants, has introduced Gemini, which becomes a rebrand of its Bard chatbot.

This time, the company introduces an update to Gemini.

Calling it Gemini 1.5 Pro, the AI just gets a lot smarter.

"Gemini 1.5 Pro can understand tasks and questions across different modalities because of its long context understanding. When given a 44-minute Buster Keaton film, it's able to find small details in the film and understand plot points," wrote Google in a post on X.

In a blog post, co-written by Google CEO Sundar Pichai:

"Our teams continue pushing the frontiers of our latest models with safety at the core. They are making rapid progress. In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across a number of dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra, while using less compute."

"This new generation also delivers a breakthrough in long-context understanding."

In one of the examples provided by Google, it's said that Gemini 1.5 Pro "can perform highly-sophisticated understanding and reasoning tasks for different modalities," and this includes videos.

And even films.

In the example, Google said that it gave Gemini 1.5 Pro a 44-minute Buster Keaton movie, and said that "the model can accurately analyze various plot points and events, and even reason about small details in the movie that could easily be missed."

Gemini 1.5 Pro can identify a scene in the silent movie when given a simple line drawing as reference material for a real-life object.

The AI is able to do this because of using an "experimental feature in long-context understanding."

Gemini 1.5 Pro comes with a standard 128,000 token context window. But following the introduction, Google said that a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via AI Studio and Vertex AI in private preview.

This allows significant advances in the model, and opens up new possibilities.

1.5 Pro can seamlessly analyze, classify and summarize large amounts of content within a given prompt. For example, when given the 402-page transcripts from Apollo 11’s mission to the moon, it can reason about conversations, events and details found across the document.

But its amount of tokens isn't the only thing that Gemini 1.5 Pro boasts.

According to Google, the AI also "delivers dramatically enhanced performance."

While the first Gemini 1.5 model Google is releasing for early testing is Gemini 1.5 Pro, and that it’s a mid-size multimodal model, optimized for scaling across a wide-range of tasks, it actually performs at a similar level to 1.0 Ultra, which is considered Google's largest model at this time.

Google achieved this by building the Gemini 1.5 Pro using its Transformer and MoE architecture.

"While a traditional Transformer functions as one large neural network, MoE models are divided into smaller 'expert' neural networks," explained Google.

"Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model’s efficiency. Google has been an early adopter and pioneer of the MoE technique for deep learning through research such as Sparsely-Gated MoE, GShard-Transformer, Switch-Transformer, M4 and more."

This way, Gemini 1.5 Pro is able to both learn complex tasks more quickly, and maintain quality at the same time.

It's even more efficient to train and serve, too.

"These efficiencies are helping our teams iterate, train and deliver more advanced versions of Gemini faster than ever before, and we’re working on further optimizations," said Google.

Google provided that its 1.5 Pro can "seamlessly analyze, classify and summarize large amounts of content within a given prompt."

And in an example, the company said that when given the 402-page transcripts from Apollo 11’s mission to the moon, it can reason about conversations, events and details found across the document.

According to Google, the 1.5 Pro can also "perform more relevant problem-solving tasks across longer blocks of code."

In another example, Google said that when given a prompt with more than 100,000 lines of code, it can better reason across examples, suggest helpful modifications and give explanations about how different parts of the code works.

According to Google, other feat Gemini 1.5 Pro could do, also include processing 11 hours of audio.

Gemini 1.5 Pro could do these things because Google has given it more tokens.

An AI model's "context window" is made up of tokens, and that a token can represent an part or just a subsection of words, images, videos, audio or code.

Because tokens are essentially the building blocks used by AI for processing information, by giving the AI model more tokens, the AI can churn in more information.

What this means, more information can taken in and processed in a given prompt.

And in turn, this will make results more consistent, relevant and useful.

Due to how controversial AIs like Gemini can be, Google has given it extensive ethics and safety testing.

"In line with our AI Principles and robust safety policies, we’re ensuring our models undergo extensive ethics and safety tests. We then integrate these research learnings into our governance processes and model development and evaluations to continuously improve our AI systems," said Google.

Published: 
19/02/2024