Background

With 'Pico-Banana-400K,' Apple Makes A Contribution To The War, Before Making its Next Big Move

Robot hand holding Apple

The war of large language models (LLMs) isn't stopping. It's only getting more intense as newer and more powerful products are introduced.

What began after the moment OpenAI ChatGPT arrived, that followed was a frenzy unlike anything the tech world had seen in years: Google rushing with Gemini, Anthropic crafting Claude, Meta scaling up LLaMA, and Microsoft doubling down on OpenAI.

Apple, however, seemed to stay on the sidelines.

While rivals raced to dominate the spotlight, Apple was, and is still lagging in the competition. Yet alone competing, it's even far from even ready.

However, the company quietly kept publishing research, and each paper is a subtle signal that it was preparing something of its own.

Its latest, 'Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing,' may be the clearest clue yet.

The project reveals Apple’s growing focus on AI-driven creativity, building a foundation for machines that can understand and transform images through natural language.

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Overview of the Pico-Banana-400K dataset: 386K edited images from OpenImages, processed with Nano-Banana and filtered by Gemini-2.5-Pro across single, preference, and multi-turn edits..

The dataset contains around 400,000 real-world images paired with AI-generated edits, each described in plain text.

Instead of relying on artificial or synthetic data, Apple built Pico-Banana-400K from authentic photos sourced from the OpenImages collection, grounding its AI training in reality.

Every edit was created using Google’s Gemini-2.5-Flash-Image model, nicknamed Nano-Banana. Edits include moving an object in the image, adding artistic effects, and zooming on. After that, results were then evaluated by the more powerful Gemini-2.5-Pro for accuracy and visual quality.

With the method, the researchers had the resulting images analyzed and then either rejected or accepted them.

"The result became Pico-Banana-400K, which includes images produced through single-turn edits (a single prompt), multi-turn edit sequences (multiple iterative prompts)," say the researchers, "and preference pairs comparing successful and failed results (so models can also learn what undesirable outcomes look like)."

Having now produced this large image dataset, Apple's researchers say Pico-Banana-400K "establishes a robust foundation for training" AI image editors.

The foundation of this research is what the researchers described as the limitations of current systems, which are "remarkable [at] text-guided image editing."

"[The] research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images," says the full paper.

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Example single-turn edits from the Pico-Banana-400K dataset, showing original (left) and edited (right) images across various transformations — from color and lighting changes to object and style adjustments — demonstrating Nano-Banana’s realism and precision..

Apple’s researchers didn’t just compile images.

Instead, they categorized them into 35 types of edits, including color correction, background changes, object relocation, and stylistic transformations.

What sets this dataset apart is its realism.

By blending automation with careful human curation, Apple ensured that the edits feel natural, consistent, and visually coherent. Even failed edits were kept, teaching future models not only what to do but also what to avoid.

Beyond scale, Pico-Banana-400K is meticulously structured. It includes:

  • 250,000 single-edit examples for training base models.
  • 72,000 multi-turn sequences showing how an image evolves through a series of edits.
  • 50,000 preference pairs comparing good versus poor outcomes.
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Multi-turn editing example: beginning with a pumpkin image, the model adds film grain, swaps the background for a haunted house, shifts it to a snowy scene, then warms the lighting to a golden-hour glow for the final result..

The results highlight both potential and limits.

Global changes like tone and color adjustments succeed over 90% of the time, but complex tasks, moving objects, altering text, or changing spatial relationships, remain difficult. Still, this kind of honesty in Apple’s evaluation reflects its methodical approach: progress grounded in precision, not hype.

Though the dataset currently relies on Google’s models, its implications point straight to Apple’s ecosystem.

It could strengthen Apple Intelligence, power future creative tools in Photos, or make Siri capable of understanding commands like "brighten this picture" or "remove that background," for example.

More importantly, Pico-Banana-400K that is available for free, represents one of Apple’s largest open contributions to the AI research community.

This rare move reflects both confidence and restraint. While rivals chase attention, Apple, well aware it’s trailing in the AI race, is shaping a strategy that contributes to the field before making its next big move.

Published: 
02/11/2025