Background

Anthropic Announces 'Prompt Caching' To Significantly Speed Up Its Claude AI

Gemini AI

In computing, there are instances where data needs to be reused repeatedly.

Instead of retrieving it each time from a slower source of storage, a computer can 'cache' it by storing it in a faster, temporary location. Caching is common in various contexts, such as web browsers, databases, applications, and hardware systems, to enhance efficiency. And here, generative AI should also utilize it to some extent.

With it, things should be more efficient, and also faster.

Generative AI products, powered by Large Language Models, can be made to process a lot of data at once.

Traditionally, these AI products, like chatbots, have to be able to build complex “prompts” or natural language data blocks in order to generate responses each time they run.

A prompt could be something as simple as "What is today's weather?" or something as long and complex prompts, like to summarize an entire document.

In the latter case, when an LLM is about to process a large document, that document must be part of the overall prompt for every subsequent conversation.

What this means, that very document, needs to be reloaded in its entirety into the AI each time a conversation is made.

For the AI, this can eat up a lot of resources.

Anthropic has a method to address this.

Using what it calls 'Prompt Caching.' the idea is that, developers can save frequently used prompts between API calls.

Introduced to its Claude family of generative AI models, this Prompt Caching feature allows developers working on AI that is powered by Claude, to be able to have their products work with long prompts that can then be referred to in subsequent requests without having to send the prompt again.

“With prompt caching, customers can provide Claude with more background knowledge and example outputs—all while reducing costs by up to 90% and latency by up to 85% for long prompts,” the company said in its announcement.

What's more, Cached Prompt can also allow developers to store detailed instructions, example responses and relevant information.

This allow them to easily set up a way to produce a consistent response between separate instances of the chatbot, without having to inject them on top of the user prompt every time.

While this feature can work in any kind of prompt, according to Anthropic, this feature is only most effective when sending a large amount of prompt context in one go, and then referring to that information repeatedly in new requests.

Using Prompt Caching, the document doesn't have to be reloaded every time a query is made.

Cached Prompt is workaround to how LLM products are supposed to work.

LLMs process prompts by splitting them up into tokens. By adding more information, means that there are more tokens to process.

More tokens mean more response time, and this means more time needed before the LLM can respond.

By reducing the amount of times a document is loaded, according to Anthropic, Prompt Caching can reduce overall costs for businesses and developers by up to 90% and improve response times by up to two times.

Not just that, because Prompt Caching also has another powerful use case, which is enhancing the performance of AI agents, where the LLM needs to make multiple calls to third-party tools, execute iterative code changes and step through complex instructions.

Initially, the feature is introduced in beta on the Anthropic API for Claude 3.5 Sonnet, the company’s most powerful multimodal LLM model, and the high-speed model Claude 3 Haiku.

Published: 
16/08/2024