Developers, programmers, coders, and others alike, have one thing in common: they love speed.
There’s nothing more satisfying than a project that is error-free—smooth and seamless—running at peak performance. But in reality, this is something hard to achieve because when it comes to processing prowess and the internet, there is that dreaded latency that can kill all the joy.
Latency is the delay between a user's action and the response they receive from a system. It’s essentially the time it takes for data to travel from one point to another in a network.
In tech terms, it's often measured in milliseconds (ms).
OpenAI, the creator of ChatGPT, has introduced a new feature aimed at helping developers accelerate their projects. This feature, called "Predicted Outputs," allows developers to enhance their applications' performance by receiving faster response predictions from the model.
With Predicted Outputs, developers can reduce latency in responses, improving the efficiency and speed of their applications, especially in cases where quick responses are critical to user experience.
This tool is designed to give developers using OpenAI’s platform an edge by making interactions more seamless and efficient, ultimately driving better performance in their projects.
Introducing Predicted Outputs—dramatically decrease latency for gpt-4o and gpt-4o-mini by providing a reference string. https://t.co/n6mqjQwQV1
Speed up:
- Updating a blog post in a doc
- Iterating on prior responses
- Rewriting code in an existing file, like @exponent_run here: pic.twitter.com/c9O3YtHH7N— OpenAI Developers (@OpenAIDevs) November 4, 2024
In a dedicated web page, OpenAI said that:
The method is to pass existing class files as predicted text.
This way, developers using GPT-4o and GPT-4o mini can have their app to regenerate the entire file much more quickly.
"Generating tokens with Predicted Outputs should result in much lower latency on these types of requests," said OpenAI.
OpenAI then further explained how to developers can speed up their project, and further reduce latency.
Among them, include improving inference speed by making app process tokens faster.
To do this, developers may have to opt to using a smaller model size.
To maintain high quality performance with smaller models, developers can try using longer and more detailed prompts, add more few-shot examples, fine-tuning, and more.
See @FactoryAI's results: https://t.co/OgTS7ZP4Lo
— OpenAI Developers (@OpenAIDevs) November 4, 2024
OpenAI also suggests developers have their app generate fewer tokens, because "generating tokens is almost always the highest latency step when using an LLM."
They can also reduce the number of input tokens, and make fewer requests.
Developers can also use the max_tokens
or stop_tokens
to end their generation early.
OpenAI continues by saying that developers can parallelize when performing multiple steps with an LLM, and if possible, make users watch progress rather than have them wait.
"There's a huge difference between waiting and watching progress happen – make sure your users experience the latter," explained OpenAI.
Lastly, OpenAI said developers not to default to LLM.
"LLMs are extremely powerful and versatile, and are therefore sometimes used in cases where a faster classical method would be more appropriate. Identifying such cases may allow you to cut your latency significantly."