
Real-time video agents have arrived and they are changing how we think about digital characters forever.
Runway ML announced how its Runway Characters let anyone turn a single reference image into a fully expressive conversational video agent streaming live at 24 frames per second in high definition with only 1.75 seconds of end-to-end latency. No fine-tuning no extra training just one image and suddenly a photorealistic person a cartoon mascot or a fantasy creature starts talking back to you with natural lip sync facial expressions and head movements all driven by real-time audio input.
The demo video captures the magic perfectly. An orange tabby cat with bright green eyes and a cheeky toothy grin answers a simple question about the capital of France complete with animated blinks wide smiles and enthusiastic head tilts.
The timing feels alive and responsive as the character reacts instantly to spoken prompts.
Then the screen fills with a grid of diverse characters an elderly bearded man a wide-eyed white monster a blond animated boy a professional woman with her hair in a bun and more each one speaking fluidly and maintaining consistent style and personality. This kind of interaction was once felt decades away.
But what sets this apart is the sophisticated engineering that makes real-time performance possible.
Runway Can Now Let Users' AI Characters Join Zoom, Meet, And Teams Calls On Their Behalf
Real-time video agents are here.
Today, we’re sharing how we built Runway Characters, allowing you to turn one image into a fully expressive, conversational video agent streaming at 24 frames per second in HD. With just 1.75 seconds of end-to-end latency.
Learn more below. pic.twitter.com/CJqv3Kdl0v— Runway (@runwayml) May 4, 2026
Traditional video generation works offline where you can afford seconds per frame but a live conversation demands a strict budget of roughly 42 milliseconds per frame to hit 24 frames per second.
Runway solved this with autoregressive frame-by-frame generation streaming each new frame directly to the viewer instead of rendering entire clips at once.
They applied distribution matching distillation to cut down the number of denoising steps parallelized inference across multiple devices and used clever techniques like KV-cache management CUDA graphs and custom kernels to keep everything running smoothly at HD resolution.
The result is an effective 37 milliseconds of model time per frame with the video pipeline itself handling the heavy lifting in just 567 milliseconds while the voice agent takes about 1.185 seconds leaving room for network travel time.
Practical applications open up immediately.
Creators can build tutoring sessions where a character explains concepts while sharing a screen game companions that react to player choices or brand mascots that handle customer questions with personality intact.
Voice customization is seamless through text prompts or instant cloning from short audio samples and characters can pull from uploaded documents for accurate domain-specific answers or call tools to perform actions like updating a game state or fetching live data.
Embedding these agents into websites takes a single line of code and they even integrate directly into Zoom Google Meet or Teams for meetings that feel genuinely interactive.
From when the user stops speaking to when the Character starts replying is just 1.75 seconds. Under the hood, the video model runs at 37 milliseconds of effective model time per frame, at more than 24fps HD. pic.twitter.com/9Qp0lcSb61
— Runway (@runwayml) May 4, 2026
Compared to pure generative video tools like Kling or Pika, Runway Characters shifts the game from offline clip creation to live interaction.
While those platforms shine at high-quality short videos with strong motion and realism, they do not deliver streaming conversational agents with sub-second perception-to-response loops. Runway combines the visual fidelity of modern generators with the interactivity of avatar platforms in one unified system.
Runway has made its Characters feature live across the Runway API web app and mobile apps so developers and creators can jump in without waiting.
The focus on low latency unlocks experiences that previously felt clunky or delayed turning static images into dynamic partners that respond with the fluidity of a real conversation.