
AdaCodec reduces visual token count in video language models by encoding only frame-to-frame changes, borrowing the residual logic of H.264 and HEVC codec…

14-day trial. No DevOps. No Sales call. Provisioned in under a minute.
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet most video multimodal large language models encode each sampled frame as an independent RGB image, so visual tokens repeat content the model already saw a few hundred milliseconds earlier. AdaCodec proposes a predictive code for those tokens, borrowing a trick that classical video codecs have used since the 1990s. ## What the paper changes The relevant work is AdaCodec: A Predictive Visual Code for Video MLLMs. Its observation is mundane and its consequence is large: when a video MLLM samples one frame every 0.5 to 2 seconds, the encoder still treats those frames as independent images. The vision tower produces tokens; the projector maps them into the language model; the LLM then has to spend context budget on near-duplicates. Video codecs solved an analogous problem decades ago. H.264 and HEVC encode keyframes (I-frames) fully and represent neighboring frames as residuals against motion-compensated predictions from earlier frames (P-frames and B-frames). The bits that actually flow over the wire are dominated by what changed, not what stayed the same. AdaCodec applies the same structure to the visual code consumed by a language model. At a high level, the design has three properties worth naming: - An anchor frame is encoded with full visual tokens, similar to a standard video MLLM. - Subsequent frames are represented by a smaller predictive code that captures the delta against a prediction derived from the anchor. - The number of tokens spent per frame becomes adaptive: static stretches collapse, scene cuts and motion-heavy spans get more budget. This is not novel as an idea in vision; it is novel as an interface to a language model that was not built to understand residuals. ## Why this matters for agentic systems For teams running video-consuming agents, surveillance triage, long-form meeting review, robotic perception, sports analytics, the binding constraint is almost always tokens per second of footage. A 30 minute meeting at one frame per second with 256 tokens per frame is 460,800 visual tokens before any text. Most production systems respond by sampling more sparsely, which trades recall for latency, or by adding a separate summarization layer, which trades fidelity for a smaller context. A predictive code changes the shape of that tradeoff. If 80 percent of frames in a typical clip can be represented in, say, one-quarter the tokens of the anchor, an agent can either: 1. Process longer clips in a single forward pass without raising context length. 2. Sample more densely at the same context budget, which improves event localization. 3. Run the same workload at lower cost, which matters when the agent is one of many in a larger pipeline. The third option is the one operators usually care about. Token cost dominates inference economics for video tasks far more than it does for text, and the per-frame token count, not the model size, is typically the lever with the highest leverage. ### Where it slots into an agent stack Most video-consuming agents today look something like this: a sampler picks frames from a stream, a vision encoder produces tokens, a projector aligns them with a language model embedding space, and a planner or evaluator consumes the resulting trajectory. AdaCodec sits at the projector boundary. It does not change the vision tower and it does not change the LLM. It changes what gets written into the prompt, and that is the part that scales with footage length. The practical implication: if you already have an evaluation harness that scores a video MLLM on, say, EgoSchema or VideoMME, you can in principle swap the projector and rerun the same evals. You do not need to retrain the underlying language model. This is the kind of substitution that makes adoption tractable for teams who would rather not own a multimodal training stack. ## What to verify before adopting it The paper makes claims that need to be checked against the workloads you actually run. A predictive code is, by construction, a lossy code; the question is whether the loss falls in places that matter for your downstream task. Areas worth probing in an internal evaluation: - Scene cut handling. Predictive codes assume temporal continuity. When the camera cuts hard, the prediction fails and the residual is effectively a new anchor. Whether the encoder detects this automatically or whether you need to inject keyframes is a question you should answer empirically. - Fine-grained motion. Tasks that hinge on small but semantically important movements (a hand entering a frame, a label being read off a screen) may be exactly the frames where token reduction hurts most. - Long-horizon recall. A predictive code reduces tokens per frame but lengthens the dependency chain across frames. If your task asks the model to recall something from 20 minutes ago, the anchor structure matters more than the per-frame compression ratio. - Eval drift versus a flat encoder. Run a paired evaluation on your own benchmark set, not only the public ones in the paper. Public video benchmarks are heavy on cooking videos and instructional content, which have predictable temporal structure. Your traffic may not. A reasonable acceptance test: hold token budget per minute of video constant, run AdaCodec and your current encoder against the same eval suite, and look for tasks where AdaCodec actually loses. Those losses are more informative than the average win. ## The connection to eval-driven operations The reason this paper is interesting beyond the architectural point is what it implies about how to operate video agents. If the visual code is adaptive, then the token cost of a given clip is a function of its content, not its length. That breaks an assumption many cost models quietly make: that inference cost per minute of footage is roughly constant. For an operations team, the consequences are concrete: - Per-tenant cost forecasting needs to look at content statistics, not just minutes ingested. A customer sending mostly static security camera footage will be cheaper than one sending edited social video at the same duration. - SLA design should distinguish between latency on anchor frames and latency on predicted frames, because the work is not symmetric. - Eval suites need stratification by motion and cut density. A model that wins on the average score but loses badly on high-cut content is a model that will produce support tickets in production. This is the part of agentic operations that gets under-discussed. Most teams treat the model as a black box with a price per token. Predictive visual codes make the relationship between content and cost explicit, which is good for budgeting and inconvenient for any team that has been amortizing variance across customers. ## How this fits with other compression directions AdaCodec is one of several lines of work attacking the video token problem. The others are worth naming so the design choice is legible: - Spatial token reduction. Methods that prune or merge tokens within a single frame. These are content-agnostic across time and tend to plateau because they cannot exploit temporal redundancy. - Temporal pooling. Methods that average or attend across windows of frames before producing tokens. These collapse time but tend to lose precise event boundaries. - Sparser sampling. Pick fewer frames. Simple, often effective, but it caps the frame rate at which the model can localize events. - Predictive coding (AdaCodec and similar). Keeps the sampling rate, reduces tokens per frame based on temporal context. These are not mutually exclusive. A production system might sample at 2 fps, apply spatial token merging within each frame, then apply a predictive code across frames. The interesting question for a team is which combination passes their eval suite at their target cost. Related compression work on the language side, such as From Layers to Submodules on replacement-based LLM compression, is orthogonal. It compresses the model; AdaCodec compresses the input. A serious cost program will end up doing both. ## A short integration checklist For teams considering a serious evaluation, the steps are roughly: 1. Pick three to five video tasks that represent your actual production traffic. Avoid leaning entirely on public benchmarks. 2. Establish a baseline with your current encoder at a fixed token-per-minute budget. 3. Replace the projector with the predictive code under test, holding the vision tower and LLM constant. 4. Rerun the eval suite and stratify results by content type: static, low-motion, high-motion, high-cut. 5. Measure not only accuracy but token cost per task instance, and the variance of that cost across instances. 6. Audit failure cases. If failures cluster around scene cuts or fine motion, decide whether you can route those segments to the flat encoder. A hybrid routing layer is often the most pragmatic outcome: predictive coding for the bulk of footage, flat encoding for segments flagged by a cheap upstream classifier. That is the kind of architecture that survives contact with production. ## What is still open The paper is a proof of concept; it is not the final shape of predictive visual codes for MLLMs. Several questions are unresolved: - How does the predictive code interact with reasoning models that produce long chains of thought referencing specific frames? If the model needs to point at frame 47, the addressing scheme has to remain coherent across compressed and uncompressed frames. - How robust is the code under streaming conditions where frames arrive incrementally and the anchor cannot be chosen with global knowledge? - How does it compose with retrieval over video? If a retriever pulls clip segments out of context, the predictive code is broken at the boundary; the retrieved clip needs its own anchor. These are tractable engineering problems, not fundamental obstacles. Expect follow-up work that addresses streaming and retrieval, because those are the deployment shapes that matter for autonomous agents that watch video continuously rather than analyzing fixed clips on demand.