Eat Tokens Wisely 🏛 Architecture & novelty →

A context-compression layer for LLMs & agents. Two modes: a lossless structural codec for repetitive/structured input (JSON tool output, logs) — byte-exact — and a lossy extractive selector for prose (docs, search, conversations) that keeps the smallest set of original, verbatim spans under a token budget. No generative model in the compression path: it adds no fabricated content and is measured against gold labels, not an LLM judge.

Compression as rate–distortion source coding: a token-budget knob that slides along a Pareto frontier we plot on real data (HotpotQA, gold answers + gold supporting facts). Every kept token came from the source, with provenance — so “zero unsupported claims” is a string-containment property, not an opinion.

Prove it live · watch compressed context answer correctly

real Claude calls · held-out HotpotQA · scored vs the gold answer · not pre-recorded

The proof · accuracy vs tokens (Pareto frontier)

Downstream answer quality

A frozen Claude Haiku reader answers from the compressed context only; scored with HotpotQA official EM/token-F1 vs the gold answer. Same reader across all arms — a ruler, not a judge.

Gold-fact retention (zero-API)

Fraction of human-labelled supporting facts kept, vs tokens sent. Distortion = facts dropped. Pure set comparison — no model call anywhere.

Learned scorer beats lexical ranking

Held-out AUC at the unit level: predict “is this span a gold supporting fact”. A tiny CPU logistic regression trained on gold labels.

AUC ↑ better · trained on HotpotQA supporting_facts

Robust to duplicate retrieval

Inject R near-identical copies of one relevant passage (multi-tool agents re-surface the same content). Near-dup suppression holds recall flat; ablating it crashes recall and wastes budget on copies.

Lossless mode · structured / tool-output data (zero information loss)

For structured, repetitive LLM inputs (MCP/API tool output, logs) we factor repeated sub-trees and string values into shared definitions — provably lossless (decode(encode(x))==x, unit-tested). The compact "definitions + references" form is read natively by the LLM, so we cut the tokens *sent to the LLM* with no information loss and no accuracy drop.

Generalization · same compressor, different use cases (proven, not assumed)

The exact same HotpotQA-trained compressor — never retrained — run on three structurally different QA datasets. It compresses and preserves the majority of answer quality on all of them, always far above random. Honest caveat: out-of-domain, the learned scorer's edge over BM25 disappears (BM25 transfers as well or better) — the framework generalizes; the learned edge is in-domain.

The new idea · reader-grounded label-source ablation

Everyone trains compressors on generic statistics or human relevance labels. We ask a sharper question: train the selector on a signal derived from the reader model itself — keep a span only if removing it actually changes claude-haiku's answer — and see if that survives distillation into a cheap CPU scorer. Same selector, same features, same budget; only the training label differs: A = human supporting-facts · B = LLMLingua-2-style importance · C = reader-grounded answer-impact. We report the C−A / C−B deltas with CIs — honestly, either way.

Label source → downstream F1 @ matched tokens

Three scorers, identical except the training label, at budget 240 tokens. The contribution is whether C (reader-grounded) beats A (human) and B (importance) — with paired bootstrap CIs.

Economics · real $ saved (the sponsor's business)

Compression saves the same tokens regardless of which model reads them, so $ scales with the reader's input price. CPU compressor → no frontier-model call to compress → savings are real, not circular.

Live compressor · paste your own input

▶ Run a ready example → one click — loads & runs · CPU-only · no API call. (The Use cases tab shows full normal-vs-compressed runs.)

TASK (what the agent is trying to do) RAW CONTEXT (paste text, or JSON tool output)

budget 200 tok

🧠 semantic rerank (embedding model — catches differently-worded answers; still no LLM) 🛡️ safe mode (auto-widen the budget if the answer's terms aren't covered — never silently drop it)

NEED-SLOTS COVERED (from the task)

SPANS — kept are verbatim & source-tagged; dim = dropped

How it works

raw context ─▶ [lossless structural codec] (JSON: factor repeated values, byte-reversible) ─▶ unitize + char-offset provenance ─▶ learned keep-scorer (LogReg on gold supporting_facts, CPU, ms) ─▶ budgeted selection + near-dup suppression (submodular diversity) ─▶ verbatim kept spans ── 0 generative tokens, full provenance

What is learned: the per-span keep-scorer, on measured human labels. What is principled: rate–distortion selection under a token budget + the lossless codec (round-trip unit-tested). What is killed: abstractive “capsules”, hand-weighted scoring formulas, and LLM-judge metrics. The compressor touches the expensive model zero times per compression — strictly cheaper than the tokens it saves.