A context-compression layer for LLMs & agents. Two modes: a lossless structural codec for repetitive/structured input (JSON tool output, logs) — byte-exact — and a lossy extractive selector for prose (docs, search, conversations) that keeps the smallest set of original, verbatim spans under a token budget. No generative model in the compression path: it adds no fabricated content and is measured against gold labels, not an LLM judge.
A frozen Claude Haiku reader answers from the compressed context only; scored with HotpotQA official EM/token-F1 vs the gold answer. Same reader across all arms — a ruler, not a judge.
Fraction of human-labelled supporting facts kept, vs tokens sent. Distortion = facts dropped. Pure set comparison — no model call anywhere.
Held-out AUC at the unit level: predict “is this span a gold supporting fact”. A tiny CPU logistic regression trained on gold labels.
supporting_factsInject R near-identical copies of one relevant passage (multi-tool agents re-surface the same content). Near-dup suppression holds recall flat; ablating it crashes recall and wastes budget on copies.
For structured, repetitive LLM inputs (MCP/API tool output, logs) we factor repeated sub-trees and string values into shared definitions — provably lossless (decode(encode(x))==x, unit-tested). The compact "definitions + references" form is read natively by the LLM, so we cut the tokens *sent to the LLM* with no information loss and no accuracy drop.
The exact same HotpotQA-trained compressor — never retrained — run on three structurally different QA datasets. It compresses and preserves the majority of answer quality on all of them, always far above random. Honest caveat: out-of-domain, the learned scorer's edge over BM25 disappears (BM25 transfers as well or better) — the framework generalizes; the learned edge is in-domain.
Three scorers, identical except the training label, at budget 240 tokens. The contribution is whether C (reader-grounded) beats A (human) and B (importance) — with paired bootstrap CIs.
Compression saves the same tokens regardless of which model reads them, so $ scales with the reader's input price. CPU compressor → no frontier-model call to compress → savings are real, not circular.
What is learned: the per-span keep-scorer, on measured human labels. What is principled: rate–distortion selection under a token budget + the lossless codec (round-trip unit-tested). What is killed: abstractive “capsules”, hand-weighted scoring formulas, and LLM-judge metrics. The compressor touches the expensive model zero times per compression — strictly cheaper than the tokens it saves.