A compression solution for LLM context. ← back to the live demo
1. An LLM-native lossless codec. Compression that's provably lossless (a byte-exact test, not a self-grade) and that the model reads directly — so it reduces tokens sent to the LLM with zero information loss. Verified: reader answers the compact form as accurately as full JSON (EM 1.00 = 1.00 on guess-resistant questions). Why not gzip? gzip isn't LLM-readable → it reduces zero tokens sent.
2. A query-conditioned extractive compressor that beats the query-agnostic delete-only paradigm (LLMLingua-2 style) by +0.29 F1 (CI clears 0) on RAG/QA — generalizing unchanged across QA, conversations, and documents.
3. Intellectual honesty as a feature. We tested two genuinely novel ideas and report both as null results with their CIs and the real reason — the credibility differentiator for a sponsor that markets against self-graded compression.
We do not claim to have invented extractive compression, distillation, or dedup — those are known. The contribution is the specific combination, the LLM-native lossless guarantee, and a rigorous, honestly-reported benchmark.
| modality | input | result |
|---|---|---|
| structured JSON / tool output | lossless codec | 21–48% fewer tokens, 0 loss, reader EM = full |
| logs | lossless codec | ~54% fewer (vs pretty), byte-exact |
| multi-hop QA (HotpotQA) | extractive | 76% F1 retained @ 5.3× |
| conversations (CoQA) | extractive | 97% F1 retained, beats BM25 |
| documents (SQuAD / NarrativeQA) | extractive | 81% / 59% retained (narrative: learned scorer < BM25 — honest weak spot) |
| voice (Deepgram ASR transcript) | extractive | 3× fewer tokens, faster response, same answer |
| MCP tool output (real Context7 / Perplexity servers) | extractive | docs ~67% · search ~81% (verbatim spans, grounded — answer survives in the kept spans) · codec declines on unique filesystem/git output (shown) |
Null 1 — reader-grounded labels. Training the scorer on "removing this span changes the reader's answer" does not beat human labels (C−A within noise). Reason: 68% of gold supporting facts are droppable due to redundancy, not parametric memory (closed-book 7%) — dropping redundant-but-correct facts doesn't help at a fixed budget.
Null 2 — learned variable-rate budgeting. Oracle headroom is huge (+0.27 F1), but a learned per-example budget predictor doesn't beat fixed budget (all CIs include 0) — the headroom is reader stochasticity, not learnable structure.
It's one function. raw can be a string (prose / chat / transcript → extractive) or JSON (dict/list → lossless codec, auto-selected). You add one line before your model call:
JSON-only callers can use suffix.structural.render_compact(obj) directly; structural_report(obj)["beneficial"] says whether to — it's opt-in and declines when the reference table would cost more than it saves. Reusable today against the repo; a pip package and a literal MCP-server proxy are future work (not claimed).
The compressor calls no LLM — it's scikit-learn + string ops. The model only ever reads, with this exact neutral prompt, identical for the full and compressed arms (it never names the expected answer):
Check it yourself:
Deterministic facts (raw/kept tokens, reduction %) are 100% reproducible CPU output; the live reader's answer is the only model-dependent part (pinned to temperature 0). Use-case correctness uses lenient matching (exact OR substring OR token-F1≥0.6, applied identically to both arms); the benchmark and "prove it live" numbers use strict SQuAD EM/F1.
| reduces tokens sent to the LLM | nailed lossless codec + lossy 5.3× |
| preserves context for high-quality outputs | nailed lossless EM=full; lossy 76–97% F1; random craters |
| text · code(JSON) · conversations · documents · other | nailed 6 modalities incl. voice |
| model / algorithm / framework | nailed CPU scorer + selection + dedup + invertible codec |
| maintaining or improving downstream performance | nailed maintains; improves +0.29 F1 over query-agnostic (CI clears 0); compact form can be easier for the reader to parse than verbose JSON |
| model / application / system level | nailed live app demo + CPU-only system codec |