Eat Tokens Wisely — architecture & novelty

A compression solution for LLM context. ← back to the live demo

The bottleneck for agents isn't model intelligence — it's context bandwidth. Agents and RAG pipelines pour long, noisy, repetitive context into every LLM call: tool outputs, logs, documents, conversations, transcripts. That is expensive, slow, and often hurts answers. Eat Tokens Wisely is a two-mode codec that cuts the tokens sent to the LLM while preserving — and sometimes improving — answer quality, with guarantees you can check, not take on faith.

The architecture · two modes, matched to the input

① Lossless structural codec ZERO LOSS

For structured / repetitive inputs: JSON tool output, logs, API responses.

repeated sub-trees + string values ↓ factor into shared definitions {user}, {labels} defined once issues reference them by id ↓ collision-safe sentinels compact form (plaintext) ↓ the LLM reads it natively

decode(encode(x)) == x, byte-exact — unit-tested incl. 4,000 adversarial fuzz cases (0 failures). The compact "definitions + references" form is read natively by the reader, so fewer tokens are sent with no information lost.

② Lossy extractive selector VERBATIM

For prose: QA context, documents, conversations, transcripts.

unitize into spans (+provenance) ↓ learned keep-scorer (CPU LogReg) score each span vs the task ↓ budgeted selection + near-dup suppression keep smallest set under a token budget ↓ verbatim spans only

Output is a verbatim subsequence of the source (no fabrication in the compression step — unit-tested). A token-budget knob slides along a rate–distortion curve. No LLM in the compression path (milliseconds, CPU) — so savings are real, not circular.

Coverage-confidence + safe fallback: every result reports how much of the question's key terms the kept spans cover — a silent drop becomes a signal. In safe mode the budget auto-widens until they're covered rather than losing the answer.

Optional semantic rerank: the scorer is mostly lexical, so it can miss an answer worded differently from the task. A small embedding model (generation-free, still no LLM) blends in semantic similarity to close that gap — off by default, toggle it live in the demo's compressor.

What's novel (and what isn't — we're precise)

1. An LLM-native lossless codec. Compression that's provably lossless (a byte-exact test, not a self-grade) and that the model reads directly — so it reduces tokens sent to the LLM with zero information loss. Verified: reader answers the compact form as accurately as full JSON (EM 1.00 = 1.00 on guess-resistant questions). Why not gzip? gzip isn't LLM-readable → it reduces zero tokens sent.

2. A query-conditioned extractive compressor that beats the query-agnostic delete-only paradigm (LLMLingua-2 style) by +0.29 F1 (CI clears 0) on RAG/QA — generalizing unchanged across QA, conversations, and documents.

3. Intellectual honesty as a feature. We tested two genuinely novel ideas and report both as null results with their CIs and the real reason — the credibility differentiator for a sponsor that markets against self-graded compression.

We do not claim to have invented extractive compression, distillation, or dedup — those are known. The contribution is the specific combination, the LLM-native lossless guarantee, and a rigorous, honestly-reported benchmark.

Results · every number measured (gold labels, frozen reader, CIs)

Generalization across the challenge's modalities

modality	input	result
structured JSON / tool output	lossless codec	21–48% fewer tokens, 0 loss, reader EM = full
logs	lossless codec	~54% fewer (vs pretty), byte-exact
multi-hop QA (HotpotQA)	extractive	76% F1 retained @ 5.3×
conversations (CoQA)	extractive	97% F1 retained, beats BM25
documents (SQuAD / NarrativeQA)	extractive	81% / 59% retained (narrative: learned scorer < BM25 — honest weak spot)
voice (Deepgram ASR transcript)	extractive	3× fewer tokens, faster response, same answer
MCP tool output (real Context7 / Perplexity servers)	extractive	docs ~67% · search ~81% (verbatim spans, grounded — answer survives in the kept spans) · codec declines on unique filesystem/git output (shown)

The extractive framework generalizes (always beats random, preserves majority quality); the learned scorer's edge over BM25 is in-domain — use BM25 cross-domain.

What we honestly report didn't work

Null 1 — reader-grounded labels. Training the scorer on "removing this span changes the reader's answer" does not beat human labels (C−A within noise). Reason: 68% of gold supporting facts are droppable due to redundancy, not parametric memory (closed-book 7%) — dropping redundant-but-correct facts doesn't help at a fixed budget.

Null 2 — learned variable-rate budgeting. Oracle headroom is huge (+0.27 F1), but a learned per-example budget predictor doesn't beat fixed budget (all CIs include 0) — the headroom is reader stochasticity, not learnable structure.

Use it as a drop-in compression layer

It's one function. raw can be a string (prose / chat / transcript → extractive) or JSON (dict/list → lossless codec, auto-selected). You add one line before your model call:

from suffix.pipeline import compress res = compress(task=user_question, raw=tool_output_or_docs, budget=240) context = res["compressed_text"] # verbatim spans (prose) OR compact codec form (JSON) client.messages.create(model="claude-...", messages=[ {"role":"user","content": f"{context}\n\n{user_question}"}]) print(res["reduction_pct"], "% fewer tokens,", res["compression_x"], "x")

JSON-only callers can use suffix.structural.render_compact(obj) directly; structural_report(obj)["beneficial"] says whether to — it's opt-in and declines when the reference table would cost more than it saves. Reusable today against the repo; a pip package and a literal MCP-server proxy are future work (not claimed).

Not a prompt trick — verify it in under 2 minutes

The compressor calls no LLM — it's scikit-learn + string ops. The model only ever reads, with this exact neutral prompt, identical for the full and compressed arms (it never names the expected answer):

SYSTEM: You answer questions from the provided context only. Reply with the shortest exact answer (a name, phrase, number, or yes/no). No explanation. USER: Context:\n{context}\n\nQuestion: {question}\nAnswer:

Check it yourself:

# 1. lossless is math, not a model — no API key, no server, <5s: python -c "import json,sys; sys.path.insert(0,'.'); from suffix.structural import encode,decode; \ o=json.load(open('data/usecases.json'))[3]['json']; print('byte-exact:', decode(encode(o))==o)" # 2. the unit tests prove it (incl. adversarial sentinel-collision inputs): python -m pytest tests/ -q # 6 passed # 3. compression runs with the model key UNSET — the model isn't in the loop: unset ANTHROPIC_API_KEY ; curl -s localhost:8000/api/compress -d '{"task":"q","text":"<long text>"}'

Deterministic facts (raw/kept tokens, reduction %) are 100% reproducible CPU output; the live reader's answer is the only model-dependent part (pinned to temperature 0). Use-case correctness uses lenient matching (exact OR substring OR token-F1≥0.6, applied identically to both arms); the benchmark and "prove it live" numbers use strict SQuAD EM/F1.

Challenge-clause coverage

reduces tokens sent to the LLM	nailed lossless codec + lossy 5.3×
preserves context for high-quality outputs	nailed lossless EM=full; lossy 76–97% F1; random craters
text · code(JSON) · conversations · documents · other	nailed 6 modalities incl. voice
model / algorithm / framework	nailed CPU scorer + selection + dedup + invertible codec
maintaining or improving downstream performance	nailed maintains; improves +0.29 F1 over query-agnostic (CI clears 0); compact form can be easier for the reader to parse than verbose JSON
model / application / system level	nailed live app demo + CPU-only system codec

← Run the live demo