Live retrieval / Workers AI / Same model both runs

The proof, run live.

Pick a question. Both runs hit the same Workers AI model with the same corpus. One stuffs all 12 cards into the prompt. The other retrieves the top 3 cards by cosine similarity. We capture token counts, latency, and cost in real time. No mock numbers.

Step 1 / pick a question

Or run with your own question

Step 2 / live result

Metric	Traditional RAG	Card Network	Delta
Input tokens	--	--	--
Output tokens	--	--	--
Total latency	--	--	--
Cost (est)	--	--	--
Cards in context	--	--	--

Top-3 retrieved card IDs:

Traditional RAG answer

Model:

Card Network answer

Model:

Both answers come from the same Workers AI model, the same article corpus, the same question. Card Network retrieval used --% fewer input tokens. That savings compounds across millions of agent queries. It is the mechanical heart of the support-burn scenario, the AI-bill-spike scenario, and every Magic Sprint we ship.

Scope a Magic Sprint Run the full ROI calculator Read the whitepaper

Methodology

▮ Corpus: 12 cards, ~3,000 tokens, source: why-ai-agents-need-pre-chunking.json.
▮ Embeddings: @cf/baai/bge-base-en-v1.5. Cosine similarity. Cards embedded once + cached in KV.
▮ Text generation: @cf/google/gemma-4-26b-a4b-it. Falls back to @cf/meta/llama-3.1-8b-instruct-fast if unavailable. Same model on both runs.
▮ Cost band: $0.30/M input + $0.60/M output tokens (Workers AI Paid plan, public estimate).
▮ Caching: results cached 7 days per question in Workers KV. Add ?refresh=1 to the API to force a fresh run.

Three doors / pick one

Saw the savings? Want them on your stack?

Drop your email. We send the whitepaper, schedule a 30-min call, and run a free CARD-readiness audit. The audit is the audit. No sales theatre.

No spam. We use your email to send the whitepaper, schedule the call, and follow up on the audit. That is it.