Live retrieval / Workers AI / Same model both runs
The proof, run live.
Pick a question. Both runs hit the same Workers AI model with the same corpus. One stuffs all 12 cards into the prompt. The other retrieves the top 3 cards by cosine similarity. We capture token counts, latency, and cost in real time. No mock numbers.
Step 1 / pick a question
Or run with your own question
First run pays ~2-4 seconds. Subsequent loads hit KV cache.
Workers AI error
| Metric | Traditional RAG | Card Network | Delta |
|---|---|---|---|
| Input tokens | -- | -- | -- |
| Output tokens | -- | -- | -- |
| Total latency | -- | -- | -- |
| Cost (est) | -- | -- | -- |
| Cards in context | -- | -- | -- |
Traditional RAG answer
Model:
Card Network answer
Model:
Both answers come from the same Workers AI model, the same article corpus, the same question. Card Network retrieval used --% fewer input tokens. That savings compounds across millions of agent queries. It is the mechanical heart of the support-burn scenario, the AI-bill-spike scenario, and every Magic Sprint we ship.
Methodology
- ▮ Corpus: 12 cards, ~3,000 tokens, source: why-ai-agents-need-pre-chunking.json.
- ▮ Embeddings:
@cf/baai/bge-base-en-v1.5. Cosine similarity. Cards embedded once + cached in KV. - ▮ Text generation:
@cf/google/gemma-4-26b-a4b-it. Falls back to@cf/meta/llama-3.1-8b-instruct-fastif unavailable. Same model on both runs. - ▮ Cost band: $0.30/M input + $0.60/M output tokens (Workers AI Paid plan, public estimate).
- ▮ Caching: results cached 7 days per question in Workers KV. Add
?refresh=1to the API to force a fresh run.
Three doors / pick one
Saw the savings? Want them on your stack?
Drop your email. We send the whitepaper, schedule a 30-min call, and run a free CARD-readiness audit. The audit is the audit. No sales theatre.
No spam. We use your email to send the whitepaper, schedule the call, and follow up on the audit. That is it.