Live retrieval / Workers AI / Same model both runs

The proof, run live.

Pick a question. Both runs hit the same Workers AI model with the same corpus. One stuffs all 12 cards into the prompt. The other retrieves the top 3 cards by cosine similarity. We capture token counts, latency, and cost in real time. No mock numbers.

Step 1 / pick a question

Or run with your own question

Methodology

  • Corpus: 12 cards, ~3,000 tokens, source: why-ai-agents-need-pre-chunking.json.
  • Embeddings: @cf/baai/bge-base-en-v1.5. Cosine similarity. Cards embedded once + cached in KV.
  • Text generation: @cf/google/gemma-4-26b-a4b-it. Falls back to @cf/meta/llama-3.1-8b-instruct-fast if unavailable. Same model on both runs.
  • Cost band: $0.30/M input + $0.60/M output tokens (Workers AI Paid plan, public estimate).
  • Caching: results cached 7 days per question in Workers KV. Add ?refresh=1 to the API to force a fresh run.

Three doors / pick one

Saw the savings? Want them on your stack?

Drop your email. We send the whitepaper, schedule a 30-min call, and run a free CARD-readiness audit. The audit is the audit. No sales theatre.

Book a 30-min call instead

No spam. We use your email to send the whitepaper, schedule the call, and follow up on the audit. That is it.