Hey

We run 14 AI agents on a single RTX 3090 — no cloud API, no subscription. One GPU from 2020.

The setup: two models loaded simultaneously on one card.

  • Qwen3.5-9B (thinking ON) — orchestration, compliance, code architecture, SEO writing. 7 agents that need deep reasoning.

  • Qwen3.5-2B (thinking OFF) — trend scouting, social posts, voice workflows, quick patches. 7 agents that need raw speed at 210 tok/s.

Total VRAM used: 14.3 GB out of 24. Still 40% headroom.

The secret? KV cache offloading. The system prompt gets processed once and cached. Every subsequent request skips straight to the new tokens. Response time: from 6 minutes to 52 milliseconds. 6,667× faster.

The full architecture — which agents go where, how thinking budgets work per-request, and the exact VRAM breakdown — is in the free PDF:

Want the implementation code? Patched llama.cpp source, server launch scripts, and thinking budget enforcement:

Local AI. Real hardware. No cloud.

— Andrew Darius

P.S. — The PDF shows WHAT we built. The Academy has HOW — every patch, every config flag, every optimization explained step by step.

Reply

Avatar

or to participate

Keep Reading