Hey
Everyone says their model is "state of the art." We decided to test that.
We built ClawEval — an open-source benchmark that puts LLMs through 59 real agent jobs. Not trivia. Not chat. Actual work:
Route support tickets
Review code and find bugs
Analyze financial data
Draft legal documents
Plan sprints
Score leads
…and 53 more
Every test has an exact expected answer. No "vibes." No LLM-as-judge. Either the model nails it or it doesn't.
What we found
A quantized 35B model running on a single 24GB GPU (an RTX 3090 — we got ours for $799) scored 10/10 on 37 out of 59 agent roles — including code generation, legal review, financial analysis, and security auditing.
That's not a typo. An $799 GPU running a free, open-source model, matching or beating what many people pay cloud APIs hundreds of dollars a month for.
Some roles where even small models crush it:
✅ Router / Triage — 10/10
✅ Code Generation — 10/10
✅ Fact-Checking — 10/10
✅ Customer Support — 10/10
✅ Legal Document Review — 10/10
✅ Financial Analysis — 10/10
And roles where you still need bigger models or thinking mode:
🔴 Calendar scheduling — 0/10 (timezone math is hard)
🔴 Data Analysis — 3/10 (needs reasoning)
🔴 Math / Logic — 4/10 (needs thinking enabled)
Full results — every role, every model, side by side
We published the complete per-role breakdown for all 59 agent roles across multiple models and configurations:
The repo includes:
VRAM guides for 16GB, 24GB, 32GB, 48GB, 64GB, and 96GB GPUs
The complete test suite so you can run it on your own models
Side-by-side comparisons of thinking vs non-thinking modes
Join the community
Want to discuss results, share your own benchmarks, or get help setting up local AI agents for your business?
It's a free community for freelancers, solopreneurs, and entrepreneurs who are putting AI agents to work — not just talking about them. Come share what's working for you.
More dispatches coming soon — including which sub-agent tasks you can hand off to tiny models, and what actually needs a big one.
— Andrew Darius Founder, AIgenteur

