Hey

Everyone says their model is "state of the art." We decided to test that.

We built ClawEval — an open-source benchmark that puts LLMs through 59 real agent jobs. Not trivia. Not chat. Actual work:

  • Route support tickets

  • Review code and find bugs

  • Analyze financial data

  • Draft legal documents

  • Plan sprints

  • Score leads

  • …and 53 more

Every test has an exact expected answer. No "vibes." No LLM-as-judge. Either the model nails it or it doesn't.

What we found

A quantized 35B model running on a single 24GB GPU (an RTX 3090 — we got ours for $799) scored 10/10 on 37 out of 59 agent roles — including code generation, legal review, financial analysis, and security auditing.

That's not a typo. An $799 GPU running a free, open-source model, matching or beating what many people pay cloud APIs hundreds of dollars a month for.

Some roles where even small models crush it:

  • Router / Triage — 10/10

  • Code Generation — 10/10

  • Fact-Checking — 10/10

  • Customer Support — 10/10

  • Legal Document Review — 10/10

  • Financial Analysis — 10/10

And roles where you still need bigger models or thinking mode:

  • 🔴 Calendar scheduling — 0/10 (timezone math is hard)

  • 🔴 Data Analysis — 3/10 (needs reasoning)

  • 🔴 Math / Logic — 4/10 (needs thinking enabled)

Full results — every role, every model, side by side

We published the complete per-role breakdown for all 59 agent roles across multiple models and configurations:

The repo includes:

  • VRAM guides for 16GB, 24GB, 32GB, 48GB, 64GB, and 96GB GPUs

  • The complete test suite so you can run it on your own models

  • Side-by-side comparisons of thinking vs non-thinking modes

Join the community

Want to discuss results, share your own benchmarks, or get help setting up local AI agents for your business?

It's a free community for freelancers, solopreneurs, and entrepreneurs who are putting AI agents to work — not just talking about them. Come share what's working for you.

More dispatches coming soon — including which sub-agent tasks you can hand off to tiny models, and what actually needs a big one.

Andrew Darius Founder, AIgenteur

Reply

Avatar

or to participate

Keep Reading