AIBench: Benchmarking 7 LLMs on Real-World Code Generation
We evaluated seven frontier and budget LLMs on 10 graded code generation tasks across React and Rust. Quality still costs money — but the gap is narrowing faster than the price sheet suggests.

📄 Prefer the full version? Download the complete whitepaper (PDF, 12 pages) — includes the full methodology, per-task result tables for every configuration, threats to validity, and deployment guidance.
Abstract
We evaluated seven large language models on a suite of 10 code generation tasks spanning two technology stacks: React (TypeScript, frontend) and Rust (systems programming). Each task was machine-verified against a test harness, with up to three retry attempts permitted on failure. Results show a clear quality–cost frontier: Claude Fable 5 achieved a perfect 10/10 at $2.38 total, while DeepSeek V4 Flash passed 8/10 at $0.015 — a 159× cost difference for a two-task quality gap. We also find that retry strategy is a first-order variable: enabling retries moved Claude Sonnet 4.6 from 3/10 to 9/10 on identical tasks.
1. Methodology
1.1 Task suites
The benchmark consists of two suites of five tasks each, ordered by increasing difficulty (L1–L5):
| Level | React suite | Rust suite |
|---|---|---|
| L1 | Stat card component | Merge intervals algorithm |
| L2 | Login form with validation | Word-count CLI (wc clone) |
| L3 | Sortable users table | LRU cache implementation |
| L4 | Posts CRUD interface | Concurrent HTTP fetcher |
| L5 | Multi-page application | SPSC lock-free ring buffer |
The React suite tests component composition, state management, and form handling. The Rust suite escalates from basic algorithms to ownership-heavy concurrent code, where the L5 lock-free ring buffer requires correct use of atomics and memory ordering.
1.2 Evaluation protocol
- Verification: every task is graded by an automated harness (compilation, unit tests, and behavioral checks). No human judgment is involved in pass/fail decisions.
- Trials: 1 trial per task, with up to 3 retry attempts on failure. On retry, the model receives the failure output and may revise its solution.
- Cost accounting: total API cost per suite, including all retry attempts. Failed attempts count toward cost.
- Date: all runs executed June 9, 2026.
1.3 Models under test
Seven configurations were evaluated: Claude Fable 5, Claude Opus 4.5, Claude Sonnet 4.6 (in both single-pass and retry configurations), DeepSeek V4 Pro, DeepSeek V4 Flash, and Tencent HY3 Preview.
2. Results
2.1 Overall pass rates
| Model | React | Rust | Total | Total cost |
|---|---|---|---|---|
| Claude Fable 5 | 5/5 | 5/5 | 10/10 | $2.38 |
| Claude Opus 4.5 | 4/5 | 5/5 | 9/10 | $1.00 |
| Claude Sonnet 4.6 (retries) | 4/5 | 5/5 | 9/10 | $0.75 |
| DeepSeek V4 Flash | 4/5 | 4/5 | 8/10 | $0.015 |
| DeepSeek V4 Pro | 3/5 | 2/5 | 5/10 | $0.11 |
| Claude Sonnet 4.6 (1-pass) | 1/5 | 2/5 | 3/10 | $0.27 |
| Tencent HY3 Preview | 2/5 | 0/5 | 2/10 | $0.02 |
Three observations stand out:
- The frontier is tight at the top. Fable 5, Opus 4.5, and Sonnet 4.6 (with retries) are separated by a single task — the React L4 CRUD interface, which only Fable 5 completed.
- DeepSeek V4 Flash punches far above its price. At 8/10 it trails the Claude frontier by one task while costing two orders of magnitude less.
- Rust separates the field. Tencent HY3 passed zero Rust tasks, and DeepSeek V4 Pro dropped to 2/5 — the ownership and concurrency requirements of L3–L5 are a meaningful capability bar.
2.2 Cost efficiency
| Model | Passes | Total cost | Cost per pass |
|---|---|---|---|
| DeepSeek V4 Flash | 8 | $0.015 | $0.002 |
| Tencent HY3 Preview | 2 | $0.02 | $0.010 |
| DeepSeek V4 Pro | 5 | $0.11 | $0.022 |
| Claude Sonnet 4.6 (retries) | 9 | $0.75 | $0.083 |
| Claude Sonnet 4.6 (1-pass) | 3 | $0.27 | $0.090 |
| Claude Opus 4.5 | 9 | $1.00 | $0.111 |
| Claude Fable 5 | 10 | $2.38 | $0.238 |
DeepSeek V4 Flash delivers a passed task for $0.002 — roughly 50× cheaper per pass than Claude Opus 4.5 and 119× cheaper than Fable 5. For workloads where an 80% first-line success rate is acceptable (with a stronger model as fallback), the economics are difficult to ignore.
Notably, cost per pass is not monotonic with model price: Sonnet 4.6 in single-pass mode is more expensive per pass ($0.090) than with retries ($0.083), because failed runs still cost money but produce nothing.
2.3 The retry effect
| Model | Single attempt | With retries | Δ |
|---|---|---|---|
| Claude Sonnet 4.6 | 3/10 | 9/10 | +6 |
| Claude Fable 5 | 6/10 | 10/10 | +4 |
This is the most consequential finding of the run. Giving the model its own failure output and up to two more attempts tripled Sonnet 4.6's pass rate. Both models converge near their ceiling within three attempts, which suggests the failures in single-pass mode are largely recoverable errors (missed edge cases, test misreads) rather than capability gaps.
The practical implication: a benchmark score without a retry policy attached is close to meaningless, and production agent loops should treat verification-plus-retry as baseline architecture, not an optimization.
3. Recommendations
Based on this run, our guidance for production code generation workloads:
- Cost-sensitive pipelines: DeepSeek V4 Flash. 8/10 at $0.015 total (~159× cheaper than Fable 5). Use it as the first line, escalate failures to a frontier model.
- Maximum quality: Claude Fable 5. The only model to clear all 10 tasks, including the React L4 CRUD task that defeated every other model.
- Best balance: Claude Sonnet 4.6 with retries — 9/10 at $0.75, matching Opus 4.5's score at 25% less cost.
- Always enable retries. Verification feedback loops are worth more pass-rate than a model tier upgrade, at a fraction of the price.
4. Limitations and future work
This run used a single trial per task, so results carry run-to-run variance that we have not yet quantified. Per-test cost data was not captured for three of the seven configurations. Future iterations will add multi-trial runs with confidence intervals, additional language suites (Python, Go), and latency measurements alongside cost.
The AIBench harness and task suites are developed internally at AtomicoLabs. For the full analysis, download the whitepaper (PDF). Questions about methodology or requests for specific model evaluations: c@atomicolabs.com.

