AtomicoLabs Logo

AtomicoLabs

Back to blog
AI Research· AtomicoLabs Research

AIBench: Benchmarking 7 LLMs on Real-World Code Generation

We evaluated seven frontier and budget LLMs on 10 graded code generation tasks across React and Rust. Quality still costs money — but the gap is narrowing faster than the price sheet suggests.

AIBench: Benchmarking 7 LLMs on Real-World Code Generation
Photo: Chris Ried / Unsplash

📄 Prefer the full version? Download the complete whitepaper (PDF, 12 pages) — includes the full methodology, per-task result tables for every configuration, threats to validity, and deployment guidance.

Abstract

We evaluated seven large language models on a suite of 10 code generation tasks spanning two technology stacks: React (TypeScript, frontend) and Rust (systems programming). Each task was machine-verified against a test harness, with up to three retry attempts permitted on failure. Results show a clear quality–cost frontier: Claude Fable 5 achieved a perfect 10/10 at $2.38 total, while DeepSeek V4 Flash passed 8/10 at $0.015 — a 159× cost difference for a two-task quality gap. We also find that retry strategy is a first-order variable: enabling retries moved Claude Sonnet 4.6 from 3/10 to 9/10 on identical tasks.

1. Methodology

1.1 Task suites

The benchmark consists of two suites of five tasks each, ordered by increasing difficulty (L1–L5):

LevelReact suiteRust suite
L1Stat card componentMerge intervals algorithm
L2Login form with validationWord-count CLI (wc clone)
L3Sortable users tableLRU cache implementation
L4Posts CRUD interfaceConcurrent HTTP fetcher
L5Multi-page applicationSPSC lock-free ring buffer

The React suite tests component composition, state management, and form handling. The Rust suite escalates from basic algorithms to ownership-heavy concurrent code, where the L5 lock-free ring buffer requires correct use of atomics and memory ordering.

1.2 Evaluation protocol

  • Verification: every task is graded by an automated harness (compilation, unit tests, and behavioral checks). No human judgment is involved in pass/fail decisions.
  • Trials: 1 trial per task, with up to 3 retry attempts on failure. On retry, the model receives the failure output and may revise its solution.
  • Cost accounting: total API cost per suite, including all retry attempts. Failed attempts count toward cost.
  • Date: all runs executed June 9, 2026.

1.3 Models under test

Seven configurations were evaluated: Claude Fable 5, Claude Opus 4.5, Claude Sonnet 4.6 (in both single-pass and retry configurations), DeepSeek V4 Pro, DeepSeek V4 Flash, and Tencent HY3 Preview.

2. Results

2.1 Overall pass rates

Overall pass rate by model

ModelReactRustTotalTotal cost
Claude Fable 55/55/510/10$2.38
Claude Opus 4.54/55/59/10$1.00
Claude Sonnet 4.6 (retries)4/55/59/10$0.75
DeepSeek V4 Flash4/54/58/10$0.015
DeepSeek V4 Pro3/52/55/10$0.11
Claude Sonnet 4.6 (1-pass)1/52/53/10$0.27
Tencent HY3 Preview2/50/52/10$0.02

Three observations stand out:

  1. The frontier is tight at the top. Fable 5, Opus 4.5, and Sonnet 4.6 (with retries) are separated by a single task — the React L4 CRUD interface, which only Fable 5 completed.
  2. DeepSeek V4 Flash punches far above its price. At 8/10 it trails the Claude frontier by one task while costing two orders of magnitude less.
  3. Rust separates the field. Tencent HY3 passed zero Rust tasks, and DeepSeek V4 Pro dropped to 2/5 — the ownership and concurrency requirements of L3–L5 are a meaningful capability bar.

2.2 Cost efficiency

Cost per passed task

ModelPassesTotal costCost per pass
DeepSeek V4 Flash8$0.015$0.002
Tencent HY3 Preview2$0.02$0.010
DeepSeek V4 Pro5$0.11$0.022
Claude Sonnet 4.6 (retries)9$0.75$0.083
Claude Sonnet 4.6 (1-pass)3$0.27$0.090
Claude Opus 4.59$1.00$0.111
Claude Fable 510$2.38$0.238

DeepSeek V4 Flash delivers a passed task for $0.002 — roughly 50× cheaper per pass than Claude Opus 4.5 and 119× cheaper than Fable 5. For workloads where an 80% first-line success rate is acceptable (with a stronger model as fallback), the economics are difficult to ignore.

Notably, cost per pass is not monotonic with model price: Sonnet 4.6 in single-pass mode is more expensive per pass ($0.090) than with retries ($0.083), because failed runs still cost money but produce nothing.

2.3 The retry effect

Impact of retries on pass rate

ModelSingle attemptWith retriesΔ
Claude Sonnet 4.63/109/10+6
Claude Fable 56/1010/10+4

This is the most consequential finding of the run. Giving the model its own failure output and up to two more attempts tripled Sonnet 4.6's pass rate. Both models converge near their ceiling within three attempts, which suggests the failures in single-pass mode are largely recoverable errors (missed edge cases, test misreads) rather than capability gaps.

The practical implication: a benchmark score without a retry policy attached is close to meaningless, and production agent loops should treat verification-plus-retry as baseline architecture, not an optimization.

3. Recommendations

Based on this run, our guidance for production code generation workloads:

  1. Cost-sensitive pipelines: DeepSeek V4 Flash. 8/10 at $0.015 total (~159× cheaper than Fable 5). Use it as the first line, escalate failures to a frontier model.
  2. Maximum quality: Claude Fable 5. The only model to clear all 10 tasks, including the React L4 CRUD task that defeated every other model.
  3. Best balance: Claude Sonnet 4.6 with retries — 9/10 at $0.75, matching Opus 4.5's score at 25% less cost.
  4. Always enable retries. Verification feedback loops are worth more pass-rate than a model tier upgrade, at a fraction of the price.

4. Limitations and future work

This run used a single trial per task, so results carry run-to-run variance that we have not yet quantified. Per-test cost data was not captured for three of the seven configurations. Future iterations will add multi-trial runs with confidence intervals, additional language suites (Python, Go), and latency measurements alongside cost.


The AIBench harness and task suites are developed internally at AtomicoLabs. For the full analysis, download the whitepaper (PDF). Questions about methodology or requests for specific model evaluations: c@atomicolabs.com.

From the lab

Latest insights

View all posts