AIBench: Benchmarking 8 LLMs on Real-World Code Generation

Prefer the full version? Download the complete whitepaper (PDF, 13 pages). It includes the full methodology, per-task result tables, latency analysis, methodological notes, and deployment guidance.

Abstract

We evaluated eight large language models on 10 code generation tasks spanning two technology stacks: React (TypeScript, frontend) and Rust (systems programming). Each model ran three independent trials per task, for 30 trials per model and 240 trials total. Every submission was machine-verified by a test harness, with up to three retry attempts allowed after failure.

The headline result: Claude Fable 5 was the only model to achieve a perfect score, passing 30/30 trials at a total cost of $7.97. DeepSeek V4 Flash passed 24/30 trials for $0.053, creating a 152x total-cost spread for a six-trial quality gap. We also find that verification-driven retries are not optional infrastructure: across all eight models, first-attempt pass rate averaged 40%, while final pass rate within three attempts averaged 87%.

1. Methodology

1.1 Task suites

The benchmark consists of two suites of five tasks each, ordered by increasing difficulty (L1–L5):

Level	React suite	Rust suite
L1	Stat card component	Merge intervals algorithm
L2	Login form with validation	Word-count CLI (`wc` clone)
L3	Sortable users table	LRU cache implementation
L4	Posts CRUD interface	Concurrent HTTP fetcher
L5	Multi-page application	SPSC lock-free ring buffer

The React suite tests component composition, state management, and form handling. The Rust suite escalates from basic algorithms to ownership-heavy concurrent code, where the L5 lock-free ring buffer requires correct use of atomics and memory ordering.

1.2 Evaluation protocol

Verification: every task is graded by an automated harness (compilation, unit tests, and behavioral checks). No human judgment is involved in pass/fail decisions.
Trials: 3 trials per task, with up to 3 retry attempts per trial. On retry, the model receives the failure output and may revise its solution.
Cost accounting: total API cost includes all retry attempts. Failed attempts count toward cost.
Latency accounting: throughput probes and per-attempt wall-clock generation times were captured for every model.
Date: all runs executed June 12-13, 2026.

1.3 Models under test

Eight models were evaluated under the same retry policy: Claude Fable 5, Claude Opus 4.8, Claude Opus 4.5, Claude Sonnet 4.6, DeepSeek V4 Pro, Qwen 3.7 Plus, DeepSeek V4 Flash, and GPT-5.4-mini.

2. Results

2.1 Overall pass rates

Overall pass rate by model

Model	React	Rust	Total	Pass rate	Total cost
Claude Fable 5	15/15	15/15	30/30	100%	$7.97
Claude Opus 4.8	13/15	15/15	28/30	93%	$2.80
Claude Sonnet 4.6	12/15	15/15	27/30	90%	$2.05
Claude Opus 4.5	12/15	14/15	26/30	87%	$3.19
DeepSeek V4 Pro	13/15	13/15	26/30	87%	$0.60
Qwen 3.7 Plus	10/15	15/15	25/30	83%	$0.59
DeepSeek V4 Flash	12/15	12/15	24/30	80%	$0.053
GPT-5.4-mini	12/15	11/15	23/30	77%	$0.33

Three observations stand out:

Fable 5 is the reliability ceiling. It was the only model to clear all 30 trials.
Opus 4.8 is the premium value upgrade. It outperformed Opus 4.5 while costing less.
Budget models are close enough to matter. DeepSeek V4 Pro matched Opus 4.5's 87% pass rate at roughly one-fifth the cost.

2.2 Cost efficiency

Cost per passed task

Model	Passes	Total cost	Cost per pass
DeepSeek V4 Flash	24	$0.053	$0.0022
GPT-5.4-mini	23	$0.33	$0.014
DeepSeek V4 Pro	26	$0.60	$0.023
Qwen 3.7 Plus	25	$0.59	$0.024
Claude Sonnet 4.6	27	$2.05	$0.076
Claude Opus 4.8	28	$2.80	$0.100
Claude Opus 4.5	26	$3.19	$0.123
Claude Fable 5	30	$7.97	$0.266

DeepSeek V4 Flash delivers a passed trial for $0.0022, about 121x cheaper per pass than Fable 5. That does not make it the best model; it makes it a serious first-line candidate for high-volume workflows where failed trials can be escalated to a stronger model.

The more interesting budget result is DeepSeek V4 Pro: it passed 26/30, tied with Opus 4.5, for $0.60 total. Quality tiers still exist, but the economic gap is wider than the capability gap.

2.3 The retry effect

Impact of retries on pass rate

Model	First attempt	Within 3 attempts	Improvement
Qwen 3.7 Plus	8/30	25/30	+17
Claude Sonnet 4.6	11/30	27/30	+16
DeepSeek V4 Flash	8/30	24/30	+16
Claude Fable 5	16/30	30/30	+14
Claude Opus 4.5	12/30	26/30	+14
DeepSeek V4 Pro	12/30	26/30	+14
Claude Opus 4.8	16/30	28/30	+12
GPT-5.4-mini	14/30	23/30	+9

This is the most consequential finding of the run. Across all eight models, first-attempt pass rate averaged 40%; final pass rate within three attempts averaged 87%. Even Fable 5, the perfect scorer, passed only 16/30 trials on the first attempt.

The practical implication: a benchmark score without a retry policy attached is close to meaningless. Production agent loops should treat verification-plus-retry as baseline architecture, not an optimization.

2.4 Latency and throughput

Median end-to-end solve time by model

Model	TTFT short	TTFT 16k	Output tok/s	Median solve
GPT-5.4-mini	0.75s	0.73s	123	9.8s
Claude Opus 4.8	1.17s	1.64s	72	20.7s
Claude Opus 4.5	1.61s	2.37s	45	21.1s
Claude Sonnet 4.6	1.47s	3.00s	45	21.7s
Claude Fable 5	3.81s	4.35s	80	36.7s
DeepSeek V4 Flash	2.58s	2.17s	103	44.3s
DeepSeek V4 Pro	1.55s	--	82	62.2s
Qwen 3.7 Plus	4.52s	3.70s	593*	186.9s

Latency is its own axis. GPT-5.4-mini is the fastest model in the study by a wide margin, despite landing last on pass rate. Qwen 3.7 Plus is inexpensive and excellent on Rust, but its median solve time was over three minutes. Interactive assistants and background batch pipelines should not optimize for the same model.

Qwen's headline output-throughput number appears inflated by provider-side batched streaming; the median solve time is the more useful operational metric.

3. Recommendations

Based on this run, our guidance for production code generation workloads:

Maximum reliability: Claude Fable 5. It is the only model to clear all 30 trials.
Best premium value: Claude Opus 4.8. It reached 93% while beating Opus 4.5 on both quality and price.
Best balance: Claude Sonnet 4.6. It reached 90% at the lowest premium-tier cost.
Cost-conscious production: DeepSeek V4 Pro. It matched Opus 4.5's pass rate at about one-fifth the cost.
High-volume first line: DeepSeek V4 Flash. It passed 80% of trials for roughly five cents total.
Interactive assistants: GPT-5.4-mini. It is the only model here with a sub-10-second median solve time.
Rust-heavy workloads: Qwen 3.7 Plus. It went 15/15 on Rust at budget-tier pricing, if latency is acceptable.
Always enable retries. The measured retry effect, 40% to 87% average pass rate, is larger than any model-tier upgrade in this run.

4. Methodological notes

This run resolves the biggest gaps from the earlier pilot: it uses three trials per task, captures cost telemetry across the full field, applies one retry policy to every model, and adds latency data.

The remaining caveats are narrower. Tasks come from our own production work, so they may not represent every engineering domain. Provider prices and routing change frequently. And even at temperature 0, provider-side non-determinism means rankings separated by one or two trials should be treated as directional, not permanent.

The AIBench harness and task suites are developed internally at AtomicoLabs. For the full analysis, download the whitepaper (PDF). Questions about methodology or requests for specific model evaluations: c@atomicolabs.com.

AtomicoLabs

AIBench: Benchmarking 8 LLMs on Real-World Code Generation

Abstract

1. Methodology

1.1 Task suites

1.2 Evaluation protocol

1.3 Models under test

2. Results

2.1 Overall pass rates

2.2 Cost efficiency

2.3 The retry effect

2.4 Latency and throughput

3. Recommendations

4. Methodological notes

Latest insights

AIBench: Benchmarking 8 LLMs on Real-World Code Generation

Kani Child Profiles: Private AI for Kids, With Parents in the Loop

Welcome to the AtomicoLabs Blog