Leaderboard

eXalt Value presents

AI Coding Agents Leaderboard

What each category measures:

Issue Resolution — Fixing bugs in real GitHub issues (SWE-Bench)
Frontend — UI development with visual context (SWE-Bench Multimodal)
Greenfield — Building new applications from scratch (Commit0)
Testing — Test generation and quality (SWT-Bench)
Information Gathering — Research and information retrieval (GAIA)

Model	Avg Score ↓	Avg Cost / run	Avg Runtime	Context	Bugs	Frontend	Greenfield	Testing	Information
Claude-Opus-4-6	66.7	$1.14	410s	200K	76.8% ($0.77)	41.8% ($2.37)	56.2% ($1.70)	78.8% ($0.43)	80.0% ($0.44)
GPT-5.4	63.8	$1.82	372s	—	75.6% ($1.36)	36.8% ($4.24)	56.2% ($2.19)	71.4% ($0.55)	78.8% ($0.74)
Claude-Opus-4-5	60.6	$1.77	378s	200K	76.6% ($1.82)	41.2% ($2.54)	37.5% ($2.54)	78.5% ($1.38)	69.1% ($0.55)
GPT-5.2-Codex	59.5	$1.43	819s	400K	73.8% ($0.94)	35.9% ($2.97)	50.0% ($2.02)	67.0% ($0.66)	70.9% ($0.55)
Qwen3.6-Plus	57.9	$2.12	819s	—	74.2% ($1.52)	30.9% ($2.27)	50.0% ($4.40)	62.1% ($2.04)	72.1% ($0.34)
GPT-5.2	56.3	$1.20	596s	400K	74.6% ($0.86)	30.9% ($2.77)	37.5% ($1.34)	73.2% ($0.56)	65.5% ($0.48)
Gemini-3.1-Pro	55.7	$0.80	883s	10M	75.4% ($0.63)	44.1% ($1.24)	18.8% ($1.52)	64.0% ($0.50)	76.4% ($0.12)
Claude-Sonnet-4-5	53.0	$1.57	583s	200K	74.2% ($1.19)	36.8% ($1.89)	12.5% ($2.90)	68.8% ($0.98)	72.7% ($0.87)
GLM-5	49.4	$0.97	1323s	—	73.4% ($1.06)	35.3% ($0.58)	31.2% ($1.96)	47.3% ($0.91)	60.0% ($0.36)
Kimi-K2.5	49.2	$1.09	854s	256K	68.8% ($0.48)	32.8% ($1.58)	18.8% ($2.86)	61.9% ($0.42)	63.6% ($0.13)
Gemini-3-Pro	49.0	$1.42	1091s	10M	70.6% ($0.95)	36.8% ($1.46)	25.0% ($3.18)	68.6% ($1.01)	44.2% ($0.50)
Gemini-3-Flash	49.0	$0.64	726s	1M	74.6% ($0.42)	22.1% ($0.80)	18.8% ($1.28)	70.7% ($0.30)	58.8% ($0.38)
MiniMax-M2.5	46.5	$0.13	715s	1M	72.6% ($0.10)	25.0% ($0.15)	18.8% ($0.29)	68.1% ($0.07)	47.9% ($0.02)
Minimax-2.7	44.6	$0.36	1032s	—	75.6% ($0.17)	27.9% ($0.33)	25.0% ($0.94)	69.1% ($0.13)	25.5% ($0.21)
DeepSeek-V3.2	44.4	$0.13	1124s	128K	71.6% ($0.16)	27.9% ($0.19)	18.8% ($0.12)	53.6% ($0.12)	50.3% ($0.06)
Claude-Sonnet-4-6	43.3	$1.29	501s	1M	74.4% ($1.03)	30.9% ($2.24)	43.8% ($1.88)	54.0% ($0.87)	13.3% ($0.41)
GLM-4.7	41.0	$0.44	968s	203K	73.4% ($0.56)	22.1% ($0.66)	6.2% ($0.47)	49.4% ($0.37)	53.9% ($0.15)
MiniMax-M2.1	39.9	$0.22	1211s	197K	68.8% ($0.14)	16.2% ($0.21)	12.5% ($0.61)	61.4% ($0.11)	40.6% ($0.02)
Kimi-K2-Thinking	39.7	$1.76	1484s	256K	69.2% ($2.00)	32.4% ($2.31)	6.2% ($2.47)	47.3% ($1.39)	43.6% ($0.65)
Nemotron-3-Super	36.2	$0.57	1329s	—	62.0% ($0.44)	20.6% ($0.73)	12.5% ($1.23)	45.7% ($0.32)	40.0% ($0.11)
Qwen3-Coder-480B	30.9	$0.92	502s	262K	62.4% ($1.26)	23.5% ($2.09)	0.0% ($0.01)	34.9% ($0.97)	33.9% ($0.28)

■ Best ■ Good ■ Mid ■ Low

OpenHands Team (2025). OpenHands Index: A Comprehensive Leaderboard for AI Coding Agents. index.openhands.dev