Coding LLM Benchmark

eXalt Value presents

Benchmarks Business

Compare LLMs on real-world business tasks: translation quality, long document analysis, cognitive reasoning, and spreadsheet intelligence.

🌐

Translation

Multilingual translation quality, useful for teams localizing content, translating business documents or communicating with international clients.

FLORES-200 sacreBLEU COMET
# Model Price (In/Out) BLEU ↓ COMET ↓ Context
🥇 Gemini 3.1 Pro $2.00/$12.00 62.0 91.5 10M
🥈 Gemini 3 Pro $2.00/$12.00 61.0 91.2 10M
🥉 GPT-5.2 $1.75/$14.00 61.2 91.0 400K
4 GPT-5 $1.25/$10.00 60.5 90.2 400K
5 Gemini 2.5 Pro $1.25/$10.00 59.5 90.0 1M
6 Claude Opus 4.6 $5.00/$25.00 58.3 89.0 200K
7 Grok 4 $3.00/$15.00 58.0 88.8 256K
8 Claude Opus 4.5 $5.00/$25.00 57.8 88.5 200K
9 Qwen3 Coder 480B $0.90/$0.90 57.0 88.0 262K
10 Claude Sonnet 4.6 $3.00/$15.00 56.5 87.8 1M
11 DeepSeek V3.2 Thinking $0.27/$1.10 56.0 87.5 128K
12 OpenAI o3 $2.00/$8.00 56.8 87.5 200K
13 Gemini 3 Flash $0.50/$3.00 56.0 87.0 1M
14 DeepSeek V3.2 $0.27/$0.41 55.5 87.0 128K
15 Claude Sonnet 4.5 $3.00/$15.00 55.0 86.9 200K
16 GPT-4.1 $2.00/$8.00 55.0 86.5 1M
17 Mistral Large 3 $2.00/$6.00 54.5 86.0 131K
18 Kimi K2.5 $0.40/$1.75 54.0 85.5 256K
19 GPT-5 Mini $0.25/$2.00 53.0 85.5 200K
20 Kimi K2 Thinking $0.40/$1.75 53.5 85.0 256K
21 Gemini 2.5 Flash $0.15/$0.60 52.0 84.5 1M
22 Claude Haiku 4.5 $1.00/$5.00 51.2 84.0 200K
23 GLM 4.7 $0.38/$1.75 52.0 84.0 203K
24 MiniMax M2.5 $0.30/$1.20 50.0 83.0 1M
25 MiniMax M2.1 $0.23/$0.90 48.0 81.5 197K
📖

Long Context / Large Documents

Ability to analyze long PDFs, large specifications, contracts, multi-documents and lengthy summaries. Critical for legal, RFP and audit teams.

LongBench v2 RULER
# Model Price (In/Out) LongBench v2 ↓ RULER ↓ Context
🥇 Gemini 3.1 Pro $2.00/$12.00 60.5% 95.8% 10M
🥈 GPT-5.2 $1.75/$14.00 58.5% 98.0% 400K
🥉 Gemini 3 Pro $2.00/$12.00 58.0% 95.0% 10M
4 GPT-5.2 Codex $1.75/$14.00 57.0% 96.0% 400K
5 GPT-5 $1.25/$10.00 56.8% 92.0% 400K
6 Gemini 2.5 Pro $1.25/$10.00 55.0% 95.8% 1M
7 Claude Opus 4.6 $5.00/$25.00 54.2% 76.0% 200K
8 Gemini 3 Flash $0.50/$3.00 52.5% 88.0% 1M
9 Claude Opus 4.5 $5.00/$25.00 52.1% 72.0% 200K
10 OpenAI o3 $2.00/$8.00 51.0% 80.0% 200K
11 Claude Sonnet 4.6 $3.00/$15.00 50.8% 82.0% 1M
12 Grok 4 $3.00/$15.00 50.5% 78.0% 256K
13 DeepSeek V3.2 Thinking $0.27/$1.10 50.0% 72.0% 128K
14 Qwen3 Coder 480B $0.90/$0.90 49.5% 76.0% 262K
15 GPT-5.1 Codex $1.25/$10.00 49.0% 85.0% 200K
16 DeepSeek V3.2 $0.27/$0.41 48.7% 70.0% 128K
17 Claude Sonnet 4.5 $3.00/$15.00 48.5% 68.0% 200K
18 Kimi K2.5 $0.40/$1.75 48.0% 76.0% 256K
19 Gemini 2.5 Flash $0.15/$0.60 48.0% 82.0% 1M
20 Kimi K2 Thinking $0.40/$1.75 47.0% 74.0% 256K
21 GPT-4.1 $2.00/$8.00 46.5% 80.0% 1M
22 MiniMax M2.5 $0.30/$1.20 46.0% 72.0% 1M
23 GPT-5 Mini $0.25/$2.00 45.2% 78.0% 200K
24 GLM 4.7 $0.38/$1.75 44.5% 68.0% 203K
25 Claude Haiku 4.5 $1.00/$5.00 44.0% 60.0% 200K
26 Mistral Large 3 $2.00/$6.00 43.0% 62.0% 131K
27 MiniMax M2.1 $0.23/$0.90 42.0% 65.0% 197K
🧠

Reasoning / Problem Solving

Overall cognitive power: logical reasoning, expert knowledge, mathematics. To judge structured analysis and problem-solving capabilities.

MMLU-Pro GPQA Diamond
# Model Price (In/Out) MMLU-Pro ↓ GPQA Diamond ↓ Context
🥇 Gemini 3.1 Pro $2.00/$12.00 90.2% 94.3% 10M
🥈 Gemini 3 Pro $2.00/$12.00 89.8% 91.9% 10M
🥉 Claude Opus 4.5 $5.00/$25.00 89.5% 86.5% 200K
4 Claude Opus 4.6 $5.00/$25.00 88.2% 89.0% 200K
5 Claude Sonnet 4.6 $3.00/$15.00 87.5% 89.9% 1M
6 Claude Sonnet 4.5 $3.00/$15.00 87.5% 84.2% 200K
7 GPT-5.2 $1.75/$14.00 87.4% 92.4% 400K
8 Gemini 2.5 Pro $1.25/$10.00 87.2% 85.0% 1M
9 GPT-5 $1.25/$10.00 87.1% 87.0% 400K
10 GPT-5.1 Codex $1.25/$10.00 87.0% 88.1% 200K
11 GPT-4.1 $2.00/$8.00 86.5% 82.0% 1M
12 Grok 4 $3.00/$15.00 86.4% 88.9% 256K
13 OpenAI o3 $2.00/$8.00 86.0% 85.0% 200K
14 DeepSeek V3.2 Thinking $0.27/$1.10 85.9% 85.3% 128K
15 Kimi K2 Thinking $0.40/$1.75 84.6% 80.0% 256K
16 Gemini 3 Flash $0.50/$3.00 84.5% 90.4% 1M
17 Qwen3 Coder 480B $0.90/$0.90 83.0% 80.0% 262K
18 DeepSeek V3.2 $0.27/$0.41 82.5% 78.0% 128K
19 Mistral Large 3 $2.00/$6.00 81.0% 75.0% 131K
20 GPT-5 Mini $0.25/$2.00 79.5% 74.0% 200K
21 Kimi K2.5 $0.40/$1.75 78.5% 75.0% 256K
22 Claude Haiku 4.5 $1.00/$5.00 78.0% 72.5% 200K
23 GLM 4.7 $0.38/$1.75 78.0% 72.0% 203K
24 MiniMax M2.5 $0.30/$1.20 76.0% 70.0% 1M
25 Gemini 2.5 Flash $0.15/$0.60 76.0% 70.0% 1M
26 MiniMax M2.1 $0.23/$0.90 72.0% 65.0% 197K
📊

Spreadsheet / Data Analysis

Analyze and manipulate large spreadsheets, create charts, and find insights in data. For data, finance and business analyst teams.

SpreadsheetBench FinSheet-Bench
# Model Price (In/Out) SpreadsheetBench ↓ FinSheet-Bench ↓ Context
🥇 Gemini 3.1 Pro $2.00/$12.00 52.0% 82.4% 10M
🥈 GPT-5.2 $1.75/$14.00 48.2% 80.4% 400K
🥉 Claude Opus 4.6 $5.00/$25.00 42.9% 80.2% 200K
4 Gemini 3 Pro $2.00/$12.00 50.5% 80.2% 10M
5 GPT-5 $1.25/$10.00 45.5% 78.5% 400K
6 Claude Opus 4.5 $5.00/$25.00 41.5% 78.0% 200K
7 Claude Sonnet 4.6 $3.00/$15.00 40.2% 76.5% 1M
8 Gemini 2.5 Pro $1.25/$10.00 46.0% 75.5% 1M
9 Grok 4 $3.00/$15.00 41.0% 74.5% 256K
10 OpenAI o3 $2.00/$8.00 42.0% 74.0% 200K
11 Claude Sonnet 4.5 $3.00/$15.00 38.5% 73.0% 200K
12 Gemini 3 Flash $0.50/$3.00 44.0% 72.0% 1M
13 DeepSeek V3.2 Thinking $0.27/$1.10 38.0% 70.0% 128K
14 GPT-4.1 $2.00/$8.00 40.0% 70.0% 1M
15 DeepSeek V3.2 $0.27/$0.41 36.0% 68.0% 128K
16 Qwen3 Coder 480B $0.90/$0.90 37.0% 66.0% 262K
17 Kimi K2 Thinking $0.40/$1.75 34.0% 66.0% 256K
18 GPT-5 Mini $0.25/$2.00 35.0% 65.0% 200K
19 Kimi K2.5 $0.40/$1.75 35.0% 64.0% 256K
20 Claude Haiku 4.5 $1.00/$5.00 32.0% 62.0% 200K
21 Mistral Large 3 $2.00/$6.00 34.0% 62.0% 131K
22 Gemini 2.5 Flash $0.15/$0.60 38.0% 60.0% 1M
23 MiniMax M2.5 $0.30/$1.20 33.0% 58.0% 1M
24 GLM 4.7 $0.38/$1.75 32.0% 58.0% 203K
25 MiniMax M2.1 $0.23/$0.90 30.0% 52.0% 197K
Best Good Mid Low