Dataset Preview
Duplicate
The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
The dataset generation failed because of a cast error
Error code:   DatasetGenerationCastError
Exception:    DatasetGenerationCastError
Message:      An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 12 new columns ({'month', 'milestone_name', 'is_reasoning', 'is_research', 'date', 'is_product_launch', 'is_open_source', 'description', 'event_id', 'significance_score', 'year', 'milestone_type'}) and 8 missing columns ({'model_id', 'model_name', 'benchmark', 'score_pct', 'score', 'release_date', 'max_score', 'benchmark_type'}).

This happened while the csv dataset builder was generating data using

hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/capability_milestones.csv (at revision f04fa5f30fb56ff2fb35af483221bce9e8dc9a67), ['hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/benchmark_scores.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/capability_milestones.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/compute_estimates.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/models_catalog.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/pricing_history.csv']

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1837, in _prepare_split_single
                  writer.write_table(table)
                  ~~~~~~~~~~~~~~~~~~^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 765, in write_table
                  self._write_table(pa_table, writer_batch_size=writer_batch_size)
                  ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 773, in _write_table
                  pa_table = table_cast(pa_table, self._schema)
                File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2369, in table_cast
                  return cast_table_to_schema(table, schema)
                File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2297, in cast_table_to_schema
                  raise CastError(
                  ...<3 lines>...
                  )
              datasets.table.CastError: Couldn't cast
              event_id: string
              date: string
              year: int64
              month: int64
              milestone_name: string
              organization: string
              milestone_type: string
              significance_score: int64
              description: string
              is_product_launch: int64
              is_research: int64
              is_open_source: int64
              is_reasoning: int64
              -- schema metadata --
              pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1845
              to
              {'model_id': Value('string'), 'model_name': Value('string'), 'organization': Value('string'), 'release_date': Value('string'), 'benchmark': Value('string'), 'benchmark_type': Value('string'), 'score': Value('float64'), 'max_score': Value('int64'), 'score_pct': Value('float64')}
              because column names don't match
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1369, in compute_config_parquet_and_info_response
                  parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
                                                                        ~~~~~~~~~~~~~~~~~~~~~~~~~^
                      builder, max_dataset_size_bytes=max_dataset_size_bytes
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  )
                  ^
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 948, in stream_convert_to_parquet
                  builder._prepare_split(split_generator=splits_generators[split], file_format="parquet")
                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1683, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~^
                      gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  ):
                  ^
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1839, in _prepare_split_single
                  raise DatasetGenerationCastError.from_cast_error(
                  ...<4 lines>...
                  )
              datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
              
              All the data files must have the same columns, but at some point there are 12 new columns ({'month', 'milestone_name', 'is_reasoning', 'is_research', 'date', 'is_product_launch', 'is_open_source', 'description', 'event_id', 'significance_score', 'year', 'milestone_type'}) and 8 missing columns ({'model_id', 'model_name', 'benchmark', 'score_pct', 'score', 'release_date', 'max_score', 'benchmark_type'}).
              
              This happened while the csv dataset builder was generating data using
              
              hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/capability_milestones.csv (at revision f04fa5f30fb56ff2fb35af483221bce9e8dc9a67), ['hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/benchmark_scores.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/capability_milestones.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/compute_estimates.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/models_catalog.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/pricing_history.csv']
              
              Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

model_id
string
model_name
string
organization
string
release_date
string
benchmark
string
benchmark_type
string
score
float64
max_score
int64
score_pct
float64
M001
GPT-3 (175B)
OpenAI
2020-06-11
MMLU
knowledge
32.83
100
32.83
M001
GPT-3 (175B)
OpenAI
2020-06-11
HellaSwag
commonsense
55.08
100
55.08
M001
GPT-3 (175B)
OpenAI
2020-06-11
ARC-Challenge
reasoning
49.43
100
49.43
M002
T5-XXL
Google
2020-02-24
MMLU
knowledge
24.38
100
24.38
M002
T5-XXL
Google
2020-02-24
HellaSwag
commonsense
49.3
100
49.3
M002
T5-XXL
Google
2020-02-24
ARC-Challenge
reasoning
47.24
100
47.24
M003
Turing-NLG
Microsoft
2020-02-13
MMLU
knowledge
30.64
100
30.64
M003
Turing-NLG
Microsoft
2020-02-13
HellaSwag
commonsense
52.13
100
52.13
M003
Turing-NLG
Microsoft
2020-02-13
ARC-Challenge
reasoning
52.07
100
52.07
M004
GShard
Google
2020-06-30
MMLU
knowledge
35.61
100
35.61
M004
GShard
Google
2020-06-30
HellaSwag
commonsense
54.48
100
54.48
M004
GShard
Google
2020-06-30
ARC-Challenge
reasoning
51.88
100
51.88
M005
CTRL
Salesforce
2020-09-11
MMLU
knowledge
38.09
100
38.09
M005
CTRL
Salesforce
2020-09-11
HellaSwag
commonsense
76.22
100
76.22
M005
CTRL
Salesforce
2020-09-11
ARC-Challenge
reasoning
63.01
100
63.01
M006
BlenderBot
Meta
2020-04-30
MMLU
knowledge
24.07
100
24.07
M006
BlenderBot
Meta
2020-04-30
HellaSwag
commonsense
49.88
100
49.88
M006
BlenderBot
Meta
2020-04-30
ARC-Challenge
reasoning
46.41
100
46.41
M007
Switch Transformer
Google
2021-01-11
MMLU
knowledge
64.46
100
64.46
M007
Switch Transformer
Google
2021-01-11
HumanEval
coding
28.12
100
28.12
M007
Switch Transformer
Google
2021-01-11
GSM8K
math_grade
20.04
100
20.04
M007
Switch Transformer
Google
2021-01-11
MATH
math_competition
7.92
100
7.92
M007
Switch Transformer
Google
2021-01-11
HellaSwag
commonsense
87.7
100
87.7
M007
Switch Transformer
Google
2021-01-11
ARC-Challenge
reasoning
67
100
67
M008
Codex
OpenAI
2021-08-10
MMLU
knowledge
53.1
100
53.1
M008
Codex
OpenAI
2021-08-10
HumanEval
coding
46.31
100
46.31
M008
Codex
OpenAI
2021-08-10
GSM8K
math_grade
34.76
100
34.76
M008
Codex
OpenAI
2021-08-10
MATH
math_competition
3.25
100
3.25
M008
Codex
OpenAI
2021-08-10
HellaSwag
commonsense
80.72
100
80.72
M008
Codex
OpenAI
2021-08-10
ARC-Challenge
reasoning
61.03
100
61.03
M009
Jurassic-1 Jumbo
AI21
2021-08-11
MMLU
knowledge
56.26
100
56.26
M009
Jurassic-1 Jumbo
AI21
2021-08-11
HumanEval
coding
41.6
100
41.6
M009
Jurassic-1 Jumbo
AI21
2021-08-11
GSM8K
math_grade
32.28
100
32.28
M009
Jurassic-1 Jumbo
AI21
2021-08-11
MATH
math_competition
8.49
100
8.49
M009
Jurassic-1 Jumbo
AI21
2021-08-11
HellaSwag
commonsense
87.97
100
87.97
M009
Jurassic-1 Jumbo
AI21
2021-08-11
ARC-Challenge
reasoning
63.93
100
63.93
M010
Megatron-Turing NLG
Microsoft+NVIDIA
2021-10-11
MMLU
knowledge
63.83
100
63.83
M010
Megatron-Turing NLG
Microsoft+NVIDIA
2021-10-11
HumanEval
coding
36.27
100
36.27
M010
Megatron-Turing NLG
Microsoft+NVIDIA
2021-10-11
GSM8K
math_grade
34.38
100
34.38
M010
Megatron-Turing NLG
Microsoft+NVIDIA
2021-10-11
MATH
math_competition
6.2
100
6.2
M010
Megatron-Turing NLG
Microsoft+NVIDIA
2021-10-11
HellaSwag
commonsense
85.41
100
85.41
M010
Megatron-Turing NLG
Microsoft+NVIDIA
2021-10-11
ARC-Challenge
reasoning
67.52
100
67.52
M011
Gopher
DeepMind
2021-12-08
MMLU
knowledge
55.05
100
55.05
M011
Gopher
DeepMind
2021-12-08
HumanEval
coding
41.74
100
41.74
M011
Gopher
DeepMind
2021-12-08
GSM8K
math_grade
35.71
100
35.71
M011
Gopher
DeepMind
2021-12-08
MATH
math_competition
7.62
100
7.62
M011
Gopher
DeepMind
2021-12-08
HellaSwag
commonsense
82.78
100
82.78
M011
Gopher
DeepMind
2021-12-08
ARC-Challenge
reasoning
65.22
100
65.22
M012
GLaM
Google
2021-12-09
MMLU
knowledge
61.41
100
61.41
M012
GLaM
Google
2021-12-09
HumanEval
coding
40.93
100
40.93
M012
GLaM
Google
2021-12-09
GSM8K
math_grade
35.07
100
35.07
M012
GLaM
Google
2021-12-09
MATH
math_competition
16.41
100
16.41
M012
GLaM
Google
2021-12-09
HellaSwag
commonsense
86.14
100
86.14
M012
GLaM
Google
2021-12-09
ARC-Challenge
reasoning
61.93
100
61.93
M013
WuDao 2.0
BAAI
2021-06-01
MMLU
knowledge
63.54
100
63.54
M013
WuDao 2.0
BAAI
2021-06-01
HumanEval
coding
30.88
100
30.88
M013
WuDao 2.0
BAAI
2021-06-01
GSM8K
math_grade
24.08
100
24.08
M013
WuDao 2.0
BAAI
2021-06-01
MATH
math_competition
9.34
100
9.34
M013
WuDao 2.0
BAAI
2021-06-01
HellaSwag
commonsense
87.58
100
87.58
M013
WuDao 2.0
BAAI
2021-06-01
ARC-Challenge
reasoning
69.39
100
69.39
M014
ERNIE 3.0
Baidu
2021-07-05
MMLU
knowledge
49.43
100
49.43
M014
ERNIE 3.0
Baidu
2021-07-05
HumanEval
coding
37.44
100
37.44
M014
ERNIE 3.0
Baidu
2021-07-05
GSM8K
math_grade
32.22
100
32.22
M014
ERNIE 3.0
Baidu
2021-07-05
MATH
math_competition
8.44
100
8.44
M014
ERNIE 3.0
Baidu
2021-07-05
HellaSwag
commonsense
80.75
100
80.75
M014
ERNIE 3.0
Baidu
2021-07-05
ARC-Challenge
reasoning
66.91
100
66.91
M015
HyperCLOVA
Naver
2021-05-25
MMLU
knowledge
58.88
100
58.88
M015
HyperCLOVA
Naver
2021-05-25
HumanEval
coding
24.12
100
24.12
M015
HyperCLOVA
Naver
2021-05-25
GSM8K
math_grade
16.95
100
16.95
M015
HyperCLOVA
Naver
2021-05-25
MATH
math_competition
7.25
100
7.25
M015
HyperCLOVA
Naver
2021-05-25
HellaSwag
commonsense
87.14
100
87.14
M015
HyperCLOVA
Naver
2021-05-25
ARC-Challenge
reasoning
70.99
100
70.99
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
MMLU
knowledge
77.96
100
77.96
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
HumanEval
coding
61.66
100
61.66
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
GSM8K
math_grade
52.03
100
52.03
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
MATH
math_competition
13.47
100
13.47
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
HellaSwag
commonsense
89.9
100
89.9
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
ARC-Challenge
reasoning
77.56
100
77.56
M016
GPT-3.5 (text-davinci-002)
OpenAI
2022-03-15
TruthfulQA
truthfulness
27.92
100
27.92
M017
InstructGPT
OpenAI
2022-01-27
MMLU
knowledge
67.38
100
67.38
M017
InstructGPT
OpenAI
2022-01-27
HumanEval
coding
59.44
100
59.44
M017
InstructGPT
OpenAI
2022-01-27
GSM8K
math_grade
58.2
100
58.2
M017
InstructGPT
OpenAI
2022-01-27
MATH
math_competition
22.63
100
22.63
M017
InstructGPT
OpenAI
2022-01-27
HellaSwag
commonsense
89.05
100
89.05
M017
InstructGPT
OpenAI
2022-01-27
ARC-Challenge
reasoning
76.89
100
76.89
M017
InstructGPT
OpenAI
2022-01-27
TruthfulQA
truthfulness
21.63
100
21.63
M018
Chinchilla
DeepMind
2022-03-29
MMLU
knowledge
65.77
100
65.77
M018
Chinchilla
DeepMind
2022-03-29
HumanEval
coding
59.78
100
59.78
M018
Chinchilla
DeepMind
2022-03-29
GSM8K
math_grade
60.3
100
60.3
M018
Chinchilla
DeepMind
2022-03-29
MATH
math_competition
22.52
100
22.52
M018
Chinchilla
DeepMind
2022-03-29
HellaSwag
commonsense
90.73
100
90.73
M018
Chinchilla
DeepMind
2022-03-29
ARC-Challenge
reasoning
78.5
100
78.5
M018
Chinchilla
DeepMind
2022-03-29
TruthfulQA
truthfulness
24.08
100
24.08
M019
PaLM
Google
2022-04-04
MMLU
knowledge
68.77
100
68.77
M019
PaLM
Google
2022-04-04
HumanEval
coding
57.1
100
57.1
M019
PaLM
Google
2022-04-04
GSM8K
math_grade
61.39
100
61.39
M019
PaLM
Google
2022-04-04
MATH
math_competition
18.71
100
18.71
M019
PaLM
Google
2022-04-04
HellaSwag
commonsense
91.78
100
91.78
M019
PaLM
Google
2022-04-04
ARC-Challenge
reasoning
77.29
100
77.29
M019
PaLM
Google
2022-04-04
TruthfulQA
truthfulness
34.66
100
34.66
End of preview.

πŸ“Š LLM Benchmarks & Capabilities 2020–2026

The most comprehensive open dataset tracking the evolution of Large Language Models β€” from GPT-3 to GPT-5.5, Claude Opus 4.7, Gemini 3.5, and beyond.


🧭 Overview

This dataset captures the complete LLM landscape from 2020 to 2026 across five dimensions:

  • πŸ€– 113 models from 25+ organizations
  • πŸ“ˆ 17 benchmarks tracking capability growth over time
  • πŸ’° Monthly API pricing showing 100x+ cost reductions
  • βš™οΈ Training compute estimates validating scaling laws
  • 🏁 57 capability milestones marking key inflection points

Designed for ML researchers, AI practitioners, policy analysts, and data scientists working on LLM-related problems β€” trend analysis, capability forecasting, cost-performance modeling, and competitive intelligence.


πŸ“ Dataset Files

File Rows Description
models_catalog.csv 113 Model metadata: org, release date, params, type, access
benchmark_scores.csv 1,276 Long format: model Γ— benchmark Γ— score
pricing_history.csv 1,187 Monthly API pricing per model (USD per 1M tokens)
compute_estimates.csv 113 Training FLOPs, GPU hours, cost, energy, CO2
capability_milestones.csv 57 Major AI events with significance scores

Total: ~2,750 rows Β· ~250 KB


🏒 Organizations Covered

Closed Frontier OpenAI Β· Anthropic Β· Google DeepMind Β· xAI Β· Microsoft

Open Weights Meta Β· DeepSeek Β· Mistral Β· Alibaba Qwen Β· 01.AI Β· TII

Early Era Chinchilla Β· PaLM Β· BLOOM Β· OPT Β· GLaM Β· Switch Transformer

Chinese Frontier WuDao 2.0 Β· ERNIE 3.0 Β· HyperCLOVA Β· YaLM 100B


πŸ“ Benchmarks Tracked

Benchmark Type Max Score
MMLU Knowledge 100
MMLU-Pro Knowledge Hard 100
HumanEval / HumanEval+ Coding 100
MBPP Coding 100
GSM8K Math Grade 100
MATH Math Competition 100
AIME 2024 Math Olympiad 100
GPQA Diamond Science PhD 100
HellaSwag Commonsense 100
ARC-Challenge Reasoning 100
TruthfulQA Truthfulness 100
BBH (BIG-Bench Hard) Reasoning Hard 100
SWE-Bench Verified Agentic Coding 100
LiveCodeBench Live Coding 100
MMMU Multimodal 100
Chatbot Arena ELO Human Eval 1500

⚑ Quick Start

import pandas as pd

# Load models catalog
models = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/models_catalog.csv")

# Load benchmark scores
scores = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/benchmark_scores.csv")

# Load pricing history
pricing = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/pricing_history.csv")

# Load compute estimates
compute = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/compute_estimates.csv")

# Load capability milestones
milestones = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/capability_milestones.csv")

print(models.head())

πŸ’‘ Suggested Use Cases

  1. Benchmark Saturation Forecasting β€” MMLU went from 32% (2020) to 95% (2025). Predict saturation for GPQA Diamond and SWE-Bench.
  2. Price vs Capability Analysis β€” Plot Arena ELO vs blended API price. Track Pareto-optimal models per quarter.
  3. Open vs Closed Model Gap β€” Measure the capability gap between open weights and closed models across time.
  4. Scaling Law Validation β€” Plot training FLOPs vs benchmark scores. Test empirical scaling exponents.
  5. Reasoning Model Premium β€” Measure score lift of reasoning models vs chat models on math and coding benchmarks.
  6. Cost Reduction Trajectory β€” Track price per Arena ELO point over time. How fast is intelligence-per-dollar growing?
  7. Competitive Organization Analysis β€” Head-to-head benchmark matrix per organization per quarter.
  8. Chinese vs Western Capability Race β€” DeepSeek, Qwen, ERNIE vs OpenAI, Anthropic, Google.
Downloads last month
88