Datasets:

hmnshudhmn24
/

llm-benchmarks-capabilities-2020-2026

Error code:   DatasetGenerationCastError
Exception:    DatasetGenerationCastError
Message:      An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 12 new columns ({'month', 'milestone_name', 'is_reasoning', 'is_research', 'date', 'is_product_launch', 'is_open_source', 'description', 'event_id', 'significance_score', 'year', 'milestone_type'}) and 8 missing columns ({'model_id', 'model_name', 'benchmark', 'score_pct', 'score', 'release_date', 'max_score', 'benchmark_type'}).

This happened while the csv dataset builder was generating data using

hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/capability_milestones.csv (at revision f04fa5f30fb56ff2fb35af483221bce9e8dc9a67), ['hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/benchmark_scores.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/capability_milestones.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/compute_estimates.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/models_catalog.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/pricing_history.csv']

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1837, in _prepare_split_single
                  writer.write_table(table)
                  ~~~~~~~~~~~~~~~~~~^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 765, in write_table
                  self._write_table(pa_table, writer_batch_size=writer_batch_size)
                  ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 773, in _write_table
                  pa_table = table_cast(pa_table, self._schema)
                File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2369, in table_cast
                  return cast_table_to_schema(table, schema)
                File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2297, in cast_table_to_schema
                  raise CastError(
                  ...<3 lines>...
                  )
              datasets.table.CastError: Couldn't cast
              event_id: string
              date: string
              year: int64
              month: int64
              milestone_name: string
              organization: string
              milestone_type: string
              significance_score: int64
              description: string
              is_product_launch: int64
              is_research: int64
              is_open_source: int64
              is_reasoning: int64
              -- schema metadata --
              pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1845
              to
              {'model_id': Value('string'), 'model_name': Value('string'), 'organization': Value('string'), 'release_date': Value('string'), 'benchmark': Value('string'), 'benchmark_type': Value('string'), 'score': Value('float64'), 'max_score': Value('int64'), 'score_pct': Value('float64')}
              because column names don't match
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1369, in compute_config_parquet_and_info_response
                  parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
                                                                        ~~~~~~~~~~~~~~~~~~~~~~~~~^
                      builder, max_dataset_size_bytes=max_dataset_size_bytes
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  )
                  ^
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 948, in stream_convert_to_parquet
                  builder._prepare_split(split_generator=splits_generators[split], file_format="parquet")
                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1683, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~^
                      gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  ):
                  ^
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1839, in _prepare_split_single
                  raise DatasetGenerationCastError.from_cast_error(
                  ...<4 lines>...
                  )
              datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
              
              All the data files must have the same columns, but at some point there are 12 new columns ({'month', 'milestone_name', 'is_reasoning', 'is_research', 'date', 'is_product_launch', 'is_open_source', 'description', 'event_id', 'significance_score', 'year', 'milestone_type'}) and 8 missing columns ({'model_id', 'model_name', 'benchmark', 'score_pct', 'score', 'release_date', 'max_score', 'benchmark_type'}).
              
              This happened while the csv dataset builder was generating data using
              
              hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/capability_milestones.csv (at revision f04fa5f30fb56ff2fb35af483221bce9e8dc9a67), ['hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/benchmark_scores.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/capability_milestones.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/compute_estimates.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/models_catalog.csv', 'hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026@f04fa5f30fb56ff2fb35af483221bce9e8dc9a67/pricing_history.csv']
              
              Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

model_id string	model_name string	organization string	release_date string	benchmark string	benchmark_type string	score float64	max_score int64	score_pct float64
M001	GPT-3 (175B)	OpenAI	2020-06-11	MMLU	knowledge	32.83	100	32.83
M001	GPT-3 (175B)	OpenAI	2020-06-11	HellaSwag	commonsense	55.08	100	55.08
M001	GPT-3 (175B)	OpenAI	2020-06-11	ARC-Challenge	reasoning	49.43	100	49.43
M002	T5-XXL	Google	2020-02-24	MMLU	knowledge	24.38	100	24.38
M002	T5-XXL	Google	2020-02-24	HellaSwag	commonsense	49.3	100	49.3
M002	T5-XXL	Google	2020-02-24	ARC-Challenge	reasoning	47.24	100	47.24
M003	Turing-NLG	Microsoft	2020-02-13	MMLU	knowledge	30.64	100	30.64
M003	Turing-NLG	Microsoft	2020-02-13	HellaSwag	commonsense	52.13	100	52.13
M003	Turing-NLG	Microsoft	2020-02-13	ARC-Challenge	reasoning	52.07	100	52.07
M004	GShard	Google	2020-06-30	MMLU	knowledge	35.61	100	35.61
M004	GShard	Google	2020-06-30	HellaSwag	commonsense	54.48	100	54.48
M004	GShard	Google	2020-06-30	ARC-Challenge	reasoning	51.88	100	51.88
M005	CTRL	Salesforce	2020-09-11	MMLU	knowledge	38.09	100	38.09
M005	CTRL	Salesforce	2020-09-11	HellaSwag	commonsense	76.22	100	76.22
M005	CTRL	Salesforce	2020-09-11	ARC-Challenge	reasoning	63.01	100	63.01
M006	BlenderBot	Meta	2020-04-30	MMLU	knowledge	24.07	100	24.07
M006	BlenderBot	Meta	2020-04-30	HellaSwag	commonsense	49.88	100	49.88
M006	BlenderBot	Meta	2020-04-30	ARC-Challenge	reasoning	46.41	100	46.41
M007	Switch Transformer	Google	2021-01-11	MMLU	knowledge	64.46	100	64.46
M007	Switch Transformer	Google	2021-01-11	HumanEval	coding	28.12	100	28.12
M007	Switch Transformer	Google	2021-01-11	GSM8K	math_grade	20.04	100	20.04
M007	Switch Transformer	Google	2021-01-11	MATH	math_competition	7.92	100	7.92
M007	Switch Transformer	Google	2021-01-11	HellaSwag	commonsense	87.7	100	87.7
M007	Switch Transformer	Google	2021-01-11	ARC-Challenge	reasoning	67	100	67
M008	Codex	OpenAI	2021-08-10	MMLU	knowledge	53.1	100	53.1
M008	Codex	OpenAI	2021-08-10	HumanEval	coding	46.31	100	46.31
M008	Codex	OpenAI	2021-08-10	GSM8K	math_grade	34.76	100	34.76
M008	Codex	OpenAI	2021-08-10	MATH	math_competition	3.25	100	3.25
M008	Codex	OpenAI	2021-08-10	HellaSwag	commonsense	80.72	100	80.72
M008	Codex	OpenAI	2021-08-10	ARC-Challenge	reasoning	61.03	100	61.03
M009	Jurassic-1 Jumbo	AI21	2021-08-11	MMLU	knowledge	56.26	100	56.26
M009	Jurassic-1 Jumbo	AI21	2021-08-11	HumanEval	coding	41.6	100	41.6
M009	Jurassic-1 Jumbo	AI21	2021-08-11	GSM8K	math_grade	32.28	100	32.28
M009	Jurassic-1 Jumbo	AI21	2021-08-11	MATH	math_competition	8.49	100	8.49
M009	Jurassic-1 Jumbo	AI21	2021-08-11	HellaSwag	commonsense	87.97	100	87.97
M009	Jurassic-1 Jumbo	AI21	2021-08-11	ARC-Challenge	reasoning	63.93	100	63.93
M010	Megatron-Turing NLG	Microsoft+NVIDIA	2021-10-11	MMLU	knowledge	63.83	100	63.83
M010	Megatron-Turing NLG	Microsoft+NVIDIA	2021-10-11	HumanEval	coding	36.27	100	36.27
M010	Megatron-Turing NLG	Microsoft+NVIDIA	2021-10-11	GSM8K	math_grade	34.38	100	34.38
M010	Megatron-Turing NLG	Microsoft+NVIDIA	2021-10-11	MATH	math_competition	6.2	100	6.2
M010	Megatron-Turing NLG	Microsoft+NVIDIA	2021-10-11	HellaSwag	commonsense	85.41	100	85.41
M010	Megatron-Turing NLG	Microsoft+NVIDIA	2021-10-11	ARC-Challenge	reasoning	67.52	100	67.52
M011	Gopher	DeepMind	2021-12-08	MMLU	knowledge	55.05	100	55.05
M011	Gopher	DeepMind	2021-12-08	HumanEval	coding	41.74	100	41.74
M011	Gopher	DeepMind	2021-12-08	GSM8K	math_grade	35.71	100	35.71
M011	Gopher	DeepMind	2021-12-08	MATH	math_competition	7.62	100	7.62
M011	Gopher	DeepMind	2021-12-08	HellaSwag	commonsense	82.78	100	82.78
M011	Gopher	DeepMind	2021-12-08	ARC-Challenge	reasoning	65.22	100	65.22
M012	GLaM	Google	2021-12-09	MMLU	knowledge	61.41	100	61.41
M012	GLaM	Google	2021-12-09	HumanEval	coding	40.93	100	40.93
M012	GLaM	Google	2021-12-09	GSM8K	math_grade	35.07	100	35.07
M012	GLaM	Google	2021-12-09	MATH	math_competition	16.41	100	16.41
M012	GLaM	Google	2021-12-09	HellaSwag	commonsense	86.14	100	86.14
M012	GLaM	Google	2021-12-09	ARC-Challenge	reasoning	61.93	100	61.93
M013	WuDao 2.0	BAAI	2021-06-01	MMLU	knowledge	63.54	100	63.54
M013	WuDao 2.0	BAAI	2021-06-01	HumanEval	coding	30.88	100	30.88
M013	WuDao 2.0	BAAI	2021-06-01	GSM8K	math_grade	24.08	100	24.08
M013	WuDao 2.0	BAAI	2021-06-01	MATH	math_competition	9.34	100	9.34
M013	WuDao 2.0	BAAI	2021-06-01	HellaSwag	commonsense	87.58	100	87.58
M013	WuDao 2.0	BAAI	2021-06-01	ARC-Challenge	reasoning	69.39	100	69.39
M014	ERNIE 3.0	Baidu	2021-07-05	MMLU	knowledge	49.43	100	49.43
M014	ERNIE 3.0	Baidu	2021-07-05	HumanEval	coding	37.44	100	37.44
M014	ERNIE 3.0	Baidu	2021-07-05	GSM8K	math_grade	32.22	100	32.22
M014	ERNIE 3.0	Baidu	2021-07-05	MATH	math_competition	8.44	100	8.44
M014	ERNIE 3.0	Baidu	2021-07-05	HellaSwag	commonsense	80.75	100	80.75
M014	ERNIE 3.0	Baidu	2021-07-05	ARC-Challenge	reasoning	66.91	100	66.91
M015	HyperCLOVA	Naver	2021-05-25	MMLU	knowledge	58.88	100	58.88
M015	HyperCLOVA	Naver	2021-05-25	HumanEval	coding	24.12	100	24.12
M015	HyperCLOVA	Naver	2021-05-25	GSM8K	math_grade	16.95	100	16.95
M015	HyperCLOVA	Naver	2021-05-25	MATH	math_competition	7.25	100	7.25
M015	HyperCLOVA	Naver	2021-05-25	HellaSwag	commonsense	87.14	100	87.14
M015	HyperCLOVA	Naver	2021-05-25	ARC-Challenge	reasoning	70.99	100	70.99
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	MMLU	knowledge	77.96	100	77.96
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	HumanEval	coding	61.66	100	61.66
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	GSM8K	math_grade	52.03	100	52.03
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	MATH	math_competition	13.47	100	13.47
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	HellaSwag	commonsense	89.9	100	89.9
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	ARC-Challenge	reasoning	77.56	100	77.56
M016	GPT-3.5 (text-davinci-002)	OpenAI	2022-03-15	TruthfulQA	truthfulness	27.92	100	27.92
M017	InstructGPT	OpenAI	2022-01-27	MMLU	knowledge	67.38	100	67.38
M017	InstructGPT	OpenAI	2022-01-27	HumanEval	coding	59.44	100	59.44
M017	InstructGPT	OpenAI	2022-01-27	GSM8K	math_grade	58.2	100	58.2
M017	InstructGPT	OpenAI	2022-01-27	MATH	math_competition	22.63	100	22.63
M017	InstructGPT	OpenAI	2022-01-27	HellaSwag	commonsense	89.05	100	89.05
M017	InstructGPT	OpenAI	2022-01-27	ARC-Challenge	reasoning	76.89	100	76.89
M017	InstructGPT	OpenAI	2022-01-27	TruthfulQA	truthfulness	21.63	100	21.63
M018	Chinchilla	DeepMind	2022-03-29	MMLU	knowledge	65.77	100	65.77
M018	Chinchilla	DeepMind	2022-03-29	HumanEval	coding	59.78	100	59.78
M018	Chinchilla	DeepMind	2022-03-29	GSM8K	math_grade	60.3	100	60.3
M018	Chinchilla	DeepMind	2022-03-29	MATH	math_competition	22.52	100	22.52
M018	Chinchilla	DeepMind	2022-03-29	HellaSwag	commonsense	90.73	100	90.73
M018	Chinchilla	DeepMind	2022-03-29	ARC-Challenge	reasoning	78.5	100	78.5
M018	Chinchilla	DeepMind	2022-03-29	TruthfulQA	truthfulness	24.08	100	24.08
M019	PaLM	Google	2022-04-04	MMLU	knowledge	68.77	100	68.77
M019	PaLM	Google	2022-04-04	HumanEval	coding	57.1	100	57.1
M019	PaLM	Google	2022-04-04	GSM8K	math_grade	61.39	100	61.39
M019	PaLM	Google	2022-04-04	MATH	math_competition	18.71	100	18.71
M019	PaLM	Google	2022-04-04	HellaSwag	commonsense	91.78	100	91.78
M019	PaLM	Google	2022-04-04	ARC-Challenge	reasoning	77.29	100	77.29
M019	PaLM	Google	2022-04-04	TruthfulQA	truthfulness	34.66	100	34.66

End of preview.

📊 LLM Benchmarks & Capabilities 2020–2026

The most comprehensive open dataset tracking the evolution of Large Language Models — from GPT-3 to GPT-5.5, Claude Opus 4.7, Gemini 3.5, and beyond.

🧭 Overview

This dataset captures the complete LLM landscape from 2020 to 2026 across five dimensions:

🤖 113 models from 25+ organizations
📈 17 benchmarks tracking capability growth over time
💰 Monthly API pricing showing 100x+ cost reductions
⚙️ Training compute estimates validating scaling laws
🏁 57 capability milestones marking key inflection points

Designed for ML researchers, AI practitioners, policy analysts, and data scientists working on LLM-related problems — trend analysis, capability forecasting, cost-performance modeling, and competitive intelligence.

📁 Dataset Files

File	Rows	Description
`models_catalog.csv`	113	Model metadata: org, release date, params, type, access
`benchmark_scores.csv`	1,276	Long format: model × benchmark × score
`pricing_history.csv`	1,187	Monthly API pricing per model (USD per 1M tokens)
`compute_estimates.csv`	113	Training FLOPs, GPU hours, cost, energy, CO2
`capability_milestones.csv`	57	Major AI events with significance scores

Total: ~2,750 rows · ~250 KB

🏢 Organizations Covered

Closed Frontier OpenAI · Anthropic · Google DeepMind · xAI · Microsoft

Open Weights Meta · DeepSeek · Mistral · Alibaba Qwen · 01.AI · TII

Early Era Chinchilla · PaLM · BLOOM · OPT · GLaM · Switch Transformer

Chinese Frontier WuDao 2.0 · ERNIE 3.0 · HyperCLOVA · YaLM 100B

📐 Benchmarks Tracked

Benchmark	Type	Max Score
MMLU	Knowledge	100
MMLU-Pro	Knowledge Hard	100
HumanEval / HumanEval+	Coding	100
MBPP	Coding	100
GSM8K	Math Grade	100
MATH	Math Competition	100
AIME 2024	Math Olympiad	100
GPQA Diamond	Science PhD	100
HellaSwag	Commonsense	100
ARC-Challenge	Reasoning	100
TruthfulQA	Truthfulness	100
BBH (BIG-Bench Hard)	Reasoning Hard	100
SWE-Bench Verified	Agentic Coding	100
LiveCodeBench	Live Coding	100
MMMU	Multimodal	100
Chatbot Arena ELO	Human Eval	1500

⚡ Quick Start

import pandas as pd

# Load models catalog
models = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/models_catalog.csv")

# Load benchmark scores
scores = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/benchmark_scores.csv")

# Load pricing history
pricing = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/pricing_history.csv")

# Load compute estimates
compute = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/compute_estimates.csv")

# Load capability milestones
milestones = pd.read_csv("hf://datasets/hmnshudhmn24/llm-benchmarks-capabilities-2020-2026/capability_milestones.csv")

print(models.head())

💡 Suggested Use Cases

Benchmark Saturation Forecasting — MMLU went from 32% (2020) to 95% (2025). Predict saturation for GPQA Diamond and SWE-Bench.
Price vs Capability Analysis — Plot Arena ELO vs blended API price. Track Pareto-optimal models per quarter.
Open vs Closed Model Gap — Measure the capability gap between open weights and closed models across time.
Scaling Law Validation — Plot training FLOPs vs benchmark scores. Test empirical scaling exponents.
Reasoning Model Premium — Measure score lift of reasoning models vs chat models on math and coding benchmarks.
Cost Reduction Trajectory — Track price per Arena ELO point over time. How fast is intelligence-per-dollar growing?
Competitive Organization Analysis — Head-to-head benchmark matrix per organization per quarter.
Chinese vs Western Capability Race — DeepSeek, Qwen, ERNIE vs OpenAI, Anthropic, Google.

Downloads last month: 88

Total file size:

171 kB