Dataset Viewer
Auto-converted to Parquet Duplicate
image
imagewidth (px)
2.09k
2.25k
page_id
stringlengths
15
15
issue_date
stringdate
1901-01-23 00:00:00
1901-03-01 00:00:00
markdown
stringlengths
3.84k
11.4k
surya_blocks
stringlengths
9.59k
37.9k
inference_info
stringclasses
1 value
1901-01-23-0001
1901-01-23
To Chas W. Bryan with Compliments of W.J. The Commoner. VOL. I. NO. 1. LINCOLN, NEBRASKA, JANUARY 23, 1901. $1.00 a Year. William J. Bryan. Editor and Proprietor. The Commoner. Webster defines a commoner as "one of the common people." The name has been selected for this paper because THE COMMONER will endeavor to aid t...
[{"blocks": [{"polygon": [[274.383, 63.00000000000001], [1673.949, 63.00000000000001], [1673.949, 291.0], [274.383, 291.0]], "confidence": 0.9931035062365156, "label": "Text", "raw_label": "Text", "reading_order": 0, "html": "<p><i>To Chas W. Bryan<br/>with Compliments<br/>of W.J.</i></p>", "skipped": false, "error": f...
[{"model": "datalab-to/surya-ocr-2", "model_name": "surya-ocr-2", "column_name": "markdown", "blocks_column": "surya_blocks", "task": "ocr", "table_mode": null, "backend": "vllm-offline", "page_range": null, "error_rate": 0.0, "timestamp": "2026-06-22T14:33:56.581088+00:00", "script": "surya-ocr.py"}]
1901-01-23-0002
1901-01-23
2 The Commoner. a member of the President's cabinet; he ought to sit next to the president in the council chamber. Receiving his nomination from a national convention and his commission from the people, he is able to furnish the highest possible proof that he enjoys public respect and confidence, and the President shou...
[{"blocks": [{"polygon": [[108.472, 234.0], [137.67600000000002, 234.0], [137.67600000000002, 273.0], [108.472, 273.0]], "confidence": 0.9949548616235608, "label": "PageHeader", "raw_label": "Page-Header", "reading_order": 0, "html": "2", "skipped": false, "error": false, "bbox": [108.472, 234.0, 137.67600000000002, 27...
[{"model": "datalab-to/surya-ocr-2", "model_name": "surya-ocr-2", "column_name": "markdown", "blocks_column": "surya_blocks", "task": "ocr", "table_mode": null, "backend": "vllm-offline", "page_range": null, "error_rate": 0.0, "timestamp": "2026-06-22T14:33:56.581088+00:00", "script": "surya-ocr.py"}]
1901-01-23-0003
1901-01-23
The Commoner. 3 Another Endless Chain. Secretary Gage recently appeared before a House committee and urged the enactment of a law specifically requiring silver dollars to be redeemed in gold on demand. He argued that, as the legal tender law makes silver the equivalent of gold, the government might as well offer to fur...
[{"blocks": [{"polygon": [[840.796, 63.00000000000001], [1278.53, 63.00000000000001], [1278.53, 129.0], [840.796, 129.0]], "confidence": 0.9949813595019078, "label": "SectionHeader", "raw_label": "Section-Header", "reading_order": 0, "html": "<h1>The Commoner.</h1>", "skipped": false, "error": false, "bbox": [840.796, ...
[{"model": "datalab-to/surya-ocr-2", "model_name": "surya-ocr-2", "column_name": "markdown", "blocks_column": "surya_blocks", "task": "ocr", "table_mode": null, "backend": "vllm-offline", "page_range": null, "error_rate": 0.0, "timestamp": "2026-06-22T14:33:56.581088+00:00", "script": "surya-ocr.py"}]
1901-01-23-0004
1901-01-23
4 The Commoner. The Commoner. Issued Weekly. Terms—Payable in Advance. One Year..... $1.00 Six Months..... .60 Three Months..... .35 Single Copy..... .05 No Traveling Canvassers Are Employed. Subscriptions can be sent direct to THE COMMONER. They can also be sent through newspapers which have advertised a clubbing rate...
[{"blocks": [{"polygon": [[148.488, 117.0], [176.464, 117.0], [176.464, 156.0], [148.488, 156.0]], "confidence": 0.9927225206400798, "label": "PageHeader", "raw_label": "Page-Header", "reading_order": 0, "html": "4", "skipped": false, "error": false, "bbox": [148.488, 117.0, 176.464, 156.0]}, {"polygon": [[908.144, 90....
[{"model": "datalab-to/surya-ocr-2", "model_name": "surya-ocr-2", "column_name": "markdown", "blocks_column": "surya_blocks", "task": "ocr", "table_mode": null, "backend": "vllm-offline", "page_range": null, "error_rate": 0.0, "timestamp": "2026-06-22T14:33:56.581088+00:00", "script": "surya-ocr.py"}]
1901-01-23-0005
1901-01-23
The Commoner. 5 the country into one great concern. Of all the trusts, that would be the most disastrous to business interests, and of all the burdens imposed by the trusts, that burden would be the largest. It may be that there are many people in this country who, like the Missourian, must be shown. It is not difficul...
[{"blocks": [{"polygon": [[865.6560000000001, 66.0], [1302.856, 66.0], [1302.856, 129.0], [865.6560000000001, 129.0]], "confidence": 0.9965878162860597, "label": "PageHeader", "raw_label": "Page-Header", "reading_order": 0, "html": "<h1>The Commoner.</h1>", "skipped": false, "error": false, "bbox": [865.6560000000001, ...
[{"model": "datalab-to/surya-ocr-2", "model_name": "surya-ocr-2", "column_name": "markdown", "blocks_column": "surya_blocks", "task": "ocr", "table_mode": null, "backend": "vllm-offline", "page_range": null, "error_rate": 0.0, "timestamp": "2026-06-22T14:33:56.581088+00:00", "script": "surya-ocr.py"}]
1901-01-23-0006
1901-01-23
6 The Commoner. The Organization. After the defeat of 1896 the gold democrats met and congratulated themselves upon their part in the republican victory and demanded a re-organization of the party. A second defeat has brought forth another chorus of criticism and a demand that the party management be turned over to tho...
[{"blocks": [{"polygon": [[129.869, 93.0], [157.546, 93.0], [157.546, 135.0], [129.869, 135.0]], "confidence": 0.99378637532185, "label": "PageHeader", "raw_label": "Page-Header", "reading_order": 0, "html": "6", "skipped": false, "error": false, "bbox": [129.869, 93.0, 157.546, 135.0]}, {"polygon": [[887.793, 72.0], [...
[{"model": "datalab-to/surya-ocr-2", "model_name": "surya-ocr-2", "column_name": "markdown", "blocks_column": "surya_blocks", "task": "ocr", "table_mode": null, "backend": "vllm-offline", "page_range": null, "error_rate": 0.0, "timestamp": "2026-06-22T14:33:56.581088+00:00", "script": "surya-ocr.py"}]
1901-01-23-0007
1901-01-23
"The Commoner.\n7\nWhether Common or Not\nPlan, Poor Man.\nTrusts in cradles and bottles and milk. T(...TRUNCATED)
"[{\"blocks\": [{\"polygon\": [[865.2, 228.0], [1297.8, 228.0], [1297.8, 288.0], [865.2, 288.0]], \"(...TRUNCATED)
"[{\"model\": \"datalab-to/surya-ocr-2\", \"model_name\": \"surya-ocr-2\", \"column_name\": \"markdo(...TRUNCATED)
1901-01-23-0008
1901-01-23
"8\nThe Commoner.\nThe Evils of Ship Subsidies.\nThe Journal of Political Economy contains a very in(...TRUNCATED)
"[{\"blocks\": [{\"polygon\": [[90.601, 225.0], [120.099, 225.0], [120.099, 264.0], [90.601, 264.0]](...TRUNCATED)
"[{\"model\": \"datalab-to/surya-ocr-2\", \"model_name\": \"surya-ocr-2\", \"column_name\": \"markdo(...TRUNCATED)
1901-01-30-0001
1901-01-30
"The Commoner.\nVOL. I. NO. 2\nLINCOLN, NEBRASKA, JANUARY 30, 1901.\n$1.00 a Year.\nWilliam J. Bryan(...TRUNCATED)
"[{\"blocks\": [{\"polygon\": [[311.73999999999995, 222.0], [1966.3600000000001, 222.0], [1966.36000(...TRUNCATED)
"[{\"model\": \"datalab-to/surya-ocr-2\", \"model_name\": \"surya-ocr-2\", \"column_name\": \"markdo(...TRUNCATED)
1901-01-30-0002
1901-01-30
"2\nThe Commoner.\nwar with Mexico than Thomas Corwin, and even Abraham Lincoln repeatedly added his(...TRUNCATED)
"[{\"blocks\": [{\"polygon\": [[94.5, 183.0], [121.5, 183.0], [121.5, 225.0], [94.5, 225.0]], \"conf(...TRUNCATED)
"[{\"model\": \"datalab-to/surya-ocr-2\", \"model_name\": \"surya-ocr-2\", \"column_name\": \"markdo(...TRUNCATED)
End of preview. Expand in Data Studio

Surya OCR 2 on 1901 US newspapers (Chronicling America)

48 pages from The Commoner (William Jennings Bryan's weekly — Lincoln, Nebraska, Jan–Mar 1901), OCR'd with Surya OCR 2 (datalab-to/surya-ocr-2, 650M) run as offline vLLM batch on Hugging Face Jobs.

Dense, multi-column historic newspaper pages have long been a failure mode for VLM-based OCR (hallucination / repetition loops on the columns). On first inspection Surya OCR 2 handles them very well: across all 48/48 pages — 0 failures, 0 empty, 0 looping — ~460k characters of clean text with per-block structure and reading order.

Columns

  • image — the page scan (downscaled from the source JP2).
  • markdown — flattened, reading-order text (Surya's per-block HTML, stripped + joined).
  • surya_blocks — the full structured result as JSON: per block label / bbox (pixel space) / reading_order / confidence / html (equations as <math>), one entry per page.
  • inference_info — run metadata.

How it was made

surya-ocr.py — a self-contained UV script — in one command on HF Jobs (a100-large, vllm/vllm-openai:v0.20.1). Recipe: https://github.com/davanstrien/uv-scripts-for-ai/pull/54

Source pages mirror the Library of Congress Chronicling America collection (public domain).

Caveats (honest)

First-pass, not verified at the character level; The Commoner is relatively cleanly printed — badly degraded microfilm is the harder test. Surya weights are modified OpenRAIL-M (research / personal / startups <$5M).

Downloads last month
75