CHILDES-Aligned is derived from the CHILDES database (TalkBank) and is distributed under TalkBank's terms of use (https://talkbank.org/share/rules.html). Access is reviewed manually. By requesting access you agree to:

Use the data for non-commercial research only.
Cite the BEACON paper and every source CHILDES corpus you use (see corpora.tsv).
Follow the TalkBank Ground Rules, including respecting participant privacy and not redistributing the audio.

CHILDES-Aligned: Curated Child-Speech Dataset (BEACON)

English child-speech audio clips with corrected utterance-level timestamps, curated from CHILDES with BEACON (Boundary Estimation via Alignment CONsensus): multiple ASR systems' word timestamps are aligned to the human CHAT transcripts and fused by consensus to recover reliable utterance boundaries, without altering the original transcripts.

Speakers: target child (CHI) only.
Audio: 16 kHz mono WAV, one clip per (merged) child turn.
License/terms: governed by TalkBank/CHILDES terms of use (not Apache). Cite the source corpora you use — see corpora.tsv.

Two released versions (paper Table 1)

	General (raw CHAT)	ASR (verbatim)
CHILDES corpora	49	49
Recordings (sessions)	5,533	5,344
Child age range	0;6–14;3	0;6–14;3
Utterances (clips)	501,880	310,983
Audio (hours)	413.3	282.6
Words / vocabulary	2.18 M / 20,581	1.63 M / 18,168
Transcript	raw CHAT	verbatim-normalized

Ages use CHILDES years;months notation (0;6 = 0 years 6 months).

general/ — lossless, general-purpose. All clips keep their raw CHAT text (unintelligibility xxx, pauses, &=events, terminators, etc.), so it is not tied to any single downstream task: distribution, child-language analysis, custom pipelines.
asr/ — a stricter subset of General for ASR training: transcripts are verbatim-normalized and each clip must pass an independent ASR agreement check (drops clips whose audio and transcript disagree). The ASR version is a labeled subset of the same audio — no separate audio store.

The ASR subset is derived from General in two passes: (1) verbatim normalization of the CHAT text (text_normalization.md), dropping empty / unintelligible (xxx/yyy/www) references; (2) an ASR-agreement filter — each clip is transcribed by a fixed out-of-ensemble ASR and dropped if it disagrees with the normalized reference beyond a threshold.

Label schema (`{version}/manifest.jsonl`, one JSON object per clip)

field	meaning
`audio`	clip path relative to the clip store: `audio/{kk}/{clip_key}.wav`
`duration`	clip length (seconds)
`text`	transcript — raw CHAT (General) or verbatim-normalized (ASR)
`text_original`	(ASR only) the raw CHAT text before normalization
`clip_key`, `recording_id`	clip + source recording identifiers
`corpus`, `pid`, `source_url`	CHILDES corpus, TalkBank PID, source media URL
`start_sec`, `end_sec`	recovered span on the original recording timeline
`recovery_profile`	BEACON ensemble + merge configuration for the span
`child_age`, `child_age_months`, `child_sex`, `child_group`	participant metadata from the source `.cha` `@ID`

Labels carry the transcript plus provenance/participant metadata only — no per-clip quality/confidence fields.

Method / provenance

English CHILDES recordings → recording-level QC (drop mislabeled non-English, silent, all-model-untranscribable) → per-recording multi-ASR word timestamps (WhisperX, Parakeet, Canary, Qwen3) → BEACON chunk-DP alignment to the CHAT reference utterances → IoU-consensus fusion → adjacent CHI turns with ≤1 s gaps merged (+0.5 s padding) → clip cutting. The ASR version adds normalization + the ASR-agreement filter. Text normalization spec: text_normalization.md.

Limitations

Child-speaker only; adult turns are not released as clips.
Timestamps are model-recovered, not hand-aligned; the hardest child utterances are disproportionately abstained on (no clip) or filtered.
The ASR subset is biased toward judge-transcribable speech; do not use it as the sole source of hard child speech.
English CHILDES only.

Citation

Cite the BEACON paper (see CITATION in the code release) and the individual source CHILDES corpora you use, listed in corpora.tsv (TalkBank requires citing the corpora).

Downloads last month: -

Total file size:

80.3 GB

Access CHILDES-Aligned