Datasets:
Access CHILDES-Aligned
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
CHILDES-Aligned is derived from the CHILDES database (TalkBank) and is distributed under TalkBank's terms of use (https://talkbank.org/share/rules.html). Access is reviewed manually. By requesting access you agree to:
- Use the data for non-commercial research only.
- Cite the BEACON paper and every source CHILDES corpus you use (see
corpora.tsv). - Follow the TalkBank Ground Rules, including respecting participant privacy and not redistributing the audio.
Log in or Sign Up to review the conditions and access this dataset content.
CHILDES-Aligned: Curated Child-Speech Dataset (BEACON)
English child-speech audio clips with corrected utterance-level timestamps, curated from CHILDES with BEACON (Boundary Estimation via Alignment CONsensus): multiple ASR systems' word timestamps are aligned to the human CHAT transcripts and fused by consensus to recover reliable utterance boundaries, without altering the original transcripts.
- Speakers: target child (
CHI) only. - Audio: 16 kHz mono WAV, one clip per (merged) child turn.
- License/terms: governed by TalkBank/CHILDES terms of use (not Apache).
Cite the source corpora you use β see
corpora.tsv.
Two released versions (paper Table 1)
| General (raw CHAT) | ASR (verbatim) | |
|---|---|---|
| CHILDES corpora | 49 | 49 |
| Recordings (sessions) | 5,533 | 5,344 |
| Child age range | 0;6β14;3 | 0;6β14;3 |
| Utterances (clips) | 501,880 | 310,983 |
| Audio (hours) | 413.3 | 282.6 |
| Words / vocabulary | 2.18 M / 20,581 | 1.63 M / 18,168 |
| Transcript | raw CHAT | verbatim-normalized |
Ages use CHILDES years;months notation (0;6 = 0 years 6 months).
general/β lossless, general-purpose. All clips keep their raw CHAT text (unintelligibilityxxx, pauses,&=events, terminators, etc.), so it is not tied to any single downstream task: distribution, child-language analysis, custom pipelines.asr/β a stricter subset of General for ASR training: transcripts are verbatim-normalized and each clip must pass an independent ASR agreement check (drops clips whose audio and transcript disagree). The ASR version is a labeled subset of the same audio β no separate audio store.
The ASR subset is derived from General in two passes: (1) verbatim normalization
of the CHAT text (text_normalization.md), dropping empty / unintelligible
(xxx/yyy/www) references; (2) an ASR-agreement filter β each clip is
transcribed by a fixed out-of-ensemble ASR and dropped if it disagrees with the
normalized reference beyond a threshold.
Label schema ({version}/manifest.jsonl, one JSON object per clip)
| field | meaning |
|---|---|
audio |
clip path relative to the clip store: audio/{kk}/{clip_key}.wav |
duration |
clip length (seconds) |
text |
transcript β raw CHAT (General) or verbatim-normalized (ASR) |
text_original |
(ASR only) the raw CHAT text before normalization |
clip_key, recording_id |
clip + source recording identifiers |
corpus, pid, source_url |
CHILDES corpus, TalkBank PID, source media URL |
start_sec, end_sec |
recovered span on the original recording timeline |
recovery_profile |
BEACON ensemble + merge configuration for the span |
child_age, child_age_months, child_sex, child_group |
participant metadata from the source .cha @ID |
Labels carry the transcript plus provenance/participant metadata only β no per-clip quality/confidence fields.
Method / provenance
English CHILDES recordings β recording-level QC (drop mislabeled non-English,
silent, all-model-untranscribable) β per-recording multi-ASR word timestamps
(WhisperX, Parakeet, Canary, Qwen3) β BEACON chunk-DP alignment to the CHAT
reference utterances β IoU-consensus fusion β adjacent CHI turns with β€1 s gaps
merged (+0.5 s padding) β clip cutting. The ASR version adds normalization + the
ASR-agreement filter. Text normalization spec: text_normalization.md.
Limitations
- Child-speaker only; adult turns are not released as clips.
- Timestamps are model-recovered, not hand-aligned; the hardest child utterances are disproportionately abstained on (no clip) or filtered.
- The ASR subset is biased toward judge-transcribable speech; do not use it as the sole source of hard child speech.
- English CHILDES only.
Citation
Cite the BEACON paper (see CITATION in the code release) and the individual
source CHILDES corpora you use, listed in corpora.tsv (TalkBank requires citing
the corpora).
- Downloads last month
- -