Access CHILDES-Aligned

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CHILDES-Aligned is derived from the CHILDES database (TalkBank) and is distributed under TalkBank's terms of use (https://talkbank.org/share/rules.html). Access is reviewed manually. By requesting access you agree to:

  1. Use the data for non-commercial research only.
  2. Cite the BEACON paper and every source CHILDES corpus you use (see corpora.tsv).
  3. Follow the TalkBank Ground Rules, including respecting participant privacy and not redistributing the audio.

Log in or Sign Up to review the conditions and access this dataset content.

CHILDES-Aligned: Curated Child-Speech Dataset (BEACON)

English child-speech audio clips with corrected utterance-level timestamps, curated from CHILDES with BEACON (Boundary Estimation via Alignment CONsensus): multiple ASR systems' word timestamps are aligned to the human CHAT transcripts and fused by consensus to recover reliable utterance boundaries, without altering the original transcripts.

  • Speakers: target child (CHI) only.
  • Audio: 16 kHz mono WAV, one clip per (merged) child turn.
  • License/terms: governed by TalkBank/CHILDES terms of use (not Apache). Cite the source corpora you use β€” see corpora.tsv.

Two released versions (paper Table 1)

General (raw CHAT) ASR (verbatim)
CHILDES corpora 49 49
Recordings (sessions) 5,533 5,344
Child age range 0;6–14;3 0;6–14;3
Utterances (clips) 501,880 310,983
Audio (hours) 413.3 282.6
Words / vocabulary 2.18 M / 20,581 1.63 M / 18,168
Transcript raw CHAT verbatim-normalized

Ages use CHILDES years;months notation (0;6 = 0 years 6 months).

  • general/ β€” lossless, general-purpose. All clips keep their raw CHAT text (unintelligibility xxx, pauses, &=events, terminators, etc.), so it is not tied to any single downstream task: distribution, child-language analysis, custom pipelines.
  • asr/ β€” a stricter subset of General for ASR training: transcripts are verbatim-normalized and each clip must pass an independent ASR agreement check (drops clips whose audio and transcript disagree). The ASR version is a labeled subset of the same audio β€” no separate audio store.

The ASR subset is derived from General in two passes: (1) verbatim normalization of the CHAT text (text_normalization.md), dropping empty / unintelligible (xxx/yyy/www) references; (2) an ASR-agreement filter β€” each clip is transcribed by a fixed out-of-ensemble ASR and dropped if it disagrees with the normalized reference beyond a threshold.

Label schema ({version}/manifest.jsonl, one JSON object per clip)

field meaning
audio clip path relative to the clip store: audio/{kk}/{clip_key}.wav
duration clip length (seconds)
text transcript β€” raw CHAT (General) or verbatim-normalized (ASR)
text_original (ASR only) the raw CHAT text before normalization
clip_key, recording_id clip + source recording identifiers
corpus, pid, source_url CHILDES corpus, TalkBank PID, source media URL
start_sec, end_sec recovered span on the original recording timeline
recovery_profile BEACON ensemble + merge configuration for the span
child_age, child_age_months, child_sex, child_group participant metadata from the source .cha @ID

Labels carry the transcript plus provenance/participant metadata only β€” no per-clip quality/confidence fields.

Method / provenance

English CHILDES recordings β†’ recording-level QC (drop mislabeled non-English, silent, all-model-untranscribable) β†’ per-recording multi-ASR word timestamps (WhisperX, Parakeet, Canary, Qwen3) β†’ BEACON chunk-DP alignment to the CHAT reference utterances β†’ IoU-consensus fusion β†’ adjacent CHI turns with ≀1 s gaps merged (+0.5 s padding) β†’ clip cutting. The ASR version adds normalization + the ASR-agreement filter. Text normalization spec: text_normalization.md.

Limitations

  • Child-speaker only; adult turns are not released as clips.
  • Timestamps are model-recovered, not hand-aligned; the hardest child utterances are disproportionately abstained on (no clip) or filtered.
  • The ASR subset is biased toward judge-transcribable speech; do not use it as the sole source of hard child speech.
  • English CHILDES only.

Citation

Cite the BEACON paper (see CITATION in the code release) and the individual source CHILDES corpora you use, listed in corpora.tsv (TalkBank requires citing the corpora).

Downloads last month
-