Welcome to the VoiceAfrica dataset. VoiceAfrica is a Lanfrica–Meta collaboration (code-named VoiceAfrica 1) that set out to create authentic, conversational speech and expert-curated transcriptions for 11 under-represented African languages. The dataset contains ~125 hours of speech (about 10 hours per language) across 16,383 audio samples from 118 speakers in four countries.

Unlike read-speech corpora, VoiceAfrica uses a natural speech generation approach: native speakers were given prompts (think of them as questions across diverse, modern thematic areas) and spoke freely for 1–2 minutes per prompt. Long recordings were naturally split into small, model-compatible chunks (≤30s) — so no forced alignment is required downstream. The result is conversational, real-world, culturally-rich data, including the natural fillers, repetitions, and code-mixing that occur in everyday speech.

For more about Lanfrica's data farming approach, visit https://lanfrica.com. By using this dataset, you acknowledge reading and accepting to use it according to the data usage terms and conditions.

Languages

The dataset covers the following languages. Each language is a separate config (named by its ISO 639-3 code).

Config	Language	Country	Samples	Hours	Speakers
`bcz`	Bainouk-Gunyaamolo	Senegal	1,543	10.8	13
`ble`	Balanta-Kentohe	Senegal	1,755	12.9	12
`bvb`	Bube	Equatorial Guinea	1,404	11.3	10
`fan`	Fang	Cameroun	1,473	11.5	10
`igl`	Igala	Nigeria	1,570	12.1	10
`knc`	Central Kanuri	Nigeria	1,505	11.0	10
`krx`	Karon	Senegal	1,279	10.2	10
`nup`	Nupe-Nupe-Tako	Nigeria	1,604	12.6	10
`pov`	Upper Guinea Crioulo	Senegal	1,430	10.2	12
`srr`	Serer	Senegal	1,309	10.4	11
`urh`	Urhobo	Nigeria	1,511	11.9	10
Total	11 languages	4 countries	16,383	~124.9	118

There was a deliberate effort to balance gender representation. However, in some Muslim-dominated localities (e.g. for Kanuri, Serer, Balanta-Kentohe, Bainouk, Karon) female participation was lower; the dataset reflects this reality.

Metadata

Each row contains the audio together with rich speaker, prompt, and transcript metadata. Use the Dataset Viewer to explore it. The columns are:

Column	Type	Description
`audio`	audio	The speech recording — a natural ≤30s chunk of a speaker's spoken response.
`speaker_id`	string	Anonymized speaker identifier (unique per language, e.g. `spk01`).
`gender`	string	Speaker gender (`Female` / `Male`).
`age_range`	string	Speaker age range (`18-29` or `30-over`).
`country`	string	Country where the recording was made.
`language_name`	string	Full human-readable language name (e.g. `Bainouk-Gunyaamolo`).
`language_code`	string	ISO 639-3 language code — also the config name.
`prompt`	string	The prompt (question) as presented to the speaker, in the lingua franca of their region (English, French, or Spanish).
`prompt_english`	string	The English version of the prompt, provided consistently for every row regardless of the language it was presented in.
`prompt_language`	string	The language the prompt was presented in (`english`, `french`, or `spanish`).
`prompt_id`	string	Identifier of the prompt (e.g. `e001`, `f001`, `s001`).
`domain`	string	Thematic area of the prompt (e.g. `food`, `family`, `government`, `music`).
`transcript`	string	Expert human transcription of the audio chunk.
`audio_filename`	string	Original filename of the audio chunk.
`audio_sequence`	int	Order of this chunk within the speaker's full response to the prompt.
`duration`	float	Duration of the audio chunk in seconds.

A note on prompts and languages

The same 600 prompts were used across all 11 languages, spanning 19 thematic areas — hobbies, animals, food, farming, music, clothing, conversations, folklore, government, social media, education, family, social, art and crafts, places, greetings, festivals, occupation, and medical. Prompts were presented to speakers in the most widely-understood lingua franca of their region (English for Nigeria, French for Senegal/Cameroun, Spanish for Equatorial Guinea). The prompt field preserves what the speaker actually saw, while prompt_english gives a consistent English reference across the whole dataset.

Usage

By using this dataset, you agree to our Terms and Conditions of usage outlining acceptable and prohibited uses, and acknowledge that any misuse may result in enforcement actions.

The dataset is organized into one config per language (named by its ISO 639-3 code). To load a specific language (e.g. Igala, igl):

from datasets import load_dataset

dataset = load_dataset("naijavoices/voiceafrica-datasets", "igl")

Because the audio is large, it is advisable to load in streaming mode:

# Streaming mode
dataset = load_dataset("naijavoices/voiceafrica-datasets", "igl", streaming=True)

Data Collection & Methodology

The data was created through Lanfrica's data farming approach, working directly with indigenous speakers across Africa. A large share of spoken African languages live outside the internet; to ensure authenticity, recording and transcription were carried out by community members — ranging from the educated to the non-literate, villagers to city-dwellers — giving balanced views across the thematic areas.

Several roles ensured quality through multiple tiers of checks-and-balances:

Recorders — native speakers who recorded their spoken responses (≥1 hour each; ~10 recorders per language).
Transcribers — community members who transcribed the audio into text.
Reviewers (audio and transcription) — a first pass of cross-checking for accuracy.
Validators — a second tier of quality checks.
Quality Assurance Providers — a third tier of final quality assurance.

At each tier, samples could be flagged, corrected, or dropped to maintain high quality.

Notes on the Data

In the absence of formal external guidelines, the language communities established their own standards. A few nuances are worth understanding when using the data:

Square-bracket annotation. Conversational speech contains fillers (e.g. "em", "er"), stammering, and repetitions. Where possible, transcribers captured these and enclosed them in square brackets [ ], using their language expertise to decide what counts as a nuance.
Plosives / mic "blowing". Due to the phonetics of some languages, plosive sounds occasionally produced a faint blowing effect into the microphone. Where intelligibility was unaffected, the sample was kept as valid (~1.23% of valid samples).
Audio cut-offs. In a small share of recordings (~2.46%), particularly from Senegal, the final word was clipped — a natural phenomenon of how speakers ended their turns. Transcribers either marked the interruption with brackets or inferred the final word.
Faint background noise. Some recordings include faint background noise that does not affect the speaker's clarity; these were retained as valid.
Orthography & dialects. Many African languages lack a single standardized orthography and span multiple mutually-intelligible dialects. The dataset settles on common, widely-intelligible writing styles, while preserving some dialectal and spelling variety to enrich coverage. For example, Bainouk is ~80% Bainouk-Gunyaamolo + 20% other varieties; Bube is ~70% Bube Reebola + 30% other varieties.
Borrowed / foreign words. New or non-indigenous concepts (e.g. "website", "internet") are rendered the way each community commonly refers to them, so some audio and transcripts contain nativized foreign words.

License

The VoiceAfrica dataset is licensed under CC BY-NC-SA 4.0. It is free for research (non-commercial) use, with proper credit to the community. For commercial interests, please check out the Lanfrica / NaijaVoices membership tiers which offer special commercial waivers.

Acknowledgements

VoiceAfrica is a collaboration between Lanfrica and Meta, made possible by the data farmers — the recorders, transcribers, reviewers, validators, and quality assurance providers — and the language communities who contributed their voices.

Citation 📖

@misc{voiceafrica2025,
  title={The VoiceAfrica Dataset: Conversational, Culturally-Rich Speech Data for Under-Represented African Languages},
  author={Emezue, Chris and {Lanfrica Data Farmers} and {VoiceAfrica Community}},
  year={2025},
  publisher={Lanfrica},
  howpublished={\url{https://huggingface.co/datasets/naijavoices/voiceafrica-datasets}}
}

Downloads last month: 63

Total file size:

86.3 GB