Datasets:
VoiceAfrica Dataset
Introduction
Welcome to the VoiceAfrica dataset. VoiceAfrica is a Lanfrica–Meta collaboration (code-named VoiceAfrica 1) that set out to create authentic, conversational speech and expert-curated transcriptions for 11 under-represented African languages. The dataset contains ~125 hours of speech (about 10 hours per language) across 16,383 audio samples from 118 speakers in four countries.
Unlike read-speech corpora, VoiceAfrica uses a natural speech generation approach: native speakers were given prompts (think of them as questions across diverse, modern thematic areas) and spoke freely for 1–2 minutes per prompt. Long recordings were naturally split into small, model-compatible chunks (≤30s) — so no forced alignment is required downstream. The result is conversational, real-world, culturally-rich data, including the natural fillers, repetitions, and code-mixing that occur in everyday speech.
For more about Lanfrica's data farming approach, visit https://lanfrica.com. By using this dataset, you acknowledge reading and accepting to use it according to the data usage terms and conditions.
Languages
The dataset covers the following languages. Each language is a separate config (named by its ISO 639-3 code).
| Config | Language | Country | Samples | Hours | Speakers |
|---|---|---|---|---|---|
bcz |
Bainouk-Gunyaamolo | Senegal | 1,543 | 10.8 | 13 |
ble |
Balanta-Kentohe | Senegal | 1,755 | 12.9 | 12 |
bvb |
Bube | Equatorial Guinea | 1,404 | 11.3 | 10 |
fan |
Fang | Cameroun | 1,473 | 11.5 | 10 |
igl |
Igala | Nigeria | 1,570 | 12.1 | 10 |
knc |
Central Kanuri | Nigeria | 1,505 | 11.0 | 10 |
krx |
Karon | Senegal | 1,279 | 10.2 | 10 |
nup |
Nupe-Nupe-Tako | Nigeria | 1,604 | 12.6 | 10 |
pov |
Upper Guinea Crioulo | Senegal | 1,430 | 10.2 | 12 |
srr |
Serer | Senegal | 1,309 | 10.4 | 11 |
urh |
Urhobo | Nigeria | 1,511 | 11.9 | 10 |
| Total | 11 languages | 4 countries | 16,383 | ~124.9 | 118 |
There was a deliberate effort to balance gender representation. However, in some Muslim-dominated localities (e.g. for Kanuri, Serer, Balanta-Kentohe, Bainouk, Karon) female participation was lower; the dataset reflects this reality.
Metadata
Each row contains the audio together with rich speaker, prompt, and transcript metadata. Use the Dataset Viewer to explore it. The columns are:
| Column | Type | Description |
|---|---|---|
audio |
audio | The speech recording — a natural ≤30s chunk of a speaker's spoken response. |
speaker_id |
string | Anonymized speaker identifier (unique per language, e.g. spk01). |
gender |
string | Speaker gender (Female / Male). |
age_range |
string | Speaker age range (18-29 or 30-over). |
country |
string | Country where the recording was made. |
language_name |
string | Full human-readable language name (e.g. Bainouk-Gunyaamolo). |
language_code |
string | ISO 639-3 language code — also the config name. |
prompt |
string | The prompt (question) as presented to the speaker, in the lingua franca of their region (English, French, or Spanish). |
prompt_english |
string | The English version of the prompt, provided consistently for every row regardless of the language it was presented in. |
prompt_language |
string | The language the prompt was presented in (english, french, or spanish). |
prompt_id |
string | Identifier of the prompt (e.g. e001, f001, s001). |
domain |
string | Thematic area of the prompt (e.g. food, family, government, music). |
transcript |
string | Expert human transcription of the audio chunk. |
audio_filename |
string | Original filename of the audio chunk. |
audio_sequence |
int | Order of this chunk within the speaker's full response to the prompt. |
duration |
float | Duration of the audio chunk in seconds. |
A note on prompts and languages
The same 600 prompts were used across all 11 languages, spanning 19 thematic areas — hobbies, animals, food, farming, music, clothing, conversations, folklore, government, social media, education, family, social, art and crafts, places, greetings, festivals, occupation, and medical. Prompts were presented to speakers in the most widely-understood lingua franca of their region (English for Nigeria, French for Senegal/Cameroun, Spanish for Equatorial Guinea). The prompt field preserves what the speaker actually saw, while prompt_english gives a consistent English reference across the whole dataset.
Usage
By using this dataset, you agree to our Terms and Conditions of usage outlining acceptable and prohibited uses, and acknowledge that any misuse may result in enforcement actions.
The dataset is organized into one config per language (named by its ISO 639-3 code). To load a specific language (e.g. Igala, igl):
from datasets import load_dataset
dataset = load_dataset("naijavoices/voiceafrica-datasets", "igl")
Because the audio is large, it is advisable to load in streaming mode:
# Streaming mode
dataset = load_dataset("naijavoices/voiceafrica-datasets", "igl", streaming=True)
Data Collection & Methodology
The data was created through Lanfrica's data farming approach, working directly with indigenous speakers across Africa. A large share of spoken African languages live outside the internet; to ensure authenticity, recording and transcription were carried out by community members — ranging from the educated to the non-literate, villagers to city-dwellers — giving balanced views across the thematic areas.
Several roles ensured quality through multiple tiers of checks-and-balances:
- Recorders — native speakers who recorded their spoken responses (≥1 hour each; ~10 recorders per language).
- Transcribers — community members who transcribed the audio into text.
- Reviewers (audio and transcription) — a first pass of cross-checking for accuracy.
- Validators — a second tier of quality checks.
- Quality Assurance Providers — a third tier of final quality assurance.
At each tier, samples could be flagged, corrected, or dropped to maintain high quality.
Notes on the Data
In the absence of formal external guidelines, the language communities established their own standards. A few nuances are worth understanding when using the data:
- Square-bracket annotation. Conversational speech contains fillers (e.g. "em", "er"), stammering, and repetitions. Where possible, transcribers captured these and enclosed them in square brackets
[ ], using their language expertise to decide what counts as a nuance. - Plosives / mic "blowing". Due to the phonetics of some languages, plosive sounds occasionally produced a faint blowing effect into the microphone. Where intelligibility was unaffected, the sample was kept as valid (~1.23% of valid samples).
- Audio cut-offs. In a small share of recordings (~2.46%), particularly from Senegal, the final word was clipped — a natural phenomenon of how speakers ended their turns. Transcribers either marked the interruption with brackets or inferred the final word.
- Faint background noise. Some recordings include faint background noise that does not affect the speaker's clarity; these were retained as valid.
- Orthography & dialects. Many African languages lack a single standardized orthography and span multiple mutually-intelligible dialects. The dataset settles on common, widely-intelligible writing styles, while preserving some dialectal and spelling variety to enrich coverage. For example, Bainouk is ~80% Bainouk-Gunyaamolo + 20% other varieties; Bube is ~70% Bube Reebola + 30% other varieties.
- Borrowed / foreign words. New or non-indigenous concepts (e.g. "website", "internet") are rendered the way each community commonly refers to them, so some audio and transcripts contain nativized foreign words.
License
The VoiceAfrica dataset is licensed under CC BY-NC-SA 4.0. It is free for research (non-commercial) use, with proper credit to the community. For commercial interests, please check out the Lanfrica / NaijaVoices membership tiers which offer special commercial waivers.
Acknowledgements
VoiceAfrica is a collaboration between Lanfrica and Meta, made possible by the data farmers — the recorders, transcribers, reviewers, validators, and quality assurance providers — and the language communities who contributed their voices.
Citation 📖
@misc{voiceafrica2025,
title={The VoiceAfrica Dataset: Conversational, Culturally-Rich Speech Data for Under-Represented African Languages},
author={Emezue, Chris and {Lanfrica Data Farmers} and {VoiceAfrica Community}},
year={2025},
publisher={Lanfrica},
howpublished={\url{https://huggingface.co/datasets/naijavoices/voiceafrica-datasets}}
}
- Downloads last month
- 63