CAFE: Algerian Arabic, French, and English Code-Switched Conversational Speech Dataset
Creators
Description
CAFE: A Spontaneous Code-Switching Speech Dataset in Algerian Dialect, French, and English (Version 2)
The CAFE dataset (Code-switching Algerian French English) addresses the critical scarcity of publicly available resources capturing spontaneous, real-world code-switching among Algerian Arabic (Darja), French, and English. Comprising approximately 37 hours of in vivo human–human dialogue from over 100 speakers, CAFE reflects authentic conversational dynamics across diverse topics such as science, technology, sports, and social issues.
All audio is preserved at 16 kHz, 16-bit PCM, mono WAV format, ensuring consistency for ASR and speech processing tasks.
Dataset Structure
CAFE is organized into two primary tiers to support diverse research needs — from high-precision evaluation to large-scale pretraining.
CAFE_small/
audio/: 170 mono-channel WAV files.transcripts_raw/: CSV file containing raw manual transcriptions.transcripts_ZAEBUC/: JSON files with linguistically enriched annotations, including:- Code-switching boundaries,
- Dialect intensity levels (L0–L4),
- Non-lexical event tagging,
cafe-small-clean/ (2h 18m):
Audio and aligned transcriptions without overlapping speech, suitable for clean ASR benchmarking.
cafe-small-overlap/ (17m):
Contains 23 files with time-stamped overlapping speech regions.
audio_processed/: Overlap segments excised;processed_transcripts.json: Includes original transcription, processed version, and metadata on removed segments (text + timestamps).
🔹 CAFE_large/
this will be found in the version 1.0.0 zip file. the current version includes the cafe-small new subsets: https://zenodo.org/records/15642786
A larger subset (~34h 35m, 3,588 files) with pseudo-labeled transcriptions, generated using a Whisper-based pipeline enhanced with pyannote speaker diarization. Suitable for semi-supervised learning, pretraining, and domain adaptation.
audio_large/: Mono-channel WAV files.pseudo_labels/: CSV files mapping each audio file to its pseudo-transcription.