Published September 1, 2025 | Version v2.0.0
Dataset Open

CAFE: Algerian Arabic, French, and English Code-Switched Conversational Speech Dataset

Description

CAFE: A Spontaneous Code-Switching Speech Dataset in Algerian Dialect, French, and English (Version 2)

 

The CAFE dataset (Code-switching Algerian French English) addresses the critical scarcity of publicly available resources capturing spontaneous, real-world code-switching among Algerian Arabic (Darja), French, and English. Comprising approximately 37 hours of in vivo human–human dialogue from over 100 speakers, CAFE reflects authentic conversational dynamics across diverse topics such as science, technology, sports, and social issues.

All audio is preserved at 16 kHz, 16-bit PCM, mono WAV format, ensuring consistency for ASR and speech processing tasks.

 
 
 

Dataset Structure

CAFE is organized into two primary tiers to support diverse research needs — from high-precision evaluation to large-scale pretraining.

CAFE_small/

  • audio/: 170 mono-channel WAV files.
  • transcripts_raw/: CSV file containing raw manual transcriptions.
  • transcripts_ZAEBUC/: JSON files with linguistically enriched annotations, including:
    • Code-switching boundaries,
    • Dialect intensity levels (L0–L4),
    • Non-lexical event tagging,

cafe-small-clean/ (2h 18m):
Audio and aligned transcriptions without overlapping speech, suitable for clean ASR benchmarking.

cafe-small-overlap/ (17m):
Contains 23 files with time-stamped overlapping speech regions.

  • audio_processed/: Overlap segments excised;
  • processed_transcripts.json: Includes original transcription, processed version, and metadata on removed segments (text + timestamps).
 

🔹 CAFE_large/

this will be found in the version 1.0.0 zip file. the current version includes the cafe-small new subsets: https://zenodo.org/records/15642786

A larger subset (~34h 35m, 3,588 files) with pseudo-labeled transcriptions, generated using a Whisper-based pipeline enhanced with pyannote speaker diarization. Suitable for semi-supervised learning, pretraining, and domain adaptation.

  • audio_large/: Mono-channel WAV files.
  • pseudo_labels/: CSV files mapping each audio file to its pseudo-transcription.
 

 

 

Files

cafe-small-clean.zip

Files (691.2 MB)

Name Size Download all
md5:3aa5127394f76c87665880f59c12f214
292.3 MB Preview Download
md5:44f23bbbf5af71cb39f3f9a4e9d08c25
77.9 MB Preview Download
md5:c03ab09435ab8cd95e62c38c9c72dbd1
320.9 MB Preview Download