CAFE: Algerian Arabic, French, and English Code-Switched Conversational Speech Dataset

Lachemat, Houssam Eddine-Othman; ABBAS, Akli; OUKAS, Nourredine; El Kheir, Yassine; haboussi, samia; Chowdhury, Shammur Absar

doi:10.5281/zenodo.16964503

Published September 1, 2025 | Version v2.0.0

Dataset Open

CAFE: Algerian Arabic, French, and English Code-Switched Conversational Speech Dataset

1. University of Bouira
2. Qatar Computing Research Institute

CAFE: A Spontaneous Code-Switching Speech Dataset in Algerian Dialect, French, and English (Version 2)

The CAFE dataset (Code-switching Algerian French English) addresses the critical scarcity of publicly available resources capturing spontaneous, real-world code-switching among Algerian Arabic (Darja), French, and English. Comprising approximately 37 hours of in vivo human–human dialogue from over 100 speakers, CAFE reflects authentic conversational dynamics across diverse topics such as science, technology, sports, and social issues.

All audio is preserved at 16 kHz, 16-bit PCM, mono WAV format, ensuring consistency for ASR and speech processing tasks.

Dataset Structure

CAFE is organized into two primary tiers to support diverse research needs — from high-precision evaluation to large-scale pretraining.

`CAFE_small/`

audio/: 170 mono-channel WAV files.
transcripts_raw/: CSV file containing raw manual transcriptions.
transcripts_ZAEBUC/: JSON files with linguistically enriched annotations, including:
- Code-switching boundaries,
- Dialect intensity levels (L0–L4),
- Non-lexical event tagging,

cafe-small-clean/ (2h 18m):
Audio and aligned transcriptions without overlapping speech, suitable for clean ASR benchmarking.

cafe-small-overlap/ (17m):
Contains 23 files with time-stamped overlapping speech regions.

audio_processed/: Overlap segments excised;
processed_transcripts.json: Includes original transcription, processed version, and metadata on removed segments (text + timestamps).

🔹 `CAFE_large/`

this will be found in the version 1.0.0 zip file. the current version includes the cafe-small new subsets: https://zenodo.org/records/15642786

A larger subset (~34h 35m, 3,588 files) with pseudo-labeled transcriptions, generated using a Whisper-based pipeline enhanced with pyannote speaker diarization. Suitable for semi-supervised learning, pretraining, and domain adaptation.

audio_large/: Mono-channel WAV files.
pseudo_labels/: CSV files mapping each audio file to its pseudo-transcription.

Files

cafe-small-clean.zip

Files (691.2 MB)

Name	Size	Download all
cafe-small-clean.zip md5:3aa5127394f76c87665880f59c12f214	292.3 MB	Preview Download
CAFE-small-overlap.zip md5:44f23bbbf5af71cb39f3f9a4e9d08c25	77.9 MB	Preview Download
cafe-small.zip md5:c03ab09435ab8cd95e62c38c9c72dbd1	320.9 MB	Preview Download

	All versions	This version
Views	101	52
Downloads	20	18
Data volume	26.9 GB	5.1 GB

CAFE: Algerian Arabic, French, and English Code-Switched Conversational Speech Dataset

Creators

Description

Dataset Structure

CAFE_small/

🔹 CAFE_large/

Files

cafe-small-clean.zip

Files (691.2 MB)

`CAFE_small/`

🔹 `CAFE_large/`