Building MoulSot: How We Curated 1,500 Hours of Moroccan Darija Speech, Selected the Best 80 Hours for Transcription, and Fine-Tuned Qwen3-ASR on Top of It
A technical deep-dive into the data pipeline and model training behind the MoulSot Moroccan Darija ASR system.

Introduction
Moroccan Darija is spoken by over 30 million people, yet it remains severely under-resourced in the speech technology landscape. Unlike Modern Standard Arabic, Darija is not a codified written language — it lives in spoken form, blending Arabic roots with Amazigh, French, and Spanish influences in ways that vary significantly between regions and speakers. Building an ASR system for Darija from scratch requires not just data, but high-quality data that reflects this linguistic diversity.
The MoulSot project set out to answer a deceptively simple question: can we build a large, clean, diverse Moroccan Darija speech dataset entirely from publicly available YouTube content, and use it to fine-tune a state-of-the-art ASR model? The answer turned out to be yes — but the path from raw web video to a running ASR system involved a multi-stage pipeline integrating six distinct AI models, human annotation, and careful engineering at every step.
This post walks through every stage of that pipeline: how we collected ~1,500 hours of raw Moroccan Darija speech from YouTube, scored and filtered it using state-of-the-art audio quality models, enhanced the worst segments with deep noise suppression, transcribed the best 80 hours, and finally fine-tuned Qwen3-ASR-1.7B on the resulting corpus.
The Challenge: Why Raw YouTube Audio Is Not Enough
YouTube hosts an enormous amount of Moroccan Darija speech, but raw video audio is almost never directly usable for ASR training. The problems are numerous:
- Background music and sound effects are pervasive in vlogs and entertainment content
- Multiple overlapping speakers appear in debates, panel discussions, and some podcasts
- Variable recording quality spans professional studio microphones to phone cameras in noisy environments
- Long silences, intros, and outros pad most videos with non-speech content
- Content diversity is uneven: some channels are extremely high-quality but narrow in topic; others are diverse but low-quality
A naive approach — download everything and transcribe it — would waste enormous human annotation effort on audio that is unusable or marginal quality. Our pipeline inverts this logic: automate the quality assessment first, then direct human effort exclusively at the segments that matter most.

Stage 1: Channel Curation — Choosing the Right Sources
Before a single byte of audio was downloaded, we curated a list of 51 YouTube channels. This was not a mechanical step — it required judgment about what kinds of speech would be most valuable for ASR.
Selection criteria:
- The channel must be primarily in Moroccan Darija (not Modern Standard Arabic, not French)
- Code-switching with French is acceptable and desirable (it reflects real Darija usage)
- Preference for channels with clear, unscripted conversational speech: podcasts, interviews, talk shows, vlogs, and commentary
- Avoid channels dominated by music, dubbed content, or non-Moroccan dialects
- Seek speaker diversity: men and women, different ages, different regions
The final list (channels/list.txt) includes 51 channels spanning categories including:
| Category | Examples |
|---|---|
| Lifestyle / Vlogs | @FaroukLife, @la3zawifamily, @walo_maydi3_family |
| Podcasts / Talks | @TouilTalks, @GeeksBlaBla01, @Thusspokeayman2 |
| Finance / Education | @Financebdarija, @AlmofidBdarija, @Drtabaliyoussef |
| Cooking / Beauty | @Choumicha, @asmaabeauty7208, @withnassima |
| News / Commentary | @Darkhabar, @MOTALATE, @naoufalchaara |
| Sports / Outdoors | @MarokiRider, @cyclingaroundafrica, @PILOTA.11 |
| Comedy / Entertainment | @BASSOU, @MrL3ami9, @IAMYAS |
We also maintained a separate regional subset (channels/chamali.txt) targeting Tangier-region (Chamali dialect) speech — a dialect notably distinct from Casablanca Darija.
This diversity was intentional. An ASR model trained on a single content domain or accent would generalize poorly. By spanning topics, ages, genders, and regions from the start, we built in robustness that no amount of post-hoc augmentation can substitute for.
Stage 2: Automated YouTube Crawling
Tool: pytubefix | Script: data/crawl.py
With the channel list in hand, we used pytubefix to download audio-only streams from each channel. For each channel, we fetched up to 70 videos, choosing audio-only streams to save storage and skip unnecessary video decoding.
videos = channel_ytb.videos[:min(70, len(channel_ytb.videos))]
for i, video in enumerate(videos):
video.streams.get_audio_only().download(
output_path=channel_dir,
filename=f'moul_sout_{i}_audio'
)
The set() deduplication on the channel list ensures that accidentally duplicated handles don't result in double-downloading the same channel.
Output: ~3,570 audio files across 51 channels.
Stage 3: Voice Activity Detection — Silero VAD
Tool: Silero VAD | Script: pipeline/vad/segment_silero.py
Raw audio files are rarely pure speech from start to finish. Intros, outros, music beds, long pauses, and off-mic conversations all mix in. The first step in making this audio useful is segmentation: slicing each file into short clips that contain only voiced speech.
What is Silero VAD?
Silero VAD is a lightweight neural voice activity detection model. Unlike classical energy-based VAD methods that simply threshold audio loudness, Silero VAD is a deep learning model trained to distinguish human speech from all other audio events — noise, music, ambient sound, background voices. It runs at real-time speed and is robust to noisy conditions.
Our Parameters
speech_timestamps = get_speech_timestamps(
wav, model,
return_seconds=True,
max_speech_duration_s=30, # no segment longer than 30s
min_speech_duration_ms=1, # keep even very brief voiced frames
)
Output
For each input file, the script produces:
- Short
.wavsegments under<channel>/silero_vad/<name>/{0.wav, 1.wav, ...} - A JSON timestamp manifest
After running VAD over all downloaded audio: 338,227 speech segments in master.parquet — the raw pool from which everything downstream flows.
Stage 4: Multi-Dimensional Audio Quality Scoring
Tools: SQUIM (torchaudio), Audiobox Aesthetics (Meta) | Script: pipeline/audio_aesthetics/audio_metadata.py
With 338,227 segments in hand, the critical question is: which ones are worth transcribing? We answer this by running every segment through two complementary quality models — one focused on signal-level speech quality, the other on subjective content and production quality.
4a. SQUIM — Non-Intrusive Speech Quality Estimation
SQUIM predicts three established perceptual speech quality metrics without needing a clean reference signal. This "blind" property is essential for our use case: we have no clean reference to compare against.
SQUIM is trained on the DNS 2020 dataset and learns to predict the same quality scores that would be obtained by comparing to a clean reference, even when no reference is available.
Metrics produced:
| Metric | Range | What it measures |
|---|---|---|
PESQ (pesq_hyp) | 1.0 – 4.5 | Perceptual speech quality; how clean the speech sounds |
STOI (stoi_hyp) | 0.0 – 1.0 | Short-time objective intelligibility; can a listener understand it? |
SI-SDR (si_sdr_hyp) | dB | Speech-to-noise power ratio; positive = speech dominates |
All audio is resampled to 16 kHz mono before scoring. torch.autocast is used for memory efficiency.
Key filtering threshold: pesq_hyp < 1.75 → routed to the speech enhancement module (Stage 5). pesq_hyp > 2.5 → highest-priority for transcription.
4b. Audiobox Aesthetics — Content and Production Quality
Audiobox Aesthetics (Meta Research) is built on a WavLM backbone — a 12-layer transformer with 768-dimensional embeddings, 12 attention heads, and 3,072-dimensional feed-forward layers — fine-tuned with human aesthetic ratings to predict four subjective quality dimensions.
Metrics produced (all scored 0–10):
| Metric | Key | What it captures |
|---|---|---|
| Content Enjoyment | CE | How engaging/enjoyable the audio content is |
| Content Usefulness | CU | How informative or practically useful the content is |
| Production Complexity | PC | Recording/production sophistication |
| Production Quality | PQ | Overall recording and production quality |
Observed score distributions (from our 338k-segment processing log):
| Metric | Low | Typical | High |
|---|---|---|---|
CE | 2.6 | 4.0 – 4.5 | 5.8 |
CU | 2.5 | 4.5 – 5.5 | 6.6 |
PC | 1.5 | 1.8 – 2.5 | 6.3 |
PQ | 4.3 | 5.3 – 6.5 | 7.6 |
Inference runs on GPU in batches of 8. Processing all 338k segments took ~60 minutes on a single GPU (observed in the processing log: 12:40 start → 13:40 aesthetics complete).
Stage 5: Speech Enhancement — Rescuing Low-Quality Segments
Tool: Facebook DNS64 | Script: pipeline/se_module/speech_enhancer.py
Segments with pesq_hyp < 1.75 (poor perceptual quality) are passed through Facebook's DNS64 deep noise suppression model.
DNS64 is a real-time speech enhancement model trained on the DNS Challenge dataset. It uses an encoder-decoder convolutional architecture with skip connections, processing audio in the time domain.
Process:
- Filter metadata to
pesq_hyp < 1.75 - Load DNS64 model (
denoiser.pretrained.dns64()) - For each low-quality segment: load → convert audio → denoise → save to
silero_vad_enhanced/ - New metadata rows appended with
_Enhsuffix in theidcolumn - Output parquet contains both original rows and enhanced rows
Result: master.filter_metadata.with_enhanced.parquet — 116,825 rows
By recovering degraded segments we preserve speaker diversity (some speakers always record in non-ideal conditions) and increase dataset size without re-crawling.
Stage 6: Speaker Diarization — Counting Voices
Tool: pyannote/speaker-diarization-3.1 | Script: pipeline/se_module/speaker_separator.py
pyannote/speaker-diarization-3.1 estimates the number of distinct speakers in each segment. This enables downstream filtering to keep single-speaker (interview/monologue) segments or flag multi-speaker clips.
The system uses a segmentation model + speaker embedding model + agglomerative clustering. For our purposes, we use it simply to count speakers: len(diarization.labels()) → num_speakers column.
Smart caching: Enhanced versions (_Enh) are processed first. When the non-enhanced original is encountered later, we reuse the speaker count — denoising doesn't change who is speaking.
Downstream use:
num_speakers == 1: Clean single-speaker segment — ideal for attributed transcriptionnum_speakers == 2: Dialogue/interview — valuable for conversational ASRnum_speakers > 2: Panel/overlap — lower priority for initial transcription
Stage 7: Human-in-the-Loop Feedback — Validating with Real Annotators
Tool: Argilla on HuggingFace Spaces | Scripts: data/argilla_data_feedback.py, data/Data Feedback Analysis.ipynb
Automated quality metrics, however sophisticated, measure proxies for what we actually want: audio that a human annotator can comfortably transcribe. We ran a human validation study on a preview subset to ground-truth our automated scores against human perception.
Annotation Setup
A sample of ~2,300 clips was uploaded to HuggingFace (atlasia/moulsot-preview) and annotators reviewed them through an Argilla interface. Three questions per clip:
rg.RatingQuestion(name="quality", values=[1, 2, 3, 4, 5]) # overall quality
rg.LabelQuestion(name="useful", labels=["Useful", "Not Useful"]) # ASR value
rg.LabelQuestion(name="duration", labels=["shorter better", "good", "longer better"])
Key Findings
After analysis of 2,318 completed annotations:
- Duration sweet spot: Annotators strongly preferred segments in the 3–15 second range. Clips shorter than ~2s were too short to be meaningful; clips approaching 30s often contained wandering speech that was hard to transcribe.
- Duration quality encoding:
−1: "shorter better" — clip is too long0: "good" — duration is appropriate+1: "longer better" — clip is too short
- The human feedback validated our SQUIM/Audiobox scores: clips annotators rated ≥ 3.5 out of 5 consistently had
pesq_hyp > 2.0andPQ > 5.0.
Selecting the Best 80 Hours
The final selection synthesizes all quality signals — PESQ, STOI, SI-SDR, CE, CU, PQ, num_speakers, human ratings, and duration — into a ranked selection for transcription.
Multi-Criteria Ranking
- Hard exclusions:
pesq_hyp < 1.0even after enhancement,stoi_hyp < 0.6,si_sdr_hyp < -5 dB - Quality promotion:
production_quality > 5.0andcontent_usefulness > 4.0 - Speaker preference: Single-speaker segments (
num_speakers == 1) prioritized - Duration filter: 3–25 second range preferred
- Channel diversity cap: No single channel exceeds a fixed share of the 80-hour budget
- Human rating boost: Clips with
quality_mean ≥ 3.5anduseful == "Useful"receive strong selection weight
Result: ~80 hours of curated, diverse, high-quality Moroccan Darija speech — representing a 5.3% selection rate from the raw 1,500-hour pool — now published as atlasia/MoulSot-Full on HuggingFace, partitioned into a training config 100-gt-2.5 (all samples with pesq_hyp > 2.5) and a held-out test split.
Stage 8: Transcription — Data Annotation with Gemini
Tool: Gemini 2.5 Pro (Google GenAI) | Script: data/transcriber.py | Benchmark: benchmarking/transcriber_bench.ipynb
With 80 hours of curated audio selected, the next challenge is transcription: converting every segment into accurate Darija text. Manual transcription of 80 hours would take hundreds of annotator-hours, so we used an LLM-based transcription approach — but first, we needed to choose the right model.
Benchmarking Transcription Models
We ran a head-to-head benchmark on 101 samples from an existing Darija speech dataset (DODa_audio) with human-verified ground truth transcriptions. Four models were evaluated:
| Model | Provider | Type |
|---|---|---|
Gemini Flash (gemini-flash-latest) | Multimodal LLM | |
GPT-4o Transcribe (gpt-4o-transcribe) | OpenAI | Dedicated ASR endpoint |
ElevenLabs Scribe (scribe_v1) | ElevenLabs | Dedicated ASR model |
Gemini 2.5 Flash (gemini-2.5-flash) | Multimodal LLM |
Each model transcribed the same 101 audio clips, and outputs were compared against the ground truth darija_ar column.
Key Findings
1. Script consistency: Both GPT-4o Transcribe and ElevenLabs Scribe occasionally transliterated Darija into Latin script instead of Arabic. For example, the phrase "اش بان ليك" was rendered as Isbellik? by GPT-4o and Jbal-lik? by ElevenLabs — completely unusable for Arabic-script ASR training. The Gemini models consistently output Arabic script.
2. Darija-specific vocabulary: GPT-4o struggled with colloquial Darija expressions. For instance, "طبعا راه مكتئب" became "طبعا رهمك تائب" — a severe distortion. "صافي دابا نحاول ناقشها بلا ما نعيق" was mangled into "سافيدة بنحاول نقشها بلا من عيق". The Gemini models and ElevenLabs handled these phrases significantly better.
3. Robustness on short utterances: The older Gemini Flash sometimes failed on very short clips (returning "مممم" or empty strings), but Gemini 2.5 Flash showed marked improvement on these edge cases.
4. Overall ranking: Gemini 2.5 Flash and ElevenLabs Scribe produced the closest transcriptions to the ground truth across the 101-sample benchmark, with Gemini having the critical advantage of consistent Arabic-script output and native support for prompt-guided transcription behavior.
Based on these results, we selected the Gemini family as our transcription backbone — specifically upgrading to Gemini 2.5 Pro for production transcription, leveraging its stronger reasoning capabilities for the full 80-hour corpus.
The Production Transcriber
The production transcription system (data/transcriber.py) is built around the Google GenAI SDK with several design choices informed by the benchmarking:
Prompt engineering for Darija:
prompt = """\
You are an audio transcriber. Your job is to transcribe the input audios
to text exactly as it was said by the user.
Rules:
* Audio is Moroccan Darija Arabic.
* Do not transliterate: if words from another language are uttered
(e.g. French, English), transcribe them in their original script.
* If the audio is not clear, respond with "<filename.wav>:" (empty).
"""
The prompt explicitly instructs the model to preserve code-switching — a critical Darija feature where speakers seamlessly mix Arabic with French. Words like "normalement", "en fait", or "c'est bon" should remain in Latin script, not be transliterated into Arabic. This was a key failure mode we observed in competing models during benchmarking.
Batch transcription for scale:
Transcribing 80 hours one clip at a time would be slow and expensive. The transcriber supports two modes:
-
Interactive mode — sends multiple audio files in a single prompt with interleaved file labels, returning all transcriptions in one response. Used for small batches and testing.
-
Batch API mode — packages groups of audio files (default: 10 per request) into JSONL batch files and submits them via Gemini's Batch API. Each request gets a unique
custom_id(SHA-256 hash of the file paths) for deterministic deduplication. Batch jobs are logged tobatch_requests/batch_jobs.jsonlfor monitoring and recovery.
# Batch mode groups files and submits asynchronously
batch_job = transcriber.batch_transcribe(file_paths, files_per_request=10)
# Poll for completion
completed_job = transcriber.wait_for_batch(batch_job.name, sleep_interval=30)
# Parse results
results = transcriber.get_batch_results(completed_job)
Thinking mode: The transcriber uses Gemini's thinking feature at LOW level — enough to improve accuracy on ambiguous audio without significantly increasing latency or cost.
Stage 9: Fine-Tuning Qwen3-ASR on MoulSot-Full
Model: Qwen/Qwen3-ASR-1.7B | Scripts: model/tune.py, model/prepare_moulsot.py, model/evaluate.py, model/push_model.py
With the 80-hour transcribed corpus published, the next step is fine-tuning a state-of-the-art ASR model on it. We chose Qwen3-ASR (Qwen Language Models, Alibaba Cloud) — a Whisper-style encoder-decoder architecture pre-trained on massively multilingual speech data including Arabic.
Why Qwen3-ASR?
Qwen3-ASR comes in two sizes — 0.6B and 1.7B parameters — and was pre-trained on a large multilingual corpus that includes standard Arabic. However, Moroccan Darija is substantially different from Modern Standard Arabic phonetically, lexically, and prosodically. Fine-tuning on MoulSot-Full adapts the model to:
- Darija-specific phonology (pharyngeal consonants with Amazigh influence, French loanwords)
- Code-switching patterns (seamless Darija/French mixing)
- Diverse recording conditions (studio to phone camera)
- Regional accent variation (Casablanca, Tangier, Marrakech, etc.)
Data Format
Qwen3-ASR expects a specific JSONL format:
{"audio": "/abs/path/to/clip.wav", "text": "language Arabic<asr_text>transcript"}
The language Arabic prefix is the language hint; <asr_text> is the separator tag. All audio is normalized to 16 kHz mono WAV.
The dataset preparation script (model/prepare_moulsot.py) handles downloading from HuggingFace and converting to this format:
python model/prepare_moulsot.py --output_dir model/data --hf_token $HF_TOKEN
This produces:
model/data/
train.jsonl # training samples (MoulSot-Full, config: 100-gt-2.5)
test.jsonl # held-out test samples (MoulSot-Full, config: default)
wavs/train/ # 16 kHz WAV files for training
wavs/test/ # 16 kHz WAV files for evaluation
Training Architecture and Hyperparameters
Fine-tuning uses the official qwen3_asr_sft.py script from the QwenLM/Qwen3-ASR repository (auto-cloned into model/Qwen3-ASR/ on first run), launched via torchrun for multi-GPU training.
| Hyperparameter | Value | Notes |
|---|---|---|
| Base model | Qwen/Qwen3-ASR-1.7B | 1.7 billion parameters |
| Epochs | 5 | With best-checkpoint saving per epoch |
| Batch size (per GPU) | 8 | |
| Gradient accumulation | 2 steps | Effective batch = 16 × GPUs |
| Learning rate | 5 × 10⁻⁵ | |
| LR scheduler | Cosine decay | |
| Warmup ratio | 0.05 | 5% of total steps as linear warmup |
| Audio sample rate | 16,000 Hz | |
| Max audio duration | 30 seconds | Longer clips skipped |
| Optimizer | AdamW | (qwen3_asr_sft.py default) |
| Precision | bf16 | GPU-native bfloat16 |
| Save strategy | Per epoch | Keeps best 3 checkpoints |
| Workers | 4 | DataLoader workers |
Multi-GPU launch (direct):
torchrun --nproc_per_node=<NUM_GPUS> model/Qwen3-ASR/finetuning/qwen3_asr_sft.py \
--model_path Qwen/Qwen3-ASR-1.7B \
--train_file model/data/train.jsonl \
--eval_file model/data/test.jsonl \
--output_dir model/output_moulasot_l40/1.7B \
--epochs 5 \
--batch_size 8 \
--grad_acc 2 \
--lr 5e-5 \
--warmup_ratio 0.05 \
--lr_scheduler_type cosine
Or using the all-in-one Python launcher (handles repo clone, data prep, and multi-GPU automatically):
python model/tune.py --model 1.7B --num_gpus 4
Evaluation Metrics
After training, we evaluate with three metrics:
| Metric | What it measures |
|---|---|
| WER (Word Error Rate) | (substitutions + deletions + insertions) / total words — the standard ASR accuracy metric |
| CER (Character Error Rate) | Same as WER but at the character level — better suited for Arabic script where word boundaries can be ambiguous |
| RTF (Real-Time Factor) | decode_time / audio_duration — values < 1.0 mean faster than real-time |
The evaluation script (model/evaluate.py) uses qwen_asr.Qwen3ASRModel.transcribe() for inference and computes both corpus-level and per-sample statistics:
python model/evaluate.py \
--model_path model/output_moulasot_l40/1.7B \
--test_jsonl model/data/test.jsonl \
--batch_size 8 \
--dtype bf16
Outputs saved under model/output_moulasot_l40/1.7B/eval_results/:
results.json— aggregate WER, CER, RTFdetails.tsv— per-sample reference / hypothesis / WER / CERworst10.tsv— 10 worst-WER samples for error analysis
Text Normalization
During evaluation, the model output is stripped of training format tags before WER/CER calculation:
def normalise(text: str) -> str:
if "<asr_text>" in text:
text = text.split("<asr_text>", 1)[-1]
if text.startswith("language Arabic"):
text = text[len("language Arabic"):]
return re.sub(r"\s+", " ", text).strip()
Pushing to HuggingFace Hub
Once trained, the checkpoint is packaged and pushed to HuggingFace with model/push_model.py. This script:
- Copies all checkpoint files to a staging directory
- Re-syncs processor/tokenizer files from the base model (fixing a known issue where the SFT script corrupts
tokenizer_config.json) - Re-saves weights in safetensors format
- Writes a model card with metadata
- Uploads everything to HuggingFace Hub
python model/push_model.py \
--checkpoint model/output_moulasot_l40/1.7B/checkpoint-14934 \
--repo_id atlasia/moulsot.v0.3 \
--hf_token $HF_TOKEN
Key Numbers at a Glance
| Stage | Count / Value |
|---|---|
| Source YouTube channels | 51 |
| Videos crawled | ~3,570 |
| Raw speech segments (post-VAD) | 338,227 |
| Total raw speech duration | ~1,500 hours |
| After quality filtering | 338,176 |
| After DNS64 enhancement pass | 116,825 rows |
| Human-reviewed preview clips (Argilla) | ~2,300 |
| Selected for transcription | ~80 hours |
| Selection ratio | ~5.3% of raw pool |
| ASR base model | Qwen3-ASR-1.7B |
| Training epochs | 5 |
| Effective batch size (4× GPU) | 64 |
| Learning rate | 5 × 10⁻⁵ (cosine) |
Tools and Models Summary
| Tool | Role | Why we chose it |
|---|---|---|
| pytubefix | YouTube downloading | Actively maintained; stable audio-only API |
| Silero VAD | Speech segmentation | Fast, neural, robust on noisy audio |
| SQUIM (torchaudio) | Signal quality (PESQ/STOI/SI-SDR) | Non-intrusive; no clean reference required |
| Audiobox Aesthetics | Content/production quality (CE/CU/PC/PQ) | Captures subjective quality beyond signal metrics |
| DNS64 (denoiser) | Speech enhancement | Real-time; trained on diverse DNS noise types |
| pyannote 3.1 | Speaker diarization | State-of-the-art open-source; easy labels() API |
| Gemini 2.5 Pro | Batch transcription / data annotation | Best Darija accuracy in benchmark; consistent Arabic script; prompt-guided; Batch API for scale |
| Argilla | Human annotation | Open-source; seamless HuggingFace Spaces deployment |
| Qwen3-ASR-1.7B | ASR fine-tuning | Pre-trained on multilingual speech including Arabic |
Lessons Learned
1. VAD is the most impactful preprocessing step. The difference between "1,500 hours of YouTube audio" and "1,500 hours of speech" is enormous. Silero VAD removes music, silence, and non-speech content that would otherwise contaminate training data and waste human transcription time.
2. Signal quality and content quality are orthogonal. SQUIM and Audiobox Aesthetics capture fundamentally different things. A segment can score well on PESQ (clean signal) while being low-value content (CE=2.5, CU=2.5 — someone humming or mumbling). Combining both signal-level and content-level quality scoring is essential.
3. Human feedback validates and calibrates automated metrics. The Argilla annotation round revealed that annotators cared about duration as much as audio quality. The technical metrics didn't capture the "too long, wandering speech" problem — only humans noticed it. Building a human-in-the-loop feedback loop early in the project changed how we designed the selection criteria.
4. Speech enhancement should augment, not replace.
DNS64 is powerful but not magic. We don't enhance everything — only segments below a PESQ threshold. Applying enhancement universally would alter the acoustic characteristics of clean recordings unnecessarily. The _Enh suffix convention in our metadata lets us track the provenance of every segment.
5. Diversity requires deliberate enforcement. Left unconstrained, quality-ranked selection tends to over-represent a few high-quality channels. The channel diversity cap ensures our 80 hours genuinely spans 51 channels rather than being dominated by 5 studio-quality podcasts.
6. Fine-tuning format matters as much as the model.
Qwen3-ASR uses a non-standard text format (language Arabic<asr_text>{text}) and its SFT script corrupts tokenizer files. The evaluation and push scripts include logic to re-sync clean processor files from the base model — a detail that is easy to miss and leads to cryptic inference failures.
Try MoulSot Live
Link to the space: https://atlasia-moulsot-v0-3.hf.space
What's Next
The 80-hour transcribed corpus is published as atlasia/MoulSot-Full on HuggingFace. The fine-tuned Qwen3-ASR model is available at atlasia/moulsot.v0.3. Both the data pipeline (this repo) and the fine-tuning code are fully open-source.
Next steps include:
- Scaling to the full 116,825-segment enhanced corpus for a second training pass
- Exploring Qwen3-ASR-0.6B for edge deployment
- Adding diarization-aware transcription for multi-speaker segments
- Expanding the channel list to cover more regional dialects (Souss, Oujda, Sahara)
The pipeline described here can be adapted for any under-resourced language with YouTube presence by swapping the channel list and adjusting quality thresholds to match the language's phonetic characteristics.
Citations
If you use MoulSot in your research, please cite:
@misc{moulsot2026,
title = {MoulSot: A Curated Moroccan Darija Speech Dataset and Fine-Tuned ASR Model},
author = {Atlasia},
year = {2026},
url = {https://huggingface.co/atlasia/moulsot.v0.3}
}
Tools, Models & Datasets
-
Silero VAD — Silero Team. Silero VAD: pre-trained enterprise-grade Voice Activity Detector. GitHub, 2021. https://github.com/snakers4/silero-vad
-
SQUIM — Kumar, A., et al. TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio. ICASSP, 2023. https://pytorch.org/audio/stable/pipelines.html#squim
-
Audiobox Aesthetics — Meta Research. Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound. 2024. https://github.com/facebookresearch/audiobox-aesthetics
-
DNS64 (Denoiser) — Defossez, A., Synnaeve, G., & Adi, Y. Real Time Speech Enhancement in the Waveform Domain. Interspeech, 2020. https://github.com/facebookresearch/denoiser
-
pyannote.audio — Bredin, H., et al. pyannote.audio 2.1: speaker diarization pipeline. Interspeech, 2023. https://github.com/pyannote/pyannote-audio
-
Argilla — Argilla, Inc. Argilla: Open-source data curation platform for LLMs. https://github.com/argilla-io/argilla
-
Gemini — Google DeepMind. Gemini: A Family of Highly Capable Multimodal Models. 2024. https://deepmind.google/technologies/gemini/
-
Qwen3-ASR — Qwen Team, Alibaba Cloud. Qwen3-ASR: Multilingual Automatic Speech Recognition Model. 2025. https://huggingface.co/Qwen/Qwen3-ASR-1.7B
-
DODa (Darija Open Dataset) — Outchakoucht, A., & Es-Samaali, H. DODa: Data-centric Open Darija Dataset. 2024. https://huggingface.co/datasets/abdeljalilELmajjodi/Atlaset-audio
-
pytubefix — JuanBindez. pytubefix: A maintained fork of pytube. GitHub. https://github.com/JuanBindez/pytubefix
MoulSot is a project by Atlasia, building open AI infrastructure for Moroccan and North African languages.