Building MoulSot: How We Curated 1,500 Hours of Moroccan Darija Speech, Selected the Best 80 Hours for Transcription, and Fine-Tuned Qwen3-ASR on Top of It

A technical deep-dive into the data pipeline and model training behind the MoulSot Moroccan Darija ASR system.


Introduction

Moroccan Darija is spoken by over 30 million people, yet it remains severely under-resourced in the speech technology landscape. Unlike Modern Standard Arabic, Darija is not a codified written language — it lives in spoken form, blending Arabic roots with Amazigh, French, and Spanish influences in ways that vary significantly between regions and speakers. Building an ASR system for Darija from scratch requires not just data, but high-quality data that reflects this linguistic diversity.

The MoulSot project set out to answer a deceptively simple question: can we build a large, clean, diverse Moroccan Darija speech dataset entirely from publicly available YouTube content, and use it to fine-tune a state-of-the-art ASR model? The answer turned out to be yes — but the path from raw web video to a running ASR system involved a multi-stage pipeline integrating six distinct AI models, human annotation, and careful engineering at every step.

This post walks through every stage of that pipeline: how we collected ~1,500 hours of raw Moroccan Darija speech from YouTube, scored and filtered it using state-of-the-art audio quality models, enhanced the worst segments with deep noise suppression, transcribed the best 80 hours, and finally fine-tuned Qwen3-ASR-1.7B on the resulting corpus.


The Challenge: Why Raw YouTube Audio Is Not Enough

YouTube hosts an enormous amount of Moroccan Darija speech, but raw video audio is almost never directly usable for ASR training. The problems are numerous:

  • Background music and sound effects are pervasive in vlogs and entertainment content
  • Multiple overlapping speakers appear in debates, panel discussions, and some podcasts
  • Variable recording quality spans professional studio microphones to phone cameras in noisy environments
  • Long silences, intros, and outros pad most videos with non-speech content
  • Content diversity is uneven: some channels are extremely high-quality but narrow in topic; others are diverse but low-quality

A naive approach — download everything and transcribe it — would waste enormous human annotation effort on audio that is unusable or marginal quality. Our pipeline inverts this logic: automate the quality assessment first, then direct human effort exclusively at the segments that matter most.


MoulSot audio pipeline

Stage 1: Channel Curation — Choosing the Right Sources

Before a single byte of audio was downloaded, we curated a list of 51 YouTube channels. This was not a mechanical step — it required judgment about what kinds of speech would be most valuable for ASR.

Selection criteria:

  • The channel must be primarily in Moroccan Darija (not Modern Standard Arabic, not French)
  • Code-switching with French is acceptable and desirable (it reflects real Darija usage)
  • Preference for channels with clear, unscripted conversational speech: podcasts, interviews, talk shows, vlogs, and commentary
  • Avoid channels dominated by music, dubbed content, or non-Moroccan dialects
  • Seek speaker diversity: men and women, different ages, different regions

The final list (channels/list.txt) includes 51 channels spanning categories including:

CategoryExamples
Lifestyle / Vlogs@FaroukLife, @la3zawifamily, @walo_maydi3_family
Podcasts / Talks@TouilTalks, @GeeksBlaBla01, @Thusspokeayman2
Finance / Education@Financebdarija, @AlmofidBdarija, @Drtabaliyoussef
Cooking / Beauty@Choumicha, @asmaabeauty7208, @withnassima
News / Commentary@Darkhabar, @MOTALATE, @naoufalchaara
Sports / Outdoors@MarokiRider, @cyclingaroundafrica, @PILOTA.11
Comedy / Entertainment@BASSOU, @MrL3ami9, @IAMYAS

We also maintained a separate regional subset (channels/chamali.txt) targeting Tangier-region (Chamali dialect) speech — a dialect notably distinct from Casablanca Darija.

This diversity was intentional. An ASR model trained on a single content domain or accent would generalize poorly. By spanning topics, ages, genders, and regions from the start, we built in robustness that no amount of post-hoc augmentation can substitute for.


Stage 2: Automated YouTube Crawling

Tool: pytubefix | Script: data/crawl.py

With the channel list in hand, we used pytubefix to download audio-only streams from each channel. For each channel, we fetched up to 70 videos, choosing audio-only streams to save storage and skip unnecessary video decoding.

videos = channel_ytb.videos[:min(70, len(channel_ytb.videos))]
for i, video in enumerate(videos):
    video.streams.get_audio_only().download(
        output_path=channel_dir,
        filename=f'moul_sout_{i}_audio'
    )

The set() deduplication on the channel list ensures that accidentally duplicated handles don't result in double-downloading the same channel.

Output: ~3,570 audio files across 51 channels.


Stage 3: Voice Activity Detection — Silero VAD

Tool: Silero VAD | Script: pipeline/vad/segment_silero.py

Raw audio files are rarely pure speech from start to finish. Intros, outros, music beds, long pauses, and off-mic conversations all mix in. The first step in making this audio useful is segmentation: slicing each file into short clips that contain only voiced speech.

What is Silero VAD?

Silero VAD is a lightweight neural voice activity detection model. Unlike classical energy-based VAD methods that simply threshold audio loudness, Silero VAD is a deep learning model trained to distinguish human speech from all other audio events — noise, music, ambient sound, background voices. It runs at real-time speed and is robust to noisy conditions.

Our Parameters

speech_timestamps = get_speech_timestamps(
    wav, model,
    return_seconds=True,
    max_speech_duration_s=30,    # no segment longer than 30s
    min_speech_duration_ms=1,    # keep even very brief voiced frames
)

Output

For each input file, the script produces:

  • Short .wav segments under <channel>/silero_vad/<name>/{0.wav, 1.wav, ...}
  • A JSON timestamp manifest

After running VAD over all downloaded audio: 338,227 speech segments in master.parquet — the raw pool from which everything downstream flows.


Stage 4: Multi-Dimensional Audio Quality Scoring

Tools: SQUIM (torchaudio), Audiobox Aesthetics (Meta) | Script: pipeline/audio_aesthetics/audio_metadata.py

With 338,227 segments in hand, the critical question is: which ones are worth transcribing? We answer this by running every segment through two complementary quality models — one focused on signal-level speech quality, the other on subjective content and production quality.

4a. SQUIM — Non-Intrusive Speech Quality Estimation

SQUIM predicts three established perceptual speech quality metrics without needing a clean reference signal. This "blind" property is essential for our use case: we have no clean reference to compare against.

SQUIM is trained on the DNS 2020 dataset and learns to predict the same quality scores that would be obtained by comparing to a clean reference, even when no reference is available.

Metrics produced:

MetricRangeWhat it measures
PESQ (pesq_hyp)1.0 – 4.5Perceptual speech quality; how clean the speech sounds
STOI (stoi_hyp)0.0 – 1.0Short-time objective intelligibility; can a listener understand it?
SI-SDR (si_sdr_hyp)dBSpeech-to-noise power ratio; positive = speech dominates

All audio is resampled to 16 kHz mono before scoring. torch.autocast is used for memory efficiency.

Key filtering threshold: pesq_hyp < 1.75 → routed to the speech enhancement module (Stage 5). pesq_hyp > 2.5 → highest-priority for transcription.


4b. Audiobox Aesthetics — Content and Production Quality

Audiobox Aesthetics (Meta Research) is built on a WavLM backbone — a 12-layer transformer with 768-dimensional embeddings, 12 attention heads, and 3,072-dimensional feed-forward layers — fine-tuned with human aesthetic ratings to predict four subjective quality dimensions.

Metrics produced (all scored 0–10):

MetricKeyWhat it captures
Content EnjoymentCEHow engaging/enjoyable the audio content is
Content UsefulnessCUHow informative or practically useful the content is
Production ComplexityPCRecording/production sophistication
Production QualityPQOverall recording and production quality

Observed score distributions (from our 338k-segment processing log):

MetricLowTypicalHigh
CE2.64.0 – 4.55.8
CU2.54.5 – 5.56.6
PC1.51.8 – 2.56.3
PQ4.35.3 – 6.57.6

Inference runs on GPU in batches of 8. Processing all 338k segments took ~60 minutes on a single GPU (observed in the processing log: 12:40 start → 13:40 aesthetics complete).


Stage 5: Speech Enhancement — Rescuing Low-Quality Segments

Tool: Facebook DNS64 | Script: pipeline/se_module/speech_enhancer.py

Segments with pesq_hyp < 1.75 (poor perceptual quality) are passed through Facebook's DNS64 deep noise suppression model.

DNS64 is a real-time speech enhancement model trained on the DNS Challenge dataset. It uses an encoder-decoder convolutional architecture with skip connections, processing audio in the time domain.

Process:

  1. Filter metadata to pesq_hyp < 1.75
  2. Load DNS64 model (denoiser.pretrained.dns64())
  3. For each low-quality segment: load → convert audio → denoise → save to silero_vad_enhanced/
  4. New metadata rows appended with _Enh suffix in the id column
  5. Output parquet contains both original rows and enhanced rows

Result: master.filter_metadata.with_enhanced.parquet116,825 rows

By recovering degraded segments we preserve speaker diversity (some speakers always record in non-ideal conditions) and increase dataset size without re-crawling.


Stage 6: Speaker Diarization — Counting Voices

Tool: pyannote/speaker-diarization-3.1 | Script: pipeline/se_module/speaker_separator.py

pyannote/speaker-diarization-3.1 estimates the number of distinct speakers in each segment. This enables downstream filtering to keep single-speaker (interview/monologue) segments or flag multi-speaker clips.

The system uses a segmentation model + speaker embedding model + agglomerative clustering. For our purposes, we use it simply to count speakers: len(diarization.labels())num_speakers column.

Smart caching: Enhanced versions (_Enh) are processed first. When the non-enhanced original is encountered later, we reuse the speaker count — denoising doesn't change who is speaking.

Downstream use:

  • num_speakers == 1: Clean single-speaker segment — ideal for attributed transcription
  • num_speakers == 2: Dialogue/interview — valuable for conversational ASR
  • num_speakers > 2: Panel/overlap — lower priority for initial transcription

Stage 7: Human-in-the-Loop Feedback — Validating with Real Annotators

Tool: Argilla on HuggingFace Spaces | Scripts: data/argilla_data_feedback.py, data/Data Feedback Analysis.ipynb

Automated quality metrics, however sophisticated, measure proxies for what we actually want: audio that a human annotator can comfortably transcribe. We ran a human validation study on a preview subset to ground-truth our automated scores against human perception.

Annotation Setup

A sample of ~2,300 clips was uploaded to HuggingFace (atlasia/moulsot-preview) and annotators reviewed them through an Argilla interface. Three questions per clip:

rg.RatingQuestion(name="quality",  values=[1, 2, 3, 4, 5])           # overall quality
rg.LabelQuestion(name="useful",    labels=["Useful", "Not Useful"])   # ASR value
rg.LabelQuestion(name="duration",  labels=["shorter better", "good", "longer better"])

Key Findings

After analysis of 2,318 completed annotations:

  • Duration sweet spot: Annotators strongly preferred segments in the 3–15 second range. Clips shorter than ~2s were too short to be meaningful; clips approaching 30s often contained wandering speech that was hard to transcribe.
  • Duration quality encoding:
    • −1: "shorter better" — clip is too long
    • 0: "good" — duration is appropriate
    • +1: "longer better" — clip is too short
  • The human feedback validated our SQUIM/Audiobox scores: clips annotators rated ≥ 3.5 out of 5 consistently had pesq_hyp > 2.0 and PQ > 5.0.

Selecting the Best 80 Hours

The final selection synthesizes all quality signals — PESQ, STOI, SI-SDR, CE, CU, PQ, num_speakers, human ratings, and duration — into a ranked selection for transcription.

Multi-Criteria Ranking

  1. Hard exclusions: pesq_hyp < 1.0 even after enhancement, stoi_hyp < 0.6, si_sdr_hyp < -5 dB
  2. Quality promotion: production_quality > 5.0 and content_usefulness > 4.0
  3. Speaker preference: Single-speaker segments (num_speakers == 1) prioritized
  4. Duration filter: 3–25 second range preferred
  5. Channel diversity cap: No single channel exceeds a fixed share of the 80-hour budget
  6. Human rating boost: Clips with quality_mean ≥ 3.5 and useful == "Useful" receive strong selection weight

Result: ~80 hours of curated, diverse, high-quality Moroccan Darija speech — representing a 5.3% selection rate from the raw 1,500-hour pool — now published as atlasia/MoulSot-Full on HuggingFace, partitioned into a training config 100-gt-2.5 (all samples with pesq_hyp > 2.5) and a held-out test split.


Stage 8: Transcription — Data Annotation with Gemini

Tool: Gemini 2.5 Pro (Google GenAI) | Script: data/transcriber.py | Benchmark: benchmarking/transcriber_bench.ipynb

With 80 hours of curated audio selected, the next challenge is transcription: converting every segment into accurate Darija text. Manual transcription of 80 hours would take hundreds of annotator-hours, so we used an LLM-based transcription approach — but first, we needed to choose the right model.

Benchmarking Transcription Models

We ran a head-to-head benchmark on 101 samples from an existing Darija speech dataset (DODa_audio) with human-verified ground truth transcriptions. Four models were evaluated:

ModelProviderType
Gemini Flash (gemini-flash-latest)GoogleMultimodal LLM
GPT-4o Transcribe (gpt-4o-transcribe)OpenAIDedicated ASR endpoint
ElevenLabs Scribe (scribe_v1)ElevenLabsDedicated ASR model
Gemini 2.5 Flash (gemini-2.5-flash)GoogleMultimodal LLM

Each model transcribed the same 101 audio clips, and outputs were compared against the ground truth darija_ar column.

Key Findings

1. Script consistency: Both GPT-4o Transcribe and ElevenLabs Scribe occasionally transliterated Darija into Latin script instead of Arabic. For example, the phrase "اش بان ليك" was rendered as Isbellik? by GPT-4o and Jbal-lik? by ElevenLabs — completely unusable for Arabic-script ASR training. The Gemini models consistently output Arabic script.

2. Darija-specific vocabulary: GPT-4o struggled with colloquial Darija expressions. For instance, "طبعا راه مكتئب" became "طبعا رهمك تائب" — a severe distortion. "صافي دابا نحاول ناقشها بلا ما نعيق" was mangled into "سافيدة بنحاول نقشها بلا من عيق". The Gemini models and ElevenLabs handled these phrases significantly better.

3. Robustness on short utterances: The older Gemini Flash sometimes failed on very short clips (returning "مممم" or empty strings), but Gemini 2.5 Flash showed marked improvement on these edge cases.

4. Overall ranking: Gemini 2.5 Flash and ElevenLabs Scribe produced the closest transcriptions to the ground truth across the 101-sample benchmark, with Gemini having the critical advantage of consistent Arabic-script output and native support for prompt-guided transcription behavior.

Based on these results, we selected the Gemini family as our transcription backbone — specifically upgrading to Gemini 2.5 Pro for production transcription, leveraging its stronger reasoning capabilities for the full 80-hour corpus.

The Production Transcriber

The production transcription system (data/transcriber.py) is built around the Google GenAI SDK with several design choices informed by the benchmarking:

Prompt engineering for Darija:

prompt = """\
You are an audio transcriber. Your job is to transcribe the input audios
to text exactly as it was said by the user.

Rules:
* Audio is Moroccan Darija Arabic.
* Do not transliterate: if words from another language are uttered
  (e.g. French, English), transcribe them in their original script.
* If the audio is not clear, respond with "<filename.wav>:" (empty).
"""

The prompt explicitly instructs the model to preserve code-switching — a critical Darija feature where speakers seamlessly mix Arabic with French. Words like "normalement", "en fait", or "c'est bon" should remain in Latin script, not be transliterated into Arabic. This was a key failure mode we observed in competing models during benchmarking.

Batch transcription for scale:

Transcribing 80 hours one clip at a time would be slow and expensive. The transcriber supports two modes:

  1. Interactive mode — sends multiple audio files in a single prompt with interleaved file labels, returning all transcriptions in one response. Used for small batches and testing.

  2. Batch API mode — packages groups of audio files (default: 10 per request) into JSONL batch files and submits them via Gemini's Batch API. Each request gets a unique custom_id (SHA-256 hash of the file paths) for deterministic deduplication. Batch jobs are logged to batch_requests/batch_jobs.jsonl for monitoring and recovery.

# Batch mode groups files and submits asynchronously
batch_job = transcriber.batch_transcribe(file_paths, files_per_request=10)

# Poll for completion
completed_job = transcriber.wait_for_batch(batch_job.name, sleep_interval=30)

# Parse results
results = transcriber.get_batch_results(completed_job)

Thinking mode: The transcriber uses Gemini's thinking feature at LOW level — enough to improve accuracy on ambiguous audio without significantly increasing latency or cost.


Stage 9: Fine-Tuning Qwen3-ASR on MoulSot-Full

Model: Qwen/Qwen3-ASR-1.7B | Scripts: model/tune.py, model/prepare_moulsot.py, model/evaluate.py, model/push_model.py

With the 80-hour transcribed corpus published, the next step is fine-tuning a state-of-the-art ASR model on it. We chose Qwen3-ASR (Qwen Language Models, Alibaba Cloud) — a Whisper-style encoder-decoder architecture pre-trained on massively multilingual speech data including Arabic.

Why Qwen3-ASR?

Qwen3-ASR comes in two sizes — 0.6B and 1.7B parameters — and was pre-trained on a large multilingual corpus that includes standard Arabic. However, Moroccan Darija is substantially different from Modern Standard Arabic phonetically, lexically, and prosodically. Fine-tuning on MoulSot-Full adapts the model to:

  1. Darija-specific phonology (pharyngeal consonants with Amazigh influence, French loanwords)
  2. Code-switching patterns (seamless Darija/French mixing)
  3. Diverse recording conditions (studio to phone camera)
  4. Regional accent variation (Casablanca, Tangier, Marrakech, etc.)

Data Format

Qwen3-ASR expects a specific JSONL format:

{"audio": "/abs/path/to/clip.wav", "text": "language Arabic<asr_text>transcript"}

The language Arabic prefix is the language hint; <asr_text> is the separator tag. All audio is normalized to 16 kHz mono WAV.

The dataset preparation script (model/prepare_moulsot.py) handles downloading from HuggingFace and converting to this format:

python model/prepare_moulsot.py --output_dir model/data --hf_token $HF_TOKEN

This produces:

model/data/
  train.jsonl          # training samples (MoulSot-Full, config: 100-gt-2.5)
  test.jsonl           # held-out test samples (MoulSot-Full, config: default)
  wavs/train/          # 16 kHz WAV files for training
  wavs/test/           # 16 kHz WAV files for evaluation

Training Architecture and Hyperparameters

Fine-tuning uses the official qwen3_asr_sft.py script from the QwenLM/Qwen3-ASR repository (auto-cloned into model/Qwen3-ASR/ on first run), launched via torchrun for multi-GPU training.

HyperparameterValueNotes
Base modelQwen/Qwen3-ASR-1.7B1.7 billion parameters
Epochs5With best-checkpoint saving per epoch
Batch size (per GPU)8
Gradient accumulation2 stepsEffective batch = 16 × GPUs
Learning rate5 × 10⁻⁵
LR schedulerCosine decay
Warmup ratio0.055% of total steps as linear warmup
Audio sample rate16,000 Hz
Max audio duration30 secondsLonger clips skipped
OptimizerAdamW(qwen3_asr_sft.py default)
Precisionbf16GPU-native bfloat16
Save strategyPer epochKeeps best 3 checkpoints
Workers4DataLoader workers

Multi-GPU launch (direct):

torchrun --nproc_per_node=<NUM_GPUS> model/Qwen3-ASR/finetuning/qwen3_asr_sft.py \
    --model_path Qwen/Qwen3-ASR-1.7B \
    --train_file model/data/train.jsonl \
    --eval_file  model/data/test.jsonl \
    --output_dir model/output_moulasot_l40/1.7B \
    --epochs 5 \
    --batch_size 8 \
    --grad_acc 2 \
    --lr 5e-5 \
    --warmup_ratio 0.05 \
    --lr_scheduler_type cosine

Or using the all-in-one Python launcher (handles repo clone, data prep, and multi-GPU automatically):

python model/tune.py --model 1.7B --num_gpus 4

Evaluation Metrics

After training, we evaluate with three metrics:

MetricWhat it measures
WER (Word Error Rate)(substitutions + deletions + insertions) / total words — the standard ASR accuracy metric
CER (Character Error Rate)Same as WER but at the character level — better suited for Arabic script where word boundaries can be ambiguous
RTF (Real-Time Factor)decode_time / audio_duration — values < 1.0 mean faster than real-time

The evaluation script (model/evaluate.py) uses qwen_asr.Qwen3ASRModel.transcribe() for inference and computes both corpus-level and per-sample statistics:

python model/evaluate.py \
    --model_path model/output_moulasot_l40/1.7B \
    --test_jsonl  model/data/test.jsonl \
    --batch_size  8 \
    --dtype       bf16

Outputs saved under model/output_moulasot_l40/1.7B/eval_results/:

  • results.json — aggregate WER, CER, RTF
  • details.tsv — per-sample reference / hypothesis / WER / CER
  • worst10.tsv — 10 worst-WER samples for error analysis

Text Normalization

During evaluation, the model output is stripped of training format tags before WER/CER calculation:

def normalise(text: str) -> str:
    if "<asr_text>" in text:
        text = text.split("<asr_text>", 1)[-1]
    if text.startswith("language Arabic"):
        text = text[len("language Arabic"):]
    return re.sub(r"\s+", " ", text).strip()

Pushing to HuggingFace Hub

Once trained, the checkpoint is packaged and pushed to HuggingFace with model/push_model.py. This script:

  1. Copies all checkpoint files to a staging directory
  2. Re-syncs processor/tokenizer files from the base model (fixing a known issue where the SFT script corrupts tokenizer_config.json)
  3. Re-saves weights in safetensors format
  4. Writes a model card with metadata
  5. Uploads everything to HuggingFace Hub
python model/push_model.py \
    --checkpoint model/output_moulasot_l40/1.7B/checkpoint-14934 \
    --repo_id    atlasia/moulsot.v0.3 \
    --hf_token   $HF_TOKEN

Key Numbers at a Glance

StageCount / Value
Source YouTube channels51
Videos crawled~3,570
Raw speech segments (post-VAD)338,227
Total raw speech duration~1,500 hours
After quality filtering338,176
After DNS64 enhancement pass116,825 rows
Human-reviewed preview clips (Argilla)~2,300
Selected for transcription~80 hours
Selection ratio~5.3% of raw pool
ASR base modelQwen3-ASR-1.7B
Training epochs5
Effective batch size (4× GPU)64
Learning rate5 × 10⁻⁵ (cosine)

Tools and Models Summary

ToolRoleWhy we chose it
pytubefixYouTube downloadingActively maintained; stable audio-only API
Silero VADSpeech segmentationFast, neural, robust on noisy audio
SQUIM (torchaudio)Signal quality (PESQ/STOI/SI-SDR)Non-intrusive; no clean reference required
Audiobox AestheticsContent/production quality (CE/CU/PC/PQ)Captures subjective quality beyond signal metrics
DNS64 (denoiser)Speech enhancementReal-time; trained on diverse DNS noise types
pyannote 3.1Speaker diarizationState-of-the-art open-source; easy labels() API
Gemini 2.5 ProBatch transcription / data annotationBest Darija accuracy in benchmark; consistent Arabic script; prompt-guided; Batch API for scale
ArgillaHuman annotationOpen-source; seamless HuggingFace Spaces deployment
Qwen3-ASR-1.7BASR fine-tuningPre-trained on multilingual speech including Arabic

Lessons Learned

1. VAD is the most impactful preprocessing step. The difference between "1,500 hours of YouTube audio" and "1,500 hours of speech" is enormous. Silero VAD removes music, silence, and non-speech content that would otherwise contaminate training data and waste human transcription time.

2. Signal quality and content quality are orthogonal. SQUIM and Audiobox Aesthetics capture fundamentally different things. A segment can score well on PESQ (clean signal) while being low-value content (CE=2.5, CU=2.5 — someone humming or mumbling). Combining both signal-level and content-level quality scoring is essential.

3. Human feedback validates and calibrates automated metrics. The Argilla annotation round revealed that annotators cared about duration as much as audio quality. The technical metrics didn't capture the "too long, wandering speech" problem — only humans noticed it. Building a human-in-the-loop feedback loop early in the project changed how we designed the selection criteria.

4. Speech enhancement should augment, not replace. DNS64 is powerful but not magic. We don't enhance everything — only segments below a PESQ threshold. Applying enhancement universally would alter the acoustic characteristics of clean recordings unnecessarily. The _Enh suffix convention in our metadata lets us track the provenance of every segment.

5. Diversity requires deliberate enforcement. Left unconstrained, quality-ranked selection tends to over-represent a few high-quality channels. The channel diversity cap ensures our 80 hours genuinely spans 51 channels rather than being dominated by 5 studio-quality podcasts.

6. Fine-tuning format matters as much as the model. Qwen3-ASR uses a non-standard text format (language Arabic<asr_text>{text}) and its SFT script corrupts tokenizer files. The evaluation and push scripts include logic to re-sync clean processor files from the base model — a detail that is easy to miss and leads to cryptic inference failures.

Try MoulSot Live

Link to the space: https://atlasia-moulsot-v0-3.hf.space


What's Next

The 80-hour transcribed corpus is published as atlasia/MoulSot-Full on HuggingFace. The fine-tuned Qwen3-ASR model is available at atlasia/moulsot.v0.3. Both the data pipeline (this repo) and the fine-tuning code are fully open-source.

Next steps include:

  • Scaling to the full 116,825-segment enhanced corpus for a second training pass
  • Exploring Qwen3-ASR-0.6B for edge deployment
  • Adding diarization-aware transcription for multi-speaker segments
  • Expanding the channel list to cover more regional dialects (Souss, Oujda, Sahara)

The pipeline described here can be adapted for any under-resourced language with YouTube presence by swapping the channel list and adjusting quality thresholds to match the language's phonetic characteristics.


Citations

If you use MoulSot in your research, please cite:

@misc{moulsot2026,
  title   = {MoulSot: A Curated Moroccan Darija Speech Dataset and Fine-Tuned ASR Model},
  author  = {Atlasia},
  year    = {2026},
  url     = {https://huggingface.co/atlasia/moulsot.v0.3}
}

Tools, Models & Datasets

  1. Silero VAD — Silero Team. Silero VAD: pre-trained enterprise-grade Voice Activity Detector. GitHub, 2021. https://github.com/snakers4/silero-vad

  2. SQUIM — Kumar, A., et al. TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio. ICASSP, 2023. https://pytorch.org/audio/stable/pipelines.html#squim

  3. Audiobox Aesthetics — Meta Research. Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound. 2024. https://github.com/facebookresearch/audiobox-aesthetics

  4. DNS64 (Denoiser) — Defossez, A., Synnaeve, G., & Adi, Y. Real Time Speech Enhancement in the Waveform Domain. Interspeech, 2020. https://github.com/facebookresearch/denoiser

  5. pyannote.audio — Bredin, H., et al. pyannote.audio 2.1: speaker diarization pipeline. Interspeech, 2023. https://github.com/pyannote/pyannote-audio

  6. Argilla — Argilla, Inc. Argilla: Open-source data curation platform for LLMs. https://github.com/argilla-io/argilla

  7. Gemini — Google DeepMind. Gemini: A Family of Highly Capable Multimodal Models. 2024. https://deepmind.google/technologies/gemini/

  8. Qwen3-ASR — Qwen Team, Alibaba Cloud. Qwen3-ASR: Multilingual Automatic Speech Recognition Model. 2025. https://huggingface.co/Qwen/Qwen3-ASR-1.7B

  9. DODa (Darija Open Dataset) — Outchakoucht, A., & Es-Samaali, H. DODa: Data-centric Open Darija Dataset. 2024. https://huggingface.co/datasets/abdeljalilELmajjodi/Atlaset-audio

  10. pytubefix — JuanBindez. pytubefix: A maintained fork of pytube. GitHub. https://github.com/JuanBindez/pytubefix


MoulSot is a project by Atlasia, building open AI infrastructure for Moroccan and North African languages.