Wav2vec 2.0 Models — Dutch Archival Broadcast
Self-supervised speech foundation models pre-trained on 55,000 hours of Dutch archival television broadcast data from the NISV collection.
Speech models
Dutch
Available on request
Data source: NISV archival television broadcast data — 55,000 hours
Summary
This work explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0.
Key findings:
- Data quality matters for SSL. Music, noise, and speaker overlap all affect convergence and downstream fine-tuning performance — understanding these effects is essential when working with broadcast archives.
- Pre-processing makes the difference. Using Whisper and WhisperX to filter and clean the noisy broadcast dataset substantially improves pre-training quality.
- Mono-lingual pre-training is more robust. Compared with multi-lingual pre-training on equivalent amounts of data, mono-lingual pre-training generalises better to out-of-domain speech.
- State-of-the-art Dutch ASR. Continued pre-training of a wav2vec 2.0 XLS-R checkpoint on the 55k-hour NISV dataset yields a state-of-the-art large model for the Dutch language.
Available models
| Model | Description | |
|---|---|---|
base / large
|
Base and large wav2vec 2.0 models pre-trained from scratch on Dutch archival television broadcast data. | on request |
xls-r-300m
|
Large model — continued pre-training of a wav2vec 2.0 XLS-R checkpoint on the 55k-hour archival dataset. Achieves state-of-the-art performance for Dutch speech recognition. | on request |
Publication
Vaessen et al. (2025). Self-supervised learning of speech foundation models from Dutch archival broadcast data. Interspeech 2025.
https://www.isca-archive.org/interspeech_2025/vaessen25_interspeech.html
Interested in access?
Contact us with a short description of your research, your affiliation, and a rough timeline.
mediastudies@clariah.nl