MinSpeech
A Corpus of Southern Min Dialect for Automatic Speech Recognition
2000/1000+ Hours
MinSpeech provides 2000+ hours of unlabeled audio and 1000+ hours of labeled audio from a wide range of sources covering a variety of different contexts.
36 TV Programs
The corpus contains audio from 36 TV programs covering a variety of topics such as history, mythology, family, culture, education, etc., with high diversity and coverage.
Cross-linguistic Labels
We use Chinese text as labels to facilitate alignment and conversion with corpora in other languages. This method facilitates cross-language alignment and conversion.
DOWNLOAD
./label folder contains labeled data, including:
The ./list folder stores the video list for downloading audio.
The ./data folder contains the training set, development set, and test set.
Each data set has a corresponding text file and audio file list.
Acknowledgement