MinSpeech

A Corpus of Southern Min Dialect for Automatic Speech Recognition

About

Important Note: Our dataset is divided into two parts: an unlabeled dataset (including YouTube links) and a labeled dataset (including YouTube links and transcription labels). We do not provide audio or video files. Users are responsible for deciding whether and how to download videos and for ensuring that downloading and using these data complies with the laws of their country.

2000/1000+ Hours

MinSpeech provides 2000+ hours of unlabeled audio and 1000+ hours of labeled audio from a wide range of sources covering a variety of different contexts.

36 TV Programs

The corpus contains audio from 36 TV programs covering a variety of topics such as history, mythology, family, culture, education, etc., with high diversity and coverage.

Cross-linguistic Labels

We use Chinese text as labels to facilitate alignment and conversion with corpora in other languages. This method facilitates cross-language alignment and conversion.

Publications

Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu,Qingyang Hong∗, Lin Li∗

MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition, INTERSPEECH 2024

Bibtex |  Abstract |  PDF  

DOWNLOAD

Resource

         The annotation files can be downloaded through here.
         1. The ./label folder contains labeled data, including: The ./list folder stores the video list for downloading audio. The ./data folder contains the training set, development set, and test set. Each data set has a corresponding text file and audio file list.
         2.The ./unlabel folder contains unlabeled data, and there is only one list folder, which is used to save the video list of downloaded audio.

Execute

         After you download the annotation files, you can follow the guidance in Repo

License

         The dataset is licensed under the CC BY-NC-SA 4.0 license. MinSpeech does not own the audio copyrights; they remain with the original content creators. Only annotated data, including YouTube links and transcriptions, is provided. Users must ensure the legality of downloading audio or video data in their country. Metadata is accurate as of June 2024, but we cannot guarantee future availability of the content on YouTube.

Acknowledgement

         This work was supported in part by the National Natural Science Foundation of China under Grants 62276220 and 62371407.