MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition

About

Important Note: Our dataset is divided into two parts: an unlabeled dataset (including YouTube links) and a labeled dataset (including YouTube links and transcription labels). We do not provide audio or video files. Users are responsible for deciding whether and how to download videos and for ensuring that downloading and using these data complies with the laws of their country.

2000/1000+ Hours

MinSpeech provides 2000+ hours of unlabeled audio and 1000+ hours of labeled audio from a wide range of sources covering a variety of different contexts.

36 TV Programs

The corpus contains audio from 36 TV programs covering a variety of topics such as history, mythology, family, culture, education, etc., with high diversity and coverage.

Cross-linguistic Labels

We use Chinese text as labels to facilitate alignment and conversion with corpora in other languages. This method facilitates cross-language alignment and conversion.

Publications

Jiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu,Qingyang Hong∗, Lin Li∗

MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech Recognition, INTERSPEECH 2024

Bibtex | Abstract | PDF

DOWNLOAD

Resource

         The annotation files can be downloaded through here.
         1. The ./label folder contains labeled data, including: The ./list folder stores the video list for downloading audio. The ./data folder contains the training set, development set, and test set. Each data set has a corresponding text file and audio file list.
         2.The ./unlabel folder contains unlabeled data, and there is only one list folder, which is used to save the video list of downloaded audio.

Execute

After you download the annotation files, you can follow the guidance in Repo

License

The dataset is licensed under the CC BY-NC-SA 4.0 license. MinSpeech does not own the audio copyrights; they remain with the original content creators. Only annotated data, including YouTube links and transcriptions, is provided. Users must ensure the legality of downloading audio or video data in their country. Metadata is accurate as of June 2024, but we cannot guarantee future availability of the content on YouTube.

Example

The documents are speaker-wised saved in './video_tags' and the line is splited by '\t'.

                  
  video_id	user_name	upload_date	Theme	                Tags
  u66ruMnQQhU	Kesha Barrett	20111009	['Entertainment']	['JoJo', 'ATA', 'Power', 'Kick']
  gP1iNTd4uwk	Kesha Barrett	20110516	['Entertainment']	['Jo', 'Barrett', 'Michael', 'Jackson', 'year', 'olds', 'dancing', 'Preschool']
  9j6L7O8ztFM	Kesha Barrett	20110221	['Entertainment']	['Jo-Jo', 'Joseph', 'Breaking', 'boards', 'kids', 'in', 'karate', 'Lancaster', 'SC']

Example

The documents are utterance-wised saved in './time_stamp' and the line is splited by '\t'.

                  
    Identity:       id00000
    Reference:      DwgYRqnQZHM
    # 'X,Y' denote the top-left corner coordinates 
    # 'W,H' denote the width and height of the face bounding box.

    FRAME   X       Y       W       H

    000014  84      164     164     225
    000015  86      166     163     225
    000016  87      164     163     226
    000017  87      164     162     225
    000018  85      167     164     223
    000019  83      168     166     223
    000020  84      163     165     224
    000021  89      155     167     226
    000022  94      157     165     222

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 62276220 and 62371407.