tinyML Talks on December 14, 2021 “The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset” by Mark Mazumder

We held our next tinyML Talks webcast. Mark Mazumder from Harvard University presented The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset on December 14, 2021.

December 14 forum

This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.

Mark Mazumder is a PhD student in Vijay Janapa Reddi’s group at Harvard University. His research interests are in efficient machine learning techniques for small datasets. Prior to joining Harvard, Mark was an Associate Staff member at MIT Lincoln Laboratory, where he performed research in computer vision and robotics.


Watch on YouTube:
Mark Mazumder

Download presentation slides:
Mark Mazumder

Feel free to ask your questions on this thread and keep the conversation going!

Thanks for helping with this!