tinyML Talks on December 14, 2021 “The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset” by Mark Mazumder

Olga · December 9, 2021, 6:29am

We held our next tinyML Talks webcast. Mark Mazumder from Harvard University presented The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset on December 14, 2021.

December 14 forum

This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.

Mark Mazumder is a PhD student in Vijay Janapa Reddi’s group at Harvard University. His research interests are in efficient machine learning techniques for small datasets. Prior to joining Harvard, Mark was an Associate Staff member at MIT Lincoln Laboratory, where he performed research in computer vision and robotics.

=========================

Watch on YouTube:
Mark Mazumder

Download presentation slides:
Mark Mazumder

Feel free to ask your questions on this thread and keep the conversation going!

nickm · December 15, 2021, 11:51am

Thanks for helping with this!

Topic		Replies	Views
tinyML Talks on July 11, 2023 “Datasheets for Machine Learning Sensors” by Matthew Stewart from Harvard University tinyML Talks	0	293	June 29, 2023
tinyML Talks on April 27, 2021 “Train-by-weight (TBW): Accelerated Deep Learning by Data Dimensionality Reduction” by Xingheng Lin and Michael Jo tinyML Talks	1	731	May 20, 2021
tinyML Talks on August 11, 2022 “Data techniques that enable tiny computer vision in the real world” by Jelmer Neeven from Plumerai tinyML Talks	2	545	August 19, 2022
tinyML Talks on March 25, 2022 “How A Middle School Girl Solves a Real-Life Challenge Using TinyML: Gas Leak Detection” by Mithun Das and Sashrika Das tinyML Talks	0	560	March 22, 2022
tinyML Talks on September 14, 2023 “Unsupervised Federated Learning” by Ranjitha Prasad from IIIT Delhi tinyML Talks	0	321	September 7, 2023

tinyML Talks on December 14, 2021 “The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset” by Mark Mazumder

Related Topics