Child vs Adult Speaker Diarization of naturalistic audio recordings in preschool environment using Deep Neural Networks

Prasanna Vasant Kothalkar; John H. L. Hansen; Jay Buzhardt; Dwight Irvin; Beth S Rous

Download Paper | Permalink

Conference: ASEE 2021 Gulf-Southwest Annual Conference
Location: Waco, Texas
Publication Date: March 24, 2021
Start Date: March 24, 2021
End Date: March 26, 2021
Page Count: 11
DOI: 10.18260/1-2--36365
Permanent URL: https://peer.asee.org/36365
Download Count: 345

Request a correction

Paper Authors

biography

Prasanna Vasant Kothalkar Center for Robust Speech Systems (CRSS), University of Texas at Dallas, TX, USA

visit author page

Prasanna Kothalkar received the B.S. degree in Computer Engineering from Mumbai University, Mumbai, India in 2010, M.S. degree in Computer Science from University of Texas at Dallas, Dallas, United States, in 2014. He has interned at technology companies for research positions in the areas of Speech Processing and Machine Learning. Currently he is pursuing his Ph.D. degree as a Research Assistant in the Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas (UTD), Richardson, United States under supervision of Dr. John H. L. Hansen. His research interests focus on Child Speech Pronunciation Modeling, Speech Recognition and Diarization, Machine Learning and Deep Learning.

visit author page

biography

Jay Buzhardt

visit author page

As an Associate Research Professor at the University of Kansas, Dr. Buzhardt’s research interests focus on developing and testing technology solutions to support data-driven intervention decision making in early childhood education. At Juniper Gardens Children’s Project (JGCP), he leads the Technology Innovation Development & Research (TIDR) Lab, which is a hybrid of onsite fulltime application developers and externally contracted developers, where online and mobile applications are designed, developed, tested, and maintained for nearly all JGCP interventions that utilize technology. Through grants funded through OSEP and IES and led by Dr. Buzhardt, the TIDR Lab developed and currently maintains the MOD and IGDI platform where it is hosted. Additionally, Dr. Buzhardt has led or co-led 10 federal grants from the Department of Education (5 from Office of Special Education Programs, 5 from Institute of Education Sciences) and four from the National Institute on Disability, Independent Living, and Rehabilitation Research. He currently directs a project funded by the Institute of Education Sciences to develop a web application that guides educators' data-driven intervention decision making. He also leads a $2.5M project funded by the Office of Special Education Programs to develop and test strategies and applications grounded in Implementation Science to scale-up sustained use of data-driven decision-making practices by infant-toddler service providers. He recently completed a 2nd successful RCT of the MOD across four states to test web-based decision-making support vs. self-guided decision making in Early Head Start home visiting settings. Other relevant projects include investigations of the construct and predictive validity of infant-toddler IGDI assessments, development of web-based professional development for elementary educators, and a current NSF-funded project to develop technology to automatically measure child and adult language in preschool and informal learning contexts.

visit author page

biography

John H. L. Hansen University of Texas at Dallas orcid.org/0000-0003-1382-9929

visit author page

John H.L. Hansen, received Ph.D. & M.S. degrees from Georgia Institute of Technology, and B.S.E.E. degree from Rutgers Univ. He joined Univ. of Texas at Dallas (UTDallas) in 2005, where he is Associate Dean for Research, Prof. of Electrical & Computer Engineering, and holds a joint appointment in School of Behavioral & Brain Sciences (Speech & Hearing). At UTDallas, he established Center for Robust Speech Systems (CRSS). He is an ISCA Fellow, IEEE Fellow, past TC-Chair of IEEE Signal Proc. Society, Speech & Language Proc. Tech. Comm.(SLTC), and Technical Advisor to U.S. Delegate for NATO (IST/TG-01). He currently serves as President of ISCA (Inter. Speech Comm. Assoc.). He has supervised 92 PhD/MS thesis candidates, was recipient of 2020 UT-Dallas Provost’s Award for Grad. Research Mentoring, 2005 Univ. Colorado Teacher Recognition Award, and author/co-author of +750 journal/conference papers in the field of speech/language/hearing processing & technology.

visit author page

author page

Dwight Irvin Juniper Gardens Children's Project

biography

Beth S Rous University of Kentucky

visit author page

Dr. Beth Rous is a professor and researcher who works with students and organizations to apply research to generate new knowledge and solve real-world problems. Beth believes all children have a right to high-quality educational experiences and has generated over $98 million in grants and contracts to help realize this vision. She has worked at the state and national levels to help build, implement, and scale programs and services for children from vulnerable populations. Beth has conducted and provided consultation on numerous national research studies funded through the U.S. Department of Education and Administration for Children and Families. She has been trained in special education, early childhood, and leadership and holds a doctorate in educational administration from the University of Kentucky.

visit author page

Download Paper | Permalink

Abstract

Speech and language development in children are crucial for ensuring effective skills in their long term learning ability and the person’s life-long educational journey. A child’s vocabulary size at the time of kindergarten entry is an early indicator of learning to read and potential long-term success in school. The preschool classroom is thus a promising venue for monitoring growth in young children by measuring their interactions with teachers and classmates. Automatic Speech Recognition (ASR) technologies provide the ability to ‘Early Childhood’ researchers for automatically analyzing naturalistic recordings in these settings. For this purpose, data is collected in a high-quality childcare learning center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. A preliminary task for ASR of daylong audio recordings would involve diarization i.e. segmenting speech into smaller parts for identifying ‘who spoke when’. This study investigates different Deep Learning-based diarization systems for classroom interactions of 3-5 year old children. However, the focus is on ’speaker group’ diarization which includes classifying speech segments being from adults or children, from across multiple classrooms. SincNet based diarization systems achieve utterance level Diarization Error Rate of 21.6%. Utterance level speaker group confusion matrices also show promising, balanced results. These diarization systems have potential applications in developing metrics for adult-to-child or child-to-child rapid conversational turns in a naturalistic noisy early childhood setting. Such technical advancements will also help teachers better and more efficiently quantify and understand their interactions with children, make changes as needed, and monitor the impact of those changes.

Citation
Format

Kothalkar, P. V., & Buzhardt, J., & Hansen, J. H. L., & Irvin, D., & Rous, B. S. (2021, March), Child vs Adult Speaker Diarization of naturalistic audio recordings in preschool environment using Deep Neural Networks Paper presented at ASEE 2021 Gulf-Southwest Annual Conference, Waco, Texas. 10.18260/1-2--36365