A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages
DOI:
https://doi.org/10.37965/jait.2025.0781Keywords:
deep learning, Kazakh language, KEMO, low-resource languages, multimodal learning, speech emotion recognitionAbstract
Speech emotion recognition (SER) plays a crucial role in enhancing human–computer interaction by identifying emotional states in speech. However, low-resource languages like Kazakh face challenges due to limited datasets and linguistic tools. To address this problem, we propose a novel multimodal framework, KEMO (Kazakh Emotion Multimodal Optimizer), which combines text-based semantic analysis and audio emotion recognition to leverage complementary features of linguistic and paralinguistic data. Using a Kazakh-translated version of the DAIR-AI (Contextualized Affect Representations for Emotion Recognition) dataset for text and the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset for audio, we have developed a system capable of classifying six emotions from text and eight emotions from audio. By integrating outputs from speech-to-text and audio-based recognition models with adaptive weighting, KEMO significantly improves the accuracy and robustness of emotion classification, providing an effective solution for SER in low-resource language scenarios.
Metrics
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.