A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages

A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages

Authors

DOI:

https://doi.org/10.37965/jait.2025.0781

Keywords:

deep learning, Kazakh language, KEMO, low-resource languages, multimodal learning, speech emotion recognition

Abstract

Speech emotion recognition (SER) plays a crucial role in enhancing human–computer interaction by identifying emotional states in speech. However, low-resource languages like Kazakh face challenges due to limited datasets and linguistic tools. To address this problem, we propose a novel multimodal framework, KEMO (Kazakh Emotion Multimodal Optimizer), which combines text-based semantic analysis and audio emotion recognition to leverage complementary features of linguistic and paralinguistic data. Using a Kazakh-translated version of the DAIR-AI (Contextualized Affect Representations for Emotion Recognition) dataset for text and the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset for audio, we have developed a system capable of classifying six emotions from text and eight emotions from audio. By integrating outputs from speech-to-text and audio-based recognition models with adaptive weighting, KEMO significantly improves the accuracy and robustness of emotion classification, providing an effective solution for SER in low-resource language scenarios.

Metrics

Metrics Loading ...

Downloads

Published

2025-09-03

How to Cite

Mamyr Altaibek, Zulkhazhav, A., Yergesh, B., Gulmira Bekmanova, & Tileukhan Aibol. (2025). A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages. Journal of Artificial Intelligence and Technology. https://doi.org/10.37965/jait.2025.0781

Issue

Section

Research Articles
Loading...