An ASR Transformer-Based Model for Kannada Speech-to-Text Transcription

An ASR Transformer-Based Model for Kannada Speech-to-Text Transcription

Authors

  • Chandrika Prasad Department of Computer Science and Engineering, Ramaiah Institute of Technology (Affiliated to VTU, Belagavi), Karnataka, India https://orcid.org/0000-0002-1798-9551
  • veenags Department of Computer Science and Engineering, Ramaiah Institute of Technology (Affiliated to VTU, Belagavi), Karnataka, India https://orcid.org/0000-0002-3933-6113
  • Geetha Department of Computer Science and Engineering, Ramaiah Institute of Technology (Affiliated to VTU, Belagavi), Karnataka, India https://orcid.org/0000-0003-4707-9257
  • R China Appala Naidu Department of Computer Science and Engineering, Ramaiah Institute of Technology (Affiliated to VTU, Belagavi), Karnataka, India https://orcid.org/0000-0002-2142-1694

DOI:

https://doi.org/10.37965/jait.2026.0935

Keywords:

automatic speech recognition, fast Fourier transform, FLEURS, Mel spectrograms, transformer model, Whisper OpenAI

Abstract

This work presents a dialect-aware and noise-robust Kannada automatic speech recognition (ASR) system that bridges the gap between low-resource linguistic contexts and state-of-the-art deep learning models. We design a two-stage approach: (i) a scratch-built convolutional neural network (CNN)–Transformer hybrid trained on curated Kannada speech data with fast Fourier transform-based noise reduction and (ii) fine-tuning OpenAI’s Whisper-small model on a dialect-diverse corpus. The proposed pipeline integrates adaptive noise suppression, subword tokenization, and beam-search decoding to handle agglutinative morphology, speaker variation, and environmental noise.

Extensive experiments were conducted on two datasets: the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) multilingual Kannada subset and a curated 150-sample, Custom-collected Kannada speech dataset covering both formal and conversational speech. On the FLEURS test set (≈ 2.5 hours), our fine-tuned Whisper model achieves a word error rate (WER) of 0.15, a character error rate (CER) of 0.255, and a BLEU score of 0.912, representing a 53% relative reduction in WER and 36% reduction in CER compared to the scratch CNN–Transformer baseline.

For the customized dataset, the fine-tuned Whisper model achieves a WER of 0.2311 and a CER of 0.0453, outperforming

Google Speech-to-Text Application Programming Interface (API) by 16.8% (relative WER reduction) and surpassing the scratch transformer by over 70% in WER. We further evaluate robustness under dialectal variation and noisy recordings, providing detailed error analysis and computational efficiency metrics. To our knowledge, this is the first comprehensive evaluation of Whisper fine-tuning for Kannada, demonstrating its viability for real-time, edge-deployable applications in education, accessibility, and public administration.

Downloads

Published

2026-02-03

How to Cite

Prasad, C., Swamy Rao, V. G., J, G., & Naidu, R. C. A. (2026). An ASR Transformer-Based Model for Kannada Speech-to-Text Transcription. Journal of Artificial Intelligence and Technology. https://doi.org/10.37965/jait.2026.0935

Issue

Section

Research Articles
Loading...