An ASR Transformer-Based Model for Kannada Speech-to-Text Transcription
DOI:
https://doi.org/10.37965/jait.2026.0935Keywords:
automatic speech recognition, fast Fourier transform, FLEURS, Mel spectrograms, transformer model, Whisper OpenAIAbstract
This work presents a dialect-aware and noise-robust Kannada automatic speech recognition (ASR) system that bridges the gap between low-resource linguistic contexts and state-of-the-art deep learning models. We design a two-stage approach: (i) a scratch-built convolutional neural network (CNN)–Transformer hybrid trained on curated Kannada speech data with fast Fourier transform-based noise reduction and (ii) fine-tuning OpenAI’s Whisper-small model on a dialect-diverse corpus. The proposed pipeline integrates adaptive noise suppression, subword tokenization, and beam-search decoding to handle agglutinative morphology, speaker variation, and environmental noise.
Extensive experiments were conducted on two datasets: the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) multilingual Kannada subset and a curated 150-sample, Custom-collected Kannada speech dataset covering both formal and conversational speech. On the FLEURS test set (≈ 2.5 hours), our fine-tuned Whisper model achieves a word error rate (WER) of 0.15, a character error rate (CER) of 0.255, and a BLEU score of 0.912, representing a 53% relative reduction in WER and 36% reduction in CER compared to the scratch CNN–Transformer baseline.
For the customized dataset, the fine-tuned Whisper model achieves a WER of 0.2311 and a CER of 0.0453, outperforming
Google Speech-to-Text Application Programming Interface (API) by 16.8% (relative WER reduction) and surpassing the scratch transformer by over 70% in WER. We further evaluate robustness under dialectal variation and noisy recordings, providing detailed error analysis and computational efficiency metrics. To our knowledge, this is the first comprehensive evaluation of Whisper fine-tuning for Kannada, demonstrating its viability for real-time, edge-deployable applications in education, accessibility, and public administration.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.
