Deep Learning Multimodal Sarcasm Detection in Social Media Comments: The Role of Memes and Emojis

Eka Dyar Wahyuni; Tri Lathif Mardi Suryanto; Heidy Arviani

doi:10.37965/jait.2025.0699

Deep Learning Multimodal Sarcasm Detection in Social Media Comments: The Role of Memes and Emojis

Authors

Eka Dyar Wahyuni Department of Information Systems, Universitas Pembangunan Nasional “Veteran” Jawa Timur, Surabaya, Indonesia https://orcid.org/0000-0003-2541-1474
Tri Lathif Mardi Suryanto Department of Information Systems, Universitas Pembangunan Nasional “Veteran” Jawa Timur, Surabaya, Indonesia https://orcid.org/0000-0001-7532-2440
Heidy Arviani Department of Communication Sciences, Universitas Pembangunan Nasional “Veteran” Jawa Timur, Surabaya, Indonesia https://orcid.org/0000-0001-5908-8797

DOI:

https://doi.org/10.37965/jait.2025.0699

Keywords:

emoji, deep learning, meme, sarcasm detection

Abstract

Social media has become a crucial platform for interaction, information exchange, and market analysis. Businesses and researchers rely on it for sentiment and emotion analysis, yet sarcasm detection remains a major challenge due to its ability to alter sentiment polarity. Traditional text-based analysis struggles with sarcasm as it lacks tone and facial expressions. Additionally, crucial indicators of sarcasm—repeated emojis, punctuation, and characters—are often discarded during preprocessing. To address this issue, we proposed a multimodal deep-learning approach that integrated text, emojis, and images to improve sarcasm detection. This approach preserved and transformed repeated emojis, punctuation, and characters into structured features rather than removing them. Images were processed using Optical Character Recognition (OCR) to extract text to ensure computational efficiency by excluding non-textual visual elements. Word representations were then generated using Word2Vec embeddings, which were fed into LSTM, GRU, and BiLSTM models. The study highlighted the importance of scenario-specific preprocessing and feature selection in sarcasm detection. Among the 15 models tested, LSTM–composite demonstrated stable accuracy and strong generalization (76% accuracy, 73% precision, and 82% recall). Its high computational cost made it unsuitable for large-scale deployment. On the contrary, Model 9 (i.e., BiLSTM–isRepeatedChar) could balance efficiency and predictive performance (76% accuracy, 74% precision, and 79% recall), which made it ideal for resource-limited environments.