Trusting ChatGPT? When a Subtle Variation in the Prompt Can Significantly Alter the Results

Jaime E. Cuellar; Óscar Moreno-Martínez; Paula Sofía Torres Rodríguez; Jaime Andrés Pavlich-Mariscal; Andrés Felipe Micán Castiblanco; Juan Guillermo Torres Hurtado

doi:10.37965/jait.2026.0860

Trusting ChatGPT? When a Subtle Variation in the Prompt Can Significantly Alter the Results

Authors

Jaime E. Cuellar Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia https://orcid.org/0009-0004-0858-2823
Óscar Moreno-Martínez Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia https://orcid.org/0000-0001-9735-9455
Paula Sofía Torres Rodríguez Pontificia Universidad Javeriana. Bogotá, Colombia
Jaime Andrés Pavlich-Mariscal Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Ingienería, Bogotá, Colombia https://orcid.org/0000-0002-3892-6680
Andrés Felipe Micán Castiblanco Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia https://orcid.org/0000-0002-0049-750X
Juan Guillermo Torres Hurtado Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Ingienería, Bogotá, Colombia https://orcid.org/0000-0001-8912-9289

DOI:

https://doi.org/10.37965/jait.2026.0860

Keywords:

ChatGPT, large language models (LLMs), trust, robustness, sentiment analysis, Spanish

Abstract

How much can we trust highly complex predictive models like ChatGPT? This study tests if subtle changes in prompt structuring do not produce significant variations in the classification results of sentiment polarity analysis generated by the LLM GPT-4o mini. The model classified 100.000 comments in Spanish on four Latin American presidents as positive, negative, or neutral on 10 occasions, varying the prompts each time. The experimental methodology included exploratory and confirmatory analyses to identify significant discrepancies among classifications.

The results reveal that minor modifications to prompts, such as lexical, syntactic, modal, or even their lack of structure, impact the classifications. At times, the model produced undecided responses mixing categories, providing unsolicited explanations, or using languages other than Spanish. Statistical analysis using Chi-square tests confirmed significant differences in most comparisons between prompts, except in one case when linguistic structures were similar.

These findings challenge the robustness and trustworthiness of large language models (LLMs) for classification tasks, highlighting their vulnerability to variations in instructions. Moreover, it was evident that the lack of structured grammar in prompts increases the frequency of hallucinations. The discussion underscores that trust in LLMs is based not only on technical performance but also on the social and institutional relationships underpinning their use.

Author Biographies

Jaime E. Cuellar, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia

Professor, Universidad Javeriana, Department of Communication and Language.
Member research group, Communication, media and culture

Óscar Moreno-Martínez, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia

Assistant Professor, Universidad Javeriana, Department of Communication and Language. Comunicación, Member research group, Communication, media and culture

Paula Sofía Torres Rodríguez, Pontificia Universidad Javeriana. Bogotá, Colombia

Degree in data science, Pontificia Universidad Javeriana.

Jaime Andrés Pavlich-Mariscal, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Ingienería, Bogotá, Colombia

Full Professor, Universidad Javeriana, Department of Systems Engineering,
Member research group, SIDRe - Information Systems, Distributed Systems and Computer Networks
Member research group, ISTAR

Andrés Felipe Micán Castiblanco , Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia

Professor, Universidad Javeriana, Department of Communication and Language.
Member research group: Lenguajes, Pedagogías y Culturas.

Juan Guillermo Torres Hurtado, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Ingienería, Bogotá, Colombia

Director of the High Performance Computing Center (ZINE in spanish) at the Universidad Javeriana
Member research group, SIDRe - Information Systems, Distributed Systems and Computer Networks

Trusting ChatGPT? When a Subtle Variation in the Prompt Can Significantly Alter the Results

Trusting ChatGPT? When a Subtle Variation in the Prompt Can Significantly Alter the Results

Authors

DOI:

Keywords:

Abstract

Author Biographies

Jaime E. Cuellar, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia

Óscar Moreno-Martínez, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia

Paula Sofía Torres Rodríguez, Pontificia Universidad Javeriana. Bogotá, Colombia

Jaime Andrés Pavlich-Mariscal, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Ingienería, Bogotá, Colombia

Andrés Felipe Micán Castiblanco , Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Comunicacio´n y Lenguaje, Bogotá, Colombia

Juan Guillermo Torres Hurtado, Pontificia Universidad Javeriana. Bogotá, Colombia; Facultad de Ingienería, Bogotá, Colombia

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Second CiteScore Released: 11.0

First CiteScore Released: 8.7

JAIT is now indexed in Scopus

JAIT is now indexed in IET Inspec

ORCID has been integrated into submission system

SJR