Three Large Language Models for Solidity Smart Contract Vulnerability Detection

Amjad Almaghthawi; Wael M.S. Yafooz; Nasser S. Albalawi

doi:10.37965/jait.2025.0811

Three Large Language Models for Solidity Smart Contract Vulnerability Detection

Authors

Amjad Almaghthawi Faculty of Computer Science Department, College of Computer Science and Engineering, Taibah University https://orcid.org/0009-0003-4571-3311
Wael M.S. Yafooz Faculty of Computer Science Department, College of Computer Science and Engineering, Taibah University
Nasser S. Albalawi Department of Computer Science, Faculty of Computing and Information Technology, Northern Border University

DOI:

https://doi.org/10.37965/jait.2025.0811

Keywords:

Code Llama, DeepSeek, GPT-3.5, large language models, prompt engineering, smart contract, solidity, vulnerability detection

Abstract

In decentralized apps, smart contracts are used to conduct trusted transactions on the Blockchain (BC). While smart contracts are highly effective, they are also highly susceptible to security flaws, leading to serious financial consequences. However, the combination of BC technology and artificial intelligence provides a solution for powerful, secure, and decentralized applications in various sectors. Furthermore, large language models (LLMs), which are essential advanced machine learning frameworks, are now used in various applications, including customer service, chatbots, code generation, vulnerability detection, and language translation.

This study investigates the use of LLMs for automated vulnerability detection in Solidity-based smart contracts. Specifically, three models are evaluated and compared: GPT-3.5-turbo, DeepSeek R1, and LLaMA-3. With a labeled, multi-class dataset including four vulnerability types, the models are assessed across three reasoning strategies: zero-shot, few-shot, and chain of thought. A prompt-based evaluation and performance comparison is conducted using standard metrics such as accuracy, precision, recall, F1-score, and average detection time.

Results show that in the zero-shot setting, GPT-3.5-turbo achieves the highest accuracy of 94.59%, followed closely by LLaMA-3 with 92%, while DeepSeek R1 achieved 78.95%. In the few-shot setting, LLaMA-3 outperformed other models. Furthermore, in the CoT setting, LLaMA-3 demonstrates the strongest overall performance with 96% accuracy and an F1-score of 0.82, surpassing DeepSeek R1's average of 78.95% and GPT-3.5's CoT performance, which is notably lower. Hence, this study develops an evaluation framework for LLM-based vulnerability detection, and we have demonstrated that prompt engineering has the potential to enhance the security of smart contracts.