Diabetes Prediction Using Hybrid Supervised and Unsupervised Techniques Based on PIMA Dataset

Ahmad Adel Abu-Shareha; Mosleh Abualhaj; Abdelrahman H. Hussein; Amal Amer; Anusha Achuthan; Alfian Abdul Halin

doi:10.37965/jait.2025.0899

Diabetes Prediction Using Hybrid Supervised and Unsupervised Techniques Based on PIMA Dataset

Authors

Ahmad Adel Abu-Shareha Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan https://orcid.org/0000-0002-2374-3152
Mosleh Abualhaj Department of Networks and Information Security, Al-Ahliyya Amman University, Amman, Jordan
Abdelrahman H. Hussein Department of Networks and Information Security, Al-Ahliyya Amman University, Amman, Jordan
Amal Amer Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan https://orcid.org/0009-0005-4023-8691
Anusha Achuthan School of Computer Sciences, Universiti Sains Malaysia, Gelugor, Penang, Malaysia https://orcid.org/0000-0002-2015-2269
Alfian Abdul Halin School of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia

DOI:

https://doi.org/10.37965/jait.2025.0899

Keywords:

classification, clustering, diabetes prediction

Abstract

Diabetes prediction using machine learning remains challenging due to the limited size and inherent imbalance of available medical datasets. This paper presents a hybrid framework that blends supervised and unsupervised machine learning techniques to improve the accuracy and robustness of early diabetes prediction. The proposed framework integrates clustering, feature selection, and classification to enhance predictive performance and robustness on small-scale medical datasets, specifically the PIMA Indian Diabetes Dataset. Feature selection using Mutual Information minimizes computational complexity while maintaining discriminative power. The unsupervised clustering component groups similar patient records to reduce intra-class variability, improving class separability for the subsequent supervised learning stage. Thirteen classifiers, including Support Vector Machine, K-Nearest Neighbors, Decision Tree, Random Forest (RF), Neural Networks, Adaptive Boosting, Gaussian Naïve Bayesian, Quadratic Discriminant Analysis, Skope Rules, eXtreme Gradient Boosting (XGB), Gradient Boosting, Deep Neural Network, and Logistic Regression, are evaluated to compare model performance under clustered and non-clustered settings. Experimental results show that ensemble-based classifiers, particularly RF and XGB, achieve the highest accuracy, precision, recall, and area under the curve (AUC) scores across two optimized clusters, confirming that integrating clustering and feature selection substantially improves the robustness of diabetes prediction models. The results showed that the proposed framework achieved 88.5% accuracy, 0.836 precision, 0.836 recall, 0.836 f-measure, and 0.874 AUC using a RF, and 88.5% accuracy, 0.838 precision, 0.832 recall, 0.835 f-measure, and 0.873 AUC with the XGB classifier.