Diabetes Prediction Using Hybrid Supervised and Unsupervised Techniques Based on PIMA Dataset

Diabetes Prediction Using Hybrid Supervised and Unsupervised Techniques Based on PIMA Dataset

Authors

  • Ahmad Adel Abu-Shareha Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan https://orcid.org/0000-0002-2374-3152
  • Mosleh Abualhaj Department of Networks and Information Security, Al-Ahliyya Amman University, Amman, Jordan
  • Abdelrahman H. Hussein Department of Networks and Information Security, Al-Ahliyya Amman University, Amman, Jordan
  • Amal Amer Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan https://orcid.org/0009-0005-4023-8691
  • Anusha Achuthan School of Computer Sciences, Universiti Sains Malaysia, Gelugor, Penang, Malaysia https://orcid.org/0000-0002-2015-2269
  • Alfian Abdul Halin School of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia

DOI:

https://doi.org/10.37965/jait.2025.0899

Keywords:

classification, clustering, diabetes prediction

Abstract

Diabetes prediction using machine learning remains challenging due to the limited size and inherent imbalance of available medical datasets. This paper presents a hybrid framework that blends supervised and unsupervised machine learning techniques to improve the accuracy and robustness of early diabetes prediction. The proposed framework integrates clustering, feature selection, and classification to enhance predictive performance and robustness on small-scale medical datasets, specifically the PIMA Indian Diabetes Dataset. Feature selection using Mutual Information minimizes computational complexity while maintaining discriminative power. The unsupervised clustering component groups similar patient records to reduce intra-class variability, improving class separability for the subsequent supervised learning stage. Thirteen classifiers, including Support Vector Machine, K-Nearest Neighbors, Decision Tree, Random Forest (RF), Neural Networks, Adaptive Boosting, Gaussian Naïve Bayesian, Quadratic Discriminant Analysis, Skope Rules, eXtreme Gradient Boosting (XGB), Gradient Boosting, Deep Neural Network, and Logistic Regression, are evaluated to compare model performance under clustered and non-clustered settings. Experimental results show that ensemble-based classifiers, particularly RF and XGB, achieve the highest accuracy, precision, recall, and area under the curve (AUC) scores across two optimized clusters, confirming that integrating clustering and feature selection substantially improves the robustness of diabetes prediction models. The results showed that the proposed framework achieved 88.5% accuracy, 0.836 precision, 0.836 recall, 0.836 f-measure, and 0.874 AUC using a RF, and 88.5% accuracy, 0.838 precision, 0.832 recall, 0.835 f-measure, and 0.873 AUC with the XGB classifier.

Downloads

Published

2025-11-23

How to Cite

Abu-Shareha, A. A., Mosleh Abualhaj, Abdelrahman H. Hussein, Amal Amer, Anusha Achuthan, & Alfian Abdul Halin. (2025). Diabetes Prediction Using Hybrid Supervised and Unsupervised Techniques Based on PIMA Dataset. Journal of Artificial Intelligence and Technology. https://doi.org/10.37965/jait.2025.0899

Issue

Section

Research Articles
Loading...