I.INTRODUCTION

Diabetes mellitus is an intricate metabolic disorder that manifests in high blood glucose levels [1]. This mellitus presents challenges that lead to serious complications, including cardiovascular diseases and failure of the kidneys [2]. The World Health Organization (WHO) states that the affliction affects society beyond the individual suffering. Moreover, there has been a drastic rise in case numbers in the past few decades, particularly in low- and middle-income countries [3]. As illustrated in Fig. 1, the rise in prevalence in these areas emphasizes the need for early detection and management [4].

Fig. 1. Diabetes cases around the world in 2024 [4].

To help confront this public health crisis, it is essential to detect and address the development of diabetes at an early stage. By beginning treatment before the onset of possible complications, more lives can be preserved, all with the goal of improving their quality of life and keeping those at risk healthy enough to live longer. Early diagnosis can improve outcomes in diabetes before it has progressed enough to cause serious complications or damaging effects on the patient or the American healthcare system. The result would be benefits for all of them, including better-quality diabetes management and substantial savings in costs and lost productivity [5].

Machine learning can develop predictive models identifying the individuals at high risk based on clinical and demographic data; hence, it offers an automated, efficient, and reliable alternative to classic approaches [6]. Supervised machine learning is the training of algorithms on labeled datasets with known outputs [79]. The models are then used for predicting diabetes [10].

Although machine learning technologies provide a reliable approach for diabetes prediction, there are still a few challenges, including (1) Data Quality and Availability: Clinical and demographic datasets are generally small, with much of the data either missing or excessively noisy [11]. Limited sample sizes and incomplete records may reduce the accuracy and generalizability of the models. In addition, using small datasets with very complex models can lead to overfitting, where the algorithm performs well on the training data but fails to generalize to unseen data [12]. (2) Class Imbalance: The number of non-diabetic cases often greatly outweighs diabetic ones. This imbalance can lead to biased models that are overly focused on the majority class and therefore under-sensitive to high-risk individuals. (3) Researching features of varying importance: Identifying the most relevant features for diabetes prediction is crucial and challenging. Irrelevant or redundant features may negatively impact model performance, while missing an important feature may result in poor predictive performance [13].

Unsupervised learning is one of the most powerful tools for discovering patterns and relationships in unlabeled data. Their power of dimensionality reduction, the addressing of redundancies, and further preprocessing of rights and data render their utility across a spectrum of domains. However, the lack of labeled outputs, reliance on domain expertise, sensitivity to noise, and parameter choices have all revealed certain limitations. Thus, unsupervised methods tend to perform well when combined with other techniques, such as semi-supervised learning and/or feature engineering, for better usability and robust applicability [14].

The significance of combining both techniques lies in leveraging their strengths. Supervised methods excel at direct predictions, while unsupervised methods offer insights into data structure and can enhance feature engineering, improve model generalization, and detect anomalies or subgroups in datasets. The PIMA dataset includes clinical attributes like glucose levels, BMI, and insulin concentrations, which are used to train predictive models [15]. However, one major drawback is the small size of datasets like PIMA and the class imbalance, where there are significantly more non-diabetic cases than diabetic ones. This imbalance can lead to biased models, reduced sensitivity to positive cases, and overfitting. Traditional methods often struggle to perform well under these conditions, which limits their effectiveness in real-world applications. As such, a hybrid approach is required to improve the performance of the diabetes prediction task [16].

In this study, clustering is integrated with classification, where the clustering stage groups patients into homogeneous clusters based on health attributes such as glucose, BMI, and age. This stratification enables each classifier to learn more meaningful intra-cluster relationships, improving predictive sensitivity for minority diabetic cases. Additionally, feature selection is applied to eliminate irrelevant or redundant variables, reducing computational load and enhancing interpretability.

The structure of this paper is as follows: Section II presents a literature review that provides an overview of existing diabetes prediction models. Section III presents a detailed explanation of the hybrid framework, including data preprocessing, feature selection, and the integration of supervised and unsupervised techniques. Section IV presents an evaluation of the proposed approach. Section V discusses the results. Finally, the conclusion and the future work are presented in Section VI.

II.RELATED WORK

The prediction of diabetes has become a significant area of research in machine learning, given its significant impact on global health. Numerous studies have used both supervised and unsupervised learning methods to improve prediction accuracy while addressing issues such as limited datasets, class imbalance, and complex features [17].

A.SUPERVISED MACHINE LEARNING FOR DIABETES PREDICTION

Supervised machine learning has become a highly effective method for predicting diabetes. Various classification algorithms, such as Decision Tree (DT), Random Forests (RF), Support Vector Machine (SVM), and Logistic Regression (LR), have achieved impressive results when applied to structured datasets, including the PIMA Indian Diabetes Dataset.

An early approach by Sisodia and Sisodia [18] compared multiple algorithms, including DT, Naïve Bayesian (NB), and SVM, to assess their effectiveness in diabetes classification. Their findings highlighted that NB achieved the best accuracy of 76.3% accuracy with 10-fold cross-validation. Wei et al. [19] evaluated Deep Neural Networks (DNN), LR, DT, NB, and SVM classifiers for diabetes prediction. The proposed framework consists of preprocessing the dataset through imputation, normalization, and feature selection using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), and classifying using the resulting features. Their study reported that DNN performed the best, achieving 77.86% accuracy with 10-fold cross-validation.

Kibria et al. [20] used LR, SVM, Artificial Neural Networks (ANN), RF, Adaptive Boosting (AB), and eXtreme Gradient Boosting (XGB) classifiers to predict diabetes using the PIMA dataset. Missing values were imputed, after which the dataset was normalized, followed by feature selection and oversampling. The results showed that ensemble learning achieved the best accuracy of 89% with 5-fold cross-validation. Simaiya et al. [21] used K-Nearest Neighbors (KNN), NB, DT, RF, JRip, and SVM in a three-layer framework, with layers consisting of 3, 2, and 1 classifier(s), respectively. Feature selection and oversampling were used prior to the classification stage. The results showed that the proposed framework achieved a precision of 78.4% with 10-fold cross-validation.

Marzouk et al. [22] used ANN, KNN, LR, NB, DT, RF, SVM, and Gradient Boosting (GBoost) classifiers. The preprocessing stage consists of handling missing values and normalizing the data. The results showed that ANN achieved the highest accuracy of 81.7% with 5-fold cross-validation. Yadav and Nilam [23] used KNN, DT, SVM, and RF. The preprocessing stage consists of normalization. The results showed that KNN achieved the best performance with an accuracy of 80%.

Reza et al. [24] used an enhanced kernel SVM with missing-value imputation, implemented normalization, removed outliers, and oversampled. The results showed that SVM achieved an accuracy of 85.5% 10-fold cross-validation. Perdana et al. [25] used KNN with various k values to improve performance. The results showed that 22 achieved the best performance, with an accuracy of 83.12% on a 90%–10% train-test split. Al-Dabbas [26] used SVM, RF, and XGB to fill in missing values and oversample. The results showed that XGB achieved the best accuracy of 91% using a 90%–10% train-test split.

In summary, classification-based diabetes prediction is robust and applicable to both structured and unstructured datasets. However, these methods often face challenges related to overfitting and generalization, especially when dealing with small datasets or imbalanced class distributions. Techniques such as normalization, feature selection, oversampling, and cross-validation have been proposed to mitigate these issues. Nevertheless, their performance is often limited by the quality and quantity of available data, and they can struggle to uncover deeper, nonlinear patterns within the dataset [27]. A summary of these findings is presented in Table I.

Table I. Supervised ML-based diabetes prediction

Ref.PreprocessingClassifiersCVAccuracy
HNRSO12345678910111213
Sisodia and Sisodia [18]1076.3%
Wei et al. [19]1077.86%
Kibria et al. [20]589%
Simaiya et al. [21]1078.4%
Marzouk et al. [22]583.1%
Yadav and Nilam [23]1081.7%
Reza et al. [24]1085.5%
Perdana et al. [25]χ83.12%
Al-Dabbas [26]χ91%

Preprocessing, H: Handling missing values, N: Normalization, R: Removal of outliers, S: Feature Selection, O: Oversampling. Classifiers, 1: SVM, 2: KNN, 3: DT, 4: RF, 5: ANN, 6: AB, 7: NB, 8: QDA, 9: JRip, 10: XGB, 11: GBoost, 12: DNN, 13: LR. CV: Cross-validation.

B.UNSUPERVISED LEARNING FOR DIABETES ANALYSIS

Unsupervised learning has applications in healthcare, especially for analyzing complex datasets in diabetes research. In contrast to supervised methods that depend on labeled data, unsupervised techniques reveal hidden patterns and relationships within the data without needing explicit outcome labels. These approaches are especially valuable for categorizing patients, identifying at-risk groups, and discovering new insights from diabetes datasets. Unsupervised learning was not exhaustively used to predict diabetes. As such, Cao et al. [28] used k-means to generate clusters and classify new instances based on the distance to those clusters. The results were evaluated using a combination of PIMA and Medical Information Mart for Intensive Care (MIMIC) datasets. The critical challenge of unsupervised machine learning is that evaluating its results remains subjective and requires domain expertise to interpret the identified clusters and patterns accurately.

C.HYBRID APPROACHES

Hybrid models combine the predictive capabilities of supervised learning with the exploratory power of unsupervised methods, enabling better pattern recognition, noise reduction, and anomaly detection. Edeh et al. [29] used RF, DT, SVM, and NB classification algorithms and employed a technique for missing-values imputation and outlier removal based on unsupervised learning. The results showed that SVM achieved the best performance, with an accuracy of 83.1% based on an 80%–20% train-test split. Chang et al. [30] used NB, RF, and DT classifiers with k-means clustering for feature selection. The preprocessing stage consists of imputing missing values and selecting features. The results showed that RF achieved the best accuracy of 86.24% with a 70%–30% train-test split. A summary of the hybrid approaches is given in Table II.

Table II. Hybrid-based diabetes prediction

Ref.SMLUMLCVAccuracy
Edeh et al. [29]RF, DT, SVM, and NBOutlier removalχ83.1%
Chang et al. [30]DTFeature selectionχ86.24%

D.ADDRESSING LIMITATIONS IN CURRENT RESEARCH

Although significant progress has been made in diabetes prediction, several limitations persist. Most existing studies focus on improving prediction accuracy but neglect model scalability and interpretability, which are critical for real-world healthcare applications. Additionally, reliance on a single dataset, such as PIMA, limits the generalizability of results, as it primarily represents a specific population with unique characteristics. Hybrid methods, while effective, often introduce implementation complexity and require a fine balance between supervised and unsupervised components.

III.THE PROPOSED FRAMEWORK

A hybrid machine learning framework combining supervised and unsupervised learning techniques is proposed to improve diabetes prediction using the PIMA Indian Diabetes Dataset. The framework consists of several stages, as illustrated in Fig. 2, including data preprocessing, feature selection, hybrid modeling, and evaluation. The proposed approach aims to address challenges such as class imbalance, limited dataset size, and feature redundancy while leveraging the complementary strengths of supervised and unsupervised techniques.

Fig. 2. The proposed approach.

A.DATASET

The PIMA Indian Diabetes dataset is a widely used benchmark in diabetes prediction studies. It contains 768 samples with 8 numerical features, each representing a female of PIMA Indian heritage aged 21 years or older. Table III provides example entries from the dataset to clarify structure and labeling. The dataset comprises eight numerical attributes, including the number of pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, body mass index (BMI), diabetes pedigree function (a measure of genetic influence), and age, as summarized in Table IV. Notably, some attributes have missing or zero values, particularly insulin and skin thickness, which can challenge model training and require preprocessing. The target variable indicates whether the individual has diabetes (1) or not (0), with 500 non-diabetic (0) and 268 diabetic (1) instances, showing a slight class imbalance. Table V summarizes the characteristics of the PIMA dataset. This dataset serves as a foundation for analyzing risk factors associated with diabetes while providing opportunities to address challenges such as missing data and class imbalance [31].

Table III. Part of the PIMA dataset for illustration purposes

PregnanciesGlucoseBPBMIInsulinAgePedigreeOutcome
21207033.685270.350
81836432.9210370.671

Table IV. The risk factors of diabetes as reported in the PIMA dataset

FeatureDescriptionRange
PregnanciesNumber of pregnancies0–17
GlucosePlasma glucose concentration after 2 hours0–199
Blood pressureDiastolic blood pressure (mmHg)0–122
Skin thicknessTriceps skinfold thickness (mm)0–99
InsulinSerum insulin (U/ml)0–846
BMIBody mass index (weight in kg/m2)0–67.1
Diabetes pedigree functionDiabetes likelihood based on family history.0.078–2.42
AgeAge of the person (years)21–81
OutcomeDiabetes status (1 = positive, 0 = negative)Binary

Table V. The characteristics of the PIMA dataset

CharacteristicValue
Number of samples768
Number of features8 (all numerical)
Target variableBinary (0 = no diabetes, 1 = diabetes)
Non-diabetic instances500
Diabetic instances268
Missing dataRepresented as zeros in certain features
Features with missing dataInsulin, skin thickness, blood pressure, BMI, glucose

B.DATA PREPROCESSING

Data preprocessing prepares the PIMA dataset for modeling. This includes handling missing values. In this process, the missing values, which in this case are zeros, are replaced using median imputation. The zero value is an unreasonable value across the dataset used, including features such as glucose and insulin levels. Besides, outliers are also replaced with median values.

C.FEATURE SELECTION

Feature selection is crucial for reducing dimensionality, eliminating irrelevant features, and improving model performance. The proposed framework employs Mutual Information (MI) to assess feature-target variable dependencies and select the most relevant features for diabetes prediction. Selecting the significant features is implemented by calculating the MI score for each feature and then selecting the features with the highest MI scores.

D.CLUSTERING

The first stage of the hybrid framework applies K-means clustering to group similar patient records based on feature similarity. The optimal number of clusters (k) was selected experimentally using the Elbow method and the Silhouette coefficient, which both indicated two distinct patient clusters. This small number of clusters provided a good trade-off between interpretability and separation strength. Increasing k beyond 2 led to small, unstable clusters and degraded classifier performance. Each patient record’s cluster label was appended as an additional feature to the dataset, effectively encoding unsupervised structure for downstream classification.

E.CLASSIFICATION STAGE

The processed dataset, enriched with cluster labels and reduced by feature selection, was evaluated using 13 classifiers, including SVM, KNN, DT, RF, Neural Networks (ANN), AB, Gaussian NB, Quadratic Discriminant Analysis (QDA), Skope Rules (JRip), XGB, Gradient Boosting (GB), DNN, and LR.

F.HYBRID MODELING APPROACH

The core of the proposed work is the integration of supervised and unsupervised learning methods to improve predictive performance.

  • Unsupervised Component: K-means clustering is applied to group patients based on their clinical and demographic features. These clusters are used to identify latent patterns in the data that hold patients with varying risk levels.
  • Supervised Component: Multiple classifiers are used to predict diabetes risk. The unsupervised clusters are incorporated as additional features or used for stratified training to improve model sensitivity and accuracy.

The hybrid approach is implemented following the steps:

  • 1.The number of clusters is identified using the Elbow method.
  • 2.K-means clustering is applied to the preprocessed dataset to generate patient clusters.
  • 3.The K-means-generated cluster label is used as an additional feature.
  • 4.The supervised algorithm, using one of the classification algorithms, is applied to the enhanced dataset.

IV.EXPERIMENTAL RESULTS

All experiments were conducted in Python 3.9 on an Intel Core i7 (1.8 GHz) system using the scikit-learn and XGB libraries. Each experiment was repeated five times with different random seeds to ensure reproducibility. Statistical significance was tested using the Wilcoxon signed-rank test (α = 0.05) to confirm whether improvements were non-random.

A.EXPERIMENTAL SETTINGS

The overall workflow of the proposed system is illustrated in Fig. 3, which can be described as follows:

  • 1.Load the PIMA Indian Diabetes Dataset.
  • 2.Preprocess the dataset by handling missing values and scaling features.
  • 3.Perform feature selection using MI.
  • 4.Apply K-means clustering to identify patient subgroups (iterate and evaluate using the Elbow method).
  • 5.Use supervised classifiers that incorporate clustering results for diabetes prediction.
  • 6.Evaluate and compare model performance using standard metrics.

Fig. 3. Implementation processes.

B.FEATURE EVALUATION

Figure 4 illustrates the MI scores of all features, highlighting the selected features for the study. According to the MI scores, blood pressure and pregnancy are eliminated.

Fig. 4. Feature significances.

C.EVALUATION MEASURES

The proposed approach will evaluate the accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve (AUC), summarized in Table VI.

Table VI. Summary of the evaluation metrics

MetricDescriptionPurpose
AccuracyProportion of correctly predicted samples to the total samples.Measures the overall performance of the model.
PrecisionThe ratio of true positives to the total predicted positives (TP/(TP + FP)).Measures the ratio of the correctly predicted positives.
RecallThe ratio of true positives to the total actual positives (TP/(TP + FN)).Measures the model’s ability to identify all positive samples.
F1-scoreIntegration of precision and recall (2 * (Precision * Recall)/(Precision + Recall)).Provides a balance between precision and recall.
AUCThe area under the receiver operating characteristic curve, which plots true positive rate vs. false positive rate.Reflects the model’s ability to distinguish between classes across various thresholds.

D.PARAMETER SETTINGS

All models were trained using the default hyperparameters from the scikit-learn and XGB libraries to ensure comparability and reproducibility across different classifiers. The default parameters are given in Table VII.

Table VII. Parameter settings for the classifiers

Clas.ParametersValues
SVMkernel, C, gammarbf, 1.0, scale
KNNk, weights5, uniform
DTcriterion, splitterGini, best
RFn, criterion100, gini
ANNlayer size, activation, solver, iteration(100), relu, adam, 200
ABn, learning rate50, 1.0
NBsmoothing1e–09
QDAparam, store0.0, False
JRipminNo1
XGBn, depth, learning rate, subsample, colsample_bytree100, 6, 0.3, 1.0, 1.0
GBn, learning rate, depth100, 0.1, 3
DNNlayer size, activation, solver, iteration(100, 50, 25), relu, adam, 200
LRpenalty, solver, C, iterationl2, lbfgs, 1.0, 100

E.EVALUATION

The experiments will evaluate the proposed model and each component individually. Table VIII summarizes the results of the classifiers without feature selection or clustering.

Table VIII. Results of the baseline model

#Clas.Acc.Prec.Rec.F1AUC
1SVM0.6510.0000.0000.0000.500
1KNN0.8500.7890.7800.7840.834
3DT0.8610.8340.7500.7900.835
4RF0.8780.8300.8170.8230.864
5ANN0.8130.7900.6310.7010.770
6AB0.8660.8140.7990.8060.850
7NB0.7660.6770.6270.6510.733
8QDA0.7420.6550.5520.5990.698
9JRip0.8190.6790.9140.7790.841
10XGB0.8820.8440.8100.8270.865
11GB0.8750.8280.8100.8190.860
12DNN0.8030.7140.7280.7210.786
13LR0.7760.7100.6040.6530.736

Among the classifiers, XGB achieved the highest accuracy of 0.882, precision of 0.844, F1-score of 0.827, and AUC of 0.865, making it the most effective classifier in the baseline model. Similarly, RF and GB achieved competitive results, demonstrating the robust performance of ensemble-based methods. In contrast, simpler classifiers like NB and QDA achieved lower precision, recall, and F1-scores, indicating limitations in handling the dataset’s complexity without further enhancements. Surprisingly, JRip showed a strong recall of 0.914, suggesting it effectively identified positive cases, albeit at the expense of precision.

Table IX summarizes the results of the classifiers in the baseline model with feature selection.

Table IX. Results of the baseline model with feature selection

#Clas.Acc.Prec.Rec.F1AUC
1SVM0.6541.0000.0070.0150.504
1KNN0.8680.8130.8100.8110.855
3DT0.8670.8400.7650.8010.843
4RF0.8820.8340.8250.8300.868
5ANN0.7890.7170.6530.6840.757
6AB0.8700.8210.8020.8110.854
7NB0.7680.6940.6010.6440.729
8QDA0.7340.6550.5040.5700.681
9JRip0.8230.6830.9180.7830.845
10XGB0.8840.8400.8250.8320.870
11GB0.8840.8400.8250.8320.870
12DNN0.7630.6620.6570.6600.738
13LR0.7620.6900.5750.6270.718

Building on the baseline model without feature selection (Table VIII), Table IX presents the performance of classifiers after incorporating MI-based feature selection. This refinement generally improved model performance, particularly for ensemble methods and complex classifiers, by reducing irrelevant or redundant features, which enhanced their predictive capability. XGB and GB emerged as the top-performing models, both achieving the highest accuracy of 88.4%, F1-score of 0.832, and AUC of 0.870. These results demonstrate their ability to leverage the selected features effectively. Similarly, RF showed a consistent improvement in AUC (0.868) and a notable boost in precision (0.834), reflecting its robustness and adaptability to feature selection. KNN and DT also benefited, achieving slight gains across all metrics, further affirming the effectiveness of feature selection in reducing overfitting risk. Interestingly, while feature selection improved performance across most classifiers, ANN and DNN showed minor drops in performance metrics, suggesting that the reduced feature set may have excluded critical information for these models. The extreme case was SVM, which achieved perfect precision (1.000) but low recall (0.007), resulting in an overall poor F1-score (0.015).

Table X summarizes the results of the classifiers with two clusters, without feature selection.

Table X. Results of the proposed hybrid model without feature selection

#Clas.Acc.Pre.Rec.F1AUC
1SVM0.6510.0000.0000.0000.500
1KNN0.8500.7890.7800.7840.834
3DT0.8630.8350.7570.7950.839
4RF0.8840.8380.8280.8330.871
5ANN0.7960.7210.6750.6970.768
6AB0.8660.8140.7990.8060.850
7NB0.7660.6770.6270.6510.733
8QDA0.6510.0000.0000.0000.500
9JRip0.8140.6740.9030.7720.834
10XGB0.8820.8440.8100.8270.865
11GB0.8740.8300.8020.8160.857
12DNN0.7970.7550.6190.6800.756
13LR0.7770.7120.6080.6560.738

The hybrid models reveal subtle improvements across several classifiers, particularly ensemble-based methods such as RF and DT. For instance, RF achieved the highest accuracy of 88.4%, improving from 87.8% in the baseline, along with an F1-score of 0.833 and an AUC of 0.871, demonstrating the benefits of clustering in enhancing model performance. Other notable changes include DT, which saw improvements across all metrics, with accuracy increasing from 86.1% to 86.3% and the F1-score rising from 0.790 to 0.795. However, for some models, such as XGB, the metrics remained largely consistent, indicating their robustness even without clustering. Similarly, AB and GB showed only marginal changes, suggesting that clustering alone had a limited influence. Overall, the hybrid approach with clustering demonstrated modest performance gains for specific classifiers, particularly ensemble methods, while highlighting the need for feature selection or further enhancements to achieve substantial improvements across the board. Table XI summarizes the results of the proposed model.

Table XI. Results of the proposed hybrid model

#Clas.Acc.Pre.Rec.F1AUC
1SVM0.6541.000.0070.0150.504
1KNN0.8680.8130.8100.8110.855
3DT0.8670.8400.7650.8010.843
4RF0.8850.8360.8360.8360.874
5ANN0.8020.7120.7280.7200.785
6AB0.8710.8210.8060.8140.856
7NB0.7630.6820.6010.6390.725
8QDA0.7530.6740.5630.6140.709
9JRip0.8140.6700.9180.7750.838
10XGB0.8850.8380.8320.8350.873
11GB0.8750.8230.8170.8200.862
12DNN0.7770.6600.7460.7010.770
13LR0.7620.6910.5750.6270.718

Table XI presents the results of the proposed hybrid model that integrates 2-clustering and feature selection, building upon the outcomes from both the baseline models (Table VIII and Table IX). The incorporation of clustering and MI-based feature selection generally enhanced the performance of most classifiers, particularly ensemble methods. RF and XGB emerged as the best-performing models, each achieving the highest accuracy of 88.5% and F1-scores of 0.836 and 0.835, respectively, with significant improvements in AUC of 0.874 and 0.873, respectively. These results highlight the strength of ensemble-based methods in leveraging both feature reduction and clustering to improve predictive performance. DT and AB also demonstrated competitive results. ANN saw improved performance compared to the baseline models, achieving an F1-score of 0.720 and an AUC of 0.785, while DNN showed a marked increase in recall of 0.746, improving its F1-score to 0.701. In conclusion, the hybrid model combining feature selection and clustering demonstrated measurable performance improvements, particularly for ensemble and tree-based classifiers, while other models showed mixed results. These findings underscore the effectiveness of combining feature selection with clustering to enhance model accuracy and generalization.

Figure 5 provides an overview of the evaluation of the proposed hybrid approach compared with the baseline models.

Fig. 5. Evaluation of the proposed hybrid approach.

F.STATISTICAL TEST

Statistical significance was tested using the Wilcoxon signed-rank test (α = 0.05) to confirm whether improvements were non-random. Table XII presents the results of the Wilcoxon test.

Table XII. Results of the statistical test

Clas.p-ValueSignificance
SVM0.008Significant
KNN0.034Significant
DT0.041Significant
RF0.013Significant
ANN0.056Not Significant
AB0.019Significant
NB0.067Not Significant
QDA0.082Not Significant
JRip0.028Significant
XGB0.011Significant
GB0.017Significant
DNN0.051Borderline
LR0.060Not Significant

G.COMPARISON WITH EXISTING MODELS

The proposed method was compared with the existing hybrid models from the literature. As summarized in Table XIII, the proposed model achieved a superior accuracy of 87.1%, compared to 83.1% with K-means and SVM and 86.2% with PCA and RF. The results demonstrate the combined advantage of unsupervised grouping and selective feature reduction.

Table XIII. Hybrid-based diabetes prediction

Ref.SMLUMLCVAccuracy
ProposedXGBK-means87.1%
Edeh et al. [29]SVMK-meansχ83.1%
Chang et al. [30]DTPCAχ86.24%

V.RESULT ANALYSIS

A.IMPACT OF CLUSTERING INTEGRATION

As noted in the results, using clustering improved classification metrics across nearly all models. For instance, RF accuracy increased from 82.5% (non-clustered) to 87.1% (clustered). Similarly, XGB AUC improved from 0.86 to 0.90. These improvements are attributed to the enhanced feature separability obtained from the unsupervised stage, which reduced within-class overlap.

B.IMPACT OF FEATURE SELECTION

Applying MI-based feature selection reduced training time by approximately 35% on average without sacrificing performance. For example, the SVM model’s training time decreased from 2.8s to 1.9s, while accuracy remained nearly constant. The results confirm that removing redundant features effectively reduces computational complexity while retaining predictive power.

C.COMPARISON OF CLASSIFIERS

Ensemble models, specifically RF, XGB, and AB, consistently outperformed simpler models such as KNN and NB. Ensemble methods benefit from aggregating multiple weak learners, reducing overfitting and improving robustness to noise, which is critical in small, imbalanced datasets. The performance gain demonstrates the effectiveness of ensemble diversity when combined with cluster-based stratification.

D.EFFECT OF CLUSTER NUMBER

To confirm the selection of two clusters in the clustering process, Fig. 6 shows the accuracy of all classifiers with several cluster values. The two-cluster choice outperformed the others for all classifiers except the GB classifier.

Fig. 6. Results of the proposed hybrid approach based on different numbers of clusters.

E.GENERALIZATION

The generalizability of this framework was assessed conceptually by comparing data characteristics of other medical datasets (e.g., Sylhet [32]). Since these datasets share small sample sizes and class imbalance, similar improvements in performance are expected. However, differences in feature distributions may require adaptive clustering strategies or autoencoder-based embedding.

VI.CONCLUSION

This study proposed a hybrid approach combining clustering with classification to enhance predictive model performance. The experimental results demonstrated that integrating clustering with supervised classification improved the accuracy, precision, recall, F1-score, and AUC metrics for most classifiers. The improvements were particularly notable for ensemble-based methods, such as RF, XGB, and GB, which consistently achieved the highest performance across various configurations. Besides, the study also highlighted limitations in simpler models, such as NB and QDA, which showed limited improvements despite the proposed approach. Overall, integrating clustering with classification significantly improves predictive performance, particularly for complex and ensemble-based classifiers. This demonstrates the potential of the proposed hybrid approach in real-world predictive modeling tasks. Future work could explore the impact of advanced clustering techniques, diverse feature selection methods, and optimal hyperparameter tuning to further enhance the proposed approach.