Exploring the Effects of High Dimensionality and Imbalanced Data on Predictive Models for Colon Cancer Detection Using Tree, Rule, and Lazy Learning Techniques
DOI:
https://doi.org/10.37965/jait.2024.0581Keywords:
preprocessing, imbalanced, feature selection, classification, colon cancerAbstract
Analyzing colon cancer data is essential for improving early detection, treatment outcomes, public health initiatives, research efforts, and overall patient care, ultimately leading to better outcomes and reduced burden associated with this disease. The prediction of any disease depends on the quality of the available dataset. Before applying the prediction algorithm, it is important to analyze its characteristics. This research presented a comprehensive framework for addressing data imbalance in colon cancer datasets, which has been a significant challenge in previous studies in terms of imbalancing and high dimensionality for the prediction of colon cancer data. Both characters are important concepts of preprocessing. Imbalancing refers to the adjusting the data points in the proper portion of the class label. Feature selection is the process of selecting the strong feature from the available dataspace. This study aims to improve the performance of the popular tree, rule, lazy (K nearest neighbor (KNN)) classifiers, and support vector machine (SVM) algorithm after addressing the imbalancing issue of data analysis and applying various feature selection methods such as chi-square, symmetrical uncertainty, correlation-based feature selection (CFS) subset, and classifier subset evaluators. The proposed research framework shows that after balancing the dataset, all the algorithms performed better with all applied feature selection methods. Out of all methods, Jrip records 85.71% accuracy with classifier subset evaluators, Ridor marks 84.52% accuracy with CFS, J48 produces 83.33% accuracy with both CFS and classifier subset evaluators, simple cart notices 84.52% with classifier subset evaluators, KNN records 91.66% accuracy with Chi and CFS, and SVM produces 92.85% with symmetrical uncertainty.
Metrics
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.