Exploring the Effects of High Dimensionality and Imbalanced Data on Predictive Models for Colon Cancer Detection Using Tree, Rule, and Lazy Learning Techniques

Swapnali N Tambe; Saiprasad Potharaju; Shanmuk Srinivas Amiripalli; Ravi Kumar Tirandasu; Yogita Algat

doi:10.37965/jait.2024.0581

Exploring the Effects of High Dimensionality and Imbalanced Data on Predictive Models for Colon Cancer Detection Using Tree, Rule, and Lazy Learning Techniques

Authors

Swapnali N Tambe Department of Information Technology, K. K. Wagh Institute of Engineering Education & Research, Nashik, MH, India
Saiprasad Potharaju Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune 412115, India https://orcid.org/0000-0002-7511-6855
Shanmuk Srinivas Amiripalli Department of CSE, GST, GITAM University, Visakhapatnam, AP, India https://orcid.org/0000-0003-0810-1810
Ravi Kumar Tirandasu Department of CSE, Koneru Lakshmaiah Education Foundation, Vaddeswaram, AP, India
Yogita Algat Department of Information Technology, K. K. Wagh Institute of Engineering Education & Research, Nashik, MH, India

DOI:

https://doi.org/10.37965/jait.2024.0581

Keywords:

preprocessing, imbalanced, feature selection, classification, colon cancer

Abstract

Analyzing colon cancer data is essential for improving early detection, treatment outcomes, public health initiatives, research efforts, and overall patient care, ultimately leading to better outcomes and reduced burden associated with this disease. The prediction of any disease depends on the quality of the available dataset. Before applying the prediction algorithm, it is important to analyze its characteristics. This research presented a comprehensive framework for addressing data imbalance in colon cancer datasets, which has been a significant challenge in previous studies in terms of imbalancing and high dimensionality for the prediction of colon cancer data. Both characters are important concepts of preprocessing. Imbalancing refers to the adjusting the data points in the proper portion of the class label. Feature selection is the process of selecting the strong feature from the available dataspace. This study aims to improve the performance of the popular tree, rule, lazy (K nearest neighbor (KNN)) classifiers, and support vector machine (SVM) algorithm after addressing the imbalancing issue of data analysis and applying various feature selection methods such as chi-square, symmetrical uncertainty, correlation-based feature selection (CFS) subset, and classifier subset evaluators. The proposed research framework shows that after balancing the dataset, all the algorithms performed better with all applied feature selection methods. Out of all methods, Jrip records 85.71% accuracy with classifier subset evaluators, Ridor marks 84.52% accuracy with CFS, J48 produces 83.33% accuracy with both CFS and classifier subset evaluators, simple cart notices 84.52% with classifier subset evaluators, KNN records 91.66% accuracy with Chi and CFS, and SVM produces 92.85% with symmetrical uncertainty.