Published at: Algorithms

A Hybrid Dimensionality Reduction Procedure Integrating Clustering with KNN-Based Feature Selection for Unsupervised Data

Abstract

This paper proposes a novel hybrid approach that combines unsupervised feature extraction through clustering and unsupervised feature selection for data reduction, specifically targeting high-dimensional data. The proposed method employs K-means clustering for feature extraction, where cluster membership serves as a new feature representation, capturing the inherent data characteristics. Subsequently, the K-Nearest Neighbors (KNN) and Random Forest algorithms are utilized for supervised feature selection, identifying the most relevant feature to enhance model performance. This hybrid approach leverages the strengths of both unsupervised and supervised learning techniques. The new algorithm was applied to 13 different tabular datasets, with 9 datasets showing significant improvements across various performance metrics (accuracy, precision, recall, and F1-score) in both KNN and Random Forest models, despite substantial feature reduction. In the remaining four datasets, we achieved substantial dimensionality reduction with only negligible performance decreases. This improvement in performance while reducing dimensionality highlights the potential of the proposed method within the procedure, where datasets are treated without prior knowledge or assumptions. The proposed method offers a promising solution for handling high-dimensional data, enhancing model performance while maintaining interpretability and ease of integration within the proposed frameworks, with the ability to be irrespective of supervised or unsupervised designation datasets while reducing the dependency on a target or label features.

Read the full article