
Published at: SOCO – Soft Computing
The impact of synthetic data rates in imbalanced datasets on convergence characteristics of deep learning networks
Abstract
The application of machine learning to imbalanced datasets may cause undesirably biased results. Some preprocessing applied to the dataset may avoid distortion against the minority class. A common approach for implementation of such preprocessing is by data synthesis. In this study, we compare the impact of varying rates of synthetic data assimilated into the original dataset on the training performance of a neural network (NN). The synthetic data was generated by two different algorithms: Conditional Generative Adversarial Network (CTGAN) and Triple Based Variational Autoencoder (TVAE), an encoder-based data generator. Varying rates of synthetic data were assimilated into two different medical datasets where the attribute of the patient’s age is imbalanced. In the first, a birth deliveries dataset, the NN was implemented to solve a regression problem; the second dataset, containing information about Covid-19 patients, was used to solve a classification problem. From these two original datasets, an additional 26 datasets were derived, with varying rates of synthetic records that were generated by two different algorithms. The training performance of a NN on these datasets was compared in terms of accuracy, convergence speed, and two novel metrics designed to quantify internal oscillations of the network’s weights during training. The results demonstrate that both the data synthesis method and the proportion of synthetic data significantly affect model accuracy and training dynamics. This study contributes a novel framework for assessing convergence stability under imbalanced conditions using deep generative synthetic data.
The impact of synthetic data rates in imbalanced datasets on convergence characteristics of deep learning networks
Share a link using:
https://www.afeka.ac.il/en/industry-relations/research-authority/the-impact-of-synthetic-data-rates-in-imbalanced-datasets-on-convergence-characteristics-of-deep-learning-networks/WhatsApp
Facebook
Twitter
Email
https://www.afeka.ac.il/en/industry-relations/research-authority/the-impact-of-synthetic-data-rates-in-imbalanced-datasets-on-convergence-characteristics-of-deep-learning-networks/