The Impact of Synthetic Data Rates in Imbalanced Datasets on Convergence Characteristics of Deep Learning Networks

A new study by Dr. Sharon Yalov-Handzel, head of Afeka’s M.Sc. program in Intelligent Systems Engineering, intelligent systems M.Sc. graduate Keren Glickman, recently published in the international journal Soft Computing from Springer, examine how the addition of synthetic data affects the accuracy and stability of AI systems that handle imbalanced datasets. It is a phenomenon that is especially typical of the healthcare field: Datasets featuring numerous examples of common cases, yet very few examples of rare cases, thus limiting the ability of AI systems to learn accurately.

The study examined two medical datasets: Birth data and COVID-19 testing data. Each dataset was supplemented with varying amounts of synthetic data generated using two common methods in the field. The goal was to see whether this addition improves the system’s learning, or degrades it.
The findings pointed to a clear insight: A moderate addition of synthetic data (about 10%-20%) may improve learning quality, but beyond that amount, it degrades performance and even destabilizes the results. An additional finding was that the method of data synthesis must be adjusted to the type of data: Data based on continuous values vs. categorical values.

The implication: In the world of healthcare and AI, smart and accurate use of synthetic data can improve the ability to handle small cohorts and rare cases, as long as it is used in the right amount and with the appropriate method.

 

READ THE PAPER