Information
Paper link: https://arxiv.org/pdf/2202.03579.pdf
Published date: June 8th, 2022
Authors: Ahmad B. Hassanat , Ahmad S. Tarawneh , Ghada A. Altarawneh , Abdullah Almuhaimeed
Purpose
Analysis of a large number of oversampling techniques and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. The authors used more than 70 different oversampling methods.
Given data and methods in hand, authors argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.
The major goal of the study is as follows:
Oversampling in its current forms and methodologies is a misleading approach that should be avoided since it feeds the learning process with falsified instances that are pushed to be members of the minority class when they are most likely members of the majority
Content notes
Usually minority class represents a very high value in terms of information, as it represents favorable examples that are rare in nature or expensive to obtain. This is true in Big Data analytics, Biometrics, gene profiling, credit card fraud detection, face image retrieval, disease detection, IoT, NLP, and many others.
There are several approaches to solving class imbalance problem before starting classification, such as:
- Obtain more samples from the minority class(es)
- Changing the loss function to give the failing minority class a higher cost
- Oversampling the minority class
- Undersample the majority class
- Different combinations between the previous apporaches
Oversampling is the most used method based on works published in the last two decades. However, this does not necessarily imply that the oversampling approach is beneficial.
The authors raised different problems when using oversampling methods:
- Paper results often are different than the practical results.
- Synthetic data generated can exist in real life belonging to a different class, regardless of how similar it is to the minority’s examples, as we always have examples from class A that are the closest to examples from a different class B.
Authors proved their counterclaim in the paper by using a number of typical oversampling methods on several benchmark datasets, concealing some of the majority examples, and then comparing the created examples to the hidden majority examples to determine if they approximately match. Finding such counter examples proves their counterclaim.
Literature review
One of the earliest and most extensively utilized approaches for class imbalance is the Synthetic Minority Oversampling Technique (SMOTE). It interpolates synthetic examples between nearest neighbors from the training set’s collection of minority class cases. As a result, by merging the properties of seed instances with randomly picked k-nearest neighbors, a synthetic sample is generated.
There are several variation of SMOTE algorithm as can be seen on one of my published papers here
Besides SMOTE and respective variants, there are also different other oversampling methods:
- Constrained Oversample (CO): this method is used to extract the overlapping regions in a dataset. Ant Colony Optimization is then used to define the boundaries of minority regions. Most significantly, in order to create a balanced dataset, fresh samples are synthesized via oversampling under constraints. This method varies from others in that it includes noise-reduction constraints in the oversampling process. [1]
- Majority Weighted Minority Oversampling Technique (MWMOTE): addresses class imbalance by identifying difficult-to-learn informative minority class samples, weighting them based on their distance from majority class samples, and generating synthetic samples from them using clustering. It ensures all generated samples belong to minority class clusters and outperforms or matches other methods in evaluations. [2]
- Adaptive synthetic (ADASYN): use a weighted distribution for different minority class examples based on their learning difficulty, with more synthetic data created for more difficult minority class examples than for easier minority class examples. The efficacy of this method is proved by the results of experiments conducted on a variety of datasets using five different evaluation measures. [3]
- Synthetic Minority Over-Sampling Technique Based on Furthest Neighbor Algorithm (SOMTEFUNA): this method employs the farthest neighbor examples. SOMTEFUNA has a number of advantages over some other approaches, one of which being the lack of tuning parameters, which makes it easier to be used in real world scenarios. [4]
- Sampling WIth the Majority (SWIM): robust to significant class imbalance. It uses the density of the well-sampled majority class to direct the creation process. SWIM’s model was built using both the radial basis function and the Mahalanobis distance. SWIM was put to the test on 25 benchmark datasets, and the findings show that it beats some of the most common oversampling methods. [5]
Different studies shows that on paper the accuracy measures appear to be good if the data is over-fitted, which is common when using Oversampling methods
Method
Proposed validation system:
Similarity was obtained with Hassanat distance. [6]
Error is calculated as follow: Error=CM/SS, where CM is the number of synthetic examples that are close to majority examples and S S is the total number of examples in the synthetic subset.
Datasets
Three real-life datasets were used to put authors validation system to the test, namely Yeast4, Yeast5, and Yeast6, which are routinely used by many oversampling methods. [7]
ID | Name | No. Attributes | No. Classes | No. Minor | No. Major |
---|---|---|---|---|---|
1 | Yeast 4 | 10 | 2 | 51 | 1433 |
2 | Yeast 5 | 10 | 2 | 44 | 1440 |
3 | Yeast 6 | 10 | 2 | 35 | 1449 |
Results
Authors employed varied numbers of hidden examples, namely 10%, 25%, and 50% of the majority examples of each dataset examined, to see the effect of the number of hidden examples on the validation process. Furthermore, each experiment is repeated five times, with the average of the results for each hidden ratio for each oversampling method on each dataset being reported.
An examination of the obtained results shows that oversampling methods result in errors in the synthesized examples. That is, they generate examples that are meant to be minority, yet are similar to the majority or fall within the majority class’s decision boundary.
Results for 10%, 25& and 50% hidden examples can be seen below:
Ranking methods based on their average erro rate:
As a result:
- all Oversampling methods validated produce misleading examples, regardless of the hidden percentage or dataset used.
- The common mistake that all oversampling methods make is to feed such data to a classifier, assuming that all of the examples are realistic and labeled based on reality. The classifier has no other knowledge and learns based on the false assumption, which produces excellent results in labs but unexpected behavior in real-world applications.
- This makes the training of these examples deceptive, and it could lead to the classifier being overfitted on incorrect data if robust generalization techniques are not used.
When applied to real-world tasks, it is possible that the entire machine learning system fails spectacularly, particularly in critical applications such as security, autonomous driving, aviation safety and medical applications, where even one unrealistic synthesized example could do catastrophic harm.
Conclusions
- Oversampling methods have been extensively researched for handling class imbalance learning.
- Current oversampling methods may be deceptive and lead to failures in real-world applications.
- A new validation system for oversampling methods was proposed and used to evaluate over 70 methods on four real-world datasets.
- Results show that all investigated oversampling methods generate false examples, potentially causing classifiers to fail in practice.
- Some oversampling methods are ranked as less harmful, but they are ultimately found to be useless when datasets are changed.
- It is recommended to avoid oversampling methods for sensitive applications and instead use ensemble approaches like Easy Ensemble and Random Data Partitioning.
- Further research is needed to confirm the validity of oversampling methods and explore more methods and real-world applications.
References
[1]: https://ieeexplore.ieee.org/document/9179814
[2]: https://ieeexplore.ieee.org/document/6361394
[3]: https://ieeexplore.ieee.org/document/4633969
[4]: https://ieeexplore.ieee.org/document/9045990
[5]: https://ieeexplore.ieee.org/document/8594869
[6]: https://ieeexplore.ieee.org/document/10009844
[7]: https://ngdc.cncb.ac.cn/databasecommons/database/id/3269