Résumé:
This study investigates the impact of oversampling techniques, specifically SVMSMOTE and BorderlineSMOTE, on machine learning models for heart disease and
diabetes risk prediction. Using Gradient Boosting Machine (GBM) and K-Nearest
Neighbors (KNN) algorithms, we assess changes in accuracy, precision, recall, F1 score,
Positive Predictive Value (PPV), Equal Opportunity Difference (EOD), Disparate Impact
(DI), and Impact Ratio (IR) across diverse biomedical datasets.
In both heart disease and diabetes risk prediction tasks, SVM-SMOTE and BorderlineSMOTE
proved effective in enhancing machine learning model performance. For heart disease
prediction, SVM-SMOTE and BorderlineSMOTE improved GBM model accuracy to 0.85 and
0.85 from an initial 0.74, precision to 0.79 and 0.77 from 0.69, and recall to 0.88 and 0.92 from
0.75, respectively. KNN models also showed enhancements in accuracy (0.71 from 0.68),
precision (0.70 from 0.59), and recall (0.72 from 0.62). In diabetes risk prediction, both
techniques consistently boosted accuracy, precision, and F1 score metrics across GBM and
KNN models. Notably, DI values improved significantly to 1.11 with both SVM-SMOTE and
BorderlineSMOTE from an initial 0.43, indicating improved fairness in model predictions
across demographic groups.
Overall, the strategic application of SVM-SMOTE and BorderlineSMOTE effectively
addresses class imbalance challenges in biomedical datasets, enhancing both predictive
accuracy and fairness in machine learning models. These results underscore the importance of
tailored oversampling techniques in achieving robust and equitable healthcare predictions
across diverse demographic groups