Résumé:
Feature selection, as a dimensionality reduction technique, aims at selecting a
small subset of the relevant features from the original ones by removing the irrele vant, redundant or noisy ones. Feature selection generally leads to better learning
performance, i.e. higher learning accuracy, lower computational cost and better mo del interpretation. Feature selection methods such as Information Gain (IG), Mutual
Information (MI) and Chi-square (Chi2) are statistical methods based on document
frequency, but they do not take into account the frequency of terms within docu ments, nor do they consider their semantics.
Based on the idea that terms that frequently co-occur may have a common se mantics and thus a high discrimination capacity compared to isolated terms, we
propose a feature selection method for text classification considering two measures :
term co-occurrence frequency and term entropy, where a term that frequently co occurs with other terms and leads to minimize the uncertainty (entropy) of the class
variable is considered relevant.
The performance of our method is compared to the four most commonly used se lection metrics : Information Gain (IG), Mutual Information (MI), Chi-square (Chi2)
and Document-Frequency (DF), using two classifiersNaïve Bayes (NB) and Support
Vector Machine (SVM) and three datatsets