Résumé:
Feature selection plays an important role in text categorization. It has proven
to be an effective and efficient way to prepare high dimensional data for data mining and text classification. Among the most popular selection metrics, we find :
Gain Information (GI), Mutual Information (MI), Chi-square (Chi2) and Document
Frequency (DF) which uses the document frequency distribution to compute the relevance of words to the class variable, without considering the intra-document word
frequency distribution.
Our main contribution is to propose a new approach called (TFDF) feature selection
based on term frequency and document frequency at the class level. In the experiments, our proposed method is compared with existing metrics such as GI, MI, Chi2
and DF. The classifiers used to test the performance of the selection metrics are Support Vector Machine (SVM) and Naive Bayes (NB), which are the best performing
ones at present.
Experimental results show that our proposed method is superior to the results of
existing metrics in the literature