Résumé:
Detecting outlier documents is a critical task in various domains, including fraud
detection, information retrieval, and anomaly detection. This project leverages
Word2Vec Framework and the Word Mover’s Distance (WMD) to identify outlier
documents in a corpus. Word2Vec is utilized to generate dense vector representations of words, capturing semantic similarities and contextual relationships. The
WMD, which measures the dissimilarity between two text documents by computing
the minimal cost to transform one document into another, is applied to these vector representations to assess document similarity. By analyzing the distribution of
WMD scores across the document corpus, we can identify documents that deviate
significantly from the norm, thus classifying them as outliers. This approach is advantageous due to its ability to handle the semantic richness of text and provide a
nuanced measure of document similarity. The effectiveness of the proposed method
is validated through experiments on benchmark datasets, demonstrating its potential in accurately identifying outlier documents.