Résumé:
A large number of documents currently exist in various fields such as public administration,
industry, scientific research, education, and many more. The exponential growth of digital
documents has made their management and exploitation increasingly complex.
Faced with this abundance of textual information, it has become essential to be able to quickly
and efficiently access the knowledge contained in these documents. This is where word
spotting comes in, locating and identifying words of interest within these massive datasets.
Word spotting plays a crucial role in areas such as information retrieval, document
classification, machine translation, and many other applications.
It is in this context that this thesis falls. Our work aims to make a contribution to the task of
finding words in images of digitized documents, with a focus on Arabic documents. The
proposed approach integrates with the analytical methods which require the segmentation of
the documents in words to carry out the identification. It encompasses several processing
steps aimed at achieving our goals.
The first step of our approach consists of a pre-processing of document images in order to
improve their quality and reduce artefacts. Next, we proceed to segment the documents into
lines of text and then into individual words. Once the words have been segmented, we extract
a set of features from each word. This stage plays a key role in the representation of words
and the ability to distinguish them from each other. We explored different families of
descriptors in order to obtain a rich and discriminative representation of the words. Then the
words are grouped into classes based on their similarity. Finally, the last module of our
approach is the search module, where the user expresses his query in the form of an image of
words, and the system compares it with the previously extracted and classified words to find
the most relevant words.
The experiments demonstrated promising performances, thus opening up new perspectives in
the field of document analysis and recognition