Repérer les mots dans les images de documents

Maatoug, Yaaqob

Accueil de DSpace
→
Faculté de mathématiques et de l'informatique et des sciences de la matière
→
Département de l'Informatique
→
Master
→
Voir le document

Repérer les mots dans les images de documents

Maatoug, Yaaqob

URI: http://dspace.univ-guelma.dz/jspui/handle/123456789/15010

Date: 2023

Résumé:

A large number of documents currently exist in various fields such as public administration, industry, scientific research, education, and many more. The exponential growth of digital documents has made their management and exploitation increasingly complex. Faced with this abundance of textual information, it has become essential to be able to quickly and efficiently access the knowledge contained in these documents. This is where word spotting comes in, locating and identifying words of interest within these massive datasets. Word spotting plays a crucial role in areas such as information retrieval, document classification, machine translation, and many other applications. It is in this context that this thesis falls. Our work aims to make a contribution to the task of finding words in images of digitized documents, with a focus on Arabic documents. The proposed approach integrates with the analytical methods which require the segmentation of the documents in words to carry out the identification. It encompasses several processing steps aimed at achieving our goals. The first step of our approach consists of a pre-processing of document images in order to improve their quality and reduce artefacts. Next, we proceed to segment the documents into lines of text and then into individual words. Once the words have been segmented, we extract a set of features from each word. This stage plays a key role in the representation of words and the ability to distinguish them from each other. We explored different families of descriptors in order to obtain a rich and discriminative representation of the words. Then the words are grouped into classes based on their similarity. Finally, the last module of our approach is the search module, where the user expresses his query in the form of an image of words, and the system compares it with the previously extracted and classified words to find the most relevant words. The experiments demonstrated promising performances, thus opening up new perspectives in the field of document analysis and recognition

Afficher la notice complète