Résumé:
Bioinformatics is very important in extracting as much information as possible from biological data. Even though the old methods are useful, they become unable to measure the amount of biological data from ever-increasing high-throughput sequencing projects. One of the most important areas of bioinformatics is sequence grouping. In this paper, we focus on sequence grouping to help multiple sequence alignment algorithms in case large-scale biological sequences grows with the demand in computational biology. We present our clustering method based on the K-means algorithm which is guided by the k-mers related to the sequences to be aligned. Also, we integrate this method into a multiple alignment strategy to save time for execution without losing quality. We tested the approach on a multi-core processor, in addition to a set of Benchmarks in the literature review. We compared our results with those generated by the UClust clustering algorithm. The results show that our approach fails in terms of calculating time compared to UClust, while maintaining accuracy in all the tested Benchmarks.