Whole Genome Alignment via Alternating Lyndon Tree Factorization
Mahmud Sami Aydın
(Supervisor: Assoc.Prof.Can Alkan)
Computer Engineering Department
Abstract: The Whole Genome Alignment Problem (WGA) is an important challenge in the field of genomics, especially in the context of pangenome construction. Here we propose a novel indexing structure called the Alternating Lyndon Factorization Tree (ALFTree), which incorporates both spatial and lexicographical information within its nodes. The ALFTree is a powerful tool for WGA, as it can efficiently store and retrieve information about large DNA sequences. We present an algorithm, namely Idoneous, specifically designed to construct the ALFTree from a given DNA sequence. The algorithm works by generating intervals of specific sizes, identifying matches within these intervals, and performing a sanity check through alignment procedures. The algorithm is efficient and scalable, making it a valuable tool for WGA. Some of the key features of the ALFTree are: 1) compact and efficient data structure for storing large DNA sequences; 2) efficient retrieval of information about specific regions of a DNA sequence; 3) ability to handle both spatial and lexicographical information; and 4) scalability to large DNA sequences. Our experimental results on different genomes highlight the effects of parameter selections on coverage and identity. Idoneous demonstrates competitive performance in terms of coverage and provides flexibility in adjusting sensitivity and specificity for different alignment scenarios. The ALFTree has the potential to significantly improve the performance of WGA algorithms. We believe that the ALFTree is a valuable contribution to the field of genomics, and we hope that it will be used by researchers to accelerate the pace of discovery.
DATE: July 6, Thursday @ 10:30 Place: EA 409