Efficiently merging $r$-indexes


Large sequencing projects, such as GenomeTrakr and MetaSub, are updated frequently (sometimes daily, in the case of GenomeTrakr) with new data. Therefore, it is imperative that any data structure indexing such data supports efficient updates. Toward this goal, Bannai et al. (TCS, 2020) proposed a data structure named dynamic $r$-index which is suitable for large genome collections and supports incremental construction; however, it is still not powerful enough to support substantial updates. Here, we develop a novel algorithm for updating the $r$-index, which we refer to as RIMERGE. Fundamental to our algorithm is the combination of the basics of the dynamic $r$-index with a known algorithm for merging Burrows-Wheeler Transforms (BWTs). As a result, RIMERGE is capable of performing batch updates in a manner that exploits parallelism while keeping the memory overhead small. We compare our method to the dynamic $r$-index of Bannai et al. using two different datasets, and show that RIMERGE is between 1.88 to 5.34 times faster on reasonably large inputs.

Proceedings of the 31th Data Compression Conference (DCC)
Giovanni Manzini
Full professor

Professor of Computer Science at the University of Eastern Piedmont