PFP Compressed Suffix Trees

Christina Boucher, Ondřej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

January 2021

PDF Code DOI

Abstract

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$ , it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $BWT (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $| PFP (S) |$ . In practice $D$ and $P$ are significantly smaller than $S$ and computing $BWT (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ is the concatenation of many genomes. In this paper, we consider $PFP (S)$ as a data structure and show how it can be augmented to support full suffix tree functionality, still built and fitting within $O (| PFP (S) |)$ space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in $S$ ; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

Type

Conference paper

Publication

Proceedings of the SIAM Symposium on Algorithm Engineering and Experiments (ALENEX)

Giovanni Manzini

Full professor

Professor of Computer Science at the University of Eastern Piedmont