A learned approach to design compressed rank/select data structures

Abstract

We address the problem of designing, implementing and experimenting with compressed data structures that support $\textsf{rank}$ and $\textsf{select}$ queries over a dictionary of integers. We shine a new light on this classical problem by showing a connection between the input integers and the geometry of a set of points in a Cartesian plane suitably derived from them. We then build upon some results in computational geometry to introduce the first compressed rank/select dictionary based on the idea of “learning” the distribution of such points via proper linear approximations (LA). We therefore call this novel data structure the la_vector. We prove time and space complexities of the la_vector in several scenarios: in the worst case, in the case of input distributions with finite mean and variance, and taking into account the $k$th order entropy of some of its building blocks. We also discuss improved hybrid data structures, namely ones that suitably orchestrate known compressed rank/select dictionaries with the la_vector. We corroborate our theoretical results with a large set of experiments over datasets originating from a variety of applications (Web search, DNA sequencing, information retrieval and natural language processing) and show that our approach provides new interesting space-time trade-offs with respect to many well-established compressed rank/select dictionary implementations. In particular, we show that our $\textsf{select}$ is the fastest, and our $\textsf{rank}$ is on the space-time Pareto frontier.

Publication
ACM Transactions on Algorithms
Avatar
Antonio Boffa
PhD student

PhD student in Computer Science

Avatar
Paolo Ferragina
Full professor

Professor of Algorithms and PI of the project

Avatar
Giorgio Vinciguerra
Postdoc

Postdoc in Computer Science