Learning Based Compressed Data Structures (Poster)

Abstract

The deluge of data in applications such as bioinformatics, information retrieval, and databases has made the use of compressed data structures, sometimes called succinct or compact data structures, indispensable. The specialty of these structures is to represent data in compressed form while also supporting efficient access and queries on it. Traditionally, this is achieved by exploiting two particular sources of compressibility: statistical properties and repetitiveness of the data.
In this poster, we will discuss a different kind of data regularity based on geometric considerations: approximate linearity. Several recent results have shown that this geometric regularity is surprisingly frequent in real datasets, such as time series. We will expand the horizon of compressed data structures by presenting solutions that discover, or “learn”, in a principled algorithmic way, these approximate linearities in the data to solve some fundamental and ubiquitous problems in computer science, such as predecessor search and rank/select primitives. We will provide a walkthrough of these new theoretical achievements, also with a focus on open-source libraries and their experimental improvements in space of orders of magnitude compared to classical solutions (such as B-trees). We conclude by discussing the plethora of research opportunities that these new learning-based approaches to data structure design open up.

Publication
Stanford Compression Workshop 2021
Avatar
Paolo Ferragina
Full professor

Professor of Algorithms and PI of the project

Avatar
Giorgio Vinciguerra
PhD student

PhD student in Computer Science