The ever growing need to efficiently store, retrieve and analyze massive datasets, originated by very different sources, is currently made more complex by the different requirements posed by users and applications. Such a new level of complexity cannot be handled properly by current data structures for big data problems.

To successfully meet these challenges, we launched a project, funded by the Italian Ministry of University and Research (PRIN no. 2017WR7SHH), that will lay down the theoretical and algorithmic-engineering foundations of a new generation of Multicriteria Data Structures and Algorithms. The multicriteria feature refers to the fact that we wish to seamlessly integrate, via a principled optimization approach, modern compressed data structures with new, revolutionary, data structures learned from the input data by using proper machine-learning tools. The goal of the optimization is to select, among a family of properly designed data structures, the one that “best fits” the multiple constraints imposed by its context of use, thus eventually dominating the multitude of trade-offs currently offered by known solutions, especially in the realm of Big Data applications.

What is a multicriteria data structure?

A multicriteria data structure, for a given problem $P$, is defined by a pair $\langle \mathcal F, \mathcal A \rangle_P$ where $\mathcal F$ is a family of data structures, each one solving $P$ with a proper trade-off in the use of some resources (e.g. time, space, energy), and $\mathcal A$ is an optimisation algorithm that selects in $\mathcal F$ the data structure that best fits an instance of $P$.

Family of data structures
Family of data structures
Computational resources
Computational resources
Optimisation algorithm
Optimisation algorithm

For more details on the project, have a look at its full description here.

Publications

Quickly discover relevant content by filtering publications.
Learned data structures. Oneto L., Navarin N., Sperduti A., Anguita D. (eds) Recent Trends in Learning From Data. Studies in Computational Intelligence, vol 896. Springer, 2020.

PDF DOI

Awards

  • Paolo Ferragina and Giovanni Manzini, together with Michael Burrows (Google) have received the 2022 ACM Paris Kanellakis Theory and Practice Award for inventing the BW-transform and the FM-index that opened and influenced the field of Compressed Data Structures with fundamental impact on Data Compression and Computational Biology.

  • Giorgio Vinciguerra won the 2022 PhD thesis award from the EATCS Italian Chapter for his research on “Learning-based compressed data structures”.

  • The paper by Domenico Amato, Giosuè Lo Bosco, Raffaele Giancarlo titled “On the Suitability of Neural Networks as Building Blocks for the Design of Efficient Learned Indexes” got the best paper award at EANN 2022.

Software

A Benchmarking Platform for Atomic Learned Indexes

A benchmarking platform to evaluate how Feed Forward Neural Networks can be effectively used as index data structures.

A Learned Sorted Table Search Library

A collection of methods for performing element search in ordered tables, starting from textbook implementations to more complex algorithms.

BigRePair

A grammar compressor for huge files with many repetitions.

Block-epsilon tree

A compressed rank/select dictionary exploiting approximate linearity and repetitiveness.

CoCo-trie

A data-aware trie for indexing and compressing a set of strings.

CORENup

A Combination of Convolutional and Recurrent Deep Neural Networks for Nucleosome Positioning Identification.

Deep neural networks compression

This package proposes compression strategies for fully-connected and convolutional layers of deep neural networks, including pruning, quantization and low-rank factorization.

DIAMIN

A software library for the distributed analysis of large-scale molecular interaction networks.

Discriminative Subgraph Discovery

Software to analyze gene expression data of a set of samples belonging to two different populations, typically the one referred to healthy individuals and the other to unhealthy ones.

FADE

An extensible framework to efficiently compute alignment-free functions on a set of large genomic sequences.

FASTdoopC

A general software framework for the efficient acquisition of FASTA/Q genomic files in a MapReduce environment.

FastKmer

A SPARK software system for the collection of $k$-mer statistics.

LA-vector

A compressed bitvector/container supporting efficient random access and rank queries.

Learned sorted table search

A software library to speed-up sorted table search procedures via learning from data.

LeMonHash

A monotone minimal perfect hash function that learns and leverages the data smoothness.

LZ-Epsilon

Compressed rank/select dictionary based on Lempel-Ziv and LA-vector compression.

MM RePair

Massive matrix multiplication on RePair-compressed matrices.

PFP Data Structures

Data structures supporting Longest Common Extensions and Suffix Array queries, built on the prefix-free parsing of the text.

PGM-index

A data structure enabling fast searches in arrays of billions of items using orders of magnitude less space than traditional indexes.

PyGM

Python library of sorted containers with state-of-the-art query performance and compressed memory usage.

sHAM

Compression strategies and space-conscious representations for deep neural networks.

Invited talks

Data compression: once upon a time, now and then
Advances in data-aware compressed-indexing schemes for integer and string keys
Data-aware (learned) design of data structures
Learning-based compressed data structures
Learning-based approaches to compressed data structures design
Learning-based approaches to compressed data structures design
Extraction of Functional Knowledge from Large Biological Graphs and Applications in Precision Medicine
FM-index - the story... with a moral for (young) researchers
Introduction to compressed data structures
Learned Indexes
A rigorous approach to design learned data structures
On the compressed indexing of dictionaries of keys for massive key-value stores
A tutorial on learning-based compressed data structures
Introduction to Wheeler Graphs
Theory and practice of learning-based compressed data structures
Learned and Compressed Data Structures, challenges on Storage Systems
Learned and Compressed Data Structures
The future of data structures: data-aware and self-designing
Life sciences and algorithmic design: speed and accuracy in small space
Hybrid Data Structures and beyond
The evolution of searching data structures

The numerous paper presentations given at conferences by the project members are not included in the above list.

Events

 
 
 
 
 

The future of compressed data structures, 20 years after the FM-index

July 19, 2022 – July 20, 2022 Lipari, Italy
Organisers: Paolo Ferragina and Giovanni Manzini
 
 
 
 
 

PhD Course on Data compression and compressed data structures

May 16, 2022 – May 31, 2022 Pisa, Italy
Lecturers: Paolo Ferragina and Giovanni Manzini
 
 
 
 
 

PhD Course on Searching Big Tables in Small Space

February 15, 2021 – February 19, 2021 Palermo, Italy
Lecturer: Raffaele Giancarlo
 
 
 
 
 

2020 Student Challenge @ Algorithm Engineering UniPi course

November 4, 2020 – February 10, 2021 Pisa, Italy
Organisers: Paolo Ferragina and Giorgio Vinciguerra

Meetings & Reports

 
 
 
 
 

Project end

August 28, 2022 (postponed to August 28, 2023 because of COVID-19)
 
 
 
 
 

Report of the second year

September 1, 2021
 
 
 
 
 

Report of the first year

September 1, 2020
 
 
 
 
 

Kickoff meeting

October 14, 2019 Video conference

Minutes of the meeting

 
 
 
 
 

Project start

September 1, 2019

People

Principal Investigators

Avatar

Marco Frasca

Assistant professor

Avatar

Raffaele Giancarlo

Full professor

Università di Pisa

Avatar

Andrea Guerra

PhD student

Avatar

Davide Bacciu

Assistant professor

Università degli Studi di Milano

Avatar

Marco Frasca

Assistant professor

Avatar

Dario Malchiodi

Associate professor

Avatar

Marco Mesiti

Associate professor

Avatar

Paolo Perlasca

Assistant professor

Avatar

Alessandro Petrini

Research fellow

Avatar

Jessica Gliozzo

PhD student

Avatar

Giosuè Cataldo Marinò

Research contractor

Avatar

Flavio Furia

Research contractor

Università degli Studi di Palermo

Avatar

Raffaele Giancarlo

Full professor

Avatar

Giosuè Lo Bosco

Associate professor

Avatar

Simona E. Rombo

Associate professor

Università degli Studi del Piemonte Orientale “Amedeo Avogadro”

Avatar

Lavinia Egidi

Associate professor

Avatar

Alessandro Poggiali

Research fellow

Students and postdocs

List of students, postdocs, and collaborators involved in the project:

  • Domenico Amato, Università degli Studi di Palermo
    PhD student (Project start–April 2022) → Postdoc (April 2022–September 2022)
  • Antonio Boffa, Università di Pisa
    PhD student (November 2020–Project end)
  • Mariella Bonomo, Università degli Studi di Palermo
    PhD student (May 2021–October 2022)
  • Flavio Furia, Università degli Studi di Milano
    Contract employment for research activity support (July 2022–October 2022)
  • Jessica Gliozzo, Università degli Studi di Milano
    PhD student (November 2020–Project end)*
  • Gennaro Grimaudo, Università degli Studi di Palermo
    PhD student (March 2021–Project end)
  • Andrea Guerra, Università di Pisa
    PhD student (November 2021–Project end)
  • Giosuè Cataldo Marinò, Università degli Studi di Milano
    Contract employment for research activity support (July 2020–October 2020)
  • Alessandro Petrini, Università degli Studi di Milano
    Postdoc (December 2020–November 2021)
  • Alessandro Poggiali, Università degli Studi del Piemonte Orientale
    Research fellow (June 2022–November 2022)
  • Manuel Striani, Università degli Studi del Piemonte Orientale
    Postdoc (September 2020–August 2021)
  • Francesco Tosoni, Università di Pisa
    PhD student (November 2020–Project end)
  • Giorgio Vinciguerra, Università di Pisa
    PhD student (Project start–October 2021) → Postdoc (November 2021–December 2022)

List of PhD thesis related to the project:

  • Domenico Amato. A Tour of Learned Static Sorted Sets Dictionaries: From Specific to Generic with an Experimental Performance Analysis. Successfully defended on July 10, 2022.
  • Giorgio Vinciguerra. Learning-based compressed data structures. Successfully defended on February 23, 2022.

*Joint Ph.D. course in “Genomics and Bioinformatics” between Università degli Studi di Milano and the Joint Research Center (JRC) di Ispra. The first year of the course has been funded by the project.