Publications

Paolo Ferragina, Giorgio Vinciguerra. Learned data structures. Oneto L., Navarin N., Sperduti A., Anguita D. (eds) Recent Trends in Learning From Data. Studies in Computational Intelligence, vol 896. Springer, 2020.

PDF DOI

Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Yoshimasa Takabatake. Rpair: Rescaling RePair with Rsync. SPIRE, 2019.

Code DOI arXiv

Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, Marinella Sciortino. A New Class of Searchable and Provably Highly Compressible String Transformations. CPM, 2019.

PDF DOI

Raffaele Giancarlo, Simona E. Rombo, Filippo Utro. DNA combinatorial messages and Epigenomics: The case of chromatin organization and nucleosome occupancy in eukaryotic genomes. Theor. Comput. Sci., 792, 2019.

DOI

Umberto Ferraro Petrillo, Mara Sorella, Giuseppe Cattaneo, Raffaele Giancarlo, Simona E. Rombo. Analyzing big datasets of genomic sequences: fast and scalable collection of

k

-mer statistics. BMC Bioinformatics 20, 138, 2019.

PDF DOI

Awards

Paolo Ferragina and Giovanni Manzini, together with Michael Burrows (Google) have received the 2022 ACM Paris Kanellakis Theory and Practice Award for inventing the BW-transform and the FM-index that opened and influenced the field of Compressed Data Structures with fundamental impact on Data Compression and Computational Biology.
Giorgio Vinciguerra won the 2022 PhD thesis award from the EATCS Italian Chapter for his research on “Learning-based compressed data structures”.
The paper by Domenico Amato, Giosuè Lo Bosco, Raffaele Giancarlo titled “On the Suitability of Neural Networks as Building Blocks for the Design of Efficient Learned Indexes” got the best paper award at EANN 2022.

Software

A Benchmarking Platform for Atomic Learned Indexes

A benchmarking platform to evaluate how Feed Forward Neural Networks can be effectively used as index data structures.

A Learned Sorted Table Search Library

A collection of methods for performing element search in ordered tables, starting from textbook implementations to more complex algorithms.

BigRePair

A grammar compressor for huge files with many repetitions.

Block-epsilon tree

A compressed rank/select dictionary exploiting approximate linearity and repetitiveness.

CoCo-trie

A data-aware trie for indexing and compressing a set of strings.

CORENup

A Combination of Convolutional and Recurrent Deep Neural Networks for Nucleosome Positioning Identification.

Deep neural networks compression

This package proposes compression strategies for fully-connected and convolutional layers of deep neural networks, including pruning, quantization and low-rank factorization.

DIAMIN

A software library for the distributed analysis of large-scale molecular interaction networks.

Discriminative Subgraph Discovery

Software to analyze gene expression data of a set of samples belonging to two different populations, typically the one referred to healthy individuals and the other to unhealthy ones.

FADE

An extensible framework to efficiently compute alignment-free functions on a set of large genomic sequences.

FASTdoopC

A general software framework for the efficient acquisition of FASTA/Q genomic files in a MapReduce environment.

FastKmer

A SPARK software system for the collection of $k$ -mer statistics.

LA-vector

A compressed bitvector/container supporting efficient random access and rank queries.

Learned sorted table search

A software library to speed-up sorted table search procedures via learning from data.

LeMonHash

A monotone minimal perfect hash function that learns and leverages the data smoothness.

LZ-Epsilon

Compressed rank/select dictionary based on Lempel-Ziv and LA-vector compression.

MM RePair

Massive matrix multiplication on RePair-compressed matrices.

PFP Data Structures

Data structures supporting Longest Common Extensions and Suffix Array queries, built on the prefix-free parsing of the text.

PGM-index

A data structure enabling fast searches in arrays of billions of items using orders of magnitude less space than traditional indexes.

PyGM

Python library of sorted containers with state-of-the-art query performance and compressed memory usage.

sHAM

Compression strategies and space-conscious representations for deep neural networks.

Invited talks

Data compression: once upon a time, now and then

July 19, 2023 Lipari School on Computational Complex and Social Systems: Spreading and Accessing Information Paolo Ferragina

L'FM index: dalla compressione dati alla genomica, e oltre

June 21, 2023 — Seminar at Scuola Sup. S. Anna Paolo Ferragina, Giovanni Manzini

Advances in data-aware compressed-indexing schemes for integer and string keys

March 7, 2023 A Tutorial Workshop on ML for Systems and Systems for ML @ BTW 2023 Giorgio Vinciguerra

Data-aware (learned) design of data structures

November 22, 2022 "La Primavera dell’Informatica Teorica", organized by the Italian Chapter of the European Association of Theoretical Computer Science Paolo Ferragina

Learning-based compressed data structures

September 9, 2022 Prize Ceremony of the 23th Italian Conference on Theoretical Computer Science (ICTCS) Giorgio Vinciguerra

Learning-based approaches to compressed data structures design

August 22, 2022 IGAFIT Workshop for Algorithms Postdocs in Europe (AlgPiE by IGAFIT) Giorgio Vinciguerra

Learning-based approaches to compressed data structures design

July 20, 2022 The future of compressed data structures, 20 years after the FM-index Giorgio Vinciguerra

Extraction of Functional Knowledge from Large Biological Graphs and Applications in Precision Medicine

July 19, 2022 The future of compressed data structures, 20 years after the FM-index Simona E. Rombo

FM-index - the story... with a moral for (young) researchers

July 19, 2022 The future of compressed data structures, 20 years after the FM-index Paolo Ferragina

Introduction to compressed data structures

July 19, 2022 The future of compressed data structures, 20 years after the FM-index Giovanni Manzini

Learned Indexes

June 9, 2022 Final Meeting of the DFG Priority Progamme "Algorithms for Big Data" Paolo Ferragina

A rigorous approach to design learned data structures

May 10, 2022 3rd European Meeting on Algorithmic Challenges of Big Data (ACBD) Giorgio Vinciguerra

On the compressed indexing of dictionaries of keys for massive key-value stores

October 16, 2021 Huawei Strategy and Technology Conference 2021 Paolo Ferragina

The design of learning-based compressed data structures

August 16, 2021 LADSIOS Workshop @ VLDB 2021 Giorgio Vinciguerra

A tutorial on learning-based compressed data structures

April 12, 2021 Seminar at the University of Melbourne Giorgio Vinciguerra

Introduction to Wheeler Graphs

March 26, 2021 Seminar at Dalhousie University Giovanni Manzini

Theory and practice of learning-based compressed data structures

March 19, 2021 Seminar at Université de Lille and Inria Giorgio Vinciguerra

Learned and Compressed Data Structures, challenges on Storage Systems

December 2, 2020 Huawei Compute and Storage Workshop Paolo Ferragina

Learned and Compressed Data Structures

November 10, 2020 Huawei Online Workshop on Universal Lossless and Video Compression Paolo Ferragina

The future of data structures: data-aware and self-designing

September 15, 2020 Italian Conference on Theoretical Computer Science (ICTCS) Paolo Ferragina

Services of Big Data Analytics and Artificial Intelligence for Precision Medicine

June 18, 2020 IEEE MELECON Conference Simona E. Rombo

Life sciences and algorithmic design: speed and accuracy in small space

November 26, 2019 Seminar at Università degli studi di Modena e Reggio Emilia Raffaele Giancarlo

DNA combinatorial messages and Epigenomics: The case of chromatin organization and nucleosome occupancy in eukaryotic genomes

June 28, 2019 16th Annual Meeting of the Bioinformatics Italian Society Raffaele Giancarlo

Hybrid Data Structures and beyond

April 17, 2019 INNS Big Data and Deep Learning conference (INNSBDDL) Paolo Ferragina

The evolution of searching data structures

April 2, 2019 Spotify Invited Talks Paolo Ferragina

The numerous paper presentations given at conferences by the project members are not included in the above list.

Events

K-GALS: 2nd International Workshop on Knowledge Graphs Analysis on a Large Scale, co-located with the 27th European Conference on Advances in Databases and Information Systems

September 4, 2023 Barcelona, Spain

Organisers: Mariella Bonomo and Simona E. Rombo

The future of compressed data structures, 20 years after the FM-index

July 19, 2022 – July 20, 2022 Lipari, Italy

Organisers: Paolo Ferragina and Giovanni Manzini

PhD Course on Data compression and compressed data structures

May 16, 2022 – May 31, 2022 Pisa, Italy

Lecturers: Paolo Ferragina and Giovanni Manzini

PhD Course on Searching Big Tables in Small Space

February 15, 2021 – February 19, 2021 Palermo, Italy

Lecturer: Raffaele Giancarlo

2020 Student Challenge @ Algorithm Engineering UniPi course

November 4, 2020 – February 10, 2021 Pisa, Italy

Organisers: Paolo Ferragina and Giorgio Vinciguerra

Meetings & Reports

Project end

August 28, 2022 (postponed to August 28, 2023 because of COVID-19)

Sixth meeting

July 10, 2023 – July 11, 2023 Dip. di Informatica Giovanni degli Antoni, Via Celoria 18, Milano

Minutes of the meeting

Building a BioMedical Knowledge Graph from PubMed Central articles (Lorenzo Bellomo)
The role of Classifier Selection and Data Complexity in Learned Bloom Filters (Dario Malchiodi)
UNIPO Past and Ongoing Activity Round-up (Giovanni Manzini)
UNIPA Past and Ongoing Activity Round-up (Raffaele Giancarlo)
UNIMI Past and Ongoing Activity Round-up (Marco Frasca)
UNIPI Past and Ongoing Activity Round-up (Paolo Ferragina)
Graﬁte: Taming Adversarial Queries with Optimal Range Filters (Marco Costa)
On Compressing Ultra-large Source Code Datasets (Antonio Boffa)
NEAT, a Novel Learned Lossless Compressor for Time Series (Andrea Guerra)
Efficient Compression Techniques for Deep Neural Networks (Francesco Tosoni)
Discriminative Pattern Discovery for the Characterization of Different Network Populations (Simona E. Rombo)
Learned Monotone Minimal Perfect Hashing (Giorgio Vinciguerra)
Engineering a textbook approach to index massive string dictionaries (Mariagiovanna Rotundo)

Fifth meeting

September 22, 2022 – September 23, 2022 Dip. di Matematica ed Informatica, Via Archirafi 34, Palermo

Minutes of the meeting

Huffman coding for neural network compression (Flavio Furia)
On the suitability of Neural Networks as Building Blocks for the design of efficient Learned Index (Domenico Amato)
Compressed String Dictionaries via Data-Aware Subtrie Compaction (Antonio Boffa)
Clustering Classical Data with Quantum k-Means (Alessandro Poggiali)
Advances on learned indexing and compression of integer data (Giorgio Vinciguerra)
Topological ranks reveal functional knowledge encoded in biological networks: a comparative analysis (Simona Rombo)
Guidelines for topology-preserving compression of deep neural networks (Giosuè Cataldo Marinò)

Fourth meeting

February 9, 2022 Video conference

Minutes of the meeting

UNIPA past and ongoing activity round-up (Raffaele Giancarlo)
Neighborhood based approaches for the prediction of lncRNA-Disease associations from tripartite graphs (Mariella Bonomo)
UNIMI past and ongoing activity round-up (Marco Frasca)
The role of classifiers and query distribution in Learned Bloom Filters (Dario Malchiodi)
UNIPO past and ongoing activity round-up (Giovanni Manzini)
UNIPI past and ongoing activity round-up (Paolo Ferragina)
Scaling compression to massive matrices (Francesco Tosoni)
Repetition- and linearity-aware rank/select dictionaries (Giorgio Vinciguerra)
Invited talk: Indexing and compressing regular languages (Nicola Prezza)

Report of the second year

September 1, 2021

Third meeting

March 12, 2021 Video conference

Minutes of the meeting

UNIPA past and ongoing activity round-up (Raffaele Giancarlo)
Developments on Theoretic and Practical Aspects of Learned Table Search and Indexing (Domenico Amato)
UNIMI past and ongoing activity round-up (Marco Frasca)
Compression strategies and space-conscious representations for deep neural networks (Alessandro Petrini / Giosuè Marinò)
UNIPO past and ongoing activity round-up (Giovanni Manzini)
Space Efficient Merging of Compressed Indices (Lavinia Egidi)
UNIPI past and ongoing activity round-up (Paolo Ferragina)
Learned compressed rank/select dictionaries (Antonio Boffa)

Report of the first year

September 1, 2020

Second meeting

February 6, 2020 – February 7, 2020 Dip. di Informatica, Largo Bruno Pontecorvo 3, Pisa

Minutes of the meeting

Revisiting sorted array search procedures via machine learning (Giosuè Lo Bosco)
Multicriteria approaches for succinct complex networks representations in precision medicine (Simona Ester Rombo)
Neural networks compression techniques for succinct classification models (Marco Frasca)
Prefix free parsing with applications (Giovanni Manzini)
On the dynamisation of learned indexes & their theoretical ground (Paolo Ferragina)

Kickoff meeting

October 14, 2019 Video conference

Minutes of the meeting

UNIMI Presentation (Marco Frasca)
UNIPA Presentation (Raffaele Giancarlo)
UNIPO Presentation (Giovanni Manzini and Lavinia Egidi)
UNIPI Presentation (Paolo Ferragina and Giorgio Vinciguerra)

Project start

September 1, 2019

People

Principal Investigators

Paolo Ferragina

Full professor

Marco Frasca

Assistant professor

Raffaele Giancarlo

Full professor

Giovanni Manzini

Full professor

Università di Pisa

Paolo Ferragina

Full professor

Giorgio Vinciguerra

Postdoc

Antonio Boffa

PhD student

Andrea Guerra

PhD student

Francesco Tosoni

PhD student

Davide Bacciu

Assistant professor

Antonio Carta

PhD student

Andrea Valenti

PhD student

Università degli Studi di Milano

Marco Frasca

Assistant professor

Dario Malchiodi

Associate professor

Marco Mesiti

Associate professor

Paolo Perlasca

Assistant professor

Giorgio Valentini

Full professor

Alessandro Petrini

Research fellow

Jessica Gliozzo

PhD student

Giosuè Cataldo Marinò

Research contractor

Flavio Furia

Research contractor

Università degli Studi di Palermo

Raffaele Giancarlo

Full professor

Giosuè Lo Bosco

Associate professor

Simona E. Rombo

Associate professor

Domenico Amato

PhD student

Mariella Bonomo

PhD student

Ylenia Galluzzo

PhD student

Armando La Placa

PhD student

Gennaro Grimaudo

PhD student

Università degli Studi del Piemonte Orientale “Amedeo Avogadro”

Giovanni Manzini

Full professor

Lavinia Egidi

Associate professor

Alessandro Poggiali

Research fellow

Manuel Striani

Postdoc

Students and postdocs

List of students, postdocs, and collaborators involved in the project:

Domenico Amato, Università degli Studi di Palermo
PhD student (Project start–April 2022) → Postdoc (April 2022–September 2022)
Antonio Boffa, Università di Pisa
PhD student (November 2020–Project end)
Mariella Bonomo, Università degli Studi di Palermo
PhD student (May 2021–October 2022)
Flavio Furia, Università degli Studi di Milano
Contract employment for research activity support (July 2022–October 2022)
Jessica Gliozzo, Università degli Studi di Milano
PhD student (November 2020–Project end)*
Gennaro Grimaudo, Università degli Studi di Palermo
PhD student (March 2021–Project end)
Andrea Guerra, Università di Pisa
PhD student (November 2021–Project end)
Giosuè Cataldo Marinò, Università degli Studi di Milano
Contract employment for research activity support (July 2020–October 2020)
Alessandro Petrini, Università degli Studi di Milano
Postdoc (December 2020–November 2021)
Alessandro Poggiali, Università degli Studi del Piemonte Orientale
Research fellow (June 2022–November 2022)
Manuel Striani, Università degli Studi del Piemonte Orientale
Postdoc (September 2020–August 2021)
Francesco Tosoni, Università di Pisa
PhD student (November 2020–Project end)
Giorgio Vinciguerra, Università di Pisa
PhD student (Project start–October 2021) → Postdoc (November 2021–December 2022)

List of PhD thesis related to the project:

Domenico Amato. A Tour of Learned Static Sorted Sets Dictionaries: From Specific to Generic with an Experimental Performance Analysis. Successfully defended on July 10, 2022.
Giorgio Vinciguerra. Learning-based compressed data structures. Successfully defended on February 23, 2022.

*Joint Ph.D. course in “Genomics and Bioinformatics” between Università degli Studi di Milano and the Joint Research Center (JRC) di Ispra. The first year of the course has been funded by the project.

What is a multicriteria data structure?

Publications

Awards

Software

Invited talks

Events

Meetings & Reports

Project end

Sixth meeting

Fifth meeting

Fourth meeting

Third meeting

Second meeting

Kickoff meeting

Project start

People

Principal Investigators

Full professor

Assistant professor

Full professor

Full professor

Università di Pisa

Full professor

Postdoc

PhD student

PhD student

PhD student

Assistant professor

PhD student

PhD student

Università degli Studi di Milano

Assistant professor

Associate professor

Associate professor

Assistant professor

Full professor

Research fellow

PhD student

Research contractor

Research contractor

Università degli Studi di Palermo

Full professor

Associate professor

Associate professor

PhD student

PhD student

PhD student

PhD student

PhD student

Università degli Studi del Piemonte Orientale “Amedeo Avogadro”

Full professor

Associate professor

Research fellow

Postdoc

Students and postdocs