On nonlinear learned string indexing

Abstract

We investigate the potential of several artificial neural network architectures to be used as an index on a sorted set of strings, namely, as a mapping from a query string to (an estimate of) its lexicographic rank in the set, which allows solving some interesting string-search operations such as range and prefix searches. Our evaluation on a variety of real and synthetic datasets shows that learned models can beat the space vs error trade-off of the classic (possibly compressed) trie-based solutions for relatively dense datasets only, while being slower to be trained and queried. This leads us to conclude that learned models are not yet competitive with classic trie-based solutions, and thus cannot completely replace them, but possibly only integrate them. Although our study does not settle the question conclusively, it highlights appropriate methods, provides a baseline for comparison, and introduces several open problems, thereby serving as a starting point for future research.

Publication
IEEE Access
Avatar
Paolo Ferragina
Full professor

Professor of Algorithms and PI of the project

Avatar
Marco Frasca
Assistant professor

Researcher in Machine Learning and AI of the UNIMI unit

Avatar
Giosuè Cataldo Marinò
Research contractor
Avatar
Giorgio Vinciguerra
Postdoc

Postdoc in Computer Science