On nonlinear learned string indexing


We investigate the potential of several artificial neural network architectures to be used as an index on a sorted set of strings, namely, as a mapping from a query string to (an estimate of) its lexicographic rank in the set, which allows solving some interesting string-search operations such as range and prefix searches. Our evaluation on a variety of real and synthetic datasets shows that learned models can beat the space vs error trade-off of the classic (possibly compressed) trie-based solutions for relatively dense datasets only, while being slower to be trained and queried. This leads us to conclude that learned models are not yet competitive with classic trie-based solutions, and thus cannot completely replace them, but possibly only integrate them. Although our study does not settle the question conclusively, it highlights appropriate methods, provides a baseline for comparison, and introduces several open problems, thereby serving as a starting point for future research.

IEEE Access
Paolo Ferragina
Full professor

Professor of Algorithms and PI of the project

Marco Frasca
Assistant professor

Researcher in Machine Learning and AI of the UNIMI unit

Giosuè Cataldo Marinò
Research contractor
Giorgio Vinciguerra

Postdoc in Computer Science