Multicriteria Data Structures and Algorithms: from compressed to learned indexes, and beyond
PRIN no. 2017WR7SHH
Detailed description of the project: targets that the project aims to achieve and their significance in terms of advancement of knowledge, state of the art and proposed methodology
The last thirty years have witnessed an upsurging interest towards the design of algorithms and data structures that could scale with the deluge of data produced from an ever growing number of sources: from the Web and DNA sequencing of the 90s, to multimedia platforms and Social Networks of the 00s, eventually getting to Internet of Things and personal datastores of the last decade. To cope with these massive datasets, software engineers and algorithm designers have tried many lines of software development and research: from taking advantage of faster I/O-subsystems, to distributed platforms and novel programming models (e.g. Map/Reduce), etc.. Overall, there was a consensus that a proper arrangement of data and properly structured computations could surpass the improvements coming from technology advancements. This explains the remarkable progresses made in the last years by the algorithmic literature devoted to massive datasets, which concentrated mainly on the following four main themes (for each one, we give a brief description, mention a seminal paper and a recent review of the state-of-the-art).
External Memory Algorithms. At the end of the 80s, the design of I/O-efficient algorithms and data structures has been placed on solid theoretic grounds. In the seminal paper [AV88], the authors study the I/O-complexity of sorting and permuting in a model representing computers as consisting of two-level memories: one small and fast, coinciding with the internal RAM, and the other large and slow, coinciding with (mechanical) disks. The following years witnessed, from the one hand, the design of I/O-efficient (often optimal) algorithms and data structures managing many kinds of data and, from the other hand, the identification of the computational limits of performing I/O-efficient processing over large datasets. A review of the most significant achievements in this area is [V01].
Cache-oblivious Algorithms. At the end of the 90s, the algorithmic community recognized the limitations of the expressiveness of the two-level memory model with respect to the hierarchical memories available in computers. A new stream of research started with [FLPR99], where the authors introduced the so called cache-oblivious model, in which algorithms are designed analogously as in the two-level memory model, but without knowing the sizes of the internal memory and disk page. The resulting algorithms are able to efficiently exploit the presence of a hierarchy of memory levels of different speed and size: from caches, to RAM, to disks, etc. As in the I/O-model, the initial results were about sorting and some basic data structures, but they have been later extended to many other data types [A05], thus making cache-oblivious most of the proposals in [V01].
Compressed Data Structures. Another line of research for dealing with large datasets borrowed techniques from Information Theory and Data Compression, with the goal of designing algorithms and data structures with a footprint close to their Information Theoretic bound. The goal was to make them fit in internal memory. In the seminal paper [FM00], the authors showed that it is possible to compress data within a space matching the theoretical lower bound given by their entropy, and still efficiently support substring searches. Over the last years, this field grew significantly and now we have efficient compressed data structures for trees, graphs, permutations, and other combinatorial objects. A comprehensive description of these achievements is available e.g. in [N16].
Streaming Algorithms. Sometimes the amount of information may be too large to store in an impoverished setting (say, an embedded device) or to keep conveniently in fast storage. In response to this challenge, some researchers proposed the model of stream data processing, whose aim was no longer to store and index the whole available data but rather to create a summary of it and process each new datum quickly, even if approximately. The seminal paper [AMS99], and the many results that followed it, have designed many interesting “sketches” of data structures which are nowadays used in many Web, Social and Networking applications. These sketches, which can be seen as a sort of lossy-compressed data structures, are well surveyed in [C17]. In recent years, these four research lines have been theoretically improved, mixed together (for example, we now have compressed and cache-oblivious data structures [FV16]), and turned into sophisticated software libraries widely used by software engineers: SDSL-lite, WebGraph, Data Sketches Core library, etc. We point out that the proponents of this project have contributed significantly to the progress of those areas with many results published in top journals and conferences. We limit ourselves to mention the 7 papers published in J.ACM [FG99,FFM00,FM00,M01,BFG03,FGMS05,FLMM09].
Multicriteria Data Structures: A novel algorithmic approach introduced here
In (classic) data structure design, we usually trade space usage for speed or for some other critical computational resources. This often creates, for a given problem, a collection of different data structures offering various trade-offs, among which software engineers have to choose the one that best fits their application needs. To make things more complicated, these needs are often not exactly the ones satisfied by theory-efficient data structures, and/or they may change with time, data distribution and users. We believe it is time to introduce a new concept, transversal to the above four research lines, that copes with these novel challenges. We call it: “Multicriteria Data Structures” and give its first formalization next. A multicriteria data structure, for a given problem P, consists of a set F of data structures, each one solving P with a proper trade-off in the use of some computational resources, and a properly designed optimization algorithm that is able to efficiently select in F the data structure that “best fits” an instance of P given in input. This concept has been touched in a handful of papers, incidentally published by the proponents [BFG03,FGM09,FFFV14], dealing with few specific problems in Data Compression and Indexing. These preliminary results have shown the potential of this approach in the design of data structures which offer performance guarantees not available before. As an example, the bi-criteria data compressor in [FFFV14] provides the most compressed input data, for any given bound on the decompression time, among the class of compressed data obtainable with the famous gzip-like compression scheme. Experiments in [FFV14] have shown that the performance of the bi-criteria compressor improves the well-known Snappy, LZ4, LZMA, etc. thus constituting a “universal” replacement for all of them. In the present project, we are interested in theoretically and experimentally investigating this novel concept of “multicriteria algorithms and data structures” by (1) formulating coherently for the first time its theoretical foundations, (2) by deploying in an innovative way some results from the apparently distant fields of Artificial Intelligence (AI) and Machine Learning (ML), and (3) by designing and systematically experimenting some novel multicriteria data structures over some application settings coming from BioInformatics and Web/Text search, as detailed below.
Multicriteria Data Structures: An unexpected help from AI
Although AI and ML have been perceived as distant from classic discrete algorithmics, a recent report by some MIT and Google researchers [K17] shed a new and surprising light on their interplay, thus opening an unexpected research direction in both fields. The key points of this highly innovative paper are the following ones:
A) Sorting and searching data structures can be seen as learning models. The paper offers a few examples based on B-trees and Hash Tables.
B) The mix of CPU/GPU/TPU [W17] makes it possible to use ML models to efficiently “cut down” the search space in a way that is more effective than, say, binary search. In very loose terms, a decision is reduced to a numeric evaluation of a function suitably determined from the data. This function can be learned using ML techniques and the resulting index is aptly called Learned Index.
C) Experimentally, these ideas seem indeed very effective. In [K17], the authors showed that using neural nets in B-tree nodes, they were able to outperform cache-optimized B-Trees by up to 70% in speed, while saving an order-of-magnitude in memory usage. For lookups, the authors reported up to 80% reduction of hash-table memory overhead, while maintaining similar query times. These results stimulated a great interest in the communities of AI, DB and software engineers, which however highlighted some important limitations of the proposal: (1) Some known data structures outperform the ones proposed by [K17], e.g., Cuckoo hashing [B18]. (2) If “binary search decisions” are seen as numeric evaluations of functions learned from the data, numerical analysis comes into play: using spline approximation rather than ML some researchers [N17] have designed “better’’ indexes than the ones in [K17]. (3) Kraska et al. suggest that, for some datasets, the best results could be obtained by mixing traditional and learned indexes, or different types of learned indexes. However, their findings were heuristic and estimated only via experiments over a few datasets. Even more importantly, they did not consider compressed data structures, and they did not provide a principled evaluation of their framework and of its overall performance.
These results stimulated a great interest in the communities of AI, DB and software engineers, which however highlighted some important limitations of the proposal: (1) Some known data structures outperform the ones proposed by [K17], e.g., Cuckoo hashing [B18]. (2) If “binary search decisions” are seen as numeric evaluations of functions learned from the data, numerical analysis comes into play: using spline approximation rather than ML some researchers [N17] have designed “better’’ indexes than the ones in [K17]. (3) Kraska et al. suggest that, for some datasets, the best results could be obtained by mixing traditional and learned indexes, or different types of learned indexes. However, their findings were heuristic and estimated only via experiments over a few datasets. Even more importantly, they did not consider compressed data structures, and they did not provide a principled evaluation of their framework and of its overall performance.
Target of the project
From all of the above, it is clear that the results by Kraska et al., together with others obtained by researchers in this team, are only the starting point for further research on the new generation of multicriteria algorithms and data structures. These are designed to carefully orchestrate different techniques borrowed from Learned Indexes, Cache Oblivious and/or Compressed data structures, with the goal of resulting in a “principled way” adaptable to data distributions, user performance needs, and the relative power of modern computing devices (CPU-SIMD/GPU/TPU [W17]). We plan to study how to optimally combine powerful techniques coming from Algorithmics (Compressed and Cache Oblivious data structures), Information Theory (Data Compression) and Artificial Intelligence (Machine Learning and especially Neural Networks) in a sound mathematical way, in order to design a “family of data structures” from which an application will draw, via proper optimization algorithms, the one which satisfies the multicriteria constraints imposed by the specific context of use. Following this approach, the performance of the selected data structure will be bounded above by the worst-case complexity of the known (classic) data structures and have the potential of significantly outperforming it, thanks to the optimization approach and the integration of novel techniques from ML. As it has happened in the development of other related areas, we will concentrate on a paradigmatic field, in our case termed “Multicriteria Indexing and Compression”.
Task outlines and methodology
Here we give further details about our target, by presenting a plan broken down into four Tasks. We plan to focus on a few fundamental problems coming from Data Indexing, Compression and Machine Learning. Thus the methodological tools involved in this project come from Information Theory (more specifically, Data Compression), Machine Learning (i.e. Neural Nets, but not only that) and Algorithms (indexing data structures, mainly for atomic items and strings). In Section 3, we will describe the deliverables of each task, their timeline and how the different research units will cooperate to successfully achieve them. At the end of this section, we will comment on the recognized scientific leadership on the project topics of all four research units.
[T1] Classic Data Structures vs Purely Learned Indexes. In this task, we will address the question of identifying under which conditions learned indexes can outperform classical indexes. To this end, we will formally characterize the time efficiency versus space occupancy of ML models with respects to the distribution of the input data, and we will compare them against the results known for classic and compressed indexes, topic on which the proponents have significantly contributed in the last decade [e.g. FM00,FLMM09,FGMS05,FLMM09,FSV13,FV16]. Methodologically, we will study this question from both a theoretical and an empirical standpoint, thus extending and providing a solid theoretical ground to the preliminary and empirical results of [K17].
[T2] Compressed ML models. Our research, directly or indirectly, would benefit from the availability of space-succinct ML tools that preserve the accuracy guarantees in their inferences. So we will address the problem of how to eliminate the redundancy present in ML models to obtain a space usage reduction, and mathematically estimate the obtained gains. We will study this issue within three different model categories, where the proponents have contributed in last decade [e.g. CBRV12, FBRG13, FBG16]: Language Classifiers, Tree-based classifiers, and Neural Networks. Results in these contexts would have a more general valence in AI and a potentially vaste applicability in real-world contexts where the physical resources (e.g. memory capacity, battery life) are limited. Applications would include, for instance, the prediction of protein families [BY01], the deployment of applications (such as Speech Recognition) on mobiles [KAU12], the offloading of heavy mobile computations to the Cloud in order to increase battery duration [KP17]. To the best of our knowledge, there is only a handful of papers that deal with space-conscious ML tools [e.g. CWZZ17]. So, the theoretic methodology we plan to adopt here, including the Information, Automata and Neural Nets Theory, will be an advancement with respect to the state of the art.
[T3] Multicriteria Data Indexing. In [K17], Kraska et al. proposed an “hybrid” approach, called Recursive Model Index, that mixes the classic B-tree structure with the learned indexes in a way that every B-tree node can choose whether it routes the search via an ML model or a sorted array. The hybrid index was first trained in the top-node, and then recursively trained in the lower nodes, and so on, eventually adopting a grid-search for choosing the best combination of models in B-tree nodes and their parameters. Methodologically, this optimization approach needs to be formalized, analyzed and generalized in the framework of Multicriteria Data Indexing by deriving data structures able to trade, in a principled way, the time efficiency of the query operations with the space occupancy of the index according to the time or space bounds imposed by the application at hand. Clearly, the studies of the previous tasks T1 and T2 are instrumental to this one, as well as the significant experience in the design of compressed indexes by the research team [e.g. FG99,FM00,FLMM09,FSV13,FV16].
[T4] Multicriteria Data Compression. In this task, we study how ML models can be used to design better lossless compressors. The intuition is simple: lossless compressors are structured as a pipeline of transformations of the input data into a form which can be easily modeled and eventually encoded. Efficiency and effectiveness of the overall (de)compression process strongly depends on the modeling stage. Since Recurrent Neural Networks (RNN) are good at capturing long term dependencies [LeCun15] to predict very well the next char/word, it is natural to use them as a modeling tool in lossless compression, possibly in combination with other classic modeling approaches. Recently, a few authors have empirically studied this approach [C16,T17] with promising results on the simple pipeline: RNN + Arithmetic coding. Leveraging on the emerging methodology of Bicriteria data compression and on more sophisticated data transformations based on Lempel-Ziv or BWT paradigms, we plan to investigate this topic by posing the theoretical basis for what we can term Multicriteria Data Compression, eventually taking advantage of the results achieved in tasks T1-T2 and of some of our previous results in the (classic) data compression setting [M01,BFG03,FGMS05,FNV13,FFFV14].
Significance and impact
As stated, the area is new, with a very few contributions scattered in diverse specialties. The development of a new area of Algorithmics is certainly highly significant, even more so when the practical impact of the outcome is expected to be pervasive for several applications, mentioned above: web search, data compression, bioinformatics, cloud, energy efficiency, etc.. This point, within the H2020 pillars, is further elaborated in Section 4. We conclude this section by stating that the proposed research project, because of its high novelty, does not overlap any other project currently pursued by the Research Units.
The team proposing this project has the capabilities to investigate the “principled”, i.e. based on solid theoretical ground, design of the novel framework of Multicriteria Indexes and Compressors. The team members offer a variety of skills and scientific expertises that are suitable to attack the ambitious goals described above and that stand at the crossing point of three fundamental CS fields: Information Theory (Data Compression), Machine Learning, and Algorithms (indexing data structures). Some of the team members (in the units of Piedmont, Pisa and Palermo) have already collaborated on some of those topics successfully, as witnessed by their publications. The presence of the Unit in Milan will bring into this long-time collaboration the skills on AI/ML that are needed to make significant advancements into this new multi-disciplinary research scenario we are foreseeing.
The project asks also for few contracts for young researchers (whose costs adhere to the Italian legislation standards): the extension of the RTD-A of dott. Frasca, who is the coordinator of the Unit in Milan, and 3 two-years Research Grants (assegno di ricerca, in Italian), one for each of the other three Units. The former does not need any comment, given his key role in leading and pursuing his research on the ML-part within Milan’s unit. For the latter three contracts, we specify that the young researchers we will recruit should have a strong background in ML, Information Theory (Data Compression) and Algorithms, and a good experience of SW development. We know that these competences are not easy to be found in current PhDs, but we believe that the ones who will accept the challenges posed by this project will get the opportunity to learn and experience a “toolbox of techniques, competencies and skills” that, at the end of the project, will be successfully spent either in Academia or in Industry, the latter because the large and growing interest of prestigious companies in this subject (such as Google, where the Hybrid Indexes were introduced [K17]).
Finally, we notice that the time dedicated to the project by its members with permanent positions will be superior to the one indicated in the economic part, which relates only to a small co-funding based on the salary of the participants. Moreover, we notice that some PhDs will contribute to the project, and some others will possibly be added to the team in the following three years, given the attracting PhD programs available in the participating sites.
Project development, with identification of the role of each research unit with regards to expected targets, and related modalities of integration and collaboration
In this section we describe the project development by detailing (within the given available space) the four tasks T1-T4, mentioned in Section 2. In particular, we will illustrate their content, deliverables, and timeline. Moreover, we will outline role and modalities of collaboration among the partner units, hereafter abbreviated as UniMI (Milan), UniPA (Palermo), UniPI (Pisa), and UniPO (Piedmont).
The timeline and structure of the deliverables of the four tasks naturally lend themselves to the following organization:
- Each task has one methodological&design deliverable that will span the three years of the project, whose goal is to develop the theoretical grounds of our Multicriteria framework. More precisely, tasks T1 and T2 will be carried out mainly in the first 1.5 years since we need to study and design the “building blocks” that tasks T3 and T4 will use in designing Multicriteria Indexes and Compressors, during the last 1.5 years of the project.
- Each task has also one algorithmic&software-engineering deliverable, whose goal is to implement and experiment software prototypes of some Multicriteria Indexes and Compressors. The deliverables of tasks T1 and T2 will span the first 1.5 years because they will implement the modules that the software deliverables of tasks T3 and T4 will use to implement some Multicriteria Indexes and Compressors in the last 1.5 years of the project.
Modalities of integration and collaboration among the Research Units. Given the interplay among Information Theory, Machine Learning Theory and Algorithmics required by this research, and given the different expertises of the four Research Units, they will be involved synergically in all the deliverables: this collaboration will be “total” for the methodological&design deliverables, and it will be “specific” for the algorithmic&software-engineering deliverables according to the expertises of each Unit.
We will set up research meetings, especially in the first year of the project, to share our own results, expertise and foresights on the tasks/background of the project. We will also have one Workshop per year where we will share the achievements, and invite colleagues and PhD students of the interested Algorithmic/AI/ML areas for feedbacks and for establishing fruitful collaborations.
It is worth pointing out that three of the research units (i.e., Palermo, Piedmont and Pisa) have collaborated successfully for many years, producing top scientific publications. The units in Palermo and Milan have collaborated in the past through workshops and visits. Therefore, there is an excellent starting ground for the very quick establishment of an homogeneous research team.
[T1] Classic (Compressed) Data Structures vs Purely Learned Indexes
In this task, we will theoretically and experimentally address the question of identifying under which conditions Learned Indexes outperform classic and compressed indexes. There are several efficient data structures in addition to the ones considered in [K17] and they should be taken into account for a robust and fair comparison. These premises lead to devise the following two deliverables:
- [D1.1] Characterization of the space-time-accuracy performance of ML models in terms of distribution of the input data, and their comparison against the known (compressed) data structures. We will adopt all the theoretical machinery that is proper of Information Theory, Machine Learning Theory and Algorithmics, looking for the first time at the interplay among those scientific areas within a data-structure perspective. Moreover, we will leverage on the vast amount of knowledge that has been developed in Compressed Data Structures, also by the proponents (as mentioned in Section 2).
- [D1.2] A collection of known and new implementations of ML-based and compressed data structures, to be used in the next tasks T3 and T4 as “building blocks” for our multicriteria framework. The selection will be based on the theoretical characterization offered by D1.1 and on the experiments that we will perform in this deliverable on the well-tuned and robust libraries of compressed data structures, such as SDSL-lite, and of ML models, e.g. WEKA, Keras and OpenNN. Datasets will be synthetic and real, those latter borrowed from Web/text collections (UniPI and UniPO) and BioInformatics (UniMI and UniPA).
[T2] Compressed ML models.
As we mentioned in Section 2, there is only a handful of papers dealing on solid theoretic basis with the goal of obtaining space-conscious ML models. The (simple) existing approaches [BCNM06, LW17] tend to a deterioration in the model accuracy, and/or to an increase of the training and execution time. In this task, we aim at overcoming such limitations.
- [D2.1] Development of compressed ML models. We will supply novel succinct ML models by investigating the trade-off between compressed space, (de)compressed time and the prediction accuracy of the three ML model categories described in Section 2. Namely, we will study (1) how to compress the Language and the Tree-based classifiers by realizing succinct representations of the statistics involved in Probabilistic Suffix Trees [AB04], for instance by exploiting techniques for weighted automata minimization [BGW98] and q-gram compression [P17]; (2) how to compress classification and regression trees by adopting tree-compression schemes, e.g. like those used for Random Forests ensembles [PAIN16]; (3) how to improve known compression techniques for (Deep) Neural Networks (NN), by focusing on the relation among the problem setting (classification, regression, etc.), the NN model and the compression strategy(-ies) adopted. Indeed, current solutions – e.g. network quantization, weight pruning and sharing, knowledge distillation – usually show a loss in the prediction accuracy, and/or an increase of the learning/execution time, while being typically limited to specific settings or NN models [LW17].
- [D2.2] A collection of software prototypes for succinct ML-models, by implementing the ones studied and designed in deliverable D2.1 and experimenting them on few synthetic and real datasets, in order to draw theoretical and practical conclusions to be deployed in the next two tasks. To this end, UniPA, UniPO and UniMI will contribute to implement compressed Language and Tree-based classifiers, experimenting them on datasets in Bioinformatics; UniMI and UniPA will also contribute to implement compressed Neural Network models and, with the help also of UniPI, to their empirical validation on Bioinformatics as well as Textual data.
[T3] Multicriteria Data Indexing
As we mentioned in Section 2, the authors of [K17] proposed a first rough example of multicriteria index based on an “hybrid and heuristic” approach, which combined classic arrays with ML models. Their experiments showed the potentialities of the approach, but clearly missed the comparison against the best known compressed data structures, any principled approach in the index design and, consequently, any mathematical guarantee in its final performance. We wish to benefit of the results in tasks T1 and T2, and of some of our previous results in the context of compressed data structures (mentioned in section 2), to achieve the following two deliverables:
- [D3.1] Methodology and framework for the design, analysis and experimentation of multicriteria indexes. We wish to deliver Multicriteria Indexes that are optimized to choose, among a set of well-studied and tuned “building blocks” (either ML models or compressed data structures), which one to adopt in the index according to the underlying data distribution and to the time or space bounds imposed by the application. It is evident that the studies of the previous tasks T1-T2 are beneficial to define the collection of “building blocks” to select from and for speeding up to the optimization process. We plan to concentrate (at least) on the design of Multicriteria versions of Wavelet Trees and (String) B-trees.
- [D3.2] Implementation of a collection of some multicriteria indexes, which adopt the methodology and results of D3.1 for their design/orchestration, and the software developed in D1.2 and D2.2 as their “building blocks”. In this deliverable, UniPA will mainly contribute to the implementation and experimental test of indexes for BioInformatics; UniPI-UniPO mainly to indexes for Web/Text data collections; and UniMI will support all other Units for the use of ML models. We plan to provide (at least) the implementation of Multicriteria versions of Wavelet Trees and (String) B-trees.
[T4] Multicriteria Data Compression
The theoretical and experimental scenario characterizing the design of Multicriteria compressors, sketched in section 2, leads to devise the following two deliverables:
- [D4.1] Study and design a novel class of ML-based data compressors. The goal is to study first whether Lempel-Ziv or BWT transformations, or newly devised transformations, produce character distributions that could be better approximated via properly designed and space-conscious ML-models, resulting from tasks T1 and T2. And then, design a bi/multi-criteria optimization scheme that effectively (i.e. in de/compression time and space occupancy) combines those modules to achieve better lossless compression.
- [D4.2] Implementation of a collection of some multi-criteria compressors, which adopt the methodology and results of D4.1 for their design, and the software developed in D1.2 and D2.2 as their “building blocks”. In this deliverable, UniMI and UniPA will deal with datasets coming from BioInformatics applications, UniPI and UniPO with Web/text collections. We plan to provide the implementation of Multi-criteria versions of compressors based on LZ-parsing, BWT, Arithmetic or ASN [D13] encoders.
Possibile application potentialities and scientific and/or technological and/or social and/or economic impact of the project
Here we describe the potential applications, scientific/technological advances and economic impact of the project in reference to the EU H2020 main pillars. To this end, it is useful to recap that the target of the project is to design and orchestrate novel compression methods, compressed indexes and space-conscious ML models, within the new framework of multicriteria data structures and algorithms that we foresee to design in this three-years PRIN project.
Scientific Impact [Horizon 2020 Part I: Excellence in the Science Base - Frontier research (ERC) - Future and Emerging Technologies (FET)]
The achievement of our goals will build on the development of new theories on which to base the design and implementation of the foreseen new class of data compressors, indexes, and space-conscious ML models. All our methodological and design deliverables address a few fundamental problems coming from Data Indexing and Compression, that will be experimentally tested on synthetic and real datasets coming from BioInformatics and Web/Text collections. Since the motivation and main field of application is “Big Data”, it is worth recalling that the methodologies for this scenario already developed by our research team are worldwide recognized as excellence in research and have led to rule-changing results: we limit ourselves to mention compressed indexing, textual and multi-dimensional data compressors, and multi-level learning/classification algorithms [FM00,M01,BFG03,FGMS05,DL09,FLMM09]. If past performance is indication of future successful results, our contributions will reinforce European Leadership in the fundamentally important area of the foundations and algorithms for Big Data technologies. Therefore, we believe that placing our contribution in the framework of H2020 Excellence in Science is appropriate. Likewise, as mentioned in previous sections of this proposal, there are only a handful of researchers that have produced results that would belong to this novel area (three of them are in this team) and the proposal is quite innovative, high risk/high benefits and therefore it fully belongs to the H2020 FET Scenario too.
Technical Impact [Horizon 2020 Part II: Creating Industrial Leadership - Leadership in enabling and industrial technologies - ICT]
In terms of algorithm engineering and software development contributions of the project, we aim at implementing a collection of multicriteria indexes, compressors and space-conscious ML-models, that will efficiently cope with the storage, processing and indexing of massive datasets on different types of computing devices – from the Cloud to PCs, from mobiles to wearables – and different kinds of deploying applications and users, thus showing their potentialities in some applicative scenarios coming from e.g. BioInformatics, Data Compression, Search engines and Energy efficiency. Here it suffices to mention that the FM compressed index [FM00], which would be an immediate ancestor of what we envision delivering, is one of the enabling technologies in Life Sciences, with deep impact on sequencing technologies: as witnessed by the fact that it is at the core of the BOWTIE and BWA tools which are nowadays used in many Bio-Med Labs worldwide and whose publications got more than 13k citations. We also believe that “personalization” should be an explicit feature not only of modern ICT applications as a whole, but also of the individual data structures that underlie their design and implementation, given the variety of devices, contexts, user needs and data that those applications have to process. Now, more than ever before, it is crucial to design algorithms and data structures that occupy less space by adapting their structure to input data distributions, satisfy the computational (i.e. space, time, energy, etc.) constraints imposed by their context of use or by the computing devices that execute them, and take eventually into account the needs of users and applications that make use of them. Our multi-criteria indexes and compressors aim at offering an efficient and effective answer to this novel “personalization” need in data structure design.
Cost and fundings
|n°||Sede dell’Unità||Responsabile scientifico||Contributo MIUR per ricerca||Cofinanziamento Ateneo / Ente||Quota premiale (per l’Ateneo del coordinatore)||Costo totale|
|1||Università di PISA||FERRAGINA Paolo||93.127||31.600||15.090||139.817|
|2||Università degli Studi di PALERMO||GIANCARLO Raffaele||93.167||30.000||–||123.167|
|3||Università degli Studi del PIEMONTE ORIENTALE||MANZINI Giovanni||90.666||30.000||–||120.666|
|4||Università degli Studi di MILANO||FRASCA Marco||101.040||33.400||–||134.440|
- [AMS99] N. Alon, et al.. The space complexity of approximating the frequency moments, J. Comput. Syst. Sci, 58, 1999.
- [AV88] A. Aggarwal, J. Vitter. The Input/Output complexity of sorting and related problems. CACM, 31, 1988.
- [A17] G. Anthes. Artificial intelligence Poised to Ride a New Wave. CACM, 7, 2017.
- [AB04] A. Apostolico, G. Bejerano. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput. Biol, 7, 2000.
- [A05] L. Arge, et al.. Cache-oblivious and cache-aware algorithms. Proc Dagstuhl Seminar, 04301, 2005.
- [BCNM06] C. Bucilua, et al. Model compression. Proc ACM KDD 2006.
- [B18] P. Bailis, et al.. Don’t throw out your algorithms book just yet: classical data structures that can outperform learned indexes. Stanford Down, 2018. On http://dawn.cs.stanford.edu/
- [BFG03] A. Buchsbaum, et al.. Improving table compression with combinatorial optimization, J.ACM, 50, 2003.
- [BGW98] A. Buchsbaum, et al.. Shrinking language models by robust approximation. Proc ICASSP, 1998.
- [BY01] G. Bejerano, G. Yona. Variations on Probabilistic Suffix Tree: Statistical modeling and prediction of protein families. Bioinformatics, 17, 2001.
- [CBRV12] N. Cesa Bianchi, et al. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach. Learn, 88, 2012.
- [CWZZ17] Y. Cheng, et al.. A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282, 2017.
- [C17] G. Cormode. Data sketching. CACM 60, 2017.
- [C16] D. Cox. Syntactically informed text compression with RNN. arXiv:1608:02893, 2016.
- [D13] J. Duda. Asymmetric numeral systems. arXiv:1311.2540, 2013.
- [DL09] G. Lo Bosco, et al.. A multi-layer method to study genome-scale positions of nucleosomes. Genomics, 93, 2009.
- [FBG16] M. Frasca, et al.. Learning node labels in multi-category Hopfield networks. Neural Comput. Appl, 27(6), 2016.
- [FBRG13] M.Frasca, et al.. A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw, 43, 2013.
- [FG99] P. Ferragina, et al.. The String B-tree: A new data structure for string search in external memory and its applications. J.ACM, 46, 1999.
- [FFM00] M. Farach-Colton, P. Ferragina, et al.. On the sorting-complexity of suffix tree construction. J. ACM 47, 2000.
- [FFV14] A. Farruggia, P. Ferragina, et al.. Bicriteria Data Compression: Efficient and Usable. Proc ESA 2014.
- [FFFV14] A. Farruggia, P. Ferragina, et al.. Bicriteria data compression. Proc ACM-SIAM SODA 2014.
- [FGM09] P. Ferragina, et al.. The myriad virtues of Wavelet Trees. Inf. Comput, 207(8), 2009.
- [FGMS05] P. Ferragina, et al.. Boosting textual compression in optimal linear time. J. ACM 52, 2005.
- [FLMM09] P. Ferragina, et al.. Compressing and indexing labeled trees, with applications. J. ACM 57, 2009.
- [FM00] P. Ferragina, G. Manzini. Opportunistic Data Structures with Applications. Proc IEEE FOCS 2000. Then appeared as J. ACM 52, 2005. [FNV13] P. Ferragina, et al. On the Bit-Complexity of Lempel-Ziv Compression. SIAM J. Comp., 42, 2013.
- [FSV13] P. Ferragina, et al. Distribution-aware compressed full-text indexes. Algorithmica, 67, 2013.
- [FV16] P. Ferragina, R. Venturini. Compressed Cache-Oblivious String B-Tree. ACM Trans. Algorithms, 12, 2016.
- [FLPR99] M. Frigo, et al.. Cache-oblivious algorithms. Proc IEEE FOCS 1999.
- [KAU12] H. Kim, et al.. Revisiting storage for smartphones. Trans. Storage 8(4), 2012.
- [K17] T. Kraska, et al.. The case of learned index structures. arXiv:1712.01208v2, 2017.
- [LeCun15] Y. LeCun, et al.. Deep learning. Nature, 521, 2015.
- [LW17] W. Liu, et al.. A survey of deep neural network architectures and their applications. Neurocomputing, 234, 2017.
- [M01] G. Manzini. An analysis of the Burrows-Wheeler transform. J.ACM 48, 2001.
- [N16] G. Navarro. Compact data structures: A practical Approach. Cambridge University Press, 2016.
- [N17] T. Neumann. The case for B-Tree index structures. A blog by and for database architects, dec 2017. On http://databasearchitects.blogspot.it/ [OV14] G. Ottaviano, R. Venturini. Partitioned Elias-Fano indexes. Proc ACM SIGIR 2014.
- [PAIN16] A. Painsky, S. Rosset. Compressing Random Forests, Proc IEEE ICDM 2016.
- [P17] G.E. Pibiri, et al. Efficient data structures for massive N-gram datasets. Proc ACM SIGIR 2017.
- [R96] R. Rojas. Neural networks: A systematic introduction. Springer, 1996.
- [T17] K. Tatwawadi. DeepZip: Lossless compression using recurrent networks. https://web.stanford.edu/class/cs224n/reports/2761006.pdf
- [V01] J. Vitter. External memory algorithms and data structures. ACM Comput. Surv, 33, 2001.
- [W17] B. Wang. Moore Law is dead but GPU will get 1000X faster by 2025. June 2017. On https://www.nextbigfuture.com/
Curriculum vitae PI: Ferragina Paolo
Paolo Ferragina is Professor of Computer Science (Algorithmics) at the University of Pisa, where he is also the Director of the PhD program in Computer Science, which is a joint regional effort with Universities of Sienna and Florence. He is currently the delegate for “Research and Innovation” at the Department of Computer Science and he is the leading the Acube Lab, where they design and implement algorithms for Big Data, with several collaborations with companies worldwide: Google, Bloomberg, Yahoo!, ST Microelectronics, Tiscali, ENEL, CERVED. Currently the Acube lab is a member of the EU H2020 Research Infrastructure on Social Big Data (SoBigData.eu, INFRAIA-1-2014-2015, grant agreement #654024).
He is also a member of the Scientific Committee of the Foundation for Innovation and Entrepreneurial Development of “Camera di Commercio” in Pisa, and a technical consultant of the Italian Parliamentary Inquiry Committee on Digitalization and Innovation of the Italian Public Administration.
His promotion to full professor was sponsored by Yahoo! Research, from 2007 to 2011. He taught at the Scuola Normale Superiore (2009-11) and he was one of the scientific coordinators of its research center Signum (2004-07).
From 2010 to 2016 Ferragina was Vice-Rector on “Applied Research and Innovation” at the Univresity of Pisa, in the same period he was the President of the IT Center, which is a competence center about Cloud and HPC for Dell and Intel, Xeonphi Centre for Intel, and recently Transform Data Center immersion for Microsoft. Among the other institutional duties: vice-chairman of the Department of Computer Science; member of the Patent Committee of the University of Pisa, member of the Scientific Committee of the Fondazione Toscana Life Sciences, member of the Advisory Board of Consorzio Pisa Ricerche.
Paolo Ferragina got his Laurea degree (summa cum laude, 1992) and his PhD (1996) in Computer Science from the University of Pisa, and his Post-doc from the Max-Planck Institut fur Informatik (Saarbrucken, 1997-98). From 1998 to 2000, he has been Assistant Professor at the University of Pisa; and from 2000 to 2007, he has been Associate Professor at the same University. He has also spent various periods of research at IBM Research (Rome), AT&T Shannon Lab (NJ), Yahoo! Research (Barcelona), Google (Zurich and NY), University of North Texas, Max Planck Institut fuer Informatik (DE), and Courant Institute at New York University (USA).
His research is mainly devoted to the design, analysis and experimentation of algorithms and data structures for storing, compressing, mining and retrieving information from Big Data. His research results received three US Patents (owned by Lucent, University of Pisa and Rutgers, Yahoo!) and some international awards: “Best Land Transportation Paper Award” from IEEE Vehicular Technology Society (1995); “EATCS Doctoral Dissertation Thesis Award” (1997); “Philip Morris Award on Science and Technology” (1997); “Research Capital award” from the University of Pisa (2002); Yahoo! faculty award (2007-2011); Working Capital Award (2010); three Google research awards (2010, 2012 and 2016) and one Bloomberg Research Grant (2017). Currently, he has three more patents pending in the USA, two owned by Yahoo! and one by NYU. The software system SMAPH won the ERD Challange @SIGIR2014 in the Short Track (query disambiguation).
He has been invited speaker of many international conferences and workshops on Algorithmics; in particular, he was a keynote speaker of CPM ‘04, SPIRE ‘05, ESA/ALGO 2010, SISAP ‘11, and Industral Track of ECIR 2012, DFG 2014 Priority Program 1307 “Algorithm Engineering” (Germany) and the DIITET 2015 National Conference of CNR (Pisa).
He’s serving in the Editor Board of the Journal of Graph Algorithms and Applications (JGAA), and he was in the Steering Committee of the European Symposium on Algorithms (ESA, 2012-2014) and he was one of the Area Editors of the Encyclopedia of Algorithms (Springer, Editor Ming-Yang Kao) for the topics “Data compression, String Algorithms and Data Structures”. He served as (co)editor of special issues in the international journals: Theory of Computing Systems (June 2006), Theoretical Computer Science (November 2007), Information Retrieval (August 2008) and Theoretical Computer Science (November 2009), Algorithmica (2014).
He has also served as member of the Scientific Review Committee for the national project “Algorithms for Big Data (SPP1736)” (Germany), and a reviewer of the Israeli Science Foundation for many years.
He has served as PC member of many International Conferences on Theoretical Computer Science, specifically in the field of Algorithmics. He has been co-chair of International Conference on FUN with Algorithms (2004), DIMACS Workshop on the Burrows-Wheeler Transform (2004), Symposium on String Processing and Information Retrieval (2006), Symposium on Combinatorial Pattern Matching (2008), European Symposium on Algorithms– Algorithm Engineering Track (2012), ACM Conference on Web Search and Data Mining (2013).
He has (co-)authored more than 150 publications on Theoretical Computer Science and Algorithmics. He has also co-authored one Italian book on Cryptography (Bollati Boringhieri, 2001 and 2007; now UniPI Press, 2015), and a book on “Il Pensiero Computazionale: dagli algoritmi al coding”, published by Il Mulino (2017), and this year this book will be published in English by Springer. He has also authored some chapters in books: just to mention the international ones: a chapter on “String search in external memory: Algorithms and data structures” in the Handbook of Computational Molecular Biology (CRC Press, Editor Srinivas Aluru, 2005), and a chapter on “Web Search” in the book On the power of algorithms (Springer, Editors Ausiello-Petreschi, 2013). For an updated list of his publications look at the CS Bibliographic Database, or via Google Scholar.
He has coordinated or participated to many national (MIUR PRIN and FIRB), international (EU and H2020) and company-based projects (like Yahoo, Google, ST Microelectronics, Bloomberg, etc.). For example, he has been the local coordinator of the Unit of University of Pisa participating to two national projects MIUR PRIN 2008 and 2010-11, one MIUR Research Grant (FFO 2016), and one MIUR FIRB Internazionalizzazione (Italy-Israel) 2005-2008. Moreover, he is leading the A3Lab which participates within the University of Pisa unit to the Research Infrastructure “SoBigData: Social Mining & Big Data Ecosystem”, H2020 Program (INFRAIA-1-2014-2015, grant agreement #654024, http://www.sobigdata.eu). He is the PI of a eight years long collaboration with Google Research (Zurich), first on the topic of semantic search engines and tools, and currently on the data compression topic (thanks to three Google Faculty awards, see above). Moreover he is currently leading a Research grant sponsored by Bloomberg.
For an updated CV please have a look at his home page and DBLP/Scopus.
National and international grants as PI
Paolo Ferragina has been PI of:
- Research Project on “Entity salience via sophisticated syntactic and semantic features”, Bloomberg Research Grant 2017 [Budget 65k euro].
- Research Project on “An algorithmic analysis of Brotli, towards personalized data compression”, Google Faculty Research Award 2016 [Budget 90k euro];
- MIUR Research Grant (ex art. 11, FFO 2016) on “Semantic search engine for the research valorization and technology transfer at the University of Pisa” [Budget 30k euro];
- Research Project on “Machine-Learning techniques and models for Big data to improve quality and yield in STM”, ST-Microelectronics, 2015-17 [Budget 32K euro];
- Research Project on “A novel graph for social-network analysis and search built by entity-annotators, and its applications”, Google Faculty Research Award 2012 [Budget 50k euro];
- Research Project on “On-the-fly annotation of short texts (by Wikipedia pages), with applications”, Google Faculty Research Award 2010 [Budget 65k euro];
- UniPI for MIUR PRIN 2010-11 on “ARS TechnoMedia (Algoritmica per le Reti Sociali Tecno-mediate)” [Budget 75k euro];
- Research Project on “Compressed Algorithms and Data Structures”, Yahoo Faculty Award, 2006-2011 [Budget 250k euro];
- UniPI for MIUR PRIN 2008 on “The Mad Web: Models, Algorithms and Data structures for the Web and other behavioral networks” [Budget 11k euro];
- UniPI for MIUR FIRB Intl. Project on “Pattern Discovery in Discrete Structures with Applications to Bioinformatics”, 2005-2008 [Budget 115k euro].
We conclude by stating that he has been:
- Leader of WorkPackages for the EU Research Infrastructure “SoBigData: Social Mining & Big Data Ecosystem” (H2020 agreement #654024), 2015-19 [Total budget 5.9 mln euro, UniPI budget 360k euro]; - PI of several (inter-)national projects related to Technology Transfer and Research Valorization when he was Vice-Rector on “Applied Research and Innovation” of University of Pisa.
National and international acknowledgments
Paolo Ferragina has got the following (inter-)national acknowledgments:
- Bloomberg Research award within the Data Science Research Grant Program 2017 for the project “Entity Salience with sophisticated syntactic and semantic features”; - Google Faculty Research Award 2016 for the project “An algorithmic analysis of Brotli, towards Personalized Data Compression”;
- Software System Award. The software system SMAPH (published in WWW ‘15) won the query-understanding track of the Entity Recognition and Disambiguation Challenge within the ACM SIGIR 2014 Conference;
- Google Faculty Research Award 2012 for the project “A novel graph for social-network analysis and search built by entity-annotators, and its applications”; - Google Faculty Research Award 2010 for the project “On-the-fly annotation of short texts (by Wikipedia pages), with applications”;
- Yahoo! Faculty Research Award in the five years 2007-2011 for the project “Compressed algorithms and data structures”;
- Working Capital Award from Telecom Italia in 2010;
- “Research Capital award” from the University of Pisa in 2002 for the design of the FM-index (IEEE FOCS 2000 and J.ACM 2005); - EATCS Doctoral Dissertation Thesis Award in 1997;
- “Philip Morris Award on Science and Technology” for his PhD Thesis, in 1997;
- “Best Land Transportation Paper Award” from IEEE Vehicular Technology Society in 1995;
- Keynote speaker of several International Conferences: CPM ‘04, SPIRE ‘05, ESA/ALGO 2010, SISAP ‘11, ECIR 2012, DFG Priority Program 1307 “Algorithm Engineering” (Germany); - Invited to join the Editorial board of the Journal on Graph Algorithms and Applications, since 2011;
- Invited to join the Editorial Board of two Springer Encyclopedias on Algorithms (since 2008) and on Big Data Technologies (since 2017);
Principal scientific publications of PI
- Ferragina Paolo, Venturini Rossano (2016). Compressed cache-oblivious string B-tree. ACM TRANSACTIONS ON ALGORITHMS, vol. 12, p. 1-17, ISSN: 1549-6325, doi: 10.1145/2903141 - Articolo in rivista
- Ferragina Paolo, Nitto Igor, VENTURINI R (2013). On the bit-complexity of Lempel-Ziv compression. SIAM JOURNAL ON COMPUTING, vol. 42, p. 1521-1541, ISSN: 0097-5397, doi: 10.1137/120869511 - Articolo in rivista
- P. Ferragina, J. Siren, R. Venturini (2013). Distribution-aware compressed full-text indexes. ALGORITHMICA, vol. 67, p. 529-546, ISSN: 0178-4617, doi: 10.1007/s00453-013-9782-3 - Articolo in rivista
- Paolo Ferragina (2013). On the weak prefix-search problem. THEORETICAL COMPUTER SCIENCE, vol. 483, p. 75-84, ISSN: 0304-3975, doi: 10.1016/j.tcs.2012.06.011 - Articolo in rivista
- Ferragina P, Scaiella U (2012). Fast and accurate annotation of short texts with Wikipedia pages. IEEE SOFTWARE, vol. 29, p. 70-75, ISSN: 0740-7459, doi: 10.1109/MS.2011.122 - Articolo in rivista
- Paolo Ferragina, Travis Gagie, Giovanni Manzini (2012). Lightweight Data Indexing and Compression in External Memory. ALGORITHMICA, vol. 63, p. 707-730, ISSN: 0178-4617, doi: 10.1007/s00453-011-9535-0 - Articolo in rivista
- Ferragina P, Nitto I, Venturini R (2011). On optimally partitioning a text to improve its compression. ALGORITHMICA, vol. 61, p. 51-74, ISSN: 0178-4617, doi: 10.1007/s00453-010-9437-6 - Articolo in rivista
- FERRAGINA P, LUCCIO F, MANZINI G, MUTHUKRISHNAN S (2010). Compressing and Indexing Labeled Trees, with Applications. JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, vol. 57 (1), p. 4:1-4:33, ISSN: 0004-5411 - Articolo in rivista
- FERRAGINA P, NITTO I, VENTURINI R (2010). On compact representations of All-Pairs-Shortest-Path-Distance matrices. THEORETICAL COMPUTER SCIENCE, vol. 411, p. 3293-3300, ISSN: 0304-3975, doi: 10.1016/j.tcs.2010.05.021 - Articolo in rivista
- FERRAGINA P, VENTURINI R (2010). The compressed permuterm index. ACM TRANSACTIONS ON ALGORITHMS, vol. 7, p. 10:1-10:21, ISSN: 1549-6325, doi: 10.1145/1868237.1868248 - Articolo in rivista
- FERRAGINA P, GIANCARLO R, MANZINI G (2009). The myriad virtues of wavelet trees. INFORMATION AND COMPUTATION, vol. 207, p. 849-866, ISSN: 0890-5401, doi: 10.1016/j.ic.2008.12.010 - Articolo in rivista
- FERRAGINA P, LUCCIO F, MANZINI G, MUTHUKRISHNAN S (2009). Compressing and indexing labeled trees, with applications. JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, vol. 57, p. 4:1-4:33, ISSN: 0004-5411, doi: 10.1145/1613676.1613680 - Articolo in rivista
- FERRAGINA P, R. GONZALEZ, G. NAVARRO, R. VENTURINI (2008). Compressed Text Indexes: From Theory to Practice. ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS, vol. 13, p. 1-31, ISSN: 1084-6654, doi: 10.1145/1412228.1455268 - Articolo in rivista
- Cornolti Marco, Ferragina Paolo, Ciaramita Massimiliano, Rüd Stefan, Schütze Hinrich (2016). A piggyback system for joint entity mention detection and linking in web queries. In: Proceedings of the 25th International Conference on World Wide Web (WWW). vol. 25, p. 567-578, ACM, ISBN: 978-1-4503-4143-1, Montreal (Canada), April 11-15, 2016, doi: 10.1145/2872427.2883061 - Contributo in Atti di convegno
- Ferragina Paolo, Piccinno Francesco, Venturini Rossano (2015). Compressed Indexes for String Searching in Labeled Graphs. In: Proceedings of the 24th International Conference on World Wide Web. p. 322-332, Geneva:International World Wide Web Conferences Steering Committee, ISBN: 978-1-4503-3469-3, Firenze, 18-22 May 2015, doi: 10.1145/2736277.2741140 - Contributo in Atti di convegno
- Andrea Farruggia, Paolo Ferragina, Rossano Venturini (2014). Bicriteria Data Compression: Efficient and Usable. In: Algorithms - ESA 2014 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings. LECTURE NOTES IN COMPUTER SCIENCE, vol. 8737, p. 406-417, Dordrecht:Springer Verlag, ISBN: 9783662447765, ISSN: 1611-3349, Wroclaw, Poland, 08/09/2014-10/09/2014, doi: 10.1007/978-3-662-44777-2_34 - Contributo in Atti di convegno
- Farruggia A., Ferragina P., Frangioni A., Venturini R. (2014). Bicriteria data compression. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014. p. 1582-1595, Philadelphia, PA, USA:Society for Industrial and Applied Mathematics (SIAM), ISBN: 9781611973389, Portland, Oregon, USA, 05/01/2014-07/01/2014, doi: 10.1137/1.9781611973402.115 - Contributo in Atti di convegno
- FERRAGINA P (2010). Data Structures: Time, I/Os, Entropy, Joules!. In: Proceedings of the 18th Annual European Symposium on Algorithms. p. 1-16, Springer, ISBN: 9783642157806, Liverpool (UK), doi: 10.1007/978-3-642-15781-3_1 - Contributo in Atti di convegno
- FERRAGINA P, MANZINI G (2010). On compressing the textual web. In: Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). p. 391-400, ACM PRESS, ISBN: 9781605588896, New York (USA)., doi: 10.1145/1718487.1718536 - Contributo in Atti di convegno
- FERRAGINA P, GROSSI R, GUPTA A, SHAH R, VITTER J.S (2008). On Searching Compressed String Collections Cache-Obliviously. In: ACM Conference on Principles of DataBase Systems (PODS). vol. 27, p. 181-190, ACM Press, ISBN: 9781605581088, Vancouver, Canada., doi: 10.1145/1376916.1376943 - Contributo in Atti di convegno
Associated investigators’ Curriculum Vitae
Current Position: Full Professor of Computer Science, Faculty of Science, Universita di Palermo. He obtained a Ph.D. in Computer Science from Columbia University in 1990 defending one of the first Ph.D. Thesis about Algorithms and Computational Biology. He was awarded several fellowships, among which the AT&T Bell Labs Post-Doctoral Fellowship in Mathematical and Information Sciences and a CNRS Visiting Scientist Grant. He has also been Invited Keynote Speaker to several conferences and summer schools, including SIAM International Conference in Applied Mathematics and has held several visiting scientist positions with many research labs and Universities both in USA and Europe such as AT&T Shannon Laboratories and Max Plank Institure for Molecular Genetics- Bioinformatics Section, Berlin, Computer Science Haifa University. Moreover, he has served either as Chairman or as Member of Scientific Commettees in several conferences directly related to the theme of the project, such as Workshop on Algorithms in Bioinformatics, Combinatorial Pattern Matching and String Processingf and Information Retrieval. He is currently on the Editorial Board of Theoreticl Computer Science, Algorithms for Molecular Biology, BMC Bioinformatics an BMC Research Notes. He has also been the principal investigator of several Italian Ministry of Education research projects in Bioinformatics and one CNRS Grant. Morebver, he has been a reviewer for most of the best established Journals and Conferences in Computational Biology and Theoretical Computer Science. In addition, he has also been reviewer for severan National Granting Agencies, including US NSF and the Royal Society. He is also regularly consulted by Universities nationally and internationally in orer to assess Faculty promotions. As for involvement in National and Local Higher Edication Structures, he has been President of the Computer Science Curricula, University of Palermo and Member of the Italian Computer Science Curricula Commission of the Italian Association of Computer Science Reserachers (GRIN). He is currently on the Board of Directors of the CINI Consortiujm, that represents all of the academic competences in Computer Science and Engineering present in Italy.
Scientific Results: Specialist of Design and Analysis of combinatorial algorithms, with particular enphasis on string algorithms, ranging from Bioinformatics to Data Compression. Great interest and many contributions to Data Structures and Automata Theory, with applications to Bioinformatics. The scientific production consists of more than 60 papers appeared in established journals and conferences. Moreover, he is coauthor of many patents, granted by the US Patent Office, in information retrieval.
Giovanni Manzini received his PhD in Mathematics from the Scuola Normale Superiore of Pisa (Italy). He has been Visiting Scientist at the Laboratory for Information and Decisions Systems of the Massachusetts Institute of Technology, and Visiting Professor at the Department of Computer Science of the Johns Hopkins University and at the Department of Computing and Information Systems of the University of Melbourne. Currently he is Professor of Computer Science with the University of Piemonte Orientale.
The research area of Giovanni Manzini is the design, analysis and engineering of algorithms. In the last fifteen years he has been mainly working on Data Compression and Indexing Data Structures for textual and biological massive data sets. This study has produced profound theoretical results that have originated a new family of tools, called compressed indices. Compressed indices support the efficient retrieval of information using a representation of the input data that is compressed in the sense that it uses space proportional to the entropy of the data.
Giovanni Manzini has been the co-author of the historically first among such tools, which is now called the FM-index. The original paper on the FM-index has more than 900 citations on Google Scholar; because of its simplicity and efficiency this data structure is still the most widely used compressed index and is at the core of many bioinformatics tools, such as BOWTIE, BWA, SOAP, the String Graph Assembler, etc.
On the topic of compressed indices Giovanni Manzini has co-authored more than 40 refereed publications and has supervised several software projects such as: the Lightweight Suffix Array and BWT Construction Algorithm ( http://people.unipmn.it/manzini/lightweight/), the Compression Boosting Library ( http://people.unipmn.it/manzini/boosting/), the BWTDISK library ( http://people.unipmn.it/manzini/bwtdisk), and the Gap library for building indices of documents collections ( http://people.unipmn.it/manzini/gap/). This research has received a “Research Capital” award from the University of Pisa in 2002, and has been granted an US patent (#8,156,156).
Giovanni Manzini has participated to many national research projects in Mathematics and Computer Science, and he has been principal investigator for an international FIRB project. He has been a member in several PhD committees and reviwer for National Granting Agencies.
Giovanni Manzini has been invited speaker for the 28th Combinatorial Pattern Matching Conference (CPM ‘17), the 2nd Workshop on Compression Text and Algorithms (WCTA ‘11), the 20th International Workshop on Combinatorial Algorithms (IWOCA ‘09) and the 24th International Symposium on Mathematical Foundations of Computer Science (MFCS ‘99). He has co-chaired the Program Committee of the conference CPM ‘11 and has served as member of the Program Committee of the conferences SPIRE ‘17, CPM ‘16, IWOCA ‘15, WABI ‘14, AlCoB ‘14, WABI ‘13, LATIN ‘12, SPIRE ‘10, SPIRE ‘08, SPIRE ‘07, ICTCS ‘07, SPIRE ‘06, CPM ‘05, FUN ‘04. He has been co-organizer of the DIMACS Workshop: The Burrows-Wheeler Transform: Ten Years Later and Guest Editor for two Special Issues of the journal Theoretical Computer Science devoted to the Burrows-Wheeler Transform, and to Combinatorial Pattern Matching.
As of February 2018 the Scopus bibliographic database contains 85 documents authored by Giovanni Manzini. These documents have received 2566 total citations by 1631 distinct documents. Giovanni Manzini’s H-index computed using the Scopus database is currently 25.
Assistant professor (RTD A, three years contract expiring on April 2020), Department of Computer Science, University of Milan, Italy
- May 2017 - Today. 3 years position as Assistant Professor at Department of Computer Science, University of Milan.
- April 2013 - March 2017. 4 years post-doc research grant at Department of Computer Science, University of Milan, project title “Graph based neural network algorithms for the analysis of biological networks”
- June 2012 - March 2013. 1 year post-doc research grant at Department of Life Sciences of University of Milan, project title “Development of algorithms and software tools for the analysis of chromatin role in eukaryotic genome”
- January 2009 - December 2011. 3 years Ph.D. in Computer Science with grant at Department of Computer Science, University of Milan
- January 2006 - September 2006. Employer of Texa S.P.A. company, Parma, Italy, with endless contract position as software programmer
- January 2009 - December 2011. Ph.D. student in the Ph.D. course in Computer Science, with 3 years grant, at Department of Computer Science, University of Milan. Title of Ph.D. thesis: “Graph-based approaches for imbalanced data in functional genomics” Supervisor: Prof. Alberto Bertoni Co-supervisor: Prof. Giorgio Valentini
- October 2006 - May 2008. Qualification course, with 1 year grant, for teaching mathematics at high school, gained at University of Salerno, Italy. Final evaluation: 80/80
- May 2005. Master degree (five years course) in Computer Science at University of Salerno, Italy. Thesis title: “L’operazione di composizione nella famiglia dei codici massimali prefissi” Supervisor: Prof.ssa Clelia De Felice Final evaluation: 110/110 cum laude
Research visits abroad
- September 2017. Research visit at the Department of Medicine, Division of Neurology, Tanz Centre for Research in Neurodegenerative Diseases, University of Toronto, to work on the disease-gene prioritization for the Alzheimer’s disease (AD) with Prof. Ming Zhang, exploiting the novel approach to embed in the computational core information coming from diseases sharing genetic characteristics with AD
- September - October 2016. Research internships at the Institute of Molecular Biology (IMB) of Johannes Gutenberg University of Mainz, for a collaboration with the Computational Biology and Data Mining Group, directed by Prof. Miguel Andrade Navarro, to work on the design and development of computational methodologies to find the most informative data sources for specific human genetic disorders, with the end of determining their causal genes
- September - October 2014. Visiting Researcher at the Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Canada, to collaborate with Prof. Quaid Morris, head of laboratory, on the analysis of computational methodologies to eliminate hierarchical inconsistencies of label biases in label propagation algorithms when classes are structured as directed acyclic graphs, in the context of Life Sciences
- “Machine Learning for Bioinformatics and Personalized Medicine: a survey of my research activity at the Computer Science Dept UNIMI”, Department of Computer Science, University of Milan, 20 April 2017
- “Neural algorithms based on graphs for the analysis of biological networks”, Department of Computer Science, University of Milan, 09 April 2015
Awards and Acknowledgments
- From April 2013. 4 years research grant as post-doc research fellow at the Department of Computer Science, University of Milan, Italy
- June 2012 - March 2013. 1 year research grant as post-doc research fellow at the Department of Life Sciences of University of Milan
- January 2009 - December 2011. 3 years research grant from the Italian national funding for Ph.D. courses at the Department of Computer Science, University of Milan
- January 2007- December 2007. 1 year grant for promising teachers of mathematics at high school for the attendance of the two years course for teaching qualification at University of Salerno, Italy
Reviewer activity for international projects/organizations
- The Natural Sciences and Engineering Research Council of Canada (NSERC), in the area Discovery Grant proposal
- The Innovational Research Incentives Scheme (VIDI programme), for projects presented by young researchers (3-8 years from Ph.D.) of the Netherlands Organisation for Scientific Research (NWO, the Dutch Research Council)
Participation in Research Projects
He is responsible of the following project:
Title: Machine learning algorithms to handle label imbalance in biomedical taxonomies, Piano di Sostegno alla Ricerca 2015-2017 Starting date: 01.01.18
Duration: 1 year
He participated/participates in the following research projects:
- Title: “A predictive model or response to FINGOLIMOD: integration of clinics, neuroradiology and genetics”. Financing body: Fondazione Centro San Raffaele, Duration: May-September 2016
- TItle: “Discovering Patterns in Multi-Dimensional Data”. Year 2016. Coordinator: Prof. Goffredo Haus. Financing body: University of Milan
- Title: “Graph-based methodologies for the automated inference in bio-medical ontologies”. Year 2016. Coordinator: Prof. Simone Bassis. Financing body: University of Milan
- Flagship project EPIGEN, May 2012, March 2013. Research project title: “Sviluppo di algoritmi e software per l’analisi dello stato della cromatina in genomi di eucarioti”. Financing body: Consiglio Nazionale delle Ricerche (CNR)
Design and analysis of new machine learning methods mainly based on neural networks, with applications in bioinformatics and computational biology and medicine. His Computer Science background, and the interest for Life Sciences branches fostered searching novel research lines as bridge between these two main areas.
He contributed to consolidate the application of neural networks to classification and ranking problems with the development of new models of parametric Hopfield networks, whose implementations are available at public software repositories. In particular, the candidate developed methods suitable for classification problems characterized by unbalanced data, that is where one class is the large minority, like it happens in many real world problems in computational biology and medicine, like the gene function prediction, drug repositioning, gene-disease prioritization, human abnormal phenotypes prediction problems.
Furthermore, his research in classification problems where classes are structured as hierarchies with parent-child constraints, lead to the introduction of a novel paradigm for multi-task label propagation, where multiple tasks are predicted at the same time by by exploiting the dissimilarities (rather than similarities, as usually done) with the other tasks, allowing learning faster and more accurately than the case task are learned independently.
Within the candidate’s research activity it is possible to distinguish two main areas, machine learning and bioinformatics, of which just a sketch is given for lack of room.
1. Machine Learning
1.1 Development of neural network-based binary classifiers for highly unbalanced data
In semi-supervised label prediction in partially labeled graphs, nodes represent the entities to be classified and edges the similarities among nodes, whereas a labeling is known just for a subset of nodes. The goal is to extend the labeling to the whole graph. A particular focus is moved on problems where node labels are unbalanced, that is one class (usually the positive one) is highly under-represented that the other one, since several real-world problems posses this characteristic. The labeling imbalance must be handle to avoid a noticeable decay in the classification performance. The problem was solved by introducing a novel family of parametric Hopfield neural networks, whose parameters are the neuron activation values and thresholds, and which, by conceptually separating the neuron labels and the neurons activation values, allows to handle the label imbalance assigning activation values so as to counterbalance the predominance of negative neurons. Results in real-world classification problems showed the model is competitive with the state-of-the-art. Moreover, a more general model of Hopfield neural networks was designed, in which the input instances are assumed being partitioned into multiple categories determined according to intrinsic properties the instances possess, independently from their membership to the class being predicted, as it happens in many real-world problems, such as the multi-species gene function prediction, in which nodes in the graph can be partitioned according to the species they belong to. The model showed strong predictive capability in unbalanced problems, outperforming the state-of-the-art methodologies.
1.2 Automated integration of heterogeneous data sources
1.3 Selection of negative examples in Positive-Unlabeled classification problems
1.4 Multitask algorithms for hierarchical classification
1.5 Development of software machine learning libraries
The candidate’s research activity in bioinformatics involved the study and application of machine learning algorithms for knowledge retrieval from bio-molecular data generated by high-throughput technologies. It is possible to distinguish the following lines:
2.1 Functional classification of genes and proteins
2.2 Integration of multiple and heterogeneous data sources for the gene/protein function prediction 2.3 Gene-disease prioritization for genetic human disorders
2.4 Multi-species protein function prediction
2.5 Epigenetic analysis of gene expression levels
2.6 Biological motivated modelling of gene expression profiles
2.7 Drug repositioning methods to determine novel therapeutic applications of existing drugs
Principal scientific publications of associated investigators
- Cattaneo, Giuseppe, Petrillo, Umberto Ferraro, Giancarlo, Raffaele, Roscigno, Gianluca (2017). An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop. THE JOURNAL OF SUPERCOMPUTING, vol. 73, p. 1467-1483, ISSN: 0920-8542, doi: 10.1007/s11227-016-1835-3 - Articolo in rivista
- Petrillo, Umberto Ferraro, Roscigno, Gianluca, Cattaneo, Giuseppe, Giancarlo, Raffaele (2017). FASTdoop: A versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications. BIOINFORMATICS, vol. 33, p. 1575-1577, ISSN: 1367-4803, doi: 10.1093/bioinformatics/btx010 - Articolo in rivista
- Utro, F., Di Benedetto, V., Corona, D., GIANCARLO, Raffaele (2016). The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes. BIOINFORMATICS, vol. 32, p. 835-842, ISSN: 1367-4803, doi: 10.1093/bioinformatics/btv679 - Articolo in rivista
- GIANCARLO, Raffaele, D. Scaturro, F. Utro (2015). ValWorkBench: An open source Java library for cluster validation, with applications to microarray data analysis.. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, vol. 118, p. 207-217, ISSN: 0169-2607, doi: 10.1016/j.cmpb.2014.12.004 - Articolo in rivista
- Giancarlo R, Lo Bosco G, Utro F (2015). Bayesian versus data driven model selection for microarray data. NATURAL COMPUTING, vol. 14(3), p. 393-402, ISSN: 1567-7818, doi: 10.1007/s11047-014-9446-5 - Articolo in rivista
- Giancarlo Raffaele, Rombo Simona Ester, Utro Filippo (2015). Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning.. BIOINFORMATICS, vol. 31, p. 2939-2946, ISSN: 1367-4803, doi: 10.1093/bioinformatics/btv295 - Articolo in rivista
- Giancarlo R, Rombo SE, Utro F (2014). Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. BRIEFINGS IN BIOINFORMATICS, vol. 3, p. 390-406, ISSN: 1467-5463, doi: 10.1093/bib/bbt088 - Articolo in rivista
- GIANCARLO R, LO BOSCO G, PINELLO L, UTRO F (2013). A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC BIOINFORMATICS, vol. 14(Suppl 1):S6, ISSN: 1471-2105, doi: 10.1186/1471-2105-14-S1-S6 - Articolo in rivista
- Giancarlo R, Scaturro D, Utro F (2012). Textual data compression in computational biology: Algorithmic techniques.. COMPUTER SCIENCE REVIEW, vol. 6, p. 1-25, ISSN: 1574-0137, doi: 10.1016/j.cosrev.2011.11.001 - Articolo in rivista
- Giancarlo R, Utro F (2012). Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis.. THEORETICAL COMPUTER SCIENCE, vol. 428, p. 58-79, ISSN: 0304-3975, doi: 10.1016/j.tcs.2012.01.024 - Articolo in rivista
- Giancarlo R, Utro F (2011). Speeding up the Consensus Clustering methodology for microarray data analysis. ALGORITHMS FOR MOLECULAR BIOLOGY, vol. 6, ISSN: 1748-7188, doi: 10.1186/1748-7188-6-1 - Articolo in rivista
- Ferragina, P, GIANCARLO, Raffaele, Manzini, G. (2009). The Myriad Virtes of Wavelet Trees. INFORMATION AND COMPUTATION, vol. 207, p. 849-866, ISSN: 0890-5401 - Articolo in rivista
- GIANCARLO, Raffaele, Scaturro, D, UTRO, Filippo (2009). Textual Data Compression In Computational Biology: A synopsis. BIOINFORMATICS, vol. 2009, p. 1575-1586, ISSN: 1367-4803, doi: 10.1093/bioinformatics/btp117 - Articolo in rivista
- AL BUCHSBAUM, GIANCARLO R, B RAKZ (2008). New Results in Finding Common Neighborhoods in Massive Graphs in the Data Streaming Model. THEORETICAL COMPUTER SCIENCE, vol. 407, p. 302-309, ISSN: 0304-3975, doi: 10.1016/j.tcs.2008.06.056 - Articolo in rivista
- Apostolico A, Giancarlo R (2008). Periodicity and repetitions in parameterized strings. DISCRETE APPLIED MATHEMATICS, vol. 156, p. 1389-1398, ISSN: 0166-218X, doi: 10.1016/j.dam.2006.11.017 - Articolo in rivista
- GIANCARLO R, SCATURRO D, UTRO F (2008). A Tutorial on Computational Cluster Analysis with Applications to Pattern Discovery in Microarray Data. MATHEMATICS IN COMPUTER SCIENCE, ISSN: 1661-8270, doi: 10.1007/s11786-007-0025-3 - Articolo in rivista
- GIANCARLO, Raffaele, Scaturro, D, UTRO, Filippo (2008). Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC BIOINFORMATICS, vol. 2008-10-29, ISSN: 1471-2105, doi: 10.1186/1471-2105-9-462 - Articolo in rivista
- Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013). The Three Steps of Clustering in the Post-Genomic Era. In: Biological Knowledge Discovery Handbook. p. 521-556, John Wiley and Sons, ISBN: 978-1-118-13273-9, doi: 10.1002/9781118617151.ch22 - Contributo in volume (Capitolo o Saggio)
- EPIFANIO C, GABRIELE A, GIANCARLO R, Sciortino M (2011). Novel Combinatorial and Information Theoretic Alignment Free Distances for Biological Data Mining. In: (a cura di): Elloumi M;Zomaya A.Y., Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications. WILEY SERIES IN BIOINFORMATICS, vol. 2011, Wiley Interscience, ISBN: 0470505192, doi: 10.1002/9780470892107.ch16 - Contributo in volume (Capitolo o Saggio)
- Giancarlo R, Scaturro D, Utro F (2009). Statistical Indexes for Computational and Data Driven Class Discovery in Microarray Data. In: (a cura di): Jake Y. Chen Stefano Lonardi, Biological Data Mining. p. 295-336, Taylor and Francis-CRC - Contributo in volume (Capitolo o Saggio)
- Gagie, Travis, MANZINI, Giovanni, Sirén, Jouni (2017). Wheeler graphs: A framework for BWT-based data structures. THEORETICAL COMPUTER SCIENCE, vol. 698, p. 67-78, ISSN: 0304-3975, doi: 10.1016/j.tcs.2017.06.016 - Articolo in rivista
- Gagie, Travis, MANZINI, Giovanni, Valenzuela, Daniel (2017). Compressed Spaced Suffix Arrays. MATHEMATICS IN COMPUTER SCIENCE, vol. 11, p. 151-157, ISSN: 1661-8270, doi: 10.1007/s11786-016-0283-z - Articolo in rivista
- Egidi Lavinia, Manzini Giovanni (2015). Multiple seeds sensitivity using a single seed with threshold. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, vol. 13, ISSN: 0219-7200, doi: 10.1142/S0219720015500110 - Articolo in rivista
- Egidi Lavinia, Manzini Giovanni (2014). Design and analysis of periodic multiple seeds. THEORETICAL COMPUTER SCIENCE, vol. 522, ISSN: 0304-3975, doi: 10.1016/j.tcs.2013.12.007 - Articolo in rivista
- Egidi Lavinia, Manzini Giovanni (2014). Spaced seed design using perfect rulers. FUNDAMENTA INFORMATICAE, vol. 131, ISSN: 0169-2968, doi: 10.3233/FI-2014-1009 - Articolo in rivista
- Egidi L, Manzini G (2013). Better Spaced Seeds Using Quadratic Residues. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, vol. 79, p. 1144-1155, ISSN: 0022-0000, doi: 10.1016/j.jcss.2013.03.002 - Articolo in rivista
- Ferragina Paolo, Gagie Travis, Manzini Giovanni (2012). Lightweight Data Indexing and Compression in External Memory. ALGORITHMICA, vol. 63, p. 707-730, ISSN: 0178-4617, doi: 10.1007/s00453-011-9535-0 - Articolo in rivista
- Gagie Travis, Manzini Giovanni (2010). Move-to-Front, Distance Coding, and Inversion Frequencies Revisited. THEORETICAL COMPUTER SCIENCE, vol. 411, p. 2925-2944, ISSN: 0304-3975, doi: 10.1016/j.tcs.2010.04.024 - Articolo in rivista
- Ferragina P, Luccio F, Manzini G, Muthukrishnan S. (2009). Compressing and Indexing Labeled Trees, with Applications. JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, vol. 57, ISSN: 0004-5411, doi: 10.1145/1613676.1613680 - Articolo in rivista
- Ferragina Paolo, Giancarlo Raffaele, Manzini Giovanni (2009). The Myriad Virtues of Wavelet Trees. INFORMATION AND COMPUTATION, vol. 207, p. 849-866, ISSN: 0890-5401, doi: 10.1016/j.ic.2008.12.010 - Articolo in rivista
- Decaroli Gianni, Gagie Travis, Manzini Giovanni (2017). A Compact Index for Order-Preserving Pattern Matching. In: Proc. IEEE 2017 Data Compression Conference. DATA COMPRESSION CONFERENCE, p. 72-81, LOS ALAMITOS:IEEE Computer Society Press, ISBN: 978-150906721-3, ISSN: 1068-0314, Snowbird, USA, 4-7 April 2017, doi: 10.1109/DCC.2017.35 - Contributo in Atti di convegno
- EGIDI, Lavinia, MANZINI, Giovanni (2017). Lightweight BWT and LCP merging via the gap algorithm. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 10508, p. 176-190, Springer Verlag, ISBN: 9783319674278, Palermo (IT), 2017, doi: 10.1007/978-3-319-67428-5_15 - Contributo in Atti di convegno
- Gagie, Travis, MANZINI, Giovanni, Venturini, Rossano (2017). An encoding for order-preserving matching. In: 25th Annual European Symposium on Algorithms (ESA 2017). vol. 87, p. 1-15, Dagstuhl:Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, ISBN: 978-3-95977-049-1, Vienna (AT), 2017, doi: 10.4230/LIPIcs.ESA.2017.38 - Contributo in Atti di convegno
- Fariña Antonio, Gagie Travis, Manzini Giovanni, Navarro Gonzalo, Ordóñez Alberto (2016). Efficient and Compact Representations of Some Non-Canonical Prefix-Free Codes. In: Proc. 23rd International Symposium on String Processing and Information Retrieval. LECTURE NOTES IN COMPUTER SCIENCE, vol. 9954, p. 50-60, NEW YORK:Springer, ISBN: 978-3-319-46048-2, ISSN: 0302-9743, Beppu, Japan, 2016, doi: 10.1007/978-3-319-46049-9_5
- Manzini Giovanni (2016). XBWT Tricks. In: Proc. 23rd International Symposium on String Processing and Information Retrieval. LECTURE NOTES IN COMPUTER SCIENCE, vol. 9954, p. 80-92, NEW YORK:Springer, ISBN: 978-3-319-46048-2, ISSN: 0302-9743, Beppu, Japan, 2016, doi: 10.1007/978-3-319-46049-9_8 - Contributo in Atti di convegno
- Boucher Christina, Bowe Alexander, Gagie Travis, Manzini Giovanni, Sirén Jouni (2015). Relative Select. In: String Processing and Information Retrieval. LECTURE NOTES IN COMPUTER SCIENCE, vol. 9309, p. 149-155, ISBN: 978-3-319-23825-8, ISSN: 0302-9743, London, September 2015, doi: 10.1007/978-3-319-23826-5_15 - Contributo in Atti di convegno
- Belazzougui Djamal, Gagie Travis, Gog Simon, Manzini Giovanni, Sirén Jouni (2014). Relative FM-Indexes. In: Proc. 21st International Symposium on String Processing and Information Retrieval. LECTURE NOTES IN COMPUTER SCIENCE, vol. 8799, p. 52-64, Springer-Verlag, ISBN: 9783319119175, ISSN: 0302-9743, doi: 10.1007/978-3-319-11918-2_6 - Contributo in Atti di convegno
- Ferragina Paolo, Manzini Giovanni (2010). On Compressing the Textual Web. In: Proc. 3rd ACM International Conference on Web Search and Data Mining (WSDM ‘10). p. 391-400, New York:ACM, ISBN: 9781605588896, New York, 2010, doi: 10.1145/1718487.1718536 - Contributo in Atti di convegno
- Kärkkäinen Juha, Manzini Giovanni, Puglisi Simon J. (2009). Permuted Longest-Common-Prefix Array. In: Proc. 20th Symposium on Combinatorial Pattern Matching (CPM ‘09). LECTURE NOTES IN COMPUTER SCIENCE, vol. 5577, p. 181-192, ISBN: 3642024408, ISSN: 0302-9743, Lille, doi: 10.1007/978-3-642-02441-2_17 - Contributo in Atti di convegno
- Manzini Giovanni (2009). Succinct representations of trees. In: Proc. 20th International Workshop on Combinatorial Algorithms (IWOCA). LECTURE NOTES IN COMPUTER SCIENCE, vol. 5874, p. 11-18, NEW YORK:Springer, ISBN: 3642102166, ISSN: 0302-9743, Hradec nad Moravici, cze, 2009, doi: 10.1007/978-3-642-10217-2_3 - Contributo in Atti di convegno
- M.Frasca, A. Bertoni, M. Re, G. Valentini (2013). A neural network algorithm for semi-supervised node label learning from unbalanced data. NEURAL NETWORKS, vol. 43, p. 84-98, ISSN: 0893-6080, doi: 10.1016/j.neunet.2013.01.021 - Articolo in rivista
- M. Frasca, S. Bassis, G. Valentini (2015). Learning node labels with multi-category Hopfield networks. NEURAL COMPUTING & APPLICATIONS, ISSN: 0941-0643, doi: 10.1007/s00521-015-1965-1 - Articolo in rivista
- FRASCA, MARCO (2017). Gene2DisCo : gene to disease using disease commonalities. ARTIFICIAL INTELLIGENCE IN MEDICINE, vol. 82, p. 34-46, ISSN: 0933-3657, doi: 10.1016/j.artmed.2017.08.001 - Articolo in rivista
- M. Frasca, D. Malchiodi (2017). Exploiting Negative Sample Selection for Prioritizing Candidate Disease Genes. GENOMICS AND COMPUTATIONAL BIOLOGY, vol. 3, e47, ISSN: 2365-7154, doi: 10.18547/gcb.2017.vol3.iss3.e47 - Articolo in rivista
- M. Frasca, N. A. Cesa Bianchi (2017). Multitask Protein Function Prediction Through Task Dissimilarity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, p. 1-12, ISSN: 1545-5963, doi: 10.1109/TCBB.2017.2684127 - Articolo in rivista
- M. Frasca (2015). Automated gene function prediction through gene multifunctionality in biological networks. NEUROCOMPUTING, vol. 162, p. 48-56, ISSN: 0925-2312, doi: 10.1016/j.neucom.2015.04.007 - Articolo in rivista
- M. Frasca, A. Bertoni, G. Valentini (2015). UNIPred : Unbalance-aware network integration and prediction of protein functions. JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 22, p. 1057-1074, ISSN: 1066-5277, doi: 10.1089/cmb.2014.0110 - Articolo in rivista
- M. Frasca, G. Valentini (2016). COSNet: An R package for label prediction in unbalanced biological networks. NEUROCOMPUTING, ISSN: 0925-2312, doi: 10.1016/j.neucom.2015.11.096 - Articolo in rivista
- G. Valentini, G. Armano, M. Frasca, J. Lin, M. Mesiti, M. Re (2016). RANKS : a flexible tool for node label ranking and classification in biological networks. BIOINFORMATICS, vol. 32, p. 2872-2874, ISSN: 1367-4803, doi: 10.1093/bioinformatics/btw235 - Articolo in rivista
- H. Caniza, A.E. Romero, S. Heron, H. Yang, A. Devoto, M. Frasca, M. Mesiti, G. Valentini, A. Paccanaro (2014). GOssTo : a stand-alone application and a web tool for calculating semantic similarities on the Gene Ontology. BIOINFORMATICS, vol. 30, p. 2235-2236, ISSN: 1367-4803, doi: 10.1093/bioinformatics/btu144 - Articolo in rivista
- M. Muselli, A. Bertoni, M. Frasca, A. Beghini, F. Ruffino, G. Valentini (2011). A mathematical model for the validation of gene selection methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, vol. 8, p. 1385-1392, ISSN: 1545-5963, doi: 10.1109/TCBB.2010.83 - Articolo in rivista
- A. Bertoni, M. Frasca, G. Valentini (2011). COSNet : a cost sensitive neural network for semi-supervised learning in graphs. In: Machine learning and knowledge discovery in databases : European conference, ECML PKDD 2010 : Athens, Greece, september 5-9, 2011 : proceedings. Part 1. vol. 6911, p. 219-234, Berlin:Springer, ISBN: 9783642237799, Athens, 2011, doi: 10.1007/978-3-642-23780-5_24 - Contributo in Atti di convegno
- M. Frasca, A. Bertoni, A. Sion (2013). A Neural Procedure for Gene Function Prediction. In: Neural Nets and Surroundings. vol. 19, p. 179-188, Springer Berlin Heidelberg, ISBN: 978-3-642-35466-3, Vietri sul Mare, 2012, doi: 10.1007/978-3-642-35467-0_19 - Contributo in Atti di convegno
- M. Frasca, G. Pavesi (2013). A neural network based algorithm for gene expression prediction from chromatin structure. In: Proceedings of the International Joint Conference on Neural Networks. p. 1-8, IEEE, ISBN: 9781467361293, Dallas, TX, usa, 2013, doi: 10.1109/IJCNN.2013.6706954 - Contributo in Atti di convegno
- M. Frasca, F. Lipreri, D. Malchiodi (2017). Analysis of Informative Features for Negative Selection in Protein Function Prediction. In: Bioinformatics and Biomedical Engineering. LECTURE NOTES IN COMPUTER SCIENCE, vol. 10209, p. 267-276, Switzerland:Springer, ISBN: 9783319561530, ISSN: 0302-9743, Granada, 2017, doi: 10.1007/978-3-319-56154-7_25 - Contributo in Atti di convegno
- M. Frasca, D. Malchiodi (2016). Selection of negative examples for node label prediction through fuzzy clustering techniques. In: Advances in Neural Networks : Computational Intelligence for ICT. SMART INNOVATION, SYSTEMS AND TECHNOLOGIES, vol. 54, p. 67-76, Springer International, ISBN: 9783319337463, ISSN: 2190-3018, Vietri sul mare, 2015, doi: 10.1007/978-3-319-33747-0_7 - Contributo in Atti di convegno
- M. Frasca, S. Bassis (2016). Gene-disease prioritization through cost-sensitive graph-based methodologies. In: Bioinformatics and biomedical engineering. LECTURE NOTES IN COMPUTER SCIENCE, vol. 9656, p. 739-751, Springer International, ISBN: 9783319317434, ISSN: 0302-9743, Granada, 2016, doi: 10.1007/978-3-319-31744-1_64 - Contributo in Atti di convegno
- P. Perlasca, G. Valentini, M. Frasca, M. Mesiti (2016). Multi-species protein function prediction: towards Web-based visual analytics. In: Proceedings of iiWAS2016. p. 1-5, ACM, ISBN: 9781450348072 - Contributo in Atti di convegno
- P.N. Robinson, M. Frasca, S. Köhler, M. Notaro, M. Re, G. Valentini (2015). A Hierarchical Ensemble Method for DAG-Structured Taxonomies. In: (a cura di): F. Schwenker;F. Roli;J.Kittler, Multiple Classifier Systems : 12th International Workshop, MCS 2015, Günzburg, Germany, June 29 - July 1, 2015, Proceedings. LECTURE NOTES IN COMPUTER SCIENCE, vol. 9132, p. 15-26, Berlin:Springer, ISBN: 9783319202471, ISSN: 0302-9743, doi: 10.1007/978-3-319-20248-8_2 - Contributo in volume (Capitolo o Saggio)
- A. Bertoni, M. Frasca, G. Grossi, G. Valentini (2010). Learning functional linkage networks with a cost-sensitive approach. In: Neural Nets WIRN10 : proceedings of the 20th italian workshop on neural nets. p. 52-61, Amsterdam:IOS Press, ISBN: 9781607506911, doi: 10.3233/978-1-60750-692-8-52 - Contributo in Atti di convegno