2021Preprint DOI: 10.18452/23209
Scaling Up Set Similarity Joins Using ACost-Based Distributed-Parallel Framework
The set similarity join (SSJ) is an important operation in data science. For example, the SSJ operation relates data from different sources or finds plagiarism. Common SSJ approaches are based on the filter-and-verification framework. Existing approaches are sequential (single-core), use multi-threading, or Map-Reduce-based distributed parallelization. The amount of data to be processed today is large and keeps growing. On the other hand, the SSJ is a compute-intensive operation. None of the existing SSJ methods scales to large datasets. Single- and multi-core-based methods are limited in terms of hardware. MapReduce-based methods do not scale due to too high and/or skewed data replication. We propose a novel, highly scalable distributed SSJ approach. It overcomes the limits and bottlenecks of existing parallel SSJ approaches. With a cost-based heuristic and a data-independent scaling mechanism we avoid intra-node data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. A RAM usage estimation prevents swapping, which is critical for the runtime. Our approach significantly scales up the SSJ execution and processes much larger datasets than all parallel approaches designed so far.
Files in this item
This is an extended version of our paper accepted for SISAP 2021. It additionally includes descriptions of experimental datasets and experimental results.