Zur Kurzanzeige

2021Preprint DOI: 10.18452/23209
Scaling Up Set Similarity Joins Using ACost-Based Distributed-Parallel Framework
dc.contributor.authorFier, Fabian
dc.contributor.authorFreytag, Johann-Christoph
dc.date.accessioned2021-08-18T07:00:32Z
dc.date.available2021-08-18T07:00:32Z
dc.date.issued2021none
dc.identifier.urihttp://edoc.hu-berlin.de/18452/23851
dc.descriptionThis is an extended version of our paper accepted for SISAP 2021. It additionally includes descriptions of experimental datasets and experimental results.none
dc.description.abstractThe set similarity join (SSJ) is an important operation in data science. For example, the SSJ operation relates data from different sources or finds plagiarism. Common SSJ approaches are based on the filter-and-verification framework. Existing approaches are sequential (single-core), use multi-threading, or Map-Reduce-based distributed parallelization. The amount of data to be processed today is large and keeps growing. On the other hand, the SSJ is a compute-intensive operation. None of the existing SSJ methods scales to large datasets. Single- and multi-core-based methods are limited in terms of hardware. MapReduce-based methods do not scale due to too high and/or skewed data replication. We propose a novel, highly scalable distributed SSJ approach. It overcomes the limits and bottlenecks of existing parallel SSJ approaches. With a cost-based heuristic and a data-independent scaling mechanism we avoid intra-node data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. A RAM usage estimation prevents swapping, which is critical for the runtime. Our approach significantly scales up the SSJ execution and processes much larger datasets than all parallel approaches designed so far.eng
dc.language.isoengnone
dc.publisherHumboldt-Universität zu Berlin
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectSet Similarity Joineng
dc.subjectCost-based Optimizationeng
dc.subjectDistributed Parallelizationeng
dc.subjectRAM usage estimationeng
dc.subject.ddc005 Computerprogrammierung, Computerprogramme, Datennone
dc.subject.ddc000 Informatik, Informationswissenschaft, allgemeine Werkenone
dc.titleScaling Up Set Similarity Joins Using ACost-Based Distributed-Parallel Frameworknone
dc.typepreprint
dc.identifier.urnurn:nbn:de:kobv:11-110-18452/23851-6
dc.identifier.doihttp://dx.doi.org/10.18452/23209
local.edoc.type-namePreprint
local.edoc.container-typeconference
local.edoc.container-type-nameKonferenz
dc.title.subtitleextended papernone
dc.description.eventSISAP, Dortmund, 2021none
dcterms.bibliographicCitation.booktitleScaling Up Set Similarity Joins Using A Cost-Based Distributed-Parallel Frameworknone
dcterms.bibliographicCitation.originalpublishernameSpringernone
dcterms.bibliographicCitation.editorFabian Fier and Johann-Christoph Freytagnone
bua.departmentMathematisch-Naturwissenschaftliche Fakultätnone

Zur Kurzanzeige