Efficient partitioning strategies for distributed web crawling
Conference Paper
Overview
Additional Document Info
View All
Overview
abstract
This paper presents a multi-objective approach toWeb space
partitioning, aimed to improve distributed crawling efficiency. The in-
vestigation is supported by the construction of two different weighted
graphs. The first is used to model the topological communication infras-
tructure between crawlers and Web servers and the second is used to
represent the amount of link connections between servers’ pages. The
values of the graph edges represent, respectively, computed RTTs and
pages links between nodes.
The two graphs are further combined, using a multi-objective partition-
ing algorithm, to support Web space partitioning and load allocation for
an adaptable number of geographical distributed crawlers.
Partitioning strategies were evaluated by varying the number of parti-
tions (crawlers) to obtain merit figures for: i) download time, ii) exchange
time and iii) relocation time. Evaluation has showed that our partition-
ing schemes outperform traditional hostname hash based counterparts in
all evaluated metric, achieving on average 18% reduction for download
time, 78% reduction for exchange time and 46% reduction for relocation
time.