Exploiting Hybrid Parallelism in Web Search Engines

10 pages
86 views
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Exploiting Hybrid Parallelism in Web Search Engines
Tags
Transcript
  Exploiting Hybrid Parallelism inWeb Search Engines Carolina Bonacic 1 , Carlos Garcia 1 , Mauricio Marin 2 , Manuel Prieto 1 , andFrancisco Tirado 1 1 Depto. Arquitectura de Computadores y Autom´aticaUniversidad Complutense de MadridContact-email:  cbonacic@fis.ucm.es, garsanca@dacya.ucm.es, mpmatias@dacya.ucm.es, ptirado@dacya.ucm.es 2 Yahoo! Research Santiago of ChileContact-email:  mmarin@yahoo-inc.com Abstract.  With the emergence of multi-core CPU (or Chip-level Multi-Processor -CMP-), it is essential to develop techniques that capitalize onCMP’s advantages to speed up very demanding applications of parallelcomputing such as Web search engines. In particular, for this applicationand given the huge amount of computational resources deployed at datacenters, it is of paramount importance to come out with strategies able toget the best performance from hardware. This is specially critical whenwe consider how we organize hardware to cope with sustained periodsof very high traffic of user queries. In this paper, we propose an hybridtechnique based on  MPI   and  OpenMP   which has been devised to takeadvantage of the multithreading facilities provided by CMP nodes forsearch engines under high query traffic. 1 Introduction Search engines must cope efficiently with dynamic variations in the query trafficgenerated by users. Most frequent queries are answered quickly by keeping themstored in cache machines. However, queries not found in cache must be directlysolved by a set of processors (cluster nodes) forming a cluster. The aim is todetermine as fast as possible the top-R results per query and from these resultsbuild up the answer web pages presented to the users. For high traffic of queriesand given the huge volume of data associated with the web samples kept at eachnode, this can involve the use of a significant amount of resources – processorsutilization, disk and network bandwidth –. Current search engines deal withpeaks in traffic by including enough hardware redundancy so that at normaltraffic the processors utilization is below 30% or 40%.Hardware redundancy can be reduced by using query processing strategiesthat take advantage of the economy of scale present in those situations in whicha large number of queries are solved concurrently. Recently, we have found [6,7]that for those high-query traffic scenarios, performing what we call  round-robin   query processing   implemented on top of bulk-synchronousparallel (BSP) process-ing [10], can significantly outperform the standard multi-threaded asynchronousmessage passing parallel processing employed by current search engines. For lowquery traffic the opposite holds true, which makes perfect sense since in this caseindividual queries can be granted as many threads they need, and those threadsare kept alive consuming all necessary resources for the time it takes to get thetop-R results. This is not harmful since hardware is being under-utilized anywaybecause of the small number of queries present in the system. In [7] we actuallypropose switching between both modes of operation depending on the observedquery traffic.In this paper we proposea hybrid parallelizationbased on a mixed MPI(BSP)-OpenMP programming model to take advantage of the hierarchical structureoffered by today’s clusters based on CMP processors. On every processor, thedocument ranking task that select the local top-R results of a query is paral-lelized using  OpenMP   threads. This is the most costly part of the processingof queries and it is certainly convenient to reduce its total running under highquery traffic scenarios. Our aim here is to put T threads to work on the docu-ment ranking phase of a group of Q queries being processed all together at thesame node in a given period of time with Q  ≥  T (this is done in parallel acrossthe P nodes available in the system).The current technology tendency underlines the emergence of the CMP pro-cessors. Actually, most systems incorporate these chips discarding the old ideaof a multiprocessor system as several nodes mono-processor. Technology trendsindicate that the number of cores on a chip will continue to grow as indicates theroadmaps of the most important manufacturers. Nowadays, AMD offers chipswith four cores ( Native Quad technology  ) and Intel has began to incorporatethe Intel Core TM Extreme quad-core Processor in their servers systems. How-ever, there are still studies that evaluate the parallel programming paradigmsemployed in the context of Web Servers with this technology and whether theyare the most appropriated. Taking into account this tendency, the main aim of this paper is to study which is the most efficiently way to exploit this noveltechnology.As baseline code we have employed a parallel Web Search Engine based ona bulk-synchronous MPI-based multiple-masters/multiple-slaves scheme previ-ously developed by us, which has been demonstrated to achieve scalable perfor-mance on conventional cluster of computers of differing architecture [5–7]. Onecould think that the better way to exploit the new extra thread level paral-lelism available in today’s processors is to extend the number of MPI processesto the cores available (i.e. instead of one MPI process per cluster node, one MPIprocess per core). However, this would involve an additional partitioning of theinverted file across the cores, resulting in a large number of messages (the queryreceptionist machine should broadcast the queries to all cores and then collectthe answers from all of them). Furthermore, since each core has its own privateinverted index, all cores compete with each other for the shared cache space  and memory bandwidth. This paper proposes a better method to exploit theadditional parallelism provided by cores at each node.The remaining of this paper is organized as follows. Section 2 describes ourgeneral method of parallel query processing for high query traffic and the par-ticular arrangement we make to profit from multicore architectures. Section 3describes the hardware and databases we used to test the efficiency of our pro-posal and show performance results. Section 4 presents concluding remarks. 2 Speeding up round-robin query processing 2.1 Distributed inverted file Web Search Engines use the inverted file data structure to index the text col-lection and speedup query processing. A number of papers have been publishedreporting experiments and proposals for efficient parallel query processing uponinverted files which are distributed on a set of P processor-memory pairs [1–3,5,8,6,7,11].An inverted file is composed of a vocabulary table and a set of posting lists.The vocabulary table contains the set of relevant terms found in the collection.Each of these terms is associated with a posting list which contains the documentidentifiers where the term appears in the collection along with additional dataused for ranking purposes. To solve a query, it is necessary to get the set of documents  ids   associated with the query terms and then perform a ranking of these documents so as to select the top-R documents as the query answer.Currentsearch engines use the document partitioned approachto distributingthe inverted file on a set of P processors. In this case, the document collection isevenly distributed at random on the processorsand an inverted file is constructedin each processor considering only the documents stored in the processor. Solvinga query involves to (a) place a copy of it in each processor, (b) let each processorcalculate their local top-R results and (c) make a merge of all results to selectthe global top-R results. 2.2 Query processing At the parallel server side, queries arrive from a receptionist machine that wecall the  broker  . The  broker   machine is in charge of routing the queries to thecluster’s processors (where for the scope of this paper each processor is a chip-multiprocessor node of the cluster) and receiving the respective answers. It de-cides to which processorrouting a givenquery by using a load balancingheuristic.The particular heuristic depends on the approach used to partition the invertedfile. Overall the  broker   tends to evenly distribute the queries on all processors.More in detail, the parallel processing of queries is basically composed of aphase in which it is necessary to fetch parts of all of the posting lists associatedwith each term present in the query, and perform a ranking of documents in orderto produce the results. After this, additional processing is required to produce the  answer to the user. This paper is concerned with the fetching+ranking part. Weare interested in situations where it is relevant to optimize the query throughput.A relevant issue for this paper is the way we organize query processing uponthe piece of inverted file stored in each processor. We basically let queries use of fixed quantum of computation, communication and disk access before grantingthe resources to another query in a round-robin fashion. 2.3 Iterative ranking and round-robin query processing. The processor in which a given query arrives from the broker is called the  ranker  for that query since it is in this processor where the associated document rankingis performed. In fact, all processors are rankers of a subset of queries and asexplained below they are also  fetchers   of posting lists in order to let rankerssolve their queries. Thus every query is processed iteratively using two majorsteps: – Fetching . The first one consists on fetching a K-sized piece of every postinglist involved in the query and sending them to the  ranker   processor (K=2R). In essence, the  ranker   sends a copy of every query to all other P nodes.Next, all nodes send K/P pairs ( doc id, frequency  ) of their posting lists to the ranker   which performs the first iteration of the documents ranking process. – Ranking . In the second step, the  ranker   performs the actual ranking of doc-uments and, if necessary, it asks for additional K-sized pieces of the postinglists in order to produce the K best ranked documents that are passed tothe  broker   as the query results. We use the vectorial method for performingthe ranking of documents along with the filtering technique proposed in [9].Consequently, the posting lists are kept sorted by frequency in descendingorder. Once the  ranker   for a query receives all the required pieces of postinglists, they are merged into a single list and passed throughout the filters. If it happens that the document with the less frequency in one of the arrivedpieces of posting lists passes the filter, then it is necessary to perform a newiteration for this term and all others in the same situation.Thus the ranking process can take one or more iterations to finish. In everyiteration a new piece of K pairs ( doc id, frequency  ) from posting lists are sentto the  ranker   for each term involved in the query. This concept of iteration isessential to distribute and allocate system resources to the queries in a round-robin fashion: the quantum comes from the fact that we let queries work onchunks of posting lists of size K and organize document ranking in iterations. 2.4 Hybrid Parallelization With the irruption of CMPs, it is appropriate to study and review the most ap-propriate parallelization strategies in our context. The simplest strategy wouldbe to extend the  rankers-fetchers   scheme [6] across CMP. However, we can an-ticipate this approach which is based on mapping  MPI  -threads in a CMP is not  the most suitable approach because it involves an overhead caused by duplica-tion of inverted file and competition of shared resources such as cache memory,main memory access, communication interface, etc..The way we organize overall computation and communication tries to bemore CMP friendly. It is based on the idea of the broker ( master  ) continuesdistributing the queries across the cluster’s nodes and having their respectiveanswers [6], but each node, which has to resolve a group of Q queries (querybatches), makes the ranking proccess (the most time comsuming phase) in par-allel by means of   OpenMP pragmas  .The idea behind the round-robin query processing approach is that queriesare processed in such a way that it properly divides the steps involved in solv-ing each query and interleave these steps whilst processing batches of queriesall together. Strict interleaving of steps is possible by ways of bulk-synchronousparallel (BSP) processing. In BSP, the parallel computation is divided in su-persteps and in each superstep the P processors are allowed to work on localdata and buffer messages to be sent to others in the same cluster. The end of each superstep is marked by the sending of all buffered messages and the barriersynchronization of processors.In order to make efficient the use of CMP, we arrangethe processing of queriesas described in the following supersteps which are executed by all processors inparallel. We assume that query traffic is high enough to let the broker place Qqueries of   t  terms in each processor. Each processor is represented by a single MPI   thread containing T  OpenMP   threads created at the start of the searchengine. The algorithm is as follows. Superstep  i  (Broadcast)  Each processor gets Q − m  queries from the brokerand  m  queries already in process requiring a new iteration, and broadcaststhem to all processors. Superstep  i + 1  (Fetching)  Each processor fetches from disk t · Q posting listsof K/P items each, and send them to the requesting processors (rankers). Superstep  i + 2  (Ranking)Part 1  Each processor receives the arriving  t · Q · P posting lists of size K/Pand merge posting lists per term and query, to store them in contiguousmemory data structures, one per query, each of size  t · K. Part 2  Each processor uses its T  ≤  Q  OpenMP   threads to calculate theranking of each query using the contiguous memory to keep on-goingdoc-ument ranking. Each query is processed sequentially by a single  OpenMP  thread. These threads keep both ownership of the memory and on-goingqueries during the complete process. Part 3  Determine the outcome of on-going queries which can be a requestto go back to the logic superstep  i  (namely superstep  i + 3) or send thetop-R results to the broker.These steps show the process associated with the solution of queries where rank-ing does not require all documents to contain all query terms. In this case, all
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks