Improving Search Engines Performance on Multithreading Processors

14 pages
of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
In this paper we present strategies and experiments that show how to take advantage of the multi-threading parallelism available in Chip Multithreading (CMP) processors in the context of efficient query processing for search engines. We show that
  Improving Search Engines Performance onMultithreading Processors Carolina Bonacic 1 , Carlos Garcia 1 , Mauricio Marin 2 , Manuel Prieto 1 ,Francisco Tirado 1 , and Cesar Vicente 1  ⋆ 1 Depto. Arquitectura de Computadores y Autom´aticaUniversidad Complutense de MadridContact-email:,  { garsanca, mpmatias,ptirado } 2 Yahoo! Research SantiagoUniversity of Chile Abstract.  In this paper we present strategies and experiments thatshow how to take advantage of the multi-threading parallelism avail-able in Chip Multithreading (CMP) processors in the context of efficientquery processing for search engines. We show that scalable performancecan be achieved by letting the search engine go synchronous so thatbatches of queries can be processed concurrently in a simple but veryefficient manner. Furthermore, our results indicate that the multithread-ing capabilities of modern CMP systems are not fully exploited when thesearch engine operates on a conventional asynchronous mode due to themoderate thread level parallelism that can be extracted from a singlequery. 1 Introduction The algorithmic design and implementation of current Web Search Engines isbased on the asynchronous message passing approach to parallel computing inwhich each newly arriving query is serviced by an independent thread in a classi-cal multiple masters/slaves scheme. Typical facilities for parallel query process-ing at data centers are composed of a few thousand Linux boxes forming clustersof computers.On the other hand, the amount of work demanded by the solution of queriesfollows the so-called Zipf’s law which in practice means that some queries, inparticular the ones composed of most popular terms, can demand large amountsof processing times whereas others containing less frequent terms can require acomparatively much smaller processing time. ⋆ This work has been partially supported by the research contracts CICYT-TIN2005/5619, CYTED-506PI0293, FONDECYT 1060776 and Ingenio 2010 ConsoliderCSD2007-20811. We also thank Sun Microsystems and the AulaSun of the UCM fortheir support.  Thus under this asynchronous approach and hardware latencies a given querycan easily restrain smaller queries by consuming comparatively larger amountsof resources in processor cycles, disk and inter-processors network bandwidths.However, we have found that a careful design of the major steps involved inthe processing of queries can allow its decomposition in such a way that we canlet every query share the cluster resources evenly in a round-robin manner [8,9]. We have observed that this scheme can be particularly useful in preventingunstable behavior under unpredictable variations in the query traffic arriving tothe search engine.In particular, we have observed for the standard asynchronous method of query processing that sudden peaks in the query traffic can be very detrimentalto overall performance due to the Zipf’s law distribution of the workload perquery. We have also observed that the round-robin method of query processingsolves this problem efficiently. We have validated this claim through extensiveexperimentation by running actual query logs upon actual 1TB samples of theWeb. Nevertheless, our experiments have been performed on standard machineswith coarse-grain threads implemented by Posix software running on clusterssupporting the distributed memory model and the MPI message passing com-munication library.Having said that, it is clear that this discussion is only valid at a macro-scopic level in terms of “heavy” threads and operations for query processing in asharing nothing model for data distribution. However, state of the art computerarchitectures integrate facilities for light threads and shared memory, which areavailable to the programmer in the form of efficient realizations of the  OpenMP  model of parallel computing. In fact, future improvements in processor perfor-mance will predominantly come from Thread Level Parallelism, rather than fromincreasing clock frequency or processor complexity [2]. In this regard, we thinkthat it is an interesting research problem to validate the above claims in the con-text of these new architectures and if not, explore new optimizations to achieveefficient performance under this new setting.In this paper, we provide a first step in this direction by studying differ-ent realizations of standard and round-robin search engines implemented upona state-of-the art Chip Multithreading system [11] and its respective  OpenMP  realization. As experimental platform we have chosen a Sun Microsystems’ Ul-traSPARC T1 processor – code-named as Niagara [6] and marketed by Sun as CoolThreads   technology – since it symbolizes the recent shift to CMP in theserver market and presents a radical new approach to enable throughput com-puting and scalability with low power consumption.For programming purposes the T1 can be seen as a set of logical proces-sors that share some resources. Consequently, one may think that paralleliza-tion schemes targeted for other shared-memory multiprocessors, such as SMPsystems, are also good candidates for this processor. However, the sharing of resources introduced on the T1 for increasing utilization may cause serious bot-tlenecks and hence, strategies that are appropriate for these machines may be  inappropriate or less effective for the T1. One of the goals motivating this studyis to revise the implementation of parallel search engines in this light.The rest of this paper is organized as follows: Section 2 and 3 describe oursearch engine and the experimental framework respectively. Section 4 presentsour parallel proposals and Section 5 shows some performance results. Finally thepaper ends with some conclusions and hints for future research. 2 Search engine overall description 2.1 Distributed inverted file Web Search Engines use the inverted file data structure to index the text col-lection and speed up query processing. A number of papers have been publishedreporting experiments and proposals for efficient parallel query processing uponinverted files which are distributed on a set of P processor-memory pairs [1,3,4,7–10,14]. It is clear that efficiency on clusters of computers is only achieved by using strategies devised to reduce communication among processors and main-tain a reasonable balance of the amount of computation and communicationperformed by the processors to solve the search queries.An inverted file is composed of a vocabulary table and a set of posting lists.The vocabulary table contains the set of relevant terms found in the collection.Each of these terms is associated with a posting list which contains the documentidentifiers where the term appears in the collection along with additional dataused for ranking purposes. To solve a query, it is necessary to get the set of documents  ids   associated with the query terms and then perform a ranking of these documents so as to select the top K documents as the query answer.Current search engines use the document partitioned approach to distributingthe inverted file on a set of P processors. In this case, the document collection isevenly distributed at random on the processors and an inverted file is constructedin each processor considering only the documents stored in the processor. Solvinga query involves to (a) place a copy of it in each processor, (b) let each processorcalculate their local top K results and (c) make a merge of all results to selectthe global top K results.Query operations over parallel search engines are usually read-only requestsupon the distributed inverted file. This means that one is not concerned withmultiple users attempting to write information on the same text collection. All of them are serviced with no regards for consistency problems since no concurrentupdates are performed over the data structure. Insertion of new documents iseffected off-line. 2.2 Organizing query processing At the parallel server side, queries arrive from a receptionist machine that wecall the  broker  . The  broker   machine is in charge of routing the queries to the  cluster’s processors (where for the scope of this paper each processor is a chip-multiprocessor node of the cluster) and receiving the respective answers. It de-cides to which processor routing a given query by using a load balancing heuristic.The particular heuristic depends on the approach used to partition the invertedfile. Overall the  broker   tends to evenly distribute the queries on all processors.More in detail, the parallel processing of queries is basically composed of aphase in which it is necessary to fetch parts of all of the posting lists associatedwith each term present in the query, and perform a ranking of documents in orderto produce the results. After this, additional processing is required to produce theanswer to the user. This paper is concerned with the fetching+ranking part. Weare interested in situations where it is relevant to optimize the query throughput.A relevant issue for this paper is the way we organize query processing uponthe piece of inverted file stored in each processor. We basically apply the com-bination of two strategies we have devised to efficiently cope with hardwareresource contention among queries and dynamic variations in the query traffic: – Round robin query processing . We let queries to use a fixed quantum of computation, communication and disk access before granting the resourcesto another query in a round-robin fashion. – Operation mode . We dynamically switch the mode of operation of thesearch engine between the asynchronous and synchronous message passingmodes of parallel computation in accordance with the observed query traffic.In the following subsection we describe both strategies in detail. 2.3 Iterative ranking and round-robin query processing. The processor in which a given query arrives is called the  ranker   for that querysince it is in this processor where the associated document ranking is performed.Every query is processed iteratively using two major steps: – Fetching . The first one consists on fetching a K-sized piece of every postinglist involved in the query and sending them to the  ranker   processor. Inessence, the  ranker   sends a copy of every query to all other P nodes. Next,all nodes send K/P pairs ( doc id, frequency  ) of their posting lists to the ranker   which performs the first iteration of the documents ranking process. – Ranking . In the second step, the  ranker   performs the actual ranking of doc-uments and, if necessary, it asks for additional K-sized pieces of the postinglists in order to produce the K best ranked documents that are passed tothe  broker   as the query results. We use the vectorial method for performingthe ranking of documents along with the filtering technique proposed in [12].Consequently, the posting lists are kept sorted by frequency in descendingorder. Once the  ranker   for a query receives all the required pieces of postinglists, they are merged into a single list and passed throughout the filters. If it happens that the document with the less frequency in one of the arrivedpieces of posting lists passes the filter, then it is necessary to perform a newiteration for this term and all others in the same situation.  Thus the ranking process can take one or more iterations to finish. In everyiteration a new piece of K pairs ( doc id, frequency  ) from posting lists are sentto the  ranker   for each term involved in the query. This concept of iteration isessential to distribute and allocate system resources to the queries in a round-robin fashion: the quantum comes from the fact that we let queries work onchunks of posting lists of size K and organize document ranking in iterations. 2.4 Operation Mode As mentioned above, we dynamically switch the mode of operation in accordancewith the query traffic observed. – Asynchronous mode . Low query traffic triggers an asynchronous modein which each query is serviced by a unique  master   thread in charge of processing the query. This  master   thread can communicate with P other slave   threads, each located in one of the P cluster nodes. – Synchronous mode . High query traffic triggers a mode in which all activethreads are blocked and a single thread takes the control of query processingby grouping queries in batches and processing them sequentially. In thiscase messages are buffered in all cluster nodes and sent out at the end of the current batch being processed, point at which all processors are barriersynchronized. Better utilization of system resources of this mode comes fromthe fact that overheads such as thread scheduling and synchronization costare reduced significantly and communication is performed in bulk. 3 Experimental framework. Computing platform anddata set As experimental platform, we have chosen a Sun Microsystems’ UltraSPARC T1processor, whose main features are summarized in Table 1. Initially codenamedas  Niagara  , the T1 is a special-purpose CMP designed by Sun for the servermarket. It is available with four, six or eight CPU cores, and each core allows forthe execution of four threads concurrently. Essentially, T1 cores are fine-grainmultithreading (FGM) processors [5] that switch between threads of executionon every cycle for hiding the inefficiencies caused by long operational latenciessuch as memory accesses [13]. Single thread applications will perform betteron traditional processors, but multithreaded workloads may benefit from thisarchitecture: each thread is slower but this architecture yields better use of theprocessor’s resources and potentially a higher overall throughput.In our implementations, thread level parallelism has been exploited by meansof the  OpenMP   standard, which is supported by Sun’s native compilers. 3.1 Fixed-Point ranking The UltraSPARC-T1 processor has a limited floating-point capability since itonly provides one floating-point unit to support all 8 cores on the chip, i.e.
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks