Optimization of Distributed Association Rule Mining Based Partial Vertical Partitioning

8 pages
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Optimization of Distributed Association Rule Mining Based Partial Vertical Partitioning
  Monika,   International Journal of Computer Science and Mobile Computing, Vol.3 Issue.5, May- 2014, pg. 1254-1261   © 2014, IJCSMC All Rights Reserved 1254 Available Online at www.ijcsmc.com  International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320  – 088X   IJCSMC, Vol. 3, Issue. 5, May 2014, pg.1254  –   1261   RESEARCH ARTICLE Optimization of Distributed Association Rule Mining Based Partial Vertical Partitioning MONIKA monika.choudhary56@gmail.com  Abstract: Association rule mining is a one of the most important technique in data mining.   Data mining is the process of analyzing data from different angles & getting useful information about data. Modern organizations are geographically distributed. Using the traditional centralized association rule mining to discover useful patterns in such distributed system is not always feasible because merging data sets from different sites into a centralized site incurs huge network communication and time costs. This paper present an optimized Distributed Association Rule Mining (D-ARM) based on vertical partitioning. The existing D-ARM algorithms have lots of communication overhead, which is a major issue for concerning. The proposed approach minimizes this communication overhead and it is based on partial count. The papers then discuss the Partial Count on Vertical Dataset (TCDV) use of this structure which offers significant advantages with respect to existing DARM techniques.   Keywords- Data Mining; Distributed Association Rule Mining; Vertical Partitioning  Monika,   International Journal of Computer Science and Mobile Computing, Vol.3 Issue.5, May- 2014, pg. 1254-1261   © 2014, IJCSMC All Rights Reserved 1255 I. Introduction Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different  perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases .  The overall goal of data mining process is to extract information from a dataset & transform it into an understandable structure for future use. Consider I = {i1… in} be a set of items. Let D be a set of transactions or database. Each transaction t ∊ D is an item set such that t is a proper subset of I. A transaction t supports A, a set of items in I, if A is a proper subset of t. An association rule is an implication of the form A → B, where A and B are subsets of I and A ∩ B= Ø. The support of rule A → B can be computed by the following equation: Support (A → B) = |A → B| / |D|, where |A → B| denotes the number of transactions in the database that contains the itemset AB, and |D| denotes the number of the transactions in the database D. The confidence of rule is calculated by following equation: Confidence (A → B) = |A → B|/|A|, where |A| is number of transactions in database D that contains item set A. Rule AB is strong if support(A → B ) ≥ min_support and confidence(A → B ) ≥ min_confidence, where min_support and min_confidence are two given minimum thresholds [1]. Association rule mining algorithms scan the database of transactions and calculate the support and confidence of the rules and retrieve only those rules having support and confidence higher than the user specified minimum support and confidence threshold [2]. Association rule mining consists of two stages: 1. The discovery of frequent itemsets. 2. The generation of association rules. It follows, that in the vast majority of cases, the discovery of the frequent set dominates the performance of the whole process. Therefore, we explicitly focus the paper on the discovery of such set [3].  Need for development of Distributed system for mining of association rules because of its unique properties:  Monika,   International Journal of Computer Science and Mobile Computing, Vol.3 Issue.5, May- 2014, pg. 1254-1261   © 2014, IJCSMC All Rights Reserved 1256 1. Databases or data warehouses may store a huge amount of data. Mining association rules in such databases may require substantial processing power, and distributed system is a possible solution. 2. Many large databases are distributed in nature. For example, the huge numbers of transaction records of hundreds of Sears’s department stores are likely to be stored at different sites.  This observation motivates authors to study efficient distributed algorithms for mining association rules in databases. This study may also shed new light on parallel data mining. Furthermore, a distributed mining algorithm can also be used to mine association rules in a single large database by partitioning the database among a set of sites and  processing the task in a distributed manner. The high flexibility, scalability, low cost performance ratio, and easy connectivity of a distributed system make it an ideal platform for mining association rules [4]. Two types of database layouts are employed in association rules mining: horizontal and vertical. In the traditional horizontal database layout, each transaction consists of a set of items and the database contains a set of transactions. Most Apriori-like algorithms use this type of layout. For vertical database layout, each item maintains a set of transaction ids (denoted by tidset) where this item is contained. Eclat uses vertical data layout. It has been shown that vertical layout performs generally better than the horizontal format. Table 1 & Table 2 show examples for different types of layouts [5]. Table: 1 Table: 2 Horizontal Layout Vertical Layout II. RELATED WORK Finding of interesting association rules in databases may disclose some useful patterns for decision support, selective marketing, financial forecast, medical diagnosis, and many other applications, it has attracted a lot of attention in recent data mining research. Mining association rules may require iterative scanning of large transaction or relational databases which is quite costly in processing. Therefore, efficient mining of association rules in transaction or relational databases has been studied substantially [4]. Since its introduction in 1993, the Association Rule Mining (ARM) has been studied intensively. Many algorithms, representing several different approaches, were  Monika,   International Journal of Computer Science and Mobile Computing, Vol.3 Issue.5, May- 2014, pg. 1254-1261   © 2014, IJCSMC All Rights Reserved 1257 suggested. Some algorithms, such as Apriori[16], DHP [15], FP growth[6] are bottom up & others, like pincer-search use a hybrid approach, trying to guess large itemsets at an early stage. Algorithms for D-ARM problem usually can be seen as parallelization of sequential ARM algorithm. The CD, FDM & DDM algorithms parallelize Apriori & PDM [14] parallelize DHP [15]. Two Basic parallel algorithms, Count Distribution (CD) and Data Distribution (DD) were proposed in [5]. The CD algorithm scales linearly and has excellent speedup and sizeup behavior with respect to number of transactions. Hence, the CD algorithm, like its sequential counterpart Apriori, is unscalable with respect to the increasing size of candidate set. The DD algorithm addresses the memory problem of the CD algorithm by partitioning the candidate set assigning a partition to each processor in the system [11]. FDM [4] was the further improvement on the CD algorithm. It gives better performance as compare to CD algorithm. In FDM the number of candidate sets generated can be substantially reduced to about 10-25% of that generated in CD [4]. In most of the above algorithms, the database is divided horizontally, called segmentation between nodes. There are also many algorithms that use vertical database. Apriori [16] based & inspired algorithms are good with sparse datasets, where frequent patterns are very short. For dense datasets such as telecommunications and census data, which have many, long frequent patterns, the  performance of these algorithms degrades incredibly. TO overcome these problems, a number of vertical mining algorithms has been proposed. I.e. Eclat, Dclat. Eclat [9] algorithm is better than previous algorithms, but it still need a lot of communication. Dclat [8] is an improvement on Eclat, that uses concept of Diffset for generating frequent-itemset. There are also many D-ARM algorithms that follow the structure of tree. The FP-growth algorithm is a new generation of frequent pattern mining that uses a compressed FP tree structure for mining a complete set of frequent itemsets without candidate itemsets generation. It works well if size of FP-tree is typically smaller and if all items are ordered from highest to lowest support count. However, for very large DB, a lot of time is required to first sort the support of 1-itemsets. To avoid this overhead, the frequent item tree FI-growth also was proposed. This algorithm constructs an FI-tree represented by ordering the items by sequence in transactions. III. PROPOSED WORK Our proposed algorithm is based on a central P-tree structure. In this method, a single pass of database is done to  perform a partial summation of the support counts. These partial counts are stored in a tree structure that we call the P-tree which enumerates item sets counted in lexicographic order. The P-Tree contains all the sets of items present as distinct records in the database. Plus some additional sets that are leading subsets of these. The Distributed  Monika,   International Journal of Computer Science and Mobile Computing, Vol.3 Issue.5, May- 2014, pg. 1254-1261   © 2014, IJCSMC All Rights Reserved 1258 version of P-Tree, PP-Tree was also proposed. It was based on vertical partitioning of item sets. This method divides the ordered set of items into subsequences & then for each subsequence it defines a PP Tree. This paper presents a approach which have greater efficiency in terms of communication overhead. This approach is based on (vertical) partitioning and has different way to partition the database. Partial Count Vertical Database (PCVD) proposed removes the problem of redundancy of transactions sometimes exists in Total count on Vertical Database (TCVD)[1] by using different way to partition the database. In this algorithm transaction are distributed according to the 1 st item of transactions. This approach based on partial support count but gives the approximate result as TCVD approach with minimum overhead of communication.   Algorithm for Partial Count shown in fig 1. Fig.1: Algorithm for Partial Count Example:  We are given a horizontal dataset in table 3. Each transaction t has item set in lexicographic order. These items have distinct values in real world. Table 4 shows the Vertical data set according to algorithm for partial count. Input : Database D Output : L k //K=1to n 1) Convert a given dataset into vertical dataset by allocating the transaction according to 1 st  item of each transaction. 2) Distribute these items with their tid set to distinct nodes. 3) Now at each node we calculate the candidate item set C k from L 1 . Generate only those candidates set that start with item assign to that particular node. C k = L k-1 ∪   L k-1 4) We now calculate the frequent K-item set at individual nodes from their corresponding  C k  . // End of Algorithm for Partial Count on Vertical Dataset
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks