Review for Paper: 21-Fast Algorithms for Mining Association Rules

Review 1

Retailers use association rules to determine which items should be placed near each other in a store, based on which items are often bought together. An association rule states that when an itemset X is purchased, a disjoint itemset Y is also purchased, with some level of confidence (conditional probability of Y, given X) and support (probability of X and Y). Given a database of transactions, or sets of items bought together, retailers may wish to find all association rules with a given level of confidence and support.

In “Fast Algorithms for Mining Association Rules,” the authors present the Apriori algorithm, a novel method for finding large itemsets (with sufficient support) in a database of transactions. The Apriori algorithm is faster than earlier algorithms such as SETM by orders of magnitude, for problems with a small minimum support size. Apriori is efficient because it does not waste steps considering itemsets that are already known to have insufficient support.

The main contribution of the paper is a new, efficient algorithm for association rule mining. This algorithm led to many follow-up papers on association rule learning and has made it practical for businesses to find all true association rules (meeting user-defined constraints) in large, real-world data sets. Each iteration of the algorithm works in two stages, a growth stage and a pruning stage. In the growth stage, Apriori uses the large itemsets of size {k-1} to build candidate k-itemsets. A superset of the large k-itemsets is formed by unioning all pairs of large {k-1}-itemsets that are identical except in their last element. Each of these k-itemsets is dropped if any of its {k-1}-itemset subsets is not large. Finally, a pass is made over the transaction data to check which of the candidates have sufficient support. This process is iterated for increasing k, until no large k-itemsets are found.

A limitation of the paper is the incomplete analysis of the Apriori algorithm's performance. There is no formal derivation of the order of growth of the algorithm's runtime, and the description of the hash table implementation may be insufficient for such a result to be found. In addition, the algorithm is tested only on synthetic data, generated from a model designed by the authors. It would be helpful to know more about the algorithm's performance on real data.


Review 2

The paper throws light on the implementation of two algorithms based on discovering association rules between items in a large database of sales transactions. These two algorithms were developed to provide a faster way of performing transactions. For cross-marketing, organizations need to sift through a larger database of items to suggest related items. To speedup such a process, fast algorithms are imperative. The present algorithms AIS algorithm and SETM algorithm suffer from large overheads resulting due to unnecessary and generation and counting of candidate itemsets which are too small. These problems are addressed by the new algorithms.

The paper describes two algorithms in detail namely – Apriori algorithm and AprioriTid algorithm. In Apriori algorithm, the itemsets are generated using the large itemsets of previous pass and does not consider the transactions in the database. The generated itemset having a small subset is deleted leaving the candidate ones. It takes advantage of the fact that any subset of a frequent itemset is also a frequent itemset. In the AprioriTid algorithm, the candidate itemsets are generated the same way. But another set C’ is generated which contains the TID of each transaction and the large itemsets in the transaction. It makes use of the fact that the number of entries in C’ may be smaller than the number of transactions in the database, especially in the later passes. Apriori does better than AprioriTid in the earlier passes, but AprioriTid does better than Apriori in the later passes.

The paper discusses the two algorithms in detail and provides a good comparison between the existing algorithms dealing with association rules between items as well. The relative performance graphs in the end corroborates their claim of better performance with the new algorithms, especially when the size of the problem increased.

The paper is mathematical in nature and doesn’t explain some of the calculations used in the paper for explaining things. The authors didn’t consider the quantities of the items bought in a transaction which is required for most of the transactional applications nowadays.



Review 3

What is the problem addressed?
Two new algorithms for discovering association rules between items in large databases of sales transactions are presented; empirical evidence proves that these algorithms outperform other known algorithms by factors of 3 to 10, and that scale linearly with database size. Formally, mine all association rules that have support and confidence greater than some user-specified minimum support and confidence out of database D.
Why important?
Association rules are "if-then rules" with two measures which quantify the support and confidence of the rule for a given data set. Having their origin in market basked analysis, association rules are now one of the most popular tools in data mining. This algorithm apparently will speed up the process of discovering association rules - in case you want to know how many people that buy toothpaste also buy suntan lotion at the same time, this is the algorithm to use.

1-­‐2 main technical contributions? Describe.
Algorithms for discovering large itemsets make multiple passes over the data. In the first pass, we count the support of individual items and determine which of them are large, i.e. have minimum support. In each subsequent pass, we start with a seed set of itemsets found to be large in the previous pass. They use this seed set for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data. At the end of the pass, we determine which of the candidate itemsets are actually large, and they become the seed for the next pass. This process continues until no new large itemsets are found.

The magic of apriori is in how candidates are generated and counted. In apriori, candidate itemsets to be counted are generated using only itemsets found large in the previous pass, without considering the transaction that is in the database. The algorithms differ in the method of generating all large itemsets
1-­‐2 weaknesses or open questions? Describe and discuss
The algorithm itself is hard to criticize. However, there is some criticize about support and confidence, since correlation and causality are different things, and the association rule can sometimes nonsense. An alternative is consider constraint-based mining.



Review 4

This paper presents new fast algorithms (Apriori, AprioriTid, AprioriHybrid) for mining association rules. To be more specific, the algorithms are used to generate all itemsets that have transaction support larger then a user-specified minimum support. This paper presents detailed performance evaluation and the new algorithms outperform the known algorithms (AIS, SETM) by factors ranging from 3 to more than an order of magnitude. Also, the combined algorithm AprioriHybrid scales up well in terms of number of transactions, transaction size, and the number of items in the database. The paper first discusses the background of mining association rules. Then it introduces Apriori and AprioriTid algorithms, as well as performance comparison with known algorithms. Then it introduces a combined algorithm AprioriHybrid which combines both Apriori and AprioriTid, as well as performance of new combined algorithm. Finally, it presents scale-up experiments on AprioriHybrid algorithm.

The problem here comes from mining association rules. Given a set of transactions, the problem of mining association rules requires to generate all association rules that satisfy user-specified constraints, and the first step of solution is to find all possible large itemsets. There are several algorithms to find all possible large itemsets (AIS, SETM), but the performance is not good and they don't have the ability to scale up. Therefore, this paper presents new algorithms Apriori and AprioriTid for generating all large itemsets, and also a combined algorithm AprioriHybrid.

The major contribution of the paper is that the new algorithms provide better performance than previous known algorithms, with an order of magnitude for large problems. Also algorithm AprioriHybrid has a great ability to scale up. The paper provides a lot of detailed examples to illustrate how algorithms work, and detailed performance analysis in comparison with other known algorithms. Actually, I will be presenting this paper and I found the paper is good in terms of presenting new algorithms which makes it easier to understand.

One interesting observation: this paper presents several fast algorithms which seem to work well by their experiments. However, I’m not sure the method they generated the experiment datasets actually stand for real problems. I think it would be better if they can apply the algorithms to databases from real problems and compare the results with known algorithm. Also, they didn't mention how the weakness of the combined algorithm might affect real problems.


Review 5

The paper presented two algorithms, Apriori and AprioriTid for discovering all significant association rules between items in a large database. To solve the problem of discovering association rules between items in a large database of sales transactions, previous algorithms such as AIS and SETM are both generating candidate sets on the fly, however, AIS generates too many candidates that turned out to be small while SETM is restricted by the size of candidate set. A more efficient algorithm is needed and thus the paper proposed Apriori and AprioriTid, later AprioriHybid which is the combination of the two.

Apriotri and AprioriTid generate the candidate item sets to be counted in a pass by using only the item sets found large in the previous pass with the institution that any subset of a large item set must be large, This ensures that a much smaller number of candidate item sets could be generated. Moreover for AprioriTid, database is not used at all for counting the support of candidate item sets after the first pass but with an encoding of the candidate item sets used in the previous pass. This will save time when size of encoding becomes much smaller. It has been proved that both algorithms outperform the transitional ones and have great scale-up property. The AprioriHybrid algorithm is then introduced with the idea that not all passes need to use the same algorithm and thus there is a choice between Apriori and AprioriTid for earlier and latter passes. Experimental result show that AprioriHybrid scales linearly with the number of transactions.

The largest contribution of the paper is introducing two algorithms for discovering association rules in a large database. They are well performed compared to traditional algorithms. The combination, AprioriHybrid is even better in performance and scaled-up greatly with number of transactions, and number of items in database.

One drawback of AprioriHybrid is that transmission from Apriori to AprioriTid involves a cost and this may effect it performance compared to Apriori to AprioriTid. However, in large datasets, it generally does better than the other two. Another thing is that quantities of the items are not considered.



Review 6

This paper addresses the issue of finding patterns within the database. They are focused on finding association rules, for example. rules such as if a customer buys diapers, they are more likely to buy baby food. By being able to accurately and quickly predict which other items customers will buy based on their past purchases, business can implement custom information-based marketing strategies. The authors propose two algorithms, Apriori and AprioriTID - along with a hybrid of the two - which they show to be several times faster than the current algorithms at itemset counting, offering linear scalability.

The goal of the algorithms is to find all sets of items that have more than the minimum support. Both algorithms make multiple passes over the data to compute these itemsets. The AprioriTID algorithm differs from the Apriori algorithm in that it uses data encoded from the previous scan on the current round, while the traditional algorithm must scan the database to determine the support for each candidate. This can lead to a dramatica speedup in later rounds. Because of this, the authors realized that they could combine the two algorithms - use Apriori at first, and then switch to AprioriTID in later rounds, once it thinks the candidate itemsets will fit in memory.

This paper does a good job of presenting its algorithms, and making a strong case for their hybrid algorithm. These rules require no supervision, so they can run offline and then be presented to humans to verify and extend. They also did a great job providing the evidence to back up their findings - more so then any other paper we've read so far, this one had plenty of data and graphs showing their results.

One noticeable weakness of the techniques presented in this paper is that they will not work for numerical data. The Aprori and variants work by looking for the existence of items, not at the value of the items. One would have to segment the data into bins, which wouldn't be a great solution.


Review 7

The problem put forward in the paper is that finding the association rules between items in a large database of sales transactions is needed for cross-marketing and attached mailing applications with the development of bar-code technology. So a fast algorithm is needed. In this paper, two algorithms called Apriori and AprioriTid are proposed to solve the problem.

The algorithm Apriori counts item occurrences at first to determine the large 1-itemsets. Then for every subsequent pass k, it generates the candidate iterates from the item sets in pass k-1. The generation function will join the previous item sets, delete the item sets whose subset is not in the original dataset and keep the rest item sets in candidate. Then the algorithm should determine the association between candidates and the given transaction efficiently. From the experimental result, it has better performance than AIS, SETM and AprioriTid most of the time.

The algorithm AprioriTid also use the generation function. But it does not count the item occurrences in the database. Bar{C_k} is used as an alternative. And bar{C_k} does not have an entry for a transaction who contains no candidate in the k-itemset. In ApriorTid, every candidate item set is assigned a unique ID.

The strength of the paper is that it provides detailed pseudo code and several examples to help understanding the algorithms, which is very accessible to the reader. And for technical strength, according to the experimental results, the algorithms proposed in the paper improves the performance greatly comparing to the SETM and AIS. The improvement becomes larger as the minimum support increases.

The weakness is that for the performance experiment, it only uses one machine with the certain settings. It is hard to say whether this is the best circumstance for AIS and SETM algorithms. The experiment should be performed in different kind machine, too.


Review 8

The paper introduces the AprioriHybrid algorithm, a fast algorithm for mining association rules that is a combination of their Apriori and AprioriTid algorithm. The paper presents the motivation behind a good, efficient algorithm for mining association rules, presents the details of their Apriori and AprioriTid algorithm, and evaluates the performance and benefit of their combined AprioriHybrid algorithm.

Data mining association is an important part of data analysis, especially in the marketing industry, and an efficient algorithm that scales well with data size is crucial. The paper decompose the problem of finding all association rules in two parts:
1) Finding all set of items (large itemsets) that have transaction support above minimum support, where support is the number of transactions that contain a particular itemset.
2) Given the large itemsets, generate the desired rules based on this set.
The algorithms presented in this paper focus on the first part of finding all large itemsets, and the basic idea behind the two algorithms is that any subset of a large itemset has to be also large, meaning the itemset of size k should be constructed by joining itemsets of k-1. The algorithm that applies this principle in a straightforward manner to obtain the possible candidate itemsets is Apriori algorithm. The Apriori algorithm goes as follows:
1) Get the large itemset of k=1
2) Generate the possible itemsets for k based on k-1, then count the supports
3) Filter the candidates in 2) to get large itemsets for k.
4) Repeat steps 2-3 till we reach the target k.
The AprioriTid is an enhanced version of Apriori, where it eliminates the need to refer to the database when counting supports in the passes beyond the first pass of k=1.

The AprioriHybrid algorithm a combination of Apriori and AprioriTid, where Apriori is used when number of iterations is low, while AprioriTid is used for high number of iterations. AprioriTid performance well at high number of passes because once the candidate itemsets get small enough, it can fit inside the main memory.

The algorithms presented in this paper improves performance by reducing the required candidate space through the characteristics of the itemsets. Furthermore, its linear scalability with respect to the database size is extremely promising and shows good results. One point of concern is how well the findings and performance results of the paper scales to modern day system, as the workstation used to evaluate the algorithm is severely outdated in today’s standards.



Review 9

This paper proposes apriori algorithms, algorithms for mining association rules in a database with transactions. In the apriori algorithm, the sets containing one frequent individual data item are built in the initial run and later the sets with many data items are extended by the initial sets if they are sufficiently frequent among the transactions. The paper also introduces aprioriTid and AprioriHybrid algorithms, which has overhead of caching the transaction information in the database but may not need to scan the whole database in the following runs.

The apriori algorithm is simple but effective. The approach is similar to a bottom-up dynamic programming. It assumes that frequent item sets contain only frequent item subsets. In each run, it extends the size of the candidate itemset by 1 each time by joining the previous candidate item set and the resulting infrequent item sets are pruned by counting itemset each transaction in the database.

The apriori is effective but still have some drawbacks in search of association rules:
1.It generates a large candidate set in each runs and then prunes it to get frequent item sets.The candidate sets(assume many large size transactions have the exact same items) can be very large and would be time-consuming or cost-intensive.
2.In each run, it needs to scan the whole database. Though aprioriTid and AprioriHybrid help reduce times to the scan of the databases, it still needs to scan the whole transaction data in the worst case scenario.


Review 10

Problem/Summary:

Often a company will want to discover associations in their data. For example, a retail store may want to know which items are commonly bought together. However, since the data warehouses used to store purchase records may be very large, there is a need for efficient algorithms to perform this analysis. This paper presents a fast algorithm for discovering association rules called the Apriori algorithm.

In a database of purchase records, each record represents a Transaction of one or many items. An interesting association would be a set of items that is a strong predictor for another disjoint set of items. The paper describes two steps in general method to find these associations: 1) find sets of items (called Large itemsets) that are frequently bought together (i.e. found together in a certain # of transactions) and 2) Partition these sets so that one partition is a strong predictor of the other. This paper focuses on the first step only.

The Apriori algorithm finds Large itemsets by building up from sets of size 1. To build large itemsets of size k, it uses the fact that any large itemset of size k must be the union of two large itemsets of size k-1. By taking all possible unions, it forms a candidate set of itemsets, and tests each itemset’s support in the database of transactions. Other algorithms do not use this method of generating candidates, and so are much more inefficient.

Strengths:

The examples and pseudocode are surprisingly clear. This paper shows that its algorithm is obviously better than AIS and SETM.

Weaknesses/open questions:

Some sections seem strangely placed. For example, why is the AIS/SETM algorithm described in the performance section and not in the introduction?

The support metric for a set A is expressed as a plain number, without reference to the popularity of individual items, which seems a little strange. For example, two pairs of items commonly bought together are: Apples with oranges, and office chairs with desks. These are both strong associations. However, many more people will buy both apples and oranges vs office chairs and desks. This means that even though these are both strong associations, the support for the apples-oranges group will be much higher because they are more popular.



Review 11

Due to new advances in bar code technology, retailers are able to collect a large amount of data concerning customer transactions. With this data, retailers want to discover association rules between items (ie If a customer purchases X, a set of items, then they are likely to also purchase Y, a different set of items). The problem of mining association is to discover all the association rules in which the confidence and support of each rule is greater than a user specified minimum. This paper presents two new algorithms for mining association, Apriori and AprioriTid, and AproriHybrid, which is a combination of both.

Both algorithms make an initial pass over the data to determine the sets of items that have minimum support. Unlike previous work in the field, Apriori and AprioriTid, after making the initial pass, no longer use the data. The key observation is that any subset of a large itemset is also large so the candidate itemsets can be generate by combining the large itemsets and then removing candidats that are not large. Apriori does this by rescanning the database where as AprioriTid uses an encoding of the results of the previous pass, which may improve performance when the results become smaller than the database.

Based on the test results, both algorithms outperform previous work, especially as the size of the database increases. However, the paper notes that AprioriTid only beats Apriori in performance when the candidate keys generated fit in memory. Otherwise, Apriori is better. To take advantage of this observation, the paper introduces AprioriHybrid. The basic idea is that the Apriori algorithm is used until it is predicted that the resulting candidates will fit in main memory. Afterwards, AprioriTid is used.

The algorithms in this paper were created in order to move efficiently discover association between itemsets within a large amount of transactional data. At the end of the paper, it discusses two possible extensions of the work: providing support for multiple taxonomies over items and having the algorithms also consider item quantity in its association discovery. I think the paper should also looking into similar applications of discovering association rules. Some examples I can think of are movie preferences, group memberships, and travel patterns. If you had movie goer data, you could try to discover association rules for questions like: What type of movie do users like? If a user sees movies X, Y, and Z, what other movies are they likely to see? Will movie X be successful based on its genre or cast? I feel like the algorithms used in the paper could be used in more that just retail setting.


Review 12

This paper introduces Apriori and AprioriTid, two algorithms for mining association rules from "basket data" or any other general data; a combined algorithm AprioriHybrid is also presented. The experiment results show that they all outperform previous algorithms.

The idea of the algorithms is based on probabilistics. They define two thresholds to decide whether to keep the association rules: minimum support and minimum confidence. The Aprirori algorithms construct the "large" (larger than minimum support) item-sets starting from one item and incrementally add in other items. The AprioriTid algorithm does the same thing, however, it does not use the entire dataset when counting support after the first pass.

In the experiments, these algorithms outperform conventional algorithms more than an order of magnitude. Furthermore, the AprioriHybrid algorithm combining Aprirori and AprioriTid scales linearly with the number of transactions, which is probably the main strength of this algorithm.


This paper introduces Apriori and AprioriTid, two algorithms for mining association rules from "basket data" or any other general data; a combined algorithm AprioriHybrid is also presented. The experiment results show that they all outperform previous algorithms.

The idea of the algorithms is based on probabilistics. They define two thresholds to decide whether to keep the association rules: minimum support and minimum confidence. The Aprirori algorithms construct the "large" (larger than minimum support) item-sets starting from one item and incrementally add in other items. The AprioriTid algorithm does the same thing, however, it does not use the entire dataset when counting support after the first pass.

In the experiments, these algorithms outperform conventional algorithms more than an order of magnitude. Furthermore, the AprioriHybrid algorithm combining Aprirori and AprioriTid scales linearly with the number of transactions, which is probably the main strength of this algorithm.


Review 13

This paper discusses the discovery of association rules in large databases. This is important in applications such as sales transactions. The paper proposes AprioriHybrid, an algorithm with the ability to scale up with respect to the number of transactions and the number of items in the database, as the best solution to finding associations.

When given a set of transactions, mining association is concerned with generating association rules that have greater confidence than user-provided min. support (minsup) and min. confidence (minconf). Apriori and AprioriTid output candidate itemsets with only itemsets that were large in a pervious pass. Apriori is derived out of the intuition that if an itemset is large, its subsets are likewise large. Thus, candidate itemsets with k items are generated through the join of large itemsets with k-1 items and deleting the itemsets with subsets that are not large. The result is a narrowing of the candidate subset pool. AprioriTid builds upon what Apriori does, and then adds the property that the dataset is excluded from the counting of candidate itemsets after the first pass. Reading effort is saved because the size can thus be much smaller than the database in later passes. Apriori performs better in earlier passes and AprioriTid performs better in later passes (because the number of candidate itemsets reduces, but Apriori still scans every transaction in the database). Thus, AprioriHybrid is born, using Apriori in the beginning passes and then AprioriTid in the later passes, when the candidate set can fit in memory at the end of the pass. There is cost when switching from Aprioir to AprioriTid. From experiments involving scaling up, AprioriHybrid was found to scale linearly with the number of transactions. The execution time even decreases slightly as the number of items in the database increase. Future work includes dining rules that consider quantities of items involved in transactions, and rules that use hierarchies of multiple taxonomies.


The drawbacks of this paper are that the paragraphs describing the algorithm are very dense and hard to follow. Perhaps some graphic visualization would’ve been helpful in understanding what the blocks of text are saying. The pseudo-code is also very hard to read and follow.


Review 14

This paper talked about rules for fast mining of association rules. The paper is from 1994 and published in the Proceedings of the 20th VLDB Conference. The paper talks about the applications to marketing in terms of applications in catalog design, cross marketing, add-on sales, and other such issues. The paper talks about previous work which uses the AIS and SETM algorithms for rule mining and then develops two new algorithms named Apriori and AprioriTid and describe their speed improvements over the previous methods. They then explain that these two methods can be combined to create an AprioriHybrid method that outperforms both.

The strengths of this paper lie in its description of algorithms and its empirical evaluation of these algorithms. The algorithms are clearly defined with good examples. In particular, Figure 3 was very helpful in understanding AprioriTid. The paper varies the number of transactions, average size of transactions, average size of maximal large item sets, number of large item sets, and number of total items, for a thorough evaluation of how these algorithms perform. This paper evaluates the algorithm as a theoretical concept that is superior to previous AIS and SETM algorithms successfully.

You could consider the fact that the evaluation on real data sets is not included in this paper a weakness of the paper since the purpose of the algorithms described is to market products. For this reason it would be very important to see how the algorithms compare on real data sets. This paper does however have one big weakness that is especially relevant today. The authors talk about the differences of Apriori and AprioriTid and the reason for combining them in section 3.6. They state that AprioriTid does better when the set of candidates being considered fits in main memory. We’ve seen this assumption in other papers and seen that it is not always the case in today’s applications that we need to write things to disk. If we don’t then the AprioriHybrid algorithm is not needed and we could just use AprioriTid. This paper is over 20 years old, but it still could have provided some analysis of memory usage over time. A graph or two about this in section 3.7 could have strengthened the paper.


Review 15

Part 1: Overview

This paper proposes two algorithms for mining association rules in large database of sale transactions and combines them into a new algorithm, called AprioriHybrid, which can maintain linearly growth of cost in terms of relative time. The foundation of this research is tracing sale transactions via bar-codes in supermarkets or stores. In the previous work, people brought up AIS algorithm and SETM algorithm to find association rules in sales data. This paper presents Apriori and AprioriTid algorithms and another algorithm which is the combination of these two. First They do large itemset discovering in a novel way, different from SEMT and AIS, they shrink the candidate pool by assuming large itemset only comes from large itemset from the previous pass, which indicates that any subset of a large itemset would also be large.

Apriori algorithm use large sets L from the last pass and the generate the next round candidate sets C. Then Apriori counts the supports of candidate subsets and selects the one exceeds the threshold. Apriori and AprioriTid algorithms actually share the same generate function of the candidate sets. However, AprioriTid uses more complex algorithm to select candidate sets for next round. They implement hash trees to store candidate sets and use subset function which traverses down the tree and generates the desired set of references. Key value pairs are used to store the term id and the itemset id, which lead to good scalability of the proposed algorithms.

Part 2: Contributions

This paper broughts two algorithms of finding association rules, specifically, large itemsets of large transaction data of sale records. Their algorithms outperform the existing algorithms in time cost and scalability. To ensure scalability, they implemented key-value pair data structure while storing the term and itemset ids. They save much more time than the previous algorithms. They even have to abort running SMET for time consideration.

They also combine their two algorithms into one which almost beat Apriori algorithm in all aspects in the experimental evaluation.

Part 3: Possible Drawbacks

(1) They mimic the transaction data in the retailing environment. Using real world data may be better. (2) Scale of large itemsets may vary and may affect the selection of large itemsets.



Review 16

The paper discussed about different algorithms for finding important association rules in a database. Finding all association rules in a very large database, is time consuming. As a result, it is important to have fast algorithms. The first algorithm proposed in the paper is called Apriori. This algorithm iteratively find candidate itemsets i.e set of items until no new large i.e has at least minimum support itemsets are found. It generates candidate itemsets using only the itemsets found large in the previous pass. An itemset is considered as large if it has a support value greater or equal to the minimum support value. The basic idea of this algorithm is that any subset of a large itemset must be large. In this way, the candidate itemsets having m items can be generated by joining large itemsets having m-1 items and deleting those that contain any subset that is not large. The second algorithm called AprioriTid has the same approach as Apriori with one additional property. The additional property is that the database is not used at all for count the support of candidates itemsets after the first pass. Instead, an encoding of the candidate itemsets used in the first pass is used for this purpose. This avoids reading and writing into disks as the size of the encoding could be smaller than the transactions in the database which otherwise is used.

The paper pointed out that while Apiori does better in earlier passes, AprioriTid beats Apriori in the later passes. The reason is that while both use the same candidate itemset generation procedure, in the later passes, the number of candidate itemsets reduces. In that case, while Apriori still examines every transactions in the database, AprioriTid scans the set of which become smaller than the size of the transactions in the database, fitting in memory. In this case, AprioriTid avoids cost of reading/writing from/to disk. Based on this observation, the paper suggested a hybrid algorithm which is called AprioerHybrid. This algorithm uses Apriori in the initial passes and switches to ArpioriTid when it expects that the set of fit in memory.

The main strength of this paper is its techniques in applying to prune candidate itemsets with infrequent subset. I like the idea that generalize that any subset of a large itemsets must be large and those itemsets which contains small itemsets can not be large. In addition, I like how they combined better features of the two algorithm to further speed up the process. In addition, the authors gave detailed explanation of their algorithms including experimental evaluation with other set of algorithms.

The main drawback of the paper is that it has been evaluated very long time ago. The size of the data and the computing infrastructure doesn’t reflects today’s reality. The authors considered data sets with size less than 10MB and run their experiments on a 33MHZ processor, 64 MB main memory, and 2GB harddisk. As a result, it could be a good idea to re-evaluate these set of algorithms on today’s server platforms and large data sizes. Nonetheless, I assume that these algorithms lay ground for newer association rule mining algorithms.


Review 17

A new algorithm, AprioriHybrid, is introduced in this paper for mining association rules in large database of sale transactions. It is combined and developed based on two existing algorithms and can maintain linearly growth of cost in terms of relative time. In the test sections, it shows that the new algorithm always outperformed the old ones.

Database for bar-code retail item lookup is an important piece of marketing infrastructure. Successful organizations are interested in instituting information driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies. As previous work, we have AIS and SETM algorithm to find association rules in sales data. As comparison, the paper presents Apriori and AprioriTid algorithms.

The overview of the algorithm is as follow:
1. do large item set discovering
2. shrink the candidate pool
(by assuming large item set only comes from large item set of previous pass.)

The problem of discovering all association rules is decomposed into two subproblems.
1. find all sets of items that have transaction support minimum support
a. make multiple pass
b. use the seed from first pass for generating new potentially large itemsets (candidate itemsets)
c. at the end of the pass, determine which of the candidate itemsets are actually large.
2.use large item sets to generate the desired rules.

===strength===
The paper provides a novel algorithm to handle the mining problem. Detailed and solid examples are also provided in this paper. Test results reasonably shows that the algorithm outperforms the old ones.

===weakness===
Testing using mimic data is less convincing. It would be better if they can apply real world data to the tests.


Review 18

This paper described several new algorithms for mining association rules. Two Algorithms are called Apriori and ApriorTid. They are different because ApriorTid does not need to scan the database after the first scan. And this two algorithm could be combined into another algorithm called AprioriHybrid.

The paper first begin by defining what association rule is. Association rule is that two set of item X and Y often appear in same transactions. It must satisfy 1) X U Y appears often enough (support is high), 2) if X appears, then Y will appear with a high probability (confidence is high).
Then the two algorithms are described. The first one is Aprior. It computes large itemsets using a ground-up fashion. First itemsets with single item are generated, then itemsets with two items are generated, and so on. This requires multiple pass of database scan. So they developed another version of their algorithm call ApriorTid, which only scan the database for once and use information recorded during that scan to compute other itemsets. I think this is the key part of this paper.
After describing their two papers, Apriori and AprioriTid are compared to previous algorithms: AIS and SETM. They found their algorithms out-perform AIS and SETM and the gap become larger when problem scale.

As a paper presenting novel algorithms, the main contribution is that it provided several new and fast algorithms for mining association rules. I think the technique described in AprioriTid is also very interesting.


What I feel confuse about this paper at beginning is the definition of confidence. If all set that satisfy minimum support are generated, then after dividing the set into X and Y, it seems that this two sets will always have a confidence of 100%. Then I spend a long time before figure out this is not the case. So I think it helps if the paper could point out directly.



Review 19

This paper describes the famous Apriori algorithms that help mine databases to discover association rules between different objects in the database.

Two terms specific to association mining algorithms are confidence and support. The rule X=> Y holds in a database D with confidence c if c% of transactions in D that contain X also contain Y. The rule ha support S in the transaction Set D if s% transactions in D contain X union Y. The aim of the algorithm is generate all association mining rules that have confidence and support greater than the minimum values specified by the user.

The Apriori algorithms are better than the existent algorithms such as SET-M then because they reduce the number of candidate itemsets that are generated that turn out to be small. The Apriori algorithms operate on the idea that the itemsets to be counted must be large. The candidate itemsets having k items can be generated by joining large itemsets having k-1 items, and deleting those that do not contain any subset that is not large. The Apriori-Tid additionally does not use the database to count the support of candidate itemsets after the first pass which is a great move for efficiency.

Since Apriori is better for first few iterations and AprioriTId is better for the latter set of passes, the algorithm for the experiments have been implemented in such a way that if the candidate itemset Ck can fit in the memory at the end of the first few passes, Apriori switches to AprioriTid for more efficiency which they call AprioriHybrid. The results are promising in terms of scalability of the algorithm with bigger databases.

Few of the disadvantages I see are that if it is a fairly large database with a lot of candidate itemsets, the result will not fit in the memory therefore, the algorithm would take time to return a valid result. Since apriori operates using large itemsets, scanning would be an overhead with each pass. Also, in the same situation, it would require multiple database scans for every pass which is definitely an inefficiency. Considering it was one of the first important data-mining algorithms to then result in improvizations, this paper definitely seems to be significant in terms of the ideas proposed.



Review 20

Fast Algorithms for Mining Association Rules

This paper introduced three fast algorithms, i.e. Apriori, AprioriTid and AprioriHybrid, for computing the large itemsets which satisfy the user specified minimum support, and conducted performance test against the traditional algorithms as well as concerning scale issues.

The problem of mining associations rule can be decomposed into two parts:

1.Find all sets of items (itemsets) that have transaction support above minimum support
2. Use the large itemsets to generate the desired rules

This paper mainly discusses algorithms for solving the first problem. The new algorithm Apriori generates large itemsets from size 1(L1) to the largest possible(Lk), and combine them to get the final answer. The Apriori work as follows:

1.get large 1-itemset L(1)
2.iterate until no larger size itemset can be found, generate new candidate itemsets Ck by joining Lk-1, get the count by comparing with each transaction for all Ck[i] to produce Lk

The AprioriTid use the same algorithm to generate the candidate itemsets Ck, but instead of using naive scan for each transaction, it exploits the C(-)k to get the count for verification. C(-)k is a inverted index from Tid to all the candidate pairs in Ck. In this way, it only compares Ck[i] with the transactions it is in. Another word is that AprioriTid reduces the invalid comparison in the phase of getting count for each candidate.

The last algorithm AprioriHybrid exploits the beinfit of AprioriTid and Apriori. Since the AprioriTid can keep a very large size of C(-)k in the first few pass, the time for writing to disk would be a bottleneck. AprioriHybrid uses Apriori in the first few passed and switch to AprioriTid when size of C(-) is small enough to be stored in the memory. And then boost the later on computation, with the overhead of generating the first C(-)switch in a quite low method.

According to the performance test, the three new algorithms outperforms the old one like AIS and SETM by a factor of three for small problems to more than an order of magnitude for larger problems.

However, there is still some drawbacks in this paper:

1.In the AprioriHybrid algorithm, the switching overhead is quite significant. And when there is no more larger possible itemsets in the solution, the overhead can slow the overall system time. And there is no heuristics put forward to predict whether there is more large itemsets?

2.In the Apriori algorithm, the use of C(-)k-1 can also produce unecessary calculation. Since the algorithm uses the C(-)k-1 to vouch the existence of new candidate itemset Ct, the chances are that, the two decomposition itemsets of Ct[i] in C(-)k-1 may just be pruned. And hence that specific Ct[i] should not be validated again. A suggestion is to create an inverted index from Tid to all itemsets in Lk as L(-)k to replace the currently used C(-)k-1, the chances of revalidating a small itemset can therefore be reduced.

And of course there are also some great strength of those three algorithms:

1. the AprioriHybrid algorithm scales linearly in terms of the growth of number of transactions and grows gradually with the increase in average transaction size
2. the Apriori algorithm used a very generative way of calculating the candidate itemsets, which largely overcomes the bottleneck of AIS’s enumerative method of calculating new candidate itemsets and hence keeping the search space in a relatively small size.





Review 21

Association rules are very important commercial information, mining association rules over basket data is a very interesting topic. As the basket database is very large, calculating the association rules should be fast. This paper talks about the algorithms that discovering association rules between items in a large database of sales.
For the association rule, there are two parameter. One is support, which guarantee the association sets are frequently appear in the database because it is not accurate to conclude rules from small sample. Another is confidence, which guarantee that to which degree the association of X and Y occur together.
The old algorithms like AIS and SETM have some disadvantage for generating itemset, it generate the itemset on-the-fly when read the whole database and keep many itemset that is very small. The algorithm introduced by the paper use the way that it will only consider the itemset that generate using last pass itemset, thus the itemset will get smaller and smaller.
The AprioriTid algorithm optimized the former algorithm by get rid of reading database after first pass, it will use an encoding of candidate items in the previous pass. This method is better when the encoding can fit in memory.
The AprioriHybrid combined these two together can only use the encoding table when the encoding table fit in memory. This algorithm scales linearly with the number of transactions.
In the end, the paper uses a suit of experiment to prove that this algorithms always perform better than previous algorithms.

Strength:
This paper introduce the algorithm that deal with association rule mining. It first point out the weakness of previous algorithm and introduce this algorithm in detail. This algorithm use the idea of DP and have better scalability than previous ones and the author also introduce AprioriHybrid that combine two strategy together to provide better performance.

Weakness:
This algorithms provide better performance than AIS or SETM. But for the algorithms the paper introduce, they will read the whole database more than once or read the encoding table more than once, however the AIS and SETM need only once. Thus will lead database has heavier read workload.


Review 22

This paper looks at the problem at identify associations between sales records in relational databases. The motivation for the problem is primarily to identify cross-market relationships useful for marketing purposes. The problem of mining association rules can be broken into two distinct steps. The first is to find the itemsets that are contained in a user-specified number transactions (i.e. having minimum support) which are called large itemsets. The second step generates the association rules from these itemsets. The first step is the focus of the paper. Unlike previous algorithms, the two algorithms introduced in the paper (Apriori and AprioriTid) only consider itemsets that have been deemed large in the previous pass. The key intuition here is that a subset of a large itemset is itself large. Both algorithms outperform prior work. Interestingly, the explanations for why the SETM algorithm (prior work) performed so badly was it's memory usage. The size of its candidate sets (maintaining TID association) were large forcing the data to spill over to disk which performs orders of magnitudes slower. The authors observed that Apriori outperforms AprioriTid in the initial passes because it needs to maintain less information, but in later passes the amount of information becomes small enough that AprioriTid performs better. The devised AprioriHybrid, and algorithm that switches from using Apriori to AprioriTid when appropriate to maximize performance. This is supported by their experiments where they find that AprioriHybrid performs roughly on par with Apriori or better.

Though this paper was theoretical, I appreciated the fact that evaluation was done on real data. In addition to running experiments, the authors also provided explanations as to why the data resulted in what it did (namely, why certain algorithms performed poorly). One issue I had with the paper was that I thought some of the figures were cluttered. Reorganizing these would make it easier to read.


Review 23

This paper proposes new, fast algorithms that can find associations between items in large sets of data. Suppose that we have data on who buys auto parts, who buys bike parts and who buys TVs. If we wanted to find the percentage of people who buy all three items, we would have to do a full scan three times and match with SQL, but with the AprioriHybrid algorithm, it scales linearly and performs much faster.

To find these associations, two main algorithms are proposed: Apriori and AprioriTid. In both of these algorithms, we make multiple passes over the data. In the first pass, we determine which of the items have minimum support. Then in every other pass, we use the set that was found in the previous pass to seed and we generate more possible large sets. We continue to go through the data until there are no more set to be found. The difference between Apriori and AprioriTid is that in AprioriTid, we are not using the database after the first pass.

When measuring performance, we find that Apriori and AprioriTid perform differently in different circumstances. Apriori beats AprioriTid when the sets Ck cannot fit into memory; otherwise, AprioriTid will perform better than Apriori. Therefore, we can use the AprioriHybrid algorithm that switches from Apriori to AprioriTid to gain the best of both worlds. Because switching from one algorithm to the other is tricky, we use two passes to switch from Apriori to AprioriTid. The first pass in the switch will add the ids of the itemsets found to C_k+1.

I think that the paper does do a good job of creating new ways to find associations between items in a database, but there are some weaknesses with the paper:

1. The paper is very hard to follow and very few examples were given. I would have liked to see real world examples about when these algorithms would be used and what the results are when ran on the real world examples.
2. In the AprioriHybrid section, it explains a heuristic that decides if the next pass will fit into memory so that we know if we can switch from Apriori to AprioriTid. Is there a way to know if the set size will decrease so that we can just run Apriori without the overhead of checking after each pass?



Review 24

¬
This paper discusses the Apriori and AprioriTid algorithms as improvements over current algorithms for mining association rules from large databases of transactions. Database mining is an important field of study within database systems. For example, correlations between purchased products influence the way retailers market products and the demographics they target. An association rule can be defined between two sets of items when disjoint sets of items, X and Y, appear in c% of transactions that contain set X and appear in s% of all transactions in the database. Finding statistically significant correlations between sets of items is an exponentially large problem.

While previous algorithms for finding association rules involved numerous passes over the entire database of transactions in order to generate candidate itemsets, the Apriori algorithm does one pass over the database, then uses logical inferences about which supersets of the “large” itemsets discovered on the pass through the database can also be large itemsets. In order to generate the candidate sets for the k-itemsets, the generation function takes the set of the union of each pair of the candidate (k-1)-itemsets whose first k-2 elements are the same, then prunes ones containing any subset of size (k-1) that is not in the set of candidates of size k-1. This allows the algorithm to generate only candidates that can possibly be large itemsets. These candidates are then checked against the database to see if they are, in fact large. This function was found to perform SETM and AIS by up to an order of magnitude in large cases, since it generates a much more selective set of candidates than the other two functions, which generate candidates on the fly as they read through the database repeatedly.

AprioriTid is a variation of the Apriori algorithm that reads through the database only once. It stores a complete set of tuples consisting of the transaction ID and a potentially large itemset present in that transaction. This algorithm generates each successive level of candidates by passing over the set of tuples it created, rather than passing over the database. Therefore, it outperforms even the Apriori algorithm once the set of candidates becomes smaller than the database. The AprioriHybrid algorithm implements the Apriori algorithm until the point where the next set of candidates will be smaller than the database, at which point it switches over to AprioriTid.

My chief complaint about this paper was that they didn’t further discuss the necessity of user guidance in defining which associations are interesting or useful. Not every itemset that has the necessary support and confidence is necessarily interesting. I would have liked to see more explanation of where the user is involved in the process. For example, performance might be affected differently if a user only gives input about which associations are interesting before the mining algorithm runs than if a user is able to pick and choose interesting associations from sets of candidates at intermediate stages of the algorithm.



Review 25

This paper is an overview and explanation of the AprioriTid and Apriori algorithms (combined to make AprioriHybrid). These algorithms are used to find associate rules in large datasets in a quicker way than what was previously done using AIS and SETM.

The thing that sets Apriori apart is that it is significantly better at discovering itemsets than AIS and SETM. Discovering itemsets requires making multiple passes on data, which as you can imagine is an expensive process. Apriori is better than AIS and SETM because AIS and SETM generates candidate itemsets on the fly whereas Apriori generates it based off large itemset from previous pass through. This allows it to perform less computation and increase speed.

This paper goes into the nitty gritty of the algorithm and is very math based on it’s derivations. One of the strengths of this paper though was it’s numeric examples it gave to give a concrete example of how Apriori works. It was very helpful for understanding the algorithm and necessary for people like me who don’t do well with psuedocode and math derivations.

In the results section, as you would expect Apriori performed much better than SETM and AIS, particularly as minimum support decreased. It performed better at all levels but a significant difference wasn’t noticed until minimum support was below 1 in most cases. I would have liked a little more discussion about how common a minimum support below 1 is used in practice and whether this upside will ever be useful. Also I think the graph was hard to read due to the legend being very similar for all 3-4 algorithms. It would have been helpful to distinguish between the different algorithms with a larger symbol than a small diamond, +, x, or o (this is a minor thing however).

The ending part of this paper discussed the combination of AprioriTid and Apriori to make the AprioriHybrid algorithm, that as you would have guessed outperforms both AprioriTid and Apriori. It is also better at scaling up than all other algorithms.

In general, this was a solid paper but algorithm papers never really captivate me. I hope Apriori was as beneficial as the paper makes it seem and it was adopted in practice.


Review 26

How are items in a database associated? Many organizations nowadays are able to keep track of their transactional history, but how can we take this one step further and derive insights between pieces of collected information? By modeling dependencies between attributes through frequent patterns/associations, we can gain insight about the rules that correlate the existence between one set of items with other potentially unknown sets of items. For example, a statement learned would look like “90% of people who do X also do Y.”

The fundamental idea behind the algorithm relies on a priori closure; if a set of items occurs frequently or has support (support being the frequency of items co-occurring in the same transaction). By making multiple “passes” in order to obtain large item sets (> 50% support), we can greedily build and obtain all large item sets. The “a priori algorithm” scans transactions in order to find the most likely candidate sets for elements that have probabilistically dependent existences.

It seems like the drawbacks for Apriori and AprioriTid both have performance bottlenecks at different “passes” in the algorithm, but the Apriori Hybridization accounts for that. However, I am curious about a more fundamental question; are min-support and max-support limited in the rules they find? Perhaps there are converse relationships or more complex associations that are ignored when determining rules, and would this approach work for other approaches like clustering, or more advanced methods such as classification?


Review 27

This paper presents two new algorithms for discovering association rules between objects in databases of sales transactions. The algorithms presented are very efficient for answering questions such as how many people purchase sweaters and snow boots at the same time.

The algorithms in the paper solve the following problem. Mine all of the association rules that have a confidence and support greater than a user defined minimum out of the system. The apriori algorithm presented is novel because of the way candidates are both generated and counted. The candidate item sets to be counted are generated using item sets found large in the previous pass. Thus, there is no consideration for the transaction in the database.

The largest flaw in this paper was its lack of details regarding resource requirements. No where in the paper are there specifics given for things like disk needs, CPU requirements or memory requirements. Without any such specifications, it is hard to know in what kind of system these algorithms will actually be performant. Another issue was with the synthetic workload that was tested. The results seem very good, but it is hard to estimate how realistic these results are.


Review 28

This paper presents new algorithms for mining association rules and shows that the proposed algorithms is faster than previous algorithms by comparing their performances to SETM and AIS algorithms. In the way, the paper also proposes the hybrid version of the new algorithms.

In order to efficiently scan the “basket data” and discover the association rules, there needs to be a fast algorithm. The problem of finding the association rules can be divided into two sub-problems: (1) finding all sets of items that have transaction support above minimum support (large itemsets); and (2) use the large itemsets to generate the desired rules. The proposed algorithm make multiple passes over the data. The first pass is to generate candidate set, and the next passes starts with a seed of itemsets found to be large in the previous pass. At the end of each pass, the algorithms determine which of the candidate sets are actually large, and so it is used for the next pass, until no new large itemsets are found. The paper then explains the detail of Apriori and AprioriTid algorithms. In this case, AprioriTid only uses the database for the first pass in order to generate candidate sets. The paper goes on with performance comparison (SETM vs. AIS vs. Apriori vs. AprioriTid), in which the new algorithm are faster. Apriori is faster than AprioriTid, but the paper introduces the hybrid algorithm. AprioriHybrid tends to exploit the advantages of both Apriori and AprioriTid. First it uses Apriori, but when the size of candidate sets fit into memory, it would change into AprioriTid. However, this algorithm cost extra for the switching. Last, the paper does scale-up experiments and compare between three algorithms.

One of the contribution of this paper is the presentation of new algorithms for mining association rules: Apriori and AprioriTid (and the hybrid version AprioriHybrid). Compared to two previous algorithm – SETM and AIS – these new algorithms try to make candidate scanning more efficient by assuming that any subset of a large item-set is large, therefore minimize the chance of scanning too many candidates like it does in SETM and AIS. AprioriTid even only scans transaction database to build the first candidate set, leaving the rest of the scans only on candidate sets. These algorithm are more efficient.

However, even after all that, in the end the algorithms still use the multiple passes technique to scan the database. Since they scan the whole database, it still results in large number of candidate sets after the first pass. After that, the algorithms still require subsequent passes (either on the transaction database or the subsets/candidates). Also, AprioriTid seems to rely a lot on memory, which is why the performance is slower for larger dataset (since the whole transaction database would not fit in the memory). --



Review 29

The purpose of this paper is to provide faster algorithms for the computation of all of the mining association rules present in a large database. The paper states that, with the rise of barcode systems and databases that store transactions based on these barcodes, these mining association rules will be more important to monitor to determine sales patterns. Keep in mind that this paper was published in 1994.

The main technical contribution of this paper is its algorithms. The paper presents two novel algorithms, Apriori and AprioriTid, that offer orders of magnitude performance improvement on large datasets as well as increased scalability. A main technical contribution is the development of and all of the formal proofs of correctness and formal analysis of these algorithms. After viewing the experimental results which are (admittedly painfully) detailed, the authors decided to produce a hybrid algorithm that combines the two methods to exploit their strengths, resulting in an even more efficient final algorithm. I think the formal parts of this paper are a main contribution, because some of the results could be used toward further refinement of the algorithms presented in future work. I also thing a solid technical contribution comes from the detailed discussion of their synthetic data generation techniques. I can see methods like this being very useful for future work.

I think the strength of this paper is in its thoroughness. The paper is VERY detailed. Every minute detail is presented, and though I appreciate the attention to detail in the paper, and I am very, very convinced of the validity and correctness of the work because of it, I do think this translates into the one weakness I found with this paper.

I find the discussion in the results section to be a little lacking. I think that the reader is blasted with a lot of different graphs. The discussion of the general trends in the graphs is presented, but I’m not sure I found it sufficiently detailed given the extreme level of detail in the rest of the paper. Because everything was justified (even the existing algorithms are walked through and their downfalls and disadvantages are explicitly mentioned), it is hard to find many weaknesses in the technical contributions of this paper.



Review 30

Review: Fast Algorithms for Mining Association Rules

Paper Summary:
This paper proposes a new algorithm for finding association rules efficiently. Association rules are valuable in many profitable businesses such as marketing. However given the large amount of basket data, finding all desired association rules could be time consuming. That is the motivation of this paper. The proposed algorithm is claimed to be orders of magnitudes faster in large problems and three times faster in simple problems.

Paper Review:
The paper is strong in the sense that the proposed algorithms show a significant improvement in the execution time of finishing the job compared to the baseline models. One concern is that the experiments demonstrated in this paper are mostly conducted on synthesized data. One may argue that the performance of the algorithms on real world data is not clearly revealed by the paper. Another issue of the paper is that, as a paper proposing a new algorithm, especially one that claims that its proposal is faster than existing methods, it would be ideal to provide a comparison in the big O notation, which would be a strong theoretical justification for the advantages of the proposed method.



Review 31

This paper introduced two main algorithms, one is Apriori algorithm, the other is AprioriTid algorithm. Both of these algorithms can be used to mine the significant association rules among items in a large databases. Moreover, a combination of these two algorithm, namely AprioriHybrid Algorithm, can perform better than both of these two since it absorb the best part of these two algorithms.
The paper splits the problem of finding association rules into two sub problems, first is to find all sets of items that have transaction support above the minimum support, the second is to generate the rules from the datasets in sub problem 1. Apriori algorithm and AprioriTid algorithm, comparing to its precedents AIS and SETM algorithm, which the first creates many useless small datasets and the second is very sensitive when transaction size grows, has better performance since it prunes out the item set that contains small subset. Moreover, AprioriTid algorithm maintains a set in memory and keeps update the set, which improve the performance by saving disk I/O.

Strengths:
1. This paper proposed the novel algorithms mining association rules on large scale dataset. It gives detail introduction as well as the great comparison with the previous existing algorithms. Moreover, it take scalability and efficiency into consideration, and gives great demonstration about why the algorithm proposed in this paper is more efficient and scalable.
2. This paper gives great and detailed introduction of the background as well as the application of the algorithm it proposed, which helps its reader to fully understand the related concepts mentioned in this paper.


Weaknesses:
1. The paper compare the algorithms it proposed with previous existing algorithm in performance, however, it would be more clear and convincing if it can also compare the correctness, or in Machine Learning terminology, the recall and precision of the result between the algorithms it proposed and the previous existing algorithms.
2. Although this paper conduct great experiment in the evaluation section, it would be great and more interesting if the paper can conduct some experiment on real world dataset to show the great potential of the algorithms.



Review 32

To tackle the problem of discovering all significant association rules between items in a large database, the authors presented two algorithms Apriori and AprioriTid. Compared to past of of AIS and SETM, the proposed algorithms largely improves the performance. Moreover, the performance gap increases with the problem size, ranges from 3x to an order of magnitude. Considering a list of items I and a list of transactions D, each transaction contains a set of items T. An association is rule is an implication that X -> Y where both X and Y are subsets of T and X and Y have no common items. To find the association rules, two steps are taken. The first step is to discover large itemsets where the number of transactions containing it exceeds a threshold s. The second step is to using the large itemsets to generate the desired rules. The paper focus on the first part.

The main advantage of this paper is to give a fast way of generating large itemsets. The two proposed algorithms differs from AIS and SETM on what candidate items are counted and how those candidates are generated. In the AIS and SETM algorithms, candidate itemsets are generated on-the-fly during the pass as data is being read. After reading a transaction, it examines the items and filtering out sets not meeting minimum support and confidence levels . New candidate itemsets are generated by extending the large itemsets with other items in the transaction. The Apriori and AprioriTid algorithms generate the candidate itemsets to be counted in a pass by using only the itemsets found large in the previous pass, without considering the transactions in the database, based on the intuition that any subset of a large itemset must be large. This procedure results in generation of a much smaller number of candidate itemsets.

Another advantage of this paper is to combine the Apriori and AprioriTid to AprioriHybrid. Based on the fact that AprioriTid only outperforms Apriori when the amount of candidate sets have been pruned enough to fit in memory, it is desirable to switch between these two algorithms when there is enough incentive.

One drawback of this paper is the fact of using synthetic dataset. The whole evaluation is established on this data set. However, the authors didn’t provide enough argument on how the results can be generalized to a real world dataset.