|
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the ''exact'' cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of with a typical accuracy of 2%, using 1.5kB of memory.〔 HyperLogLog is an extension of the earlier LogLog algorithm. == Algorithm == The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly-distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is , an estimate for the number of distinct elements in the set is .〔 In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset, to obtain a multiset of uniformly-distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly-distributed set can then be estimated using the algorithm above. The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「HyperLogLog」の詳細全文を読む スポンサード リンク
|