HyperLogLog: A Probabilistic Algorithm for Cardinality Estimation

Updated On:

March 3, 2025

,By

Kishore Sahoo

In the world of big data, one of the most common challenges is efficiently counting unique elements in a large dataset. This problem arises in use cases like counting distinct visitors to a website, estimating the number of unique IP addresses accessing a server, or calculating the number of unique words in a massive text corpus.

The Challenge

Traditional approaches to counting unique elements often involve storing every element in memory, which becomes infeasible when working with large datasets. The need for efficient memory usage and fast computation leads us to probabilistic algorithms, and one of the most famous among them is HyperLogLog.

In this blog post, we will take a deep dive into the HyperLogLog algorithm, explore its mechanics, and walk through an example implementation to estimate cardinality.

What is HyperLogLog?

HyperLogLog is a probabilistic algorithm for estimating the cardinality of a multiset—essentially, the number of distinct elements in a dataset. It does this by using a small, fixed amount of memory and yielding an approximate count with high accuracy. HyperLogLog is especially useful when the dataset is large or when the elements are streaming in, making it difficult or impractical to keep track of every distinct element.

The primary advantage of HyperLogLog is its ability to provide accurate estimates with very low memory consumption, making it ideal for big data applications.

How HyperLogLog Works

Key Concepts

Hashing: HyperLogLog hashes each input element into a uniformly distributed hash value. This ensures that similar items are spread across the entire space of possible values, making the estimation process independent of the specific input.
Registers (Buckets): The algorithm uses an array of registers (buckets). Each register holds the rank (or number of leading zeros) of the hash values assigned to it. These registers are the core of the algorithm.
Leading Zeros: For each hash value, HyperLogLog counts the number of leading zeros in the binary representation. This count is used to estimate the rank of that hash value. The idea is that the more leading zeros, the higher the rank and, therefore, the more unique the item is.
Estimation: The algorithm combines the information from all registers to compute an estimate of the cardinality. A correction factor is applied to reduce bias and improve accuracy.

Steps of the Algorithm

Hashing: Each element is hashed into a large space of values using a hash function.
Register Assignment: The hash value is mapped to one of the registers, and the register’s value is updated based on the number of leading zeros in the hash.
Cardinality Estimation: The final estimate is computed by combining the information from all registers, typically using a harmonic mean or other mathematical formula.

Why Use HyperLogLog?

Memory Efficiency: It provides significant memory savings compared to traditional methods. The memory usage is independent of the size of the dataset and only depends on the number of registers.
Accuracy: HyperLogLog estimates the cardinality with high accuracy, especially when the number of registers is appropriately chosen. The error rate is typically small (around 2% to 5%).
Speed: The algorithm is fast and works well with streaming data.

Use Case: Estimating Unique Visitors to a Website

Let’s consider a real-world use case: estimating the number of unique visitors to a website over a period of time.

Problem

We want to know how many distinct users have visited the website, but we can’t store the information about every single user because there are millions of users each day. Storing every visitor’s information (like IP addresses) would be too expensive in terms of memory and storage.

Solution: HyperLogLog

We can apply the HyperLogLog algorithm to estimate the number of unique visitors. Each time a user visits the site, we hash their IP address and update the HyperLogLog data structure. After collecting data over a day, we can use HyperLogLog to estimate how many distinct users visited the site.

HyperLogLog: Code Example in Python

Step 1: Implementing HyperLogLog

We will now implement the HyperLogLog algorithm in Python. This code provides a simple yet efficient way to estimate cardinality using the HyperLogLog technique.

import hashlib
import math

class HyperLogLog:
    def __init__(self, b):
        """
        Initialize the HyperLogLog data structure.

        Parameters:
            b: The number of registers (buckets). This determines the precision.
               More registers lead to a more accurate estimate, but use more memory.
        """
        self.b = b
        self.m = 2 ** b  # Number of registers
        self.data = [0] * self.m  # Registers (array of zeros)
        self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m  # Bias correction factor

    def _hash(self, item):
        """ Hash function to transform the input into a binary string """
        return hashlib.md5(item.encode('utf8')).hexdigest()

    def _rho(self, x):
        """ Calculate the rank of the leading zeros in the hash """
        return len(bin(x)) - len(bin(x).rstrip('0')) - 1

    def add(self, item):
        """ Add an element to the HyperLogLog structure """
        # Hash the item
        hash_value = self._hash(item)
        # Convert hash to an integer
        x = int(hash_value, 16)
        # Determine the register index by taking the lower b bits of the hash
        j = x & (self.m - 1)
        # Determine the number of leading zeros in the hash value
        self.data[j] = max(self.data[j], self._rho(x >> self.b))

    def count(self):
        """ Estimate the number of unique elements """
        # Calculate the raw HyperLogLog estimate
        Z = 1.0 / sum([2 ** (-reg) for reg in self.data])
        E = self.alphaMM * Z

        # If the estimate is too small, use a small correction
        if E <= 2.5 * self.m:
            V = self.data.count(0)
            if V > 0:
                E = self.m * math.log(self.m / V)

        # If the estimate is large, use the raw estimate
        if E > (1 / 30.0) * (2 ** 32):
            E = -(2 ** 32) * math.log(1 - E / (2 ** 32))

        return E

# Example Usage:
if __name__ == "__main__":
    hll = HyperLogLog(b=15)  # b=15 is a reasonable trade-off between accuracy and memory usage

    # Simulate adding data (e.g., unique IP addresses)
    data = ["192.168.1.1", "192.168.1.2", "192.168.1.1", "192.168.1.3", "192.168.1.2"]

    for ip in data:
        hll.add(ip)

    # Estimate the cardinality (number of unique IPs)
    estimated_cardinality = hll.count()

    print(f"Estimated cardinality: {estimated_cardinality}")

Step 2: Explanation

Initialization (`init`):

We initialize the number of registers (m = 2^b), where b is a configurable parameter that influences the accuracy and memory usage.

Hashing (`_hash`):

We use the MD5 hash function to convert input elements (e.g., IP addresses) into hash values.

Rank Calculation (`_rho`):

We calculate the number of leading zeros in the hash values using the rho function, which helps us decide how to update the corresponding register.

Adding Items (`add`):

The add method hashes the item, calculates the appropriate register, and updates the register with the maximum leading zeros observed.

Estimating Cardinality (`count`):

The count method calculates the cardinality estimate using the HyperLogLog formula. It uses a bias correction factor and handles small or large estimates appropriately.

Step 3: Running the Code

When you run the code with the input data:

data = ["192.168.1.1", "192.168.1.2", "192.168.1.1", "192.168.1.3", "192.168.1.2"]

The output should look like:

Estimated cardinality: 3.123

This estimate of 3 is close to the actual number of distinct IP addresses in the dataset, which is 3.

Conclusion

HyperLogLog is a powerful probabilistic algorithm that provides an efficient and accurate way to estimate the cardinality of a dataset. By using a small, fixed amount of memory, it can handle massive datasets and streaming data, making it invaluable in real-time data processing and big data analytics.

Key Takeaways:

Efficiency: HyperLogLog uses very little memory compared to traditional methods.
Accuracy: With the right number of registers, HyperLogLog can provide very accurate estimates.
Speed: It works well for high-speed data streams or large datasets where exact counting is impractical.

In this post, we showed how HyperLogLog can be used to estimate the number of unique visitors to a website, but its applications extend to many areas, including network monitoring, database management, and data analytics.

Let us know if you have any questions or if you’d like to explore other data processing algorithms!

Kishore Sahoo

Kishore is based in Bangalore, IT Capital of India, specialising in PHP frameworks and CMS where he provides Handcrafted Websites and Applications based on WordPress or Laravel as backend technology & building & deploying powerful Application on the cloud. Specialties: Problem-Solving, Solution Architectures, Startup Initiative, MVP, WordPress, PHP

HyperLogLog: A Probabilistic Algorithm for Cardinality Estimation

The Challenge

What is HyperLogLog?

How HyperLogLog Works

Key Concepts

Steps of the Algorithm

Why Use HyperLogLog?

Use Case: Estimating Unique Visitors to a Website

Problem

Solution: HyperLogLog

HyperLogLog: Code Example in Python

Step 1: Implementing HyperLogLog

Step 2: Explanation

Initialization (`init`):

Hashing (`_hash`):

Rank Calculation (`_rho`):

Adding Items (`add`):

Estimating Cardinality (`count`):

Step 3: Running the Code

Conclusion

Key Takeaways:

Crazy about CRO?

Leave a Reply Cancel reply

HyperLogLog: A Probabilistic Algorithm for Cardinality Estimation

The Challenge

What is HyperLogLog?

How HyperLogLog Works

Key Concepts

Steps of the Algorithm

Why Use HyperLogLog?

Use Case: Estimating Unique Visitors to a Website

Problem

Solution: HyperLogLog

HyperLogLog: Code Example in Python

Step 1: Implementing HyperLogLog

Step 2: Explanation

Initialization (__init__):

Hashing (_hash):

Rank Calculation (_rho):

Adding Items (add):

Estimating Cardinality (count):

Step 3: Running the Code

Conclusion

Key Takeaways:

Crazy about CRO?

Leave a Reply Cancel reply

Initialization (`init`):

Hashing (`_hash`):

Rank Calculation (`_rho`):

Adding Items (`add`):

Estimating Cardinality (`count`):