blooms module

Lightweight Bloom filter data structure derived from the built-in bytearray type.

class blooms(*args, **kwargs)[source]

Bases: bytearray

Bloom filter data structure with support for common operations such as insertion (using __imatmul__), membership (using __rmatmul__), union (using __or__), and containment (using issubset).

>>> b = blooms(4)

It is the responsibility of the user of the library to hash and truncate the bytes-like object being inserted. Only those bytes that remain after truncation contribute to the object’s membership within the instance.

>>> from hashlib import sha256
>>> x = 'abc' # Value to insert.
>>> h = sha256(x.encode()).digest() # Hash of value.
>>> t = h[:2] # Truncated hash.
>>> b @= t # Insert the value into the Bloom filter.
>>> b.hex()
'00000004'

When testing whether a bytes-like object is a member of an instance, the same hashing and truncation operations should be applied.

>>> sha256('abc'.encode()).digest()[:2] @ b
True
>>> sha256('xyz'.encode()).digest()[:2] @ b
False

A particular sequence of a hashing operation followed by a truncation operation can be encapsulated within a user-defined class derived from the blooms class, wherein the default insertion method __imatmul__ and membership method __rmatmul__ are overloaded. The static method specialize makes it possible to define such a derived class concisely (without resorting to Python’s class definition syntax).

For a given blooms instance, the saturation method returns a float value between 0.0 and 1.0 that is influenced by the number of bytes-like objects that have been inserted so far into that instance. This value represents an upper bound on the rate with which false positives will occur when testing bytes-like objects (of the specified length) for membership within the instance.

>>> b = blooms(32)
>>> from secrets import token_bytes
>>> for _ in range(8):
...     b @= token_bytes(4)
>>> b.saturation(4) < 0.1
True

It is also possible to use the capacity method to obtain an approximate maximum capacity of a blooms instance for a given saturation limit. For example, the output below indicates that a saturation of 0.05 will likely be reached after more than 28 insertions of bytes-like objects of length 8.

>>> b = blooms(32)
>>> b.capacity(8, 0.05)
28
LENGTH_MAX: int = 4294967296

Maximum permitted length for an instance.

__init__(*args, **kwargs)[source]

Create and initialize a new blooms instance.

>>> b = blooms(1)
>>> b @= bytes([0])
>>> bytes([0]) @ b
True
>>> bytes([1]) @ b
False

Any approach for creating an instance of the built-in bytearray class can also be used to create an instance of this class.

>>> b = blooms(range(256))
>>> bytes([1]) @ b
False
>>> b = blooms(b'abc')
>>> bytes([0]) @ b
True

An instance can be of any non-zero length. This method checks that the instance has a valid size.

>>> b = blooms()
Traceback (most recent call last):
  ...
ValueError: instance must have an integer length greater than zero
>>> b = blooms(0)
Traceback (most recent call last):
  ...
ValueError: instance must have an integer length greater than zero
>>> b = blooms(256 ** 4 + 1)
Traceback (most recent call last):
  ...
ValueError: instance length cannot exceed 4294967296
__imatmul__(argument)[source]

Insert a bytes-like object (or an iterable of bytes-like objects) into this instance.

Parameters:

argument (Union[bytes, bytearray, Iterable[Union[bytes, bytearray]]]) – Object or objects to insert into this instance.

Return type:

blooms

This method provides a concise way to insert objects into an instance. This method modifies the instance for which it is invoked.

>>> b = blooms(100)
>>> b @= bytes([1, 2, 3])
>>> b = blooms(100)
>>> b @= (bytes([i, i + 1, i + 2]) for i in range(10))
>>> b = blooms(100)

Any attempt to insert an object that has an unsupported type raises an exception.

>>> b @= 123
Traceback (most recent call last):
  ...
TypeError: supplied argument must be a bytes-like object or an iterable
>>> b @= [bytes([4, 5, 6]), 123]
Traceback (most recent call last):
  ...
TypeError: item in supplied iterable must be a bytes-like object

Note that when an iterable is supplied, the effects of all successful insertions (that occurred before the exception) remain.

>>> bytes([4, 5, 6]) @ b
True
__rmatmul__(argument)[source]

Check whether a bytes-like object appears in this instance.

Parameters:

argument (Union[bytes, bytearray]) – Object to be used in querying this instance.

Return type:

bool

A blooms instance never returns a false negative when queried using this method, but may return a false positive.

>>> b = blooms(100)
>>> b @= bytes([1, 2, 3])
>>> bytes([1, 2, 3]) @ b
True
>>> bytes([4, 5, 6]) @ b
False
>>> b = blooms(1)
>>> b @= bytes([0])
>>> bytes([8]) @ b
True

The bytes-like object of length zero is a member of every blooms instance.

>>> b = blooms(1)
>>> bytes() @ b
True

If the supplied argument is not a bytes-like object, an exception is raised.

>>> 123 @ b
Traceback (most recent call last):
  ...
TypeError: supplied argument must be a bytes-like object
__or__(other)[source]

Return the union of this instance and another instance.

Parameters:

other (blooms) – Instance to use for the union operation.

Return type:

blooms

This method creates a new blooms instance based on two existing instances.

>>> b0 = blooms(100)
>>> b0 @= bytes([1, 2, 3])
>>> b1 = blooms(100)
>>> b1 @= bytes([4, 5, 6])
>>> bytes([1, 2, 3]) @ (b0 | b1)
True
>>> bytes([4, 5, 6]) @ (b0 | b1)
True
>>> b0 = blooms(100)
>>> b1 = blooms(200)

This operation is only defined on instances that have equivalent lengths.

>>> b0 | b1
Traceback (most recent call last):
  ...
ValueError: instances must have equivalent lengths
>>> b0 | 123
Traceback (most recent call last):
  ...
TypeError: supplied argument must be a blooms instance
issubset(other)[source]

Determine whether this instance represents a subset of another instance.

Parameters:

other (blooms) – Instance for which to check the subset relationship.

Return type:

bool

Note that the subset relationship being checked is between the sets of all bytes-like objects that are accepted by each instance, regardless of whether they were explicitly inserted into an instance or not (i.e., all bytes-like objects that are false positives are considered to be members).

>>> b0 = blooms([0, 0, 1])
>>> b1 = blooms([0, 0, 3])
>>> b0.issubset(b1)
True
>>> b1.issubset(b0)
False

This operation is only defined on instances that have equivalent lengths.

>>> b0 = blooms(100)
>>> b1 = blooms(200)
>>> b0.issubset(b1)
Traceback (most recent call last):
  ...
ValueError: instances must have equivalent lengths
>>> b0.issubset(123)
Traceback (most recent call last):
  ...
TypeError: supplied argument must be a blooms instance
classmethod from_base64(s)[source]

Convert a Base64 UTF-8 string representation into an instance.

Parameters:

s (str) – Base64 UTF-8 string representation of an instance.

Return type:

blooms

This method creates a new instance based on the supplied string.

>>> b = blooms(100)
>>> b @= bytes([1, 2, 3])
>>> b = blooms.from_base64(b.to_base64())
>>> bytes([1, 2, 3]) @ b
True
>>> bytes([4, 5, 6]) @ b
False

If a non-string input is supplied, an exception is raised.

>>> blooms.from_base64(123)
Traceback (most recent call last):
  ...
TypeError: supplied argument must be a string
to_base64()[source]

Convert this instance to a Base64 UTF-8 string representation.

Return type:

str

>>> isinstance(blooms(100).to_base64(), str)
True
saturation(length)[source]

Return the approximate saturation of this instance as a value between 0.0 and 1.0 (assuming that all bytes-like objects that have been or will be inserted have the specified length).

Parameters:

length (int) – Length of bytes-like objects in queries.

Return type:

float

The approximation is an upper bound on the true saturation, and its accuracy degrades as the number of insertions approaches the value len(self) // 8.

>>> b = blooms(32)
>>> b.saturation(4)
0.0
>>> from secrets import token_bytes
>>> for _ in range(8):
...     b @= token_bytes(4)
>>> b.saturation(4) < 0.1
True
>>> b.saturation(-1)
Traceback (most recent call last):
  ...
ValueError: length must be nonnegative
>>> b.saturation('abc')
Traceback (most recent call last):
  ...
TypeError: length must be an integer

The saturation of an instance can be interpreted as an upper bound on the rate at which false positives can be expected when querying the instance with bytes-like objects that have the specified length.

capacity(length, saturation)[source]

Return this instance’s approximate capacity: the number of bytes-like objects of the specified length that can be inserted into an empty version of this instance before the specified saturation is likely to be reached.

Parameters:
  • length (int) – Length of bytes-like objects in queries.

  • saturation (float) – Saturation with respect to which to estimate capacity.

Return type:

Union[int, float]

This method is defined for nonnegative length and saturation values.

>>> b = blooms(32)
>>> b.capacity(8, 0.05)
28
>>> b.capacity(12, 0.05)
31
>>> b.capacity(-1, 0)
Traceback (most recent call last):
  ...
ValueError: length must be nonnegative
>>> b.capacity('abc', 0)
Traceback (most recent call last):
  ...
TypeError: length must be an integer
>>> b.capacity(0, -1)
Traceback (most recent call last):
  ...
ValueError: saturation must be nonnegative
>>> b.capacity(0, 'abc')
Traceback (most recent call last):
  ...
TypeError: saturation must be an integer or a floating-point number

The capacity of an instance is not bounded for a saturation of 1.0 or for bytes-like objects of length zero.

>>> b.capacity(0, 0.1)
inf
>>> b.capacity(4, 1.0)
inf

Note that capacity is independent of the number of insertions into this instance that have occurred. It is the responsibility of the user to keep track of the number of bytes-like objects that have been inserted into an instance.

static specialize(name, encode)[source]

Return a class derived from blooms that uses the supplied encoding for members.

Parameters:
Return type:

type

The supplied encoding function must accept one bytes-like object as an input and must return a bytes-like object as an output.

>>> from hashlib import sha256
>>> encode = lambda x: sha256(x).digest()[:2]
>>> blooms_custom = blooms.specialize(name='blooms_custom', encode=encode)
>>> b = blooms_custom(4)
>>> b @= bytes([1, 2, 3])
>>> bytes([1, 2, 3]) @ b
True