blooms module
Lightweight Bloom filter data structure derived from the built-in
bytearray
type.
- class blooms.blooms.blooms(*args, **kwargs)[source]
Bases:
bytearray
Bloom filter data structure with support for common operations such as insertion (using
__imatmul__
), membership (using__rmatmul__
), union (using__or__
), and containment (usingissubset
).>>> b = blooms(4)
It is the responsibility of the user of the library to hash and truncate the bytes-like object being inserted. Only those bytes that remain after truncation contribute to the object’s membership within the instance.
>>> from hashlib import sha256 >>> x = 'abc' # Value to insert. >>> h = sha256(x.encode()).digest() # Hash of value. >>> t = h[:2] # Truncated hash. >>> b @= t # Insert the value into the Bloom filter. >>> b.hex() '00000004'
When testing whether a bytes-like object is a member of an instance, the same hashing and truncation operations should be applied.
>>> sha256('abc'.encode()).digest()[:2] @ b True >>> sha256('xyz'.encode()).digest()[:2] @ b False
A particular sequence of a hashing operation followed by a truncation operation can be encapsulated within a user-defined class derived from the
blooms
class, wherein the default insertion method__imatmul__
and membership method__rmatmul__
are overloaded. The static methodspecialize
makes it possible to define such a derived class concisely (without resorting to Python’s class definition syntax).For a given
blooms
instance, thesaturation
method returns afloat
value between0.0
and1.0
that is influenced by the number of bytes-like objects that have been inserted so far into that instance. This value represents an upper bound on the rate with which false positives will occur when testing bytes-like objects (of the specified length) for membership within the instance.>>> b = blooms(32) >>> from secrets import token_bytes >>> for _ in range(8): ... b @= token_bytes(4) >>> b.saturation(4) < 0.1 True
It is also possible to use the
capacity
method to obtain an approximate maximum capacity of ablooms
instance for a given saturation limit. For example, the output below indicates that a saturation of0.05
will likely be reached after more than28
insertions of bytes-like objects of length8
:>>> b = blooms(32) >>> b.capacity(8, 0.05) 28
- __init__(*args, **kwargs)[source]
Create and initialize a new
blooms
instance. An instance can be of any non-zero length.>>> b = blooms(1) >>> b @= bytes([0]) >>> bytes([0]) @ b True >>> bytes([1]) @ b False
This method checks that the instance has a valid size before permitting its creation.
>>> b = blooms() Traceback (most recent call last): ... ValueError: instance must have an integer length greater than zero >>> b = blooms(0) Traceback (most recent call last): ... ValueError: instance must have an integer length greater than zero >>> b = blooms(256**4 + 1) Traceback (most recent call last): ... ValueError: instance length cannot exceed 4294967296
- __imatmul__(argument: Union[bytes, bytearray, Iterable]) blooms.blooms.blooms [source]
Insert a bytes-like object (or an iterable of bytes-like objects) into this instance.
>>> b = blooms(100) >>> b @= bytes([1, 2, 3]) >>> b = blooms(100) >>> b @= (bytes([i, i + 1, i + 2]) for i in range(10)) >>> b = blooms(100) >>> b @= 123 Traceback (most recent call last): ... TypeError: supplied argument is not a bytes-like object and not iterable
A
blooms
instance never returns a false negative when queried, but may return a false positive.>>> b = blooms(1) >>> b @= bytes([0]) >>> bytes([8]) @ b True
The bytes-like object of length zero is a member of every
blooms
instance.>>> b = blooms(1) >>> bytes() @ b True
- __rmatmul__(argument: Union[bytes, bytearray, Iterable]) bool [source]
Check whether a bytes-like object appears in this instance.
>>> b = blooms(100) >>> b @= bytes([1, 2, 3]) >>> bytes([1, 2, 3]) @ b True >>> bytes([4, 5, 6]) @ b False
- __or__(other: blooms.blooms.blooms) blooms.blooms.blooms [source]
Return the union of this instance and another instance.
>>> b0 = blooms(100) >>> b0 @= bytes([1, 2, 3]) >>> b1 = blooms(100) >>> b1 @= bytes([4, 5, 6]) >>> bytes([1, 2, 3]) @ (b0 | b1) True >>> bytes([4, 5, 6]) @ (b0 | b1) True >>> b0 = blooms(100) >>> b1 = blooms(200) >>> b0 | b1 Traceback (most recent call last): ... ValueError: instances do not have equivalent lengths
- issubset(other: blooms.blooms.blooms) bool [source]
Determine whether this instance represents a subset of another instance. Note that the subset relationship being checked is between the sets of all bytes-like objects that are accepted by each instance, regardless of whether they were explicitly inserted into an instance or not (i.e., all bytes-like objects that are false positives are considered to be members).
>>> b0 = blooms([0, 0, 1]) >>> b1 = blooms([0, 0, 3]) >>> b0.issubset(b1) True >>> b1.issubset(b0) False >>> b0 = blooms(100) >>> b1 = blooms(200) >>> b0.issubset(b1) Traceback (most recent call last): ... ValueError: instances do not have equivalent lengths
- classmethod from_base64(s: str) blooms.blooms.blooms [source]
Convert a Base64 UTF-8 string representation into an instance.
>>> b = blooms(100) >>> b @= bytes([1, 2, 3]) >>> b = blooms.from_base64(b.to_base64()) >>> bytes([1, 2, 3]) @ b True >>> bytes([4, 5, 6]) @ b False
- to_base64() str [source]
Convert this instance to a Base64 UTF-8 string representation.
>>> b = blooms(100) >>> isinstance(b.to_base64(), str) True
- saturation(length: int) float [source]
Return the approximate saturation of this instance as a value between
0.0
and1.0
(assuming that all bytes-like objects that have been or will be inserted have the specified length). The approximation is an upper bound on the true saturation, and its accuracy degrades as the number of insertions approaches the valuelen(self) // 8
.>>> b = blooms(32) >>> b.saturation(4) 0.0 >>> from secrets import token_bytes >>> for _ in range(8): ... b @= token_bytes(4) >>> b.saturation(4) < 0.1 True
The saturation of an instance can be interpreted as an upper bound on the rate at which false positives can be expected when querying the instance with bytes-like objects that have the specified length.
- capacity(length: int, saturation: float) Union[int, float] [source]
Return this instance’s approximate capacity: the number of bytes-like objects of the specified length that can be inserted into an empty version of this instance before the specified saturation is likely to be reached.
>>> b = blooms(32) >>> b.capacity(8, 0.05) 28 >>> b.capacity(12, 0.05) 31
The capacity of an instance is not bounded for a saturation of
1.0
or for bytes-like objects of length zero.>>> b.capacity(0, 0.1) inf >>> b.capacity(4, 1.0) inf
Note that capacity is independent of the number of insertions into this instance that have occurred. It is the responsibility of the user to keep track of the number of bytes-like objects that have been inserted into an instance.
- static specialize(name: str, encode: Callable) type [source]
Return a class derived from
blooms
that uses the supplied encoding for members.>>> from hashlib import sha256 >>> encode = lambda x: sha256(x).digest()[:2] >>> blooms_custom = blooms.specialize(name='blooms_custom', encode=encode) >>> b = blooms_custom(4) >>> b @= bytes([1, 2, 3]) >>> bytes([1, 2, 3]) @ b True