I saw an online resource that claimed “The collections.Counter
class in the Python standard library implements a multiset (or bag) type that allows elements in the set to have more than one occurrence.”
However, (and despite the fact that he is quoting comments in the CPython source code,) this is misleading. While the collections.Counter
class does allow you to have more than 1 occurrence of elements in a set, it has one huge difference with "real" multisets: it allows non-positive quantities for elements!
This leads to two disturbing situations if you just believe the claims of whatever the most SEO'd post to your query was:
from collections import Counter
# 1. Counter is NOT a multiset because it counts quantity-0 members as "present"
c = Counter(['cat', 'dog'])
c.subtract(['cat'])
assert 'cat' not in c, "True: 'cat' in c and c['cat'] == {}".format(c['cat'])
# 2. Counter is NOT a multiset because it supports negative-quantity membership
c = Counter(['cat', 'dog'])
c.subtract(['cat'])
c.subtract(['cat'])
c.subtract(['cat'])
assert 'cat' not in c, "True: c['cat'] == {}".format(c['cat'])
I think Python does not currently contain a multiset type. Reportedly, the idea was pondered in 2004 around the Python 2.4 release, but as of Python 3.11.3 / Summer 2023, the documentation only contains 2 instances of the word "multiset", both of which refer to the Counter class.
To be clear, you could simply decide to use a Counter as a multiset. It would have a few downsides:
- EAFP is utterly broken, because you'll be put in an invalid state with no raised error when trying to
subtract
something that's already at 0 quantity. - Since
Counter
doesn't prune elements whose quantity has dropped to zero, using it "as" a multiset could produce memory leaks in certain situations.- You can work around this by calling
c._keep_positive
after every mutation which might remove an element.
- You can work around this by calling
- Unless you implement the above workaround for memory-leakage, the
in
operator won't work properly; you'll have to uselambda x: c[x] > 0
to check membership, which is cringe.
But if all you need from a multiset is __contains__
, count
, append
, and remove
, why not just use a boring old list
instead?
(*Rhetorical question; I'll just tell you.)
- Comparisons won't work properly, since
list
is arbitrarily ordered.- If all elements are from the same well-ordered set, you can work around this by calling
l.sort
on both operands before comparison!
- If all elements are from the same well-ordered set, you can work around this by calling
list
won't throw aTypeError
if you accidentally try to stick something mutable into it. (Obviously, this shouldn't be a concern if your type hygiene is good elsewhere or your code certainly doesn't have typing issues, but it still would have been nice to have.)- The name
append
is a bit weird when you're semanticallyadd
-ing an element. - There's no
subtract
function to remove multiple elements at the same time. - Memory use won't be optimal when your multisets contain many repeated elements (Python needs to allocate 1 word to store a pointer for each slot in a
list
.) - Performance won't be optimal (e.g.
count
is $O(k)$—that is, not proportional to the number of unique elements in the set, but to the total number of elements in it—with no optimization available when counting multiple elements at once.)
The observant reader will notice that the only “objective” problem there is comparisons—you really can, mathematically, use “sorted list
s” as a multiset!
Alternatively, just step up and make a constrained subclass for Counter
:
from collections import Counter
from collections.abc import Mapping as _Mapping
from itertools import chain as _chain, repeat as _repeat, starmap as _starmap
class multiset(Counter):
def _keep_positive(self, /):
nonpositive_items = [(elem, count) for elem, count in self.items() if not count > 0]
for elem, count in nonpositive_items:
if count == 0: # https://en.wikipedia.org/wiki/Happy_path
del self[elem]
continue
raise RuntimeError("invalid quantity in multiset")
return self
def update(self, iterable=None, /, **k):
super().update(iterable, **k)
self._keep_positive()
def elements(self, /, _most_common=False):
# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L624
items = self.items() if not _most_common else self.most_common()
return _chain.from_iterable(_starmap(_repeat, items))
def __repr__(self, /):
# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L731
if not self:
return f'{self.__class__.__name__}()'
try:
l = list(self.elements(_most_common=True))
except TypeError:
l = list(self.elements())
return f'{self.__class__.__name__}({l!r})'
# This isn't "a dict"; it HAPPENS to SUBCLASS CPython's dict for performance reasons.
# We're not barbarians, so let's get some set-like methods, eh?
def __iter__(self, /):
yield from self.elements()
def __len__(self, /):
"Return the number of elements in multiset *self* (cardinality of *self*)."
# https://docs.python.org/3/library/stdtypes.html#set.__len__
# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L604
return self.total()
def add(self, elem, /):
"Add a single element *elem* to the multiset."
self.update([elem])
def subtract(self, mapping_or_iterable=None, /, **k):
# arguably a footgun method
# if the user wants to self -= other after asserting that self > other, they can do that
raise NotImplementedError("method not available for multiset. Did you mean multiset.difference_update?")
def difference_update(self, mapping_or_iterable=None, /, **k):
"Update the multiset, removing elements found in others."
# https://docs.python.org/3/library/stdtypes.html#set.difference_update
# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L692
# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L926
if iterable is not None:
if isinstance(mapping_or_iterable, _Mapping):
items = mapping_or_iterable.items()
else:
items = zip(mapping_or_iterable, _repeat(1))
for elem, count in mapping_or_iterable:
self[elem] = max(0, self[elem] - count)
if k:
self.difference_update(k)
else:
# No need to run this twice
self._keep_positive()
def remove(self, elem, /):
"Remove a single element *elem* from the multiset. Raises `KeyError` if *elem* is not present in the multiset."
# https://docs.python.org/3/library/stdtypes.html#set.remove
if elem not in self:
raise KeyError(f"{self.__class__.__name__}.remove(elem): elem not in multiset.")
self.difference_update([elem])
def discard(self, elem, /):
"Remove all copies of element *elem* from the multiset if it is present."
# https://docs.python.org/3/library/stdtypes.html#set.discard
del self[elem]
def pop(self, /):
"Remove and return a single arbitrary element from the multiset. Raises `KeyError` if the multiset is empty."
# https://docs.python.org/3/library/stdtypes.html#set.pop
if not self:
raise KeyError("pop from an empty multiset")
elem = next(iter(self))
self.remove([elem])
return elem
class frozenmultiset(frozenset):
def __init__(self, *a, **k):
raise NotImplementedError("TODO")