Python has no(?) multiset

I saw an online resource that claimed “The collections.Counter class in the Python standard library implements a multiset (or bag) type that allows elements in the set to have more than one occurrence.”

However, (and despite the fact that he is quoting comments in the CPython source code,) this is misleading. While the collections.Counter class does allow you to have more than 1 occurrence of elements in a set, it has one huge difference with "real" multisets: it allows non-positive quantities for elements!

This leads to two disturbing situations if you just believe the claims of whatever the most SEO'd post to your query was:

from collections import Counter

# 1. Counter is NOT a multiset because it counts quantity-0 members as "present"
c = Counter(['cat', 'dog'])
c.subtract(['cat'])
assert 'cat' not in c, "True: 'cat' in c and c['cat'] == {}".format(c['cat'])

# 2. Counter is NOT a multiset because it supports negative-quantity membership
c = Counter(['cat', 'dog'])
c.subtract(['cat'])
c.subtract(['cat'])
c.subtract(['cat'])
assert 'cat' not in c, "True: c['cat'] == {}".format(c['cat'])

I think Python does not currently contain a multiset type. Reportedly, the idea was pondered in 2004 around the Python 2.4 release, but as of Python 3.11.3 / Summer 2023, the documentation only contains 2 instances of the word "multiset", both of which refer to the Counter class.

To be clear, you could simply decide to use a Counter as a multiset. It would have a few downsides:

EAFP is utterly broken, because you'll be put in an invalid state with no raised error when trying to subtract something that's already at 0 quantity.
Since Counter doesn't prune elements whose quantity has dropped to zero, using it "as" a multiset could produce memory leaks in certain situations.
- You can work around this by calling c._keep_positive after every mutation which might remove an element.
Unless you implement the above workaround for memory-leakage, the in operator won't work properly; you'll have to use lambda x: c[x] > 0 to check membership, which is cringe.

But if all you need from a multiset is __contains__, count, append, and remove, why not just use a boring old list instead?
(*Rhetorical question; I'll just tell you.)

Comparisons won't work properly, since list is arbitrarily ordered.
- If all elements are from the same well-ordered set, you can work around this by calling l.sort on both operands before comparison!
list won't throw a TypeError if you accidentally try to stick something mutable into it. (Obviously, this shouldn't be a concern if your type hygiene is good elsewhere or your code certainly doesn't have typing issues, but it still would have been nice to have.)
The name append is a bit weird when you're semantically add-ing an element.
There's no subtract function to remove multiple elements at the same time.
Memory use won't be optimal when your multisets contain many repeated elements (Python needs to allocate 1 word to store a pointer for each slot in a list.)
Performance won't be optimal (e.g. count is $O(k)$—that is, not proportional to the number of unique elements in the set, but to the total number of elements in it—with no optimization available when counting multiple elements at once.)

The observant reader will notice that the only “objective” problem there is comparisons—you really can, mathematically, use “sorted lists” as a multiset!

Alternatively, just step up and make a constrained subclass for Counter:

from collections import Counter
from collections.abc import Mapping as _Mapping
from itertools import chain as _chain, repeat as _repeat, starmap as _starmap

class multiset(Counter):
	def _keep_positive(self, /):
		nonpositive_items = [(elem, count) for elem, count in self.items() if not count > 0]
		for elem, count in nonpositive_items:
			if count == 0:  # https://en.wikipedia.org/wiki/Happy_path
				del self[elem]
				continue
			raise RuntimeError("invalid quantity in multiset")
		return self
	def update(self, iterable=None, /, **k):
		super().update(iterable, **k)
		self._keep_positive()
	def elements(self, /, _most_common=False):
		# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L624
		items = self.items() if not _most_common else self.most_common()
		return _chain.from_iterable(_starmap(_repeat, items))
	def __repr__(self, /):
		# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L731
		if not self:
			return f'{self.__class__.__name__}()'
		try:
			l = list(self.elements(_most_common=True))
		except TypeError:
			l = list(self.elements())
		return f'{self.__class__.__name__}({l!r})'
	# This isn't "a dict"; it HAPPENS to SUBCLASS CPython's dict for performance reasons.
	# We're not barbarians, so let's get some set-like methods, eh?
	def __iter__(self, /):
		yield from self.elements()
	def __len__(self, /):
		"Return the number of elements in multiset *self* (cardinality of *self*)."
		# https://docs.python.org/3/library/stdtypes.html#set.__len__
		# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L604
		return self.total()
	def add(self, elem, /):
		"Add a single element *elem* to the multiset."
		self.update([elem])
	def subtract(self, mapping_or_iterable=None, /, **k):
		# arguably a footgun method
		# if the user wants to self -= other after asserting that self > other, they can do that
		raise NotImplementedError("method not available for multiset. Did you mean multiset.difference_update?")
	def difference_update(self, mapping_or_iterable=None, /, **k):
		"Update the multiset, removing elements found in others."
		# https://docs.python.org/3/library/stdtypes.html#set.difference_update
		# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L692
		# https://github.com/python/cpython/blob/v3.11.3/Lib/collections/__init__.py#L926
		if iterable is not None:
			if isinstance(mapping_or_iterable, _Mapping):
				items = mapping_or_iterable.items()
			else:
				items = zip(mapping_or_iterable, _repeat(1))
			for elem, count in mapping_or_iterable:
				self[elem] = max(0, self[elem] - count)
		if k:
			self.difference_update(k)
		else:
			# No need to run this twice
			self._keep_positive()
	def remove(self, elem, /):
		"Remove a single element *elem* from the multiset. Raises `KeyError` if *elem* is not present in the multiset."
		# https://docs.python.org/3/library/stdtypes.html#set.remove
		if elem not in self:
			raise KeyError(f"{self.__class__.__name__}.remove(elem): elem not in multiset.")
		self.difference_update([elem])
	def discard(self, elem, /):
		"Remove all copies of element *elem* from the multiset if it is present."
		# https://docs.python.org/3/library/stdtypes.html#set.discard
		del self[elem]
	def pop(self, /):
		"Remove and return a single arbitrary element from the multiset. Raises `KeyError` if the multiset is empty."
		# https://docs.python.org/3/library/stdtypes.html#set.pop
		if not self:
			raise KeyError("pop from an empty multiset")
		elem = next(iter(self))
		self.remove([elem])
		return elem

class frozenmultiset(frozenset):
	def __init__(self, *a, **k):
		raise NotImplementedError("TODO")

Leave a Reply Cancel reply