Infinitesque

Subjects:

Eliminating some duplicates with the RecentSet class

Principal author:
John L. Clark

Abstract

In some situations, it may make sense to try to significantly reduce the number of duplicates in a collection, while ensuring that memory usage does not increase without bound. To help with this, I created a Python utility called a RecentSet, which serves as a sort of first-pass filter for significantly reducing the number of duplicates in a collection.

Sets, both mathematically and in practice as data structures, are useful for identifying and removing duplicate items from any collection. Python makes this particularly easy for the developer, but as you might expect, its implementation utilizes only memory for storage, so a large enough number of unique items will cause the program to terminate for lack of memory. Somewhat smaller numbers will simply make the program, and likely the system, run very slowly, particularly if it is forced into frequent swapping. Neither situation is particularly pleasant.

When I ran into this problem, I looked at using a database system to eliminate the duplicates, since databases use disk-based data structures, and disks are generally far larger than available memory. This works, but it can still be expensive, because we must feed the entire collection of items to the database system and allow it to eliminate the duplicates. If the entire collection of items contains a large number of duplicates, then the size of the total collection of items will be significantly larger than that of the corresponding set of items. If we can somehow first reduce the size of the total collection by eliminating some of the duplicates, then we can significantly improve performance.

To do this, I implemented something that I called a RecentSet, which provides a mechanism for approximately determining whether an item is already a member of some abstract set. It works by keeping track of a fixed number of items in the order in which they were last checked for membership. This means that a RecentSet works best on collections that have clusters of identical items. In fact, the membership test will be completely accurate if and only if the distance between identical members within the collection is always less than or equal to the size of the RecentSet (and the accuracy generally diminishes as this distance increases). Thus, a RecentSet serves as a sort of first-pass filter for significantly reducing the number of duplicates in a collection.

I needed to be able to efficiently order items and move items from their current position in the order to the head of the order. These operations are best suited to a doubly-linked list, but it doesn't seem that an implementation of this data structure exists in the core Python library, so I implemented my own (or, more precisely, a class for nodes that combine to form a doubly-linked list, called DllNode). I've provided this implementation below:

class DllNode(object):
  def __init__(self, stuff):
    self.prev = None
    self.next = None
    self.stuff = stuff

  def insert_before(self, following):
    prev = following.prev

    if prev is not None:
      prev.next = self

    self.prev = prev
    self.next = following

    following.prev = self

  def insert_after(self, preceding):
    next = preceding.next

    if next is not None:
      next.prev = self

    self.prev = preceding
    self.next = next

    preceding.next = self

  def remove(self):
    if self.prev is not None:
      self.prev.next = self.next
    if self.next is not None:
      self.next.prev = self.prev
    self.prev = None
    self.next = None

And the following is the implementation of RecentSet, which depends on the previous doubly-linked list implementation. It's currently quite simple. The name is perhaps misleading, because it does not actually represent a set, but rather a sort of approximate oracle about whether new items were present previously in an abstract set.

class RecentSet(object):
  def __init__(self, size):
    self.size = size
    self.store = {}
    '''map from objects to their node in the doubly-linked list'''
    self.head = None
    '''first item in the doubly-linked list'''
    self.tail = None
    '''last item in the doubly-linked list'''

  def check(self, item):
    if item in self.store:
      node = self.store[item]
      if self.head is not node:
        if self.tail is node:
          self.tail = self.tail.prev
        node.remove()
        node.insert_before(self.head)
        self.head = node

      return True

    else:
      node = DllNode(item)

      if self.head is None:
        assert self.tail is None
        self.head = node
        self.tail = node
      else:
        node.insert_before(self.head)
        self.head = node

      self.store[item] = node

      if len(self.store) > self.size:
        drop = self.tail
        self.tail = drop.prev
        drop.remove()
        del self.store[drop.stuff]

      return False

Upon request, I would be happy to provide some examples of how you might utilize a RecentSet. Until then, that will just have to remain an exercise for the reader. I would also be happy to flesh these components out further, to make them more generally useful, if anyone is interested in using them.

This page was last modified on 2010-04-12 16:02:00-04:00.

This page was first published on .

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.

See the version of this page with comments enabled to read or add comments.