Learnitweb

How HashSet Works Internally in Java

HashSet is a part of the Java Collection Framework and implements the Set interface, which is a collection that does not allow duplicate elements. It is backed by a HashMap internally, where each element in the HashSet is stored as a key, and the corresponding value is a constant dummy value.

HashSet leverages hashing to provide constant-time performance (O(1)) for basic operations like add(), remove(), and contains(), on average.

Core Internal Concepts Behind HashSet

1. Hashing Mechanism

  • HashSet relies on hashing to store and retrieve elements quickly.
  • Each element’s hash code is computed using the hashCode() method (defined in the Object class, but can be overridden in custom objects).
  • The hash code is processed to determine an index in the internal array (which is essentially backed by a HashMap).

The process of hashing is:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This function ensures that the hash values are evenly distributed across the buckets, reducing the chance of collisions.

2. Buckets (Array) Structure

Internally, a HashSet is backed by a HashMap. The HashMap stores the elements as keys and uses a dummy constant value (like PRESENT) as the associated value for each key.

  • Buckets are the array elements in the HashMap that store Entry<K, V> objects (where K is the element and V is the dummy value).
  • Bucket index is determined by the hash code of the element, and this index points to a particular position in the array.
static final Object PRESENT = new Object();

static class Node<K> implements Map.Entry<K, Object> {
    final int hash;
    final K key;
    Node<K> next;
}

3. Element Uniqueness

  • HashSet ensures unique elements. If an element is already present (based on its hashCode() and equals()), the new element will not be added.
  • This is achieved by checking for duplicate keys in the underlying HashMap:
    • The hash code of the element is computed.
    • The corresponding bucket is checked.
    • If an element with the same hash code and value (using equals()) is found, the element is not added.

4. Collision Handling

  • When multiple elements hash to the same index, a collision occurs.
  • HashSet handles collisions using chaining:
    • A linked list or tree structure is used to store colliding elements at the same index.
    • If a large number of collisions occur, the linked list can be converted into a Red-Black Tree (like in HashMap).

5. Resizing and Load Factor

  • Load Factor: The default load factor of a HashSet is 0.75, meaning when the number of elements exceeds 75% of the capacity, the HashSet will resize.
  • Resizing involves:
    • Creating a new, larger array (usually twice the size).
    • Rehashing and redistributing the elements to new indices based on the new array size.
  • This resizing process ensures that the HashSet maintains efficient time complexity (O(1)) for operations like add(), remove(), and contains().

How the add(), remove(), and contains() Methods Work Internally

add(E element)

  1. Compute hash code: The hash code of the element is computed using hashCode().
  2. Calculate bucket index: The bucket index is calculated using (n - 1) & hash where n is the array length.
  3. Check if bucket is empty: If the bucket is empty, the element is inserted directly.
  4. Handle collisions: If an element already exists in the bucket (same hash code), the equals() method is called to check for equality.
    • If equal, the element is not added.
    • If not equal, the element is added to the linked list/tree.
  5. Resize if necessary: If the load factor exceeds 0.75, the HashSet is resized.

remove(Object element)

  1. Compute hash code: The hash code of the element is computed.
  2. Calculate bucket index: The bucket index is determined.
  3. Search bucket: The corresponding bucket is checked to find the element.
  4. Remove element: If found, it is removed, and the subsequent elements in the list/tree are adjusted.

contains(Object element)

  1. Compute hash code: The hash code of the element is computed.
  2. Calculate bucket index: The bucket index is determined.
  3. Search bucket: The corresponding bucket is checked for the element.
  4. Check for equality: The equals() method is called to confirm if the element is present.

6. Resizing and Rehashing

When the load factor threshold is exceeded, HashSet triggers rehashing:

  • Resize the array: The size is doubled.
  • Rehash elements: All existing elements are rehashed to fit into the new array.

Rehashing is a costly operation, and this is why choosing a suitable initial capacity for the HashSet can improve performance by minimizing the number of resizes.

7. Thread Safety

  • HashSet is not thread-safe.
  • In a multi-threaded environment, concurrent modifications may lead to unpredictable behavior.
  • If thread safety is required, you can use:
    • Collections.synchronizedSet() to synchronize access to the HashSet.
    • ConcurrentHashMap for a more thread-safe alternative in concurrent scenarios.