HashSet
is a part of the Java Collection Framework and implements the Set interface, which is a collection that does not allow duplicate elements. It is backed by a HashMap internally, where each element in the HashSet
is stored as a key, and the corresponding value is a constant dummy value.
HashSet
leverages hashing to provide constant-time performance (O(1)) for basic operations like add()
, remove()
, and contains()
, on average.
Core Internal Concepts Behind HashSet
1. Hashing Mechanism
HashSet
relies on hashing to store and retrieve elements quickly.- Each element’s hash code is computed using the
hashCode()
method (defined in theObject
class, but can be overridden in custom objects). - The hash code is processed to determine an index in the internal array (which is essentially backed by a
HashMap
).
The process of hashing is:
static final int hash(Object key) { int h; return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); }
This function ensures that the hash values are evenly distributed across the buckets, reducing the chance of collisions.
2. Buckets (Array) Structure
Internally, a HashSet
is backed by a HashMap. The HashMap
stores the elements as keys and uses a dummy constant value (like PRESENT
) as the associated value for each key.
- Buckets are the array elements in the
HashMap
that storeEntry<K, V>
objects (where K is the element and V is the dummy value). - Bucket index is determined by the hash code of the element, and this index points to a particular position in the array.
static final Object PRESENT = new Object(); static class Node<K> implements Map.Entry<K, Object> { final int hash; final K key; Node<K> next; }
3. Element Uniqueness
HashSet
ensures unique elements. If an element is already present (based on itshashCode()
andequals()
), the new element will not be added.- This is achieved by checking for duplicate keys in the underlying
HashMap
:- The hash code of the element is computed.
- The corresponding bucket is checked.
- If an element with the same hash code and value (using
equals()
) is found, the element is not added.
4. Collision Handling
- When multiple elements hash to the same index, a collision occurs.
HashSet
handles collisions using chaining:- A linked list or tree structure is used to store colliding elements at the same index.
- If a large number of collisions occur, the linked list can be converted into a Red-Black Tree (like in
HashMap
).
5. Resizing and Load Factor
- Load Factor: The default load factor of a
HashSet
is0.75
, meaning when the number of elements exceeds 75% of the capacity, theHashSet
will resize. - Resizing involves:
- Creating a new, larger array (usually twice the size).
- Rehashing and redistributing the elements to new indices based on the new array size.
- This resizing process ensures that the
HashSet
maintains efficient time complexity (O(1)) for operations likeadd()
,remove()
, andcontains()
.
How the add()
, remove()
, and contains()
Methods Work Internally
add(E element)
- Compute hash code: The hash code of the element is computed using
hashCode()
. - Calculate bucket index: The bucket index is calculated using
(n - 1) & hash
wheren
is the array length. - Check if bucket is empty: If the bucket is empty, the element is inserted directly.
- Handle collisions: If an element already exists in the bucket (same hash code), the
equals()
method is called to check for equality.- If equal, the element is not added.
- If not equal, the element is added to the linked list/tree.
- Resize if necessary: If the load factor exceeds 0.75, the
HashSet
is resized.
remove(Object element)
- Compute hash code: The hash code of the element is computed.
- Calculate bucket index: The bucket index is determined.
- Search bucket: The corresponding bucket is checked to find the element.
- Remove element: If found, it is removed, and the subsequent elements in the list/tree are adjusted.
contains(Object element)
- Compute hash code: The hash code of the element is computed.
- Calculate bucket index: The bucket index is determined.
- Search bucket: The corresponding bucket is checked for the element.
- Check for equality: The
equals()
method is called to confirm if the element is present.
6. Resizing and Rehashing
When the load factor threshold is exceeded, HashSet
triggers rehashing:
- Resize the array: The size is doubled.
- Rehash elements: All existing elements are rehashed to fit into the new array.
Rehashing is a costly operation, and this is why choosing a suitable initial capacity for the HashSet
can improve performance by minimizing the number of resizes.
7. Thread Safety
HashSet
is not thread-safe.- In a multi-threaded environment, concurrent modifications may lead to unpredictable behavior.
- If thread safety is required, you can use:
Collections.synchronizedSet()
to synchronize access to theHashSet
.ConcurrentHashMap
for a more thread-safe alternative in concurrent scenarios.