Learnitweb

False Sharing in Multithreading – Explanation and How to Avoid It

1. Introduction to False Sharing

In multi-threaded programming, false sharing is a performance-degrading phenomenon where multiple threads access different variables that happen to reside on the same CPU cache line. Even though the threads are accessing different variables, they cause cache invalidation in each other’s cores, leading to unnecessary cache coherence traffic.

This causes significant slowdowns, especially in high-performance, multi-core, low-latency applications.

2. What is a CPU Cache Line?

To understand false sharing, you need to know what a cache line is.

  • Cache line is the smallest unit of memory transfer between main memory and CPU cache.
  • Typically 64 bytes on modern processors (Intel/AMD).
  • When a CPU accesses a variable, it loads the entire cache line containing that variable into its local L1/L2/L3 cache.

So if two variables lie on the same cache line, accessing one of them brings the other into the cache as well — even if it’s not needed.

3. How False Sharing Happens

Let’s say we have two threads, each writing to separate variables x and y, but both x and y are located in the same cache line in memory.

class SharedData {
    volatile long x = 0; // Thread 1 writes here
    volatile long y = 0; // Thread 2 writes here
}

Even though each thread modifies a different variable:

  • They reside in the same 64-byte cache line.
  • Each write to x by Thread 1 invalidates the cache line in Thread 2’s CPU core, even if Thread 2 only cares about y.
  • This triggers a cache coherence protocol (like MESI), causing:
    • Performance penalties
    • Cache line bouncing between cores
    • Decreased throughput

This is false sharing – they don’t actually share variables, but the cache thinks they do.

4. Demonstration of False Sharing in Java

Let’s create a microbenchmark to observe false sharing.

Without Padding (Causing False Sharing)

class FalseSharingTest implements Runnable {
    public static int NUM_THREADS = 4;
    public static long ITERATIONS = 10_000_000L;

    static class VolatileLong {
        public volatile long value = 0L; // All values are packed together
    }

    static VolatileLong[] data = new VolatileLong[NUM_THREADS];

    static {
        for (int i = 0; i < NUM_THREADS; i++) {
            data[i] = new VolatileLong();
        }
    }

    private final int index;

    public FalseSharingTest(int index) {
        this.index = index;
    }

    public void run() {
        long i = ITERATIONS + 1;
        while (--i != 0) {
            data[index].value = i;
        }
    }

    public static void main(String[] args) throws Exception {
        Thread[] threads = new Thread[NUM_THREADS];
        long start = System.nanoTime();

        for (int i = 0; i < NUM_THREADS; i++) {
            threads[i] = new Thread(new FalseSharingTest(i));
        }
        for (Thread t : threads) t.start();
        for (Thread t : threads) t.join();

        long duration = System.nanoTime() - start;
        System.out.println("Duration: " + duration / 1_000_000 + " ms");
    }
}

With Padding (Avoiding False Sharing)

class FalseSharingPaddedTest implements Runnable {
    public static int NUM_THREADS = 4;
    public static long ITERATIONS = 10_000_000L;

    static class PaddedLong {
        // Padding to avoid false sharing (assume 64-byte cache line)
        public volatile long value = 0L;
        public long p1, p2, p3, p4, p5, p6, p7; // padding
    }

    static PaddedLong[] data = new PaddedLong[NUM_THREADS];

    static {
        for (int i = 0; i < NUM_THREADS; i++) {
            data[i] = new PaddedLong();
        }
    }

    private final int index;

    public FalseSharingPaddedTest(int index) {
        this.index = index;
    }

    public void run() {
        long i = ITERATIONS + 1;
        while (--i != 0) {
            data[index].value = i;
        }
    }

    public static void main(String[] args) throws Exception {
        Thread[] threads = new Thread[NUM_THREADS];
        long start = System.nanoTime();

        for (int i = 0; i < NUM_THREADS; i++) {
            threads[i] = new Thread(new FalseSharingPaddedTest(i));
        }
        for (Thread t : threads) t.start();
        for (Thread t : threads) t.join();

        long duration = System.nanoTime() - start;
        System.out.println("Duration: " + duration / 1_000_000 + " ms");
    }
}

You will typically see a 2x–10x performance improvement after avoiding false sharing.

5. How to Detect False Sharing

You can detect false sharing using:

Manual Code Review

  • Look for hot variables updated by multiple threads
  • Check if these variables are close together in memory

Profiling Tools

  • Intel VTune Profiler
  • perf on Linux with cache miss counters
  • Java Flight Recorder (JFR) (to spot high CPU + cache miss correlation)
  • Linux perf counters (like cache-misses, cache-references)

6. How to Avoid False Sharing in Java

1. Use Padding

Manually insert dummy fields between frequently written variables.

class PaddedCounter {
    public volatile long counter;
    public long p1, p2, p3, p4, p5, p6, p7; // prevent false sharing
}

But this is error-prone and architecture-dependent.

2. Use @Contended Annotation (JDK 8+)

Java 8+ introduced the @sun.misc.Contended annotation.

@sun.misc.Contended
public class Counter {
    public volatile long value;
}

This instructs the JVM to place this field/object on its own cache line.

To enable @Contended, you must use:

-XX:-RestrictContended

Note: @Contended is in the sun.misc package (not part of public API), but it is supported in production JVMs.

3. Avoid Sharing Mutable State Across Threads

Prefer thread-local or immutable data whenever possible.

4. Use Proper Thread-Safe Data Structures

  • Use concurrent collections that internally manage padding.
  • Examples: LongAdder, AtomicLongArray, etc., may internally use techniques to avoid false sharing.

7. Real-World Example: Java LongAdder

The class java.util.concurrent.atomic.LongAdder avoids false sharing by:

  • Internally maintaining striped counters
  • Each thread updates a separate cell
  • Cells are padded to avoid cache contention

Hence, LongAdder is faster than AtomicLong under high contention.