Learnitweb

Longest Common Substring Coding Pattern

1. Introduction

The Longest Common Substring (LCS) problem is a classic string problem in computer science. It involves finding the longest contiguous substring that appears in both strings.

Example:

String A: "abcdxyz"
String B: "xyzabcd"
Longest common substring: "abcd" (length 4)

Note:

  • Substring is contiguous.
  • This is different from the Longest Common Subsequence, where characters need not be contiguous.

Problem Statement (Typical Coding Problem):

Given two strings s1 and s2, find the length of the longest common substring. Optionally, return the substring itself.


2. When to Use Longest Common Substring Pattern

Use this pattern when:

  1. You need to find common sequences between two strings that are contiguous.
  2. You need maximum length or the substring itself.
  3. Problems involve text comparison, DNA sequences, data deduplication, or pattern matching.

Problem Indicators:

  • “Longest contiguous substring common in two strings”
  • “Maximum overlap between two sequences”
  • “Find matching blocks of text”

3. Core Idea

We can define a dynamic programming recurrence:

For strings s1 and s2 of lengths m and n:

dp[i][j] = length of longest common substring ending at s1[i-1] and s2[j-1]

Recurrence relation:

if s1[i-1] == s2[j-1]:
    dp[i][j] = dp[i-1][j-1] + 1
else:
    dp[i][j] = 0
  • dp[i][j] only increases when characters match and substring is contiguous.
  • Track the maximum value in the DP table → length of LCS.

Base Case:

dp[0][j] = 0 for all j
dp[i][0] = 0 for all i

4. Example

s1 = "abcdxyz"
s2 = "xyzabcd"

DP Table (partial view):

       ''  x  y  z  a  b  c  d
    ''  0  0  0  0  0  0  0  0
    a   0  0  0  0  1  0  0  0
    b   0  0  0  0  0  2  0  0
    c   0  0  0  0  0  0  3  0
    d   0  0  0  0  0  0  0  4
    x   0  1  0  0  0  0  0  0
    y   0  0  2  0  0  0  0  0
    z   0  0  0  3  0  0  0  0

Maximum value = 4 → LCS length = 4 (“abcd”)


5. Bottom-Up DP (Tabulation)

public int longestCommonSubstring(String s1, String s2) {
    int m = s1.length();
    int n = s2.length();
    int[][] dp = new int[m+1][n+1];
    int maxLen = 0;

    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (s1.charAt(i-1) == s2.charAt(j-1)) {
                dp[i][j] = dp[i-1][j-1] + 1;
                maxLen = Math.max(maxLen, dp[i][j]);
            } else {
                dp[i][j] = 0;
            }
        }
    }
    return maxLen;
}
  • Time Complexity: O(m * n)
  • Space Complexity: O(m * n)

6. Space Optimization

We only need the previous row, so we can optimize space to O(n):

public int longestCommonSubstringOptimized(String s1, String s2) {
    int m = s1.length();
    int n = s2.length();
    int[] prev = new int[n+1];
    int[] curr = new int[n+1];
    int maxLen = 0;

    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            if (s1.charAt(i-1) == s2.charAt(j-1)) {
                curr[j] = prev[j-1] + 1;
                maxLen = Math.max(maxLen, curr[j]);
            } else {
                curr[j] = 0;
            }
        }
        int[] temp = prev;
        prev = curr;
        curr = temp;  // reset current row
    }
    return maxLen;
}
  • Time Complexity: O(m * n)
  • Space Complexity: O(n)

7. Finding the Substring Itself

To get the substring, track the ending index when updating maxLen:

int endIndex = 0;
if (dp[i][j] > maxLen) {
    maxLen = dp[i][j];
    endIndex = i;  // end of substring in s1
}
String lcs = s1.substring(endIndex - maxLen, endIndex);

8. Variants of the Pattern

  1. Longest Repeated Substring – substring that occurs at least twice in a string.
  2. Longest Common Subsequence (LCS) – characters do not need to be contiguous.
  3. Longest Palindromic Substring – substring is a palindrome.

Observation:

  • LCS problems often rely on DP tables, subproblem overlapping, and tracking indices.
  • Recognize contiguous vs non-contiguous requirements.

9. When to Use This Pattern

  • When you need maximum-length contiguous matching segments between two strings.
  • When tracking common patterns in sequences is required.
  • Problems in text comparison, bioinformatics, plagiarism detection, and pattern matching.

Indicators:

  • “Longest common substring”
  • “Maximum overlap between strings”
  • “Find common blocks in sequences”

10. Key Takeaways

  1. Longest common substring requires contiguous characters.
  2. DP relation:
if s1[i-1] == s2[j-1] → dp[i][j] = dp[i-1][j-1] + 1
else → dp[i][j] = 0
  1. Base case: first row/column = 0.
  2. Track maximum length and optionally the ending index.
  3. Optimize space using only previous row for large strings.