Learnitweb

Detecting gaps in sequential data

1. Introduction

In many business applications, you may have tables containing sequential values, such as:

  • Transaction IDs that should be continuous (e.g., invoice numbers: 1001, 1002, 1003, …)
  • Dates that should appear daily without missing days (e.g., sales date series)
  • Employee sequence numbers, order numbers, or batch IDs

However, due to missing records, data corruption, or load errors, some IDs or dates may be missing. Detecting these gaps is a very common data quality check.

In Oracle, there are several ways to identify these gaps using SQL.


2. Example Setup

Let’s create a simple example table representing a list of transactions.

CREATE TABLE transactions (
    txn_id NUMBER PRIMARY KEY,
    txn_date DATE
);

INSERT INTO transactions VALUES (1, TO_DATE('01-JAN-2025','DD-MON-YYYY'));
INSERT INTO transactions VALUES (2, TO_DATE('02-JAN-2025','DD-MON-YYYY'));
INSERT INTO transactions VALUES (3, TO_DATE('03-JAN-2025','DD-MON-YYYY'));
INSERT INTO transactions VALUES (5, TO_DATE('05-JAN-2025','DD-MON-YYYY'));
INSERT INTO transactions VALUES (6, TO_DATE('06-JAN-2025','DD-MON-YYYY'));
INSERT INTO transactions VALUES (8, TO_DATE('08-JAN-2025','DD-MON-YYYY'));
COMMIT;

Here:

  • We are missing IDs 4 and 7.
  • We are missing dates 04-JAN-2025 and 07-JAN-2025.

3. Detecting Gaps in Sequential IDs

Let’s start with ID gaps.

Method 1: Using LAG() Analytic Function

Oracle’s analytic functions can compare each row with its previous row, making them perfect for gap detection.

SELECT
    txn_id,
    LAG(txn_id) OVER (ORDER BY txn_id) AS prev_txn,
    txn_id - LAG(txn_id) OVER (ORDER BY txn_id) AS diff
FROM transactions;

Output:

TXN_IDPREV_TXNDIFF
1NULLNULL
211
321
532
651
862

Explanation:

  • The LAG() function retrieves the previous transaction ID.
  • DIFF = txn_id - prev_txn shows the gap size between consecutive IDs.
  • Wherever DIFF > 1, it indicates a missing sequence.

Now we can find only the missing gaps:

SELECT
    LAG(txn_id) OVER (ORDER BY txn_id) + 1 AS missing_start,
    txn_id - 1 AS missing_end
FROM transactions
WHERE txn_id - LAG(txn_id) OVER (ORDER BY txn_id) > 1;

Output:

MISSING_STARTMISSING_END
44
77

Interpretation:

  • Between 3 and 5 → Missing 4
  • Between 6 and 8 → Missing 7

This is the most efficient and commonly used approach.


Method 2: Using Recursive CTE to Generate the Full Range

If you want to list all missing IDs explicitly, not just the ranges, use a recursive CTE.

WITH RECURSIVE all_ids (id) AS (
    SELECT MIN(txn_id) FROM transactions
    UNION ALL
    SELECT id + 1 FROM all_ids
    WHERE id < (SELECT MAX(txn_id) FROM transactions)
)
SELECT id AS missing_id
FROM all_ids
WHERE id NOT IN (SELECT txn_id FROM transactions);

Output:

MISSING_ID
4
7

Explanation:

  1. The recursive CTE generates a list of all numbers between the minimum and maximum ID.
  2. Then we subtract the IDs that actually exist.
  3. The remaining ones are the gaps.

Method 3: Using CONNECT BY (Non-Recursive)

Older versions of Oracle (before 11gR2) didn’t support recursive CTEs. You can use the CONNECT BY hierarchical query instead:

SELECT level + (SELECT MIN(txn_id) FROM transactions) - 1 AS id
FROM dual
CONNECT BY level <= (SELECT MAX(txn_id) - MIN(txn_id) + 1 FROM transactions)
MINUS
SELECT txn_id FROM transactions
ORDER BY id;

Output:

ID
4
7

Explanation:

  • The CONNECT BY clause generates a continuous sequence using Oracle’s pseudo-column LEVEL.
  • The MINUS operator filters out existing IDs, leaving only the missing ones.

4. Detecting Gaps in Sequential Dates

Now let’s look at missing dates — another common case, especially for time-series data like sales or attendance.

We’ll reuse the same data where some dates are missing.


Method 1: Generate Continuous Dates Using CONNECT BY

SELECT
    start_date + LEVEL - 1 AS expected_date
FROM (
    SELECT MIN(txn_date) AS start_date, MAX(txn_date) AS end_date FROM transactions
)
CONNECT BY LEVEL <= end_date - start_date + 1
MINUS
SELECT txn_date FROM transactions
ORDER BY expected_date;

Output:

EXPECTED_DATE
04-JAN-2025
07-JAN-2025

Explanation:

  1. CONNECT BY generates all dates between the minimum and maximum.
  2. We then subtract the dates that actually exist.
  3. The remaining rows are missing dates.

Method 2: Using Recursive CTE

If you prefer standard SQL instead of Oracle’s hierarchical syntax, use a recursive CTE:

WITH dates (d) AS (
    SELECT MIN(txn_date) FROM transactions
    UNION ALL
    SELECT d + 1 FROM dates
    WHERE d < (SELECT MAX(txn_date) FROM transactions)
)
SELECT d AS missing_date
FROM dates
WHERE d NOT IN (SELECT txn_date FROM transactions);

Output:

MISSING_DATE
04-JAN-2025
07-JAN-2025

Explanation:

  • The recursive CTE builds every date in the range.
  • We then exclude the ones already present.

This is clean, ANSI-standard SQL and works in modern Oracle versions (12c+).


Method 3: Using LAG() for Date Gaps

SELECT
    txn_date,
    LAG(txn_date) OVER (ORDER BY txn_date) AS prev_date,
    txn_date - LAG(txn_date) OVER (ORDER BY txn_date) AS diff
FROM transactions;

Output:

TXN_DATEPREV_DATEDIFF
01-JAN-25NULLNULL
02-JAN-2501-JAN-251
03-JAN-2502-JAN-251
05-JAN-2503-JAN-252
06-JAN-2505-JAN-251
08-JAN-2506-JAN-252

To display missing date ranges:

SELECT
    prev_date + 1 AS missing_start,
    txn_date - 1 AS missing_end
FROM (
    SELECT
        txn_date,
        LAG(txn_date) OVER (ORDER BY txn_date) AS prev_date
    FROM transactions
)
WHERE txn_date - prev_date > 1;

Output:

MISSING_STARTMISSING_END
04-JAN-202504-JAN-2025
07-JAN-202507-JAN-2025

5. Detecting Gaps in Real Data Scenarios

Example: Missing Invoices

WITH invoices AS (
    SELECT 1001 AS inv_no FROM dual UNION ALL
    SELECT 1002 FROM dual UNION ALL
    SELECT 1004 FROM dual UNION ALL
    SELECT 1005 FROM dual
)
SELECT
    prev_inv + 1 AS missing_start,
    inv_no - 1 AS missing_end
FROM (
    SELECT
        inv_no,
        LAG(inv_no) OVER (ORDER BY inv_no) AS prev_inv
    FROM invoices
)
WHERE inv_no - prev_inv > 1;

Output:

MISSING_STARTMISSING_END
10031003

6. Performance and Best Practices

  1. For large datasets, prefer LAG() or LEAD() analytic queries over recursive CTEs—they are faster and more memory efficient.
  2. Always index your sequential column (like ID or DATE) to improve join and filter performance.
  3. When using recursive CTE or CONNECT BY, always limit recursion to avoid infinite loops.
  4. MINUS performs a set-based subtraction and can be slower on large tables; prefer NOT EXISTS if performance matters.
  5. For date-based gap detection, store data in DATE or TIMESTAMP columns, not VARCHAR.

7. Summary Table

PurposeBest ApproachExample Function
Detect gaps in numeric IDsLAG() analytic functiontxn_id - LAG(txn_id)
Detect gaps in datesLAG() or recursive CTEtxn_date - LAG(txn_date)
Generate complete sequenceCONNECT BY or recursive CTESELECT LEVEL
Show missing rangesLAG() + WHERE diff > 1Yes
Show each missing valueRecursive or hierarchical queryYes

8. Conclusion

Detecting gaps in sequential data (IDs or dates) is crucial for ensuring data completeness and correctness.
Oracle offers multiple techniques to handle this:

  • Analytic functions (LAG, LEAD) for efficient difference detection
  • Recursive CTEs for generating and comparing sequences (standard SQL)
  • CONNECT BY LEVEL for fast number/date generation (Oracle-specific)