Find the Duplicate Number in a List: From Simple to Smart

As a seasoned software engineer with hands-on experience in building scalable systems and optimizing performance-critical code, I often work on algorithm optimization and space/time tradeoffs.

In this article, we tackle a common algorithm problem: how to find a duplicate number in a list.

It is a frequent question in technical interviews and high-performance applications. We'll start with the most straightforward solution and work toward the most efficient one, explaining each step clearly.

The Problem

You are given a list of numbers. Some numbers might be repeated. Your job is to find any one of the duplicate numbers.

Let's look at two types of approaches:

The simple approach uses extra memory.
The optimized approach uses constant memory.
Simple but Effective: Using a Set to Track Seen Numbers

from typing import List

def find_duplicate_number(nums: List[int]) -> int:
    """
    Find a duplicate number using a set to track previously seen values.

    Args:
        nums (List[int]): List of integers.

    Returns:
        int: A duplicate number found in the list.
    """
    seen = set()
    for num in nums:
        if num in seen:
            return num
        seen.add(num)

print("Expected: 2, Actual:", find_duplicate_number([1,3,4,2,2] ))
print("Expected: 3, Actual:", find_duplicate_number([3,1,3,4,2]))
print("Expected: 9999, Actual:", find_duplicate_number([i for i in range(1, 10**5)] + [9999]))
print("Expected: 10**6, Actual:", find_duplicate_number([10**6] + list(range(1,10**6))))
print("Expected: 1, Actual:", find_duplicate_number([1]*10**6 ))

How it Works

We create an empty set called seen.
We check every number in the list if it's already in seen.
If it is, we return it immediately as a duplicate.
Otherwise, we add the number to the set.

Performance

Time complexity: O(n)
Space complexity: O(n)

This solution is simple and works for any input. However, it uses extra memory.

More innovative Approach: Floyd's Tortoise and Hare (Cycle Detection)

We can do better if the input list follows a specific rule — that the numbers are from 1 to n and only one duplicate exists. We can use a famous algorithm: Floyd's Tortoise and Hare.

Assumptions:

The list has n + 1 integers.
Each integer is between 1 and n.
Exactly one duplicate exists.


from typing import List

def find_duplicate_number(nums: List[int]) -> int:
    """
    Find the duplicate number using Floyd's Tortoise and Hare cycle detection algorithm.
    
    This algorithm assumes that the list has exactly one duplicate and numbers range from 1 to n.

    Args:
        nums (List[int]): List of integers.

    Returns:
        int: The duplicate number.
    """
    # Phase 1: Detect the cycle (intersection point)
    slow = fast = nums[0]
    while True:
        slow = nums[slow]  # move one step
        fast = nums[nums[fast]]  # move two steps
        if slow == fast:
            break

    # Phase 2: Find entrance to the cycle (duplicate)
    slow = nums[0]
    while slow != fast:
        slow = nums[slow]
        fast = nums[fast]

    return slow

print("Expected: 2, Actual:", find_duplicate_number([1,3,4,2,2] ))
print("Expected: 3, Actual:", find_duplicate_number([3,1,3,4,2]))
print("Expected: 9999, Actual:", find_duplicate_number([i for i in range(1, 10**5)] + [9999]))
print("Expected: 1, Actual:", find_duplicate_number([1]*10**6 ))

How it Works

Imagine the numbers as links in a linked list.

A duplicate creates a cycle.

We use two pointers moving at different speeds to find the start of the cycle.

Performance

Time complexity: O(n)
Space complexity: O(1) — no extra memory!

Visual Aid

Imagine a race track where one racer runs at 1x speed and another at 2x speed. They will eventually meet inside a cycle; from there, we can trace back to the duplicate.

Which Should You Use?

Approach Time Space Works For

Set-Based O(n) O(n) Any list with possible multiple dups
Floyd’s Cycle Detection O(n) O(1) One duplicate, values from 1 to n only

If you don't have constraints on value ranges and possible duplicates, stick to the set-based approach. If memory is tight and you meet the conditions, use Floyd's algorithm.

When Performance Matters: Apply the Right Tool

Choosing the right solution depends on your context. Floyd's algorithm is your friend if you're working on data-heavy applications where space matters (like embedded systems or mobile apps).

The set-based approach is safer and easier to implement for general-purpose and simpler scenarios.

If you're preparing for coding interviews or optimizing real-world systems, practice both methods and understand when to use each. Save this reference and revisit whenever you need a refresh.

Join the Discussion

No comments yet. Be the first to share your thoughts!

Find the Duplicate Number in a List: From Simple to Smart

The Problem

How it Works

Performance

More innovative Approach: Floyd's Tortoise and Hare (Cycle Detection)

How it Works

Performance

Which Should You Use?

When Performance Matters: Apply the Right Tool

Become a PythonHaven Member

Join the Discussion

Read Next