Find the Duplicate Number in a List: From Simple to Smart
As a seasoned software engineer with hands-on experience in building scalable systems and optimizing performance-critical code, I often work on algorithm optimization and space/time tradeoffs.
In this article, we tackle a common algorithm problem: how to find a duplicate number in a list.
It is a frequent question in technical interviews and high-performance applications. We'll start with the most straightforward solution and work toward the most efficient one, explaining each step clearly.
The Problem
You are given a list of numbers. Some numbers might be repeated. Your job is to find any one of the duplicate numbers.
Let's look at two types of approaches:
- The simple approach uses extra memory.
- The optimized approach uses constant memory.
- Simple but Effective: Using a Set to Track Seen Numbers
from typing import List
def find_duplicate_number(nums: List[int]) -> int:
"""
Find a duplicate number using a set to track previously seen values.
Args:
nums (List[int]): List of integers.
Returns:
int: A duplicate number found in the list.
"""
seen = set()
for num in nums:
if num in seen:
return num
seen.add(num)
print("Expected: 2, Actual:", find_duplicate_number([1,3,4,2,2] ))
print("Expected: 3, Actual:", find_duplicate_number([3,1,3,4,2]))
print("Expected: 9999, Actual:", find_duplicate_number([i for i in range(1, 10**5)] + [9999]))
print("Expected: 10**6, Actual:", find_duplicate_number([10**6] + list(range(1,10**6))))
print("Expected: 1, Actual:", find_duplicate_number([1]*10**6 ))
How it Works
- We create an empty set called seen.
- We check every number in the list if it's already in seen.
- If it is, we return it immediately as a duplicate.
- Otherwise, we add the number to the set.
Performance
- Time complexity: O(n)
- Space complexity: O(n)
This solution is simple and works for any input. However, it uses extra memory.
More innovative Approach: Floyd's Tortoise and Hare (Cycle Detection)
We can do better if the input list follows a specific rule — that the numbers are from 1 to n and only one duplicate exists. We can use a famous algorithm: Floyd's Tortoise and Hare.
Assumptions:
- The list has n + 1 integers.
- Each integer is between 1 and n.
- Exactly one duplicate exists.
from typing import List
def find_duplicate_number(nums: List[int]) -> int:
"""
Find the duplicate number using Floyd's Tortoise and Hare cycle detection algorithm.
This algorithm assumes that the list has exactly one duplicate and numbers range from 1 to n.
Args:
nums (List[int]): List of integers.
Returns:
int: The duplicate number.
"""
# Phase 1: Detect the cycle (intersection point)
slow = fast = nums[0]
while True:
slow = nums[slow] # move one step
fast = nums[nums[fast]] # move two steps
if slow == fast:
break
# Phase 2: Find entrance to the cycle (duplicate)
slow = nums[0]
while slow != fast:
slow = nums[slow]
fast = nums[fast]
return slow
print("Expected: 2, Actual:", find_duplicate_number([1,3,4,2,2] ))
print("Expected: 3, Actual:", find_duplicate_number([3,1,3,4,2]))
print("Expected: 9999, Actual:", find_duplicate_number([i for i in range(1, 10**5)] + [9999]))
print("Expected: 1, Actual:", find_duplicate_number([1]*10**6 ))
How it Works
Imagine the numbers as links in a linked list.
A duplicate creates a cycle.
We use two pointers moving at different speeds to find the start of the cycle.
Performance
- Time complexity: O(n)
- Space complexity: O(1) — no extra memory!
Visual Aid
Imagine a race track where one racer runs at 1x speed and another at 2x speed. They will eventually meet inside a cycle; from there, we can trace back to the duplicate.
Which Should You Use?
Approach Time Space Works For
Set-Based O(n) O(n) Any list with possible multiple dups
Floyd’s Cycle Detection O(n) O(1) One duplicate, values from 1 to n only
If you don't have constraints on value ranges and possible duplicates, stick to the set-based approach. If memory is tight and you meet the conditions, use Floyd's algorithm.
When Performance Matters: Apply the Right Tool
Choosing the right solution depends on your context. Floyd's algorithm is your friend if you're working on data-heavy applications where space matters (like embedded systems or mobile apps).
The set-based approach is safer and easier to implement for general-purpose and simpler scenarios.
If you're preparing for coding interviews or optimizing real-world systems, practice both methods and understand when to use each. Save this reference and revisit whenever you need a refresh.