Grouping Anagrams in Python: From Naive to Optimal

 

If you've ever built a spellchecker, a search suggestion engine, or even just prepared for coding interviews, you've likely encountered the Group Anagrams problem.

It’s one of the most popular string manipulation challenges because it tests not just your coding skills but your ability to optimize for scale.

In this guide, you’ll learn:

  • How to group anagrams in Python

  • The difference between brute-force and optimal solutions

  • How to handle real-world, large-scale datasets

  • Common mistakes and related problems to strengthen your understanding

Whether you're prepping for a LeetCode interview or building backend systems at scale, this guide is for you.

What Is the Group Anagrams Problem?

Problem Definition

Given a list of strings, group the strings that are anagrams of each other.

An anagram is a word formed by rearranging the letters of another word.

Example Input

["eat", "tea", "tan", "ate", "nat", "bat"]

Example Output

[["eat", "tea", "ate"], ["tan", "nat"], ["bat"]]

 

Real-World Applications

This isn't just a coding exercise. Grouping anagrams has real uses:

  • Spell-check suggestions (e.g., "tea" → suggest "eat" or "ate")

  • Fraud detection (identifying scrambled text patterns)

  • Natural Language Processing (NLP) preprocessing

  • Plagiarism detection where sentences are reworded or scrambled

 

Solution 1: Brute Force (Sorting-Based Approach)

Idea

If two words are anagrams, their sorted forms will match.

Example:

Word Sorted Form
eat aet
tea aet
ate aet
tan ant
nat ant
bat abt

Group words that share the same sorted form.

The simplest way to group anagrams is to sort the characters in each word. Since anagrams share the same letters, sorting reveals a common structure.

from typing import List

def group_anagrams(s: List[str]) -> List[List[str]]:
    dp = {}
    for i in s:
        # Sort characters of the string to form the key
        sorted_i = ''.join(sorted(i))
        if sorted_i not in dp:
            dp[sorted_i] = []
        dp[sorted_i].append(i)
    return list(dp.values())


Time Complexity

Step Complexity
Sorting each word O(k log k)
Total time O(n × k log k)
Space O(n × k)
  • n: Number of words

  • k: Average word length

Why This Works?

  • Sorting turns "eat" and "tea" into the same key: "aet".
  • We use a dictionary to group all words with the same sorted key.

Why Use This?

  • Simple to implement

  • Works well for small datasets

  • Quickly understandable in interviews

What's the Problem?:

  • Sorting takes O(k log k) time for each word (k = word length).
  • Becomes inefficient for large datasets.

 

Solution 2: Optimized Approach (Frequency Counter as Key)

Idea

Instead of sorting, count the frequency of each character.
Anagrams will have the same frequency distribution.

Use a 26-element tuple as the key (since we’re dealing with lowercase English letters).

This avoids expensive sorting operations and creates a unique signature for each anagram group.
 

from collections import defaultdict
from typing import List

def group_anagrams(strs: List[str]) -> List[List[str]]:
    """
    Groups anagrams using frequency count as key.
    Each word is mapped to a 26-length character frequency tuple.
    """
    anagram_groups = defaultdict(list)

    for word in strs:
        count = [0] * 26  # 26 lowercase English letters
        for char in word:
            count[ord(char) - ord('a')] += 1
        key = tuple(count)  # Tuples are hashable and usable as dict keys
        anagram_groups[key].append(word)

    return list(anagram_groups.values())

Time Complexity

Step Complexity
Counting characters O(k)
Total time O(n × k)
Space O(n × k)
  • n = number of words
  • k = max length of any word

Why This Is Better

  • No need to sort → Faster for large datasets

  • Handles 100,000+ strings easily

  • Scales well for production systems

 

Test Cases: Manual

print(group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"]))
print(group_anagrams([]))
print(group_anagrams([""]))
print(group_anagrams(["a"*10000, "a"*10000]))


Even on massive datasets:

import random

print(group_anagrams(["".join(random.sample("abcde", 5)) for _ in range(100000)]))

Test Cases: Robust Verification

Here’s a copy-paste ready test suite to validate both solutions:

def run_tests():
    test_cases = [
        (["eat", "tea", "tan", "ate", "nat", "bat"], 3),
        ([""], 1),
        (["a"], 1),
        (["".join(chr(97 + i % 26) for i in range(1000))] * 1000, 1),
        ([chr(i % 26 + 97) * 5 for i in range(10**5)], 26),
    ]

    for i, (words, expected_groups) in enumerate(test_cases, 1):
        result = group_anagrams(words)
        actual_groups = len(result)
        print(f"Test {i}: {'Pass' if actual_groups == expected_groups else 'Fail'}")

 

Common Mistakes to Avoid

  • Sorting strings in every iteration when a frequency counter is faster.

  • Using lists as dictionary keys (not hashable)—use tuples instead.

  • Ignoring edge cases like [""] or ["a"].

 

For full mastery, practice these too:

Problem Link
Valid Anagram Check Checks if two strings are anagrams
Anagram Substring Search Find anagrams of a pattern in a larger text
Minimum Window Substring Find smallest substring containing all characters
Find All Anagrams in a String Classic sliding window

Ready to Scale with Python

The brute-force method of grouping anagrams may work for a handful of words, but it quickly becomes a bottleneck in production-scale systems. By switching to a frequency-based key strategy, we not only make our solution faster, but also more scalable and elegant.

After reading this, you should:

* Be able to clearly distinguish between naive and optimized anagram grouping methods.
* Know how to write scalable, clean, and efficient Python code.
* Understand how to tackle related issues in real-world scenarios

Real-World Scaling Tips

When moving from interviews to production:

  • Use defaultdict for clean code

  • If working on huge data pipelines:

    • Use parallelization (multiprocessing or threads)

    • Consider Redis, Memcached, or database sharding for distributed keys

 

Final Thoughts: When to Use Each Method

Scenario Method
Small Inputs / Interviews Sorting-based
Large-scale Systems

Frequency Counter

 

Download Resources

 

Conclusion

Now you can:

  • Confidently solve Group Anagrams in Python

  • Optimize your solution for scale and production-readiness

  • Tackle related problems to deepen your algorithmic thinking

Join the Discussion

No comments yet. Be the first to share your thoughts!