Algorithms

String Distance

9 months, 1 week ago ; 106 views
Share this

String Distance

This algorithm goes by many names. You could for example have heard of edit distance and Levenshtein distance which refers to the same algorithm.

It expresses the way we define how similar two strings are by counting the number of steps we need to perform to convert one string into another one.

I am seeking to calculate the edit distance (Levenshtein distance) between two strings using various approaches starting with a recursive approach and incrementally building an optimized solution. The edit distance is the minimum number of operations (insertion, deletion, or substitution) required to transform one string into another.

I will explore an example by employing the three main operations that include Delete(D), Insert (I), and Replace (R). In this example, I wish to convert Saturday to Sundays

Saturday --------------------------------------------------------> Sundays

Suturday (R)

Sunurday (R)

Sundryday (R)

Sundayday (R)

Sundayay (R)

Sundaysy (D)

Sundays

 

I have performed 6 Replaces and 1 Deletion

The question I ask is this the optimum number of operations I can achieve for this problem?

It turns out, we can do better. Let's explore this alternative approach.

Saturday --------------------------------------------------------> Sundays

Sturday (D)

Surday (R)

Sunday (I)

Sundays

 

Here, we have only done 4 operations & it turns out to be the optimum number of operations to convert Saturday to Sundays.

 

What are the applications of the Levenshtein edit distance algorithm?

The Levenshtein edit distance algorithm, which calculates the minimum number of single-character edits required to transform one string into another, finds applications in various real-life scenarios. Here are some practical applications:

 Spell Checking and Autocorrection

Scenario: Spell checkers in word processors, web browsers, and other applications.
Application: Identifying and suggesting corrections for misspelled words based on their similarity to correctly spelled words.

DNA Sequencing

Scenario: Biological research and genomics.
Application: Comparing DNA sequences to identify genetic mutations or evolutionary relationships.

Plagiarism Detection

Scenario: Academic institutions, content publishing platforms.
Application: Detecting similarities between documents to identify potential instances of plagiarism.

Data Deduplication
Scenario: Database management, data cleaning.
Application: Identifying and removing duplicate records in databases by measuring the similarity of strings.

Fuzzy String Matching

Scenario: Information retrieval, data integration.
Application: Matching similar strings in databases, facilitating search operations, and merging data from different sources.

OCR (Optical Character Recognition) Correction

Scenario: Document digitization and OCR applications.
Application: Correcting errors in recognized text by comparing it with a reference text.

Speech Recognition

Scenario: Voice-based virtual assistants, transcription services.
Application: Correcting misrecognized words by finding the closest match based on Levenshtein distance.

Address Matching

Scenario: Logistics, postal services.
Application: Matching and validating addresses by comparing user-entered addresses with a standardized database.

Product Matching in E-Commerce

Scenario: Online retail platforms.
Application: Matching similar product names or descriptions to streamline catalog management and improve search relevance.

Error Detection and Correction in Communication

Scenario: Data transmission, networking.
Application: Identifying and correcting errors in transmitted data by comparing received data with the original.

Autocomplete and Predictive Text

Scenario: Search engines, mobile keyboards.
Application: Predicting and suggesting possible word completions as user types based on Levenshtein distance.

These applications highlight the versatility of the Levenshtein edit distance algorithm in solving problems related to string similarity and matching, making it a valuable tool in various domains.

Recurrence Relation

The aim is to recognize that this is a problem that can be solved using a dynamic programming approach.

I hope to achieve this by picking up an example, building up the tree of subproblems for it if there are any, and then finding out if these subproblems re-occur.

Set ------------------------------------ > Be

I will try & illustrate the steps I take to convert the word Set to Be.

The first step is to perform a delete operation giving the following results.

Delete   <------- 1 + Se ----> Be

The second step is performing an insert operation as follows.

Since our target word has the last character letter e, we also need to insert the letter e at the end of the initial word.

Insert ----------> 1 +         Sete ------> Be

 

Then I applied a simple shortcut because both last strings are the same by removing the last character from our target string.

1+        Set ----> B

 

Essentially, when two characters are the same, there's no need to do anything. It is equivalent to removing those two characters.

If we chose replacement instead of insertion then here's the outcome

Replace <------ 1 +    See --------> Be

The replace operation replaces the last character in the initial string with an equivalent of the last character in the target string. However, the same concept for insert applies here. We can drop both e's in the initial and target strings because they are the same.

1+      Se ----> B

 

So, to convert Set to Be, we drop the letter t in Set and replace the S with B.

Out of the discussion above, here's the recurrence relation.

dist(a,b)          a if b ==0

                       b if a==0

                      min(1 + dist(a-1, b), 1 + dist(a, b-1), 0/1 + dist(a-1, b-1))  # for(<---(D), --->(I), <--->(R))

the 0/1 case

1 if char@(a) != char@(b) else 0

 

Solution: Recursive Approach

class StringDistanceRec:
    def __init__(self, str_A: str, str_B: str):
        """
        Constructor initializes the class with two input strings.

        Parameters:
             - str_A (str): The first input string.
             - str_B (str): The second input string.
        """
        self.str_A = str_A
        self.str_B = str_B

    def distance_r(self, a: int, b: int) -> int:
        """
        Recursive method to calculate the edit distance between substrings.

        Parameters:
             - a (int): Index in the first string.
             - b (int): Index in the second string.

        Returns:
             - int: The minimum edit distance between substrings.
        """
        if a == 0:
            return b
        if b == 0:
            return a

        # Check if the current characters are equal for replacement cost
        replace_cost = 0 if self.str_A[a - 1] == self.str_B[b - 1] else 1

        # Calculate costs for deletion, insertion, and replacement
        cost_delete = self.distance_r(a - 1, b) + 1
        cost_insert = self.distance_r(a, b - 1) + 1
        cost_replace = self.distance_r(a - 1, b - 1) + replace_cost

        # Return the minimum cost among deletion, insertion, and replacement
        return min(cost_delete, cost_insert, cost_replace)

    def distance(self) -> int:
        """
        Public method to initiate the recursive calculation.

        Returns:
             - int: The edit distance between the two strings.
        """
        return self.distance_r(len(self.str_A), len(self.str_B))

	

print("recursive solution 1::", StringDistanceRec("Saturday", "Sundays").distance())
print("recursive solution 2::", StringDistanceRec("Set", "Be").distance())

Code walkthrough of the recursive solution

  • The StringDistanceRec class is initialized with two strings, str_A and str_B.

  • The distance_r method is a recursive function that calculates the edit distance between substrings of str_A and str_B. It takes two indices, a and b, representing the current position in each string.

  • The base cases check if either of the indices reaches 0. If one of them does, the function returns the length of the other string, representing the cost of inserting or deleting the remaining characters.

  • The variable replace_cost is set to 0 if the characters at the current positions are equal, and 1 otherwise.

  • The function calculates three costs: cost_delete, cost_insert, and cost_replace. These represent the costs of deletion, insertion, and replacement, respectively.

  • The function returns the minimum of these three costs.

  • The distance method initializes the recursive calculation with the lengths of the two input strings.

 

Time and space Complexity
 

Time Complexity

The distance_r method is a recursive function that explores all possible combinations of edits between the two strings. For each character position in the first string, it recursively explores deletion, insertion, and replacement operations. The total number of recursive calls grows exponentially with the length of the strings.

Let m be the length of the first string (str_A) and n be the length of the second string (str_B). The time complexity can be expressed as O(3^(m+n)), where 3 represents the three possible operations: deletion, insertion, and replacement.

Space Complexity

The space complexity is determined by the depth of the recursion, which is equal to the length of the shorter string between str_A and str_B. In the worst case, it is O(min(m, n)).

This exponential time and space complexity makes the recursive solution impractical for large strings due to its inefficiency. The reason for this inefficiency is the repeated calculations of the same subproblems, which could be avoided using dynamic programming techniques such as memoization or tabulation.

Solution: Top-Down

class StringDistanceTopDown:

	def __init__(self, str_A, str_B):

        """
        Constructor initializes the class with two input strings.

        Parameters:
             - str_A (str): The first input string.
             - str_B (str): The second input string.
        """

		self.str_A = str_A
		self.str_B = str_B
		self.dp ={}

	def distance_r(self, a, b):

        """
        Recursive method to calculate the edit distance between substrings.

        Parameters:
             - a (int): Index in the first string.
             - b (int): Index in the second string.

        Returns:
             - int: The minimum edit distance between substrings.
        """

		if (a, b) in self.dp:
			return self.dp[(a, b)]
		if a == 0:
			return b
		if b == 0:
			return a

        # Check if the current characters are equal for replacement cost
		replace_cost = 0 if self.str_A[a-1] == self.str_B[b-1] else 1

        # Calculate costs for deletion, insertion, and replacement
		cost_delete = self.distance_r(a-1, b) + 1
		cost_insert = self.distance_r(a, b-1) + 1
		cost_replace = self.distance_r(a-1, b-1) + replace_cost

        # calculate the minimum cost for deletion, insertion, and replacement and store it in the dp
		self.dp[(a, b)]=min(cost_delete, cost_insert, cost_replace)
		return self.dp[(a, b)]
	
	def distance(self):
        """
        Public method to initiate the recursive calculation.

        Returns:
             - int: The edit distance between the two strings.
        """
		return self.distance_r(len(self.str_A), len(self.str_B))
	

print("Top Down 1::", StringDistanceTopDown("Saturday", "Sundays").distance())
print("Top Down 2::", StringDistanceTopDown("Set", "Be").distance())
print("Top Down 3::", StringDistanceTopDown("TodyIsSaturday", "TomorrowIsSunday").distance())

 

Time and space Complexity

Time Complexity

The use of memoization helps in avoiding redundant calculations. The time complexity is improved compared to the naive recursive solution. However, the worst-case time complexity is still O(m * n), where m is the length of the first string (str_A) and n is the length of the second string (str_B).


Space Complexity

The space complexity is determined by the space required to store the memoization table (self.dp). In the worst case, where all possible combinations of a and b are memoized, the space complexity is O(m * n). Each entry in the memoization table takes constant space.

 

Solution: Bottom-Up

class StringDistanceBottomUp:

    def __init__(self, str_A, str_B):

        """
        Constructor initializes the class with two input strings.

        Parameters:
             - str_A (str): The first input string.
             - str_B (str): The second input string.
        """

        self.str_A = str_A
        self.str_B = str_B

        self.dp = [[-1 for _ in range(len(self.str_B) + 1)] for _ in range(len(self.str_A) + 1)]

        for i in range(len(self.str_A) + 1):
            print(self.dp[i])

        for a in range(len(self.str_A) + 1):
            for b in range(len(self.str_B) + 1):
                if a == 0:
                    self.dp[a][b] = b
                elif b == 0:
                    self.dp[a][b] = a
                else:
                    replace_cost = 0 if self.str_A[a - 1] == self.str_B[b - 1] else 1

                    cost_delete = self.dp[a - 1][b] + 1
                    cost_insert = self.dp[a][b - 1] + 1
                    cost_replace = self.dp[a - 1][b - 1] + replace_cost
                    self.dp[a][b] = min(cost_delete, cost_insert, cost_replace)

    def distance(self):
         """
        Recursive method to calculate the edit distance between substrings.

        Parameters:
             - a (int): Index in the first string.
             - b (int): Index in the second string.

        Returns:
             - int: The minimum edit distance between substrings.
        """
        return self.dp[len(self.str_A)][len(self.str_B)]


print("Bottom Up Solution 1::", StringDistanceBottomUp("Saturday", "Sundays").distance())
print("Bottom Up solution 2::", StringDistanceBottomUp("Set", "Be").distance())
print("Bottom Up solution 3::", StringDistanceBottomUp("TodyIsSaturday", "TomorrowIsSunday").distance())	

Time Complexity

The time complexity of the bottom-up dynamic programming solution is O(m * n), where m is the length of the first string (str_A) and n is the length of the second string (str_B). This is because the solution iterates over all possible combinations of indices a and b in nested loops, and at each step, it performs constant-time operations.


Space Complexity

The space complexity of the bottom-up dynamic programming solution is also O(m * n). The self.dp matrix is used to store intermediate results for each combination of indices, and its size is proportional to the product of the lengths of the input strings. Each entry in the matrix takes constant space.

 

If you want a varied initialization of the self.dp matrix while still getting the same time & space complexity, then feel free to explore the following option.

class StringDistanceBottomUp2:

	def __init__(self, str_A, str_B):
		self.str_A = str_A
		self.str_B = str_B
		self.dist =[[]] * (len(str_A) + 1)

		for a in range(len(str_A) + 1):
			self.dist[a] =[-1] * (len(str_B)+1)

			for b in range(len(str_B) +1):

				if a == 0:
					self.dist[a][b] = b
				elif b == 0:
					self.dist[a][b] = a
				else:
					replace_cost = 0 if self.str_A[a - 1] == self.str_B[b - 1] else 1
					cost_delete = self.dist[a-1][b] + 1
					cost_insert = self.dist[a][b-1] +1
					cost_replace = self.dist[a-1][b-1] + replace_cost
					min_cost=min(cost_delete, cost_insert, cost_replace)
					self.dist[a][b] = min_cost
	def distance(self):
		return self.dist[len(self.str_A)][len(self.str_B)]
	

print("Bottom Up 1::", StringDistanceBottomUp2("Saturday", "Sundays").distance())
print("Bottom Up 2::", StringDistanceBottomUp2("Set", "Be").distance())
print("Bottom Up 3::", StringDistanceBottomUp2("TodyIsSaturday", "TomorrowIsSunday").distance())

 

Solution: Bottom Up Optimized

The optimized version of the bottom-up solution is here.

class StringDistanceBottomUpOptimized:

	def __init__(self, str_A, str_B):

        """
        The constructor initializes the class with two input strings.

        Parameters:
             - str_A (str): The first input string.
             - str_B (str): The second input string.
        """

		self.str_A = str_A
		self.str_B = str_B
		self.dist_read  =[-1] * (len(str_B)+1)
		self.dist_write =[-1] * (len(str_B)+1)

		for a in range(len(str_A) + 1):
			
			for b in range(len(str_B) +1):

				if a == 0:
					self.dist_write[b] = b
				elif b == 0:
					self.dist_write[b] = a
				else:
					replace_cost = 0 if self.str_A[a - 1] == self.str_B[b - 1] else 1

					cost_delete = self.dist_read[b] + 1
					cost_insert = self.dist_write[b-1] +1
					cost_replace = self.dist_read[b-1] + replace_cost

					min_cost=min(cost_delete, cost_insert, cost_replace)
					self.dist_write[b] = min_cost
			(self.dist_write, self.dist_read) =(self.dist_read, self.dist_write)
	def distance(self):
         """
        Public method to initiate the recursive calculation.

        Returns:
             - int: The edit distance between the two strings.
        """

		return self.dist_read[len(self.str_B)]
	

print("Bottom Up Optimized 1::", StringDistanceBottomUpOptimized("Saturday", "Sundays").distance())
print("Bottom Up Optimized 2::", StringDistanceBottomUpOptimized("Set", "Be").distance())
print("Bottom Up Optimized 3::", StringDistanceBottomUpOptimized("TodyIsSaturday", "TomorrowIsSunday").distance())	

Time Complexity

The time complexity of the bottom-up optimized dynamic programming solution is O(m * n), where m is the length of the first string (str_A) and n is the length of the second string (str_B). This is because the solution iterates over all possible combinations of indices a and b in nested loops, and at each step, it performs constant-time operations.


Space Complexity

The space complexity of the bottom-up optimized dynamic programming solution is O(min(m, n)). The optimization involves using two 1D arrays (dist_read and dist_write) instead of a 2D matrix. These arrays are swapped in each iteration, and only two rows of the matrix are maintained at any given time. Therefore, the space required is proportional to the minimum of the lengths of the input strings.

 

 

 

 

 

Become a member
Get the latest news right in your inbox. We never spam!

Read next

Island Perimeter

&nbsp;This solution seeks to find the perimeter of an island in a grid.&nbsp; The problem considers a grid where each cell represents land (1) or … Read More

Kibsoft 3 months, 3 weeks ago . 76 views

Pacific Atlantic Waterflow

3 months, 3 weeks ago . 82 views