6

Construct a String from another String using Suffix Trie

 1 year ago
source link: https://www.geeksforgeeks.org/construct-a-string-from-another-string-using-suffix-trie/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

A suffix tree is a data structure based on trie compression that stores all the suffixes of a given string. Apart from detecting patterns in strings, it has a range of applications in fields like bioinformatics, string algorithms, and data compression.

Features of Suffix Trie:

  • A Suffix Trie, commonly referred to as a suffix tree, is a data structure that resembles a tree and is used to store and look for patterns in strings. 
  • Each route in a suffix trie represents a particular suffix, and it keeps all the suffixes of a given string as pathways in a tree.
  • We commence with a blank tree and add each suffix of the string to a tree to generate a suffix trie for a text sequence. 
  • The empty string would serve as the root node of the output tree, and then each leaf node will symbolize a suffix of the input string. 
  • A frequent substring that appears in at least two of the suffixes is represented by each internal node.
  • The ability to quickly find substrings inside a text is one of the key benefits of employing a suffix trie.
  • By moving down the tree along the route that the pattern specifies, we may search for a pattern in the suffix trie given a pattern. 
  • We shall arrive at a leaf node that represents the suffix that begins with the pattern if the pattern is found in the string.

Examples:

Input: str1 = “programming” str2 = “gaming”
Output: [(3, 3), (5, 6), (8, 10)]

trie-2-copy.webp

Explanation: In this solution, we construct the suffix trie for the string “str1”. Then, for each substring of “str2”, we check if it exists in the suffix trie of “str1”. If it exists, we record the starting and ending indices of the substring in “str1” that form the given substring of “str2”.

In the example given, the first substring of “str2” is “g”. We search for this substring in the suffix trie of “str1” and find it at index 3. Therefore, we record the starting and ending indices of this substring in “str1” as (3, 3). Similarly, we find that the substring “am” in “str2” can be constructed from the suffix trie of “str1” using indices (5, 6) in “str1”. Finally, we find that the substring “ing” in “str2” can be constructed from the suffix trie of “str1” using indices (8, 10) in “str1”.

Therefore, the output of the program is [(3, 3), (5, 6), (8, 10)], which represents the starting and ending indices of each substring of “str1” that can be used to construct the corresponding substring in “str2”.

Input: str1 = “banana” str2 = “ana”
Output: [(1, 3)]

trie-1-copy.webp

Explanation: A suffix trie is a data structure that stores all the suffixes of a given string in a tree-like structure. To construct str2 from str1 using a suffix trie, we first build a suffix trie for str1. Then, we search for str2 in the suffix trie by traversing down the tree, following the edges labeled with the characters of str2.

To construct “ana” from “banana”, we start at the root of the suffix trie and follow the edges labeled “a”, “n”, and “a”, respectively, until we reach the end of the string. The indices of the characters we traverse are (1, 3), which correspond to the substring “ana” in str1.

Approach: This can be solved with the following idea:

Step 1: Create a Suffix Trie for the Original String

  • The first step is to construct a trie data structure that represents all the suffixes of the original string. This data structure is called a suffix trie and can be constructed using any standard algorithm.

Step 2: Identify Suffixes Beginning with the Initial Substring

  • After constructing the suffix trie, the next step is to locate all the suffixes that start with the initial substring of interest. This can be achieved by traversing the trie from the root to the leaf node that corresponds to the initial substring. By following the edges that match the characters of the initial substring, we can identify all the suffixes that begin with it.

Step 3: Determine the Longest Common Prefix (LCP) of the Suffixes

  • Once we have identified all the suffixes that begin with the initial substring, we need to determine their LCP. To accomplish this, we must identify the lowest common ancestor of the leaf nodes that correspond to the suffixes. The LCA represents the longest common prefix of the suffixes.

Step 4: Add the LCP to the Output String

  • After determining the LCP of the suffixes, we can add it to the output string.

Step 5: Repeat for Additional Substrings

  • To find the LCP for every additional substring, we repeat steps 2-4, beginning at the end of the previous substring. We identify all the suffixes that begin with the additional substring, determine their LCP, and add it to the output string.

Below is the code for the above approach:

  • Python3
# Implementing Trie using Trie and
# TrieNode classes
""" The Trie_Node class is defined with a 
__init__ method that creates a list of None
 values with length 26 to represent children nodes 
 and a Boolean variable isEndOfWord to mark the 
 end of a word in the trie."""


class Trie_Node:

    # Trie node class
    def __init__(self):
        self.children = [None]*26

        # Property to represent end
        # of a word in trie
        self.isEndOfWord = False


"""The Trie class is defined with a __init__ method 
that creates a root node using the getNode method 
and a _charToIndex private helper method to convert 
a character to its index in the children list."""


class Trie(Trie_Node):

    # Trie data structure class
    def __init__(self):
        self.root = self.getNode()

    def getNode(self):

      # Returns new trie node with
      # Null values
        return Trie_Node()

    def _charToIndex(self, ch):

        # Private helper function
        return ord(ch)-ord('a')

    """The insert method is defined to insert a 
    new key (a string) into the trie. It 
    iterates over each character of
    the key and checks if the character is 
    already present in the trie. If it's
    not present, it creates a new
    node and adds it to the children list 
    of the current node. The method marks 
    the last node as the end of the word."""

    def insert(self, key):

        # When word is already
        # present in trie
        word = self.root
        length = len(key)
        for level in range(length):
            index = self._charToIndex(key[level])

            # If character is not present
            # in trie
            if not word.children[index]:
                word.children[index] = self.getNode()
            word = word.children[index]
        word.isEndOfWord = True

    """The search method is defined to search for a 
    key (a string) in the trie. It iterates over 
    each character of the key and checks if the 
    character is present in the trie. If it's 
    not present, the method returns False.
    if the method reaches the end of the key 
    and the last node is marked as the end 
    of the word, the method returns True."""

    def search(self, key):

        # Search substring in the trie
        word = self.root
        length = len(key)
        level = 0
        while level < length:
            index = self._charToIndex(key[level])
            if not word.children[index]:
                return False
            word = word.children[index]
            level += 1

        if level == length:
            return True
        else:
            return False

    """The build_via_substrings method is 
    defined to build a suffix trie for a given
    input string P and search for all 
    substrings of another input string Q 
    in the trie."""
    def build_via_substrings(P, Q):

        # handling when length of S is 1
        if len(P) == 1:
            for i in range(len(Q)):
                if Q[i] != P:
                    return False
            return [(0, 0)]*len(Q)
        else:

            # creating suffix trie
            x = Trie()
            for i in range(len(P)):
                x.insert(P[i:])
            start_pos = 0
            substrings = []
            y = True
            k = 1

            # Search substrings in trie
            while k <= len(Q):
                y = x.search(Q[start_pos:k])
                if y == False:

                    # Unsuccessful search
                    # for a single lettered
                    # substring.
                    if k == start_pos + 1:
                        return False

                    elif k != start_pos + 1:

                        # When search fails
                        # for a substring
                        # greater than
                        # length = 1
                        sub = Q[start_pos:k-1]
                        lt = len(sub)
                        m = P.find(sub)
                        substrings.append((m, m + lt-1))
                        start_pos = k-1
                        k = k-1
                        y = True
                elif y == True and k == len(Q):

                    # We check whether we
                    # have reached the
                    # last letter
                    sub = Q[start_pos:]
                    lt = len(sub)
                    m = P.find(sub)
                    substrings.append((m, m + lt-1))
                k = k + 1
            if y == True and substrings == []:
                return [(P.find(Q), len(Q)-1)]
            else:
                return substrings


# Driver code
if __name__ == "__main__":
    str1 = "ssrtssr"
    str2 = "rsstsr"

    # Function call
    ans = Trie.build_via_substrings(str1, str2)
    print(ans)
Output
[(2, 2), (0, 1), (3, 4), (2, 2)]

Time Complexity: O(n2 + m)
Auxiliary Space: O(n*26)

Applications of Suffix Trie:

  • Suffix trie is used to find all occurrences of a pattern in a given text by searching for all substrings of the pattern in the text in pattern matching algorithms.
  • It is also used to assemble genome sequences from short DNA sequences by matching and aligning the short reads to the reference genome in bioinformatics.
  • Widely used to check whether a word is spelled correctly by searching for all possible substrings of the input word in spell-checking software.
  • It is preferably used to identify and optimize frequently used code patterns in compilers and code optimization tools.
  • Suffix trie is also used in natural language processing applications to properly match and categorize words and phrases based on their morphological and syntactical properties.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK