Week 4 - Tries & Substring Search

Week 4 - Tries & Substring Search #

Tries #

  • [from retrieval, but pronounced “try”]
  • Store characters in nodes (not keys).
  • Each node has R children, one for each possible character.
    • int R = 256; // extended ASCII

Java implementation #

public class TrieST<Value> {

    private static final int R = 256; // extended ASCII
    private Node root = new Node();
    
    private static class Node { 
        private Object value; // use Object instead of Value since no generic array creation in Java
        private Node[] next = new Node[R]; // each node has an array of links and a value
    }
    
    public void put(String key, Value val) { root = put(root, key, val, 0); }
    
    private Node put(Node x, String key, Value val, int d) { 
        if (x == null) x = new Node(); 
        if (d == key.length()) { x.val = val; return x; } 
        char c = key.charAt(d); 
        x.next[c] = put(x.next[c], key, val, d+1); 
        return x; 
    }
    
    public boolean contains(String key) { return get(key) != null; }
    
    public Value get(String key) {
        Node x = get(root, key, 0);
        if (x == null) return null;
        return (Value) x.val;  // cast needed
    }
    
    private Node get(Node x, String key, int d) { 
        if (x == null) return null; 
        if (d == key.length()) return x; 
        char c = key.charAt(d); 
        return get(x.next[c], key, d+1); 
    }
}
  • Performance: Fast search hit and even Trie representation faster search miss, but wastes space.

  • Deletion in an R-way trie

    • Find the node corresponding to key and set value to null.
    • If node has null value and all null links, remove that node (and recur).

Ternary search tries(TST) #

  • To solve the problem that R-way tries took too much space.

  • Store characters and values in nodes (not keys).

  • Each node has 3 children: smaller (left), equal (middle), larger (right).

  • Java implementation

    public class TST<Value> { private Node root;
        private class Node {
            private Value val;
            private char c;
            private Node left, mid, right; 
        }
    
        public void put(String key, Value val) { root = put(root, key, val, 0); }
    
        private Node put(Node x, String key, Value val, int d) {
            char c = key.charAt(d); 
            if (x == null) { x = new Node(); x.c = c; } 
            if (c < x.c) x.left = put(x.left, key, val, d); 
            else if (c > x.c) x.right = put(x.right, key, val, d); 
            else if (d < key.length() - 1) x.mid = put(x.mid, key, val, d+1); 
            else x.val = val;
            return x;
        }
    
        public boolean contains(String key) { return get(key) != null; }
    
        public Value get(String key) {
            Node x = get(root, key, 0);
            if (x == null) return null;
            return x.val; 
        }
    
        private Node get(Node x, String key, int d) {
            if (x == null) return null; 
            char c = key.charAt(d); 
            if (c < x.c) return get(x.left, key, d); 
            else if (c > x.c) return get(x.right, key, d); 
            else if (d < key.length() - 1) return get(x.mid, key, d+1); 
            else return x;
        }
    
    }
    

TST with R^2 branching at root #

  • Hybrid of R-way trie and TST.
    • Do R^2 -way branching at root.
    • Each of R^2 root nodes points to a TST.

String symbol table implementation cost summary #

  • TST vs. hashing
    • Hashing.
      • Need to examine entire key.
      • Search hits and misses cost about the same.
      • Performance relies on hash function.
      • Does not support ordered symbol table operations.
    • TSTs.
      • Works only for strings (or digital keys).
      • Only examines just enough key characters.
      • Search miss may involve only a few characters.
      • Supports ordered symbol table operations (plus others!).
    • Bottom line. TSTs are:
      • Faster than hashing (especially for search misses).
      • More flexible than red-black BSTs.

Character-based Operations #

  • Prefix match. Find all keys in a symbol table starting with a given prefix.

  • Longest prefix. Find longest key in symbol table that is a prefix of query string.

  • Wildcard match. Keys that match .he: she and the.

  • Goal. Find pattern of length M in a text of length N.
    • typically N » M
  • Check for pattern starting at each text position.

    • worst case: ~M N char compares.
  • Brute-force algorithm needs backup for every mismatch.

  • alternate implementaion for backup.

Knuth-Morris-Pratt (KMP) #

Use DFA to match the pattern in string in linear time.

Deterministic finite state automaton (DFA) #

  • DFA is abstract string-searching machine.
    • State = number of characters in pattern that have been matched.

      • e.g. 3 means 3 chars have been matcheed, aka A B A.
    • EXACTLY one state transition for each char to another in alphabet from left to right.

    • When mismatch, the state will go back to the state that has the maximum matches so far.

      • e.g. state 5. if follow by B, the state will go to 4, because ABABAB still has the ABAB matches which is state 4.
    • Java code for search the pattern:

      • In the loop, we only augment i, and keep updating j until either i = N or j = M.

How to build DFA from pattern? #

    • State X is used for simulating when state j goes mismatch. Why?
      • For example, when j=5, we have matched pattern: ABABA. If the next char is not C, we have a mismatch. Then we need to match BABA (without the start A) in the existing previous pattern.
        • Why omit the first A? Think about how we do brute-force. after a mismatch, we restart the match with i+1. the same idea here.
      • The straightforward way is to run a loop to rematch BABA then deciding where the new mismatch char should go. But a genious way is to mark a state X in the previous path so every time we only need to check how the last char(j) goes in the X state for mismatching.
        • Take 5 as an example. we can simply tell where a mismatch goes by simulating 3 because BABA ends with matching 3 elements(As showed in the graph). We only need to check where the state will go for a mismatch and copy over the mismatch to state 5.
      • The next question is, how to find the state X without loop the whole [1,j-1] in the pattern?
        • Think about how to get X for state 5? We simulate BABA then check the mismatch of that state column. So what about X for state 4? we need to simulate BAB, right?
        • You can see to simulate BABA, we need to check the state for BAB and check where A goes on that state. Same as BAB, we need to find the state for BA at first, then check where the B goes.
        • So if we reverse the process, simulating B can be used for checking mismatch of AB_, simulating BA can be used for checking mismatch of ABA_, simulating BAB can be used for checking mismatch of ABAB_.
        • We don’t actually need to loop every element for searching state X but record the state of the previous match, and check the new element in the state which will be our state X for the next state’s mismatch. e.g.
          • [B][0] = 0, then the mismatch of AB_ check the state 0.
          • [A][0] = 1, then the mismatch of ABA_ check the state 1.
        • Why not calculate state for A_? because [i,j-1] is empty which means X = 0.
      • Let’s loop from the beginning.
        • when j = 0, nothing need to be done, the only match is A, and the others are 0. So we initialize X = 0.
        • When j = 1, we copy over the mismatches from state 0(X=0) to state 1. And mark X = dfa[B][0] = 0 meaning we finished the matching of [B], the next state can use the result as X.
        • When j = 2, we copy over the state 0 to state 2 as well. Notice that the mismatch is base on state 2, not 0. So for state 2, we only copy B, C, etc.., then set X = dfa[A][0] = 1 because after matching [BA] the state reached 1
        • When j = 3, we need to know what a match of [1, 2] looks like. [1, 2] = [BA] which already been set up in the last loop which X = 1. So we only need to copy over state 1 to 3, and then find the new X. X = state of [BAB] = dfa[B][1] = 2
        • When j = 4, same as last loop. Copy over, and set up X = state of [BABA] = dfa[A][2] = 3
        • Whey j = 5, copy over, and set up X = state of [BABAC] = dfa[C][3] = 0
  • Java implementation

    public class KMP {
    
        private String pat; 
        private int[][] dfa;
    
        public KMP(String pat) {
            // Build DFA from pattern.
            this.pat = pat;
            int M = pat.length();
            int R = 256; // ASCII, the graph only shows A, B and C, but it's actually 256 characters
            dfa = new int[R][M]; // initialized with zeros as default
            dfa[pat.charAt(0)][0] = 1; // only update the match case to 1
    
            // then we start j at 1 since j = 0 already been initialized above
            for (int X = 0, j = 1; j < M; j++) {
                // Compute dfa[][j].
                // Copy over all cases including mismatch cases.
                // when j = 1, the state X is dfa[1][0] = 0. so if there is a mismatch, it will return to state 0 which is A->1, B->0*, C->0.
                // notice we set up B to 0 in the loop, 
                // it's only for convenience, and will overwrite it later on.
                for (int c = 0; c < R; c++)
                    dfa[c][j] = dfa[c][X]; 
                dfa[pat.charAt(j)][j] = j+1; // Overwrite the match case.
                // j = 1, pat.charAt(j) = B, X = 0, dfa[pat.charAt(j)][X] = 0;
                // j = 2, pat.charAt(j) = A, X = 0, dfa[pat.charAt(j)][X] = 1;
                // j = 3, pat.charAt(j) = B, X = 1, dfa[pat.charAt(j)][X] = 2;
                // j = 4, pat.charAt(j) = A, X = 2, dfa[pat.charAt(j)][X] = 3;
                // j = 5, pat.charAt(j) = C, X = 3, dfa[pat.charAt(j)][X] = 0;
                // j = 6, pat.charAt(j) = _, X = 0, dfa[pat.charAt(j)][X] = _;
                // The goal is to mark the match state of `[1, j]` in the current loop, 
                // so we can fill out the mismatch for j+1 in the next loop,
                // because the mismatch of j+1 = the matching result of the state for [1-j].
                X = dfa[pat.charAt(j)][X]; // Update restart(X) state.
            }
        }
    
        public int search(String txt) { 
            // Simulate operation of DFA on txt.
            int i, j, N = txt.length(), M = pat.length();
            for (i = 0, j = 0; i < N && j < M; i++)
                j = dfa[txt.charAt(i)][j];
            if (j == M) return i - M; // found (hit end of pattern)
            else return N; // not found (hit end of text)
        }
    }
    
  • Cost analysis

    • KMP substring search accesses no more than M + N chars to search for a pattern of length M in a text of length N.

Boyer-Moore #

  • Intuition.
    • Scan characters in pattern from right to left.
    • Can skip as many as M text chars when finding one not in the pattern.
  • How much to skip?
    • Case 1. Mismatch character not in pattern.
      • mismatch character ‘T’ not in pattern: increment i one character beyond ‘T’ which measn i = j+1, j = length of the pattern.
    • Case 2a. Mismatch character in pattern.
      • mismatch character ‘N’ in pattern: align text ‘N’ with rightmost pattern ‘N’.
      • In the example above, i += 3. The comparing loop start from the right where j = 5. When mismatch, j = 3 and the mismatched element N’s index = 0. So i += (3 - 0).
      • So how to get 3?
        • First of all, we match text from right to left, as j is the index in the loop from pattern.length-1 to 0.
        • When the rightmost element didn’t match text[i], but text[i] is still in the pattern, we will augment i by j - (index of the matched element in the pattern).
        • If there are some repeated elements in the pattern, we will choose the rightmost one to avoid missing matches.
          • The refined equation becomes: i += j - (index of the rightmost matched element in the pattern).
        • There is another special case which will be demonstrated below:
    • Case 2b. Mismatch character in pattern.
      • You can see, if we align ‘E’ to the rightmost pattern ‘E’, the pattern will skip ‘-2’ since j = 3 and the index of the rightmost E = 5.
      • So instead of skiping a nagative number, we increment i by 1.
      • Then the i-update becomes i += Max(1, j - (index of the rightmost matched element in the pattern));
    • To make the computation easier, we create a right[] to record the (index of the rightmost matched element in the pattern):
      • Precompute index of rightmost occurrence of character c in pattern (-1 if character not in pattern, meaning we skip j-(-1) = j+1).
      • You can see E occured three times, but we choose the rightmost one’s index as the one for calculation.
        • i.e. every time an E mismatch occurrs, backup Max(1, 5-5) = 1 positions.
  • Java implementation
  • Worst-case. Can be as bad as ~ M N.

Rabin-Karp #

Basic idea = modular hashing #

  • Compute a hash of pattern characters 0 to M - 1.
  • For each i, compute a hash of text characters i to M + i - 1.
  • If pattern hash = text substring hash, check for a match.

Modular hash function #

  • Using the notation t_i for txt.charAt(i), we wish to compute

  • Horner’s method. Linear-time method to evaluate degree-M polynomial.

  • Efficiently computing the hash function

    • Can update hash function in constant time!
  • Java implementation

public class RabinKarp {
    private long patHash; // pattern hash value
    private int M; // pattern length
    private long Q; // modulus
    private int R; // radix
    private long RM; // R^(M-1) % Q

    public RabinKarp(String pat) { 
        M = pat.length(); 
        R = 256; 
        Q = longRandomPrime(); // a large prime (but avoid overflow)

        RM = 1; 
        for (int i = 1; i <= M-1; i++) // precompute R M – 1 (mod Q)
            RM = (R * RM) % Q; 
        patHash = hash(pat, M);
    }
    
    // Compute hash for M-digit key
    private long hash(String key, int M) { 
        long h = 0;
        for (int j = 0; j < M; j++)
        h = (R * h + key.charAt(j)) % Q;
        return h;
    }

    // check for hash collision using rolling hash function
    public int search(String txt) {
        int N = txt.length(); 
        int txtHash = hash(txt, M); 
        if (patHash == txtHash) return 0; 
        for (int i = M; i < N; i++) {
            txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
            txtHash = (txtHash*R + txt.charAt(i)) % Q;
            //  handle hash collision
            // 1. **Monte Carlo version**. Return match if hash match. Runs in linear time.
            //    * It is possible return the wrong answer.
            // 2. **Las Vegas version**. Check for substring match if hash match; continue search if false collision.
            //    * Confidence about the answer, but worst case is **MN**.
            if (patHash == txtHash) return i - M + 1; 
        } 
        return N;
    }
}

Summary #

  • Cost of searching for an M-character pattern in an N-character text.