Tuesday 19 June 2012

String and Pattern Matching Algorithms

Hey everyone in this article we would discuss 4 main techniques of string and pattern matching. They are as follows:
  1. Naive algorithm
  2. Rabin Karp Algorithm 
  3. Finite State automata
  4. Knuth Morris & Pratt algorithm
along with that we would also learn about the various application, advantages and disadvantages of these algorithms.


first of all lets understand what String matching is.


In string matching we have three main things to which we deal and they are:
  1. Text in the form of array of length n (say) we name it T
  2. A pattern of length smaller than n , obviously since it would be contained in the array and we have to search for it, lets assume its length is m and hence m is smaller n. we name it P
  3. The shift in text unless we find our pattern.
all we have to do is search for the pattern contained in the text and until we find it have to shift by a particular length in the text.


so let's get started with it :)




  1. Naive algorithm:
Applications:

Used for text editing and locating files.

By text editing we mean searching for a particular string in a text. The recursive call to the procedure would do this job but it might be problem when we are dealing will huge data .

Brute Force algorithm:Naive Algorithm

Shift in text T by just one step if the pattern does not matches.

n<- Length of text
m<-Length of pattern
for s=0 to n-m 
if [T[s+1,s+2,....,s+m]=P[1...m]
then print the pattern occurs at shift s
end loop

Analysis:

The time for which the loop executes  O((n-m+m)m).  This efficiency is O(n^2) in worst case when we have the value of m=n/2.

A good example of this would be a situation when we have the following:

T: abcabaabcabac
P:abaa

Drawbacks:
  • Information gained about the text for one value of s is totally ignored in the next value of it so it very inefficient.
  • can not be used for huge data  like in case of database etc. 
Advantage :

It is used by the next algorithm we are going to discuss , Rabin Karp Algorithm.

2. Rabin Karp Algorithm:

This algorithm generalizes to other algorithms for related problem, such as 2-D pattern matching.
It was an improvement over the Naive algorithm .

The basic idea: was to before spending a lot of time comparing characters for a match we do some pre-processing of the text and the pattern in order to eliminate some locations in the text. Then we could quickly run our Naive algorithm on what is left . 

We use a hash function to match the value of the patter and the text for one value of shift s.

we go for Naive algorithm after matching the hash value the patter . 
The reason for applying Naive after matching the hash value is to explicitly verify every valid shift. Also because in some cases it is possible that two different strings may give same hash value ,this dependent upon the hash function we used.


Analysis :
  • The running time of this algorithm is Θ((n-m+1)m) in the worst case but it improves on the average running time.
  • Its aim is to speed up the search process using by using hash function .
  • The algorithm says that two strings are equal if their hash values are equal.
Drawbacks:
  • Calculating the hash value again and again for every value of shift s is very tedious especially in case we are dealing with huge data.
  • In some cases it is possible that two different strings may give same hash value ,although this dependent upon the hash function we used.
The remedy for the above given drawback is the use of a mathematical rule called as the Horner's rule.

By this rule we compute p in time O(m).


Advantage:

  • By the use of the Horner Rule we try to make the hash of the next string based on the previous string . That means we do not ignore the information gained for one value of shift s in the next value for it. 

3.Finite state automata

In this method of string matching we create an automata that scans the text to check for all the occurrences of the pattern .


An automaton represented M has 5-tuple (Q,Σ,δ,q0,F), where:
  • Q is a finite set of states.
  • Σ is a finite set of symbols, called the alphabet of the automaton.
  • δ is the transition function, that is, δ: Q × Σ → Q.
  • q0 is the start state, that is, the state of the automaton before any input has been processed, where q0∈ Q.
  • F is a set of states of Q (i.e. F⊆Q) called accept states.

We begin with the state and read the characters of its input string one at a time and its makes transitions from one state to another.
It accepts a string if the current q0 read state is a member of F. otherwise it is said to be rejected. 

Advantage:
It is very efficient as it examines each text character exactly one, and takes contant time per text character.

Efficiency:
The time used  after the automata is built- is just  Θ(n)

4. Knuth-Morris-Pratt Algorithm :


we do string matching in linear time .


Analysis :


The running time of this algorithm is (n+m) . It avoid the the computation of the transition function as in case finite state automata and does the pattern matching by using the auxiliary function pie[1...m]. Which is pre computed .
from the pattern in time O(m).


It uses 2 functions:


  • Prefix function / failure function: It is used to store the knowledge about portion of the patter that has matched and the portion that has unmatched against shift of itself.
  • Transition Function: calculated from the auxiliary function.

5 comments:

  1. A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
    website: geeksforgeeks.org

    ReplyDelete
  2. A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
    website: geeksforgeeks.org

    ReplyDelete
  3. A Computer Science portal for geeks. It contains well written, well thought and well
    explained computer science and programming articles, quizzes and practice/competitive
    programming/company interview Questions.
    website: geeksforgeeks.org

    ReplyDelete
  4. A Computer Science portal for geeks. It contains well written, well thought and well
    explained computer science and programming articles, quizzes and practice/competitive
    programming/company interview Questions.
    website: geeksforgeeks.org

    ReplyDelete
  5. A Computer Science portal for geeks. It contains well written, well thought and well
    explained computer science and programming articles, quizzes and practice/competitive
    programming/company interview Questions.
    website: geeksforgeeks.org

    ReplyDelete