Thursday, December 22, 2011

Markov Chains or random text generation


and then at the end of every line: 'Speak roughly to your tea; it's getting late.' So Alice began to tremble. Alice looked round, eager to see if she were looking over their shoulders, that all the jurymen on to himself as he said in a solemn tone, only changing the order of the jurors were all crowded round it, panting, and asking, 'But who is to do this, so she began again: 'Ou est ma chatte?' which was immediately suppressed by the English, who wanted leaders,

The above lines appear to be lifted from Alice in Wonderland. However, on closer observation it is random rubbish. It is just an implementation of Markov Chains (to which we got introduced while working on our mandatory Project required in partial fulfillment of degree). Markov chains work on the principle that the future depends on the present and not on the past. Consider the example "What is the last word at the end of this ______ ?". Well, you must have guessed the word is "sentence" or "line". Markov chains work in a similar way. We have a set of possible succeeding states and the probability of choosing a state depends on the data, the chain is trained on. In other words, the two words "line" and "sentence" represent two different states. Now, if a person usually uses the word "sentence" more then "line", he is more likely to choose "sentence" as his next state.

I will introduce another term "n-grams". Here, "n" refers to the number of states that will be considered for determining the the future state.

In the following program, the training filename is is given as argument. The n-gram model provides an approximation of the book. Books can be downloaded from http://www.gutenberg.org/

import sys,random
F=open(sys.argv[1])
dic={}
eoS=[]
N=2  #N is equal to (n-1) from the n-gram
W=tuple(['']*N)
for line in F:
    words = line.split()
    for word in words:
        if '' not in W:
            dic.setdefault( W, [] ).append(word)
            if W[-1][-1] in (',' , '.' , '?' , '!'):
                eoS.append(W)
        W =  W[1:] + tuple([(word)])
#
key = ()
count = 10 #number of sentences to be generated
while True:
    if dic.has_key(key):
        word = random.choice(dic[key])
        print word,
        key = (key[1:] + tuple([(word)]))
        if key in eoS:
            count = count - 1
            if count <= 0:
                break
    else:
        key = random.choice(eoS)


Popular implementations are Mark V Shaney and Dissociated press.

No comments:

Post a Comment