Project 2 : Language modeling

markov (Photo from the Neural Information Processing Systems Conference NIPS2014)

Design and implement functions that build an order K Markov model from a piece of input text. Markov models are popular tools in speech recognition, handwriting recognition, information retrieval, and data compression. For this assignment, you will consider a different application: generating stylized pseudo-random text.

Markov models. An order 0 Markov model predicts that each character in the alphabet occurs with a fixed probability, independent of previous characters. Given an input text, you compute the Markov model of order 0 by counting up the number of occurrences of each letter in the input, and use these as the frequencies. For example, if the input text is “agggcagcgggcg”, then the order 0 Markov model predicts that each character is a with probability 2/13, c with probability 3/13, and g with probability 8/13.

Characters in English text are not independent. If the previous 5 characters in a string are orith, then the next letter is almost surely m. An order K Markov model predicts that each character occurs with a fixed probability, but that probability can depend on the previous k characters. Given an input text, you compute a Markov model of order K by counting up the number of occurrences of each letter that follows each sequence of K letters. For example, if the text has 100 occurrences of th, with 50 occurrences of the, 25 occurrences of thi, 20 occurrences of tha, and 5 occurrences of tho, the order 2 Markov model predicts that the next character following th is e with probability 1/2, i with probability 1/4, a with probability 1/5, and o with probability 1/20.

Part 0: Markov model. Create a data type Markov to represent a K-character substring. Ultimately, it will have a method random that returns a random character according to the Markov model. For now, just make it store the substring and an integer that counts the number of times the substring appears. You will need a constructor, a method to increment the frequency count, and the usual toString method for output.

public Markov(String substring) 
public void add()
public String toString()

Part 1: frequency counts. Implement a client program FrequencyCounter.java that reads the order parameter K of the Markov model from the command-line, a text string from standard input, and uses a symbol table to insert each K-character substring (key) from the text. For example, if K is 2 and the input string is “agggcagcgggcg”, then your client program should create Markov objects for each of the 5 distinct keys, and call the add method 12 times in total: ag gg gg gc ca ag gc cg gg gg gc cg. Maintain an integer count of the number of occurrences of each key. Use your symbol table’s methods to print out the number of distinct keys and the number of times each key appears in the text. For the example above, your program should output:

5 distinct keys
2 ag
1 ca
2 cg
3 gc
4 gg

Part 2: language model. To generate random text, given a K-character key, your data type Markov must know all of the letters that follow the K-character key. This operation is at the crux of the matter, as you will need it to generate random characters in accordance with the Markov model. Modify your Markov data type so that in addition to the frequency counts, it records the breakdown depending on the next letter. Create your own linked list to keep track of the list of suffix characters along with their frequencies. Modify the toString method so that it it prints out the list of suffixes, along with the substring and the frequency count. Include the following method to insert a suffix character.

public void add(char c)

Modify FrequencyCounter.java to create a program LanguageModeler.java that inserts keys into the symbol table (if necessary), and calls add(char c) to add the appropriate suffix characters to the Markov model. It should produce the following output on the example input.

5 distinct keys
2 ag: 1 c 1 g
1 ca: 1 g
1 cg: 1 g
3 gc: 1 a 2 g
4 gg: 2 c 2 g

To make things look pretty, convert all whitespace characters into spaces or underscores. Note that since the last cg substring doesn’t have a “next” character, we don’t include it in the model.

Part 3: pseudo-random text generation. Add a method random to Markov that and returns a pseudo-random character according to the language model. Be sure to get the probabilities right, as we will be checking this.

Now, write a program TextGenerator.java that takes a command line input K, a command line input M, reads in a text string from standard input, and prints out M characters according to the order K Markov model. Start by printing the first K characters of the original text. Then, repeatedly generate successive pseudo-random characters. Using the example above, if the Markov object m represents the substring “gg”, then m.random() should return c or g, each with probability 1/2. After you generate a character, move over one character position, always using the last K characters generated to determine the probabilities for the next. For example, if your program chooses c in the example above, then the next Markov object would represent the substring “gc”, and according to the Markov model, the next character should be a with probability 1/3 and g with probability 2/3. Continue the process until you have output M characters. If the language model contains less than 100 K-tuples (prefixes), then print the language model (as in the program LanguageModeler.java) before you output the M randomly generated characters.

To select a random character you can sum all the frequencies of the characters and then generate a random number between 0 and that total. Then loop through the frequencies again and subtract them from the random number, if that subtraction results in 0 or less then select that character.

Data

Find sample text here: https://github.com/ieee8023/cs310-summer2015

Deliverables.

Submit your code (TextGenerator.java, LanguageModeler.java, Markov.java), along with one or two of the most amusing language-modeling examples that you come up with. Describe your data structure and algorithms in detail, and analyze the most important performance characteristics of your program in the memo.txt file. Discuss how your choice of k impacts the resulting text.

Grading (total 25 points):

Due: 7/7 @5pm.

4 points: Part 0: Markov model.
5 points: Part 1: frequency counts.
6 points: Part 2: language model.
5 points: Part 3: pseudo-random text generation.
5 points: memo.txt and how easy the assignment is to grade

Example Output

$java TextGenerator 2 100 < data/obama-inaugural-address-small.txt
1 ll: 1 o
1 df: 1 u
1 lo: 1 w
1 w : 1 c
1 hu: 1 m
1 mb: 1 l
1 k : 1 b
1 yo: 1 u
2 ul: 2 _
1 um: 1 b
1 s,: 1 _
1 ic: 1 e
1 mi: 1 n
1 if: 1 i
2 us: 1 t 1 ,
2 ed: 1 _ 1 ,
2 ef: 1 u 1 o
1 ac: 1 r
1 in: 1 d
1 : : 1 I
1 el: 1 l
1 s:: 1 _
1 en: 1 s
1  I: 1 _
1 it: 1 i
1 ze: 1 n
1 t : 1 y
1 an: 1 d
1 er: 1 e
2 es: 1 t 1 .
1 ra: 1 t
1 ve: 1 _
1 iz: 1 e
1 as: 1 k
2 re: 2 _
1 at: 1 e
2 l : 1 f 1 o
1 av: 1 e
2 nd: 1 _ 1 f
1 ri: 1 f
1 ay: 1 _
2 d : 1 b 1 h
1 fe: 1 l
3  b: 2 e 1 y
1  c: 1 i
2 be: 1 s 1 f
1 ru: 1 s
1 fi: 1 c
1 ns: 1 :
2  f: 1 e 1 o
3 y : 1 t 1 f 1 h
1  g: 1 r
3  h: 1 a 1 e 1 u
1 d,: 1 _
2 fo: 2 r
1 bl: 1 e
1 u : 1 h
1 My: 1 _
1  m: 1 i
1 we: 1 d
1 sa: 1 c
2 fu: 2 l
1  o: 1 f
2  s: 1 a 1 t
6  t: 1 a 1 r 3 h 1 o
1 od: 1 a
1  u: 1 s
1 by: 1 _
1 of: 1 _
1 sk: 1 _
1  y: 1 o
6 e : 1 b 1 s 3 t 1 u
3 st: 1 _ 1 a 1 o
1 ce: 1 s
2 or: 1 _ 1 e
1 ci: 1 t
1 ou: 1 _
2 ow: 1 _ 1 e
1 gr: 1 a
2 ta: 1 s 1 n
1 r : 1 t
1 cr: 1 i
1 te: 1 f
3 th: 3 e
2 , : 1 g 1 m
1 ti: 1 z
1 I : 1 s
1 le: 1 d
1 ha: 1 v
1 f : 1 t
2 to: 1 d 1 w
4 he: 3 _ 1 r
1 da: 1 y
1 tr: 1 u
My fellow citizens: I sacrificestask bestask best you he the best you have best you he trust you hav
$java TextGenerator 10 1000 < data/obama-inaugural-address.txt
My fellow citizens:  I stand here today humbled by the blood of generations faced down fascism and communism not just with missiles and tanks, but with the sturdy alliances and endure what storms may come. Let it be said by our children's children that when we were tested we refused to let this journey end, that we did not turn back nor did we falter; and with eyes fixed on the horizon and God's grace upon us, we carried us up the long, rugged path towards prosperity and the silencing of dissent, know that you are on the wrong side of history; but that we will extend a hand if you are willing to unclench your fist.  To the people of poor nations, we pledge to work alongside you to make your farms flourish and let clean waters flow; to nourish starved bodies and feed hungry minds. And to those nations like ours that enjoy relative plenty, we say we can no longer apply. The question before us whether the market can spin out of control - and that a nation cannot prosper long when it favor
$java TextGenerator 10 1000 < data/model-speech.txt
Welcome to Princeton.  But Bush does see that we could absorb the facts that we enjoy today with personal attention from paper to electronic journals is also a fascinating issue to consider the second article that I asked you to look at: "Electronics and the Dim Future of the University" by Eli Noam.  Again, the bottleneck is all in the human brain works. Each person would link together cases related to one of interconnected knowledge when searching for a new compound, or a historian could draw new connections is huge, so the connections that do everything for you is misguided.  Some people expect to have tools that will depend upon our accumulated knowledge, and that even though computers each year....