HW7 : Word Count


Due: Monday 7/14/14 @5:30pm via users.cs.umb.edu:cs210/hw7

Gain experience using a Map (aka Symbol Table aka Association List)

1. Write a program WordCount that takes a file as an argument (as args[0])

2. Read the file and gather statistics to be printed (using java HashMap or TreeMap)

3. Find the total unique words. Add them to a Map and then count the size of the keyset.

4. Store a count for each word and increment it every time you see that word. Hint (get() the count and then put() an updated count. Use this count to find the most and least common words.

5. Keep track of the length of words using another Map. Use the length as a key and increment the number of words of that length. Sort the keyset for the map and then get the values when iterating over the sorted keyset. You can use new ArrayList(map.keySet()) to get an ArrayList of the keys. You can also use Collections.sort to sort the ArrayList.

6. Write a memo.txt describing any problems you had during this assignment and what you learned.

Make sure you ignore case (Hello and hello are the same word).
Make sure you only count words. Words are composed of letters and ‘. (Edited to remove – from being in a valid word)

Sample input files:
The War in the Air by H. G. Wells – warair.txt
The Art of War by Sun Tzu – artofwar.txt
RFC-2616 Hypertext Transfer Protocol (HTTP/1.1) – rfc2616.txt

The files needed are here: https://github.com/ieee8023/cs210-summer2014

Sample HashMap Usage:

// make a new map
Map map = new HashMap();
// sample string to count
String word = "hello";
// get the existing count of null
Integer count = map.get(word);

// if null then initialize
if (count == null)
	count = 0;
//add one to count
// replace or add value
map.put(word, count);

// get all the keys
Set keys = map.keySet();

// make the set into a list
List keylist = new ArrayList(keys);

// get all the values
Collection values = map.values();

// make the set into a list
List valuelist = new ArrayList(values);

To configure your scanner to produce tokens which are only letters using the following code:

Scanner in = new Scanner(file);
in.useDelimiter("[ tnrf,.()/_?!-;:&%@"]+");

A sample run of the program:

$ java WordCount artofwar.txt 
Total unique words: 2183
Most common word "the" used 697 times
Least common word "omens" used 1 time(s)
Length 1: 288
Length 2: 2161
Length 3: 2243
Length 4: 1736
Length 5: 1135
Length 6: 944
Length 7: 784
Length 8: 538
Length 9: 357
Length 10: 222
Length 11: 118
Length 12: 61
Length 13: 29
Length 14: 8
Length 15: 5
Length 19: 1


$ java WordCount warair.txt 
Total unique words: 10072
Most common word "the" used 6503 times
Least common word "encased" used 1 time(s)
Length 1: 4507
Length 2: 15753
Length 3: 23732
Length 4: 17943
Length 5: 10903
Length 6: 8211
Length 7: 6768
Length 8: 4752
Length 9: 3238
Length 10: 2157
Length 11: 914
Length 12: 507
Length 13: 232
Length 14: 106
Length 15: 38
Length 16: 5
Length 17: 3

Grading (total 10 points):

Turn in the following files: WordCount.java, memo.txt

2 points: Total unique words
2 points: Most common word
2 points: Least common word
2 points: Distribution of word lengths
2 points: memo.txt, easy to grade.