Due: Monday 7/14/14 @5:30pm via users.cs.umb.edu:cs210/hw7
Purpose:
Gain experience using a Map (aka Symbol Table aka Association List)
1. Write a program WordCount that takes a file as an argument (as args[0])
2. Read the file and gather statistics to be printed (using java HashMap or TreeMap)
3. Find the total unique words. Add them to a Map and then count the size of the keyset.
4. Store a count for each word and increment it every time you see that word. Hint (get() the count and then put() an updated count. Use this count to find the most and least common words.
5. Keep track of the length of words using another Map. Use the length as a key and increment the number of words of that length. Sort the keyset for the map and then get the values when iterating over the sorted keyset. You can use new ArrayList(map.keySet()) to get an ArrayList of the keys. You can also use Collections.sort to sort the ArrayList.
6. Write a memo.txt describing any problems you had during this assignment and what you learned.
Notes:
Make sure you ignore case (Hello and hello are the same word).
Make sure you only count words. Words are composed of letters and ‘. (Edited to remove – from being in a valid word)
Sample input files:
The War in the Air by H. G. Wells – warair.txt
The Art of War by Sun Tzu – artofwar.txt
RFC-2616 Hypertext Transfer Protocol (HTTP/1.1) – rfc2616.txt
The files needed are here: https://github.com/ieee8023/cs210-summer2014
Sample HashMap Usage:
// make a new map Map map = new HashMap(); // sample string to count String word = "hello"; // get the existing count of null Integer count = map.get(word); // if null then initialize if (count == null) count = 0; //add one to count count++; // replace or add value map.put(word, count); // get all the keys Set keys = map.keySet(); // make the set into a list List keylist = new ArrayList(keys); // get all the values Collection values = map.values(); // make the set into a list List valuelist = new ArrayList(values);
To configure your scanner to produce tokens which are only letters using the following code:
Scanner in = new Scanner(file); in.useDelimiter("[ tnrf,.()/_?!-;:&%@"]+");
A sample run of the program:
$ java WordCount artofwar.txt ============== Total unique words: 2183 Most common word "the" used 697 times Least common word "omens" used 1 time(s) ============== Length 1: 288 Length 2: 2161 Length 3: 2243 Length 4: 1736 Length 5: 1135 Length 6: 944 Length 7: 784 Length 8: 538 Length 9: 357 Length 10: 222 Length 11: 118 Length 12: 61 Length 13: 29 Length 14: 8 Length 15: 5 Length 19: 1 ==============
and
$ java WordCount warair.txt ============== Total unique words: 10072 Most common word "the" used 6503 times Least common word "encased" used 1 time(s) ============== Length 1: 4507 Length 2: 15753 Length 3: 23732 Length 4: 17943 Length 5: 10903 Length 6: 8211 Length 7: 6768 Length 8: 4752 Length 9: 3238 Length 10: 2157 Length 11: 914 Length 12: 507 Length 13: 232 Length 14: 106 Length 15: 38 Length 16: 5 Length 17: 3 ==============
Grading (total 10 points):
Turn in the following files: WordCount.java, memo.txt
2 points: Total unique words
2 points: Most common word
2 points: Least common word
2 points: Distribution of word lengths
2 points: memo.txt, easy to grade.