Assignment 3 (100 Points) - Due Friday, November 04, 11:59PM

Announcements and Clarification

October 27: The submission date for Assignment 2 has been revised to F November 04, 11:59PM.

Alphabetized Duplicate Word Counter

Create an dupes program capable of determining of efficiently determining which words within an input file appear more than once, outputting an alphabetized list of words within the file that appear more than once with the count of how many times the duplicate words appears. The program should also output statistics of the binary search tree utilized within the program.

Commandline Arguments

Your program must be capable of utilizing a commandline argument to specify the input file.

dupes inputFile outputFile


Your program must ensure the user has correctly provided the required commandline argument and display a usage statement if the provided arguments are incorrect.

Input Text File

For this assignment, the input text file will only consist of the uppercase and lowercase characters ('a' to 'z', 'A' to 'Z'), commas(','), and periods('.'). Each word within the text file will be separated by at least one whitespace character (' ', '\t', '\n'). One comma or period may appear at the end of each word, but is not considered part of the word.

Note: Text files on Windows based computers use a carriage return ('\r') and newline ('\n') at the end of each line. On Unix machines (such as the ece3 server), only the newline is used. For testing your program, you should utilize Unix formatted text files without the carriage return. If you edit your files using a Windows based program (such as Notepad), you may want to familiarize yourself with the dos2unix command available on the ece3 server.

Binary Search Tree Data Structure

For keeping track of the individual words and how many times they are used within the input textile, you must implement a binary search tree. An template for the struct and typedef definitions for a minimally binary search tree are defined below. Note that you may need to modify these data strcutures for the assignment.

typedef struct BiTreeNode_ {
    char *word;
    int word_count;
    struct BiTreeNode_ *right;
    struct BiTreeNode_ *left;
} BiTreeNode;

typedef struct BisTree_ {
    BiTreeNode *head;
} BisTree;

The functionality specific to each of these structures should be implemented within their own set of C source and header files.

Note 1: If your program does not use a Binary Search Tree as indicated in the program assignment, you will receive 0 points for the assignment.

Note 2: A small portion of your grade will be based on your ability to keep your binary tree balanced, such as implementing your tree as an AVL tree.

Word Counting

As each word is read from the input text file, your program (specifically your binary search tree) should keep track of the number of times each word is utilized. While the input file may contain both uppercase and lowercase characters, the identification of unique words is case insensitive. For example, "Party", "party", and "PARty" are all considered the same word.

Alphabetized Duplicate Word Reporting

Once, the input file has been completely read, your program should output a alphabetized list of all words that appeared more than once within the specified output file. The output should include one word per line. Each line should include the word followed by a space, followed by the number of times the word appeared in the input file in parentheses. The following provides a sample of the output file format that should be generated:

believe (2)
choose (3)
i (2)
programmed (5)
to (2)
was (2)
what (2)

Binary Search Tree Statistics

Finally, your program should output the following statistics for your binary search tree structure.

Number of Words within Tree: 10054
Minimum Depth of Tree: 30
Maximum Depth of Tree: 35
Average Depth of Tree: 32

Extra Credit Challenge File (2%)

An Extra Credit Input File, Compressed Extra Credit Input File will be used within the grading of your assignment. The five correctly functioning submission with the fastest execution time for the file will receive 2% extra credit. Execution time will be measured using the user time on the ece3 server reported using the following command:

time ./src/dupes remembrance_of_things_past.txt ec_output.txt

The top five assignments will be posted to this webpage after grading is complete. If you would not to be recognized publicly, please indicate so in your README file.

And, The Top Five Fast Submission Are:
Travis D - 3.4s
Marcos Z - 5.6s
Christopher S - 5.9s
David S - 6.1s
Casey M - 6.1s