Assignment 4 (150 Points) - Due Tuesday, November 13, 11:59PM

Announcements

Dec 10: Assignment4 Testfile and Outputfile Assignment4_Testfile_Outputfile.zip

Oct 30: A draft Assignment 4 Rubric is available. Please note that minor changes to the grading rubric may be made during the grading process.

Nov 08: Sample output file: A4_sample_output. (updated 12:44 PM 11/8/2012)(startfile: "testfile1", endfile: "testfile2")

Nov 07: The difference in rank should be calculated as the word's start rank minus the word's end rank. For example, if the rank of a word in the end file 20, and the rank of the word in the start file 10, than the difference in rank is -10 (10 - 20 = -10). In other words, the word dropped (or fell) ten ranks.

Nov 01: You may use C++ for Assignment 4, but you are NOT allowed to use the standard template library (STL).

Ranking Trends

In this assignment, you will create an application that analyzes the frequency of the occurrence of words within two inputs files, a start file and an end file. The application will further analyze the change in rank of specific words from the start file to end file, where the highest ranked word (i.e. word ranked in position 1) appears most frequently. Finally, the program will output a ranked list of all words appearing in the end file along with the relative change in the rank for each word from the previous file to the current file.

Note: This program is motivated by http://whatstrending.com that monitors Twitter feeds to see what topics are increasing and decreasing in popularity.

Commandline Arguments

Your program must be capable of utilizing a commandline argument to specify the input file.

trending startFile endFile rankOutputFile


Your program must ensure the user has correctly provided the required commandline argument and display a usage statement if the provided arguments are incorrect.

Input Text Files

For this assignment, the input text file will only consist of the uppercase and lowercase characters ('a' to 'z', 'A' to 'Z'). Each word within the text file will be separated by at least one whitespace character (' ', '\t', '\n').

Note: Text files on Windows based computers use a carriage return ('\r') and newline ('\n') at the end of each line. On Unix machines (such as the ece3 server), only the newline is used. For testing your program, you should utilize Unix formatted text files without the carriage return. If you edit your files using a Windows based program (such as Notepad), you may want to familiarize yourself with the dos2unix command available on the ece3 server.

Binary Search Tree Data Structure

For keeping track of the individual words and how many times they are used within the input textile, you must implement a binary search tree. An template for the struct and typedef definitions for a minimally binary search tree are defined below. Note that you may need to modify these data strcutures for the assignment.

typedef struct BiTreeData_ {
    char *word;
    int start_word_count;
    int end_word_count;
} BiTreeData;

typedef struct BiTreeNode_ {
    BiTreeData  *data;
    struct BiTreeNode_ *right;
    struct BiTreeNode_ *left;
} BiTreeNode;

typedef struct BiTree_ {
    BiTreeNode *root;
    int size; // optional
} BiTree;

typedef BiTree BisTree;
typedef BiTreeNode BisTreeNode;
typedef BiTreeData BisTreeData;

The functionality specific to each of these structures should be implemented within their own set of C source and header files.

Note 1: If your program does not use a Binary Search Tree as indicated in the program assignment, you will receive 0 points for the assignment.

Note 2: You must implement the bistree_insert(), bistree_remove(), bistree_lookup(), and supporting binary tree operations for the assignment. Using the lazy removal method discussed on lecture and the textbook is acceptable.

Word Counting

As each word is read from the input text file, your program (specifically your binary search tree) should keep track of the number of times each word is utilized. While the input file may contain both uppercase and lowercase characters, the identification of unique words is case insensitive. For example, "Party", "party", and "PARty" are all considered the same word.

Hint: Lookup tolower()

Your program will need to keep track of the number of times each word is found within both the start file and the end file. While these two counts are separate values, it is suggested they be maintained within the same binary tree structure.

Word Ranking

Once all words have been processed from the start and end files, your program must determine the rank for each word within both the start file and the end file. The rank of a word is the position within a list of words sorted by decreasing frequency with 1 being the highest rank (corresponding to the most frequent word). For example, the most frequently occurring word within the start file would have a rank of 1. If two words occurs the same number of times, they should both be assigned the same rank, and the next word in sorted order should have the following rank. For example, the words "bite" and "metal" are the most frequently occurring words and "my" if the next frequently occurring word, then both "bite" and "metal" would have a a rank of 1, and "my" would have a rank of 2.

Ranking Trends

For all words that appears in the end file, your program should output a ranked list of words, the rank position, and the change in rank between the start file and end file. Each word should be output to the specified rank output file with one word per line using the following format:

R: W (RD)

R is the rank of the word within the end file. If a tie exists for that position, the rank should be preceded with the character 'T'. The order of words with the same rank does not matter.

W is the word itself displayed using all lowercase characters.

RD is the difference in rank between the start file and the end file. If the a word did not appear in the start file, the rank difference should be reported as "(new)". If the rank did not change between the start file and the end file, the rank difference should be reported as "(+0)".

The following provides an example of the output file format:

T1: bite (+0)
T1: metal (+3)
2: my (+1)
3: shiny (-1)
4: please (new)