Assignment 3 (100 Points) - *EXTENDED* Due Saturday, October 27, 11:59AM

Announcements and Clarification

Nov 13: The top five (or six) students with the fastest (and correct) programs for this assignment ate the following. Congrats!

1. Rumsey Christopher Stephen 0.468
2. Beeston Kymberly Diane 0.500
3. Smith Brian 0.554
T4. Ringle James Michael 0.635
T4. Martinez Albert 0.635
5. Sessions Alaric 0.655

Oct 25: CLARIFICATION: As defined in the assignment, the top five Trending Up words must have an increase in the number of the number of times a word appear in the end file compared to the start file. Conversely, the top five Trending Down words must have a decrease. This implies that a word cannot appear in both lists, and a word that had the same count on the start and end file (i.e. a difference of 0) is neither trending up or trending down.

Oct 25: Due to the course website being for down briefly, the assignment submission due date has been extended to Saturday, October 27 at 11:59 AM. Late submissions will not be extended, and no further extensions will be given.

Oct 17: Sample output file: sample_output. (startfile: "extra_testfile1", endfile: "extra_testfile2")

Oct 17: A draft Assignment 3 Rubric is available. Please note that minor changes to the grading rubric may be made during the grading process.

Oct 16: Text file input description has been revised. You do not have to worry about commas or periods within the input files.

Oct 15: The due data for this assignment has been extended to slightly to Friday, October 26.

Oct 15: When inserting a new node within your binary search, you will need to dynamically allocate both a new BiTreeData along with a C string large enough to store the word. In tother words, each BiTreeData will need to point to a dynamically allocated string.

Oct 15: The following are recommended methods for declaring the top-level insert, lookup, and remove functions for the binary search tree:

BisTreeNode* bistree_lookup(BisTree *tree, char *key);
int bistree_insert(BisTree *tree, BiTreeData* data);
int bistree_remove(BisTree *tree, char *key);


Oct 2: The following test files can be used for testing the functionality of your program: Assignment3_testfiles.tgz.

What's Trending

In this assignment, you will create an application that analyzes the frequency of the occurrence of words within two inputs files, a start file and an end file. The application will further analyze the change in frequency of specific words from the start file to end file, to determine the increase or decrease in the word frequencies. Finally, the program will output the five words that have increased the most and the five words that have decreased the most from the first input file to the second file file.

Note: This program is motivated by http://whatstrending.com that monitors Twitter feeds to see what topics are increasing and decreasing in popularity.

Commandline Arguments

Your program must be capable of utilizing a commandline argument to specify the input file.

trending startFile endFile


Your program must ensure the user has correctly provided the required commandline argument and display a usage statement if the provided arguments are incorrect.

Input Text Files

For this assignment, the input text file will only consist of the uppercase and lowercase characters ('a' to 'z', 'A' to 'Z'). Each word within the text file will be separated by at least one whitespace character (' ', '\t', '\n').

Note: Text files on Windows based computers use a carriage return ('\r') and newline ('\n') at the end of each line. On Unix machines (such as the ece3 server), only the newline is used. For testing your program, you should utilize Unix formatted text files without the carriage return. If you edit your files using a Windows based program (such as Notepad), you may want to familiarize yourself with the dos2unix command available on the ece3 server.

Binary Search Tree Data Structure

For keeping track of the individual words and how many times they are used within the input textile, you must implement a binary search tree. An template for the struct and typedef definitions for a minimally binary search tree are defined below. Note that you may need to modify these data strcutures for the assignment.

typedef struct BiTreeData_ {
    char *word;
    int start_word_count;
    int end_word_count;
} BiTreeData;

typedef struct BiTreeNode_ {
    BiTreeData  *data;
    struct BiTreeNode_ *right;
    struct BiTreeNode_ *left;
} BiTreeNode;

typedef struct BiTree_ {
    BiTreeNode *root;
    int size; // optional
} BiTree;

typedef BiTree BisTree;
typedef BiTreeNode BisTreeNode;
typedef BiTreeData BisTreeData;

The functionality specific to each of these structures should be implemented within their own set of C source and header files.

Note 1: If your program does not use a Binary Search Tree as indicated in the program assignment, you will receive 0 points for the assignment.

Note 2: You must implement the bistree_insert(), bistree_remove(), bistree_lookup(), and supporting binary tree operations for the assignment. Using the lazy removal method discussed on lecture and the textbook is acceptable.

Word Counting

As each word is read from the input text file, your program (specifically your binary search tree) should keep track of the number of times each word is utilized. While the input file may contain both uppercase and lowercase characters, the identification of unique words is case insensitive. For example, "Party", "party", and "PARty" are all considered the same word.

Hint: Lookup tolower()

Your program will need to keep track of the number of times each word is found within both the start file and the end file. While these two counts are separate values, it is suggested they be maintained within the same binary tree structure.

What's Trending

The program should determine the top five words that are Trending Up and the top five words that are Trending Down. The five words trending up are those words that have increased the most in frequency from the start file to the end file. The five words trending down are those words that have decreased the most in frequency from the start file to the end file.

The trending up and trending down words should be reported using the following format. If ties occur within these lists, the ordering does not matter.

Trending Up:
believe (+100)
choose (+90)
i (+80)
programmed (+40)
to (+2)

Trending Down:
programming (-1024)
in (-800)
C (-456)
is (-7)
hard (-1)

Extra Credit Challenge File (2%)

A set of Extra Credit Input Files will be used within the grading of your assignment. The five correctly functioning submissions with the fastest execution time for these files will receive 2% extra credit. Execution time will be measured using the user time on the ece3 server reported using the following command:

time ./src/trending EC_StartFile EC_EndFile

The top five assignments will be posted to this webpage after grading is complete. If you would not to be recognized publicly, please indicate so in your README file.