辅导C/C++、Matlab语言讲解留学生、解析R、C/C++编程辅导、讲解回归

Programming Assignment 4 (PA4) - Spam Filter

Example Input Milestone Functions Unit Testing Extra Credit
Detailed Overview Post-Milestone Functions README Questions Turnin Summary

Assignment Overview

In this assignment you are writing a two part interactive program meant to check if an email is in a list of spam
addresses or not. The first part will read in email data files and populate hash tables with these email
addresses, then write out these populated tables into an output file. The second part will allow for user
interaction with the data. It will first read in the tables that were written to the output file from the first part and
will then allow the user to enter email addresses and check if they are spam. If the email address is found in all
of the tables then there is a high chance that the email is a spam email. In our program, if the email address is
found in all of the tables, then it is definitely a spam. This program is inspired by Bloom Filtering, which you
may want to read about to fully understand what you are trying to accomplish.

The educational purpose of this programming assignment is to provide you with more practice with dynamic
memory allocation, more complex file I/O, hashtables and separate-chaining hashtables.

We will try not to do too much hand-holding in this PA. You have some experience now and you need to learn
how to look up information yourself. You will not be given nearly as much help in the upper-division classes,
so now is the time to start doing more on your own.

Grading

● README: 10 points - See README Requirements here and questions below
○ http://cseweb.ucsd.edu/~ricko/CSE30READMEGuidelines.pdf
● Compiling: 5 points - Using our Makefile; no warnings. If what you turn in does not compile with the
given Makefile, you will receive 0 points for this assignment. NO EXCEPTIONS!
● Style. 10 points - See Style. Requirements here
○ http://cseweb.ucsd.edu/~ricko/CSE30StyleGuidelines.pdf
● Correctness: 75 points
○ Milestone (15 points) - To be distributed across the Milestone functions (see below)
○ Make sure you have all files tracked in Git.
● Extra Credit: 5 points - View Extra Credit section for more information.
● Wrong Language: You will lose 10 points for each module in the wrong language, C vs. Assembly or
vice versa.
NOTE: If what you turn in does not compile with given Makefile, you will receive 0 points for this assignment.

Back to Top
Getting Started

Follow these steps to acquire the starter files and prepare your Git repository.

Gathering Starter Files:
The first step is to gather all the appropriate files for this assignment. Connect to pi-cluster via ssh.
$ ssh

Create and enter the pa4 working directory.
$ mkdir ~/pa4
$ cd ~/pa4

Copy the starter files from the public directory.
$ cp -r ~/../public/pa4StarterFiles/* ~/pa4/

Copy over the following files from PA3. You are responsible for making sure these functions have good style.
and work correctly as they are required for this assignment.
$ cp ~/pa3/isInRange.s ~/pa4
$ cp ~/pa3/hash.s ~/pa4

Starter files provided:
pa4.h
test.h

Spam emails provided:

/emails/emails_
pa4Strings.h
testrevHash.c

/emails/emails_all
pa4Globals.c
Makefile

*num means the number of spam email addresses in the file
** You can assume no duplicates in the email list

Preparing Git Repository:
You are required to use Git with this and all future programming assignments. Refer to the PA0 writeup for how
to set up your local git repository.

Example Input

Two sample stripped executables provided for you to try and compare your output against are available in the
public directory. Note that you cannot copy them to your own directory; you can only run them using the
following commands (where you will also pass in the command line arguments):
$ ~/../public/pa4testCreate
$ ~/../public/pa4testSearch

NOTE:
1. The output of your program MUST match exactly as it appears in the pa4testCreate and
pa4testSearch output. You need to pay attention to everything in the output, from the order of the
error messages to the small things like extra newlines at the end (or beginning, or middle, or
everywhere)!
Back to Top
2. We are not giving you any sample outputs, instead you are provided some example inputs.
You are responsible for trying out all functionality of the program; the list of example inputs is
not exhaustive or complete. It is important that you fully understand how the program works
and you test your final solution thoroughly against the executable.

Usage for create:
Usage: ./create -i dataFile -o outFile [-s size] [-h]
-i/--infile dataFile -- The file containing the data
-o/--output outFile -- The file to output hash tables to
-s/--size size -- The size of the hash tables.
If not specified defaults to 5000.
-h/--help -- Displays this long usage message

Example input that has error output:
cs30xyz@pi-cluster-001:pa4$ ./create
cs30xyz@pi-cluster-001:pa4$ ./create -o -s
cs30xyz@pi-cluster-001:pa4$ ./create -s 1000000
cs30xyz@pi-cluster-001:pa4$ ./create -i emails/emails_1000 -s 10000
cs30xyz@pi-cluster-001:pa4$ ./search
cs30xyz@pi-cluster-001:pa4$ ./search -x
cs30xyz@pi-cluster-001:pa4$ ./search -i outfile -x -o

Usage for search:
Usage: ./search -i inFile [-x] [-h]
-i inFile -- The file containing the hash tables to search
-x -- Prints statistics for search, including linked list
and table collisions.
-h -- Displays this long usage message

Example input that has normal output:
cs30xyz@pi-cluster-001:pa4$ ./create -i emails/emails_10000 -s 10000 -o outfile
cs30xyz@pi-cluster-001:pa4$ ./create -i emails/emails_10000 -o outfile
cs30xyz@pi-cluster-001:pa4$ ./search -i outfile -x -h
cs30xyz@pi-cluster-001:pa4$ ./search -i outfile -x

Note: An easy way to check your output against the sample executable is by using diff:
cs30xyz@pi-cluster-001:pa4$ ./create -i emails/emails_10 -o myOutfile
cs30xyz@pi-cluster-001:pa4$ ~/../public/pa4testCreate -i emails/emails_10 -o refOutfile
cs30xyz@pi-cluster-001:pa4$ diff -s myOutfile refOutfile

Back to Top
Detailed Overview

The function prototypes for the various C and Assembly functions are as follows.

C routines:
int main( int argc, char * argv[] ); (in the file create.c)
int main( int argc, char * argv[] ); (in the file search.c)
void beginUserQuery(int counts[], table_t * htbl, table_t * rtbl,
table_t * eotbl, int statsOn);
int checkTables( char * str, table_t * htbl, table_t * rtbl, table_t * eotbl );
int evenOddHash( char * str );
void freeLinkedList( linkedList_t * head );
void * getFromTable( int hashVal, table_t * table );
void populateTables( table_t * htbl, table_t * rtbl, table_t * eotbl,
FILE * dataFile );
void pushToList( void ** head, char * str );
void readTables( FILE * outFile, table_t * htbl, table_t * rtbl,
table_t * eotbl );
void writeTables( FILE * outFile, table_t * htbl, table_t * rtbl,
table_t * eotbl );

Assembly routines:
int hash( char * str );
long isInRange( long value, long minRange, long maxRange );
int revHash( char * str );

For the Milestone, you will need to complete:
writeTables.c
pushToList.c
revHash.s
populateTables.c
evenOddHash.c

Process Overview:
The following is an explanation of the main tasks of the assignment, and how the individual functions work
together to form. the whole program. There are two diagrams for this program since there are two separate
main functions.

The first portion of this assignment is create, which will be used to read data and produce hash tables that get
written to a file.

Create workflow:
1. Use getopt_long() to parse all the command line arguments passed into the program. Do all the
required error checking.
2. Allocate space for the three different hash tables and begin populating the tables with data inside the
file specified by the user. *Any time during memory allocation, if an error occurs, free all allocated
memory and exit program
3. Write the populated tables into an output file which could be used to run search later.

The second portion of this assignment is search, which allows the user to check if an email is likely to be spam
or not.

Search workflow:
1. Use getopt() to parse all the command line arguments passed into the program. Do all the required
error checking.
2. Read in the different hash tables stored in the specified input file and recreate those tables in memory.
3. Begin the interactive user query mode that allows user to search the tables. If user specified -x for stats
mode, at the end of the query the total stats and collisions found will be printed. [Three collisions means
that there is a higher chance of the email being spam]
4. When the user ends the search with ctrl-d, free all allocated memory and exit the program.
Explanation about the hashtables
There are three hashtables:
● table 1(htbl) - Use hash() as the hash function. The void * at index i should be set to 1 if some
item hashed to index i
● table 2(rtbl) - Use revhash() as the hash function. The void * at index i should be set to 1 if
some item hashed to index i
● table 3(eotbl) - Use evenOddHash() as the hash function. This is a separate-chaining hash
table. When an item is hashed to index i, it should be pushed to the linked list pointed by the pointer
at index i

Memory Diagram of tables 1 and 2:
Milestone Functions to be Written
Listed below are the modules to be written for the milestone.

hash.s
int hash( char * str );

This function will be used to create the hash key of a string. It is the same hash.s you wrote in PA3, so you can
just reuse that file for this PA.

revHash.s
int revHash( char * str );

This function is similar to hash.s, the only difference is that to compute the hash key, it iterates through the
string in reverse order. So, instead of starting at the beginning of the string, you will start at the end and work
your way forward.

Return Value: The hash key of str.

evenOddHash.c
int evenOddHash( char * str );
This function is also similar to hash.s, the only difference is that to compute the hash key, it iterates through the
string first by by even and then by odd indices. So, loop through the even indices in the string first (i.e. str[0]
then str[2], etc.), then iterate over all odd indices (i.e. str[1] then str[3], etc.).

Return Value: The hash key of str.
Back to Top

pushToList.c
void pushToList( void ** head, char * str );

This function takes in the head of a linked list and an element to insert (a string) and then pushes the element
onto the front of the linked list. To implement this function, you need to allocate new memory for the string, so
that the list does not need to rely on the memory already allocated for the string passed in as an argument.

In order to insert a new element in the list first allocate space for the new head of the list (a struct of type
linkedList_t), then allocate space for a copy of the string that was passed in. After that, put the allocated
string with a copy of str into the newly allocated head, and finally attach the new head to the original list.

Reasons for error:
● Error with dynamically allocating memory. Free previously allocated memory, then call perror(),
passing in the macro MEM_ERR defined in pa4.h as the parameter. (perror() will print the correct
error message for you) Then return immediately.

populateTables.c
void populateTables( table_t * htbl, table_t * rtbl, table_t * eotbl,
FILE * dataFile );