CISC 5835300—Algorithms for Big Data
Fall, 2017
Programming Project: Sublinear Streaming
Due Date: Tuesday 12 December 2017
We have looked at sublinear-space streaming algorithms for two problems:
Periodically counting the number of events in a stream.
Counting the number of distinct items in a stream.
Of course, any algorithm running in sublinear space must necessarily produce an approximate answer, rather
than an exact answer.
Your task is to choose one of these problems and implement the appropriate approximation algorithm
for same.
Feel free to use whatever programming language you like. You may also work in teams of size three (or
less). I will let you self-organize. People with little programming experience should join a team containing
a person (or persons) with more experience; perhaps those with less experience can be more involved in
testing, rather than coding.
We discuss each of these programming tasks in turn.
Periodicallycountingthenumberofeventsinastream: You are to write a program named count-events
that uses the Morris++ algorithm to give an approximate count of the number of items in an input stream.
For the purposes of this assignment, you can assume that the input stream is a text file; each entry in the file
represents the amount of time (in seconds) since the previous event.
The command line to invoke this program should look like
count-events filename " [how often]
where filename is the name of the input file, " is the allowable error, is the failure probability, and
how often to report (the latter defaulting to one second).
To help you get started, the directory ~agw/class/big-data-alg/projects/count-events con-
tains the following:
events: a file of events
gen-events.cc: the C++ program that generated events
gen-events: the compiled version of gen-events.cc
count-events.cc: a C++ program that gives the exact count of an events file. It is invoked from
the command line as
count-events filename [how often]
My suggestion for how to tackle this would be to write functions named morris(), morris-plus(), and
morris-plus-plus() that implement the Morris algorithms. You should code and test them one at a time,
in that order. The amount of time each takes should be reported; if you’re using C++11 (or later), you can
follow what was done in count-events.cc. You’ll also need to generate random (floating-point) numbers;
if you’re using C++11 (or later), take a look at what was done in gen-events.cc.
Counting the number of distinct items in a stream: You are to write a program named count-items
that uses the Flagolet-Martin++ algorithm to give an approximate count of the number of distinct items in
an input stream. For the purposes of this assignment, you can assume that the input stream is a text file of
integers.
The command line to invoke this program should look like
count-distinct filename max value "
where filename is the name of the input file, max value is the maximum value appearing in the input file,
" is the allowable error, and is the failure probability.
To help you get started, the directory ~agw/class/big-data-alg/projects/count-distinct con-
tains the following:
items: a file of items
gen-items.cc: the C++ program that generated items
gen-items: the compiled version of gen-items.cc
count-distinct.cc: a C++ program that gives the exact number of distinct items in a file of items.
It is invoked from the command line as
count-distinct filename [max value]
where max value is the maximum value appearing in filename.
My suggestion for how to tackle this would be to write functions named flagolet-martin(),
flagolet-martin-plus(), and flagolet-martin-plus-plus() that implement the Flagolet-Martin
algorithms. You should code and test them one at a time, in that order. The amount of time each takes should
be reported; if you’re using C++11 (or later), you can follow what was done in count-distinct.cc. You’ll
also need to generate random (floating-point) numbers; if you’re using C++11 (or later), take a look at what
was done in the file gen-events.cc for the other assignment.
Deliverables: Submit the following:
A listing of the final version of the program that you wrote. You should follow the advice in my guide
to getting better grades, see http://www.dsm.fordham.edu/ agw/cs1/better-grades.pdf. Al-
though this guide is tailored to C++ programmers, the ideas contained therein apply to any program-
ming language you’re likely to use in this assignment. Note that the introductory comment block
should list all the team members as authors.
2
If you’re using a compiled language such as C++ (rather than an interpreted language such as Python),
evidence of correct compilation. So for a C++ program, this means that a command such as
g++ -o count-distinct count-distinct.cc -std=c++17
should produce no error messages; better yet would be for
g++ -o count-distinct count-distinct.cc -Wall -std=c++17
to produce no error messages or warnings.
Evidence that the program works. In other words, run the program on the test data provided, with
some value(s) of " and , and see what the program reports.
This means that you need to submit a clean typescript, as produced by the photo program (together with
photo-clean). If you don’t know about this, visit the page http://www.fordham.edu/cishelp and
read the entry on how to submit programming assignmens.
Good luck!