COMP 5434: Assignment 04
Due Date: 02 Dec 2020
Due Time: 23:59
Instructor: George Baciu
Title: BDML Finale
Total Marks: 25
Note: Max 2 people
Please note that the project may have a team of a maximum of two people.
Description
This will be your final machine learning (ML) deployment on top of your big data cluster
built on top of virtual Ubuntu 20.04 servers and clients. The minimum resources necessary
for the deployment of a working client on the Hadoop cluster are (1) 16 GB Memory
(RAM); (2) 128 GB Hard Disk Storage; (3) QuadCore Processors; and (4) Network. Your
project can be built on one or more client machines udesk00, udesk01, etc, connected to the
Hadoop cluster with the three or more virtual servers (vms) on top of the VirtualBox 6.1.16
hypervisor. The client machine(s) will act as the application machine. It will be used by
you to run PySpark 3.0.1 jobs to the cluster and operate on files stored by you in your the
HDFS Hadoop 3.3.0 cluster. The client machines and the servers will have a minimum
RAM of 2GB and a minimum disk storage of 64GB for the client, 8Gb for the master node,
and 32GB for each data node, respectively. The machines can be udesk00, udesk01…,
hname10, sdata11, sdata12, sdata13, …sdata19.
The Problem
This assignment is your final part in completing your project in Big Data Machine
Learning. In this assignment you will implement, test, and verify the specifications of your
project according to what you have proposed in assignment 3. You will submit all your
runs and analysis as a project report. The analysis should have significant meaning and
relevance to Big Data Computing and Machine Learning. You may get your inspiration and
data from some of the most popular sites on the internet, such as the Kaggle big data
repository: https://www.kaggle.com/. You may also come up with your own data if you
wish or move your platform to a cloud, such as AWS for example.
In this assignment, you will implement the solution to the problem that you proposed to
work on, and describe your analysis and findings with illustrations and graphs in your final
report. You can also describe how you could improve the results or discover new concepts
if you had more time. The implementation and analysis should follow your methodology.
The template of the report is found at:
http://www4.comp.polyu.edu.hk/~csgeorge/comp5434/asg/04/doc/comp5434-asg-04-report.docx
COMP 5434: Assignment 04
Due Date: 02 Dec 2020
Due Time: 23:59
Instructor: George Baciu
Title: BDML Finale
Total Marks: 25
Note: Max 2 people
What to hand in
Assuming that your student ID is 12345678D, the submission should consist of four or
more files, where –nn indicates the number of your job run:
1. 12345678d-pyspark-nn.py - your script for the job submitted, -01, -02, -03, …
2. 12345678d-output-nn.txt - output results; use -01, -02, -03 (one for each run)
3. 12345678d-logs-nn.txt - execution logs of your job runs, -01, -02, -03, ...
4. 12345678d-report.docx - report with graphs and charts of your analysis
Please hand in your files via the MS TEAMS Assignment system. If your team consists of
two members, then both members need to submit a file with both their names on it.
That’s it. Good Luck!