首页 > > 详细

COMP 5434: Assignment 04

COMP 5434: Assignment 04

Due Date: 02 Dec 2020

Due Time: 23:59

Instructor: George Baciu

Title: BDML Finale

Total Marks: 25

Note: Max 2 people

 Please note that the project may have a team of a maximum of two people.

Description
This will be your final machine learning (ML) deployment on top of your big data cluster 
built on top of virtual Ubuntu 20.04 servers and clients. The minimum resources necessary 
for the deployment of a working client on the Hadoop cluster are (1) 16 GB Memory
(RAM); (2) 128 GB Hard Disk Storage; (3) QuadCore Processors; and (4) Network. Your 
project can be built on one or more client machines udesk00, udesk01, etc, connected to the 
Hadoop cluster with the three or more virtual servers (vms) on top of the VirtualBox 6.1.16
hypervisor. The client machine(s) will act as the application machine. It will be used by 
you to run PySpark 3.0.1 jobs to the cluster and operate on files stored by you in your the 
HDFS Hadoop 3.3.0 cluster. The client machines and the servers will have a minimum 
RAM of 2GB and a minimum disk storage of 64GB for the client, 8Gb for the master node,
and 32GB for each data node, respectively. The machines can be udesk00, udesk01…, 
hname10, sdata11, sdata12, sdata13, …sdata19.
The Problem
This assignment is your final part in completing your project in Big Data Machine 
Learning. In this assignment you will implement, test, and verify the specifications of your
project according to what you have proposed in assignment 3. You will submit all your 
runs and analysis as a project report. The analysis should have significant meaning and 
relevance to Big Data Computing and Machine Learning. You may get your inspiration and 
data from some of the most popular sites on the internet, such as the Kaggle big data 
repository: https://www.kaggle.com/. You may also come up with your own data if you 
wish or move your platform to a cloud, such as AWS for example.
In this assignment, you will implement the solution to the problem that you proposed to 
work on, and describe your analysis and findings with illustrations and graphs in your final 
report. You can also describe how you could improve the results or discover new concepts 
if you had more time. The implementation and analysis should follow your methodology. 
The template of the report is found at:
http://www4.comp.polyu.edu.hk/~csgeorge/comp5434/asg/04/doc/comp5434-asg-04-report.docx
COMP 5434: Assignment 04
Due Date: 02 Dec 2020
Due Time: 23:59
Instructor: George Baciu
Title: BDML Finale
Total Marks: 25
Note: Max 2 people
What to hand in
Assuming that your student ID is 12345678D, the submission should consist of four or 
more files, where –nn indicates the number of your job run:
1. 12345678d-pyspark-nn.py - your script for the job submitted, -01, -02, -03, …
2. 12345678d-output-nn.txt - output results; use -01, -02, -03 (one for each run)
3. 12345678d-logs-nn.txt - execution logs of your job runs, -01, -02, -03, ...
4. 12345678d-report.docx - report with graphs and charts of your analysis
Please hand in your files via the MS TEAMS Assignment system. If your team consists of 
two members, then both members need to submit a file with both their names on it.
That’s it. Good Luck!
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!