辅导DATA3404解析Java

School of Computer Science

Uwe Roehm

DATA3404: Data Science Platforms 1.Sem./2020

Big Data Analysis Assignment

Group Assignment (15%) 06.05.2020

Introduction

This is the practical assignment of DATA3404 in which you have to write a series of Apache Spark

programs to analyze a air traffic data set and then optimise your programs for scalability on increas-

ing data volumes. We provide you with the schema and dataset. Your task is to implement the three

given data analysis tasks, to evaluate their performance, and to decide on which optimisations are

best suited to improve the task’s performance.

You find links to online documentation, data, and hints on tools and schema needed for this

assignment in the ’Assignments’ section in Canvas.

Data Set Description and Preparation

This assignment is based on an Aviation On-time data set which includes information about airports,

airlines, aircrafts, and flights. This data set has the following structure:

airport_code

airport_name

city

state

country

Airports

tail_number

manufacturer

model

aircraft_type

year

Aircrafts

carrier_code

name

country

Airlinesflight_id

carrier_code

flight_number

flight_date

origin

destination

tail_number

scheduled_departure_time

scheduled_arrival_time

actual_departure_time

actual_arrival_time

distance

Flights

You find a set of corresponding data files (as zip archives) on our course website in Canvas in

the ”Assignment” module.

1. Download the linked air traffic data archives from the course website and unpack them.

2. Load the contained CSV files into your storage of your AWS Educate account (cf. tutorial

Week 9), typically S3 containers. Important: Only do this data load for the two smallest data

sets. We will also provide you with a larger data set for the performance evaluation. Due to

its size, this one will however only be available as shared resource later in this unit of study.

Question 1: Data Analysis with Apache Spark

You shall implement three different analysis tasks of the given data set using plain Apache Spark

(using the Apache Spark’s RDD API or Dataframe API, either with Java or Python):

1. Task 1: Top-3 Cessna Models

Write an Apache Spark program that determines the top-3 Cessna aircraft models with regard

to the number of flights, listed in descending order of number of flights. Output the Cessna

models in the form ”Cessna 123” as one string with only the initial ’C’ capitalised and the

model number having just its three digits. The output file should have the following tab-

delimited format, ordered by number of flights in descending order:

Cessna XYZ \t numberOfDepartingFlights

2. Task 2: Average Departure Delay

In the second task, write a Apache Spark program that determines the average, min and max

delay (in minutes) of flights by US airlines in a given year (user-specified year). Only consider

delayed flights, i.e. a flight whose actual departure time is after its scheduled departure time,

and ignore any canceled flights. The output file should have the following tab-delimited format

(ordered alphabetically by airline name):

airline_name \t num_delays \t average_delay \t min_delay \t max_delay

3. Task 3: Most Popular Aircraft Types

In the third task, you shall write an Apache Spark program that lists per airline of a given

country (user-specified) the five most-used aircraft types (manufacturer, model). List the

airlines in alphabetical order, and show the five most-used aircraft in descending order of the

number of flights as a single, comma-separated string that is enclosed in ’[’ and ’]’ (indicating

a list). Format the name of an aircraft type as follows: MANUFACTURER ’ ’ MODEL (for

example, ”Boeing 787” or ”Airbus A350”).

The output should have the following tab-delimited format (alphabetically by airline name):

airline_name \t [aircraft_type1, aircraft_type2, ... , aircraft_type5]

General Coding Requirements

1. You should solve this assignment with the Apache Spark version 2.4 as installed in AWS

EMR. You will need an AWS Educate account for this.

2. If you use any code fragments or code cliches from third-party sources (which you should

not need for these tasks...), you must reference those properly. Include a statement on which

parts of your submission are from yourself.

3. Always test your code using a small data set before running it on any larger data set.

Question 2: Performance Evaluation and Tuning

a) Conduct a performance evaluation of your implementations for each task on varying dataset

sizes. We will provide you with five different data sizes, the two largest ones to be shared

among all groups. You should execute your code on each data size and record the execution

times and the sizes of the intermediate results (communication efforts).

b) Suggest some optimisations to the the analysis task implementations such that the perfor-

mance of your task(s) improve. Show that it works.

Question 3: Documentation of Implementation and Tuning Decisions

Write a text document (plain text or Word document or PDF file, no more than 5 pages plus

optional Appendix) in which you document your implementation and your performance evaluation.

Your document should contain the following:

1. Job Design Documentation

In your document, describe the Apache Spark jobs you use to implement Tasks 1 to 3. For

each job, briefly describe the different transformation functions. If you use any user-defined

functions, classes or operators, please describe those too.

2. Justification of any tuning decisions or optimisations; document the changes in the exe-

cution plans and the estimated execution costs for each individual analysis tasks before and

after your optimisations using the DAG Visualizations of Apache Spark.

3. Briefly justify each tuning decision.

4. Performance Evaluation: Include a chart and a table with the average execution times of

your tasks for different data sets.

5. Include as appendix the S3 storage location of your final output files from various executions.

Milestones

Have the first task ready in the Week 11 tutorials for the tutors to review and to give feedback.

Deliverables and Submission Details

There are three deliverables: source code, a brief program design and performance documen-

tation (up to 5 pages, as of content description above), and a demo in Week 12 via Zoom.

All deliverables are due in Week 12, no later than 8 pm, Friday 22 May 2020. Late submission

penalty: -20% of the awarded marks per day late. We will make available a marking rubric in

Canvas.

Please submit the source code and a soft copy of design documentation as a zip or tar file elec-

tronically in Canvas, one per each group. Name your zip archive after your UniKey: abcd1234.zip

Demo: A few points of the marking scheme will be given to any submission which can be

demoed successfully on our own cluster.

Students must retain electronic copies of their submitted assignment files and databases, as the

unit coordinator may request to inspect these files before marking of an assignment is completed. If

these assignment files are not made available to the unit coordinator when requested, the marking

of this assignment may not proceed.

All the best!

Group member participation

This is a group assignment. The mark awarded for your assignment is conditional on you being

able to explain any of your answers to your tutor or the subject coordinator if asked.

If members of your group do not contribute sufficiently you should alert your tutor as soon as

possible. The tutor has the discretion to scale the group’s mark for each member as follows, based

on the outcome of the group’s demo in Week 12:

Level of contribution Proportion of final grade received

No participation. 0%

Passive member, but full understanding of the submitted work. 50%

Minor contributor to the group’s submission. 75%

Major contributor to the group’s submission. 100%

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！