CSC 578 "Class Project", Part (A) Kaggle competition

2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 1/6 CSC 578 "Class Project", Part (A) Kaggle competition

Versions

0.3: 5/28/2021: Adding info about downloading data (under Dataset), fixed

number of predictions (4993 instead of 4992), and added link to the kaggle

competition

0.2: 5/27/2021: Added lots of details

0.1: Basic idea

Overview

The goal of the project is to apply deep learning to do time series forecasting. In

particular, you will create deep learning models to predict future traffic volume at a

location in Minnesota, between Minneapolis and St Paul.

Tensorflow has published a tutorial on developing a variety of models for

predicting time series data. Your project will constitute a form of transfer learning:

Apply the procedures and techniques from that tutorial to the traffic volume

data,

while indicating your understanding of the processes:

via your comments in your notebook code,

by showing your results with visualizations in your notebook, and

via the results that you submit to the kaggle competition

In this project, the task of model building is made into a Kaggle competition so

that you/students can compare the performance of their best model with that of

others in the class.

Requirements

The general project requirements are described here.

A. Participation in the competition [required]. Note: it will close on Thursday,

June 10th, 6:59 pm Central time (11:59 pm UTC).

B. Make your Jupyter Notebook code file presentable, like a report.

Include Exploratory Data Analysis (EDA) of the data in the introduction

part. - Minimally there should be description of variables, plots and

histograms, along with your analysis/comments on the distribution.

Good organization. Section headers and descriptions.

Visualization of the performance for each model.

C. Write documentation (in pdf).

Minimum 700 words (at least 2.5 pages), including figures, tables and

references.

2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 2/6

Be sure to add your name, course/section number and the assignment

name at the top of the file.

Also at the top of the file, write, in bold, your Kaggle user name (as

displayed in the leaderboard) and ranking (public or private, or both).

Models and development. Describe at least three different models:

One must be a baseline model. You decide/define the model, and

write a brief explanation/justification on why you chose it as the

baseline.

You pick two other models to put forward (e.g. type and

configuration of architecture, different hyperparameter

configurations).

Be sure to include your Kaggle-best model. Write your

thoughts/speculation on why the model performed the best.

Your analysis must be insightful. Comments should also include

your expectation for the model and the reasoning behind the

expectation.

Your reaction and reflection on your result and the competition.

Your reaction and reflection on the project as well as the course overall.

Deliverables:

A. Source Jupyter notebook file and its html or pdf version.

B. Documentation.

NOTE: Higher grades will be given to submissions (including both source and

documentation files) that are nicely organized and well written, with a sufficient

amount of comments and presentable graphs/charts. Ones with terse, minimal

content will be considered insufficient and receive a lower grade.

More Details about the Task

Dataset

The dataset we will use is Metro Interstate traffic volume from the UCI repository.

Here are:

the data file csv in zip format, and in gzip format,

the data set description

You can either download the data from one of the links above, then upload it to

google drive, and mount your google drive. Or you can directly download it,

modifying the instructions in the tutorial like this:

zip_path = tf.keras.utils.get_file( origin='https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Inte fname='Metro_Interstate_Traffic_Volume.csv.gz',

2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 3/6

You can also download and extract the data directly from the data folder in the

above repository, following the example in the tutorial, but you'll need to make a

few changes

There are 48+ thousand total instances, with 9 features:

holiday: string (None or name of holiday)

temp: in degrees kelvin

rain_1h: in mm for the last hour

snow_1h: in mm for the last hour

clouds: percent

weather_main: short descriptive text

weather_description: longer descriptive text

date_time: in M/D/Y H:m:s AM/PM format

traffic_volume: # of cars in the last hour

How to start

Work through the tensorflow time series tutorial, but apply it to the traffic data

instead of the weather data. Work your way toward the multi-step RNN with LSTM

model. That will be the goal, but you should also look at the other models to be

able to compare them. You should also create a baseline model as shown in the

tutorial.

Specific goal

The specific goal will be to predict from a 6-hour input window, only the traffic

volume for 2 hours past the end of the window. That is a little different from the

multi-step (and multi-output) RNN with LSTM example, but I think it will be a lot

less confusing when you create your submission file.

Initial things to watch out for:

As with the weather data, some of the features will need to be transformed in

some way.

For those which are strings, you can decide whether to just ignore them,

or replace them by a more useful representation of the same

information.

You can also decide what to do with periodic data, but you probably

don't want to ignore them.

cache_dir='/content', cache_subdir='sample_data') # colab doesn't know how to automatically gunzip a file !gunzip sample_data/Metro_Interstate_Traffic_Volume.csv.gz csv_path, _ = os.path.splitext(zip_path) # just strip off the .gz csv_path # should be '/content/sample_data/Metro_Inte

2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 4/6 Training / Test split

Following the TF tutorial's example for splitting the data into training, validation,

and test sets, except use the last 5,000 records of the data as the test set. You can

split the rest of the data as you want. (For example, putting 10,000 records into the

validation set is approximately 20% as shown in the tutorial.)

In particular:

There are 48204 samples in the entire dataset. When you read them in, they

will appear in a pandas dataframe. You can use head() and tail() to see the

start and end of the different datasets and their subsets.

For predictions:

You will use the first 6 of those values (with indices 43,204 to 43,209,

inclusive) to predict the value for two hours later, i.e., the one in row

43,211.

And so on, hour-by-hour, until you predict the last row/hour (48203),

based on rows 48,196 to 48,201 (inclusive).

For a total of 4993 lines in your .csv submission file after the header line,

which should be "id,prediction".

For the id value of each prediction, you should use integers from 1 to

4993 (MOD).

Don't forget to denormalize your predictions.

Then save them to the aforementioned .csv file with the appropriate

indices, e.g., by putting the predictions back into a DataFrame.

Here is a sample submission file, sample_sub.csv file. Don't submit it to

Kaggle, but you can use it to compare the format of your submission file.

Details about the kaggle competition

Your submissions to the kaggle competition should be formatted as described

above, with 4993 rows plus a header row. Here are the beginning and end of the

sample_submission.csv file (again, no extra spaces or other junk):

id,prediction 1,3481.280499629291 2,2876.958684404887 3,3417.846569017333 4,3293.5700851714946 5,1296.7354053428837 ... 4991,3947.3713909650096 4992,3542.831172174818 4993,2781.3480226893225

NOTE: A maximum of 4 submissions per day are allowed to encourage you to

optimize intelligently based on your validation results.

2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 5/6

The submissions are evaluated with kaggle's Mean Absolute Error metric.

Other notes and hints

Start simple! Try something basic, make an initial kaggle submission (to

make sure it works), then tweak it.

After building a baseline model (e.g. an LSTM model), you may try exploring

advanced features and architectures such as:

Batch size

Number of recurrent units

Stacking recurrent layers

Statefulness

Recurrent dropout

Bidirectional RNNs

Combining CNNs with recurrent networks

1D convnet + RNN

1D convnet + GRU

FYI, if you are interested in knowing about the performance optimization by

GPUs, look at the section "Performance optimization and CuDNN kernels " at

Keras Guide on RNNs.

References:

This site How to Develop LSTM Models for Time Series Forecasting (and

the same code at Multivariate Time Series Forecasting with LSTMs in

Keras) has code example for creating rollout data.

Ditto this Multivariate Time Series using RNN with Keras, although the

example is Univariate.

This site Prediction and Analysis of Time Series Data using Tensorflow is

a very useful reference on time-series data, in particular generating

predictions (and using Keras), although again the example is Univariate.

This site A comprehensive beginner’s guide to create a Time Series

Forecast (with Codes in Python and R) has quite intuitive and concise

explanations on some of the fundamental concepts in Time-series in

Statistics.

Finally, the general Keras reference on Working with RNNs. This is an

EXCELLENT page on Keras RNNs (for TF2).

Assessment (15 points max total for this part, and 15 for the paper)

This project will be graded according to the following criteria:

Code in Jupyter notebook (8 points max total)

Includes appropriate visualizations, training, and evaluations showing the

exploratory data analysis. (3 pts max)

2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 6/6

Good comments in code indicating your understanding of the processes.

Clearly distinguish between the tutorial code and code you added. (And

do not include the entire tutorial code.) (2 pts max)

For your comments, put your initials in front, so I can distinguish them

from what was in the tutorial.

Baseline and Best models clearly labelled. (1 pt max)

Clearly organized overall. (2 pts max)

It is important that you have a submission. Your overall ranking at the end of

the competition is not important, but I hope to see that you at least tried to

improve on the benchmark performance of my baseline.

Documentation (7 points max total)

Criteria for your documentation:

At least 600 words (plus relevant images), kaggle submission completed,

includes your real name and kaggle names (2 pts max)

Includes a good description of your baseline and kaggle-best models (1.5 pts

max)

Includes a good description of your "journey" to prepare data and tune

hyperparameters: (3.5 pts max)

What exploratory data analysis did you do?

How did you prepare the data? What modifications did you make to the

data (e.g., with derived features)?

What were your expectations of different approaches? Were they met?

Also includes description of a "runner-up" model that was interesting.

This reflection part will be weighted more than the other components of

the documentation.

Thoughtful conclusions on your best model and on assignment.

Spring 2021 () 2021-05-28 Fri 23:53

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！