首页 > > 详细

CSC 578 "Class Project", Part (A) Kaggle competition

 2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition

https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 1/6 CSC 578 "Class Project", Part (A) Kaggle competition
Versions
0.3: 5/28/2021: Adding info about downloading data (under Dataset), fixed
number of predictions (4993 instead of 4992), and added link to the kaggle
competition
0.2: 5/27/2021: Added lots of details
0.1: Basic idea
Overview
The goal of the project is to apply deep learning to do time series forecasting. In
particular, you will create deep learning models to predict future traffic volume at a
location in Minnesota, between Minneapolis and St Paul.
Tensorflow has published a tutorial on developing a variety of models for
predicting time series data. Your project will constitute a form of transfer learning:
Apply the procedures and techniques from that tutorial to the traffic volume
data,
while indicating your understanding of the processes:
via your comments in your notebook code,
by showing your results with visualizations in your notebook, and
via the results that you submit to the kaggle competition
In this project, the task of model building is made into a Kaggle competition so
that you/students can compare the performance of their best model with that of
others in the class.
Requirements
The general project requirements are described here.
A. Participation in the competition [required]. Note: it will close on Thursday,
June 10th, 6:59 pm Central time (11:59 pm UTC).
B. Make your Jupyter Notebook code file presentable, like a report.
Include Exploratory Data Analysis (EDA) of the data in the introduction
part. - Minimally there should be description of variables, plots and
histograms, along with your analysis/comments on the distribution.
Good organization. Section headers and descriptions.
Visualization of the performance for each model.
C. Write documentation (in pdf).
Minimum 700 words (at least 2.5 pages), including figures, tables and
references.
2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition
https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 2/6
Be sure to add your name, course/section number and the assignment
name at the top of the file.
Also at the top of the file, write, in bold, your Kaggle user name (as
displayed in the leaderboard) and ranking (public or private, or both).
Models and development. Describe at least three different models:
One must be a baseline model. You decide/define the model, and
write a brief explanation/justification on why you chose it as the
baseline.
You pick two other models to put forward (e.g. type and
configuration of architecture, different hyperparameter
configurations).
Be sure to include your Kaggle-best model. Write your
thoughts/speculation on why the model performed the best.
Your analysis must be insightful. Comments should also include
your expectation for the model and the reasoning behind the
expectation.
Your reaction and reflection on your result and the competition.
Your reaction and reflection on the project as well as the course overall.
Deliverables:
A. Source Jupyter notebook file and its html or pdf version.
B. Documentation.
NOTE: Higher grades will be given to submissions (including both source and
documentation files) that are nicely organized and well written, with a sufficient
amount of comments and presentable graphs/charts. Ones with terse, minimal
content will be considered insufficient and receive a lower grade.
More Details about the Task
Dataset
The dataset we will use is Metro Interstate traffic volume from the UCI repository.
Here are:
the data file csv in zip format, and in gzip format,
the data set description
You can either download the data from one of the links above, then upload it to
google drive, and mount your google drive. Or you can directly download it,
modifying the instructions in the tutorial like this:
zip_path = tf.keras.utils.get_file( origin='https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Inte fname='Metro_Interstate_Traffic_Volume.csv.gz',
2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition
https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 3/6
You can also download and extract the data directly from the data folder in the
above repository, following the example in the tutorial, but you'll need to make a
few changes
There are 48+ thousand total instances, with 9 features:
holiday: string (None or name of holiday)
temp: in degrees kelvin
rain_1h: in mm for the last hour
snow_1h: in mm for the last hour
clouds: percent
weather_main: short descriptive text
weather_description: longer descriptive text
date_time: in M/D/Y H:m:s AM/PM format
traffic_volume: # of cars in the last hour
How to start
Work through the tensorflow time series tutorial, but apply it to the traffic data
instead of the weather data. Work your way toward the multi-step RNN with LSTM
model. That will be the goal, but you should also look at the other models to be
able to compare them. You should also create a baseline model as shown in the
tutorial.
Specific goal
The specific goal will be to predict from a 6-hour input window, only the traffic
volume for 2 hours past the end of the window. That is a little different from the
multi-step (and multi-output) RNN with LSTM example, but I think it will be a lot
less confusing when you create your submission file.
Initial things to watch out for:
As with the weather data, some of the features will need to be transformed in
some way.
For those which are strings, you can decide whether to just ignore them,
or replace them by a more useful representation of the same
information.
You can also decide what to do with periodic data, but you probably
don't want to ignore them.
cache_dir='/content', cache_subdir='sample_data') # colab doesn't know how to automatically gunzip a file !gunzip sample_data/Metro_Interstate_Traffic_Volume.csv.gz csv_path, _ = os.path.splitext(zip_path) # just strip off the .gz csv_path # should be '/content/sample_data/Metro_Inte
2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition
https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 4/6 Training / Test split
Following the TF tutorial's example for splitting the data into training, validation,
and test sets, except use the last 5,000 records of the data as the test set. You can
split the rest of the data as you want. (For example, putting 10,000 records into the
validation set is approximately 20% as shown in the tutorial.)
In particular:
There are 48204 samples in the entire dataset. When you read them in, they
will appear in a pandas dataframe. You can use head() and tail() to see the
start and end of the different datasets and their subsets.
For predictions:
You will use the first 6 of those values (with indices 43,204 to 43,209,
inclusive) to predict the value for two hours later, i.e., the one in row
43,211.
And so on, hour-by-hour, until you predict the last row/hour (48203),
based on rows 48,196 to 48,201 (inclusive).
For a total of 4993 lines in your .csv submission file after the header line,
which should be "id,prediction".
For the id value of each prediction, you should use integers from 1 to
4993 (MOD).
Don't forget to denormalize your predictions.
Then save them to the aforementioned .csv file with the appropriate
indices, e.g., by putting the predictions back into a DataFrame.
Here is a sample submission file, sample_sub.csv file. Don't submit it to
Kaggle, but you can use it to compare the format of your submission file.
Details about the kaggle competition
Your submissions to the kaggle competition should be formatted as described
above, with 4993 rows plus a header row. Here are the beginning and end of the
sample_submission.csv file (again, no extra spaces or other junk):
id,prediction 1,3481.280499629291 2,2876.958684404887 3,3417.846569017333 4,3293.5700851714946 5,1296.7354053428837 ... 4991,3947.3713909650096 4992,3542.831172174818 4993,2781.3480226893225
NOTE: A maximum of 4 submissions per day are allowed to encourage you to
optimize intelligently based on your validation results.
2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition
https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 5/6
The submissions are evaluated with kaggle's Mean Absolute Error metric.
Other notes and hints
Start simple! Try something basic, make an initial kaggle submission (to
make sure it works), then tweak it.
After building a baseline model (e.g. an LSTM model), you may try exploring
advanced features and architectures such as:
Batch size
Number of recurrent units
Stacking recurrent layers
Statefulness
Recurrent dropout
Bidirectional RNNs
Combining CNNs with recurrent networks
1D convnet + RNN
1D convnet + GRU
FYI, if you are interested in knowing about the performance optimization by
GPUs, look at the section "Performance optimization and CuDNN kernels " at
Keras Guide on RNNs.
References:
This site How to Develop LSTM Models for Time Series Forecasting (and
the same code at Multivariate Time Series Forecasting with LSTMs in
Keras) has code example for creating rollout data.
Ditto this Multivariate Time Series using RNN with Keras, although the
example is Univariate.
This site Prediction and Analysis of Time Series Data using Tensorflow is
a very useful reference on time-series data, in particular generating
predictions (and using Keras), although again the example is Univariate.
This site A comprehensive beginner’s guide to create a Time Series
Forecast (with Codes in Python and R) has quite intuitive and concise
explanations on some of the fundamental concepts in Time-series in
Statistics.
Finally, the general Keras reference on Working with RNNs. This is an
EXCELLENT page on Keras RNNs (for TF2).
Assessment (15 points max total for this part, and 15 for the paper)
This project will be graded according to the following criteria:
Code in Jupyter notebook (8 points max total)
Includes appropriate visualizations, training, and evaluations showing the
exploratory data analysis. (3 pts max)
2021/6/4 CSC 578 "Class Project", Part (A) Kaggle competition
https://reed.cs.depaul.edu/peterh/courses/csc578/Assignments/Final/class-project-kaggle.html 6/6
Good comments in code indicating your understanding of the processes.
Clearly distinguish between the tutorial code and code you added. (And
do not include the entire tutorial code.) (2 pts max)
For your comments, put your initials in front, so I can distinguish them
from what was in the tutorial.
Baseline and Best models clearly labelled. (1 pt max)
Clearly organized overall. (2 pts max)
It is important that you have a submission. Your overall ranking at the end of
the competition is not important, but I hope to see that you at least tried to
improve on the benchmark performance of my baseline.
Documentation (7 points max total)
Criteria for your documentation:
At least 600 words (plus relevant images), kaggle submission completed,
includes your real name and kaggle names (2 pts max)
Includes a good description of your baseline and kaggle-best models (1.5 pts
max)
Includes a good description of your "journey" to prepare data and tune
hyperparameters: (3.5 pts max)
What exploratory data analysis did you do?
How did you prepare the data? What modifications did you make to the
data (e.g., with derived features)?
What were your expectations of different approaches? Were they met?
Also includes description of a "runner-up" model that was interesting.
This reflection part will be weighted more than the other components of
the documentation.
Thoughtful conclusions on your best model and on assignment.
Spring 2021 () 2021-05-28 Fri 23:53
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!