Data留学生讲解、辅导Python、Python程序设计调试辅导Python编程|讲解留学生Processing

Assignment 1
Due: Monday 2nd March at 10.00pm
The aims of this assignment are to put into practice the concepts covered in lectures, apply
these to a real dataset, and to demonstrate your ability to use Python to carry out machine
learning tasks.
Problems and Data
The data you are going to be working on comes from Spotify and is based on this set:
https://www.kaggle.com/cnic92/spotify-past-decades-songs-50s10s which captures various
attributes about songs and includes a popularity score. You will work on this as a group of four
and you are going to address two problems:
⚫ A regression problem which aims to predict the popularity score of a song
⚫ A classification problem which aims to predict the top genre that a song belongs to
A description of each problem along with a detailed overview of the data is available at:
⚫ Regression Problem: https://www.kaggle.com/t/fcb56e49f46d4bfb999148579d857fbc
⚫ Classification Problem: https://www.kaggle.com/t/38bfabc24c8942d1802d2214522a3249
Instructions on how to work with Kaggle Data in Google Colab are provided here:
https://colab.research.google.com/drive/1EXYGOLT_uoYm9dHM8U51qCropg1MhXFI
Instructions
Think carefully about the problem you are working on and the main question you are trying to
answer. Take your time to make sure you understand the data. It is also not necessary to use
every attribute - you may find yourself working with many or just a few. The emphasis in this
assignment is also much on the process: if you find that the techniques you have chosen don't
work very well or fail to produce particularly interesting results, then this is not a problem
provided you followed the appropriate steps to understand and prepare the data and select
appropriate models, and can provide some insights or explanations into why your model failed
(or performed brilliantly!).
To both the classification and regression tasks you should aim to apply a range (around 2-4) of
techniques including the basic ones (which might be useful as baseline comparisons) and also
the more sophisticated ones covered in this class, but the emphasis should be on using the
techniques appropriately and interpreting the results, not on a scatter-gun of algorithms.
When you have developed a solution to the problem you should evaluate it on the test data and
upload the results file to the competition page for scoring. Each page provides detailed
instructions on the file format required. You will then be able to evaluate your solution in relation
to previous submissions and also see how you are performing in relation to other teams in the
class. Your position in the rankings will make a (small) contribution to your final mark for the
assignment.
Teams
You should work as a team of four. You are permitted to select your own groups and when you
have done so you should add your team into myplace. How you choose to split the work
amongst your group is up to yourselves but remember that this is a learning opportunity and you
should aim to contribute evenly and understand all aspects of the work. Everyone in the group is
responsible for the final submission and should be prepared to answer questions about it.
Submission
For this assignment, your team will need to submit the following for each task:
- Your final predictions to Kaggle via the InClass Competitions
- Your exported Jupyter/Colab Notebook (submitted as an .ipynb file)
- Your exported Jupyter/Colab Notebook (submitted as a PDF file)
In addition, your team should also submit a PDF of the assignment cover sheet, with the
contribution percentages from each student.
Each Jupyter/Colab Notebook should include the following:
- Your team’s name
- Your team’s names and student numbers
- A description of the final architecture and solution that you employed for the final set of
predictions.
- A justification for why you choose this architecture and solution including: how you came
up with the approach, why you selected or modified input variables, explaining what
worked and did not work, and what other models were tried.
- All code to reproduce the final predictions must be included, along with any code that
justifies your choices.
- The report should conclude by reporting your performance in the Kaggle InClass
competition.
The Python code used and the explanations of the steps should be interleaved within the
notebook, and provided in a logical manner, to show your working and justify your
interpretations and analysis of the outcomes. Explanations should be succinct and clear, with
the emphasis justifying the choices made, and critical interpretation of results.
Each task is worth 25 marks each, and so the assignment is out of a total of 50 marks, and is
worth 25% of your overall mark.
Marking Scheme
This assignment is out of 50 and each problem will be assessed according to the following
marking scheme:
Solution (based on comments and code) (10 marks per task):
⚫ Explanation of your solution and setup (packages, algorithms etc. used, data analysis
and preparation)
⚫ Justification for the choices (rationale for the models used which may also be based on
your analysis of the data)
⚫ Explanation of the various models tried (what worked, what didn't, and why)
⚫ No more than 1500 words.
Code Quality (8 marks per task)
⚫ Readability, configurability (how easily it can be adapted to other models, problems etc.),
structure
⚫ Correspondence to solution
⚫ How cleanly it runs
Performance (7 marks per task)
⚫ Performance on the Training Data should be reported in the text (and the code should
report the values reported).
⚫ Performance given the Kaggle Test Data should be reported in the text..
⚫ Explanation of the difference.
⚫ Relative performance in comparison to other solutions
More details on the assessment criteria are given in the table on the following page
Poor Fair Good Excellent
Solution 0-3 4-5 6-8 9-10
Model and Data
Engineering
A simple naive baseline, with
limited processing or engineering
of features and configuration
(unless justified)
A sophisticated and appropriate
configuration given the data,
processes features appropriately,
and ensures over-fitting is limited.
Justification No or little justification for the
choices made to produce the
solution.
A well-motivated justification based
on theory/course-work/previous
experience and/or other attempts.
Additional
Models
No other models tried, or other
features/parameters
experimented with.
A range of models, parameters and
configurations explored, to provide a
strong justification for the final
choice.
Code 0-1 2-3 4-6 7-8
Readability Unclear or unclear in parts, poorly
structured
Functions used and added
appropriately, code is clear and
readable, and well-structured.
Correspondence Does not or only partly
corresponds to the actual solution
described.
Mostly or completely corresponds to
the actual solution described.
Works Does not, or parts of work. Works, runs efficiently, and obtains
the reported outputs.
Performance 0-1 2-3 4-6 7
Relative
Performance
Bottom 25/50% of scores Top 50/75% of scores
Explanation No or little justification for poor
performance, or difference with
training performance. No
explanation as to what was
impacting performance.
Provides justification for differences
with training performance, or what
gave the submission the edge - i.e.
what made the biggest impact to the
performance.
Example Notebook Reports
Below are some illustrations of reports which combine the code and text together.
⚫ Using Python to see how the Times writes about men and women
⚫ An open science approach to a recent false-positive between solar activity and the
Indian monsoon
⚫ Kaggle Competition | Titanic Machine Learning from Disaster
⚫ An example machine learning notebook
⚫ An exploratory statistical analysis of the 2014 World Cup Final
CS98X - Assignment Cover Page
Team Name:
Contributions
Fill in your name, number, and contribution (typed). Please also sign or mark that you have
agreed. If you can’t agree, fill in the percentage contribution out of 100, that you think you
deserve, with a short justification below
Student Name Student No Percentage
Contribution
Signature
/ Check
Notes on Contribution (if required):