首页 > > 详细

讲解INF6028辅导asp、asp讲解

Information 
School. 
 
 
INF6028 Coursework 2019-20 
 
Mining and Evaluating a Structured Dataset 
 
1. Introduction 
 
The assessment for INF6028 Data Mining consists of a piece of individual coursework to assess your 
ability to understand key data mining, analysis and evaluation concepts. You will be assigned a single 
dataset and an associated complete Knime workflow. Each workflow applies appropriate data mining 
methods to the dataset in order to solve a supervised prediction problem - this might be regression or 
classification – and to evaluate the relative performance of these different approaches/algorithms. You 
will interpret and critically discuss the various techniques and best practises employed in the workflow 
and will evaluate the performance of the algorithms. 
 
Note: a video taking you through the workflow step-by-step will also be provided. 
 
You should write a 2,000 word structured report (see Section 3) that includes the following headings 
(more details on how the report will be assessed are provided below): 
• Introduction - introduce the prediction problem. 
• Data mining theory - provide a theoretical description of the two supervised data mining 
methods used in the workflow (for example, the classification or regression techniques that have 
been used) and why they are appropriate to the prediction task. 
• Data exploration and preparation – describe the approaches used in the workflow for feature 
selection, transformation and normalisation, where appropriate. 
• Experimental setup - describe the experimental setup and the evaluation measures used in the 
workflow and how the data has been handled to ensure that the models were not over-fitted. 
You should explain which nodes were used in KNIME and provide a rationale for the various 
parameter settings that were used. 
• Results – present the results for each data mining method and compare the performance of the 
different methods using graphical and tabular methods. What insights can you gain from the 
models? For example, which are the most important features, are there any outliers in the 
predictions? 
• Conclusion and reflections – summarise the main findings of your report and reflect on the 
methods used. 
Charts, tables, references and appendices are not included in the word count. 
 
Remember: your report should be a critical evaluation of the workflow in the context of the data mining 
problem posed, it should not be merely a description of what was done. 
 
This assessment is worth 100% of the overall module mark for INF6028. A pass mark of 50 is required to 
pass the module. Submission deadline: June 8 via Turnitin. See Section 4 for more general information 
about Coursework Submission Requirements within the Information School. 
 
 
 
2. The Datasets and KNIME Workflows 
 
You will be assigned a single dataset and KNIME workflow to base your report on. Please ensure before 
you start working on the assessment that you are using the correct dataset and workflow. 
 
Note: You should try to open the workflow in KNIME and work from there, however, should you be 
unable to open the workflow or install KNIME on your machine, you will also be provided with a video, 
which will take you through the workflow step-by-step. 
 
The datasets have been derived from Kaggle competitions and are downloadable from MOLE in the 
Coursework Brief Information section. A brief description of the attributes in each dataset is given at 
the end of this document. Note that in both cases the data are different to the standard Kaggle 
datasets. 
 
 
Titanic-derived dataset 
The data is split across two files each of which contains 1204 entries representing 1204 passengers, 
although it should be noted that the passengers are not necessarily the same in the two files. The two 
files are titanic_ticket_data.csv and titanic_personal_data.csv 
The aim of this challenge is to build a model that is able to predict whether or not a passenger will survive 
the sinking of the titanic. 
 
Australian Weather-derived Dataset 
The Australian weather dataset consists of weather data for 16 cities and towns in Australia over the 
period of nearly 10 years. 
The aim of this challenge is to predict the total daily rainfall based on other features of the weather. 
 
 
3. Report Structure 
 
You are required to produce a structured report that includes all the sections detailed in Table 1. You 
must state the word count somewhere in the report. As there is a word count limit you should aim to 
make your writing as concise and informative as possible. The emphasis of the report should be on the 
clarity, accuracy and quality in communicating your findings. 
 
Table 1: Required content of the structured report. 
 
Section Description 
 
Maximum allocated marks 
Structured 
abstract 
This should provide a summary of your report 
in a structured manner. This is not included in 
the word count. 
Required, but 0 marks 
Introduction This section should introduce the data mining 
task that is addressed in the report. You 
should indicate the property/data value that 
is predicted and give a brief overview of the 
dataset and methods used. 
10 marks 
Data Mining 
Theory 
This section should provide an overview of 
the algorithms for predictive data mining 
used in the workflow from a theoretical 
aspect. Explain why they are relevant to the 
25 marks 
 
prediction problem. Support your rationale 
by providing references to the literature 
where the techniques have been applied to 
similar problems. 
Include a short discussion of the most 
appropriate methods for evaluating the 
performance of these data mining methods. 
Data Exploration 
and Preparation 
This section should provide a brief 
description of the data and of the approaches 
used to pre-process the data. You should 
present an investigation of the attributes 
(including the data value to be predicted) and 
describe any data cleaning employed, 
including handling of missing data, data 
transformations and data aggregations. 
10 marks 
Experimental 
Setup 
This section should describe the 
experimental design in the workflow. 
You should describe the process followed in 
order to find the best performing model for 
each method and how this was validated. 
For example, which KNIME nodes were used? 
How were they configured? Was any cross- 
validation or a separate validation set used 
and why? 
20 marks 
Results and 
Discussion 
Present the results of the data mining 
process including the results of experiments 
to find the best model for each data mining 
method. Compare the best performance of 
the different methods and, if appropriate, 
consider which attribute contributes most to 
each model. 
Discuss the advantages and disadvantages of 
the data mining methods. Which of the 
chosen methods produced the best model 
and why? 
20 marks 
Conclusion and 
reflections 
Summarise the main findings of the analysis 
and reflect on the choice of methods for the 
problem, for example, how might the models 
be improved with hindsight? Use evidence 
from the literature to support your 
arguments. 
15 marks 
 
 
 
 
 
 
4. Information School Coursework Submission Requirements 
 
It is the student's responsibility to ensure no aspect of their work is plagiarised or the result of other 
unfair means. The University’s and Information School’s Advice on unfair means can be found in your 
Student Handbook, available via http://www.sheffield.ac.uk/is/current 
 
Your assignment has a word count limit. A deduction of 3 marks will be applied for coursework that is 
5% or more above or below the word count as specified above or that does not state the word count. 
 
It is your responsibility to ensure your coursework is correctly submitted before the deadline. It is 
highly recommended that you submit well before the deadline. Coursework submitted after 10am on 
the stated submission date will result in a deduction of 5% of the mark awarded for each working day 
after the submission date/time up to a maximum of 5 working days, where ‘working day’ includes 
Monday to Friday (excluding public holidays) and runs from 10am to 10am. Coursework submitted 
after the maximum period will receive zero marks. 
 
Work submitted electronically, including through Turnitin, should be reviewed to ensure it appears as 
you intended. 
 
Before the submission deadline, you can submit coursework to Turnitin numerous times. Each 
submission will overwrite the previous submission. Only your most recent submission will be assessed. 
However, after the submission deadline, the coursework can only be submitted once. 
 
Details about the submission of work via Turnitin can be found at http://youtu.be/C_wO9vHHheo 
 
If you encounter any problems during the electronic submission of your coursework, you should 
immediately contact the module coordinator and one of the Information School Teaching Support 
Team (Julie Priestley 0114 2222839). This does not negate your 
responsibilities to submit your coursework on time and correctly. 
 
 
Titanic Dataset 
The titanic data consist of two files that need to be merged. 
 
The titanic_ticket_data.csv data consists of the following variables: 
PassengerId: the identifier 
Survived: the value to predict 
Ticket: the Ticket Number 
Fare: the passenger fare 
Cabin: Cabin number 
Embarked: Port of embarkation. C = Cherbourg, Q = Queenstown, S = Southampton 
 
The personal data titanic_personal_data.csv consists of the following variables: 
PassengerId – the identifier 
Name: the name of the passenger 
Sex: male or female 
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 
SibSp: number of siblings/spouses where family relations are defined as follows: 
Sibling = brother, sister, stepbrother, stepsister 
Spouse = husband, wife 
Parch: number of parent/children where family relations are defined as follows: 
Parent = mother, father; 
Child = daughter, son, stepdaughter, stepson. 
Some children travelled only with a nanny, therefore parch=0 for them 
Salary: in dollars 
Job: job title 
 
Australian Weather Dataset 
The Australian weather dataset consists of a single CSV file, which contains weather data for 16 cities 
and towns in Australia over the period of nearly 10 years. The file consists of the following variables: 
 
Date: date of observation 
Location: name of town/city where observation was made 
MinTemp: minimum temperature recorded (Celsius) 
MaxTemp: maximum temperature recorded (Celsius) 
Rainfall: total daily rainfall (mm) 
Sunshine: total daily sunshine (hours) 
WindDir9am: wind direction at 9am 
WindDir3pm: wind direction at 3pm 
WindSpeed9am: wind speed at 9am (kph) 
WindSpeed3pm: wind speed at 3pm (kph) 
Humidity9am: humidity at 9am (%) 
Humidity3pm: humidity at 3pm (%) 
Pressure9am: atmospheric pressure at 9am (hpa) 
Pressure3pm: atmospheric pressure at 3pm (hpa) 
Temp9am: temperature at 9am (Celsius) 
Temp3pm: temperature at 3pm (Celsius) 
RainToday: did it rain? (Boolean) 
RISK_MM: total daily rainfall the following day (mm) 
RainTomorrow: did it rain the following day? (Boolean) 
 
Note: this dataset is different from the “Rain in Australia” dataset on Kaggle. 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!