讲解INF6028辅导asp、asp讲解

Information

School.

INF6028 Coursework 2019-20

Mining and Evaluating a Structured Dataset

1. Introduction

The assessment for INF6028 Data Mining consists of a piece of individual coursework to assess your

ability to understand key data mining, analysis and evaluation concepts. You will be assigned a single

dataset and an associated complete Knime workflow. Each workflow applies appropriate data mining

methods to the dataset in order to solve a supervised prediction problem - this might be regression or

classification – and to evaluate the relative performance of these different approaches/algorithms. You

will interpret and critically discuss the various techniques and best practises employed in the workflow

and will evaluate the performance of the algorithms.

Note: a video taking you through the workflow step-by-step will also be provided.

You should write a 2,000 word structured report (see Section 3) that includes the following headings

(more details on how the report will be assessed are provided below):

• Introduction - introduce the prediction problem.

• Data mining theory - provide a theoretical description of the two supervised data mining

methods used in the workflow (for example, the classification or regression techniques that have

been used) and why they are appropriate to the prediction task.

• Data exploration and preparation – describe the approaches used in the workflow for feature

selection, transformation and normalisation, where appropriate.

• Experimental setup - describe the experimental setup and the evaluation measures used in the

workflow and how the data has been handled to ensure that the models were not over-fitted.

You should explain which nodes were used in KNIME and provide a rationale for the various

parameter settings that were used.

• Results – present the results for each data mining method and compare the performance of the

different methods using graphical and tabular methods. What insights can you gain from the

models? For example, which are the most important features, are there any outliers in the

predictions?

• Conclusion and reflections – summarise the main findings of your report and reflect on the

methods used.

Charts, tables, references and appendices are not included in the word count.

Remember: your report should be a critical evaluation of the workflow in the context of the data mining

problem posed, it should not be merely a description of what was done.

This assessment is worth 100% of the overall module mark for INF6028. A pass mark of 50 is required to

pass the module. Submission deadline: June 8 via Turnitin. See Section 4 for more general information

about Coursework Submission Requirements within the Information School.

2. The Datasets and KNIME Workflows

You will be assigned a single dataset and KNIME workflow to base your report on. Please ensure before

you start working on the assessment that you are using the correct dataset and workflow.

Note: You should try to open the workflow in KNIME and work from there, however, should you be

unable to open the workflow or install KNIME on your machine, you will also be provided with a video,

which will take you through the workflow step-by-step.

The datasets have been derived from Kaggle competitions and are downloadable from MOLE in the

Coursework Brief Information section. A brief description of the attributes in each dataset is given at

the end of this document. Note that in both cases the data are different to the standard Kaggle

datasets.

Titanic-derived dataset

The data is split across two files each of which contains 1204 entries representing 1204 passengers,

although it should be noted that the passengers are not necessarily the same in the two files. The two

files are titanic_ticket_data.csv and titanic_personal_data.csv

The aim of this challenge is to build a model that is able to predict whether or not a passenger will survive

the sinking of the titanic.

Australian Weather-derived Dataset

The Australian weather dataset consists of weather data for 16 cities and towns in Australia over the

period of nearly 10 years.

The aim of this challenge is to predict the total daily rainfall based on other features of the weather.

3. Report Structure

You are required to produce a structured report that includes all the sections detailed in Table 1. You

must state the word count somewhere in the report. As there is a word count limit you should aim to

make your writing as concise and informative as possible. The emphasis of the report should be on the

clarity, accuracy and quality in communicating your findings.

Table 1: Required content of the structured report.

Section Description

Maximum allocated marks

Structured

abstract

This should provide a summary of your report

in a structured manner. This is not included in

the word count.

Required, but 0 marks

Introduction This section should introduce the data mining

task that is addressed in the report. You

should indicate the property/data value that

is predicted and give a brief overview of the

dataset and methods used.

10 marks

Data Mining

Theory

This section should provide an overview of

the algorithms for predictive data mining

used in the workflow from a theoretical

aspect. Explain why they are relevant to the

25 marks

prediction problem. Support your rationale

by providing references to the literature

where the techniques have been applied to

similar problems.

Include a short discussion of the most

appropriate methods for evaluating the

performance of these data mining methods.

Data Exploration

and Preparation

This section should provide a brief

description of the data and of the approaches

used to pre-process the data. You should

present an investigation of the attributes

(including the data value to be predicted) and

describe any data cleaning employed,

including handling of missing data, data

transformations and data aggregations.

10 marks

Experimental

Setup

This section should describe the

experimental design in the workflow.

You should describe the process followed in

order to find the best performing model for

each method and how this was validated.

For example, which KNIME nodes were used?

How were they configured? Was any cross-

validation or a separate validation set used

and why?

20 marks

Results and

Discussion

Present the results of the data mining

process including the results of experiments

to find the best model for each data mining

method. Compare the best performance of

the different methods and, if appropriate,

consider which attribute contributes most to

each model.

Discuss the advantages and disadvantages of

the data mining methods. Which of the

chosen methods produced the best model

and why?

20 marks

Conclusion and

reflections

Summarise the main findings of the analysis

and reflect on the choice of methods for the

problem, for example, how might the models

be improved with hindsight? Use evidence

from the literature to support your

arguments.

15 marks

4. Information School Coursework Submission Requirements

It is the student's responsibility to ensure no aspect of their work is plagiarised or the result of other

unfair means. The University’s and Information School’s Advice on unfair means can be found in your

Student Handbook, available via http://www.sheffield.ac.uk/is/current

Your assignment has a word count limit. A deduction of 3 marks will be applied for coursework that is

5% or more above or below the word count as specified above or that does not state the word count.

It is your responsibility to ensure your coursework is correctly submitted before the deadline. It is

highly recommended that you submit well before the deadline. Coursework submitted after 10am on

the stated submission date will result in a deduction of 5% of the mark awarded for each working day

after the submission date/time up to a maximum of 5 working days, where ‘working day’ includes

Monday to Friday (excluding public holidays) and runs from 10am to 10am. Coursework submitted

after the maximum period will receive zero marks.

Work submitted electronically, including through Turnitin, should be reviewed to ensure it appears as

you intended.

Before the submission deadline, you can submit coursework to Turnitin numerous times. Each

submission will overwrite the previous submission. Only your most recent submission will be assessed.

However, after the submission deadline, the coursework can only be submitted once.

Details about the submission of work via Turnitin can be found at http://youtu.be/C_wO9vHHheo

If you encounter any problems during the electronic submission of your coursework, you should

immediately contact the module coordinator and one of the Information School Teaching Support

Team (Julie Priestley 0114 2222839). This does not negate your

responsibilities to submit your coursework on time and correctly.

Titanic Dataset

The titanic data consist of two files that need to be merged.

The titanic_ticket_data.csv data consists of the following variables:

PassengerId: the identifier

Survived: the value to predict

Ticket: the Ticket Number

Fare: the passenger fare

Cabin: Cabin number

Embarked: Port of embarkation. C = Cherbourg, Q = Queenstown, S = Southampton

The personal data titanic_personal_data.csv consists of the following variables:

PassengerId – the identifier

Name: the name of the passenger

Sex: male or female

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: number of siblings/spouses where family relations are defined as follows:

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife

Parch: number of parent/children where family relations are defined as follows:

Parent = mother, father;

Child = daughter, son, stepdaughter, stepson.

Some children travelled only with a nanny, therefore parch=0 for them

Salary: in dollars

Job: job title

Australian Weather Dataset

The Australian weather dataset consists of a single CSV file, which contains weather data for 16 cities

and towns in Australia over the period of nearly 10 years. The file consists of the following variables:

Date: date of observation

Location: name of town/city where observation was made

MinTemp: minimum temperature recorded (Celsius)

MaxTemp: maximum temperature recorded (Celsius)

Rainfall: total daily rainfall (mm)

Sunshine: total daily sunshine (hours)

WindDir9am: wind direction at 9am

WindDir3pm: wind direction at 3pm

WindSpeed9am: wind speed at 9am (kph)

WindSpeed3pm: wind speed at 3pm (kph)

Humidity9am: humidity at 9am (%)

Humidity3pm: humidity at 3pm (%)

Pressure9am: atmospheric pressure at 9am (hpa)

Pressure3pm: atmospheric pressure at 3pm (hpa)

Temp9am: temperature at 9am (Celsius)

Temp3pm: temperature at 3pm (Celsius)

RainToday: did it rain? (Boolean)

RISK_MM: total daily rainfall the following day (mm)

RainTomorrow: did it rain the following day? (Boolean)

Note: this dataset is different from the “Rain in Australia” dataset on Kaggle.

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！