Final project
This project is intended to be more open-ended because it handles real world data.
Each project should have included the following steps:
1. Choose a project (CRISP-DM steps 1 and 2)
o identify the problem you want to solve and what kinds of analysis (e.g.
classification, regression, association, or clustering) is involved.
o identify what data (and source) is required to perform. such an analysis
o prepare a plan for which stages will involve the most work (acquisition,
preparation, building, evaluation, deployment)
o be sure to document your data sources and any other details involved in
finding the data relevant to the project
o try to gather some interesting and powerful information from publicly
available sources or primary research to support your data mining
project. It is best if you can identify areas of interest to pursue and
retrieve the data yourself, but the below data repositories can also be
accepted, some of which are outdated.
o a repository of data that has been used to test the performance of many
data mining algorithms is available at http://archive.ics.uci.edu/ml/ .
Some of the data sets are meant to test the limits of current
machine-learning algorithms and to compare their performance with
new approaches to learning. However, some of the smaller data sets
can be useful for exploring the functionality of any data mining
software, such as R or Python.
o Large and feature-rich data sets are made available by the U.S.
government or its subsidiaries on the internet. For instance, see the
Centers for Disease Control and Prevention data sets
(www.cdc.gov/DataStatistics), Surveillance, Cancer.org’s
Epidemiology and End Results data sets (http://seer.cancer.gov/data),
and the Department of Transportation’s Fatality Analysis Reporting
System crash data sets (www.nhtsa.gov/FARS). These data sets are not
preprocessed for data mining, which makes them a great resource to
experience the complete data mining process. Another rich source for a
collection of analytics data sets is listed on KDNuggets.com
( http://www.kdnuggets.com/datasets/index.html ).
2. Prepare the data (CRISP-DM steps 2 and 3)
o acquire (could involve some web mining) data
o clean and integrate the data (consider using some R or Python
operators)
o split the data for building and testing (actually, R or Python have
operators that can do this)
o be sure to document all the work involved in getting from original data
sources to data ready for building a model
3. Build the model (CRISP-DM step 4)
o either build one model and invest a lot in improving it or build multiple
models with less time to improve them, but compare them. You are
required to use R or Python programming.
o if multiple models involved, consider adding one more into the mix,
which is a combination of the others
o be sure to document all the steps involved in improving a model and/or
building different models
4. Test the model (CRISP-DM step 5)
o design and implement a way to test the result of your model(s)
o if multiple models are involved, consider using R or Python's ROC
comparison operators
o one scenario to consider testing is "noisy" or "messy" data that has
missing/wrong/duplicate values
o be sure to document the plan for testing, the steps in performing the test,
and the results of the test
5. Deploy the model (CRISP-DM step 6)
o create a summary of the entire project to present to the class
Grading. Depending on your project choice, one of these areas may be a bit more
involved than the other. For example, some may choose to tackle some pretty raw data
sources that involve a fair amount of work to acquire and prepare while other projects
may have the data more or less prepared in advance. The expectation is that the latter
project would do quite a bit more work on the tail end while the former project may do
a minimal amount of work after preparation. After completing the process, some
students with better work will be chosen to present their journey to the class detailing
the tasks, tools, and lessons learned at each stage in the data mining project.
Every student need write an English report and corresponding presentation ppt file and upload
them with source R or Python codes via ftp. The due date is 1st Jan 2018. The late
submission of your work will be given 0 point. The selected student needs give a 6 mins oral
English presentation + 3 mins QA in the class on 3rd Jan 2018.
Honor Code: As to honor code issues, students are allowed to help one another with
getting systems to work, but students are not allowed to directly copy one another's work
(Note: Anyone who is found to copy others' work will be given 0 point for his/her final
course score. The contributor will also be given 0 point)