解析Classification TreesPython语言程序、Python辅导、解析Python语言程序、Python辅导

Challenge
Classification Trees
Background
Computer scientists often refer to the principle of divide and conquer: by recursively splitting
a problem into smaller problems which are easier to solve, we can solve a large difficult
problem by putting together solutions to many small easy problems. Common applications
are tasks like sorting, where mergesort recursively splits arrays in half and sorts each half,
or data storage and retrieval, like k-d trees and R-trees for efficiently storing and querying
high-dimensional data.
Divide and conquer can work for statistical tasks as well. Their appeal is their simplicity:
a conceptually simple algorithm can, by putting together many simple pieces, model quite
complex patterns in data.
One such algorithm is called Classification and Regression Trees (CART). Suppose we
have a vector Y of n observed outcomes—either categorical outcomes, for classification, or
continuous outcomes for regression—and a n×p matrix X of observed covariates for each
observation. We would like to make a prediction hatwidey for a given set of newly observed predictors
x. CART does this first by dividing: splitting up the space of possible values of X into many
regions. It then conquers by producing a single estimate hatwidey for each of these regions, usually
just a constant value in each region. The trick, of course, is all in how we divide the space.
For simplicity, we’ll focus on binary classification here, and save regression for later. In
binary classification, Y is always either 0 or 1 (e.g. “survived” vs. “didn’t survive”, “spam”
vs. “not spam”, and so on). Given a new observed x, our classifier should predict 0 or 1.
CART can, as its name suggests, also be used for regression problems where Y can take on a
continuous range of values, and it can also handle classification when there are more than
two possible classes, but we will ignore these possibilities for now.
For CART to divide the space of X values into regions, we need to define what makes a
“good” region and what makes a “bad” region. We define something called the impurity of
each region, and pick splits that try to reduce the impurity the most. In classification, a pure
region would be one where Y is always the same value—always 0 or 1—and the most impure
region would be one where half the observations are 0 and half are 1, since we would make
the worst predictions in this case. Let p(y = 1|A) be the probability that the observed Y in
region A is 1. The impurity I(A) will be some function φ of this:
I(A) = φ(p(Y = 1|A)).
There are several common choices of φ:
φ(p) = min(p,1−p) (Bayes error)
φ(p) =−plog(p)−(1−p) log(1−p) (cross-entropy)
φ(p) = p(1−p) (Gini index)
1
Each of these functions has its maximum at p = 1/2 and its minimum at p = 0 and p = 1, so
pure regions have all their data points in the same class and impure regions have half and
half.
CART picks regions by recursively splitting regions, forming a tree. We start with a tree
with only one node, the root node: we predict hatwidey to be the most common value of Y, regardless
of the x we are given. This is obviously not a very good model, so we want to split the region
up. We do this as follows. For a region A, consider a possible covariate Xj, for 1≤j≤p.
Then,
1. Consider splitting A into two regions, AL and AR, where AL contains all observations
in A with Xj α, it is not possible
to lower Rα(T) by pruning T, so T is optimally pruned. Otherwise, we prune all nodes for
which g(t) = α∗. We then repeat the process, calculating g(t) on the nodes of the freshly
pruned tree, until we cannot find a node to prune for which α∗≤α.
For a derivation demonstrating why this procedure works, consult David Austin’s essay
“How to Grow and Prune a Classification Tree”: http://www.ams.org/publicoutreach/
feature-column/fc-2014-12
Choosing the Best Prune
Question: how do we pick the best penalty α? What is the “best”, anyway?
One approach is to pick the α that results in the smallest generalization error—the best
accuracy when predicting Y for new observations that we didn’t use when building the tree.
To estimate the generalization error, then, we need to set aside some data and not use it to
build the tree. One way to do this is cross-validation.
5
In cross validation, we build the tree using only a subset of the data, and use the left-out
data to evaluate the classification accuracy of our tree. In k-fold cross-validation, the data
is split into k folds, and k−1 are used to fit the tree and the remaining used to test it.
Typically, one rotates through the folds in turn, using each fold as test set for a tree built
with the other k−1, and calculates the average classification error over all k test sets.
We can use the cross-validation concept to choose α. Choose a reasonable value of k, such
as 5 or 10. Fix a value of α. Build and prune trees using this value of α and calculate the
average error on the test set across the k test folds. Repeat this procedure for a range of
possible values of α, and choose the value of α that minimizes the test set error.
Forests of Trees
Random forests use the idea of ensemble classifiers: instead of building a single classifier,
like a single tree, we build many classifiers in slightly different ways. The ensemble then
votes on the classification of each data point. In random forests, we build ensembles of many
classification trees, each with slightly different data. When we get a new x and want to
predict hatwidey, we ask each tree to make its prediction, then pick the most common answer as our
prediction.
What do I mean by “slightly different data”? To be specific, a random forest is built
following this procedure:
1. Take a random sample of size n, with replacement, from the observations X. Also take
the corresponding Y values. (This is a bootstrap sample.)
2. Build a classification tree using this sample, except that at each split, instead of choosing
the best covariate xj to split out of all covariates, choose from a random sample of k of
the covariates. Use a new random sample for every split in the tree. Do not prune these
trees.
3. Repeat steps 1 and 2 many times (say, 500 times). Store the collection of classification
trees as your random forest.
A random forest, then, requires two tuning parameters: the number of trees in the forest
and the number of randomly selected features k to consider at each split. No pruning is done,
so no α is necessary.
You may want to consult the paper Resources/classification-tree/breiman.pdf,
which introduces random forests, though rather abstractly.
Task
Part 1: Plans and Tests
First, you must plan out your implementation of classification trees. There are a lot of moving
parts here—the tree, the splitting method, the pruning method, cross-validation—so careful
design is important to prevent your code from becoming complicated and hard to work with.
In this part, you should
6
Determine how you will represent your tree data structure. A tree consists of many nodes,
each of which represents a split along a particular variable at a particular split point; all
of this has to be stored in the tree.
You might use a Node class containing data fields like split_var and split_point, or
a Tree class containing many nodes. How will you represent this in your programming
language?
Write definitions of the classes/data structures you will use. Be specific about what data
will be stored where.
Design functions for operating on your tree. To implement classification trees, you’ll
probably need functions for adding nodes to a tree, removing nodes (to prune them),
searching the tree, and so on. Design these functions.
It should be possible, for a given node, to obtain all the data points that fall into that
node of the tree. For example, for the root node, this would be all the data; for a node
midway through the tree, it should be the subset of the data obtained by splitting the
dataset to get to this node. Design a function get_data that can do this. (It may have to
traverse the tree to do this.)
(By “design”, we mean you should write out the function names, arguments, and return
values, with comments explaining their purpose, but do not write the actual code inside
the functions. That will come in Part 2.)
Design functions for building and pruning a classification tree. Consider how the data
should be provided—a data frame? a matrix? how is the response variable specified?—and
make sure that it’s possible for the user to pick different values of φ, so they can use
different impurity definitions. It should be possible for the user to specify a new φ that
isn’t built in to your code.
Design functions for querying a classification tree—given a new X, they must determine
the Y predicted by the tree.
Design functions to build an ensemble of classification trees—a forest of many trees,
using resampled data. These functions should reuse your classification tree functions in
straightforward ways, rather than re-implementing the tree algorithms from scratch.
Write an is_valid function that checks that a classification tree is valid. (Write the
actual implementation, not just a design; your implementation will call all the functions
you designed above.)
What does “valid” mean? Several things:
– No nodes are empty (contain no data)
– If a node is split on xj, all data points in the left child should have xj 1964
GROUP BY character_class
ORDER BY evilness DESC;
and get the results without the database actually looking at every row in the characters
table. Instead, databases usually use various indexes, often based on tree data structures, so
they can search only rows meeting the requirements.
In this part, you’ll use SQL to build a classification tree. Your tree-building code should
take a connection to a SQL database (however that’s represented in your programming
language), a table name, and the names of the table columns to use as predictors and outcome
variables. It will then query each of the predictor columns to find optimal split points, choose
one to split on, and record the split; then, for each child, it will again use queries to pick the
optimal split, and so on.
9
Crucially, your tree structure will not store the data, just the variable splits. To make a
prediction for a specific x, you can find the matching leaf node of the tree, then query the
database to find the data in that node.
In this part, you should
Build support for creating classification trees from SQL data and from data frames. Don’t
hardcode the methods, like this:
if data_source == "data_frame":
# get data from data frame
data[column, :] = # ...
elif data_source == "sql":
cur.query("SELECT data FROM ...")
Instead, it should be possible to add new data access methods without rewriting all the
code. This could mean
– having a Tree class with a SQLTree subclass that overrides some methods
– extracting all data access out to functions that can be easily replaced
– having a DataFrame. class and SQLTable class with common interfaces for accessing
their data, and that can be given to your tree code to use to access data
– or some other suitable and flexible design.
Once the tree is built, it should be able to make predictions as well, and support everything
that was possible in Part 2.
Write tests for any new functions you’ve made, as needed. (It can be tricky to test
SQL code, since it’s annoying to write unit tests that access databases; extract out code
into separate functions as much as possible, so the database access is limited to specific
functions.)
Make some test data tables in the SQL database you’re using. Use this to test your SQL
code, to ensure that the correct rows are returned when requested and the queries it
uses are valid. (For example, if you had a DataFrame. class and a SQLTable class, each
with get_data_in_region functions that retrieve all rows with x values in a certain
range, you should make sure the SQL version returns the correct results and matches the
DataFrame. version.)
Load identical data into a SQL table and a data frame. (You can generate some test
data for this purpose; be sure to include the script. that generated the test data in your
submission.) Build a randomized test that compares the results of your classifier on the
SQL version and the data frame. version of the data, making sure they give matching
answers.
10
Part 4: Scrape and Classify
Scraper, in another language, of some website; capable of both loading data into SQL and
updating data that’s already been ingested.
[TODO: More to come...]
Requirements
A Mastered submission will meet the requirements described in each Part above. A Sophis-
ticated submission will additionally:
Use excellent variable names, code style, organization, and comments, so the code is clear
and readable throughout.
Be efficient, making the best use of its data structures with minimal copying, wasteful
searching, or other overhead.
Have a comprehensive suite of tests to ensure that the individual components function
correctly on corner cases, error cases, and unexpected inputs.
Give particularly thorough and detailed answers to the questions above.
Supply a command-line driver that runs all your tests.
You should also remember that peer review is an essential part of this assignment.
You will be asked to review another student’s submission, and another student will review
yours. You will then revise your work based on their comments. You should provide a clear,
comprehensive, and helpful review to your classmate.