首页 > > 详细

DATA7202 Statistical Methods for Data Science

Statistical Methods for Data Science

DATA7202

Semester 1, 2021

Assignment 3 (Weight: 25%)

Assignment 3 is due on 21 May 21 17:00).

Please answer the questions below. For theoretical questions, you should present rigorous proofs

and appropriate explanations. Your report should be visually appealing and all questions should

be answered in the order of their appearance. For programming questions, you should present your

analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives

and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing

your conclusions. Do not include excessive amounts of output in your reports. All the code should

be copied into the appendix and the sources should be packaged separately and submitted on the

blackboard in a zipped folder with the name:

"student_last_name.student_first_name.student_id.zip".

For example, suppose that the student name is John Smith and the student ID is 123456789.

Then, the zipped file name will be John.Smith.123456789.zip.

1. [10 Marks] Show that any training set (with unique xi values), τ = {(xi

, yi), i = 1, . . . , n} can

be fitted via a tree with zero training loss.

2. [10 Marks] Suppose during the construction of a decision tree we wish to specify a constant regional prediction function hw on the region R¸ w, based on the training data in R¸ w,

say {(x1, y1), . . . ,(xk, yk)}. Show that hw(x) := k 1 Pki=1 yi minimizes the squared-error loss.

3. [5 Marks] Suppose that in a certain leaf node of a decision tree that was applied to a classifi-

cation problem, there are 3 blue and 2 red data points in a certain tree region. Calculate the

misclassification impurity, the Gini impurity, and the entropy impurity. Repeat these calculations for 2 blue and 3 red data points.

4. [15 Marks] Suppose τ is a training set with n elements and τ ∗

, also of size n, is obtained from

τ by bootstrapping; that is, resampling with replacement. Show that for large n, τ ∗ does not

contain a fraction of about e 1 ≈ 0.37 of the points from τ .

5. [30 Marks] Consider the following train/test split of the data.

import numpy as np

from sklearn.datasets import make_friedman1

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score

# create regression problem

n_points = 1000 # points

x, y = make_friedman1 ( n_samples =n_points , n_features =15 ,

noise =1.0 , random_state =100)

1

# split to train /test set

x_train , x_test , y_train , y_test = \

train_test_split (x, y, test_size =0.33 , random_state =100)

Construct random forest regressor with 1000 trees and identify the optimal parameter m in the

sense of R2

score. Here, m is the subset size of predictors that are being considered at each

split.

6. [30 Marks] Consider the following classification data and module imports:

from sklearn.datasets import make_blobs

from sklearn.metrics import zero_one_loss

from sklearn.model_selection import train_test_split

import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier

X_train, y_train = make_blobs(n_samples=1000, n_features=10, centers=3,

random_state=10, cluster_std=5)

Using the gradient boosting algorithm with B = 150 rounds, plot the training loss as a function

of γ, for γ = 0.1, 0.3, 0.5, 0.7, 1. What is your conclusion regarding the relation between B and

γ? 2

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

更多

辅导 comm2000 creating socia... 2026-01-08
讲解 isen1000 – introductio... 2026-01-08
讲解 cme213 radix sort讲解 c... 2026-01-08
辅导 csc370 database讲解迭代 2026-01-08
讲解 ca2401 a list of colleg... 2026-01-08
讲解 nfe2140 midi scale play... 2026-01-08
讲解 ca2401 the universal li... 2026-01-08
辅导 engg7302 advanced compu... 2026-01-08
辅导 comp331/557 – class te... 2026-01-08
讲解 soft2412 comp9412 exam辅... 2026-01-08
讲解 scenario # 1 honesty讲解... 2026-01-08
讲解 002499 accounting infor... 2026-01-08
讲解 comp9313 2021t3 project... 2026-01-08
讲解 stat1201 analysis of sc... 2026-01-08
辅导 stat5611: statistical m... 2026-01-08
辅导 mth2010-mth2015 - multi... 2026-01-08
辅导 eeet2387 switched mode ... 2026-01-08
讲解 an online payment servi... 2026-01-08
讲解 textfilter辅导 r语言 2026-01-08
讲解 rutgers ece 434 linux o... 2026-01-08

热点标签

engn4536/engn6536

comp(2041|9044)

litr1-uc6201.200

int2067/int5051

csci-ua.0480-003

cs247—assignment

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

© 2024 www.7daixie.com

程序辅导网！