讲解留学生statistical 、asp调试、辅导留学生Computational Statistics

Introduction to Computational Statistics, Winter 2018
1 Preface
Statistics, data mining and machine learning are all concerned with collecting and an-
alyzing data. For some time, statistical research was conducted in statistics depart-
ments while data mining and machine learning research was conducted in computer
science departments. Statisticians thought that computer scientists were reinventing
the wheel; while computer scientists thought that statistical theory didn’t apply to
their problems.
Things are changing. Statisticians now recognize that computer scientists are
making novel contributions, while computer scientists now recognize the generality of
statistical theory and methodology. Clever data mining algorithms are more scalable
than statisticians ever thought possible. Formal statistical theory is more pervasive
than computer scientists had realized.
Students who analyze data, or who aspire to develop new methods for analyz-
ing data, should be well grounded in basic probability and mathematical statistics.
Using fancy tools like neural nets, boosting and support vector machines without
understanding basic statistics is like doing brain surgery before knowing how to use
a band-aid. This course is designed to provide fundamental ideas and methods for
statistical inference, and their implementations in softwares. Here is a summary of
the main features of this course.
1. The course moves very quickly and covers a wide range of ideas and methods.
The course is demanding but I have tried to make the material as intuitive as
possible so that the material is very understandable despite the fast pace.
2. Whenever possible, I avoid tedious calculations in favor of emphasizing concepts.
3. Rigor and clarity are not synonymous. I will try to strike a good balance. To
avoid getting bogged down in uninteresting technical details, many results are
stated without proof. The bibliographic references at the end of each lecture
point students to appropriate sources.
Probability theory is the formal language of uncertainty and serves as the basis
of statistical inference. The basic problem that we study in probability is:
Given a data generating process, what are the properties of the outcomes?
The rst part of this course will focus on basic methodologies and their implementa-
tions for statistical inference. The basic problem of statistical inference is the inverse
of probability:
Given the outcomes, what can we say about the process that generated the data?
The second part of the course will apply the ideas from Part I to speci c problems
such as regression, density estimation, smoothing, classi cation and simulation.
These ideas are illustrated in Figure 1. Prediction, classi cation, clustering, and
estimation are all special cases of statistical inference. Data analysis, machine learn-
ing and data mining are various names given to the practice of statistical inference,
depending on the context.
Although this course is called \Computational Statistics," I will start with in-
troducing or reviewing the statistical methods (with a little but not much theory
involved) and then show that how these methods can be put into practice to deal
with both simulated and real data sets. Let’s keep in mind the following four points.
Statistical simulation (Monte Carlo) is an important part of statistical method
research.
The statistical theories/methods are all based on assumptions. So most theo-
rems state something like \if the data follow these models/assumptions, then
...".
The theories can hardly be veri ed in real world data because (1) the real data
never satisfy the assumption perfectly; and (2) the underlying truth is unknown
(no \gold standard").
In simulation, data are \created" in a well controlled environment (model as-
sumptions) and all truth are known. So the claim in the theorem can be veri ed.
2 Models, statistical inference and learning
Statistical inference, or \learning" as it is called in computer science, is the process
of using data to infer the distribution that generated the data. A typical statistical
inference question is:
Given a sample X1;:::;Xn F, how do we infer F?
In some cases, we may want to infer only some feature/aspect of F, such as its mean,
variance, median, etc.
2.1 Parametric and nonparametric models
A statistical model F is a set of distributions (e.g. densities or regression functions).
A parametric model is a set F that can be parameterized by a nite number of
parameters. For example, if we assume that the data come from a normal distribution,
then the model is
FG =

f(x; ; ) = 1 p2 e (x )2=(2 2) : 2R; > 0

: (2.1)
This is a two-parameter model. We have written the density as f(x; ; ) to show
that x is a value of the random variable whereas and are parameters. In general,
a parametric model takes the form
F =ff(x; ) : 2 g; (2.2)
where is an unknown parameter (univariate parameter or vector of parameters) that
takes values in some parameter space . If is a vector but we are only interested in
one of its components, the remaining parameters are then called nuisance parameters.
A nonparametric model is a set F that cannot be parametrized by a nite number
of parameters. For example, FCDF =fall cumulative distribution functionsg is non-
parametrc. The distinction between parametric and nonparametric is more subtle
than this but we don’t need a rigorous de nition in this course.
Example 2.1. (One-dimensional parametric estimation). Let X1;:::;Xn be inde-
pendent Bernoulli(p) observations. The problem is to estimate the parameter p.
Example 2.2. (Two-dimensional parametric estimation). Assume thatX1;:::;Xn
F with the probability density function (PDF) f 2FG, where FG is given in (2.1).
In this case, there are two unknown parameters, and , to infer. The goal is to
estimate them from the data. If we are only interested in estimating , then is the
parameter of interest and becomes a nuisance parameter.
4
Example 2.3. (Nonparametric estimation of the CDF). Let X1;:::;Xn be inde-
pendent observation from a CDF F. The problem is to estimate F under the only
assumption that F 2FCDF. Recall that basic properties of a CDF F is: 0 F(x) 1
for all x2R, limx! 1F(x) = 0, limx!1F(x) = 1, and F is non-decreasing.
Example 2.4. (Nonparametric density estimation). Let X1;:::;Xn be independent
observations from a CDF F. Assume that the PDF f = F0 exists, the goal is to
estimate f with as few assumptions about f as possible. For example, we might
assume that f 2 FPDF \FSob, where FPDF is the set of all probability density
functions and
FSob =ff00(x)g2dxqg
for q 2 [0;1]. If F is strictly increasing and continuous, then F 1(q) is the unique
real number x such that F(x) = q.
Example 2.6. (Regression, prediction and classi cation). Suppose we observe pairs
of data (X1;Y1);:::;(Xn;Yn) from (X;Y). For example, Xi may denote the blood
pressure of subject i (e.g. ith patient) and Yi is how long subject i lives. In statistics,
we call X a predictor or regressor or feature or independent variable, and Y is called the
outcome or the response variable or the dependent variable. We call m(x) = E(YjX =
x) the regression function.
If we assume that m2F, where F is nite dimensional (e.g. the set of straight
lines), then we have a parametric regression model;
If we assume that m 2 F, where F is in nite dimensional, then we have a
nonparametric regression model.
The goal of predicting Y for a new patient based on their X value is called prediction.
If Y is discrete (e.g. survive or decease, positive or negative), then prediction is
5
instead called classi cation. If the goal is to estimate the function m, then we call
this regression or curve estimation. Equivalently, regression models can be written as
Y = m(X) +";
where " is a noise variable satisfying E("jX) = 0. Then, it is easy to see that
E(YjX) = Efm(X) +"jXg= Efm(X)jXg+ E("jX) = m(X):
Notation. If F =ff(x; ) : 2 g is a parametric model, we write
P (X2A) =
Z
A
f(x; )dx and E fm(X)g=
Z
m(x)f(x; )dx
for any set A and function m. The subscript. indicates that the probability or
expectation is with respect to f(x; ). Similarly, we write V for the variance.
3 Fundamental concepts in inference
Most inferential problems can be identi ed as being one of the three categories: es-
timation, con dence sets and hypothesis testing. In this section, we give a brief
introduction to the ideas underlying these problems.
3.1 Point estimation
Point estimation refers to providing a single \best guess" of some quantity of interest,
which could be a parameter in a parametric model, a CDF F, a probability density
function f, a regression function m or a prediction for a future value Y of some
random variable.
By convention, we denote a point estimate of by b or b n, which is a random
function of observed data. Remember that is a xed unknown quantity. The
estimate b depends on the data and therefore is a random variable.
Let X1;:::;Xn be independent observations from some distribution F. A point
estimator of a parameter is some function of X1;:::;Xn:
b n = g(X1;:::;Xn):
The bias of this estimator is de ned by
bias(b n) = E (b n) : (3.1)
We say b n is unbiased if E (b n) = . Unbiasedness used to receive much attention but
is considered less important nowadays. In fact, many of the estimators we will use
are biased. A reasonable requirement for an estimator is that it converges to the true
parameter value as we collect more and more data. This requirement is quanti ed by
the following de nition:
6
De nition 3.1. A point estimator b n of a parameter is consistent if b n ! in
probability, or equivalently, b n P ! (check probability lecture notes!).
The distribution of b n is called the sampling distribution. The standard deviation
of b n is called the standard error, denoted by se:
se = se(b n) =
q
V (b n): (3.2)
Note that the standard error depends on the unknown distribution F, and typically
is an unknown quantity that needs to estimate. The estimated standard error is
denoted by bse.
Example 3.1. (Bernoulli distribution). Let X1;:::;Xn Bernoulli(p) and let bpn =
n 1Pni=1 Xi. We calculate
E(bpn) = 1n
nX
i=1
E(Xi) = p:
Hence, bpn is an unbiased estimator of p with standard error pV(bpn) = pp(1 p)=n.
Based on this formula, we can estimate the standard error by bse = pbpn(1 bpn)=n.
The quality of a point estimate (good or bad) is often assessed by the mean squared
error, or MSE, de ned by
MSE(b n) = E (b n )2: (3.3)
Keep in mind that E ( ) refers to expectation w.r.t. the distribution
f(x1;:::;xn; ) =
nY
i=1
f(xi; ) (joint density function of (X1;:::;Xn))
that generated the data X1;:::;Xn.
Theorem 3.1. The MSE can be written as
MSE = bias2(b n) + V (b n): (3.4)
If bias!0 and se!0 as n!1, then b n is consistent, that is, b n P ! .
Example 3.2. In the coin ipping example, we have E(bpn) = p so that bias = 0 and
se = pp(1 p)=n!0. Therefore, bpn P !p so that bpn is a consistent estimator.
The notion of consistency shows that, with more and more observations at hand,
the estimator gets more and more closer to the true parameter of interest. Given an
estimator, we might guess that the true parameter lies in a small neighborhood of
the estimator. Quantitative characterization of such a neighborhood is a more subtle
issue than consistency. For this purpose, we need to understand more about the
estimator. Many of the estimators we will encounter turn out to have, approximately,
a normal/Gaussian distribution. This is a fundamental element of the foundation of
statistical inference.De nition 3.2. An estimator b n is asymptotically normal if
b n
se(b n)
D !N(0;1) (convergence in distribution): (3.5)
(Check probability lecture notes!)
3.2 Con dence sets
For a given 2(0;1) (e.g. = 1% or 5%), a 1 con dence interval for a parameter
is an interval Cn = (a;b) satisfying
P ( 2Cn) 1 for all 2 ; (3.6)
where a = a(X1;:::;Xn) and b = b(X1;:::;Xn) are functions of the data. In other
words, (a;b) covers with probability at least 1 . We call 1 the coverage of
the con dence interval.
Commonly, people use 95% con dence intervals, which correspond to choosing
= 0:05. If is a vector, then we use a con dence set (such as a sphere or an ellipse)
instead of an interval.
Theorem 3.2. (Hoe ding’s Inequality). Let Y1;:::;Yn be independent random vari-
ables satisfying E(Yi) = 0 and ai Yi bi for i = 1;:::;n. Then, for any y> 0,
Corollary 3.1. Let X1;:::;Xn Bernoulli(p). Then, for any > 0,
P(j Xn pj ) 2e 2n 2; (3.8)
where Xn = n 1Pni=1Xi.
Example 3.3. In the coin ipping setting, let Cn = (bpn n;bpn + n), where 2n =
log(2= )=(2n). From Hoe ding’s inequality (3.7), it follows that P(p2Cn) 1
for every p. Hence, Cn is a 1 con dence interval.
Theorem 3.3. Let X be a random variable with mean = E(X).
1. (Markov’s inequality). If X is non-negative, then for any t> 0,
2. (Chebyshev’s inequality). If the variance 2 = V(X) exists, then for any t> 0,
Example 3.4. Let X1;:::;Xn Bernoulli(p). Let n = 100 and = 0:2. By
Chebyshev’s inequality,
On the other hand, according to Hoe ding’s inequality,
P(j Xn pj ) 2e 200 2 = 6:7 10 4;
which is much smaller than 0:0625.
As mentioned earlier, point estimators often have an asymptotic normal distribu-
tion, meaning that equation (3.5) holds, that is, b n N( ;bse2). In this case, we can
construct (approximate) con dence intervals as follows.
Theorem 3.4. (Normal-based Con dence Interval). Suppose that b n N( ;bse2). Let
be the CDF of a standard normal distribution and let z =2 = 1(1 =2), that
is, P(Z >z =2) = =2 and P( z =2 Cn = (b n z =2 bse;b n +z =2 bse): (3.9)
Then we have
P ( 2Cn)!1 as n!1: (3.10)
For 95% con dence intervals, = 0:05 and z =2 = 1:96 so that the approximate
95% con dence interval is (b n 1:96 bse;b n + 1:96 bse).
Example 3.5. Let X1;:::;Xn Bernoulli(p) and let bpn = n 1Pni=1 Xi. Then
V(bpn) = n 2Pni=1 V(Xi) = n 2Pni=1p(1 p) = p(1 p)=n. Hence,Computer experiment). Compare this with the con dence interval in Example 3.3.
The normal-based interval is shorter but it only has approximately (large sample)
correct coverage.
3.3 Hypothesis testing
In hypothesis testing, we start with some default theory { called a null hypothesis, and
we ask if the data provide su cient evidence to reject the theory/null hypothesis. If
not, we fail to reject (or accept or retain) the null hypothesis.
Example 3.6. (Testing if a coin is fair). Let X1;:::;Xn Bernoulli(p) be n inde-
pendent coin ips. The goal is to test if the coin is fair. Let H0 denote the hypothesis
that the coin is fair and let H1 denote the hypothesis that the coin is not fair. H0
is called the null hypothesis and H1 is called the alternative hypothesis. We write the
hypotheses as
H0 : p = 1=2 versus H1 : p6= 1=2:
It seems reasonable to reject H0 if T = jbpn (1=2)j is large. When we discuss
hypothesis testing in detail, we will be more precise about how large T should be to
reject H0.
3.4 Appendix
Our de nition of con dence interval requires that P ( 2Cn) 1 for all 2 . A
pointwise asymptotic con dence interval requires that lim infn!1P ( 2Cn) 1 for
all 2 . A uniform. asymptotic con dence interval requires lim infn!1inf 2 P ( 2
Cn) 1 . The approximate normal-based interval is a pointwise asymptotic
con dence interval.
3.5 Exercises
1. Let X1;:::;Xn Poisson( ) and let b = n 1Pni=1Xi. Find the bias, se and
MSE of this estimator (sample mean).
2. Let X1;:::;Xn Uniform(0; ) and let b = max1 i nXi. Find the bias, se and
MSE of this estimator.
3. Let X1;:::;Xn Uniform(0; ) and let b = 2 Xn. Find the bias, se and MSE of
this estimator.
4. (Computer experiment). Generate n2f20;50;100;200g data points from dis-
tribution Bernoulli(1=2), and use the normal-based method to construct 95%
con dence intervals. Compute the empirical coverage probability based on 1000
simulations (set.seed(185)).