讲解 STAT3600 Linear Statistical Analysis Chapter 1 Introduction讲解 Python语言程序

DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE

STAT3600 Linear Statistical Analysis

Chapter 1 Introduction

1 Introduction

1.1 AMotivating Example (Cholesterol Data)

The following dataset records the plasma levels of total cholesterol (in mg/ml) of 24 patients with hyperlipoproteinaemia admitted to a hospital:

3.5 1.9 4.0 2.6 4.5 3.0 2.9 3.8 2.1 3.8 4.1 3.0

2.5 4.6 3.2 4.2 2.3 4.0 4.3 3.9 3.3 3.2 2.5 3.3

Figure 1(a) gives a scatterplot of the data.

Figure 1: Plasma levels of total cholesterol (in mg/ml)

Question: Predict the cholesterol level of the next patient to be admitted to the hospital with hyperlipoproteinaemia.

Intuitive answer: Use the average of the 24 observations: 3.354 (horizontal reference line in Figure 1(a)). Observations scatter around the average but are subject to considerable ﬂuctua- tions.

[The above is justiﬁable if the observations are i.i.d. for instance. In the absence of further infor- mation, this seems to be the best we can do.]

Suppose the hospital has also collected data on the ages of the 24 patients:

46 20 52 30 57 25 28 36 22 43 57 33

22 63 40 48 28 49 52 58 29 34 24 50

Each observation (corresponding to each patient) consists of values of two variables:

(X , Y ) = (age, cholesterol level).

Figure 1(b) plots the cholesterol levels ( Y ) against ages (X ) for the dataset. The plot shows a strong linear relationship between X and Y. It therefore seems more reliable to assume a linear function relating age and cholesterol level, and predict the next patient’s cholesterol level based on his/her age.

Figure 1(b) ﬁts a sloped straight line to the scatterplot by the least squares method (to be discussed later). This straight line summarises the relationship between cholesterol level and age and can be used for predicting future patients’ cholesterol levels. Compared to Figure 1(a), ﬂuctuations ofthe 24 observations around the sloped straight line are much smaller. A function linear in age (the sloped straight line) can better account for the observed variation in cholesterol level than a simple constant function (the horizontal line).

A crucial question of interest to statisticians:

how much “better” is the “sloped straight line” model than the “horizontal line”

model?

The above example highlights the importance for data analysis of collecting data on some other variables (e.g. age) relevant to the main variable of interest (e.g. cholesterol level) in order to obtain a model which can better explain the observed variation in the main variable.

1.2 General Problem and Terminology

Typical observational or experimental studies involve the drawing of a sample of n obser- vations from a population about which inference is to be made. In general, each observa- tion consists of measurements on a number of variables related to an individual experimen- tal/observational unit sampled from the population.

Variable of primary interest — response or dependent variable

Remaining variables — explanatory or independent variables, also known as regressors or covariates.

Example

1. In an opinion poll, information is collected on n members sampled from a community. Each observation can be represented in the form (sex, age, educational level, ... , opinion).

2. In a study of property market, n recent transactions are sampled. Each observation may have the form (area, building age, facilities, location, price).

3. A clinical trial is conducted on a sample of n patients, some receiving a new medical treatment and the rest an old treatment. Each observation may have the form.

(age, sex, past medical record, smoking behaviour, type of medical treat- ment applied, response to treatment).

4. To study gravitational force, a physicist varies the length of a pendulum and measures its period on n separate occasions, giving n observations of the form. (pendulum length, period).

5. An electrician wants to determine the resistance of an electrical circuit. He passes sev- eral pre-speciﬁed currents through the circuit and measures the corresponding volt- ages. Each observation may have the form. (current, voltage).

Example

Response

Explanatory

1	opinion	sex, age, educational level, etc.
2	price	area, building age, facilities, location
3	response to treatment	age, sex, past medical record, smoking behaviour, type of medical treatment applied
4	period	length of pendulum
5	voltage	current

Our objective in linear modelling is to study the relationship between explanatory and re- sponse variables based on the sample collected. Typical questions to ask include:

• can a simple statistical model explain the relation between the response and the ex- planatory variables?

– does a certain explanatory variable affect the response signiﬁcantly?

– can we predict a future response based on the values of the explanatory variables?

A variable may be quantitative or qualitative (or categorical) .

Quantitative variables can be measured numerically: e.g. age, income, time, temperature etc. Qualitative variables are not numerical in nature: e.g. sex, categorized age, education level, type of crime committed, style. of cuisine served in a restaurant etc.

[Earlier chapters will focus only on quantitative variables. Later chapters will consider also qualitative explanatory variables.]

1.3 General Procedure

Practical linear modelling consists of the following phases:

1. graphical display of observed data

• scatterplot, scatterplot matrix, etc.

2. formulation of model

• are we ﬁtting a straight line, a parabola or anything else?

• no useful model can ﬁt the observed data perfectly; how do we formulate a model to account for any residual discrepancies?

3. model ﬁtting

• calculate the best estimates of the parameters in the model

• “best” in the sense of what? how to calculate? by some kind of optimisation?

4. model adequacy checking

• ascertain the quality of ﬁt

• is the model adequate? should we modify our model to obtain a better ﬁt?

• we may need to iterate the above phases many times before coming up with a satisfactory model.

5. making inference

• try to answer questions of our interest (depending on the context of the problem)

• quantify conﬁdence in our answers using statistical arguments

• make suggestions, conclusions etc.

1.4 Uses and Limitations

Uses of linear models include:

• data description —

e.g. model equations are eﬀective mathematical devices for summarising the observed data

• parameter estimation —

e.g. unknown resistance of electrical circuit is estimated by ﬁtting a straight line to a number of current–voltage coordinate pairs and using the formula “voltage = current × resistance”

• prediction and estimation —

e.g. predicting future observations for planning and decision-making purposes

• control —

e.g. adjust the setting of the explanatory variables to yield the desired response level (but this usage is possible only if causality has been conﬁrmed between the explanatory variable and the response) .

Linear modelling identiﬁes relationships between variables, but does NOT imply causality.

Causality must be established by further theoretical considerations. Linear modelling anal- ysis only assists in conﬁrming, but not proving, causality. The linear model is only an "ap- proxmation" to the true relationship between variables if there is one. It is useful because it provides a simple and effective "portrait" of the relationship.

Model equations are valid only over the range of the observed data. Extrapolating our infer- ence beyond this range risks committing serious errors.

联系我们