DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE
STAT3600 Linear Statistical Analysis
Chapter 1 Introduction
1 Introduction
1.1 AMotivating Example (Cholesterol Data)
The following dataset records the plasma levels of total cholesterol (in mg/ml) of 24 patients with hyperlipoproteinaemia admitted to a hospital:
3.5 1.9 4.0 2.6 4.5 3.0 2.9 3.8 2.1 3.8 4.1 3.0
2.5 4.6 3.2 4.2 2.3 4.0 4.3 3.9 3.3 3.2 2.5 3.3
Figure 1(a) gives a scatterplot of the data.
Figure 1: Plasma levels of total cholesterol (in mg/ml)
Question: Predict the cholesterol level of the next patient to be admitted to the hospital with hyperlipoproteinaemia.
Intuitive answer: Use the average of the 24 observations: 3.354 (horizontal reference line in Figure 1(a)). Observations scatter around the average but are subject to considerable fluctua- tions.
[The above is justifiable if the observations are i.i.d. for instance. In the absence of further infor- mation, this seems to be the best we can do.]
Suppose the hospital has also collected data on the ages of the 24 patients:
46 20 52 30 57 25 28 36 22 43 57 33
22 63 40 48 28 49 52 58 29 34 24 50
Each observation (corresponding to each patient) consists of values of two variables:
(X , Y ) = (age, cholesterol level).
Figure 1(b) plots the cholesterol levels ( Y ) against ages (X ) for the dataset. The plot shows a strong linear relationship between X and Y. It therefore seems more reliable to assume a linear function relating age and cholesterol level, and predict the next patient’s cholesterol level based on his/her age.
Figure 1(b) fits a sloped straight line to the scatterplot by the least squares method (to be discussed later). This straight line summarises the relationship between cholesterol level and age and can be used for predicting future patients’ cholesterol levels. Compared to Figure 1(a), fluctuations ofthe 24 observations around the sloped straight line are much smaller. A function linear in age (the sloped straight line) can better account for the observed variation in cholesterol level than a simple constant function (the horizontal line).
A crucial question of interest to statisticians:
how much “better” is the “sloped straight line” model than the “horizontal line”
model?
The above example highlights the importance for data analysis of collecting data on some other variables (e.g. age) relevant to the main variable of interest (e.g. cholesterol level) in order to obtain a model which can better explain the observed variation in the main variable.
1.2 General Problem and Terminology
Typical observational or experimental studies involve the drawing of a sample of n obser- vations from a population about which inference is to be made. In general, each observa- tion consists of measurements on a number of variables related to an individual experimen- tal/observational unit sampled from the population.
Variable of primary interest — response or dependent variable
Remaining variables — explanatory or independent variables, also known as regressors or covariates.
Example
1. In an opinion poll, information is collected on n members sampled from a community. Each observation can be represented in the form (sex, age, educational level, ... , opinion).
2. In a study of property market, n recent transactions are sampled. Each observation may have the form (area, building age, facilities, location, price).
3. A clinical trial is conducted on a sample of n patients, some receiving a new medical treatment and the rest an old treatment. Each observation may have the form.
(age, sex, past medical record, smoking behaviour, type of medical treat- ment applied, response to treatment).
4. To study gravitational force, a physicist varies the length of a pendulum and measures its period on n separate occasions, giving n observations of the form. (pendulum length, period).
5. An electrician wants to determine the resistance of an electrical circuit. He passes sev- eral pre-specified currents through the circuit and measures the corresponding volt- ages. Each observation may have the form. (current, voltage).
Example
|
Response
|
Explanatory
|
1
|
opinion
|
sex, age, educational level, etc.
|
2
|
price
|
area, building age, facilities, location
|
3
|
response to treatment
|
age, sex, past medical record, smoking behaviour, type of medical treatment applied
|
4
|
period
|
length of pendulum
|
5
|
voltage
|
current
|
Our objective in linear modelling is to study the relationship between explanatory and re- sponse variables based on the sample collected. Typical questions to ask include:
• can a simple statistical model explain the relation between the response and the ex- planatory variables?
– does a certain explanatory variable affect the response significantly?
– can we predict a future response based on the values of the explanatory variables?
A variable may be quantitative or qualitative (or categorical) .
Quantitative variables can be measured numerically: e.g. age, income, time, temperature etc. Qualitative variables are not numerical in nature: e.g. sex, categorized age, education level, type of crime committed, style. of cuisine served in a restaurant etc.
[Earlier chapters will focus only on quantitative variables. Later chapters will consider also qualitative explanatory variables.]
1.3 General Procedure
Practical linear modelling consists of the following phases:
1. graphical display of observed data
• scatterplot, scatterplot matrix, etc.
2. formulation of model
• are we fitting a straight line, a parabola or anything else?
• no useful model can fit the observed data perfectly; how do we formulate a model to account for any residual discrepancies?
3. model fitting
• calculate the best estimates of the parameters in the model
• “best” in the sense of what? how to calculate? by some kind of optimisation?
4. model adequacy checking
• ascertain the quality of fit
• is the model adequate? should we modify our model to obtain a better fit?
• we may need to iterate the above phases many times before coming up with a satisfactory model.
5. making inference
• try to answer questions of our interest (depending on the context of the problem)
• quantify confidence in our answers using statistical arguments
• make suggestions, conclusions etc.
1.4 Uses and Limitations
Uses of linear models include:
• data description —
e.g. model equations are effective mathematical devices for summarising the observed data
• parameter estimation —
e.g. unknown resistance of electrical circuit is estimated by fitting a straight line to a number of current–voltage coordinate pairs and using the formula “voltage = current × resistance”
• prediction and estimation —
e.g. predicting future observations for planning and decision-making purposes
• control —
e.g. adjust the setting of the explanatory variables to yield the desired response level (but this usage is possible only if causality has been confirmed between the explanatory variable and the response) .
Linear modelling identifies relationships between variables, but does NOT imply causality.
Causality must be established by further theoretical considerations. Linear modelling anal- ysis only assists in confirming, but not proving, causality. The linear model is only an "ap- proxmation" to the true relationship between variables if there is one. It is useful because it provides a simple and effective "portrait" of the relationship.
Model equations are valid only over the range of the observed data. Extrapolating our infer- ence beyond this range risks committing serious errors.