讲解Processing、数据结构讲解、解析Python语言程序、Java讲解

Assignment 3 – FRE 528 - Team Assignment
Objective of Assignment
Your assignment is to develop a linear regression model that predicts or explains selling price by as many as 6
other variables. Your team will be collecting real estate listing information and then applying backwards-step-
wise regression to fit and validate a linear regression model (as best as possible) to your real estate data.
Data Collection
In your first team meeting, take a look at https://www.realtor.ca/ and decide upon a city that you would like to
focus on (e.g. Vancouver). Click onto either “Search Map” or type in a City or Neighborhood you are interested
in to carry out this real estate study. The default is “Residential Properties” that should be used in this assignment.
It should be a city that has at least a population size of 200,000 people. You can modify the building type by
selecting “Options”. Zoom into your suburb/area until there is at least 250 property listings and the property
listings are shown on the right side of the webpage. You can refine your search by selecting (1) Building Type =
House, and (2) Style. = Detached. Write down the boundaries of your data collection (e.g., Kitsilano: West of
Burrard St. and North of 29th). Save your search by clicking on to the “Save Search” button at the top right hand
corner of the webpage. This will allow you to retrieve your search listings at a later time.
Randomly select (as best as possible) 100 listings from the total number of property listings. The 100 listings
will constitute your sample. MLS search listings are displayed from lowest to highest in prices on the right side
of the webpage. Try to randomly select homes across the spectrum of prices that are displayed in your search
listing. For example, suppose you have a total of 250 listings of homes in your selected area. Note on the right
side of the webpage, the cheapest 12 of the 250 listings are shown and with each sequential page the prices
increase. Thus, for this example, there are approximately 250/12 = 21 pages of listings for the 250 houses.
Therefore, try to select approx. 3 listings from each of the 21 pages (rather than selecting all prices in one page)
so that the data collected is a better representation of the prices in the area. Collect the following information
(8 variables) from each of the 100 real estate property listings:
Y = listing price
MLS Listing # (Located under the price of every listing)
X1 = interior floor space (square footage of the house)
X2 = land size (square footage). This is usually represented by length of the front of the lot (in ft) x the depth
of the lot (in ft). Sizes range but a common lot size in Vancouver is 33 x 120. You will need to convert this to feet
squared (area) before entering it into your spreadsheet (i.e., 3960)
X3 = number of bedrooms
X4 = number of bathrooms
X5 = age of building (often listed as “Built in” year date. Thus, you will need to calculate it!)
X6 = a variable of your team’s choice! You can use anything here. Perhaps a binary (indicator) variable?
Key all data into an Excel spreadsheet.
Note: some listings will not have all the data. If you are collecting data in a geographical area that does not
provide the information above, please move onto another area. You will find that suburban areas around the
Lower Mainland provide the above information quite readily. Also Note: You can save your mls search and reuse
it later by emailing yourself the browser URL. Simply copy and paste the URL from the mls webpage into an email
and mail it to yourself so you can continue your data collection at another time.
Use Stata or Excel to complete the following steps:
1. Check that the assumptions that are required by a linear regression model are valid by creating 6 scatter plots
and provide some appropriate comments. The regression assumptions are 1) linearity, 2) constant variance,
3) normality, and 4) independence. For an initial check of the assumptions, please create 6 scatter plots: Y
versus each Xi for i = 1, 2, …, 6 and provide a comment with respect to the presence of an “approximate”
linear relationship or not. Are any of your variables potentially problematic from this perspective? (Answer
in, at most, one bullet point per scatter plot). Note: You are expected to provide at least a scatter, residual
and normal probability plot for each Y versus each Xi for i = 1, 2, …, 6 in an effort to reflect on the regression
assumptions.
2. Check for potential multicollinearity by creating a correlation matrix. Correlation stronger than ±0.6
between any two X-variables indicates that they are somewhat redundant and MAY cause problems in your
analysis. Do you have any such potential problems? (Answer in, at most, three bullet points identifying
variables with potential collinearity issues.)
3. Now, determine the regression model that best predicts selling price. (Proceed regardless of any potential
problems you see in steps 1 or 2.)
(a) You will be performing Backward Step-Wise Regression. Run a multiple regression of Y on all six X-
variables together and review the output.
(b) Is the overall model significant? Carry out an overall F-test to determine this. Be certain to state the
hypotheses, your decision rule and provide a concluding statement in the context of the problem.
(c) Are some of the X-variables clearly insignificant and others apparently significant? Choose one X-
variable that seems to be the most insignificant and eliminate it from your analysis. Run another
regression of Y on the remaining X-variables. Review the output.
(d) If the reduced model is still inadequate (i.e., insignificant X-variables are still present), repeat step (c)
and try to reduce it further. For each reduction, provide a clear statement why a particular
independent variable was eliminated.
(e) Continue this iterative process of running another regression and until you have discarded all variables
that are not significant in predicting the selling price. This is your final reduced model. Your final
reduced model should be checked for multicollinearity by comparing the mathematical operators in
front of coefficients to the correlation matrix. Remember that multicollinearity MAY surface in more
ways than just high p-values: incorrect “signs” in front of coefficients need to be assessed in
relationship to the correlation matrix!
(f) What is form. of your final TRUE REGRESSION model? State your final model using the population
parameters (It should look something like Y = β1 + β2X2 + β3X3 + … + ε with the Y and all Xi clearly
defined.)
(g) What is your estimate of the final TRUE REGRESSION model? (It should look something like ˆy = b1 + b2X2 +
… but you should have the coefficient estimates from the regression output in place of the bs.)
(h) Clearly state the meaning of all regression coefficients in part g) in the context of this problem.
(i) Provide a proper Hypothesis Test at the 5% significance level on your final estimated regression
model to demonstrate if there is a significant linear relationship between each independent and
dependent variable.
(j) How good is the fit of your model? Quote a measure from the regression output. Provide a clear
statement of its meaning.
(k) Generate a residual plot on the final reduced model (residuals versus fitted values) and provide
commentary of what it means. Also, run the Variance Inflation Factors on your final reduced model and
comment on the output.
(l) Use your estimated regression model to predict the selling price of a house in your selected city of
choice. Select any reasonable values for your independent variables in your calculation. Create a 95%
prediction interval of your predicted selling price as conducted in class.
(m) Does this model seem to be a good predictor of real estate market value? Draw a conclusion with your
team member. State why or why not you believe the model is a good/poor predictor. Also suggest
ways that the model could be improved.
Deliverable in Hard Copy: ONLY 1 HARDCOPY PER TEAM
Please see Page 4 of this assignment for suggestions on how to work between Stata/Excel and Word so that
you can properly format this assignment using the following guidelines:
First 6 pages (approx.): State the community where (approximate is fine) you conducted your data collection
(e.g., Kitsilano area west of Burrard st. and north of 29th Ave.). Six scatter, residual and normal probability plots
and comments about what they tell you about the validity of the assumptions implicit in your use of a linear
regression model. Make sure to properly label your scatter plots with appropriate labels and a title.
7th page: A correlation matrix and a comment or two on whether it indicates any potential problems.
8th page: The regression output using all independent variables and the overall F-test (formally stated with
hypotheses, decision rule and conclusion).
9 – 10th page: Each regression output that was reduced along with a one to two sentences stating why a
particular X-variable was eliminated from the regression analysis.
Last 2-3 pages: Your regression output from your final model and your written answers to Steps 3(f) through 3(m).
General Tips and Instructions:

Assum a level of significance of 0.05 throughout this analysis. Strongly recommended that you collect the data
using Excel and then import your data file into Stata. The analysis will be quicker using Stata and it will facilitate
the assumption tests on your final model.
Make sure to use an identifiable “label” when inputting your data into Stata or Excel. Marks will be deducted if
we cannot understand the slopes in your regression output. For example, all regression outputs should have an
understandable variable name label as the independent variable as shown in the example below:
All commentary (although there should be very little) should be in bullet points, that is, in concise phrases.
The process that you should be using to find the final, reduced model is to always eliminate just one variable in
each iteration. (Choosing the least significant variable each time). Your final regression model will be a simple or
multiple regression model with statistically significant independent variable(s). Once you have reduced the model
so that it is has significant independent variables, interpret the quality of the fit of the model
_cons 634.6526 223.1848 2.84 0.005 191.5748 1077.73
age -16.80214 2.760266 -6.09 0.000 -22.28197 -11.32232
bed -110.6739 43.11002 -2.57 0.012 -196.2581 -25.08965
land 145.8664 19.88146 7.34 0.000 106.3967 185.3361
sqft 231.7052 67.28634 3.44 0.001 98.12497 365.2855

price Coef. Std. Err. t P>|t| [95% Conf. Interval]
FRE 528 Students: E-mail your all your electronic files (Stata Do-file, the collected Excel data, *.dta files and Word
files), regression outputs, etc. that has been pasted into Word (by the submission deadline) to us at:
Please place your last names of the 2 students in your team in the “Subject” line of the
email when emailing us your work.
Formatting for Assignment #3
Please copy and paste your scatter plots, regression outputs from Stata to Word using the following method so
that it is properly formatted and legible.
1. Highlight your regression output (or scatter plot) with your mouse in Stata and select “copy as picture”:
2. Open Word and select Paste
4. The item will be pasted as a “Picture (Enhanced Metafile)”. You will now be able to select the corner of
the pasted picture (the regression output or other Excel picture) and modify its size so that it is legible
and clear to the reader.