# 2.3.4.4E: The Regression Equation (Exercise)

Use the following information to answer the next five exercises. A random sample of ten professional athletes produced the following data where (x) is the number of endorsements the player has and (y) is the amount of money made (in millions of dollars).

(x)(y)(x)(y)
02512
3849
2739
1303
513410

Exercise 12.4.2

Draw a scatter plot of the data.

Exercise 12.4.3

Use regression to find the equation for the line of best fit.

(hat{y} = 2.23 + 1.99x)

Exercise 12.4.4

Draw the line of best fit on the scatter plot.

Exercise 12.4.5

What is the slope of the line of best fit? What does it represent?

The slope is 1.99 ((b = 1.99)). It means that for every endorsement deal a professional player gets, he gets an average of another $1.99 million in pay each year. Exercise 12.4.6 What is the (y)-intercept of the line of best fit? What does it represent? Exercise 12.4.7 What does an (r) value of zero mean? Answer It means that there is no correlation between the data sets. Exercise 12.4.8 When (n = 2) and (r = 1), are the data significant? Explain. Exercise 12.4.9 When (n = 100) and (r = -0.89), is there a significant correlation? Explain. ## 12.4E: The Regression Equation (Exercise) Use the following information to answer the next five exercises. A random sample of ten professional athletes produced the following data where (x) is the number of endorsements the player has and (y) is the amount of money made (in millions of dollars). (x) (y) (x) (y) 0 2 5 12 3 8 4 9 2 7 3 9 1 3 0 3 5 13 4 10 Draw a scatter plot of the data. Use regression to find the equation for the line of best fit. Draw the line of best fit on the scatter plot. What is the slope of the line of best fit? What does it represent? The slope is 1.99 ((b = 1.99)). It means that for every endorsement deal a professional player gets, he gets an average of another$1.99 million in pay each year.

What is the (y)-intercept of the line of best fit? What does it represent?

What does an (r) value of zero mean?

It means that there is no correlation between the data sets.

When (n = 2) and (r = 1), are the data significant? Explain.

When (n = 100) and (r = -0.89), is there a significant correlation? Explain.

## The Least Squares Regression Line

Given any collection of pairs of numbers (except when all the x-values are the same) and the corresponding scatter diagram, there always exists exactly one straight line that fits the data better than any other, in the sense of minimizing the sum of the squared errors. It is called the least squares regression line. Moreover there are formulas for its slope and y-intercept.

### Definition

Given a collection of pairs ( x , y ) of numbers (in which not all the x-values are the same), there is a line y ^ = β ^ 1 x + β ^ 0 that best fits the data in the sense of minimizing the sum of the squared errors. It is called the least squares regression line The line that best fits a set of sample data in the sense of minimizing the sum of the squared errors. . Its slope β ^ 1 and y-intercept β ^ 0 are computed using the formulas

β ^ 1 = S S x y S S x x a n d β ^ 0 = y - − β ^ 1 x -

S S x x = Σ x 2 − 1 n ( Σ x ) 2 , S S x y = Σ x y − 1 n ( Σ x ) ( Σ y )

x - is the mean of all the x-values, y - is the mean of all the y-values, and n is the number of pairs in the data set.

The equation y ^ = β ^ 1 x + β ^ 0 specifying the least squares regression line is called the least squares regression equation The equation y ^ = β ^ 1 x + β ^ 0 of the least squares regression line. .

Remember from Section 10.3 "Modelling Linear Relationships with Randomness Present" that the line with the equation y = β 1 x + β 0 is called the population regression line. The numbers β ^ 1 and β ^ 0 are statistics that estimate the population parameters β 1 and β 0 .

We will compute the least squares regression line for the five-point data set, then for a more practical example that will be another running example for the introduction of new concepts in this and the next three sections.

### Example 2

Find the least squares regression line for the five-point data set

and verify that it fits the data better than the line y ^ = 1 2 x − 1 considered in Section 10.4.1 "Goodness of Fit of a Straight Line to Data".

In actual practice computation of the regression line is done using a statistical computation package. In order to clarify the meaning of the formulas we display the computations in tabular form.

In the last line of the table we have the sum of the numbers in each column. Using them we compute:

S S x x = Σ ​ x 2 − 1 n ( Σ ​ x ) 2 = 208 − 1 5 ( 28 ) 2 = 51.2 S S x y = Σ ​ x y − 1 n ( Σ ​ x ) ( Σ ​ y ) = 68 − 1 5 ( 28 ) ( 9 ) = 17.6 x - = Σ ​ x n = 28 5 = 5.6 y - = Σ ​ y n = 9 5 = 1.8

β ^ 1 = S S x y S S x x = 17.6 51.2 = 0.34375 and β ^ 0 = y - − β ^ 1 x - = 1.8 − ( 0.34375 ) ( 5.6 ) = − 0.125

The least squares regression line for these data is

The computations for measuring how well it fits the sample data are given in Table 10.2 "The Errors in Fitting Data with the Least Squares Regression Line". The sum of the squared errors is the sum of the numbers in the last column, which is 0.75. It is less than 2, the sum of the squared errors for the fit of the line y ^ = 1 2 x − 1 to this data set.

Table 10.2 The Errors in Fitting Data with the Least Squares Regression Line

x y y ^ = 0.34375 x − 0.125 y − y ^ ( y − y ^ ) 2
2 0 0.5625 −0.5625 0.31640625
2 1 0.5625 0.4375 0.19140625
6 2 1.9375 0.0625 0.00390625
8 3 2.6250 0.3750 0.14062500
10 3 3.3125 −0.3125 0.09765625

### Example 3

Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model" shows the age in years and the retail value in thousands of dollars of a random sample of ten automobiles of the same make and model.

1. Construct the scatter diagram.
2. Compute the linear correlation coefficient r. Interpret its value in the context of the problem.
3. Compute the least squares regression line. Plot it on the scatter diagram.
4. Interpret the meaning of the slope of the least squares regression line in the context of the problem.
5. Suppose a four-year-old automobile of this make and model is selected at random. Use the regression equation to predict its retail value.
6. Suppose a 20-year-old automobile of this make and model is selected at random. Use the regression equation to predict its retail value. Interpret the result.
7. Comment on the validity of using the regression equation to predict the price of a brand new automobile of this make and model.

Table 10.3 Data on Age and Value of Used Automobiles of a Specific Make and Model

 x 2 3 3 3 4 4 5 5 5 6 y 28.7 24.8 26 30.5 23.8 24.6 23.8 20.4 21.6 22.1

Figure 10.7 Scatter Diagram for Age and Value of Used Automobiles

We must first compute S S x x , S S x y , S S y y , which means computing Σ x , Σ y , Σ x 2 , Σ y 2 , and Σ x y . Using a computing device we obtain

Σ ​ x = 40 Σ ​ y = 246.3 Σ ​ x 2 = 174 Σ ​ y 2 = 6154.15 Σ ​ x y = 956.5

S S x x = Σ ​ x 2 − 1 n ( Σ ​ x ) 2 = 174 − 1 10 ( 40 ) 2 = 14 S S x y = Σ ​ x y − 1 n ( Σ ​ x ) ( Σ ​ y ) = 956.5 − 1 10 ( 40 ) ( 246.3 ) = − 28.7 S S y y = Σ ​ y 2 − 1 n ( Σ ​ y ) 2 = 6154.15 − 1 10 ( 246.3 ) 2 = 87.781

r = S S x y S S x x · S S y y = − 28.7 ( 14 ) ( 87.781 ) = − 0.819

The age and value of this make and model automobile are moderately strongly negatively correlated. As the age increases, the value of the automobile tends to decrease.

Using the values of Σ x and Σ y computed in part (b),

x - = Σ x n = 40 10 = 4 and y - = Σ y n = 246.3 10 = 24.63

Thus using the values of S S x x and S S x y from part (b),

β ^ 1 = S S x y S S x x = − 28.7 14 = − 2.05 and β ^ 0 = y - − β ^ 1 x - = 24.63 − ( − 2.05 ) ( 4 ) = 32.83

The equation y ^ = β ^ 1 x + β ^ 0 of the least squares regression line for these sample data is

Figure 10.8 "Scatter Diagram and Regression Line for Age and Value of Used Automobiles" shows the scatter diagram with the graph of the least squares regression line superimposed.

Figure 10.8 Scatter Diagram and Regression Line for Age and Value of Used Automobiles

The slope −2.05 means that for each unit increase in x (additional year of age) the average value of this make and model vehicle decreases by about 2.05 units (about $2,050). Since we know nothing about the automobile other than its age, we assume that it is of about average value and use the average value of all four-year-old vehicles of this make and model as our estimate. The average value is simply the value of y ^ obtained when the number 4 is inserted for x in the least squares regression equation: which corresponds to$24,630.

Now we insert x = 20 into the least squares regression equation, to obtain

y ^ = − 2.05 ( 20 ) + 32.83 = − 8.17

which corresponds to −$8,170. Something is wrong here, since a negative makes no sense. The error arose from applying the regression equation to a value of x not in the range of x-values in the original data, from two to six years. Applying the regression equation y ^ = β ^ 1 x + β ^ 0 to a value of x outside the range of x-values in the data set is called extrapolation. It is an invalid use of the regression equation and should be avoided. For emphasis we highlight the points raised by parts (f) and (g) of the example. ### Definition The process of using the least squares regression equation to estimate the value of y at a value of x that does not lie in the range of the x-values in the data set that was used to form the regression line is called extrapolation The process of using the least squares regression equation to estimate the value of y at an x value not in the proper range. . It is an invalid use of the regression equation that can lead to errors, hence should be avoided. ## Assignment: Linear Regression Exercises Research Question: Does the number of hours worked per week (workweek) predict family income (income)? Using Polit2SetA data set, run a simple regression using Family Income (income) as the outcome variable (Y) and Number of Hours Worked per Week (workweek) as the independent variable (X). When conducting any regression analysis, the dependent (outcome) variables is always (Y) and is placed on the y-axis, and the independent (predictor) variable is always (X) and is placed on the x-axis. ### ORDER COMPREHESIVE SOLUTION PAPERS ON Assignment: Linear Regression Exercises Follow these steps when using SPSS: 1. Open Polit2SetA data set. 2. Click on Analyze, then click on Regression, then Linear. 3. Move the dependent variable (income) in the box labeled “Dependent” by clicking the arrow button. The dependent variable is a continuous variable. 4. Move the independent variable (workweek) into the box labeled “Independent.” 5. Click on the Statistics button (right side of box) and click on Descriptives, Estimates, Confidence Interval (should be 95%), and Model Fit, then click on Continue. 6. Click on OK. Through analysis of the SPSS output, answer the following questions. Answer questions 1 – 10 individually, not in paragraph form 1. What is the total sample size? 2. What is the mean income and mean number of hours worked? 3. What is the correlation coefficient between the outcome and predictor variables? Is it significant? How would you describe the strength and direction of the relationship? 4. What it the value of R squared (coefficient of determination)? Interpret the value. 5. Interpret the standard error of the estimate? What information does this value provide to the researcher? 6. The model fit is determined by the ANOVA table results (F statistic = 37.226, 1,376 degrees of freedom, and the p value is .001). Based on these results, does the model fit the data? Briefly explain. (Hint: A significant finding indicates good model fit.) 7. Based on the coefficients, what is the value of the y-intercept (point at which the line of best fit crosses the y-axis)? 8. Based on the output, write out the regression equation for predicting family income. 9. Using the regression equation, what is the predicted monthly family income for women working 35 hours per week? 10. Using the regression equation, what is the predicted monthly family income for women working 20 hours per week? In this assignment we are trying to predict CES-D score (depression) in women. The research question is: How well do age, educational attainment, employment, abuse, and poor health predict depression? Using Polit2SetC data set, run a multiple regression using CES-D Score (cesd) as the outcome variable (Y) and respondent’s age (age), educational attainment (educatn), currently employed (worknow), number, types of abuse (nabuse), and poor health (poorhlth) as the independent variables (X). When conducting any regression analysis, the dependent (outcome) variables is always (Y) and is placed on the y-axis, and the independent (predictor) variable is always (X) and is placed on the x-axis. Follow these steps when using SPSS: 1. Open Polit2SetC data set. 2. Click on Analyze, then click on Regression, then Linear. 3. Move the dependent variable, CES-D Score (cesd) into the box labeled “Dependent” by clicking on the arrow button. The dependent variable is a continuous variable. 4. Move the independent variables (age, educatn, worknow, and poorhlth) into the box labeled “Independent.” This is the first block of variables to be entered into the analysis (block 1 of 1). Click on the bottom (top right of independent box), marked “Next” this will give you another box to enter the next block of indepdent variables (block 2 of 2). Here you are to enter (nabuse). Note: Be sure the Method box states “Enter”. 5. Click on the Statistics button (right side of box) and click on Descriptives, Estimates, Confidence Interval (should be 95%), R square change, and Model Fit, and then click on Continue. 6. Click on OK. (When answering all questions, use the data on the coefficients panel from Model 2). Answer questions 1 – 5 individually, not in paragraph form 1. Analyze the data from the SPSS output and write a paragraph summarizing the findings. (Use the example in the SPSS output file as a guide for your write-up.) 2. Which of the predictors were significant predictors in the model? 3. Which of the predictors was the most relevant predictor in the model? 4. Interpret the unstandardized coefficents for educational attainment and poor health. 5. If you wanted to predict a woman’s current CES-D score based on the analysis, what would the unstandardized regression equation be? Include unstandardized coefficients in the equation. Gray, J.R., Grove, S.K., & Sutherland, S. (2017). Burns and Grove’s the practice of nursing research: Appraisal, synthesis, and generation of evidence (8th ed.). St. Louis, MO: Saunders Elsevier. This chapter asserts that predictive analyses are based on probability theory instead of decision theory. It also analyzes how variation plays a critical role in simple linear regression and multiple regression. Statistics and Data Analysis for Nursing Research This section of Chapter 9 discusses the simple regression equation and outlines major components of regression, including errors of prediction, residuals, OLS regression, and ordinary least-square regression. Chapter 10 focuses on multiple regression as a statistical procedure and explains multivariate statistics and their relationship to multiple regression concepts, equations, and tests. This chapter provides an overview of logistic regression, which is a form of statistical analysis frequently used in nursing research. ## Chapter 12: Simple Linear Regression The following exercises are intended to (1) provide practice analyzing data using simple linear regression and (2) review and reinforce our ability to subset data. The reason we emphasize these two skills together is that, in many instances, we want to analyze data that include only certain observations (and variables) while excluding the others. To this end, we make use of the Cars93 data from the R package named MASS. If necessary, make sure you consult Chapter 2 for speci c instructions on how to access Cars93, with particular attention to the brief discussion immediately preceeding Exercise 1 at the chapter's end. And because regression is the nal analytic methodology of the book, we revisit several of the most useful R functions covered earlier in the book, just for practice. 1. Import the Cars93 data into the object named E12_1. What are the variable names? How many observations are included? Find (1) the minimum and maximum values,(2) the median and mean, (3) the first and third quartiles, and (4) the standard deviation of the two variables, MPG.city and EngineSize. Comment on your initial findings. library(MASS) ## ## Attaching package: 'MASS' ## The following object is masked from 'package:introstats': ## ## housing #Comment1. Import Cars93 into the object named E12_1. E12_1 <- Cars93 #Comment2. Use the nrow() function to find the number #of observations. nrow(E12_1) ## [1] 93 #Comment3. Use the names() function to list variable names. names(E12_1) ## [1] "Manufacturer" "Model" "Type" ## [4] "Min.Price" "Price" "Max.Price" ## [7] "MPG.city" "MPG.highway" "AirBags" ## [10] "DriveTrain" "Cylinders" "EngineSize" ## [13] "Horsepower" "RPM" "Rev.per.mile" ## [16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers" ## [19] "Length" "Wheelbase" "Width" ## [22] "Turn.circle" "Rear.seat.room" "Luggage.room" ## [25] "Weight" "Origin" "Make" #Comment4. Use the summary() function to find the basic #descriptive statistics for MPG.city and EngineSize. summary(E12_1$MPG.city)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 18.00 21.00 22.37 25.00 46.00
summary(E12_1$EngineSize) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.800 2.400 2.668 3.300 5.700 #Comment5. Use the sd() function to find the standard deviation #of each variable. sd(E12_1$MPG.city)
## [1] 5.619812
sd(E12_1$EngineSize) ## [1] 1.037363 Answer:The Cars93 data include 93 observations across 27 variables. The descrip- tive statistics for MPG.city and EngineSize are provided in Comments 4 and 5. 2. Do MPG.city and EngineSize appear related in any systematic way? Comment. plot(E12_1$EngineSize, E12_1$MPG.city, pch = 19, xlab = 'Engine Size (liters)', ylab = 'City Miles per Gallon ', main = 'Relationship Between City MPG and Engine Size (liters)') Answer: The pattern of points revealed by the scatterplot suggests that the relationship is negatively related. The important question is whether the relationship is linear the scatterplot suggests that it is probably more curvilinear than linear. To sort out that issue, we next move on to the residual plot. 3. Make and inspect a residual plot. Does the pattern of points reveal anything that might cause us to question the assumptions underlying the appropriate usage of regression analysis to explore the relationship between MPG.city and EngineSize? #Comment1. Use the lm() function to create the model object #named slr1 (the first simple linear regression). #Comment2. Use the plot() function to create a residual plot. #Note that resid(slr1) must be included as an argument. resid(slr1), abline(h = 0), pch = 19, xlab = 'Engine Size (liters)', ylab = 'Residuals', main = 'Residuals Against the Independent Variable') Answer: When the engine size is between (roughly) 1.5 and 4.75 liters, the residuals reveal a reasonably linear relationship between MPG.city and EngineSize. However, this pattern tends to break down for both the upper and lower values of engine size: for vehicles having the smallest engine size (below 1.5 liters) and the largest engine size (above 4.75) the pattern of the residuals reveals that the assumptions underlying the correct application of regression are not very well met. 4. There are several possible methods for managing the problem of nonlinear relation- ships among variables, such as what we have encountered in this case. One of theapproaches involves transforming the variables—by way of logarithms, exponents, etc.—in such a way that they are forced to be more linearly related. (This class of methods, sometimes referred to as GLM or general linear model, is not covered in this book.) Another procedure requires including additional variables into the mul- tiple regression model (the focus of Chapter 13). Instead, the approach we employ here involves subsetting the data according to some specification—such as, subset- ting the data in a way that includes, for example, only vehicles manufactured in the US or all vehicles that have smaller engines (i.e., fewer liters of displacement). The expectation (or hope) is that, by subsetting, the resulting data may meet the assumptions behind the appropriate application of regression analysis. As a first step, subset the data stored in object E12_1 in a way that excludes all vehicles with EngineSize greater than the median. (See Comment 4 under the first exercise above.) Name this new object E12_2. Check to make sure that E12_2 conforms to this requirement. How many observations remain in the new object? List the first 3 observations list the last 3 observations. If necessary, review the third point of the Chapter 2 Appendix, “Can We Extract A Data Subset From A Larger Data Set?" #Comment1. Use indexing [ ] set median Engine Size value of 2.4. E12_2 <- E12_1[E12_1$EngineSize <= 2.40, c(MPG.city, EngineSize)]

#Comment2. Use the max() and min() functions to find the maximum

#and minimum values of EngineSize in E12_2.

#Comment3. List the first 3 and last 3 observations in E12_2.

## MPG.city EngineSize
## 1 25 1.8
## 6 22 2.2
## 12 25 2.2

## MPG.city EngineSize
## 90 21 2.0
## 92 21 2.3
## 93 20 2.4

#Comment4. Use the nrow() function to find the number of
#observations stored in E12_2.

Answer: The variable EngineSize now runs from a low of 1 to a high of 2.4 liters n = 49 observations are stored in the E12_2 object.

5. For E12_2, do MPG.city and EngineSize appear related in a systematic way?

ylab = 'City Miles per Gallon',

main = 'Relationship Between City MPG and Engine Size (liters)')

Answer: Yes, the pattern of points revealed in the scatterplot of the E12_2 data
suggests that the two variables, MPG.city and EngineSize, may be negatively and
linearly related. This is not a surprising nding, of course, since it confirms what

6. Make and inspect a residual plot. Does the pattern of points appear more linearly arranged than they did in the earlier exercise when the data included vehicles with large engines as well as those with a smaller one?

#Comment1. Use the lm() function to create the model object named

#slr2 (the second simple linear regression).

#Comment2. Use the plot() function to create a residual plot.

#Note that resid(slr2) must be included as an argument.

main = 'Residuals Against the Independent Variable')

Answer: Yes, apart from 3 or 4 outliers for values of engine size below 1.5 liters, the residual plot does not reveal serious violations of the assumptions.

7. As part of making the residual plot in the preceding exercise, we used the lm() function to create the model object slr2. This is an important step in residual analysis because the model object (slr2) includes all the important information associated with the particular regression problem at hand, including the estimated regression equation itself. What is the estimated regression equation?

slr2
##
## Call:
## lm(formula = MPG.city

EngineSize, data = E12_2)
##
## Coefficients:
## (Intercept) EngineSize
## 46.15 -10.87

Answer: The estimated regression equation is ŷ = b0 + b1x = 46.15 – 10.87x, where ŷ is the predicted dependent variable, MPG.city, and x is the independent variable, EngineSize.

8. Find the 95 and 99 percent confidence interval estimates of the regression coefficient b1. Describe what these con dence intervals mean.

#Comment. Use the confint(, level = ) function to find the

##confidence interval estimates of the regression coefficient.

## 2.5 % 97.5 %
## (Intercept) 40.09671 52.207573
## EngineSize -14.03045 -7.714777

## 0.5 % 99.5 %
## (Intercept) 38.07151 54.232777
## EngineSize -15.08657 -6.658656

Answer: There is a 95% probability that the regression coefficient falls in the interval from -14.03045 to -7.714777 there is a 99% probability that it falls in the interval from -15.08657 to -6.658656.

9. What does the estimated regression equation tell us?

Answer: We can interpret the estimated regression equation ŷ = 46.15–10.87x this way: for the class of vehicles with engine sizes of 2.4 liters or less, a change of 1 liter in engine size is associated with a change of 10.87 miles per gallon when the vehicle is driving in a city. Moreover, the negative sign tells us that MPG.city and EngineSize are inversely related: as EngineSize increases (decreases), MPG.city decreases (increases). We know this because the regression coefficient b1 = –10.87. The intercept term b0 = 46.15 means less to us, except when we make predictions, because it implies that a vehicle with 0 liters of displacement should get 46.15 miles per gallon in city driving.

10. What is the strength of association between the two variables, MPG.city and Engine Size? Find the coefficient of determination r 2 using the following expression for r 2 (do not use the summary() function to unpack the regression statistics we will use it later). This exercise provides another opportunity to hone your coding skills.

#Comment1. Find the total sum of squares, ss_y.

ss_y <- sum((E12_2$MPG.city - mean(E12_2$MPG.city)) ^ 2)

#Comment2. Find the residual sum of squares, ss_res.

#Comment3. Find the coefficient of determination.

Answer: The coefficient of determination, r 2 = 0.505143.

11. What does the coefficient of determination r 2 reveal about the regression model?

Answer: The r 2 indicates the proportion of variation in the dependent variable MPG.city that is explained (or accounted for) by variation in the independent vari- able EngineSize. In this case, that proportion is 0.505143, or roughly 51%. More- over, because r 2 = 0.505143, we also know that almost 49% of the variation in MPG.city remains unaccounted for, even once the association with EngineSize has been considered.

12. What is the t value of the coefficient b1 on the independent variable EngineSize? Do not use the summary() function but rather write out the R code (great practice).

Because finding the answer to this question requires a slightly more complicated bit of code, we break up the solution into several pieces.

(a) The expression for the t value is found by taking the ratio of the coefficient itself to the standard error.

(b) Finding the denominator (i.e., the standard error sb1) of the above expression
requires calculating another ratio

where the numerator of this ratio sy/x is

s_xy <- sqrt(sum((resid(slr2) ^ 2)) / (nrow(E12_2) - 2))

and where the denominator of this ratio is

ssx <- sqrt(sum((E12_2$EngineSize - mean(E12_2$EngineSize)) ^ 2))

The ratio can now be found by dividing the rst value (above) by the second.

(c) The numerator of the t statistic requires the regression coefficient b1

b1 <- sum((E12_2$EngineSize - mean(E12_2$EngineSize)) *

sum((E12_2$EngineSize - mean(E12_2$EngineSize)) ^ 2)

(d) Finally, the t statistic is found by dividing the regression coefficient b1 by the standard error sb1.

13. What is the p-value of t = 6.926538?

Note: For convenience and accuracy, we use t from the preceding exercise as the first argument of the pt() function

#Comment1. Since the p-value statistic has a very small value, we
#can override the default of reporting it in scientific notation.

#Comment2. Use the pt() function with (n-2)=47 degrees of freedom.
#Remember that since this is a two-tail test, we need to multiply

#by 2.

14. Use the summary() extractor function to check our work. Remember to use the
model object slr2 as the argument.

#Comment Use summary() function to extract the desired statistics.

EngineSize, data = E12_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0177 -1.5814 -0.1451 1.7676 12.1568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.15 3.01 15.333 < 0.0000000000000002 ***
## EngineSize -10.87 1.57 -6.927 0.0000000106 ***
## ---
## Signif. codes: 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' '1
##
## Residual standard error: 4.061 on 47 degrees of freedom
## Multiple R-squared: 0.5051,Adjusted R-squared: 0.4946
## F-statistic: 47.98 on 1 and 47 DF, p-value: 0.00000001056

All the findings arrived at using the summary() function confirm what has been found in the preceding exercises. That is, the estimated regression equation is ŷ = 46.15 10.87x the coefficient of determination is r 2 = 0.505143 the t statistic is t = 6.926538 and the p-value=0.00000001056443.

15. Use the regression equation to find the predicted values of MPG.city for the follow- ing values of EngineSize (liters of displacement): 1.25, 1.50, 1.75, 2.00, 2.25.

Answer: The predicted values of MPG.city for EngineSize of 1.25, 1.50, 1.75, 2.00,and 2.25 liters are (in order) 32.56138, 29.84322, 27.12507, 24.40692, and 21.68876 miles per gallon.

#Comment1. Use data.frame() to create a new object containing 1.25,
#1.50, 1.75, 2.00, and 2.25. Name the new object size_new.

size_new <- data.frame(EngineSize <- c(1.25, 1.50, 1.75, 2.00, 2.25))

#Comment2. Use the predict() function to provide predicted values
#of miles per gallon for vehicles having 1.25, 1.50, 1.75, 2.00, and
#2.25 liters EngineSize.

## 1 2 3 4 5
## 32.56138 29.84322 27.12507 24.40692 21.68876

16. What are the predicted values of MPG.city that were used to calibrate the estimated
regression equation ŷ = 46.15–10.87x? Import those predicted values into an object
named mileage predicted and list the rst and last three elements.

#Comment1. Use fitted(slr2) function to create the predicted
#values of the dependent variable. Import those values into
#the object named mileage_predicted.

#Comment2. Use the head(,3) and tail(,3) functions to list the
#first and final three values of the predicted value.

## 1 6 12
## 26.58144 22.23239 22.23239

## 90 92 93
## 24.40692 21.14513 20.05787

17. Add the mileage predicted object (created in the preceding exercise) to E12_2, and name the resulting object E12_3. List the first and last four elements. Find the correlation of the actual and predicted variables that is, the correlation of MPG.city and mileage predicted. Once you have calculated the correlation, square it (i.e., raise it to the second power). Does the squared correlation coefficient look familiar?

#Comment1. Use the cbind() function to bind the column
#mileage_predicted #to E12_2. Name the new object E12_3.

E12_3 <- cbind(E12_2, mileage_predicted)

#Comment2. List the first and last four elements of E12_3.

## MPG.city EngineSize mileage_predicted
## 1 25 1.8 26.58144
## 6 22 2.2 22.23239
## 12 25 2.2 22.23239
## 13 25 2.2 22.23239

## MPG.city EngineSize mileage_predicted
## 88 25 1.8 26.58144
## 90 21 2.0 24.40692
## 92 21 2.3 21.14513
## 93 20 2.4 20.05787

#Comment3. Find the correlation of the actual and predicted
#dependent variables. Store the value in an object named r.

r <- cor(E12_3$MPG.city, E12_3$mileage_predicted)

#Comment4. Square the value of r.

The square of the correlation of the actual dependent variable and predicted dependent variable equals the coefficient of determination, r 2 .

18. Create a scatterplot with MPG.city on the vertical axis, Engine Size on the hor- izontal axis. Add labels to both axes as well as a main title. Finally, using the abline() function, add a regression line to the scatterplot.

xlab = 'Engine 'Size (liters)',
ylab = 'City Miles per Gallon',
main = 'The Best Line Through the Scatterplot',
pch = 19,
col = 'blue')

19. For additional practice structuring our data before analyzing it, we now subset E12_1 by Origin. For this exercise: (1) create a new object from the original data set, E12 _1, that includes only vehicles of non-USA origin (thus excluding all vehicles of USA origin) and name it E12_4 (2) find the median EngineSize of non-USA vehicles, and (3) create a new object, named E12_5, that includes only (a) those vehicles having MPG.city less than or equal to the median and only (b) the two variables, MPG.city and EngineSize. In other words, subset the original data set, E12_1, to include only the two variables, MPG.city and EngineSize, and only those vehicles that are of non-USA origin and that feature engines with displacement (in liters) at or below the median for the relevant category. Just to make sure E12_5 “looks” as it should, run a few of the same functions that were used in Exercise 1.

#Comment1. Subset data so that it includes only vehicles of non-USA

#origin. Name new object E12_4.

E12_4 <- E12_1[ which(E12_1$Origin == 'non-USA' ), ] #Comment2. Find the median EngineSize for the sample including only #non-USA vehicles . #Comment3. Since the median is 2.2 liters, subset the data once #again to include only vehicles with 2.2 liters (or less) engine #displacement. Name the new object E12_5. E12_5 <- E12_4[E12_4$EngineSize <= 2.20, c( 'MPG.city', 'EngineSize' )]

#Comment4. Use the summary() function to find the basic descriptive
#statistics for MPG.city and EngineSize

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 22.75 25.50 27.54 29.25 46.00

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 1.700 1.708 2.000 2.200

#Comment5. Use nrow() function to find the number of observations.

#Comment6. Use the names() function to confirm that E12_5 includes
#only two variables, MPG.city and EngineSize.

Answer: The new data set, E12_5, includes 24 observations across the two vari- ables, MPG.city and EngineSize. The descriptive statistics for both variables are reported under Comment 4. Note that the maximum value of EngineSize is 2.2, thus confirming that the data include only what we want, in terms of observations, variables, and restrictions. This practice of “looking under the hood (or bonnet)” is a sound one. It allows us to con rm that the data really do look like they should.

20. For the category of vehicles of non-USA origin, do the two variables, MPG.city and
EngineSize, seem to be related in a systematic way? If so, how?

pch = 19,
xlab = 'Engine Size (liters)',
ylab = 'City Miles per Gallon',
main = 'Relationship Between City MPG and Engine Size (liters) ')

Answer: Yes, the pattern of points appears to run (approximately) from the upper- lefthand to lower-righthand corners of the scatterplot, implying a negative associa- tion: larger engine sizes are associated with lower miles per gallon (in city driving). But because of a few outlier cases, we do not expect the r 2 to be very high, perhaps not even r 2 = 0.50.

21. Make and inspect a residual plot. Does the pattern of points reveal any reason why we should not use regression to analyze these data? Are there any radical departures from the assumptions underlying the appropriate usage of this methodology?

#Comment1. Use the lm() function to create the model object named
#slr3 (slr3 stands for the third simple linear regression).

#Comment2. Use the plot() function to create a residual plot.
#Note that resid(slr3) must be included as an argument.

resid(slr3),
abline(h = 0),
pch = 19,
xlab = 'Engine Size (liters) ',
ylab = 'Residuals',
main = 'Residuals Against the Independent Variable')

Answer: The only possible area of trouble revealed by the residual plot might be in the range of the smaller engines, particulary at and below 1.5 liters. For vehicles with larger engines, however, the assumption of constant variation seems satisfied. Therefore, although the data are far from what we might characterize as “well behaved,” the violations do not seem serious enough to cause us to drop regression as a potentially promising analytical methodology.

22. What is the estimated regression equation?

EngineSize, data = E12_5)
##
## Coefficients:
## (Intercept) EngineSize
## 51.27 -13.89

23. Find the 75 and 90 percent con dence interval estimates of the regression coefficient b1. How should we interpret the meaning of these con dence interval estimates?

## 1 2.5 % 87.5 %
## (Intercept) 44.60803 57.94082
## EngineSize -17.72209 -10.06260

## 5 % 95 %
## (Intercept) 41.58615 60.962698
## EngineSize -19.45811 -8.326578

Answer: There is a 75% probability that the regression coefficient falls in the interval from -17.72209 to -10.06260 there is a 90% probability that it falls in the interval from -19.45811 to -8.326578.

24. What does the estimated regression equation tell us?

Answer: The estimated regression equation ŷ = 51.27 – 13.89x can be interpreted in this manner: for the category of vehicles of non-USA origin|and with engine sizes of 2.2 or fewer liters|a change of 1 liter in engine size is associated with a change of 13.89 miles per gallon when the vehicle is driving in a city. The negative sign, b1 = –13.89, also tells us that as EngineSize increases (decreases), MPG.city decreases (increases). Although the intercept term, b0 = 51.27, is not meaningfully interpretable, we will find it necessary to include it in the regression equation when ever we want to forecast or make predictions.

25. What is the strength of association between the two variables, MPG.city and Engine Size? Find the coefficient of determination r 2 using the following expression for r 2 (do not use the summary() function to unpack the regression statistics we will use it later). This exercise provides another opportunity to hone your coding skills.

#Comment1. Find the total sum of squares, ss_y.

ss_y <- sum((E12_5$MPG.city - mean(E12_5$MPG.city)) ^ 2)

#Comment2. Find the residual sum of squares, ss_res.

#Comment3. Find the coefficient of determination.

Answer: The coefficient of determination, r 2 = 0.455044.

26. What does the coefficient of determination r 2 tell us about the regression model?

Answer: The r 2 is the proportion of variation in the dependent variable MPG.city that is accounted for (or explained) by variation in EngineSize, the independent variable. When r 2 = 0.455044, we understand that that proportion is about 45%. We also know that approximately 55% of variation in MPG.city remains unex- plained, even after taking EngineSize into account.

27. What is the t value of the coefficient b1 on the independent variable EngineSize? Do not use the summary() function but rather write out the code (more practice).

Because nding the answer to this question requires a slightly more complicated bit of code, we break up the solution into several pieces.

(a) The expression for the t value is found by taking the ratio of the coefficient
itself to the standard error.

(b) Finding the denominator (i.e., the standard error sb1) of the above expression
requires calculating another ratio

where the numerator of this ratio sy|x is

s_xy <- sqrt(sum((resid(slr3) ^ 2)) / (nrow(E12_5) - 2))

and where the denominator of this ratio is

ssx <- sqrt(sum((E12_5$EngineSize - mean(E12_5$EngineSize)) ^ 2))

The ratio can now be found by dividing the rst value (above) by the second.
This is the value for sb1

(c) The numerator of the t statistic requires the regression coefficient b1

b1 <- sum((E12_5$EngineSize - mean(E12_5$EngineSize)) *

sum((E12_5$EngineSize - mean(E12_5$EngineSize)) ^ 2)

(d) Finally, the t statistic is found by dividing the regression coefficient b1 by the
standard error sb1 .

28. What is the p-value of t = –4.286051?

Answer: p-value= (2)(p(t ≤ t = –4.286051 df = 22)) = 0.0003000032.
Note: For convenience and accuracy, we use the t from the preceding exercise as the
rst argument of the pt() function.

#Comment. Use the pt() function with (n-2)=22 degrees of freedom.
#Remember that since this is a two-tail, we need to multiply by 2.

29. Use the summary() extractor function to check our work. Remember to use the
model object slr3 as the argument.

##
## Call:
## lm(formula = MPG.city

EngineSize, data = E12_5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.2144 -1.7278 -0.6574 1.8710 11.5641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.274 5.642 9.088 0.00000000668 ***
## EngineSize -13.892 3.241 -4.286 0.0003 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '. 0.1 ' '1
##
## Residual standard error: 5.305 on 22 degrees of freedom
## Multiple R-squared: 0.455,Adjusted R-squared: 0.4303
## F-statistic: 18.37 on 1 and 22 DF, p-value: 0.0003

All findings arrived at using the summary() function con rm what has been found in the preceding exercises. The estimated regression equation is ŷ = 51.274-13.892x the coefficient of variation is r 2 = 0.455 the t statistic is t = -4.286051 and the p-value=0.0003000032.

30. Use the estimated regression equation to nd the predicted values of MPG.city for the following values of EngineSize (liters of displacement): 1.25, 1.50, 1.75, 2.00, 2.25.

Answer: The predicted values of MPG.city for EngineSize of 1.25, 1.50, 1.75, 2.00, and 2.25 liters are (in order) 33.90899, 30.43591, 26.96282, 23.48973, and 20.01665 miles per gallon.

#Comment1. Use data.frame() to create a new object containing 1.25,
#1.50, 1.75, 2.00, and 2.25. Name the new object size_new.

size_new <- data.frame(EngineSize <- c(1.25, 1.50, 1.75, 2.00, 2.25))

#Comment2. Use predict() function to provide the predicted values
#of miles per gallon for vehicles having 1.25, 1.50, 1.75, 2.00, and

#2.25 liters EngineSize.

predict(slr3, size_new)
## 1 2 3 4 5
## 33.90899 30.43591 26.96282 23.48973 20.01665

31. What are the predicted values of MPG.city that were used to calibrate the estimated regression equation ŷ= 51.274–13.892x? Import those predicted values into an object named mileage predicted and list the first and last three elements.

#Comment1. Use fitted(slr3) function to create the predicted
#values of the dependent variable. Import those values into
#the object named mileage_predicted.

#Comment2. Use the head(,3) and tail(,3) functions to list the
#first and final three values of the predicted value.

## 1 9 40
## 26.26820 37.38208 29.04667

## 86 88 90
## 20.71126 26.26820 23.48973

32. Add the mileage predicted object (created in the preceding exercise) to E12_5, and name the resulting object E12_6. List the rst and last four elements. Find the correlation of the actual and predicted variables that is, the correlation of MPG.city and mileage predicted. Once you have the correlation, square it (i.e., raise it to the second power). Comment on the square of the correlation. What is it?

#Comment1. Use the cbind() function to bind the column
#mileage_predicted #to E12_5. Name the new object E12_6.

E12_6 <- cbind(E12_5, mileage_predicted)

#Comment2. List the first and last four elements of E12_6.

## MPG.city EngineSize mileage_predicted
## 1 25 1.8 26.26820
## 39 46 1.0 37.38208
## 40 30 1.6 29.04667
## 42 42 1.5 30.43591

## MPG.city EngineSize mileage_predicted
## 85 25 2.2 20.71126
## 86 22 2.2 20.71126
## 88 25 1.8 26.26820
## 90 21 2.0 23.48973

#Comment3. Find the correlation of the actual and predicted
#dependent variables. Store the value in an object named r.

r <- cor(E12_6$MPG.city, E12_6$mileage_predicted)

#Comment4. Square the value of r.

## [1] 0.455044
The square of the correlation of the actual dependent variable and predicted dependent variable equals the coefficient of determination, r 2 .

33. Create a scatterplot with MPG.city on the vertical axis, Engine Size on the horizontal axis. Add labels to both axes as well as a main title set blue as the color of the points. Finally, using the abline() function, add a regression line to the scatterplot.

xlab = 'Engine Size (liters)',
ylab = 'City Miles per Gallon,
main = 'The Best Line Through the Scatterplot',
pch = 19,
col = 'blue' )

34. This exercise provides further opportunity to (1) find a set of data from an online source (pick a source, any source), (2) create a data frame (see the Chapter 1 Appendix, if necessary), and (3) analyze it using some of the methods associated with simple linear regression. Look up and record the high and low intraday temperatures (in degrees Fahrenheit) for the following 14 cities from around the world: Auckland, Beijing, Cairo, Lagos, London, Mexico City, Mumbai, Paris, Rio de Janeiro, Sydney, Tokyo, Toronto, Vancouver, and Zurich. This information (in degrees Fahrenheit) is easily found after brief search.

(a) Use c() to create 3 objects, one for each city name, one for the high temperature, one for the low temperature. Data are recorded for December 19, 2016.

city <- c( 'Auckland ',' Beijing ', 'Cairo ', 'Lagos' ,' London','Mexico City', 'Mumbai','Paris','Rio de Janeiro' 'Sydney' , ' Tokyo' , ' Toronto' , ' Vancouver' , 'Zurich' )

high <- c(71, 45, 65, 91, 46, 67, 88, 44, 92, 88, 57, 20, 42, 40)

low <- c(56, 23, 48, 76, 37, 45, 71, 35, 73, 65, 39, 15, 39, 29)

(b) Use data.frame() to create a data frame consisting of each city name and high and low temperatures. Display the result to check your work.

WorldTemps <- data.frame(City = city, High = high, Low = low)
WorldTemps
## City High Low
## 1 Auckland 71 56
## 2 Beijing 45 23
## 3 Cairo 65 48
## 4 Lagos 91 76
## 5 London 46 37
## 6 Mexico City 67 45
## 7 Mumbai 88 71
## 8 Paris 44 35
## 9 Rio de Janeiro 92 73
## 10 Sydney 88 65
## 11 Tokyo 57 39
## 12 Toronto 20 15
## 13 Vancouver 42 39
## 14 Zurich 40 29

(c) Make a scatterplot of high against low temperatures. Create a main title, label each axis appropriately, and use pch= to specify how the points should appear. Does the pattern of points appear to con rm that the relationship between high and low temperatures is linear?

pch = 19,
xlab = "Low",
ylab = "High",
main = "High and Low Intraday Temperatures")

Amswer: The scatterplot makes clear that the relationship between high and low intraday temperatures is both positive and linear.

(d) Estimate and write out the regression equation: ŷ = b0 + b1x. Let the high temp be the dependent variable the low temp, the independent variable.

Low, data = WorldTemps)
reg_eq_temps
##
## Call:
## lm(formula = High

Low, data = WorldTemps)
##
## Coefficients:
## (Intercept) Low
## 7.763 1.148

The estimated regression equation is: ŷ = 7.763 + 1.148x

summary(reg_eq_temps)
##
## Call:
## lm(formula = High

Low, data = WorldTemps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.533 -3.991 -1.051 3.884 10.834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.76313 4.27689 1.815 0.0946 .
## Low 1.14795 0.08546 13.433 0.0000000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '. ' 0.1 ' '1
##
## Residual standard error: 5.918 on 12 degrees of freedom
## Multiple R-squared: 0.9376,Adjusted R-squared: 0.9325
## F-statistic: 180.5 on 1 and 12 DF, p-value: 0.00000001363

The r 2 is 0.9376, indicating that approximately 93.76% of variation in the dependent variable is explained by variation in the independent variable.

(f) What is the p-value? Is the estimate regession equation signi cant? Why or why not

Since p-value=0.0000000136 is far less than the usual values we set for α—e.g., 0.05, 0.01, etc.—we say that the estimated regression equation is significant.

35. A dependent variable y is regressed on an independent variable x the sample size is n = 32.

(b) If b 1 = –0.041215 and s b1 = 0.004712, what is the value of the test statistic t?

Answer: = 2(p(t < –8.75 df = n–k–1)) = 2(p(t < –8.75 df = 30)) = 0.00000000093

2 * pt(-8.75, 30)
## [1] 0.0000000009313949

(d) Is the estimated regression equation significant at the α = 0.01 level?

Answer: Yes, since p-value = 0.00000000093 <α = 0.01, we conclude that the estimated
regression equation is significant.

(e) If b0 = 29.599855, write out the regression equation, ŷ = b0 + b1x.

Answer: ŷ = 29.599855 – 0.041215x

36. This exercise uses the mtcars data set that is installed in R. (Remember that to see all the installed data sets, simply enter data() at the R prompt in the Console to view the mtcars data set itself, enter mtcars at the R prompt to learn more about the data set, including the variables and observations, enter ?mtcars at the prompt and wait for the R help page to open.) In this case, we are interested in the relationship between an automobile's quarter mile time and gross horsepower.

(a) Create a scatterplot of the 2 variables. What does the pattern of points suggest about the relationship (if any) between the variables?

Answer: The scatterplot makes clear that the relationship between gross horse power and quarter mile time (seconds) is both negative and (approximately) linear.

pch = 19,
xlab = "Gross Horse Power",
ylab = "Quarter Mile Time (seconds)")

(b) Letting the quarter mile time be the dependent variable, estimate the regression equation. Write out the regression equation, ŷ = b0 + b1x.

hp, data = mtcars)
##
## Coefficients:
## (Intercept) hp
## 20.55635 -0.01846

The estimated regression equation is: ŷ = 20.55635 – 0.01846x.

summary(reg_eq_mtcars)
##
33
## Call:
## lm(formula = qsec

hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1766 -0.6975 0.0348 0.6520 4.0972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.556354 0.542424 37.897 < 0.0000000000000002 ***
## hp -0.018458 0.003359 -5.495 0.00000577 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.282 on 30 degrees of freedom
## Multiple R-squared: 0.5016,Adjusted R-squared: 0.485
## F-statistic: 30.19 on 1 and 30 DF, p-value: 0.000005766
The r 2 is 0.5016, indicating that approximately 50.16% of variation in the dependent variable is explained by variation in the independent variable.

(e) Is the estimated regression equation signi cant at the α = 0.05 level?

Answer: Since p-value=0.000005766 is far less than the usual values for α, we say that the estimated regression equation is significant.

37. Please use the mtcars data set to answer the following questions.

(a) Find the predicted quarter mile time for the values of gross horsepower from the data set used in the original analysis. Report the predicted values for the last four observations.

## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E

## 15.68336 17.32615 14.37282 18.54440

(b) Find the predicted values of quarter mile time for the following values of gross horsepower: 100, 125, 160, 225, and 250.

new_values <- data.frame(hp <- c(100, 125, 160, 225, 250))

## 1 2 3 4 5
## 18.71052 18.24906 17.60302 16.40323 15.94178

(c) Can we use the estimated regression equation to make predictions of quarter mile time when gross horsepower is 40 or 350? Why or why not?

Answer: When we learn that the minimum and maximum values of the variable hp are 52 and 335, respectively, we should not use the estimated regression equation to make predictions based on values that fall above or below that range. Some analysts will do so anyway, but one should proceed very carefully when making any predictive claims based on this estimated regression equation.

38. This exercise explores the relationship (if any) between 2 of the 5 variables included in the polling data (found on the companion website): x1 =Age, measured in years, and x3 =Same Sex, which is measured on a 1-to-7 Likert scale as a response to the statement," approve of the right of same-sex couples to marry." A respondent registers strong disapproval with a 1, strong approval with a 7, and relative indi erence with a response in the middle of the range from 1-to-7.

(a) Make a scatterplot of x3 against x1. Do you see any possible violations of the assumptions underlying the correct application of simple linear regression to these data? What does the nature of the pattern tell you? Do you think regression can be used to explore the relationship between the 2 variables?

Answer: The scatterplot reveals the negative and (relatively) linear relationship between a person's age and the extent to which he approves of the right of same-sex couples to marry. That is, in general, resistance to the idea that same-sex couples should have the right to marry seems to increase with one's age. However, as with most plots, the relationship is not a perfect one. Even so, regression analysis would seem to be a promising means by which to explore the relationship between these 2 variables.

xlab = "Age",
ylab = "Views of Same-Sex Marriage",
pch = 19)

(b) Write out the regression equation. In this case, does it make more sense to specify x1 or x3 as the dependent variable? That is, should you de ne the model as x3 = b0 + b1x1? Or as x1 = b0 + b3x3? Why?

Answer: We would most likely specify x3, Same-Sex Marriage, as the dependent variable and x1, Age, as the independent variable. Although we do not use regression analysis to demonstrate causality, it makes more sense to say that approval of the right of same-sex couples to marry falls with age than the reverse.

The estimated regression equation is ŷ = 11.757 – 0.168x.

x1, data = polling)
reg_eq_polling
##
## Call:
## lm(formula = x3

x1, data = polling)
##
## Coefficients:
## (Intercept) x1
## 11.757 -0.168

summary(reg_eq_polling)
##
## Call:
## lm(formula = x3

x1, data = polling)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8535 -1.0235 -0.2935 0.6705 2.8025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7573 1.2276 9.577 0.000000159 ***
## x1 -0.1680 0.0265 -6.340 0.000018289 ***
## ---
## Signif. codes: 0 '***' 0.001 '*'* 0.01 '*' 0.05 '.' 0.1 '' 1
##
## Residual standard error: 1.296 on 14 degrees of freedom
## Multiple R-squared: 0.7417,Adjusted R-squared: 0.7232
## F-statistic: 40.19 on 1 and 14 DF, p-value: 0.00001829

(e) Is the regression equation significant at the α = 0.05 level?

Answer: Since p-value=0.00001829 is less than the usual values for α, we say that the
estimated regression equation is significant.

(f) Find the 95% confidence interval estimate of the regression coefficient?

confint(reg_eq_polling, level = 0.95)

## 2.5 % 97.5 %
## (Intercept) 9.1242484 14.3903218
## x1 -0.2248278 -0.1111626

(g) State in words the meaning of the con dence interval estimate.

Answer: There is a 95% probability that the regression coefficient falls in the interval
from -0.2248278 to -0.1111626.

Answer: The estimated regression equation ŷ = 11.757-0.168x allows us to conclude that a change of 1 year is associated with a change of -0.168 in approval. (We know this because the regression coefficient is b1 =-0.168.) In this case, the meaning of the intercept term, b0 = 11.757, is less clear because it represents the predicted value of approval for a person whose age is 0. Even so, it is important to retain the intercept term in the equation because it must be included when we want to make predictions.

### Browser not supported

You are using a browser version that is no longer supported by this website and could result in a less-than-optimal experience.

## Coefficient of Determination and Correlation Coefficient

### Look at the Scattergram of Your Data First

You can use the equations shown above to find the regression line for any set of numbers. For example, given the pairs of numbers, (16,42), (1,54), (14,60), (4,70), (0,48), (6,41), (2,59), (10,64), (0,45), (8,69), using the formulas shown above, the equation of the regression line is y = 0.115653 x + 54.4945. However, look at the scattergram of these numbers.

It appears that there isn't a linear relationship between the x and y variables. So, before trying to compute a regression line, which should only be used when a linear relationship exists between the variables, make a scatterplot. If the scatterplot makes it clear that there is not a linear relationship between variables, don't use linear regression.

### What is the Coefficient of Determination

Look at the next two scattergrams.

In both cases, it appears that there is a linear relationship between the x and y variables. However, if you imagine regression lines atop the scatterplots, you can see that in the left graph the points will lie closer to the regression line than in the right graph. The coefficient of determination is a number that measures the degree of closeness of points to the regression line. This degree of closeness is called 'goodness of fit' by statisticians.

Variation in y-values is measured by the standard deviation of the y-values. Standard deviation is measured by

In defining the coefficient of determination, only the inside top of this formula is used, and the x's are replaced by y's. It is called the total sum of squares or SST, and is given by (y with a bar over it is the mean or average of the y-values):

It can be shown that SST can be expressed as the sum of two terms named SSR and SSE. SSR, or the sum of squares due to regression is given by the formula (y with a 'hat' over it signifies a the y-value found by using the regression equation on a x-value to find the corresponding y-value ):

SSR measures the amount of total variation in y-values explained by the regression line. The amount of total variation not explained by the regression line is called the sum of squares for error and denoted by SSE. The formula for it is:

These three quantities are related by the formula (which can be shown algebraically):

Dividing both sides of this formula by SST results in the equation

In this equation SSR/SST is called the coefficient of determination--it measures the proportion of variation in the y values explained by the regression line. Multiplying by 100 gives the percentage of the variation in y-values explained by the regression. The coefficient of determination is denoted by r 2 . For the data shown in the last three graph above, the coefficient of determination is 3046.68/3187.67 =0.96. So about 96% of the variation in y-values is explained by the regression. The other 4% is unexplained or error variation.

Formulas for computing SST and SSR are:

### How is the Correlation Coefficient Computed

Correlation measures the degree of linear relationship between two variables and is the square root of the coefficient of determination. It is denoted by r and is the square root of the coefficient of determination, r 2 . Since r 2 can only lie between 0 and 1, r must lie between -1 and 1. Also, since values of r 2 near 1 indicate that the regression line lies close to the data points, i.e. the regression line explains most of the variation in y-values, values of r near -1 or +1 also indicate a regression in which most of the variability in y's is explained by the regression line. Values of r near +1 indicate a regression line with positive slope, which implies that there is a direct linear relationship between the x and y-variables, while values of r near -1 indicate a regression line with negative slope implying an indirect or inverse linear relationship between the variables.

Another formula for computing the correlation coefficient is:

where the symbols in the formula have been defined above.

### Relationship between the Correlation Coefficient, the Scattergram, and the Regression Line

This link takes you to an interactive demonstration that shows the relationship between the correlation coefficient and the regression line. When the page opens click on Interactive Scatterplot. After the simulation for the scatterplot opens, you can place points on the display by clicking the mouse button. After the 2nd point has been placed the regression line will be drawn. In addition, the correlation coefficient and other statistics will be shown.

Here is a link to another demonstration of the relationship between the scattergram or scatterplot and the correlation coefficient. Once the page opens click on the + symbol next to Statistical Application--the display will change, then click on the + next to correlation, then click on the + next to applets, and finally click on correlation movie. When the movie plays you will see the relationship between points in the plane and the correlation coefficient.

### Computation of Regression Lines and Coefficients of Determination

Make the following regression calculations on the FOCUS: database using Webstat2. You can open Webstat2 by pushing the orange button above.

1. Make a scatterplot of the points where SAT Math is the y-variable and SAT Verbal is the x-variable. Then find and plot the regression line on the scatterplot of points, find the coefficient of determination and correlation coefficient. Relate these coefficients to the regression line plotted on the graph.

2. Do the same as in number 1 but make HS GPA the x-variable and Cumulative GPA the y-variable.

3. Finally, answer the same questions as in 1 but use hours as the x-variable and hsgpa as the y-variable.

## Content Preview

Regression uses one or more explanatory variables ((x)) to predict one response variable ((y)). In this course, we will be learning specifically about simple linear regression. The "simple" part is that we will be using only one explanatory variable. If there are two or more explanatory variables, then multiple linear regression is necessary. The "linear" part is that we will be using a straight line to predict the response variable using the explanatory variable. Unlike in correlation, in regression is does matter which variable is called (x) and which is called (y). In regression, the explanatory variable is always (x) and the response variable is always (y). Both (x) and (y) must be quantitative variables.

You may recall from an algebra class that the formula for a straight line is (y=mx+b), where (m) is the slope and (b) is the y-intercept. The slope is a measure of how steep the line is in algebra, this is sometimes described as "change in y over change in x," ((frac)), or "rise over run." A positive slope indicates a line moving from the bottom left to top right. A negative slope indicates a line moving from the top left to bottom right. The y-intercept is the location on the y-axis where the line passes through. In other words, when (x=0) then (y=y - intercept).

In statistics, we use similar formulas:

(widehat) = predicted value of (y) for a given value of (x)
(a) = (y)-intercept
(b) = slope

In a population, the y-intercept is denoted as (eta_0) ("beta sub 0") or (alpha) ("alpha"). The slope is denoted as (eta_1) ("beta sub 1") or just (eta) ("beta").

Simple linear regression uses data from a sample to construct the line of best fit. But what makes a line “best fit”? The most common method of constructing a simple linear regression line, and the only method that we will be using in this course, is the least squares method. The least squares method finds the values of the y-intercept and slope that make the sum of the squared residuals (also know as the sum of squared errors or SSE) as small as possible.

(y) = actual value of (y)
(widehat) = predicted value of (y)

Balbharati solutions for Mathematics and Statistics 2 (Commerce) 12th Standard HSC Maharashtra State Board chapter 3 (Linear Regression) include all questions with solution and detail explanation. This will clear students doubts about any question and improve application skills while preparing for board exams. The detailed, step-by-step solutions will help you understand the concepts better and clear your confusions, if any. Shaalaa.com has the Maharashtra State Board Mathematics and Statistics 2 (Commerce) 12th Standard HSC Maharashtra State Board solutions in a manner that help students grasp basic concepts better and faster.

Further, we at Shaalaa.com provide such solutions so that students can prepare for written exams. Balbharati textbook solutions can be a core help for self-study and acts as a perfect self-help guidance for students.

Concepts covered in Mathematics and Statistics 2 (Commerce) 12th Standard HSC Maharashtra State Board chapter 3 Linear Regression are Regression, Types of Linear Regression, Fitting Simple Linear Regression, The Method of Least Squares, Lines of Regression of X on Y and Y on X Or Equation of Line of Regression, Properties of Regression Coefficients.

Using Balbharati 12th Board Exam solutions Linear Regression exercise by students are an easy way to prepare for the exams, as they involve solutions arranged chapter-wise also page wise. The questions involved in Balbharati Solutions are important questions that can be asked in the final exam. Maximum students of Maharashtra State Board 12th Board Exam prefer Balbharati Textbook Solutions to score more in exam.

## Results

A total of 80 healthy, college-aged males (n = 43) and females (n = 37) completed the study. The majority of participants were between the ages of 18 and 20 years (76%) and White (84%). The derivation subgroup (n = 50) was gender-balanced and had a proportional representation of individuals from five fitness levels (ten participants per group five males, five females) as designated by the American College of Sports Medicine (Thompson, Gordon, & Pescatello, 2009). The validation subgroup (n = 30) was comprised of the remaining males and females ranging in fitness level from very poor to excellent/superior. Those in the derivation subgroup were younger than the validation subgroup (p = 0.027). Participant characteristics are summarized in Table 1 .

### Table 1

Participant characteristics by subgroup

All (n = 80)Derivation (n = 50)Validation (n = 30)
CharacteristicsM±SDMdnM±SDMdnM±SDMdn
Male, n (%)44 (54%) 25 (50%) 19 (61%)
𠀺ge * 19.6 ± 1.6 19.2 ± 1.6 20.7 ± 1.5
𠀻ody fat (%)19.5 ± 8.7 18.9 ± 8.8 20.5 ± 8.6
Weight (kg)69.9 ± 12.8 67.6 ± 10.8 73.6 ± 15.0
Height (m)1.7 ± 0.1 1.7 ± 0.1 1.7 ± 0.1
𠀻MI (kg·m -2 )23.8 ± 3.3 23.2 ± 2.4 24.8 ± 4.2
Physical activity a
Walking1433 ± 24529241293 ± 12748501648 ± 3620939
Moderate Activity633 ± 929340702 ± 988360516 ± 810360
Vigorous Activity1058 ± 13417201055 ± 11677201059 ± 1585600
Cardiorespiratory fitness b
VO2max (ml·kg -1 ·min -1 )41.6 ± 7.1 41.6 ± 7.1 41.7 ± 7.2
𠀿itness Level, n (%)
Very Poor15 (18.8%) 10 (20.0%) 5 (16.1%)
Poor16 (20.0%) 10 (20.0%) 6 (19.4%)
Fair18 (22.5%) 10 (20.0%) 8 (25.8%)
Good18 (22.5%) 10 (20.0%) 8 (25.8%)
Excellent13 (16.2%) 10 (20.0%) 4 (12.9%)

Chi-square difference testing used for categorical data and independent t-tests for continuous data.

Differences significant at

Pearson correlations among VO2max, PA, and participant characteristics indicated that gender, height, and Vigorous Activity were significant univariate correlates of measured VO2max ( Table 2 ).

### Table 2

Correlations between VO2max and independent variables (n = 50)

Derivation subgroupVO2maxGender a AgeWeightHeightBMIWalkingModerate Activity
VO2max (ml·kg -1 ·min -1 )
Gender-0.556 ***
Age (yr)0.2160.159
Weight (kg)0.171-0.0800.328 *
Height (m)0.281 * -0.312 * 0.1190.764 ***
BMI (kg·m -2 )0.0110.763 *** 0.338 ** 0.796 *** 0.221
Walking b -0.196-0.247-0.300 * -0.150-0.085-0.147
Moderate Activity b 0.123-0.603 ** 0.2860.2260.2520.1070.046
Vigorous Activity b 0.505 *** -0.615 ** 0.0820.1300.311 * -0.0830.1900.292 *

Two models were considered in the derivation of the VO2max estimation equation. The first model (Model 1) included gender, age, weight, height, and Walking, Moderate Activity, and Vigorous Activity. In Model 2, BMI was included rather than weight and height. In both models, standard multiple regressions indicated that only gender and Vigorous Activity had significant independent multivariate associations with VO2max. To confirm these associations, regression analyses were repeated using stepwise regression that resulted in the same final model (Model 3). As expected, gender and Vigorous Activity remained in the model. Above and beyond the contribution of gender, Vigorous Activity contributed a large portion of the variance in measured VO2max denoted by a significant change in R 2 (ΔR 2 = 0.147, p = 0.001). Model 3 explained 43% of the variance in measured VO2max and was selected as the final model for the derivation of the VO2max estimation equation based on parsimony. Using the resulting unstandardized coefficients, VO2max is estimated by the following equation where males = 1 and females = 2 and Vigorous Activity is calculated as MET-mins·week -1 : Estimated VO2max = 47.749 − [6.493 × Gender] + [0.140 × (Vigorous Activity) -2 ].

In the derivation subgroup, there was a strong paired-sample correlation coefficient (p < 0.001) and no significant difference between measured and estimated VO2max (p = 0.991) ( Table 4 ). The standard error of estimate (SEE), was within the range of SEE reported for other non-exercise VO2max estimation equations ( Table 5 ) and well within the range of error (10 - 20%) observed with submaximal exercise testing. The Bland-Altman plot ( Figure 1A ) for differences and averages of the estimated and actual VO2max indicates 64% of the total sample fell within 1 SD (5.45 mL·kg -1 ·min -1 ) of actual VO2max. A total of 100% fell within 2 SD (10.90 mL·kg -1 ·min -1 ). There was no evidence of systematic error as the mean of the differences between measured and estimated VO2max equaled zero. However, the equation appears to underestimate (denoted by negative values) VO2max in fit individuals (categorized as good, excellent, and superior) by an average of - 9.2% (range = - 31.7% to + 9.2%) and overestimate (denoted by positive values) VO2max in less fit individuals (categorized as very poor, poor, and fair) by + 9.7% (range = -7.5% to + 19.8%). A total of two individuals (4%) had a total error (%) greater than the acceptable error estimated from submaximal exercise testing of 20%.

The difference between objectively measured VO2max and Estimated VO2max plotted against average of objectively actual VO2max and Estimated VO2max. Low fit individuals represent the very poor, poor, and fair fitness categories. High fit individuals represent the good, excellent, and superior fitness categories. Estimated VO2max = 47.749 − [6.493 × Gender (males = 1 females = 2)] + [0.140 × (Vigorous Activity) -2 ]. SEE = 5.45 ml·kg -1 ·min -1 . Mean difference: 0.0 mL·kg -1 ·min -1 . Acceptable individual error: ± 10 ml·kg -1 ·min -1 . A) Derivation group: n=32 64% within 1 SD n = 50 100% within 2 SD. The difference between measurements was significantly correlated to average of the two measurements (r = 0.478, p = 0.001). B) Validation group: n = 20 67% within 1 SD n = 49 97% within 2 SD. The difference between measurements was significantly correlated to average of the two measurements (r = 0.567, p = 0.001).

### Table 4

Accuracy and validity statistics for the VO2max estimation equation

Group (n)Measured VO2maxEstimated VO2maxtrSEESEE%
Derivation (n = 50)41.6 ± 7.141.6 ± 4.8-0.0120.6755.2912.7
Validation (n = 30)41.7 ± 7.242.1 ± 4.3-0.4790.6065.8614.0

### Table 5

Summary of recent (1997-2009) physical activity-based non-exercise VO2max estimation equations

StudySample sizeAge (years)VO2max (mL·kg -1 ·min -1 )Included VariablesrR 2 SEE (mL·kg -1 ·min -1 )SEE (%)
Current study8018-2541.6 ± 7.1Gender, Vigorous Activity (IPAQ-S)0.650.425.4513.1
Bradshaw, et al. (2005)10018-6540.0 ± 9.5Gender, Age, BMI, PFA, PASS0.93NR3.458.6
Duque, et al. (2009)7030.8 ± 7.730.8 ± 7.7Gender, BMI, leisure time activity (Baecke’s)NR0.356.0819.7
George, et al. (1997)10018-2944.1 ± 6.6Gender, PFA, PASS, BMI0.85NR3.447.8
Malek, et al. (2004)Females n = 8038.5 ± 9.42594 ± 431 (mL·min -1 )Gender, Age, Weight, Height, Training (hr/wk), Intensity of0.830.67259 (mL·min -1 )10.0
Malek, et al. (2005)Males n = 11240.2 ± 11.74207 ± 636.8 (mL·min -1 )training (Borg Scale), Years of training0.820.65387 (mL·min -1 )9.2
Matthews, et al. (1999)79919-7937.2 ± 11.0Gender, Age, Age 2 , PASS, weight, heightNR0.745.6415.2
Gender, Age, Age 2 , PASS, BMINR0.735.7615.5
Wier, et al. (2006)Males n = 2417Males 21-82Males 36.5 ± 8.1Gender, Age, PASS, waist girth0.81NR4.8013.4
Gender, Age, PASS, BMI0.80NR4.9013.4
Females n = 384Females 19-67Females 31.9 ± 7.5Gender, Age, PASS, % body fat0.82NR4.7213.2

PFA = perceived functional ability

PASS = NASA Physical Activity Status Scale

In the validation subgroup, there again was a strong paired-sample correlation coefficient r = 0.601, p < 0.001 with no significant difference between measured and estimated VO2max (p = 0.636) ( Table 4 ). Similarly, the standard error of estimate (SEE and SEE%) were within the range of SEE reported for other non-exercise VO2max estimation equations ( Table 5 ) and within the (10 - 20%) range of error observed with submaximal exercise testing. The Bland-Altman plot ( Figure 1B ) for differences and averages of the estimated and actual VO2max indicates 67% of the total sample fell within 1 SD of actual VO2max and 97% fell within 2SD. The average underestimation in the fit group was -11.2% (range = - 30.4% to + 8.7%) and the overestimation in the unfit group was + 11.6% (range = + 5.4% to + 20.5%). A total of four individuals (13%) had a total error (%) greater than the acceptable error estimated from submaximal exercise testing of 20%.

## Abbreviations and Acronyms

Statement of Conflict of Interest: see page 28.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Support: Partial support for this project was provided by TKC Global (Grant No. GS04T11BFP0001 L. Kaminsky, Ball State University), and the National Center for Advancing Translational Sciences, National Institutes of Health , through Grant UL1TR000050 (R. Arena, University of Chicago).