Regression Analysis for Social Sciences

of: Alexander von Eye, Christof Schuster

Elsevier Trade Monographs, 1998

ISBN: 9780080550824 , 386 Pages

Format: PDF, ePUB, Read online

Copy protection: DRM

Apple iPod touch, iPhone und Android Smartphones

Read Online for: Windows PC,Mac OSX,Linux

Regression Analysis for Social Sciences

Chapter 1

Introduction

Regression analysis is one of the most widely used statistical techniques. Today, regression analysis is applied in the social sciences, medical research, economics, agriculture, biology, meteorology, and many other areas of academic and applied science. Reasons for the outstanding role that regression analysis plays include that its concepts are easily understood, and it is implemented in virtually every all-purpose statistical computing package, and can therefore be readily applied to the data at hand. Moreover, regression analysis lies at the heart of a wide range of more recently developed statistical techniques such as the class of generalized linear models (McCullagh & Nelder, 1989; Dobson, 1990). Hence a sound understanding of regression analysis is fundamental to developing one’s understanding of modern applied statistics.

Regression analysis is designed for situations where there is one continuously varying variable, for example, sales profit, yield in a field experiment, or IQ. This continuous variable is commonly denoted by Y and termed the dependent variable, that is, the variable that we would like to explain or predict. For this purpose, we use one or more other variables, usually denoted by X1, X2, …, the independent variables, that are related to the variable of interest.

To simplify matters, we first consider the situation where we are only interested in a single independent variable. To exploit the information that the independent variable carries about the dependent variable, we try to find a mathematical function that is a good description of the assumed relation. Of course, we do not expect the function to describe the dependent variable perfectly, as in statistics we always allow for randomness in the data, that is, some sort of variability, sometimes referred to as error, that on the one hand is too large to be neglected but, on the other hand, is only a nuisance inherent in the phenomenon under study.

To exemplify the ideas we present, in Figure 1.1, a scatterplot of data that was collected in a study by Finkelstein, von Eye, and Preece (1994). One goal of the study was to relate the self-reported number of aggressive impulses to the number of self-reported incidences of physical aggression in adolescents. The sample included n = 106 respondents, each providing the pair of values X, that is, Aggressive Impulses, and Y, that is, open Physical Aggression against Peers. In shorthand notation, (Xi, Yi), i = 1, …, 106.

Figure 1.1 Scatterplot of aggressive impulses against incidences of physical aggression.

While it might be reasonable to assume a relation between Aggressive Impulses and Physical Aggression against Peers, scientific practice involves demonstrating this assumed link between the two variables using data from experiments or observational studies. Regression analysis is one important tool for this task.

However, regression analysis is not only suited to suggesting decisions as to whether or not a relationship between two variables exists. Regression analysis goes beyond this decision making and provides a different type of precise statement. As we already mentioned above, regression analysis specifies a functional form for the relationship between the variables under study that allows one to estimate the degree of change in the dependent variable that goes hand in hand with changes in the independent variable. At the same time, regression analysis allows one to make statements about how certain one can be about the predicted change in Y that is associated with the observed change in X.

To see how the technique works we look at the data presented in the scatterplot of Figure 1.1. On purely intuitive grounds, simply by looking at the data, we can try to make statements similar to the ones that are addressed by regression analysis.

First of all, we can ask whether there is a relationship at all between the number of aggressive impulses and the number of incidences of physical aggression against peers. The scatterplot shows a very wide scatter of the points in the plot. This could be caused by imprecise measurement or a naturally high variability of responses concerning aggression. Nevertheless, there seems to be a slight trend in the data, confirming the obvious hypothesis that more aggressive impulses lead to more physical aggression. Since the scatter of the points is so wide, it is quite hard to make very elaborate statements about the supposed functional form of this relation. The assumption of a linear relation between the variables under study, indicated by the straight line, and a positive trend in the data seems, for the time being, sufficiently elaborate to characterize the characteristics of the data.

Every linear relationship can be written in the form Y = βX + α. Therefore, specifying this linear relation is equivalent to finding reasonable estimates for β and α. Every straight line or, equivalently, every linear function is determined by two points in a plane through which the line passes. Therefore, we expect to obtain estimates of β and α if we can only find these two points in the plane. This could be done in the following way. We select a value on the scale of the independent variable, X, Aggressive Impulses in the example, and select all pairs of values that have a score on the independent variable that is close to this value. Now, a natural predictor for the value of the dependent variable, Y, Physical Aggression against Peers, that is representative for these observations is the mean of the dependent variable of these values. For example, when looking up in the scatterplot those points that have a value close to 10 on the Aggressive Impulse scale, the mean of the associated values on the physical aggression scale is near 15. Similarly, if we look at the points with a value close to 20 on the Aggressive Impulse scale, we find that the mean of the values of the associated Physical Aggression scale is located slightly above 20. So let us take 22 as our guess.

Now, we are ready to obtain estimates of β and α. It is a simple exercise to transform the coordinates of our hypothetical regression line, that is, (10, 15) and (20, 22), into estimates of β and α. One obtains as the estimate for β a value of 0.7 and as an estimate for α a value of 8. If we insert these values into the equation, Y = βX + α, and set X = 10 we obtain for Y a value of 15, which is just the corresponding value of Y from which we started. This can be done for the second point, (20, 22), as well.

As we have already mentioned, the scatter of the points is very wide and if we use our estimates for β and α to predict physical aggression for, say, a value of 15 or 30 on the Aggressive Impulse scale, we do not expect it to be very accurate. It should be noted that this lack of accuracy is not caused by our admittedly very imprecise eyeballing method.

Of course, we do not advocate using this method in general. Perhaps the most obvious point that can be criticized about this procedure is that if another person is asked to specify a regression line from eyeballing, he or she will probably come to a slightly different set of estimates for α and β. Hence, the conclusion drawn from the line would be slightly different as well. So it is natural to ask whether there is a generally agreed-upon procedure for obtaining the parameters of the regression line, or simply the regression parameters. This is the case. We shall see that the regression parameters can be estimated optimally by the method of ordinary least squares given that some assumptions are met about the population the data were drawn from. This procedure will be formally introduced in the next chapters. If this method is applied to the data in Figure 1.1, the parameter estimates turn out to be 0.6 for β and 11 for α. When we compare these estimates to the ones above, we see that our intuitive method yields estimates that are not too different from the least squares estimates calculated by the computer.

Regardless of the assumed functional form, obtaining parameter estimates is one of the important steps in regression analysis. But as estimates are obtained from data that are to a certain extent random, these estimates are random as well. If we imagine a replication of the study, we would certainly not expect to obtain exactly the same parameter estimates again. They will differ more or less from the estimates of the first study. Therefore, a decision is needed as to whether the results are merely due to chance. In other words, we have to deal with the question of how likely it would be that we will not get the present positive trend in a replication study. It will be seen that the variability of parameter estimates depends not on a single factor, but on several factors. Therefore, it is much harder to find an intuitive reasonable guess of this variability then a guess of the point estimates for β and α.

With regression analysis we have a...

All prices incl. VAT