Tuesday, May 27, 2008

Path Analysis

Overview
Path analysis is an extension of the regression model, used to test the fit of the correlation matrix against two or more causal models which are being compared by the researcher. The model is usually depicted in a circle-and-arrow figure in which single arrows indicate causation. A regression is done for each variable in the model as a dependent on others which the model indicates are causes. The regression weights predicted by the model are compared with the observed correlation matrix for the variables, and a goodness-of-fit statistic is calculated. The best-fitting of two or more models is selected by the researcher as the best model for advancement of theory.
Path analysis requires the usual assumptions of regression. It is particularly sensitive to model specification because failure to include relevant causal variables or inclusion of extraneous variables often substantially affects the path coefficients, which are used to assess the relative importance of various direct and indirect causal paths to the dependent variable. Such interpretations should be undertaken in the context of comparing alternative models, after assessing their goodness of fit discussed in the section on structural equation modeling (SEM packages are commonly used today for path analysis in lieu of stand-alone path analysis programs). When the variables in the model are latent variables measured by multiple observed indicators, path analysis is termed structural equation modeling, treated separately. We follow the conventional terminology by which path analysis refers to single-indicator variables.
Key Concepts and Terms
Note that path estimates may be calculated by OLS regression or by MLE maximum likelihood estimation, depending on the computer package. Two-Stage Least Squares (2SLS), discussed separately, is another path estimation procedure designed to extend the OLS regression model to situations where non-recursivity is introduced because the researcher must assume the covariances of some disturbance terms are not 0 (this assumption is discussed below). Click here for a separate discussion.
• Path model. A path model is a diagram relating independent, intermediary, and dependent variables. Single arrows indicate causation between exogenous or intermediary variables and the dependent(s). Arrows also connect the error terms with their respective endogenous variables. Double arrows indicate correlation between pairs of exogenous variables. Sometimes the width of the arrows in the path model are drawn in a width which is proportional to the absolute magnitude of the corresponding path coefficients (see below).
• Causal paths to a given variable include (1) the direct paths from arrows leading to it, and (2) correlated paths from endogenous variables correlated with others which have arrows leading to the given variable. Consider this model:

This model has correlated exogenous variables A, B, and C, and endogenous variables D and E. Error terms are not shown. The causal paths relevant to variable D are the paths from A to D, from B to D, and the paths reflecting common anteceding causes -- the paths from B to A to D, from C to A to D, and from C to B to D. Paths involving two correlations (C to B to A to D) are not relevant. Likewise, paths that go backward (E to B to D, or E to B to A to D) reflect common effects and are not relevant.
• Exogenous and endogenous variables. Exogenous variables in a path model are those with no explicit causes (no arrows going to them, other than the measurement error term). If exogenous variables are correlated, this is indicated by a double-headed arrow connecting them. Endogenous variables, then, are those which do have incoming arrows. Endogenous variables include intervening causal variables and dependents. Intervening endogenous variables have both incoming and outgoing causal arrows in the path diagram. The dependent variable(s) have only incoming arrows.
• Path coefficient/path weight. A path coefficient is a standardized regression coefficient (beta) showing the direct effect of an independent variable on a dependent variable in the path model. Thus when the model has two or more causal variables, path coefficients are partial regression coefficients which measure the extent of effect of one variable on another in the path model controlling for other prior variables, using standardized data or a correlation matrix as input. Recall that for bivariate regression, the beta weight (the b coefficient for standardized data) is the same as the correlation coefficient, so for the case of a path model with a variable as a dependent of a single exogenous variable (and an error residual term), the path coefficient in this special case is a zero-order correlation coefficient.
Consider this model, based on Bryman, A. and D. Cramer (1990). Quantitative data analysis for social scientists, pp. 246-251.

This model is specified by the following path equations:
Equation 1. satisfaction = b11age + b12autonomy + b13 income + e1
Equation 2. income = b21age + b22autonomy + e2
Equation 3. autonomy = b31age + e3
where the b's are the regression coefficients and their subscripts are the equation number and variable number (thus b21 is the coefficient in Equation 2 for variable 1, which is age.
Note: In each equation, only (and all of) the direct priors of the endogenous variable being used as the dependent are considered. The path coefficients, which are the betas in these equations, are thus the standardized partial regression coefficients of each endogenous variable on its priors. That is, the beta for any path (that is, the path coefficient) is a partial weight controlling for other priors for the given dependent variable.
Formerly called p coefficients, now path coefficients are called simply beta weights, based on usage in multiple regression models. Bryman and Cramer computed the path coefficients = standardized regression coefficients = beta weights, to be:

Correlated Exogenous Variables. If exogenous variables are correlated, it is common to label the corresponding double-headed arrow between them with its correlation coefficient.
Disturbance terms.The residual error terms, also called disturbance terms, reflect unexplained variance (the effect of unmeasured variables) plus measurement error. Note that the dependent in each equation is an endogenous variable (in this case, all variables except age, which is exogenous). Note also that the independents in each equation are all the variables with arrows to the dependent.
The effect size of the disturbance term for a given endogenous variable, which reflects unmeasured variables, is (1 - R2), and its variance is (1 - R2) times the variance of that endogenous variable, where R2 is based on the regression in which it is the dependent and those variables with arrows to it are independents. The path coefficient is SQRT(1 - R2).
The correlation between two disturbance terms is the partial correlation of the two endogenous variables, using as controls all their common causes (all variables with arrows to both). The covariance estimate is the partial covariance: the partial correlation times the product of the standard deviations of the two endogenous variables.
• Path multiplication rule: The value of any compound path is the product of its path coefficients. Imagine a simple three-variable compound path where education causes income causes conservatism. Let the regression coefficient of income on education be 1000: for each year of education, income goes up $1,000. Let the regression coefficient of conservatism on income be .0002: for every dollar income goes up, conservativism goes up .0002 points on a 5-point scale. Thus if education goes up 1 year, income goes up $1,000, which means conservatism goes up .2 points. This is the same as multiplying the coefficients: 1000*.0002 = .2. The same principle would apply if there were more links in the path. If standardized path coefficients (beta weights) were used, the path multiplication rule would still apply, but the the interpretation is in standardized terms. Either way, the product of the coefficients along the path reflects the weight of that path.
• Effect decomposition. Path coefficients may be used to decompose correlations in the model into direct and indirect effects, corresponding, of course, to direct and indirect paths reflected in the arrows in the model. This is based on the rule that in a linear system, the total causal effect of variable i on variable j is the sum of the values of all the paths from i to j. Considering "satisfaction" as the dependent in the model above, and considering "age" as the independent, the indirect effects are calculated by multiplying the path coefficients for each path from age to satisfaction:
age -> income -> satisfaction is .57*.47 = .26
age -> autonomy -> satisfaction is .28*.58 = .16
age -> autonomy -> income -> satisfaction is .28*.22 x .47 = .03
total indirect effect = .45
That is, the total indirect effect of age on satisfaction is plus .45. In comparison, the direct effect is only minus .08. The total causal effect of age on satisfaction is (-.08 + .45) = .37.
Effect decomposition is equivalent to effects analysis in regression with one dependent variable. Path analysis, however, can also handle effect decomposition for the case of two or more dependent variables.
In general, any bivariate correlation may be decomposed into spurious and total causal effects, and the total causal effect can be decomposed into a direct and an indirect effect. The total causal effect is the coefficient in a regression with all of the model's prior but not intervening variables for x and y controlled (the beta coefficient for the usual standardized solution, the partial b coefficient for the unstandardized or raw solution). The spurious effect is the total effect minus the total causal effect. The direct effect is the partial coefficient (beta for standardized, b for unstandardized) for y on x controlling for all prior variables and all intervening variables in the model. The indirect effect is the total causal effect minus the direct effect, and measures the effect of the intervening variables. Where effects analysis in regression may use a variety of coefficients (partial correlation or regression, for instance), effect decomposition in path analysis is restricted to use of regression.
For instance, imagine a five-variable model in which the exogenous variable Education is correlated with the exogenous variable Skill Level, and both Education and Skill Level are correlated with the exogenous variable Job Status. Further imagine that Education and each of the other two exogenous variables are modeled to be direct causes of Income and also of Median House Value, which are the two dependent variables. We might then decompose the correlation of Education and Income:
1. Direct effect of Education on Income, indicated by the path coefficient of the single-headed arrow from Education to Income.
2. Indirect effect due to Education's correlation with Skill Level, and Skill Level's direct effect on Income, indicated by multiplying the correlation of Education and Skill Level by the path coefficient from Skill Level to Income.
3. Indirect effect due to Education's correlation with Job Status, and Job Status's direct effect on Income, indicated by multiplying the correlation of Education and Job Status by the path coefficient from Job Status to Income.
As a second example decomposition for the same five-variable model is a bit more complex if we wish to break down the correlation of the two dependent variables, Income and Median House Value. Since here somewhat implausibly the two dependents are modeled not to have a direct effect from Income to House Value, the true correlation is hypothesized to be zero and all correlations are spurious.
4. The spurious direct effect of Education as a common anteceding variable directly causing both dependents, indicated by multiplying the path coefficient from Education to Income by the path coefficient of Education to House Value.
5. The spurious direct effect of Skill Level as a common anteceding variable directly causing both dependents, indicated by multiplying the path coefficient from Skill Level to Income by the path coefficient of Skill Level to House Value.
6. The spurious direct effect of Job Status as a common anteceding variable directly causing both dependents, indicated by multiplying the path coefficient from Job Status to Income by the path coefficient of Job Status to House Value.
7. The spurious indirect effect of Education and Skill Level as a common antecedings variable directly causing both dependents, indicated by multiplying the path coefficient from Education to Income by the correlation of Education and Skill Level by the path from Skill Level to House Value and adding the product of the path from Skill Level to Income by the correlation of Education and Skill Level by the path from Education to Median House Value.
8. The spurious indirect effect of Education and Job Status as a common anteceding variables directly causing both dependents, indicated by multiplying the path coefficient from Education to Income by the correlation of Education and Job Status by the path from Job Status to House Value and adding the product of the path from Job Status to Income by the correlation of Education and Job Status by the path from Education to Median House Value..
9. The spurious indirect effect of Skill Level and Job Status as a common anteceding variables directly causing both dependents, indicated by multiplying the path coefficient from Skill Level to Income by the correlation of Skill Level and Job Status by the path from Job Status to House Value and adding the product of the path from Job Status to Income by the correlation of Skill Level and Job Status by the path from Skill Level to Median House Value..
10. The residual effect is the difference between the correlation of Income and Median House Value and the sum of the spurious direct and indirect effects.
Correlated exogenous variables. The path weights connecting correlated exogenous variables are equal to the Pearson correlations. When calculating indirect paths, not only direct arrows but also the double-headed arrows connecting correlated exogenous variables, are used in tracing possible indirect paths, except:
Tracing rule: An indirect path cannot enter and exit on an arrowhead. This means that you cannot have a direct path composed of the paths of two correlated exogenous variables.
• Significance and Goodness of Fit in Path Models
o To test individual path coefficients one uses the standard t or F test from regression output.
o To test the model with all its paths one uses a goodness of fit test from a structural equation modeling program. If a model is correctly specified, including all relevant and excluding all irrelevant variables, with arrows correctly indicated, then the sum of path values from i to j will equal the regression coefficient for j predicted on the basis of i. That is, for standardized data, where the bivariate regression coefficient equals the correlation coefficient, the sum of path coefficients (standardized) will equal the correlation coefficient. This means one can compare the path-estimated correlation matrix with the observed correlation matrix to assess the goodness-of-fit of path models. As a practical matter, goodness-of-fit is calculated by entering the model and its data into a structural equation modeling program such as LISREL or AMOS, which compute a variety of alternative goodness-of-fit coefficients, discussed separately.
o To modify the path model on uses modification indexes (MI) to add arrows and uses nonsignificance of path coefficients to drop arrows, in a model-building and model-trimming process discussed in the section on structural equation modeling.

Assumptions
• Linearity: relationships among variables are linear (though, of course, variables may be nonlinear transforms).
• Additivity: there are no interaction effects (though, of course, variables may be interaction crossproduct terms)
• Interval level data for all variables, if regression is being used to estimate path parameters. As in other forms of regression modeling, it is common to use dichotomies and ordinal data in practice. If dummy variables are used to code a categorical variable, one must be careful that they are represented as a block in the path diagram (ex., if an arrow is drawn to one dummy it must be drawn to all others in the set). If an arrow were to be drawn from one dummy variable to another dummy variable in the same set, this would violate the recursivity assumption discussed below.
• Residual (unmeasured) variables are uncorrelated with any of the variables in the model other than the one they cause.
• Disturbance terms are uncorrelated with endogenous variables. As a corollary of the previous assumption, path analysis assumes that for any endogenous variable, its distubance term is uncorrelated with any other endogenous variable in the model. This is a critical assumption, violation of which may make regression inappropriate as a method of estimating path parameters. This assumption may be violated due to measurement error in measuring an endogenous variable; when an endogenous variable is actually a direct or indirect cause of a variable which the model states is the cause of that endogenous variable (reverse causation); or when a variable not in the model is a cause of an endogenous variable and a variable the model specifies as a cause of that endogenous variable (spurious causation).
• Low multicollinearity (otherwise one will have large standard errors of the b coefficients used in removing the common variance in partial correlation analysis).
• No underidentification or underdetermination of the model is required. For underidentified models there are too few structural equations to solve for the unknowns. Overidentification usually provides better estimates of the underlying true values than does just identification.
o Recursivity: all arrows flow one way, with no feedback looping. Also, it is assumed that disturbance (residual error) terms for the endogenous variables are uncorrelated. Recursive models are never underidentified.
• Proper specification of the model is required for interpretation of path coefficients. Specification error occurs when a significant causal variable is left out of the model. The path coefficients will reflect the shared covariance with such unmeasured variables and will not be accurately interpretable in terms of direct and indirect effects. In particular, if a variable specified as prior to a given variable is really consequent to it, "we can do ourselves considerable damage" (Davis, 1985: 64) because if a variable is consequent it would be estimated to have no path effect, whereas when it is included as a prior variable in the model, this erroneously changes the coefficients for other variables in the model. Note, however, that while interpretation of path coefficients is inaccurate under specification error, it is still possible to compare the relative fit of two models, perhaps both with specification error.
• Appropriate correlation input. When using a correlation matrix as input, it is appropriate to use Pearsonian correlation for two interval variables, polychoric correlation for two ordinals, tetrachoric for two dichotomies, polyserial for an interval and an ordinal, and biserial for an interval and a dichotomy.
• Adequate sample size is needed to assess significance. Kline (1998) recommends 10 times as many cases as parameters (or ideally 20 times). He states that 5 times or less is insufficient for significance testing of model effects.
• The same sample is required for all regressions used to calculate the path model. This may require reducing the data set down so that there are no missing values for any of the variables included in the model. This might be achieved by listwise dropping of cases or by data imputation.

Taken from : http://www2.chass.ncsu.edu/garson/pa765/path.htm

seja o primeiro a comentar!

Post a Comment

Edited By JuraganTAHU Design by Usuário ^