/****DATA SOURCE: https://onlinecourses.science.psu.edu/stat501/node/299***/ /*** Peruvian Blood Pressure Data: This dataset consists of variables possibly relating to blood pressures of n = 39 Peruvians who have moved from rural high altitude areas to urban lower altitude areas **/ clear all input ID Age Years Weight Height Chin Forearm Calf Pulse Systol Diastol 1 21 1 71.0 1629 8.0 7.0 12.7 88 170 76 2 22 6 56.5 1569 3.3 5.0 8.0 64 120 60 3 24 5 56.0 1561 3.3 1.3 4.3 68 125 75 4 24 1 61.0 1619 3.7 3.0 4.3 52 148 120 5 25 1 65.0 1566 9.0 12.7 20.7 72 140 78 6 27 19 62.0 1639 3.0 3.3 5.7 72 106 72 7 28 5 53.0 1494 7.3 4.7 8.0 64 120 76 8 28 25 53.0 1568 3.7 4.3 0.0 80 108 62 9 31 6 65.0 1540 10.3 9.0 10.0 76 124 70 10 32 13 57.0 1530 5.7 4.0 6.0 60 134 64 11 33 13 66.5 1622 6.0 5.7 8.3 68 116 76 12 33 10 59.1 1486 6.7 5.3 10.3 72 114 74 13 34 15 64.0 1578 3.3 5.3 7.0 88 130 80 14 35 18 69.5 1645 9.3 5.0 7.0 60 118 68 15 35 2 64.0 1648 3.0 3.7 6.7 60 138 78 16 36 12 56.5 1521 3.3 5.0 11.7 72 134 86 17 36 15 57.0 1547 3.0 3.0 6.0 84 120 70 18 37 16 55.0 1505 4.3 5.0 7.0 64 120 76 19 37 17 57.0 1473 6.0 5.3 11.7 72 114 80 20 38 10 58.0 1538 8.7 6.0 13.0 64 124 64 21 38 18 59.5 1513 5.3 4.0 7.7 80 114 66 22 38 11 61.0 1653 4.0 3.3 4.0 76 136 78 23 38 11 57.0 1566 3.0 3.0 3.0 60 126 72 24 39 21 57.5 1580 4.0 3.0 5.0 64 124 62 25 39 24 74.0 1647 7.3 6.3 15.7 64 128 84 26 39 14 72.0 1620 6.3 7.7 13.3 68 134 92 27 41 25 62.5 1637 6.0 5.3 8.0 76 112 80 28 41 32 68.0 1528 10.0 5.0 11.3 60 128 82 29 41 5 63.4 1647 5.3 4.3 13.7 76 134 92 30 42 12 68.0 1605 11.0 7.0 10.7 88 128 90 31 43 25 69.0 1625 5.0 3.0 6.0 72 140 72 32 43 26 73.0 1615 12.0 4.0 5.7 68 138 74 33 43 10 64.0 1640 5.7 3.0 7.0 60 118 66 34 44 19 65.0 1610 8.0 6.7 7.7 74 110 70 35 44 18 71.0 1572 3.0 4.7 4.3 72 142 84 36 45 10 60.2 1534 3.0 3.0 3.3 56 134 70 37 47 1 55.0 1536 3.0 3.0 4.0 64 116 54 38 50 43 70.0 1630 4.0 6.0 11.7 72 132 90 39 54 40 87.0 1542 11.3 11.7 11.3 92 152 88 end gen FracLife=Years/Age graph matrix Systol FracLife Age Years Weight Height Chin Forearm Calf Pulse Diastol /* Checking for Multicollinearity: variance inflation factor (VIF) */ reg Systol FracLife Age Years Weight Height Chin Forearm Calf Pulse Diastol vif reg Systol FracLife Age Weight Height Chin Forearm Calf Pulse Diastol vif /******Partial R-squares****/ net install pcorr2, from(http://fmwww.bc.edu/RePEc/bocode/p) pcorr2 Systol FracLife Age Weight Height Chin Forearm Calf Pulse Diastol /*Backward selection: pr(probability for removal)*/ stepwise, pr(.1): regress Systol FracLife Age Weight Height Chin Forearm Calf Pulse Diastol /*Forward selection: pe (probability for entry) */ stepwise, pe(.1): regress Systol FracLife Age Weight Height Chin Forearm Calf Pulse Diastol /*Stepwise selection: pe (probability for entry) pr(probability for removal)*/ stepwise, pe(0.15) pr(0.10): regress Systol FracLife Age Weight Height Chin Forearm Calf Pulse Diastol stepwise, pe(0.10) pr(0.15): regress Systol FracLife Age Weight Height Chin Forearm Calf Pulse Diastol reg Systol FracLife Weight Diastol graph matrix Systol FracLife Weight Diastol vif /******check if Diastolic is a confounding variable*******/ /*** Confounding provides an alternative explanation for an association between an exposure (X1) and an outcome (Y). It occurs when an observed association is in fact distorted because the exposure (X1) is also correlated with another risk factor (X2). This risk factor (X2) is also associated with the outcome (Y), but independently of the exposure under investigation (X1). As a consequence, the estimated association is not that same as the true effect of exposure (X1) on the outcome (Y). In order for a variable to be considered as a confounder: 1. The variable must be independently associated with the outcome (i.e. be a risk factor). 2. The variable must be also associated with the exposure under study. 3. It should not lie on the causal pathway between exposure and disease. (i.e. Confounder cannot be caused by the exposure). (source: http://www.healthknowledge.org.uk/public-health-textbook/research-methods/1a-epidemiology/confounding-interactions-methods) There are two ways (statistically) one could use to test for confounding: ***/ /***a. if the suspected confounder is significantly associated with both the exposure and outcome then it's a confounder***/ reg Systol Diastol reg Weight Diastol /***b. Alternatively, a potential confounder is included in the model if it changes the coefficient of the primary exposure variable by 10 percent. This approach is consistent with the definition of confounding, where confounding is said to be present if the unadjusted effect differs from the effect adjusted for putative confounders (Rothman and Greenland, 1998).***/ reg Systol Weight reg Systol Weight Diastol di 100*(.9628622-.7303133)/.9628622 /******final model******/ reg Systol FracLife Weight Diastol predict y_hat , xb /***GET RESIDUALS***/ predict ri, res /*raw residuals*/ predict si, rsta /*standardized residuals*/ predict ti, rstu /*studentized (or jackknifed) residuals*/ /***GET LEVERAGE AND COOK'S DISTANCE**/ predict hii, hat /*diagonal elements of the hat matrix*/ predict di, cook /*Cook's distance*/ /***Linear Regression Diagnostics Plots**/ hist ti, normal title("Histogram of Residuals") name(fig1) qnorm ti, title("Q-Q Plot for Residuals") name(fig2) twoway scatter ti y_hat [aweight=di], msymbol(oh) yline(0) title("Residual Plot against Fitted values") name(fig3) /**or rvfplot, yline(0)**/ lvr2plot, ml(ID) name(fig4) graph combine fig1 fig2 fig3 fig4, rows(2) /***Test the Assumptions of normality and constant variance****/ /* a. Checking Normality of Residuals*/ swilk ri /* b. Checking Homoscedasticity of Residuals**/ estat hettest /******Check for outliers and influential points***/ /***a. identify outlier points (in the dependent variable) with standardized or studentized residuals greater than 2 in magnitude****/ list Systol FracLife Weight Diastol y_hat ri si ti hii di if abs(si) > 2 | abs(ti) > 2, clean /***b. identify outliers (in the independent variables)***/ /***maximum acceptable leverage is 2p/n or (2k+2)/n : p=# of parameters, k=# of predictors***/ scalar hiimax = 2*4/39 di hiimax list Systol FracLife Weight Diastol y_hat ri si ti hii di if hii > hiimax, clean /***c. identify influential points***/ /****highly influential points are points with di>1 or (to be conservative) points with di>4/n*****/ list Systol FracLife Weight Diastol y_hat ri si ti hii di if di > 1, clean scalar dimax = 4/39 di dimax list Systol FracLife Weight Diastol y_hat ri si ti hii di if di > dimax, clean /*****INCORPORATE CATEGORICAL VARIABLES IN MULTIPLE REGRESSION***/ gen bp_cat="Normal" if (Systol<120 & Diastol<80) replace bp_cat="Prehypertension" if ((Systol>=120 & Systol<=139) | (Diastol>=80 & Diastol<=89)) replace bp_cat="Stage 1" if ((Systol>=140 & Systol<=159) | (Diastol>=90 & Diastol<=99)) replace bp_cat="Stage 2" if (Systol>=160 | Diastol>=100) tab bp_cat reg Pulse bp_cat /**how to create dummy variables****/ tab bp_cat, gen(dummy) xi i.bp_cat, noomit gen bp_num=0 if bp_cat=="Normal" replace bp_num=1 if bp_cat=="Prehypertension" replace bp_num=2 if bp_cat=="Stage 1" replace bp_num=3 if bp_cat=="Stage 2" tab bp_num reg Pulse Weight i.bp_num reg Pulse Weight dummy2 dummy3 dummy4 /***interpret beta hat for a dumm variable****/ reg Pulse i.bp_num univar Pulse, by(bp_num)