/******logistic regression****/


/****answering questions for pages 25-28. Taken from:
http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/7_Binary/text.pdf
****/

/****in this example we took a random sample of 1100 observations from the original data set**/

clear all
use "C:\Users\fqeadan\Documents\PH539\hippain.dta"
tab sex hip_p, row

/***one could get the same result by using:**/
tab hip_p sex, co
* i Prevalence is 9.62% in men, 15.79% in women


tab sex hip_p, row chi2
* ii The difference in prevalence between men and women is very significant
cs sex hip_p, or
* iii Confidence interval is (1.22, 2.54)
* iv The odds ratio and the relative risk are very similar
* v Yes, the confidence interval does not contain 0, which is the null hypothesis risk difference


codebook sex   /**in our case F=1 nad M=0**/
logistic hip_p sex
glm hip_p sex, family(binomial) link(logit) eform


/***without eform one gets the betas instead of the ORs***/
glm hip_p sex, family(binomial) link(logit)


/****to change the reference category one could use ib#.var***/
logistic hip_p ib1.sex

* vi The odds ratio is exactly the same as that produced by cs
* vii The confidence intervals are the same to 3 decimal places (the methods used to calculate them differ, but generally give very similar results)
egen agegp = cut(age), at(0 30(10)100)
label define age 0 "<30" 30 "30-39" 40 "40-49" 50 "50-59"
label define age 60 "60-69" 70 "70-79" 80 "80-89" 90 "90+", modify
label values agegp age
tab hip_p agegp, chi2
* viii Yes: chi2 is very significant
logistic hip_p age sex 
estimates store mr
* ix Yes: p = 0.000
* x Odds of hip pain increase by 1.03 for each year increase in age
logistic hip_p i.sex##c.age
estimates store mf

lrtest mr mf
* xi No: the interaction term i.sex#c.age is not significant (p=0.749)
logistic hip_p sex i.agegp
* xii Odds for a man aged 50-59 are 12.63 times the odds for a man aged less than 30

/*****Let's change the reference category***/

logistic hip_p sex ib90.agegp


logistic hip_p age sex 
/***hosmer lemeshow goodness of fit null hypothesis***/
/***H0: model is correct***/
estat gof
* 4.1 Yes. However, this is not really appropriate, since there are so many covariate patterns. It would be better to use only 10 groups
estat gof, group(10)
* 4.1 In this case, there is evidence that the predicted and observed values differ more than can be explained by random variation
lroc

logistic hip_p i.agegp sex 
estat gof
estat gof, group(10)
* 4.3 Yes, this model is adequate
lroc

gen age2= age*age
logistic hip_p age age2 sex 
estat gof, group(10)
* 4.5 Yes, the coefficient for age2 is highly significant, and there is
* no longer evidence of lack of fit.
lroc

* 4.6 The area under the curve with this model is similar to that use age
*     as a categorical predictor.
predict p
predict db, dbeta  /**dbeta is very similar to Cook's D in ordinary linear regression**/
scatter db p

vif, uncentered
* 5.1 No, there are no points that are obvious outliers

predict d, ddeviance
scatter d p, yline(3.84)

* 5.2 there are 4 outliers
scatter p age

* 5.3 the two lines are the prevalences in men and women
graph twoway scatter p age || lowess hip_p age if sex == 1 || lowess hip_p age if sex == 0

* 5.4 the fit is good for men, but fits poorly to women over 80
* The quadratic model is reasonable for men, not women