Context of Finding G x E Interaction as Given by Caspi et al.
Multiple Regression Computing Project
Introduction
Each student is assigned to an individual database, with a single file containing the data. Each file contains one dependent variable and twenty independent variables. The values of the dependent variable are in the DV column. The values of the twenty independent variables are in the columns with names of E1 to E5 and G1 to G15. There are no missing values, and the data file is complete and needs no further processing. This project is worth up to 150 points. Failure to use the correct dataset will lead to a grade of zero. The data sets are named by the student course number.csv. The datasets will be posted in a zip format on the class blackboard.
Background
The class blackboard has a pdf file of a paper by Caspi et al. that reports a finding of a gene-environment interaction. This paper used multiple regression techniques as the methodology for its findings. You should read it for background, as it is the genesis of the models that you will be given. The data that you are analyzing is synthetic. That is, the TA used a model to generate the data. Your task is to find the model that the TA used for your data. For example, one possible model is
The class blackboard also contains a paper by Risch et al. that uses a larger collection of data to assess the findings in Caspi et al. These researchers confirmed that Caspi et al. calculated their results correctly but that no other dataset had the relation reported in Caspi et al. That is, Caspi et al. seem to have reported a false positive (Type I error).
Report
The report that you submit should be no more than 2500 words with no more than 3 tables and 2 figures. It should include references (which do not count in the 2500 words). The report may have a technical appendix. The appendix could include your computer programs or describe your procedures for computation. You should include whatever additional material you feel is necessary to report your results in the technical appendix. There are no length restrictions on the appendix. A submission of only computer output without a report is not sufficient and will receive a grade of zero. Analyses that report an incorrect number of observations will also receive a grade of zero.
Your report should be in standard scientific report format. It should contain an introduction, methods section, results section, and a section with conclusions and discussion. You may add whatever other material you wish in a technical appendix. The introduction should contain the statement of your problem (namely estimating the function that the TA used to generate your data). It should discuss the context of finding GxE interactions, as given by Caspi et al. and others. The methods section should discuss how you performed your statistical calculations, what independent variables that you considered, and other methodological issues, such as how you dealt with interaction variables. The results section should contain an objective statement of your findings. That is, it should contain the statement of the model that your group proposes for the data, the analysis of variance table for this model, and other key summary results. The discussion and conclusion section should include the limitations of your procedures. The class blackboard has an editorial (by Cummings) that discusses reporting statistical information.
Guidelines for analysis
The first task for this problem is to use the statistical package of your choice to find the correlations between the independent variables and the dependent variable. Transformations of variables may be necessary. The Box-Cox transformation may find potentially nonlinear transformations of a dependent variable. After selecting the transformations of the dependent variable, you may use stepwise regression methods to select the important independent variables. The Lasso technique was helpful to many groups in past semesters. The TA will usually use at most two-way interactions of the independent variables (that is, terms like or ) in generating your data. There may also be non-linear environmental variables, such as or . The TA may well have used three factor interactions in the models for a few of the groups.
Hints
Chapter 12 and Chapter 13 in your text contain important information, especially Chapter 12. Also remember to consider multiple testing issues (as described in Chapter 9). The p-value for the variables that you select should be much smaller than 0.01. Remember that you have 5 environmental variables, 15 genes, 75 gene-environment variables, 105 gene-gene interaction variables, and a very large number of three gene interaction variables.
Your technical appendix may include:
(a) Your SAS or R script (If you are using SAS or R)
(b) Additional information that you want to report
(c) Any comments or suggestions
End of Project Assignment
https://blackboard(dot)stonybrook(dot)edu/bbcswebdav/pid-4636724-dt-content-rid-32930951_1/courses/1188-AMS-315-SEC01-89886/Project%202%20Handout%20Fall%202018.html
Multiple Regression Computing Project
Name of Student
Institution Affiliation
Multiple Regression Computing Project
Introduction
A G x E interaction is used to estimate the function that the TA used to generate the data provided. It should discuss the context of finding GxE interactions, as given by Caspi et al. and others
Methods
The data file provided contains one dependent variable and nineteen independent variables. The values of the dependent variables are in the DV_Y column. The values of the nineteen independent variables are in the columns with names of E1 to E4 and G1 to G15. The variables E1 to E4 are continuous and positive and simulate “environmental” variables while variables G1 to G15 are indicator variables and simulate genes (Caspi et al). The data is uploaded onto IBM SPSS 24 for analysis. Environmental variables (E1 to E4) are transformed from string type to numeric type (Descriptive Statistics).A summary of variables is presented below:
Descriptive Statistics
N
Minimum
Maximum
Mean
Std. Deviation
Variance
DV_Y
2000
832576.424481107
1342575.595692000
1001228.76874633670
78350.472416747610
6138796527.928
G1
2000
0
1
.52
.500
.250
G2
2000
0
1
.49
.500
.250
G3
2000
0
1
.49
.500
.250
G4
2000
0
1
.50
.500
.250
G5
2000
0
1
.49
.500
.250
G6
2000
0
1
.52
.500
.250
G7
2000
0
1
.47
.499
.249
G8
2000
0
1
.49
.500
.250
G9
2000
0
1
.51
.500
.250
G10
2000
0
1
.50
.500
.250
G11
2000
0
1
.50
.500
.250
G12
2000
0
1
.48
.500
.250
G13
2000
0
1
.50
.500
.250
G14
2000
0
1
.49
.500
.250
G15
2000
0
1
.48
.500
.250
Er1
2000
1
2000
1000.50
577.495
333500.000
Er2
2000
1
2000
1000.50
577.495
333500.000
Er3
2000
1
2000
1000.50
577.495
333500.000
Er4
2000
1
2000
1000.50
577.495
333500.000
Er5
2000
1
2000
1000.50
577.495
333500.000
Valid N (listwise)
2000
histograms of Y and each environmental variable
-Due to the large size of the output, the histograms are in the output file.
correlation matrix of all variables. The following variables depict a moderately strong correlation with Y: G11, G12, G14, E1, E3 and E5.
The correlation matrix is used to identify the correlation between the independent variables and the dependent variable.
scatterplot of Y versus each environmental variable.
E1
There a positive relationship between E1 and the dependent variable with the life of best fit indicating that the relationship is positive and linear.
E2
There a positive relationship between E2 and the dependent variable with the life of best fit indicating that the relationship is negative and linear.
E3
There a positive relationship between E3 and the dependent variable with the life of best fit indicating that the relationship is positive and linear.
E4
There a positive relationship between E4 and the dependent variable with the life of best fit indicating that the relationship is negative and linear.
E5
There a positive relationship between E5 and the dependent variable with the life of best fit indicating that the relationship is positive and linear.
Stepwise regression analysis of Y using environmental and genetic variables
Stepwise regression analysis indicates that both environmental and genetic variables have a multi-collinearity problem (tolerance statistic fails to detect multicollinearity).
Coefficientsa
Model
Unstandardized Coefficients
Standardized Coefficients
t
Sig.
99.0% Confidence Interval for B
Collinearity Statistics
B
Std. Error
Beta
Lower Bound
Upper Bound
Tolerance
VIF
1
(Constant)
871953.996
7812.707
111.607
.000
851810.369
892097.622
G1
-1700.104
2827.172
-.011
-.601
.548
-8989.447
5589.240
.987
1.013
G2
2102.262
2820.115
.013
.745
.456
-5168.886
9373.410
.991
1.009
G3
128.705
2816.797
.001
.046
.964
-7133.888
7391.298
.993
1.007
G4
559.907
2827.943
.004
.198
.843
-6731.424
7851.238
.985
1.015
G5
-4859.954
2820.414
-.031
-1.723
.085
-12131.873
2411.965
.991
1.009
G6
1598.390
2825.926
.010
.566
.572
-5687.740
8884.520
.987
1.013
G7
3590.690
2827.636
.023
1.270
.204
-3699.849
10881.228
.988
1.012
G8
-2968.610
2822.922
-.019
-1.052
.293
-10246.995
4309.774
.989
1.011
G9
890.087
2820.078
.006
.316
.752
-6380.965
8161.139
.991
1.009
G10
129.471
2819.672
.001
.046
.963
-7140.535
7399.477
.991
1.009
G11
47984.299
2825.403
.306
16.983
.000
40699.519
55269.080
.987
1.013
G12
46647.919
2816.717
.298
16.561
.000
39385.531
53910.306
.994
1.006
G13
-4124.674
2816.837
-.026
-1.464
.143
-11387.369
3138.021
.993
1.007
G14
47814.504
2813.117
.305
16.997
.000
40561.400
55067.608
.996
1.004
G15
-2586.750
2827.797
-.017
-.915
.360
-9877.703
4704.204
.986
1.014
Er1
27.987
2.441
.206
11.465
.000
21.693
34.281
.991
1.009
Er2
.524
2.456
.004
.213
.831
-5.808
6.856
.980
1.021
Er3
11.917
2.441
.088
4.882
.000
5.623
18.211
.991
1.009
Er4
.503
2.454
.004
.205
.838
-5.824
6.831
.981
1.019
Er5
21.925
2.453
.162
8.939
.000
15.601
28.249
.982
1.018
a. Dependent Variable: DV_Y
Results
The multicollinearity is resolved as follows:
ANOVAa
Model
Sum of Squares
df
Mean Square
F
Sig.
1
Regression
1190974179621.814
1
1190974179621.814
214.753
.000b
Residual
11080480079705.305
1998
5545785825.678
Total
12271454259327.120
1999
2
Regression
2325401149199.383
2
1162700574599.691
233.451
.000c
Residual
9946053110127.736
1997
4980497301.015
Total
12271454259327.120
1999
3
Regression
3452587723000.034
3
1150862574333.345
260.478
.000d
Residual
8818866536327.084
1996
4418269807.779
Total
12271454259327.117
1999
4
Regression
4006687166913.757
4
1001671791728.439
241.790
.000e
Residual
8264767092413.361
1995
4142740397.200
Total
12271454259327.120
1999
5
Regression
4338979782662.180
5
867795956532.436
218.139
.000f
Residual
7932474476664.939
1994
3978171753.593
Total
12271454259327.120
1999
6
Regression
4437608558489.150
6
739601426414.858
188.161
.000g
Residual
7833845700837.969
1993
3930680231.228
Total
12271454259327.120
1999
a. Dependent Variable: DV_Y
b. Predictors: (Constant), G12
c. Predictors: (Constant), G12, G11
d. Predictors: (Constant), G12, G11, G14
e. Predictors: (Constant), G12, G11, G14, Er1
f. Predictors: (Constant), G12, G11, G14, Er1, Er5
g. Predictors: (Constant), G12, G11, G14, Er1, Er5, Er3
The resulting model is as is explained by the variables in the coefficient table (Appendix)
Conclusions and Discussion
The method used to generate the model excludes some variables that have substantial data. The relationship between the independent variables and the dependent remains unclear.
References
Caspi, A. (2003). Influence of Life Stress on Depression: Moderation by a Polymorphism in the 5-HTT Gene. Science, 301(5631), pp.386-389.
Risch, N., Herrell, R., Lehner, T., Liang, K., Eaves, L., Hoh, J., Griem, A., Kovacs, M., Ott, J. and Merikangas, K. (2009). Interaction Between the Serotonin Transporter Gene (5-HTTLPR), Stressful Life Events, and Risk of Depression. JAMA, 301(23), p.2462.
Appendix
Correlations
DV_Y
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
G14
G15
Er1
Er2
Er3
Er4
Er5
DV_Y
Pearson Correlation
1
-.025
.015
-.010
-.001
-.049*
.029
.030
-.025
.005
-.001
.310**
.312**
-.039
.307**
-.006
.203**
-.001
.102**
-.006
.174**
Sig. (2-tailed)
.264
.503
.643
.966
.030
.193
.175
.263
.810
.956
.000
.000
.082
.000
.782
.000
.972
.000
.784
.000
N
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
G1
Pearson Correlation
-.025
1
.003
.045*
-.018
.020
.036
.016
.018
.019
.000
-.050*
-.008
.013
-.021
.020
.039
.021
.002
-.039
.016
Sig. (2-tailed)
.264
.901
.045
.412
.377
.108
.477
.408
.407
.988
.025
.731
.556
.342
.365
.083
.353
.920
.080
.461
N
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
G2
Pearson Correlation
.015
.003
1
.006
-.064**
.035
.016
.013
.014
.009
.012
-.014
.014
.009
-.008
-.032
.002
.028
.003
.005
.026
Sig. (2-tailed)
.503
.901
.802
.004
.122
.485
.563
.539
.679
.586
.534
.524
.690
.706
.156
.938
.209
.897
.823
.245
N
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
G3
Pearson Correlation
-.010
.045*
.006
1
-.036
.013
.002
-.013
-.026
.005
-.034
-.024
-.014
.007
-.008
.014
.013
.006
.007
.008
.003
Sig. (2-tailed)
.643
.045
.802
.109
.573
.944
.562
.241
.815
.130
.285
.540
.757
.707
.521
.567
.792
.738
.709
.880
N
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
G4
Pearson Correlation
-.001
-.018
-.064**
-.036
1
-.009
.019
-.012
-.010
.001
-.040
.010
.003
.027
-.006
-.025
.008
-.012
.026
.027
-.063**
Sig. (2-tailed)
.966
.412
.004
.109
.693
.402
.605
.659
.968
.073
.656
.882
.227
.795
.269
.709
.600
.245
.221
<...