Context of Finding G x E Interaction as Given by Caspi et al.

Statistics Project Instructions:

Multiple Regression Computing Project
Introduction
Each student is assigned to an individual database, with a single file containing the data. Each file contains one dependent variable and twenty independent variables. The values of the dependent variable are in the DV column. The values of the twenty independent variables are in the columns with names of E1 to E5 and G1 to G15. There are no missing values, and the data file is complete and needs no further processing. This project is worth up to 150 points. Failure to use the correct dataset will lead to a grade of zero. The data sets are named by the student course number.csv. The datasets will be posted in a zip format on the class blackboard.
Background
The class blackboard has a pdf file of a paper by Caspi et al. that reports a finding of a gene-environment interaction. This paper used multiple regression techniques as the methodology for its findings. You should read it for background, as it is the genesis of the models that you will be given. The data that you are analyzing is synthetic. That is, the TA used a model to generate the data. Your task is to find the model that the TA used for your data. For example, one possible model is
The class blackboard also contains a paper by Risch et al. that uses a larger collection of data to assess the findings in Caspi et al. These researchers confirmed that Caspi et al. calculated their results correctly but that no other dataset had the relation reported in Caspi et al. That is, Caspi et al. seem to have reported a false positive (Type I error).
Report
The report that you submit should be no more than 2500 words with no more than 3 tables and 2 figures. It should include references (which do not count in the 2500 words). The report may have a technical appendix. The appendix could include your computer programs or describe your procedures for computation. You should include whatever additional material you feel is necessary to report your results in the technical appendix. There are no length restrictions on the appendix. A submission of only computer output without a report is not sufficient and will receive a grade of zero. Analyses that report an incorrect number of observations will also receive a grade of zero.
Your report should be in standard scientific report format. It should contain an introduction, methods section, results section, and a section with conclusions and discussion. You may add whatever other material you wish in a technical appendix. The introduction should contain the statement of your problem (namely estimating the function that the TA used to generate your data). It should discuss the context of finding GxE interactions, as given by Caspi et al. and others. The methods section should discuss how you performed your statistical calculations, what independent variables that you considered, and other methodological issues, such as how you dealt with interaction variables. The results section should contain an objective statement of your findings. That is, it should contain the statement of the model that your group proposes for the data, the analysis of variance table for this model, and other key summary results. The discussion and conclusion section should include the limitations of your procedures. The class blackboard has an editorial (by Cummings) that discusses reporting statistical information.
Guidelines for analysis
The first task for this problem is to use the statistical package of your choice to find the correlations between the independent variables and the dependent variable. Transformations of variables may be necessary. The Box-Cox transformation may find potentially nonlinear transformations of a dependent variable. After selecting the transformations of the dependent variable, you may use stepwise regression methods to select the important independent variables. The Lasso technique was helpful to many groups in past semesters. The TA will usually use at most two-way interactions of the independent variables (that is, terms like or ) in generating your data. There may also be non-linear environmental variables, such as or . The TA may well have used three factor interactions in the models for a few of the groups.
Hints
Chapter 12 and Chapter 13 in your text contain important information, especially Chapter 12. Also remember to consider multiple testing issues (as described in Chapter 9). The p-value for the variables that you select should be much smaller than 0.01. Remember that you have 5 environmental variables, 15 genes, 75 gene-environment variables, 105 gene-gene interaction variables, and a very large number of three gene interaction variables.
Your technical appendix may include:
(a) Your SAS or R script (If you are using SAS or R)
(b) Additional information that you want to report
(c) Any comments or suggestions
End of Project Assignment
https://blackboard(dot)stonybrook(dot)edu/bbcswebdav/pid-4636724-dt-content-rid-32930951_1/courses/1188-AMS-315-SEC01-89886/Project%202%20Handout%20Fall%202018.html

Statistics Project Sample Content Preview:

Multiple Regression Computing Project
Name of Student
Institution Affiliation
Multiple Regression Computing Project
Introduction
A G x E interaction is used to estimate the function that the TA used to generate the data provided. It should discuss the context of finding GxE interactions, as given by Caspi et al. and others
Methods
The data file provided contains one dependent variable and nineteen independent variables. The values of the dependent variables are in the DV_Y column. The values of the nineteen independent variables are in the columns with names of E1 to E4 and G1 to G15. The variables E1 to E4 are continuous and positive and simulate “environmental” variables while variables G1 to G15 are indicator variables and simulate genes (Caspi et al). The data is uploaded onto IBM SPSS 24 for analysis. Environmental variables (E1 to E4) are transformed from string type to numeric type (Descriptive Statistics).A summary of variables is presented below:
Descriptive Statistics

N

Minimum

Maximum

Mean

Std. Deviation

Variance

DV_Y

2000

832576.424481107

1342575.595692000

1001228.76874633670

78350.472416747610

6138796527.928

G1

2000

0

1

.52

.500

.250

G2

2000

0

1

.49

.500

.250

G3

2000

0

1

.49

.500

.250

G4

2000

0

1

.50

.500

.250

G5

2000

0

1

.49

.500

.250

G6

2000

0

1

.52

.500

.250

G7

2000

0

1

.47

.499

.249

G8

2000

0

1

.49

.500

.250

G9

2000

0

1

.51

.500

.250

G10

2000

0

1

.50

.500

.250

G11

2000

0

1

.50

.500

.250

G12

2000

0

1

.48

.500

.250

G13

2000

0

1

.50

.500

.250

G14

2000

0

1

.49

.500

.250

G15

2000

0

1

.48

.500

.250

Er1

2000

1

2000

1000.50

577.495

333500.000

Er2

2000

1

2000

1000.50

577.495

333500.000

Er3

2000

1

2000

1000.50

577.495

333500.000

Er4

2000

1

2000

1000.50

577.495

333500.000

Er5

2000

1

2000

1000.50

577.495

333500.000

Valid N (listwise)

2000

histograms of Y and each environmental variable
-Due to the large size of the output, the histograms are in the output file.
correlation matrix of all variables. The following variables depict a moderately strong correlation with Y: G11, G12, G14, E1, E3 and E5.
The correlation matrix is used to identify the correlation between the independent variables and the dependent variable.
scatterplot of Y versus each environmental variable.
E1
There a positive relationship between E1 and the dependent variable with the life of best fit indicating that the relationship is positive and linear.
E2
There a positive relationship between E2 and the dependent variable with the life of best fit indicating that the relationship is negative and linear.
E3
There a positive relationship between E3 and the dependent variable with the life of best fit indicating that the relationship is positive and linear.
E4
There a positive relationship between E4 and the dependent variable with the life of best fit indicating that the relationship is negative and linear.
E5
There a positive relationship between E5 and the dependent variable with the life of best fit indicating that the relationship is positive and linear.
Stepwise regression analysis of Y using environmental and genetic variables
Stepwise regression analysis indicates that both environmental and genetic variables have a multi-collinearity problem (tolerance statistic fails to detect multicollinearity).
Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

99.0% Confidence Interval for B

Collinearity Statistics

B

Std. Error

Beta

Lower Bound

Upper Bound

Tolerance

VIF

1

(Constant)

871953.996

7812.707

111.607

.000

851810.369

892097.622

G1

-1700.104

2827.172

-.011

-.601

.548

-8989.447

5589.240

.987

1.013

G2

2102.262

2820.115

.013

.745

.456

-5168.886

9373.410

.991

1.009

G3

128.705

2816.797

.001

.046

.964

-7133.888

7391.298

.993

1.007

G4

559.907

2827.943

.004

.198

.843

-6731.424

7851.238

.985

1.015

G5

-4859.954

2820.414

-.031

-1.723

.085

-12131.873

2411.965

.991

1.009

G6

1598.390

2825.926

.010

.566

.572

-5687.740

8884.520

.987

1.013

G7

3590.690

2827.636

.023

1.270

.204

-3699.849

10881.228

.988

1.012

G8

-2968.610

2822.922

-.019

-1.052

.293

-10246.995

4309.774

.989

1.011

G9

890.087

2820.078

.006

.316

.752

-6380.965

8161.139

.991

1.009

G10

129.471

2819.672

.001

.046

.963

-7140.535

7399.477

.991

1.009

G11

47984.299

2825.403

.306

16.983

.000

40699.519

55269.080

.987

1.013

G12

46647.919

2816.717

.298

16.561

.000

39385.531

53910.306

.994

1.006

G13

-4124.674

2816.837

-.026

-1.464

.143

-11387.369

3138.021

.993

1.007

G14

47814.504

2813.117

.305

16.997

.000

40561.400

55067.608

.996

1.004

G15

-2586.750

2827.797

-.017

-.915

.360

-9877.703

4704.204

.986

1.014

Er1

27.987

2.441

.206

11.465

.000

21.693

34.281

.991

1.009

Er2

.524

2.456

.004

.213

.831

-5.808

6.856

.980

1.021

Er3

11.917

2.441

.088

4.882

.000

5.623

18.211

.991

1.009

Er4

.503

2.454

.004

.205

.838

-5.824

6.831

.981

1.019

Er5

21.925

2.453

.162

8.939

.000

15.601

28.249

.982

1.018

a. Dependent Variable: DV_Y

Results
The multicollinearity is resolved as follows:
ANOVAa

Model

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

1190974179621.814

1

1190974179621.814

214.753

.000b

Residual

11080480079705.305

1998

5545785825.678

Total

12271454259327.120

1999

2

Regression

2325401149199.383

2

1162700574599.691

233.451

.000c

Residual

9946053110127.736

1997

4980497301.015

Total

12271454259327.120

1999

3

Regression

3452587723000.034

3

1150862574333.345

260.478

.000d

Residual

8818866536327.084

1996

4418269807.779

Total

12271454259327.117

1999

4

Regression

4006687166913.757

4

1001671791728.439

241.790

.000e

Residual

8264767092413.361

1995

4142740397.200

Total

12271454259327.120

1999

5

Regression

4338979782662.180

5

867795956532.436

218.139

.000f

Residual

7932474476664.939

1994

3978171753.593

Total

12271454259327.120

1999

6

Regression

4437608558489.150

6

739601426414.858

188.161

.000g

Residual

7833845700837.969

1993

3930680231.228

Total

12271454259327.120

1999

a. Dependent Variable: DV_Y

b. Predictors: (Constant), G12

c. Predictors: (Constant), G12, G11

d. Predictors: (Constant), G12, G11, G14

e. Predictors: (Constant), G12, G11, G14, Er1

f. Predictors: (Constant), G12, G11, G14, Er1, Er5

g. Predictors: (Constant), G12, G11, G14, Er1, Er5, Er3

The resulting model is as is explained by the variables in the coefficient table (Appendix)
Conclusions and Discussion
The method used to generate the model excludes some variables that have substantial data. The relationship between the independent variables and the dependent remains unclear.
References
Caspi, A. (2003). Influence of Life Stress on Depression: Moderation by a Polymorphism in the 5-HTT Gene. Science, 301(5631), pp.386-389.
Risch, N., Herrell, R., Lehner, T., Liang, K., Eaves, L., Hoh, J., Griem, A., Kovacs, M., Ott, J. and Merikangas, K. (2009). Interaction Between the Serotonin Transporter Gene (5-HTTLPR), Stressful Life Events, and Risk of Depression. JAMA, 301(23), p.2462.
Appendix
Correlations

DV_Y

G1

G2

G3

G4

G5

G6

G7

G8

G9

G10

G11

G12

G13

G14

G15

Er1

Er2

Er3

Er4

Er5

DV_Y

Pearson Correlation

1

-.025

.015

-.010

-.001

-.049*

.029

.030

-.025

.005

-.001

.310**

.312**

-.039

.307**

-.006

.203**

-.001

.102**

-.006

.174**

Sig. (2-tailed)

.264

.503

.643

.966

.030

.193

.175

.263

.810

.956

.000

.000

.082

.000

.782

.000

.972

.000

.784

.000

N

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

G1

Pearson Correlation

-.025

1

.003

.045*

-.018

.020

.036

.016

.018

.019

.000

-.050*

-.008

.013

-.021

.020

.039

.021

.002

-.039

.016

Sig. (2-tailed)

.264

.901

.045

.412

.377

.108

.477

.408

.407

.988

.025

.731

.556

.342

.365

.083

.353

.920

.080

.461

N

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

G2

Pearson Correlation

.015

.003

1

.006

-.064**

.035

.016

.013

.014

.009

.012

-.014

.014

.009

-.008

-.032

.002

.028

.003

.005

.026

Sig. (2-tailed)

.503

.901

.802

.004

.122

.485

.563

.539

.679

.586

.534

.524

.690

.706

.156

.938

.209

.897

.823

.245

N

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

G3

Pearson Correlation

-.010

.045*

.006

1

-.036

.013

.002

-.013

-.026

.005

-.034

-.024

-.014

.007

-.008

.014

.013

.006

.007

.008

.003

Sig. (2-tailed)

.643

.045

.802

.109

.573

.944

.562

.241

.815

.130

.285

.540

.757

.707

.521

.567

.792

.738

.709

.880

N

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

2000

G4

Pearson Correlation

-.001

-.018

-.064**

-.036

1

-.009

.019

-.012

-.010

.001

-.040

.010

.003

.027

-.006

-.025

.008

-.012

.026

.027

-.063**

Sig. (2-tailed)

.966

.412

.004

.109

.693

.402

.605

.659

.968

.073

.656

.882

.227

.795

.269

.709

.600

.245

.221
<...

Updated on January 26, 2024

Get the Whole Paper!

Not exactly what you need?

Do you need a custom essay? Order right now:

Order

👀 Other Visitors are Viewing These APA Essay Samples:

SPSS Project for Statistics

1 page/≈275 words | No Sources | APA | Mathematics & Economics | Statistics Project |
Research Methods. Evaluation of Research Methods. Statistics Project

1 page/≈275 words | 1 Source | APA | Mathematics & Economics | Statistics Project |
Probability in Managerial Decision-Making

1 page/≈275 words | 1 Source | APA | Mathematics & Economics | Statistics Project |