100% (1)
Pages:
7 pages/≈1925 words
Sources:
-1
Style:
MLA
Subject:
Mathematics & Economics
Type:
Statistics Project
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 36.29
Topic:

Differential Analysis Using R’s Edger Economics Statistics Project

Statistics Project Instructions:

Follow instructions. Using R code to do this and summarize in words. Statistical form are necessary to be attached from R code to conclude in the paper. All material will be in the file I uploaded and progress we have done.



This is group work. And saying at most 20 pages. I just need you write the report following the instruction and I need it just 7 pages.



Data is in the other pdf which is online resource, you can access that since it is public.

Let me know if you could not find the right one

 

1 Project Detail Your final project report should be at most 20 pages. Your write-up should contain the following sections: • Introduction: It must contain the following – Describe the dataset. – Identify the problem of interest: choose a data set, describe the data set and identify the problem you are interested in. • Methodology: – Describe the software you intend to use. – Describe in detail the methods you have chosen. – Does your data has missing data? Describe how you treat missing data and why? • Results and Discussion: – All the results (figures, tables etc) goes in this section. – Discuss your findings and what the results mean. – All your tables and figures need to be labelled properly. • Conclusion: – What conclusion do you draw from your analysis? • References: – List of references that you have cited in your work. • Appendix: – Any additional Information you want to add. • Individual Contribution: – In addition to the final report, each student must submit a page summarizing what their contribution to the project was. A few things to consider in your report as they will be used for evaluation: 1 Criteria Information is presented in a logical sequence. Complexity and appropriate of the analysis for the class. Provides introduction to dataset and problem. Provides introduction to statistical methods and software packages. Technical terms well-defined. The figures and tables are well labelled. There is an obvious conclusion from the study. References are cited appropriately. Report is well prepared and readable. VERY IMPORTANT You will be penalized for grammatical errors. 2

Statistics Project Sample Content Preview:
Student’s Name
Professor’s Name
Course
Date
Differential Analysis Using R’s Edger
Introduction
The datasets
Two datasets were used in this analysis where the in-depth analysis was performed on both data sets independently, and the results compared. The first dataset was obtained from The Cancer Genome Atlas (TGCA). The first dataset consisted of Lung adenocarcinoma gene expressions. The mRNAseq preprocessor picked the “scaled estimate” value from Illumina HiSeq/GA2 mRNAseq level_3 (v2) dataset and made the mRNAseq matrix with log2 transformed for the downstream analysis. Preprocessing had already been done, but the raw data was available if necessary. The second dataset was from Bioconductor, a study on lung cancer gene expression. The data was initially published on Bioconductor in 2004 (Scharpf R, Zhong S, Parmigiani G (2019). lungExpression: ExpressionSets for Parmigiani et al., 2004 Clinical Cancer Research paper., R package version 0.24.0). The dataset called “lungExpression” was represented as an ExpressionSet and was already preprocessed.
After observing the two sets of data, a search was embarked on the best ways to integrate the two platforms usefully. However, the challenge lied in finding the most effective means to this end, a current research goal. The goal aims at unifying the two studies and reporting on how their respective findings compare post-integration. There is hope to strengthen the findings made using the two platforms. Such findings ought to ultimately help in understanding the relationship between the human genome and lung cancer. There exists a need to find the best way to get the dataset from Broad Institute into a usable format in R. Further, translation of these datasets in an integrate-able way for further analysis would be essential. Therefore, edgeR would be used to complete a differential expression analysis of the integrated data set to compare the prior results.
However, there would be a chance of an inability to integrate the two platforms because of the lack of information needed to do so. Hence, an attempt to get what would be required from the publishers of the data or another publication would be conducted. If it fails, an in-depth analysis would be performed on both data sets independently and the results compared against each other. RNA sequencing (RNA-seq) has become a very widely used technology for profiling gene expression. One of the most common aims of RNA-seq profiling is to identify genes or molecular pathways that are differentially expressed (DE) between two or more biological conditions. Conclusively, the goal of this analysis was to find out what genetic features are related to lung cancer (FireBrowse, firebrowse.org/? cohort=LUSC#).
Methodology
The paper demonstrated a computational workflow for the detection of differentially expressed genes and pathways from RNA-seq data by providing a complete analysis of an RNA-seq experiment profiling of genetic features related to lung cancer. The workflow used R software packages from the open-source Bioconductor project, and it covered all steps of the analysis pipeline, including alignment of reading sequences, data exploration, differential expression analysis, visualization, and pathway analysis (Yunshun et al. n.p). The statistical analyses were performed using the edgeR package. The differential expression analysis used the quasi-likelihood functionality of edgeR. This statistical methodology uses negative binomial generalized linear models) but with F-tests instead of likelihood ratio tests.
This method provides a strict error rate control than other negative binomial based pipelines, including the traditional edgeR pipelines. The edgeR-quasi pipeline is based on a similar statistical methodology to that of the QuasiSeq package, which has performed well in third-party comparisons. Compared to QuasiSeq, the edgeR functions offer speed improvements and some additional statistical refinements. The RNA-seq pipelines of the limma package also offer excellent error rate control. While the limma pipelines are recommended for large-scale datasets, because of their speed and flexibility, the edgeR-quasi pipeline gives better performance in low-count situations. For the data analyzed here, the edgeR-quasi, limma-voom, and limma-trend pipelines are all equally suitable and give similar results.
The analysis approach illustrated in this article can be applied to any RNA-seq study that includes some replication. Conversely, it is useful and appropriate for designed experiments with multiple treatment factors and with small numbers of biological replicates. The approach assumes that RNA samples have been extracted from cells of interest under two or more treatment conditions, that RNA-seq profiling has been applied to each RNA sample, and that there are independent biological replicates for at least one of the treatment conditions. The edgeR part of the pipeline takes a matrix of gene-wise read counts as input.
Using the is.na () function in R, missing data were checked but none were found.
> colSums(is.na(data))
Cohort BCR Clinical CN LowP Methylation mRNA
0 0 0 0 0 0 0
mRNASeq miR miRSeq RPPA MAF rawMAF
0 0 0 0 0 0
In the event of the presence of some, the best way to treat them would have been imputation so as not to lose any useful information about the datasets.
Results and Discussion
In the first dataset, an assumption was made that there were 12 RNASeq libraries in two groups (1, 2). The counts were stored in a tab-delimited text file, with gene symbols in a column called Cohort. The counts were then entered into a DGEList object using the DGEList () function in edgeR. In the second dataset, the data was already preprocessed and the genes were stored in the differentiation feature.
> #For dataset 1
> data <- read.delim("data.tsv",row.names="Cohort")
> x=data
> group <- factor(c(1,1,1,1,1,1,2,2,2,2,2,2))
> y <- DGEList(counts=x,group=group)
> #For dataset 2
> library("BiocManager")
> x=michigan
> group<-factor(c(x$differentiation))
Recall that linear modeling and differential expression analysis in edgeR requires a design matrix to be specified. Therefore, a design matrix was created to record which treatment conditions were applied to each sample (Yunshun et al. 11). The matrix was used to connect each group to its samples. Each row of the design matrix corresponded to a sample whereas each column represents a coefficient corresponding to one of the 12 groups.
> design <- model.matrix(~0+group)
> colnames(design) <- levels(group)
> design #Dataset 1
1 2
1 1 0
2 1 0
3 1 0
------
9 0 1
10 0 1
11 0 1
12 0 1
> design #Dataset 2
1 2 3 4
1 1 0 0 0
2 0 1 0 0
3 1 0 0 0
4 0 0 1 0
----------
83 0 1 0 0
84 1 0 0 0
85 0 0 1 0
86 0 0 1 0
Genes that had low counts across all the libraries were removed prior to downstream analysis.
> keep <- filterByExpr(y, design)
> table(keep)#Dataset 1
keep
TRUE
94
> table(keep)#Dataset 2
keep
FALSE TRUE
10 2186
Note that a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be considered biologically important. Also, low count genes are very unlikely to be assessed as significantly DE as such counts have insufficient statistical evidence for making a reliable judgment. Therefore, such genes can be eliminated from the analysis without any loss of information. Thereafter, filtering was accomplished using a function that took into account the library sizes and the experimental design. The filter rule would be to compute the average-log-CPM for each gene and to choose a cutoff value heuristically.
> y <- y[keep, , keep.lib.sizes=FALSE]
> AveLogCPM <- aveLogCPM(y)
> hist(AveLogCPM, col="GRAY")#Dataset 1
> hist(AveLogCPM, col="YELLOW")#Dataset 2
Normalization by trimmed mean of M values (TMM) was performed by using the calcNormFactors () function, which returned the DGEList argument with only the norm.factors changed. The function calculated a set of normalization factors, one for each sample, to eliminate composition biases between libraries.
> y <- calcNormFactors(y)
> y$samples#Dataset 1
group lib.size norm.factors
BCR 1 43545 0.680
Clinical 1 42867 0.677
CN 1 41996 0.666
LowP 1 4526 1.554
Methylation 1 40688 0.687
mRNA 1 8505 1.357
mRNASeq 2 37289 0.684
miR 2 4533 6.100
miRSeq 2 36340 0.739
RPPA 2 27132 0.771
MAF 2 25125 0.783
rawMAF 2 23951 1.209
> y$samples #Dataset 2
group lib.size norm.factors
AD10 1 22655 1.003
AD2 2 22784 1.001
AD3 1 23067 0.996
AD5 3 23000 0.996
AD6 3 22850 0.997
AD7 1 23127 0.997
AD8 1 22529 1.004
L01 2 22765 0.999
L02 2 22911 0.996
L04 2 22778 1.000
--------------------------------
L94 1 22908 0.995
L95 2 22785 0.999
L96 1 22444 1.005
L97 3 22964 0.997
L99 3 22094 1.009
Then, the RNA samples were clustered in two dimensions using multi-dimensional scaling (MDS) plots. This was an analysis and quality control step to explore the overall differences between the expression profiles of the different samples. At this case, the MDS plot was graphed to indicate the cell groups.
> pch <- c(0,1,2,15,16,17)
> colors <- rep(c("darkgreen", "red", "blue"), 2)
> plotMDS(y, col=colors[group], pch=pch[group])#Dataset 1
> plotMDS(y, col=colors[group], pch=pch[group])#Dataset 2
> legend("toplef...
Updated on
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:
Sign In
Not register? Register Now!