NYCHA Resident Data Book Summary
Project instruction:Choose a dataset from the NYC Open Data website (https://opendata(dot)cityofnewyork(dot)us/data/) and write a report on a hypothesis test using the dataset. Your report should introduce the dataset and mention your objective. The hypothesis test should be complete (with all four steps). The report should be written and submitted as a Word document. You are allowed to use Excel as work (show screen shot of Excel). The project is due Dec 4, 11:59 pm EST on Blackboard. Attached is a template project you can refer to when writing your own report (Note: you may need to do some cleaning and filtering for the selected dataset before it can be used for analysis).
Project template
Title: Comparison of the average number of employees between the finance and retail businesses in New York City
The dataset I chose is "NYC Business Acceleration Businesses Served and Jobs Created". The dataset lists the number of businesses that NYC Business Acceleration has assisted in opening and how many jobs were created by those businesses. Each row in the original dataset represents one business. My study goal is to analyze how two different business sectors compare in their average number of jobs created.
First, I selected the business sectors "Finance and Insurance" and "Retail Trade". I then deleted the businesses that did not report the "number of employees". I rearranged the selected data into two columns: one column (named "Finance and Insurance") listing the number of employees that are in the finance and insurance businesses (sample size is 17); and the other column (named "Retail Trade") listing the number of employees that are in the retail trade businesses (sample size is 125). Below is part of the filtered dataset.
Finance and Insurance |
Retail Trade |
12 |
25 |
12 |
25 |
14 |
3 |
14 |
5 |
7 |
1 |
14 |
1 |
14 |
350 |
15 |
125 |
14 |
5 |
Since the two columns list different companies from different sectors, I assume that they are independent samples. For businesses that are in Finance and Insurance, the sample size is relatively small (<30), so I assume the number of employees in all Finance and Insurance businesses in NYC is normally distributed. Finally, since I only have sample data, population standard deviation of the number of employees in each business is unknown. Under these assumptions, I set up the hypotheses as where denotes the average number of employees in the Finance and Insurance businesses while denotes the average number of employees in the Retail Trade businesses.
Next, I set the significance level (Type I error limit) at .
Then, using Excel (Data---Data Analysis---t Test: Two-Sample Assuming Unequal Variances) and keeping the alpha at 0.05 (see below), the Excel shows
t-Test: Two-Sample Assuming Unequal Variances |
||
|
Variable 1 |
Variable 2 |
Mean |
13.88235 |
64.296 |
Variance |
4.110294 |
10715.24 |
Observations |
17 |
125 |
Hypothesized Mean Difference |
0 |
|
df |
125 |
|
t Stat |
-5.43739 |
|
P(T<=t) one-tail |
1.36E-07 |
|
t Critical one-tail |
1.657135 |
|
P(T<=t) two-tail |
2.73E-07 |
|
t Critical two-tail |
1.979124 |
|
The t statistic is -5.44. Since this is a two-tailed test, the p-value equals 0.000000273, which is much less than . Therefore, we reject the null hypothesis and conclude that the Finance and Insurance and Retail Trade businesses have different average numbers of employees in NYC. This is probably due to the fact that in a few Retail Trade businesses, the numbers of employees are extremely large, resulting in large mean values. It shall be worth investigating what these businesses are.
Title: Relationship between average total gross income and total head of household (HOH)
Being 62 years and over
The dataset is “NYCHA Resident Data Book Summary”, which contains resident demographic data including housing and development data under the “NYCHA Resident Data Book Summary”. The NYC Open Data using data from the New York City Housing Authority (NYCHA), reported on housing and development. The variables chosen are “All Average Total Gross Income" and "Total HOH 62 Years and Over as Percent of Families" (sample size is 33) (NYC Open Data).
I selected the datasets by agency, then the New York City Housing Authority (NYCHA) and NYCHA Resident Data Book. Then, I filtered the dataset to include two columns the "Total HOH 62 Years and Over as Percent of Families" Below is the dataset
Hypothesis
* H0: The relationship between average total gross income and total head of household (HOH) =0
* Ha: The relationship between average total gross income and total head of household (HOH) ≠0
The level of significance is set at 5%
Then, using Excel (ANOVA and regression) and keeping the alpha at 0.05 (see below),
The excel shows
ANOVA
df
SS
MS
F
Significance F
Regression
1
2721265831
2.72E+8
271.013627
7.08723E-17
Residual
31
31127305.4
1004107
Total
32
303253889
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Intercept
15822.77416