Probability and Sampling Distributions, and Their Applications to Statistical Inference
Please read all topics carefully
Please complete all sections of the question
For your second Data Discussion, we will focus on probability and sampling distributions, and their applications to statistical inference. The specific prompts for Data Discussion #2 can be found in the "Data Discussions" module. Please follow these formatting constraints:
-2 to 4 pages (just make sure to answer all of the questions completely!)
-12 point font, any legible typeface
-1.5 spaced or double spaced
-1 inch margins
-Should be either a PDF or WORD document
-Please include your name on the document!
-Please note where your data is coming from, and including a citation for any statements or ideas that are not your own. You don't need to choose a specific type of citation style (MLA, APA, etc.), but please be consistent in your style throughout.
Instructor
Course
Due Date
Data Discussion
PART I:
You are planning a team lunch for work, and need to order accordingly so that you have enough vegetarian options. Your boss tells you that around 30% of people at your company are vegetarian, based on a recent survey. You need to order 8 total meals for your team.
o You decide that one way to approach this problem is to run a simulation. Explain how you would construct a simulation to help you find the number of vegetarian members of your team.
Since the probability of a person being vegetarian is about 30%, a spinner would be the best option to simulate. The spinner will be divided into 10 parts, and three parts will be colored for the vegetarians. Every time the spinner stops at the colored part, it will be counted. And the final probability of a person being a vegetarian will be the number of times the spinner stopped at a color divided by the number of times I span it.
o Run at least 25 trials of this simulation. o Based on this simulation, how many vegetarians do you expect to have to order for?
After 25 trials, I expect 3 (rounded) vegetarian meals.
o Let's say that you end up ordering 4 vegetarian meals. What is the probability, according to your simulation, that you buy more vegetarian meals than are needed?
75%
o You want to be more accurate about these values, so you look into using a probability model instead to help with your situation.
o What probability model would work best here? Binomial Distribution Model
What are the parameters of that model? Parameters are vegetarian and non-vegeterian.
o Create a probability distribution for this probability model.
P(x:n,p) = nCx x px(1-p)n-x
o How many vegetarians should you expect? 2 vegetarians should be expected
How does this compare to the results found from your simulation above?
The number of vegetarians is estimated by the parameters X and N.
o As mentioned above, you ordered 4 vegetarian options. What is the probability that you have ordered exactly enough?
The probability of ordering enough will be an eight = 0.125
o What is the probability that you have not ordered enough vegetarian meals?
1-0.125=0.875
o You did so well with your first lunch that you are now put in charge of ordering for an all-company meeting coming up soon. Your company has 300 people total. o Can this now be approximated using a Normal model? Explain why or why not. This can be easily approximated using normal model since the outcomes are continuous.
Yes, since we can subtract the probability of being a vegetarian. Moreover, the sample size has also increased.
o What are the mean and standard deviation of this Normal model approximation?
For a larger sample size, the mean approaches to zero and standard deviation approaches to 1.
o You decide to order 150 vegetarian meals. What is the probability that you have more than enough vegetarian orders?
100%
o You call the catering company to order, and they tell you they can only provide you with 100 vegetarian meals. What is the likelihood that you will not have enough meals?
34%
Find a dataset of at least 1000 individuals (please reach out if you need help finding a dataset). Focus on either a categorical or quantitative variable from this dataset.
o Construct an appropriate chart or graph to display the distribution of your variable. If the dataset is too large, feel free to use only 1000 of the individuals.
Describe this distribution of this variable numerically and verbally.
It is quite apparent from the distribution curve above that the average age of most billionaires lies between 50 to 80. Although it was uncommon in the past that there were many billionaires below their 40s. However, in recent times and with the evolution of technology, we see numerous billionaires in the age bracket of the 20s to 40s. It is also common in societies around the world that wealth accumulates over a period of time with the exception of inheritance. Hence, it is common to observe the mean to be almost equal to 60, where most of our observations fall.
o Now, take 100 random samples of 50 individuals from your dataset.
o If your variable is categorical, calculate the proportion p in which you see a "success" for each sample.
o If your variable is quantitative, calculate the average for each sample.
My variable is quantitative, and therefore I calculate the average of each sample.
The average of each sample is calculated in R software by using the following code.
S50<-c()
n=9000
for (i in 1:n) {
S50[i] = mean(sample(data,50, replace = TRUE))}
o Use these sample statistics to create a sampling distribution. Describe the distribution.
01516
The above frequency distribution shows even after selecting different sample sizes; the distribution follows an almost normal distribution. Moreover, the observations are concentrated around 65. There are very few sample sizes where the average age is below 40, indicating that a billionaire's age usually falls within 60 to 70...