library(tidyverse)
library(patchwork)estimation and statistical hypothesis testing.Estimation is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete.statistical hypothesis test is a method of statistical inference.The number of samples & Sample sizeIn statistics, a population is a set of similar items or events which is of interest for some question or experiment.
In social science, a population is usually what we are interested in and what we want to know about, such as the average height of Japanese, or the average income of Americans which is not usually accessible due to huge amount of costs.
In statistics, a sample is a set of individuals or objects collected or selected from a statistical population by a defined procedure.
Sample size (also known as N in statistics) is the number of observation in a sample.The number of samples is the number of sample sets taken from the population.An exmaple:
the number of samples are 2.The sample size for survey A is 1,000 and the sample size for Survey B is 2,000.Parameter and Statistic:
A parameter is a number describing a whole population (e.g., population mean),
while a statistic is a number describing a sample (e.g., sample mean).
Parameter
population meanpopulation ratiopopulation variancepopulation standard deviationStatistic
sample meansample ratiosample varianceunbiased variancesample varianceunbiased standard deviationRandom SamplingA simple random sample is a randomly selected subset of a population.
In this sampling method, each member of the population has an exactly equal chance of being selected.
This method is the most straightforward of all the probability sampling methods, since it only involves a single random selection and requires little advance knowledge about the population.
Because it uses randomization, any research performed on this sample should have high internal and external validity.
Internal validity is the extent to which you can be confident that a cause-and-effect relationship established in a study cannot be explained by other factors.
External validity is the extent to which you can generalize the findings of a study to other situations, people, settings and measures.
dnorm() function.curve(dnorm(x, 10, 5), from = -10, to = 30) rnorm() function, let’s randomly draw a sample from this artificial population.x1 <- rnorm(20, mean = 10, sd =5)
x1 [1] 7.7298953 17.0493278 7.2736714 5.9837875 9.7302424 5.0476241
[7] 9.3211174 13.7525222 11.7170875 7.2907782 10.5752665 8.3700592
[13] 15.5648316 10.8888663 -0.4720088 -1.2653309 11.6189093 6.6641503
[19] 20.8662006 5.5781984
round(x1, digits = 0) [1] 8 17 7 6 10 5 9 14 12 7 11 8 16 11 0 -1 12 7 21 6
hist(x1) \[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]
・Using mean( ) function, we can calculate the sample mean
mean(x1) [1] 9.16426
Estomator and Estimate\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data.
→ Here, the equation above is the estimator.
An estimate is the result of estimator.
→ Here, the figure calculated is the estimate
x2 <- rnorm(20, mean = 10, sd =5)
x2 [1] 7.590461 10.008695 18.989587 10.852653 5.691300 5.600953 14.905523
[8] 13.230365 12.339585 17.830952 2.628222 14.871700 10.861399 8.014057
[15] 7.113683 4.100466 13.501053 22.508313 6.805313 2.252608
round(x2, digits = 0) [1] 8 10 19 11 6 6 15 13 12 18 3 15 11 8 7 4 14 23 7 2
hist(x2) mean(x2) [1] 10.48484
Since R randomly sample from the population, there is no bias, but there is a sampling error.
We do not always get the population mean = 10 → this is a sampling error
Hoevere, if we draw many times, our sample mean identifies with the population mean.
This is the theoretical foundation that we can statistically estimate the population mean using sampling statistic.
We can get sampling statistics with the following command:
min(x2) [1] 2.252608
median(x2) [1] 10.43067
max(x2) [1] 22.50831
quantile(x2, c(0, .25, .5, .75, 1)) 0% 25% 50% 75% 100%
2.252608 6.526809 10.430674 13.843715 22.508313
Normal Distribution
As the number of samples increases, does distribution of a random variable converge to normal?
A simulation drawing 10 cards in a bag
Make 10 cards with the number from 0 to 9 printed on them.
Put them in a bag.
Let’s say, this bag is a population, which we are interested in (= we want to know about the population).
This means that we artificially generate a population, which is not the case in daily life.
Shuffle the bag, and draw a card from the bag.
Record the number you draw, record it, and return the card to the bag.
This is called, a random sampling.
For instance, the probability that you draw a card number 9 is 1/10.
You drawa another card, record it, and return the card to the bab.
You repead this n times.
The 10 numbers are a discrete uniform distribution: 0, 1, …, 9.
The population mean is 4.5 (= (0+1+2+….+9)/10 = 4.5)
Is the sample mean is also 4.5?
bag <- 0:9 # Generate an artificial population
exp_1 <- sample(bag, # Use the bag containing 10 cards
size = 2, # Draw 2 cards (sample size = 2)
replace = TRUE) # Return the card you drew before drawing another one
mean(exp_1) # Calculate the sample mean [1] 5
for loop in this simulation How to use for loop:
A and assign the value 0 to AA <- 0 result) so that you can save the result in itresult contains nothing (NA)result <- rep(NA, # NA means `result` contains nothing
length.out = 5) resultresult[1] NA NA NA NA NA
for loop, let R repead the task of adding 10 to A 5 timesforfor(i in 1:5){ # i repeats from 1 to 5
A <- A + 10 # Add 10 to A
result[i] <- A # Put the result of adding into `result[i]`
}resultresult[1] 10 20 30 40 50
A simulation of Central Limit theorem using for loop
sim1: sample size = 2, repeat the task 10 times
N <- 2 # Draw 2 cards
trials <- 10 # Repeat it 10 times
sim1 <- rep(NA, length.out = trials)
for (i in 1:trials) { # Assign what R should do
experiment <- sample(bag,
size = N,
replace = TRUE) # Return the card R drew before drawing another card
sim1[i] <- mean(experiment) # Save the mean for i
}df_sim1 <- tibble(avg = sim1)
h_sim1 <- ggplot(df_sim1, aes(x = avg)) +
geom_histogram(binwidth = 1,
boundary = 0.5,
color = "black") +
labs(x = "The Average of the 2 cards",
y = "Frequency") +
ggtitle("S1[N:2, repeat:10]") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line for the average
col = "lightgreen")
plot(h_sim1)df_sim1# A tibble: 10 x 1
avg
<dbl>
1 4.5
2 1
3 3.5
4 5.5
5 9
6 4
7 6
8 4
9 4.5
10 4
mean(df_sim1$avg)[1] 4.6
sim2: sample size = 5, repeat = 100
N <- 5 # Draw 5 cards
trials <- 100 # Repeat it 100 times
sim2 <- rep(NA, length.out = trials)
for (i in 1:trials) { # Assign what R should do
experiment <- sample(bag,
size = N,
replace = TRUE) # Return the card R drew before drawing another card
sim2[i] <- mean(experiment) # Save the mean for i
}df_sim2 <- tibble(avg = sim2)
h_sim2 <- ggplot(df_sim2, aes(x = avg)) +
geom_histogram(binwidth = 1,
boundary = 0.5,
color = "black") +
labs(x = "The Average of the 5 cards",
y = "Frequency") +
ggtitle("S2[N:5, repeat:100]") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line for the average
col = "lightgreen")
plot(h_sim2)df_sim2# A tibble: 100 x 1
avg
<dbl>
1 4.2
2 4.2
3 5.4
4 4.2
5 2
6 5
7 3.6
8 4
9 2.6
10 6
# … with 90 more rows
mean(df_sim2$avg)[1] 4.308
sim3: Sample size = 100, repeat = 1000
N <- 100 # Draw 100 cards)
trials <- 1000 # Repeat it 1000 times
sim3 <- rep(NA, length.out = trials)
for (i in 1:trials) { # Assign what R should do
experiment <- sample(bag, size = N, replace = TRUE) # Return the card R drew before drawing another card
sim3[i] <- mean(experiment) # Save the mean for i
}df_sim3 <- tibble(avg = sim3)
h_sim3 <- ggplot(df_sim3, aes(x = avg)) +
geom_histogram(binwidth = 0.125,
color = "black") +
labs(x = "The Average of the 100 cards",
y = "Frequency") +
ggtitle("S3[N:100, repeat:1000]") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line for the average
col = "lightgreen")
plot(h_sim3)df_sim3# A tibble: 1,000 x 1
avg
<dbl>
1 4.97
2 4.62
3 4.87
4 4.08
5 4.19
6 4.76
7 4.52
8 4.35
9 4.47
10 4.65
# … with 990 more rows
mean(df_sim3$avg)[1] 4.50132
library(patchwork)h_sim1 + h_sim2 + h_sim3N >100), we can conduct statistical inference and hypothesis test using a normal distribution.