library(tidyverse)
library(patchwork)
estimation
and statistical hypothesis testing
.Estimation
is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete.statistical hypothesis test
is a method of statistical inference.The number of samples
& Sample size
In statistics, a population
is a set of similar items or events which is of interest for some question or experiment.
In social science, a population
is usually what we are interested in and what we want to know about, such as the average height of Japanese, or the average income of Americans which is not usually accessible due to huge amount of costs.
In statistics, a sample
is a set of individuals or objects collected or selected from a statistical population by a defined procedure.
Sample size
(also known as N
in statistics) is the number of observation in a sample.The number of samples
is the number of sample sets taken from the population.An exmaple:
the number of samples
are 2.The sample size for survey A
is 1,000 and the sample size for Survey B
is 2,000.Parameter and Statistic:
A parameter
is a number describing a whole population (e.g., population mean),
while a statistic
is a number describing a sample (e.g., sample mean).
Parameter
population mean
population ratio
population variance
population standard deviation
Statistic
sample mean
sample ratio
sample variance
unbiased variance
sample variance
unbiased standard deviation
Random Sampling
A simple random sample
is a randomly selected subset of a population.
In this sampling method, each member of the population has an exactly equal chance of being selected.
This method is the most straightforward of all the probability sampling methods, since it only involves a single random selection and requires little advance knowledge about the population.
Because it uses randomization, any research performed on this sample should have high internal and external validity
.
Internal validity
is the extent to which you can be confident that a cause-and-effect relationship established in a study cannot be explained by other factors.
External validity
is the extent to which you can generalize the findings of a study to other situations, people, settings and measures.
dnorm()
function.curve(dnorm(x, 10, 5), from = -10, to = 30)
rnorm()
function, let’s randomly draw a sample from this artificial population.<- rnorm(20, mean = 10, sd =5)
x1 x1
[1] 7.7298953 17.0493278 7.2736714 5.9837875 9.7302424 5.0476241
[7] 9.3211174 13.7525222 11.7170875 7.2907782 10.5752665 8.3700592
[13] 15.5648316 10.8888663 -0.4720088 -1.2653309 11.6189093 6.6641503
[19] 20.8662006 5.5781984
round(x1, digits = 0)
[1] 8 17 7 6 10 5 9 14 12 7 11 8 16 11 0 -1 12 7 21 6
hist(x1)
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]
・Using mean( )
function, we can calculate the sample mean
mean(x1)
[1] 9.16426
Estomator
and Estimate
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]
In statistics, an estimator
is a rule for calculating an estimate of a given quantity based on observed data.
→ Here, the equation above is the estimator.
An estimate
is the result of estimator.
→ Here, the figure calculated is the estimate
<- rnorm(20, mean = 10, sd =5)
x2 x2
[1] 7.590461 10.008695 18.989587 10.852653 5.691300 5.600953 14.905523
[8] 13.230365 12.339585 17.830952 2.628222 14.871700 10.861399 8.014057
[15] 7.113683 4.100466 13.501053 22.508313 6.805313 2.252608
round(x2, digits = 0)
[1] 8 10 19 11 6 6 15 13 12 18 3 15 11 8 7 4 14 23 7 2
hist(x2)
mean(x2)
[1] 10.48484
Since R randomly sample from the population, there is no bias
, but there is a sampling error.
We do not always get the population mean = 10 → this is a sampling error
Hoevere, if we draw many times, our sample mean identifies with the population mean.
This is the theoretical foundation that we can statistically estimate the population mean using sampling statistic.
We can get sampling statistics with the following command:
min(x2)
[1] 2.252608
median(x2)
[1] 10.43067
max(x2)
[1] 22.50831
quantile(x2, c(0, .25, .5, .75, 1))
0% 25% 50% 75% 100%
2.252608 6.526809 10.430674 13.843715 22.508313
Normal Distribution
As the number of samples increases, does distribution of a random variable converge to normal?
A simulation drawing 10 cards in a bag
Make 10 cards with the number from 0 to 9 printed on them.
Put them in a bag.
Let’s say, this bag is a population, which we are interested in (= we want to know about the population).
This means that we artificially generate a population, which is not the case in daily life.
Shuffle the bag, and draw a card from the bag.
Record the number you draw, record it, and return the card to the bag.
This is called, a random sampling
.
For instance, the probability that you draw a card number 9 is 1/10.
You drawa another card, record it, and return the card to the bab.
You repead this n times.
The 10 numbers are a discrete uniform distribution: 0, 1, …, 9.
The population mean is 4.5 (= (0+1+2+….+9)/10 = 4.5)
Is the sample mean is also 4.5?
<- 0:9 # Generate an artificial population
bag <- sample(bag, # Use the bag containing 10 cards
exp_1 size = 2, # Draw 2 cards (sample size = 2)
replace = TRUE) # Return the card you drew before drawing another one
mean(exp_1) # Calculate the sample mean
[1] 5
for loop
in this simulation How to use for loop
:
A
and assign the value 0 to A
<- 0 A
result
) so that you can save the result in itresult
contains nothing (NA)<- rep(NA, # NA means `result` contains nothing
result length.out = 5)
result
result
[1] NA NA NA NA NA
for loop
, let R repead the task of adding 10 to A
5 timesfor
for(i in 1:5){ # i repeats from 1 to 5
<- A + 10 # Add 10 to A
A <- A # Put the result of adding into `result[i]`
result[i] }
result
result
[1] 10 20 30 40 50
A simulation of Central Limit theorem using for loop
sim1: sample size = 2, repeat the task 10 times
<- 2 # Draw 2 cards
N <- 10 # Repeat it 10 times
trials <- rep(NA, length.out = trials)
sim1 for (i in 1:trials) { # Assign what R should do
<- sample(bag,
experiment size = N,
replace = TRUE) # Return the card R drew before drawing another card
<- mean(experiment) # Save the mean for i
sim1[i] }
<- tibble(avg = sim1)
df_sim1 <- ggplot(df_sim1, aes(x = avg)) +
h_sim1 geom_histogram(binwidth = 1,
boundary = 0.5,
color = "black") +
labs(x = "The Average of the 2 cards",
y = "Frequency") +
ggtitle("S1[N:2, repeat:10]") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line for the average
col = "lightgreen")
plot(h_sim1)
df_sim1
# A tibble: 10 x 1
avg
<dbl>
1 4.5
2 1
3 3.5
4 5.5
5 9
6 4
7 6
8 4
9 4.5
10 4
mean(df_sim1$avg)
[1] 4.6
sim2: sample size = 5, repeat = 100
<- 5 # Draw 5 cards
N <- 100 # Repeat it 100 times
trials <- rep(NA, length.out = trials)
sim2 for (i in 1:trials) { # Assign what R should do
<- sample(bag,
experiment size = N,
replace = TRUE) # Return the card R drew before drawing another card
<- mean(experiment) # Save the mean for i
sim2[i] }
<- tibble(avg = sim2)
df_sim2 <- ggplot(df_sim2, aes(x = avg)) +
h_sim2 geom_histogram(binwidth = 1,
boundary = 0.5,
color = "black") +
labs(x = "The Average of the 5 cards",
y = "Frequency") +
ggtitle("S2[N:5, repeat:100]") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line for the average
col = "lightgreen")
plot(h_sim2)
df_sim2
# A tibble: 100 x 1
avg
<dbl>
1 4.2
2 4.2
3 5.4
4 4.2
5 2
6 5
7 3.6
8 4
9 2.6
10 6
# … with 90 more rows
mean(df_sim2$avg)
[1] 4.308
sim3: Sample size = 100, repeat = 1000
<- 100 # Draw 100 cards)
N <- 1000 # Repeat it 1000 times
trials <- rep(NA, length.out = trials)
sim3 for (i in 1:trials) { # Assign what R should do
<- sample(bag, size = N, replace = TRUE) # Return the card R drew before drawing another card
experiment <- mean(experiment) # Save the mean for i
sim3[i] }
<- tibble(avg = sim3)
df_sim3 <- ggplot(df_sim3, aes(x = avg)) +
h_sim3 geom_histogram(binwidth = 0.125,
color = "black") +
labs(x = "The Average of the 100 cards",
y = "Frequency") +
ggtitle("S3[N:100, repeat:1000]") +
scale_x_continuous(breaks = 0:9) +
geom_vline(xintercept = mean(df_sim1$avg), # Draw a vertical line for the average
col = "lightgreen")
plot(h_sim3)
df_sim3
# A tibble: 1,000 x 1
avg
<dbl>
1 4.97
2 4.62
3 4.87
4 4.08
5 4.19
6 4.76
7 4.52
8 4.35
9 4.47
10 4.65
# … with 990 more rows
mean(df_sim3$avg)
[1] 4.50132
library(patchwork)
+ h_sim2 + h_sim3 h_sim1
N >100
), we can conduct statistical inference and hypothesis test using a normal distribution.