• Load the packages
library(tidyverse)
library(patchwork)

Inferential statistics

  • Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.
  • It is assumed that the observed data set is sampled from a larger population.

1. Estination and statistical hypothesis testing

  • Inferential statistics consist of estimation and statistical hypothesis testing.
  • Estimation is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete.
  • A statistical hypothesis test is a method of statistical inference.
  • A statistical hypothesis is a hypothesis that is testable on the basis of observed data modelled as the realised values taken by a collection of random variables.

2. Statistical population and Sample

2.1 The number of samples & Sample size

  • In statistics, a population is a set of similar items or events which is of interest for some question or experiment.

  • In social science, a population is usually what we are interested in and what we want to know about, such as the average height of Japanese, or the average income of Americans which is not usually accessible due to huge amount of costs.

  • In statistics, a sample is a set of individuals or objects collected or selected from a statistical population by a defined procedure.

  • Sample size (also known as N in statistics) is the number of observation in a sample.
  • The number of samples is the number of sample sets taken from the population.

An exmaple:

  • Dennis went to Fukushima and took two surveys.
  • Survey A asked the bunch of questions to 1,000 people, and survey B asked the different questions to 2,000 people.
  • In this case, the number of samples are 2.
  • The sample size for survey A is 1,000 and the sample size for Survey B is 2,000.

Parameter and Statistic:
A parameter is a number describing a whole population (e.g., population mean),
while a statistic is a number describing a sample (e.g., sample mean).

Parameter

  • \(μ\)population mean
  • \(π\)population ratio
  • \(σ^2\) population variance
  • \(σ\)population standard deviation

Statistic

  • \(\bar{x}\)sample mean
  • \(p\)sample ratio
  • \(s^2\)sample variance
  • \(u_x^2\)unbiased variance
  • \(s\)sample variance
  • \(u_x\)unbiased standard deviation

2.2 Random Sampling

  • A simple random sample is a randomly selected subset of a population.

  • In this sampling method, each member of the population has an exactly equal chance of being selected.

  • This method is the most straightforward of all the probability sampling methods, since it only involves a single random selection and requires little advance knowledge about the population.

  • Because it uses randomization, any research performed on this sample should have high internal and external validity.

  • Internal validity is the extent to which you can be confident that a cause-and-effect relationship established in a study cannot be explained by other factors.

  • External validity is the extent to which you can generalize the findings of a study to other situations, people, settings and measures.

Population mean \((μ)\) and sample mean \(\bar{x}\)

  • Usually, parameters are unknown, but here let’s assume that we KNOW a parameter: we artificially generate a population on R by using dnorm() function.
  • We artificially generate a population (mean = 10, variance = 25, standard deviation = 5, range: -10 to 30)
  • The population looks like this:
curve(dnorm(x, 10, 5), from = -10, to = 30) 

  • Using rnorm() function, let’s randomly draw a sample from this artificial population.
  • The sample size (N) = 20 and name the sample (x1)
x1 <- rnorm(20, mean = 10, sd =5) 
x1
 [1]  7.7298953 17.0493278  7.2736714  5.9837875  9.7302424  5.0476241
 [7]  9.3211174 13.7525222 11.7170875  7.2907782 10.5752665  8.3700592
[13] 15.5648316 10.8888663 -0.4720088 -1.2653309 11.6189093  6.6641503
[19] 20.8662006  5.5781984
  • Round down to the nearest decimal and draw a histogram
round(x1, digits = 0) 
 [1]  8 17  7  6 10  5  9 14 12  7 11  8 16 11  0 -1 12  7 21  6
hist(x1)      

  • We can calculate the sample mean of sample \((x)\) as follows:

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

・Using mean( ) function, we can calculate the sample mean

mean(x1)            
[1] 9.16426
  • Note that the population mean of \((x)\) is 10 (because we initially set so)

2.3 Estomator and Estimate

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

  • In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data.
    → Here, the equation above is the estimator.

  • An estimate is the result of estimator.
    → Here, the figure calculated is the estimate

Randomly draw a sample (sample size = 20) from the population artifically made

  • The population mean = 10
  • The standard deviation = 5
x2 <- rnorm(20, mean = 10, sd =5) 
x2
 [1]  7.590461 10.008695 18.989587 10.852653  5.691300  5.600953 14.905523
 [8] 13.230365 12.339585 17.830952  2.628222 14.871700 10.861399  8.014057
[15]  7.113683  4.100466 13.501053 22.508313  6.805313  2.252608
round(x2, digits = 0)      
 [1]  8 10 19 11  6  6 15 13 12 18  3 15 11  8  7  4 14 23  7  2
hist(x2)                  

mean(x2)                  
[1] 10.48484
  • Since R randomly sample from the population, there is no bias, but there is a sampling error.

  • We do not always get the population mean = 10 → this is a sampling error

  • Hoevere, if we draw many times, our sample mean identifies with the population mean.

  • This is the theoretical foundation that we can statistically estimate the population mean using sampling statistic.

  • We can get sampling statistics with the following command:

min(x2)  
[1] 2.252608
median(x2) 
[1] 10.43067
max(x2)    
[1] 22.50831
quantile(x2, c(0, .25, .5, .75, 1))
       0%       25%       50%       75%      100% 
 2.252608  6.526809 10.430674 13.843715 22.508313 

3. Central Limit Theorem

  • The central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (= a bell curve) even if the original variables themselves are not normally distributed.

  • The central limit theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

Normal Distribution

  • A normal distribution is a type of continuous probability distribution for a real-valued random variable.
  • Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.
  • Their importance is partly due to the central limit theorem.
  • The central limit theorem states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable — whose distribution converges to a normal distribution as the number of samples increases.
  • Thus, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal.
  • The aim of this section is to “feel” what the central limit theorem means by a simple simulation on R.

3.2 A simulation on the Central Limit theorm on R

As the number of samples increases, does distribution of a random variable converge to normal?

A simulation drawing 10 cards in a bag

  • Make 10 cards with the number from 0 to 9 printed on them.

  • Put them in a bag.

  • Let’s say, this bag is a population, which we are interested in (= we want to know about the population).

  • This means that we artificially generate a population, which is not the case in daily life.

  • Shuffle the bag, and draw a card from the bag.

  • Record the number you draw, record it, and return the card to the bag.

  • This is called, a random sampling.

  • For instance, the probability that you draw a card number 9 is 1/10.

  • You drawa another card, record it, and return the card to the bab.

  • You repead this n times.

  • The 10 numbers are a discrete uniform distribution: 0, 1, …, 9.

  • The population mean is 4.5 (= (0+1+2+….+9)/10 = 4.5)

  • Is the sample mean is also 4.5?

  • The number of possible outcomes for the 1st draw: 10
  • The number of possible outcomes for the 2nd draw: 10
  • The number of possible outcomes for the 2 draws: 100
  • The total number of possible outcomes for the 2 draws: 19 (0 - 18)
  • If you draw 2 0’s, then the total number of possible outcomes for the 2 draws is 0 (min).
  • If you draw 2 9’s, then the total number of possible outcomes for the 2 draws is 18 (max).
  • We can show the probability distribution of mean of the 2 draws as follows:
  • For instance, we can see that the probability that we draw 0 (1st draw), and 8 (2nd draw), the sample mean = 4, is 0.8 (= 8%) from the figure.
  • Let R randomly draw a card from the bag.
bag <- 0:9                    # Generate an artificial population
exp_1 <- sample(bag,          # Use the bag containing 10 cards   
              size = 2,       # Draw 2 cards (sample size = 2)
              replace = TRUE) # Return the card you drew before drawing another one
mean(exp_1)                   # Calculate the sample mean 
[1] 5
  • Eveytime you draw, you get (probably) a different result
  • You are most likely to get a different sample mean
  • Let R draw 1,000 cards.
  • Check the distribution of the sample mean
  • We use for loop in this simulation  

How to use for loop:

  • For example, you want R to repeat the following task:
  • Start with 0 adding 10 to the original number 5 times
  • Define the original number A and assign the value 0 to A
A <- 0 
  • Prepare the variable (result) so that you can save the result in it
  • result contains nothing (NA)
result <- rep(NA,              # NA means `result` contains nothing 
              length.out = 5)  
  • Check the result
result
[1] NA NA NA NA NA
  • We only made a variale which contains nothing.
  • Using for loop, let R repead the task of adding 10 to A 5 times
  • Assign what R should do in the parenthesis after for
  • What R should do is stated within \(\{ \}\)
for(i in 1:5){   # i repeats from 1 to 5  
  A <- A + 10    # Add 10 to A 
  result[i] <- A # Put the result of adding into `result[i]`  
}
  • Check the result
result
[1] 10 20 30 40 50
  • We see what R did

A simulation of Central Limit theorem using for loop
sim1: sample size = 2, repeat the task 10 times

  • Let R draw 2 cards 10 times
N <- 2                                # Draw 2 cards
trials <- 10                          # Repeat it 10 times
sim1 <- rep(NA, length.out = trials)  
for (i in 1:trials) {                 # Assign what R should do
  experiment <- sample(bag, 
                       size = N, 
                       replace = TRUE) # Return the card R drew before drawing another card  
  sim1[i] <- mean(experiment)          # Save the mean for i
}
  • Draw a histogram of the simulation
df_sim1 <- tibble(avg = sim1)
h_sim1 <- ggplot(df_sim1, aes(x = avg)) +
  geom_histogram(binwidth = 1, 
                 boundary = 0.5,
                 color = "black") +
  labs(x = "The Average of the 2 cards", 
       y = "Frequency") +
  ggtitle("S1[N:2, repeat:10]") +
  scale_x_continuous(breaks = 0:9) +
  geom_vline(xintercept = mean(df_sim1$avg),  # Draw a vertical line for the average
             col = "lightgreen")

plot(h_sim1)

  • It does not look like a normal distribution
df_sim1
# A tibble: 10 x 1
     avg
   <dbl>
 1   4.5
 2   1  
 3   3.5
 4   5.5
 5   9  
 6   4  
 7   6  
 8   4  
 9   4.5
10   4  
  • If you want to know the sample mean, you type
mean(df_sim1$avg)
[1] 4.6

sim2: sample size = 5, repeat = 100

  • Let R draw 5 cards 100 times
N <- 5                                 # Draw 5 cards
trials <- 100                          # Repeat it 100 times
sim2 <- rep(NA, length.out = trials)   
for (i in 1:trials) {                  # Assign what R should do
  experiment <- sample(bag, 
                       size = N, 
                       replace = TRUE)  # Return the card R drew before drawing another card  
  sim2[i] <- mean(experiment)           # Save the mean for i
}
  • Draw a histogram of the simulation
df_sim2 <- tibble(avg = sim2)
h_sim2 <- ggplot(df_sim2, aes(x = avg)) +
  geom_histogram(binwidth = 1, 
                 boundary = 0.5,
                 color = "black") +
  labs(x = "The Average of the 5 cards", 
       y = "Frequency") +
  ggtitle("S2[N:5, repeat:100]") +
  scale_x_continuous(breaks = 0:9) +
  geom_vline(xintercept = mean(df_sim1$avg),  # Draw a vertical line for the average
             col = "lightgreen")

plot(h_sim2)

  • It kind of looks like a normal distribution compared to the previous one
df_sim2
# A tibble: 100 x 1
     avg
   <dbl>
 1   4.2
 2   4.2
 3   5.4
 4   4.2
 5   2  
 6   5  
 7   3.6
 8   4  
 9   2.6
10   6  
# … with 90 more rows
  • If you want to know the sample mean, you type
mean(df_sim2$avg)
[1] 4.308

sim3: Sample size = 100, repeat = 1000

  • Let R draw 100 cards 1000 times
N <- 100                             # Draw 100 cards)
trials <- 1000                       # Repeat it 1000 times
sim3 <- rep(NA, length.out = trials) 
for (i in 1:trials) {                # Assign what R should do
  experiment <- sample(bag, size = N, replace = TRUE)  # Return the card R drew before drawing another card  
  sim3[i] <- mean(experiment)        # Save the mean for i
}
  • Draw a histogram of the simulation
df_sim3 <- tibble(avg = sim3)
h_sim3 <- ggplot(df_sim3, aes(x = avg)) +
  geom_histogram(binwidth = 0.125, 
                 color = "black") +
  labs(x = "The Average of the 100 cards", 
       y = "Frequency") +
  ggtitle("S3[N:100, repeat:1000]") +
  scale_x_continuous(breaks = 0:9) +
  geom_vline(xintercept = mean(df_sim1$avg),  # Draw a vertical line for the average
             col = "lightgreen")

plot(h_sim3)

df_sim3
# A tibble: 1,000 x 1
     avg
   <dbl>
 1  4.97
 2  4.62
 3  4.87
 4  4.08
 5  4.19
 6  4.76
 7  4.52
 8  4.35
 9  4.47
10  4.65
# … with 990 more rows
  • If you want to know the sample mean, you type
mean(df_sim3$avg)
[1] 4.50132
  • Compared to the previous two simulations, it more looks like a normal distribution
library(patchwork)
h_sim1 + h_sim2 + h_sim3

Summary ・Even if the original variables themselves are not normally distributed, distribution of samples converges to normal as the number of samples increases.
→ If we have a sizable sample size (let’s say, N >100), we can conduct statistical inference and hypothesis test using a normal distribution.
参考文献
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)統計学2
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017