• R packages we use in this section
library(corrr)
library(DT)
library(psych)
library(qgraph)
library(tidyverse)

1. Relationship between Categorial Variables

1.1 Corss tables

  • A cross table is a two-way table consisting of columns and rows

  • Its greatest strength is its ability to structure, summarize and display large amounts of data.

  • Cross tables can also be used to depreviousine whether there is a relation between the row variable and the column variable or not.

  • A hypothetical data on Cabinet support rate in Japan

Q: Does it differ in cabinet support rate between the male and the female

1.2 Why Chi-squared test?

When there is no difference between the male and the female

  • Observation values

  • The half of the males (25) and the half of the females (25) support the cabinet
  • gender and cabinet support are not related with each other
  • gender and cabinet support are statistically independent

When there is a difference between the male and the female

  • 30 males out of 50 support the cabinet
  • 20 females out of 50 support the cabinet
  • gender and cabinet support may not be related with each other
  • We need to confirm it by conducting Chi-squared test.

Null Hypothesis In population, there is no difference in cabinet support between the male and the female

Alternative Hypothesis In population, there is a difference in cabinet support between the male and the female

  • When the null hypothesis is rejected
    → Accept the alternative hypothesis
    → Statistically significant
    → We can conclude that there is a difference in cabinet support between the male and the female in population

  • When the null hypothesis fails to be rejected
    → We cannot say anything
    → Not statistically significant
    → We cannot conclude that there is a difference in cabinet support between the male and the female in population

1.3 Chi-squared test

  • A chi-squared test is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson’s chi-squared test.
  • Pearson’s chi-squared test is used to depreviousine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

How to calculate Chi-squared values

Step 1:

  • Calculate expected values from observation values

Step 2:

Step 3:

Add them all

  • Chi-squared value = 1+1+1+1 = 4

1.3 Chi-squared test (by hand)

  • If the null hypothesis that there are no differences between the male and the female in the population is true, the test statistic computed from the observations follows a \(χ2\) frequency distribution.
  • Chi-Square Disribution table looks like this:

  • degree of freedom (df) : the number of values in the final calculation of a statistic that are free to vary
  • We use two variables: Cabinet support and Gender
  • We calculate degree of freedom with these two variables
  • Cabinet support is free to vary in 2 ways: “not support”, “support”
  • Gender is free to vary in 2 ways: “female”, “male”
    → In this case, degree of freedom is calculated as follows:
    degree of freedom = (2-1)*(2-1) = 1
    → We use the Chi Square Distribution (df = 1)
  • If you want to test a hypothesis with 95% confidence interval (which is the standard in statistical hypothesis test) you refer to the column of \(χ^2_.050\)
  • The value (3.841) is the cutting point we use here.
  • The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.
  • Test statistics that follow a \(χ2\) distribution occur when the observations are independent.
  • The shape of Chi-square distribution differs depending on the size of degree of freedom (df).

  • The cutting point = 3.84
  • The \(χ^2\) calculated with the sample we get = 4

  • The \(χ^2\) calculated with the sample we get = 4 is larger than the cutting point (3.84)
    → We reject the null hypothesis

Null Hypothesis In population, there is no difference in cabinet support between the male and the female

  • We conclude that the male support the cabinet more than the female in population

1.4 Chi-squared test (on R)

  • Download chi2_data2E.csv
  • Put the file (chi2_data2E.csv) into data folder in your RProject folder
  • Load the data and name it cab2
cab2 <- read_csv("data/chi2_data2E.csv") 
  • Check cab2
DT::datatable(cab2)
  • Make a cross table
table_cab2 <- table(cab2$gender, cab2$support)
addmargins(table_cab2) 
        
         not_support support Sum
  female          30      20  50
  male            20      30  50
  Sum             50      50 100
  • Conduct a \(χ^2\) test
chisq.test(cab2$gender, cab2$support, 
           correct = FALSE)

    Pearson's Chi-squared test

data:  cab2$gender and cab2$support
X-squared = 4, df = 1, p-value = 0.0455
  • You don’t have to check the Chi-square Distribution Table
  • All you need to do is see if the p-value is larger than 0.05 or not
  • Since p-value(0.0455) is smaller than 0.05, we can reject the null hypothesis
    We CAN conclude that the male support the cabinet more than the female in population

1.4 Statistical Significance and # of samples

  • What if you have only 50 respondents instead of 100

  • Download chi2_data3E.csv

  • Put the file (chi2_data3E.csv) into data folder in your RProject folder

  • Load the data and name it cab3

cab3 <- read_csv("data/chi2_data3E.csv") 
  • Check cab3
DT::datatable(cab3)
  • Make a cross table
table_cab3 <- table(cab3$gender, cab3$support)
addmargins(table_cab3) 
        
         not_support support Sum
  female          15      10  25
  male            10      15  25
  Sum             25      25  50
  • Conduct a \(χ^2\) test
chisq.test(cab3$gender, cab3$support, 
           correct = FALSE)

    Pearson's Chi-squared test

data:  cab3$gender and cab3$support
X-squared = 2, df = 1, p-value = 0.1573
  • You don’t have to check the Chi-square Distribution Table
  • All you need to do is see if the p-value is larger than 0.05 or not
  • Since p-value (0.1573) is larger than 0.05, we cannot reject the null hypothesis
    We CANNOT conclude that the male support the cabinet more than the female in population

Summary The smaller the sample, the less likely you can have a statistical significance in testing Chi-squared test.

1.5 Fisher’s exact test

  • If a number of cells is small (let’s say it is smaller than 5), you cannot use \(χ^2\) test
  • In this case, you use Fisher’s exact test.
  • Fisher’s exact test is a statistical significance test used in the analysis of contingency tables.
  • Although in practice it is employed when sample sizes are small, it is valid for all sample sizes.
  • It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis (e.g., P-value) can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.
  • Let’s suppose we get the following sample:

  • Download chi2_data4E.csv
  • Put the file (chi2_data4E.csv) into data folder in your RProject folder
  • Load the data and name it cab4
cab4 <- read_csv("data/chi2_data4E.csv") 
  • Check cab4
DT::datatable(cab4)
  • Make a cross table
table_cab4 <- table(cab4$gender, cab4$support)
addmargins(table_cab4) 
        
         not_support support Sum
  female           3       2   5
  male             2       1   3
  Sum              5       3   8
  • Conduct a Fishser’s Exact test
fisher.test(table_cab4, alternative = "less")

    Fisher's Exact Test for Count Data

data:  table_cab4
p-value = 0.7143
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
  0.00000 17.73797
sample estimates:
odds ratio 
 0.7772203 
  • We use not two-tailed test, but one-tail test
    → Add alternative = "less"
  • All you need to do is see if the p-value is larger than 0.05 or not
  • Since p-value (0.7143) is larger than 0.05, we cannot reject the null hypothesis
    We CANNOT conclude that the male support the cabinet more than the female in population

2. Correlation coefficients

  • In statistics, correlation is any statistical relationship, whether causal or not, between two random variables or bivariate data.
  • The correlation coefficient quantifies the linear relationship between two variables.
  • An upwards trend in the data cloud in a scatter plot implies a positive correlation, whereas a downwards trend in the data cloud represents a negative correlation.
  • Correlation coefficient ranges from -1 (negative relationship) to 1 (positive relationship).
  • Correlation is often not suitable for representing a nonlinear relationship.

Pearson correlation coefficient


Source: https://en.wikipedia.org/wiki/Correlation

2.2 Statistical test on correlation coefficients

  • Suppose there are two variables, x and y
x <- c(1, 5, 10)
y <- c(1, 2, 10)
  • Make dataframe (xy) containing these two variables
xy <-data.frame(x, y)
  • Draw a scatter plot using these two variables
plot(x ~ y)

  • You can have a fancier scatter plot using ggplot2 and add a regression line
library("tidyverse")
xy %>% 
  ggplot(aes(x, y)) + 
  geom_point() +
  stat_smooth(method = lm, se = FALSE) 

  • Calculate correlation coefficient between x and y
cor(x, y)
[1] 0.936599
  • Calculate correlation coefficient between x and y and see if this correlation can be seen in population
cor.test(x, y)

    Pearson's product-moment correlation

data:  x and y
t = 2.6729, df = 1, p-value = 0.2279
alternative hypothesis: true correlation is not equal to 0
sample estimates:
     cor 
0.936599 
  • correlation coefficient = 0.936599
  • Null hypothesis: true correlation is equal to 0 in population
  • p-value = 0.2279
    → We cannot reject this null hypothesis:
    → Chances are that true correlation is equal to 0 in population

3. Correlations and Causality

Correlation does not imply causation

  • Let’s check if this is so on R
  • Let’s make a hypothesitical dataset

comp

  • Make a variable, comp:
    → how competitive each electoral district (0 = not competitive, 1 = competitive)
  • Generate a dataset (N = 100) which contains either 1 or 0 with 50% chance
set.seed(12345)  
comp <- rbinom(100, 1, .5)
  • We set the seed, set.seed(12345), so that we get the same result.
  • The values, 12345, can be any number you prefer.
  • If we do not set the seed, we get different results everytime we run.
  • You should try it several times without set.seed(12345) and see your results!
  • Make a histogram on comp
hist(comp)

table(comp)
comp
 0  1 
48 52 
  • We see that we generated the 48 “not competitive” districts and the 52 “competitive” districts.

money

  • Generate campaign data (N=100) and name it money
  • We draw N = 100 sample from the population with its mean = 0.4 + 0.5*comp and standard deviation (sd = 0.2) on money.
money <- rnorm(100, mean = 0.4 + 0.5*comp, sd = 0.2) 

turnout

  • Generate turnout data (N=100) and name it turnout
  • We draw N = 100 sample from the population with its mean = 0.4 + 0.3*comp and standard deviation (sd = 0.1) on turnout.
turnout <- rnorm(100, mean = 0.4 + 0.3*comp, sd = 0.1)

Merge the 3 variables

  • Merge the three variables into dataframe, df
df <- data.frame(money = money,
                 turnout = turnout,
                 comp = as.factor(comp))
  • Check df
DT::datatable(df)
  • Draw a scatter plot between money (x axis) and turnout (y axis)
df %>% 
  ggplot(aes(x = money, y = turnout)) +
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(x = "Campaign money (million yen)", y = "turnout (%)")

  • Check the correlation coefficient
cor.test(money, turnout)

    Pearson's product-moment correlation

data:  money and turnout
t = 7.5656, df = 98, p-value = 2.118e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4664265 0.7179987
sample estimates:
      cor 
0.6072149 
  • Correlation coefficient = 0.6072149
  • Null hypothesis: true correlation is equal to 0 in population
  • p-value= 2.118e-11 = 0.00000000002118
    → We can reject the null hypothesis
    → There is a positive correlation between money and turnout in population

Is there a causal relationship between money and turnout?

  • Draw a scatter plot between money and turnout in previouss of comp (electoral competitiveness in each electoral district)
df %>% 
  ggplot(aes(money, turnout))  + 
  geom_point(aes(color = comp)) + 
  geom_smooth(method = lm, 
              se = FALSE, 
              aes(color = comp)) +
  labs(x = "Campaign money (million yen)", y = "turnout (%)") + 
  scale_color_discrete(name = "Electoeral Competitiveness",
                       labels = c("Not_competitive","Competitive")) 

Interpretation ・If you take electoral competitiveness into consideration,…
→ There is no correlation between campaign money and turnout rates
→ There is no causal relationship between campaign money and turnout rates

3.1 Visualize correlations

  • Using qgraph package, you can visualize correlations
  • Download hr96_21.csv
  • Japanese lower house election results (1996-2017)
  • Put the file (hr96_21.csv) into data folder in your RProject folder
  • Load the data and name it hr
hr <- read_csv("data/hr96-21.csv", 
               na = ".")
  • Check the variable names in hr
names(hr)
 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"
  • df1 contains the following 28 variables
variable detail
year Election year (1996-2017)
pref Prefecture
ku Electoral district name
kun Number of electoral district
mag District magnitude (Number of candidate elected)
rank Ascending order of votes
wl 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand Number of candidates in each district
seito Candidate’s affiliated party
j_name Candidate’s name (Japanese)
name Candidate’s name (English)
previous Previous wins
gender Candidate’s gender:“male”, “female”
age Candidate’s age
exp Election expenditure (yen) spent by each candidate
status 0 = challenger / 1 = incumbent / 2 = former incumbent
vote votes each candidate garnered
voteshare Voteshare (%)
eligible Eligible voters in each district
turnout Turnout in each district (%)
castvote Total votes cast in each district
seshu_dummy 0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu Relationship between candidate and his predecessor
nojiban_seshu Relationship between candidate and his predecessor

Make new variables you need

exppv

  • campaign expenditure data each candidate spent per voter (yen)

  • Check the class of exp and eligible

str(hr$exp)
 num [1:9660] 9828097 9311555 9231284 2177203 NA ...
str(hr$eligible)
 num [1:9660] 346774 346774 346774 346774 346774 ...
  • Both variables are numeric
    → No problem
  • Using exp and eligible, make a new varible, exppv
hr <- hr %>% 
  dplyr::mutate(exppv = exp/eligible) 
  • Check the descriptive statistics of exppv
summary(hr$exppv)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
  0.0013   8.1762  18.7646  23.0907  33.3863 120.8519     2831 
  • Select 10 variables you need
  • We only use the election data in 2009 here
hr2009 <- hr %>%
  dplyr::filter(year == 2009) %>% 
  dplyr::select(age, nocand, rank, wl, previous, vote, voteshare, eligible, exp, exppv)
  • Check hr
DT::datatable(hr2009)
summary(hr2009)
      age           nocand           rank             wl        
 Min.   :25.0   Min.   :2.000   Min.   :1.000   Min.   :0.0000  
 1st Qu.:41.0   1st Qu.:3.000   1st Qu.:1.000   1st Qu.:0.0000  
 Median :50.0   Median :4.000   Median :2.000   Median :0.0000  
 Mean   :50.1   Mean   :4.005   Mean   :2.496   Mean   :0.4337  
 3rd Qu.:59.0   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :85.0   Max.   :9.000   Max.   :9.000   Max.   :2.0000  
 NA's   :4                                                      
    previous           vote          voteshare        eligible     
 Min.   : 0.000   Min.   :   177   Min.   : 0.10   Min.   :211750  
 1st Qu.: 0.000   1st Qu.:  5992   1st Qu.: 2.40   1st Qu.:298265  
 Median : 0.000   Median : 62034   Median :30.00   Median :352167  
 Mean   : 1.331   Mean   : 61940   Mean   :26.34   Mean   :349973  
 3rd Qu.: 2.000   3rd Qu.:107292   3rd Qu.:47.30   3rd Qu.:405333  
 Max.   :15.000   Max.   :201461   Max.   :95.30   Max.   :487837  
                                                                   
      exp               exppv         
 Min.   :   10024   Min.   :  0.0258  
 1st Qu.: 1794542   1st Qu.:  5.3290  
 Median : 4809437   Median : 13.9190  
 Mean   : 6118181   Mean   : 18.4032  
 3rd Qu.: 9109114   3rd Qu.: 27.3219  
 Max.   :25354069   Max.   :100.8919  
 NA's   :15         NA's   :15        

Caution!

  • Three variables (age, exp, exppv) include NA's (missing values)
  • When the data contains NA, you need to add use = "complete.obs
  • Using cor() function, we visualize the correlations on the 2009 lower house election results
corHR <- cor(hr2009, use = "complete.obs") 
  • Show the correlation coefficients
corHR
                  age      nocand        rank          wl    previous
age        1.00000000 -0.01949051 -0.18875565  0.06449578  0.48783691
nocand    -0.01949051  1.00000000  0.35882406 -0.19173305 -0.12843147
rank      -0.18875565  0.35882406  1.00000000 -0.58585349 -0.48212211
wl         0.06449578 -0.19173305 -0.58585349  1.00000000  0.39104752
previous   0.48783691 -0.12843147 -0.48212211  0.39104752  1.00000000
vote       0.19315400 -0.20289537 -0.86695872  0.63917654  0.53338630
voteshare  0.20543012 -0.23982316 -0.89626914  0.68207240  0.55385390
eligible  -0.02675139  0.19908259  0.06906283 -0.11248525 -0.03812071
exp        0.30790965 -0.15181589 -0.61620242  0.40505780  0.57268149
exppv      0.28775237 -0.18464384 -0.59240696  0.42901652  0.54665625
                vote   voteshare    eligible         exp      exppv
age        0.1931540  0.20543012 -0.02675139  0.30790965  0.2877524
nocand    -0.2028954 -0.23982316  0.19908259 -0.15181589 -0.1846438
rank      -0.8669587 -0.89626914  0.06906283 -0.61620242 -0.5924070
wl         0.6391765  0.68207240 -0.11248525  0.40505780  0.4290165
previous   0.5333863  0.55385390 -0.03812071  0.57268149  0.5466562
vote       1.0000000  0.96092467  0.14578029  0.66201463  0.5686694
voteshare  0.9609247  1.00000000 -0.05370451  0.69213847  0.6667191
eligible   0.1457803 -0.05370451  1.00000000 -0.06058169 -0.2921183
exp        0.6620146  0.69213847 -0.06058169  1.00000000  0.9498970
exppv      0.5686694  0.66671908 -0.29211834  0.94989701  1.0000000
  • Visualize the correlation coefficients
cor1 <- qgraph(
  corHR,
  graph = "glasso",
  sampleSize = nrow(hr2009),
  tuning = 0,
  layout = "spring",
  title = "Correlations among variables of HR elections",
  details = TRUE
)

  • A green line means a positive correlation
  • A red line means a negative correlation
  • The thicker the line, the stronger correlation is.
  • You can save the figure as “cor_hr2009.pdf”
qgraph(cor1,
       filetype = 'pdf',
       filename = "cor_hr2009",
       height = 5,
       width = 10)

4. Exercise

  • Download JGSS-2008E.csv
  • JGSS-2008.csv is a survey data conducted one year prior to the 2009 lower house election in Japan.
  • The 2009 lower house election is one of the most important national elections because the long lasting Liberal Democratic Party (LDP) substantively lost seats and let the Democratic Party of Japan (DPJ) hold power for the first time in this election.
  • This survey was conducted in 2008, one year prior to the regime change from the LDP to the DPJ.
  • The following three variables are included in the data set.
serial : Serial number
gender : male, female
eval : 1 = evaluate the performance of LDP, 0 = not evaluate

Q1: Using R, make a cross table between gender and eval

Q2: Can you conclude that the male support the LDP’s performance more than the female in population? - If so, show your evidence to support your argument.

5. Exercise

  • women package is an embedded dataset with R
  • women: Average Heights and Weights for American Women
  • women contains the average heights and weights for American women aged 30–39.
  • If you want to know the detailed information on women, type ?women at Console in RStudio
  • Load women
data(women)
women <- data.frame(women)
women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Q1: Change the unit of height and weight as follows:
・inch → cm
・pound → kg
Note: 1 inch = 2.54 cm, 1 pound = 0.4536 kg

Q2: Draw a scatter plot using height (x-axis) and weight (y-axis), and also add a regression line

Q3: Calculate the correlation coefficient between height and weight.
・Suppose this dataset is a sample drawn from population.
・Is this relationship also true in population?

Q4: Is the relationship between height and weight a correlation or a causation? Explain why?

6. Exercise

  • cars package is an embedded dataset with R
  • cars: Speed and Stopping Distances of Cars
  • cars contains the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
  • If you want to know the detailed information on women, type ?cars at Console in RStudio
  • Load cars
data(cars)
cars
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17
11    11   28
12    12   14
13    12   20
14    12   24
15    12   28
16    13   26
17    13   34
18    13   34
19    13   46
20    14   26
21    14   36
22    14   60
23    14   80
24    15   20
25    15   26
26    15   54
27    16   32
28    16   40
29    17   32
30    17   40
31    17   50
32    18   42
33    18   56
34    18   76
35    18   84
36    19   36
37    19   46
38    19   68
39    20   32
40    20   48
41    20   52
42    20   56
43    20   64
44    22   66
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85

Q1: Change the unit of speed and dist as follows:

・mile → km
・foot → m
Note: 1 mile = 1.6 km, 1 foot = 0.3048 m

Q2: Draw a scatter plot using speed (x-axis) and dist (y-axis), and also add a regression line

Q3: Calculate the correlation coefficient between speed and dist.

・Suppose this dataset is a sample drawn from population.
・Is this relationship also true in population?

Q4: Is the relationship between speed and dist a correlation or a causation? Explain why?

Reference
  • 宋財泫 (Jaehyun Song)- 矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • グラフ作成に関しては遠藤勇哉氏(東北大学大学院情報科学研究科博士後期課程)の助言を参考にしています
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017