R packages we use in this section

library(corrr)
library(DT)
library(psych)
library(qgraph)
library(tidyverse)

1. Relationship between Categorial Variables

1.1 Corss tables

A cross table is a two-way table consisting of columns and rows
Its greatest strength is its ability to structure, summarize and display large amounts of data.
Cross tables can also be used to depreviousine whether there is a relation between the row variable and the column variable or not.
A hypothetical data on Cabinet support rate in Japan

Q: Does it differ in cabinet support rate between the male and the female

1.2 Why Chi-squared test?

When there is no difference between the male and the female

Observation values

The half of the males (25) and the half of the females (25) support the cabinet
gender and cabinet support are not related with each other
gender and cabinet support are statistically independent

When there is a difference between the male and the female

30 males out of 50 support the cabinet
20 females out of 50 support the cabinet
gender and cabinet support may not be related with each other
We need to confirm it by conducting Chi-squared test.

Null Hypothesis In population, there is no difference in cabinet support between the male and the female

Alternative Hypothesis In population, there is a difference in cabinet support between the male and the female

When the null hypothesis is rejected
→ Accept the alternative hypothesis
→ Statistically significant
→ We can conclude that there is a difference in cabinet support between the male and the female in population
When the null hypothesis fails to be rejected
→ We cannot say anything
→ Not statistically significant
→ We cannot conclude that there is a difference in cabinet support between the male and the female in population

1.3 Chi-squared test

A chi-squared test is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson’s chi-squared test.
Pearson’s chi-squared test is used to depreviousine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

How to calculate Chi-squared values

Step 1:

Calculate expected values from observation values

Step 2:

Step 3:

Add them all

Chi-squared value = 1+1+1+1 = 4

1.3 Chi-squared test (by hand)

If the null hypothesis that there are no differences between the male and the female in the population is true, the test statistic computed from the observations follows a \(χ2\) frequency distribution.
Chi-Square Disribution table looks like this:

degree of freedom (df) : the number of values in the final calculation of a statistic that are free to vary
We use two variables: Cabinet support and Gender
We calculate degree of freedom with these two variables
Cabinet support is free to vary in 2 ways: “not support”, “support”
Gender is free to vary in 2 ways: “female”, “male”
→ In this case, degree of freedom is calculated as follows:
degree of freedom = (2-1)*(2-1) = 1
→ We use the Chi Square Distribution (df = 1)
If you want to test a hypothesis with 95% confidence interval (which is the standard in statistical hypothesis test) you refer to the column of \(χ^2_.050\)
The value (3.841) is the cutting point we use here.
The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.
Test statistics that follow a \(χ2\) distribution occur when the observations are independent.
The shape of Chi-square distribution differs depending on the size of degree of freedom (df).

The cutting point = 3.84
The \(χ^2\) calculated with the sample we get = 4

The \(χ^2\) calculated with the sample we get = 4 is larger than the cutting point (3.84)
→ We reject the null hypothesis

Null Hypothesis In population, there is no difference in cabinet support between the male and the female

We conclude that the male support the cabinet more than the female in population

1.4 Chi-squared test (on R)

Download chi2_data2E.csv
Put the file (chi2_data2E.csv) into data folder in your RProject folder
Load the data and name it cab2

cab2 <- read_csv("data/chi2_data2E.csv")

Check cab2

DT::datatable(cab2)

Make a cross table

table_cab2 <- table(cab2$gender, cab2$support)
addmargins(table_cab2)

        
         not_support support Sum
  female          30      20  50
  male            20      30  50
  Sum             50      50 100

Conduct a \(χ^2\) test

chisq.test(cab2$gender, cab2$support, 
           correct = FALSE)


    Pearson's Chi-squared test

data:  cab2$gender and cab2$support
X-squared = 4, df = 1, p-value = 0.0455

You don’t have to check the Chi-square Distribution Table
All you need to do is see if the p-value is larger than 0.05 or not
Since p-value(0.0455) is smaller than 0.05, we can reject the null hypothesis
→We CAN conclude that the male support the cabinet more than the female in population

1.4 Statistical Significance and # of samples

What if you have only 50 respondents instead of 100
Download chi2_data3E.csv
Put the file (chi2_data3E.csv) into data folder in your RProject folder
Load the data and name it cab3

cab3 <- read_csv("data/chi2_data3E.csv")

Check cab3

DT::datatable(cab3)

Make a cross table

table_cab3 <- table(cab3$gender, cab3$support)
addmargins(table_cab3)

        
         not_support support Sum
  female          15      10  25
  male            10      15  25
  Sum             25      25  50

Conduct a \(χ^2\) test

chisq.test(cab3$gender, cab3$support, 
           correct = FALSE)


    Pearson's Chi-squared test

data:  cab3$gender and cab3$support
X-squared = 2, df = 1, p-value = 0.1573

You don’t have to check the Chi-square Distribution Table
All you need to do is see if the p-value is larger than 0.05 or not
Since p-value (0.1573) is larger than 0.05, we cannot reject the null hypothesis
→We CANNOT conclude that the male support the cabinet more than the female in population

Summary The smaller the sample, the less likely you can have a statistical significance in testing Chi-squared test.

1.5 Fisher’s exact test

If a number of cells is small (let’s say it is smaller than 5), you cannot use \(χ^2\) test
In this case, you use Fisher’s exact test.
Fisher’s exact test is a statistical significance test used in the analysis of contingency tables.
Although in practice it is employed when sample sizes are small, it is valid for all sample sizes.
It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis (e.g., P-value) can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.
Let’s suppose we get the following sample:

Download chi2_data4E.csv
Put the file (chi2_data4E.csv) into data folder in your RProject folder
Load the data and name it cab4

cab4 <- read_csv("data/chi2_data4E.csv")

Check cab4

DT::datatable(cab4)

Make a cross table

table_cab4 <- table(cab4$gender, cab4$support)
addmargins(table_cab4)

        
         not_support support Sum
  female           3       2   5
  male             2       1   3
  Sum              5       3   8

Conduct a Fishser’s Exact test

fisher.test(table_cab4, alternative = "less")


    Fisher's Exact Test for Count Data

data:  table_cab4
p-value = 0.7143
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
  0.00000 17.73797
sample estimates:
odds ratio 
 0.7772203

We use not two-tailed test, but one-tail test
→ Add alternative = "less"
All you need to do is see if the p-value is larger than 0.05 or not
Since p-value (0.7143) is larger than 0.05, we cannot reject the null hypothesis
→We CANNOT conclude that the male support the cabinet more than the female in population

2. Correlation coefficients

In statistics, correlation is any statistical relationship, whether causal or not, between two random variables or bivariate data.
The correlation coefficient quantifies the linear relationship between two variables.
An upwards trend in the data cloud in a scatter plot implies a positive correlation, whereas a downwards trend in the data cloud represents a negative correlation.
Correlation coefficient ranges from -1 (negative relationship) to 1 (positive relationship).
Correlation is often not suitable for representing a nonlinear relationship.

Pearson correlation coefficient

Source: https://en.wikipedia.org/wiki/Correlation

Definition of Peason Correlation Coefficient

Source: https://www.analyticsvidhya.com/blog/2021/01/beginners-guide-to-pearsons-correlation-coefficient/

2.2 Statistical test on correlation coefficients

Suppose there are two variables, x and y

x <- c(1, 5, 10)
y <- c(1, 2, 10)

Make dataframe (xy) containing these two variables

xy <-data.frame(x, y)

Draw a scatter plot using these two variables

plot(x ~ y)

You can have a fancier scatter plot using ggplot2 and add a regression line

library("tidyverse")

xy %>% 
  ggplot(aes(x, y)) + 
  geom_point() +
  stat_smooth(method = lm, se = FALSE)

Calculate correlation coefficient between x and y

cor(x, y)

[1] 0.936599

Calculate correlation coefficient between x and y and see if this correlation can be seen in population

cor.test(x, y)


    Pearson's product-moment correlation

data:  x and y
t = 2.6729, df = 1, p-value = 0.2279
alternative hypothesis: true correlation is not equal to 0
sample estimates:
     cor 
0.936599

correlation coefficient = 0.936599
Null hypothesis: true correlation is equal to 0 in population
p-value = 0.2279
→ We cannot reject this null hypothesis:
→ Chances are that true correlation is equal to 0 in population

3. Correlations and Causality

Correlation does not imply causation

Let’s check if this is so on R
Let’s make a hypothesitical dataset

`comp`

Make a variable, comp:
→ how competitive each electoral district (0 = not competitive, 1 = competitive)
Generate a dataset (N = 100) which contains either 1 or 0 with 50% chance

set.seed(12345)  
comp <- rbinom(100, 1, .5)

We set the seed, set.seed(12345), so that we get the same result.
The values, 12345, can be any number you prefer.
If we do not set the seed, we get different results everytime we run.
You should try it several times without set.seed(12345) and see your results!
Make a histogram on comp

hist(comp)

table(comp)

comp
 0  1 
48 52

We see that we generated the 48 “not competitive” districts and the 52 “competitive” districts.

`money`

Generate campaign data (N=100) and name it money
We draw N = 100 sample from the population with its mean = 0.4 + 0.5*comp and standard deviation (sd = 0.2) on money.

money <- rnorm(100, mean = 0.4 + 0.5*comp, sd = 0.2)

`turnout`

Generate turnout data (N=100) and name it turnout
We draw N = 100 sample from the population with its mean = 0.4 + 0.3*comp and standard deviation (sd = 0.1) on turnout.

turnout <- rnorm(100, mean = 0.4 + 0.3*comp, sd = 0.1)

Merge the 3 variables

Merge the three variables into dataframe, df

df <- data.frame(money = money,
                 turnout = turnout,
                 comp = as.factor(comp))

Check df

DT::datatable(df)

Draw a scatter plot between money (x axis) and turnout (y axis)

df %>% 
  ggplot(aes(x = money, y = turnout)) +
  geom_point() +
  geom_smooth(se = FALSE, method = 'lm') +
  labs(x = "Campaign money (million yen)", y = "turnout (%)")

Check the correlation coefficient

cor.test(money, turnout)


    Pearson's product-moment correlation

data:  money and turnout
t = 7.5656, df = 98, p-value = 2.118e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4664265 0.7179987
sample estimates:
      cor 
0.6072149

Correlation coefficient = 0.6072149
Null hypothesis: true correlation is equal to 0 in population
p-value= 2.118e-11 = 0.00000000002118
→ We can reject the null hypothesis
→ There is a positive correlation between money and turnout in population

Is there a causal relationship between `money` and `turnout`?

Draw a scatter plot between money and turnout in previouss of comp (electoral competitiveness in each electoral district)

df %>% 
  ggplot(aes(money, turnout))  + 
  geom_point(aes(color = comp)) + 
  geom_smooth(method = lm, 
              se = FALSE, 
              aes(color = comp)) +
  labs(x = "Campaign money (million yen)", y = "turnout (%)") + 
  scale_color_discrete(name = "Electoeral Competitiveness",
                       labels = c("Not_competitive","Competitive"))

Interpretation ・If you take electoral competitiveness into consideration,…
→ There is no correlation between campaign money and turnout rates
→ There is no causal relationship between campaign money and turnout rates

3.1 Visualize correlations

Using qgraph package, you can visualize correlations
Download hr96_21.csv
Japanese lower house election results (1996-2017)
Put the file (hr96_21.csv) into data folder in your RProject folder
Load the data and name it hr

hr <- read_csv("data/hr96-21.csv", 
               na = ".")

Check the variable names in hr

names(hr)

 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"

df1 contains the following 28 variables

variable	detail
year	Election year (1996-2017)
pref	Prefecture
ku	Electoral district name
kun	Number of electoral district
mag	District magnitude (Number of candidate elected)
rank	Ascending order of votes
wl	0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand	Number of candidates in each district
seito	Candidate’s affiliated party
j_name	Candidate’s name (Japanese)
name	Candidate’s name (English)
previous	Previous wins
gender	Candidate’s gender:“male”, “female”
age	Candidate’s age
exp	Election expenditure (yen) spent by each candidate
status	0 = challenger / 1 = incumbent / 2 = former incumbent
vote	votes each candidate garnered
voteshare	Voteshare (%)
eligible	Eligible voters in each district
turnout	Turnout in each district (%)
castvote	Total votes cast in each district
seshu_dummy	0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu	Relationship between candidate and his predecessor
nojiban_seshu	Relationship between candidate and his predecessor

Make new variables you need

exppv

campaign expenditure data each candidate spent per voter (yen)
Check the class of exp and eligible

str(hr$exp)

 num [1:9660] 9828097 9311555 9231284 2177203 NA ...

str(hr$eligible)

 num [1:9660] 346774 346774 346774 346774 346774 ...

Both variables are numeric
→　No problem
Using exp and eligible, make a new varible, exppv

hr <- hr %>% 
  dplyr::mutate(exppv = exp/eligible)

Check the descriptive statistics of exppv

summary(hr$exppv)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
  0.0013   8.1762  18.7646  23.0907  33.3863 120.8519     2831

Select 10 variables you need
We only use the election data in 2009 here

hr2009 <- hr %>%
  dplyr::filter(year == 2009) %>% 
  dplyr::select(age, nocand, rank, wl, previous, vote, voteshare, eligible, exp, exppv)

Check hr

DT::datatable(hr2009)

summary(hr2009)

      age           nocand           rank             wl        
 Min.   :25.0   Min.   :2.000   Min.   :1.000   Min.   :0.0000  
 1st Qu.:41.0   1st Qu.:3.000   1st Qu.:1.000   1st Qu.:0.0000  
 Median :50.0   Median :4.000   Median :2.000   Median :0.0000  
 Mean   :50.1   Mean   :4.005   Mean   :2.496   Mean   :0.4337  
 3rd Qu.:59.0   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :85.0   Max.   :9.000   Max.   :9.000   Max.   :2.0000  
 NA's   :4                                                      
    previous           vote          voteshare        eligible     
 Min.   : 0.000   Min.   :   177   Min.   : 0.10   Min.   :211750  
 1st Qu.: 0.000   1st Qu.:  5992   1st Qu.: 2.40   1st Qu.:298265  
 Median : 0.000   Median : 62034   Median :30.00   Median :352167  
 Mean   : 1.331   Mean   : 61940   Mean   :26.34   Mean   :349973  
 3rd Qu.: 2.000   3rd Qu.:107292   3rd Qu.:47.30   3rd Qu.:405333  
 Max.   :15.000   Max.   :201461   Max.   :95.30   Max.   :487837  
                                                                   
      exp               exppv         
 Min.   :   10024   Min.   :  0.0258  
 1st Qu.: 1794542   1st Qu.:  5.3290  
 Median : 4809437   Median : 13.9190  
 Mean   : 6118181   Mean   : 18.4032  
 3rd Qu.: 9109114   3rd Qu.: 27.3219  
 Max.   :25354069   Max.   :100.8919  
 NA's   :15         NA's   :15

Caution!

Three variables (age, exp, exppv) include NA's (missing values)
When the data contains NA, you need to add use = "complete.obs
Using cor() function, we visualize the correlations on the 2009 lower house election results

corHR <- cor(hr2009, use = "complete.obs")

Show the correlation coefficients

corHR

                  age      nocand        rank          wl    previous
age        1.00000000 -0.01949051 -0.18875565  0.06449578  0.48783691
nocand    -0.01949051  1.00000000  0.35882406 -0.19173305 -0.12843147
rank      -0.18875565  0.35882406  1.00000000 -0.58585349 -0.48212211
wl         0.06449578 -0.19173305 -0.58585349  1.00000000  0.39104752
previous   0.48783691 -0.12843147 -0.48212211  0.39104752  1.00000000
vote       0.19315400 -0.20289537 -0.86695872  0.63917654  0.53338630
voteshare  0.20543012 -0.23982316 -0.89626914  0.68207240  0.55385390
eligible  -0.02675139  0.19908259  0.06906283 -0.11248525 -0.03812071
exp        0.30790965 -0.15181589 -0.61620242  0.40505780  0.57268149
exppv      0.28775237 -0.18464384 -0.59240696  0.42901652  0.54665625
                vote   voteshare    eligible         exp      exppv
age        0.1931540  0.20543012 -0.02675139  0.30790965  0.2877524
nocand    -0.2028954 -0.23982316  0.19908259 -0.15181589 -0.1846438
rank      -0.8669587 -0.89626914  0.06906283 -0.61620242 -0.5924070
wl         0.6391765  0.68207240 -0.11248525  0.40505780  0.4290165
previous   0.5333863  0.55385390 -0.03812071  0.57268149  0.5466562
vote       1.0000000  0.96092467  0.14578029  0.66201463  0.5686694
voteshare  0.9609247  1.00000000 -0.05370451  0.69213847  0.6667191
eligible   0.1457803 -0.05370451  1.00000000 -0.06058169 -0.2921183
exp        0.6620146  0.69213847 -0.06058169  1.00000000  0.9498970
exppv      0.5686694  0.66671908 -0.29211834  0.94989701  1.0000000

Visualize the correlation coefficients

cor1 <- qgraph(
  corHR,
  graph = "glasso",
  sampleSize = nrow(hr2009),
  tuning = 0,
  layout = "spring",
  title = "Correlations among variables of HR elections",
  details = TRUE
)

A green line means a positive correlation
A red line means a negative correlation
The thicker the line, the stronger correlation is.
You can save the figure as “cor_hr2009.pdf”

qgraph(cor1,
       filetype = 'pdf',
       filename = "cor_hr2009",
       height = 5,
       width = 10)

4. Exercise

Download JGSS-2008E.csv
JGSS-2008.csv is a survey data conducted one year prior to the 2009 lower house election in Japan.
The 2009 lower house election is one of the most important national elections because the long lasting Liberal Democratic Party (LDP) substantively lost seats and let the Democratic Party of Japan (DPJ) hold power for the first time in this election.
This survey was conducted in 2008, one year prior to the regime change from the LDP to the DPJ.
The following three variables are included in the data set.

`serial`	: Serial number
`gender`	: male, female
`eval`	: 1 = evaluate the performance of LDP, 0 = not evaluate

Q1: Using R, make a cross table between gender and eval

Q2: Can you conclude that the male support the LDP’s performance more than the female in population? - If so, show your evidence to support your argument.

5. Exercise

women package is an embedded dataset with R
women: Average Heights and Weights for American Women
women contains the average heights and weights for American women aged 30–39.
If you want to know the detailed information on women, type ?women at Console in RStudio
Load women

data(women)
women <- data.frame(women)
women

   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Q1: Change the unit of height and weight as follows:
・inch → cm
・pound → kg
Note: 1 inch = 2.54 cm, 1 pound = 0.4536 kg

Q2: Draw a scatter plot using height (x-axis) and weight (y-axis), and also add a regression line

Q3: Calculate the correlation coefficient between height and weight.
・Suppose this dataset is a sample drawn from population.
・Is this relationship also true in population?

Q4: Is the relationship between height and weight a correlation or a causation? Explain why?

6. Exercise

cars package is an embedded dataset with R
cars: Speed and Stopping Distances of Cars
cars contains the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
If you want to know the detailed information on women, type ?cars at Console in RStudio
Load cars

data(cars)
cars

   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17
11    11   28
12    12   14
13    12   20
14    12   24
15    12   28
16    13   26
17    13   34
18    13   34
19    13   46
20    14   26
21    14   36
22    14   60
23    14   80
24    15   20
25    15   26
26    15   54
27    16   32
28    16   40
29    17   32
30    17   40
31    17   50
32    18   42
33    18   56
34    18   76
35    18   84
36    19   36
37    19   46
38    19   68
39    20   32
40    20   48
41    20   52
42    20   56
43    20   64
44    22   66
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85

Q1: Change the unit of speed and dist as follows:

・mile → km
・foot → m
Note: 1 mile = 1.6 km, 1 foot = 0.3048 m

Q2: Draw a scatter plot using speed (x-axis) and dist (y-axis), and also add a regression line

Q3: Calculate the correlation coefficient between speed and dist.

・Suppose this dataset is a sample drawn from population.
・Is this relationship also true in population?

Q4: Is the relationship between speed and dist a correlation or a causation? Explain why?

Reference

宋財泫 (Jaehyun Song)- 矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」

グラフ作成に関しては遠藤勇哉氏（東北大学大学院情報科学研究科博士後期課程）の助言を参考にしています

土井翔平（北海道大学公共政策大学院）「Rで計量政治学入門」

矢内勇生（高知工科大学）授業一覧

浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年

浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年

Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.

Kieran Healy, DATA VISUALIZATION, Princeton, 2019

Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

13. Chi-squared test & Correlations

Masahiko Asano

2021-11-30

1. Relationship between Categorial Variables

1.1 Corss tables

Q: Does it differ in cabinet support rate between the male and the female

1.2 Why Chi-squared test?

When there is no difference between the male and the female

When there is a difference between the male and the female

1.3 Chi-squared test

How to calculate Chi-squared values

Step 1:

Step 2:

Step 3:

Add them all

1.3 Chi-squared test (by hand)

1.4 Chi-squared test (on R)

1.4 Statistical Significance and # of samples

1.5 Fisher’s exact test

2. Correlation coefficients

Pearson correlation coefficient

Definition of Peason Correlation Coefficient

2.2 Statistical test on correlation coefficients

3. Correlations and Causality

Correlation does not imply causation

`comp`

`money`

`turnout`

Merge the 3 variables

Is there a causal relationship between `money` and `turnout`?

3.1 Visualize correlations

Make new variables you need

exppv

Caution!

4. Exercise

5. Exercise

6. Exercise

13. Chi-squared test & Correlations

Masahiko Asano

2021-11-30

1. Relationship between Categorial Variables

1.1 Corss tables

Q: Does it differ in cabinet support rate between the male and the female

1.2 Why Chi-squared test?

When there is no difference between the male and the female

When there is a difference between the male and the female

1.3 Chi-squared test

How to calculate Chi-squared values

Step 1:

Step 2:

Step 3:

Add them all

1.3 Chi-squared test (by hand)

1.4 Chi-squared test (on R)

1.4 Statistical Significance and # of samples

1.5 Fisher’s exact test

2. Correlation coefficients

Pearson correlation coefficient

Definition of Peason Correlation Coefficient

2.2 Statistical test on correlation coefficients

3. Correlations and Causality

Correlation does not imply causation

comp

money

turnout

Merge the 3 variables

Is there a causal relationship between money and turnout?

3.1 Visualize correlations

Make new variables you need

exppv

Caution!

4. Exercise

5. Exercise

6. Exercise

`comp`

`money`

`turnout`

Is there a causal relationship between `money` and `turnout`?