R packages we use in this section
library(corrr)
library(DT)
library(psych)
library(qgraph)
library(tidyverse)
A cross table is a two-way table consisting of columns and rows
Its greatest strength is its ability to structure, summarize and display large amounts of data.
Cross tables can also be used to depreviousine whether there is a relation between the row variable and the column variable or not.
A hypothetical data on Cabinet support rate in Japan
Null Hypothesis In population, there is no difference in cabinet support between the male and the female
Alternative Hypothesis In population, there is a difference in cabinet support between the male and the female
When the null hypothesis is rejected
→ Accept the alternative hypothesis
→ Statistically significant
→ We can conclude that there is a difference in cabinet support between the male and the female in population
When the null hypothesis fails to be rejected
→ We cannot say anything
→ Not statistically significant
→ We cannot conclude that there is a difference in cabinet support between the male and the female in population
degree of freedom (df)
: the number of values in the final calculation of a statistic that are free to varyCabinet support
and Gender
degree of freedom
with these two variablesCabinet support
is free to vary in 2 ways: “not support”, “support”Gender
is free to vary in 2 ways: “female”, “male”degree of freedom
is calculated as follows:degree of freedom
= (2-1)*(2-1) = 1degree of freedom (df)
.Null Hypothesis In population, there is no difference in cabinet support between the male and the female
chi2_data2E.csv
) into data
folder in your RProject
foldercab2
<- read_csv("data/chi2_data2E.csv") cab2
cab2
::datatable(cab2) DT
<- table(cab2$gender, cab2$support)
table_cab2 addmargins(table_cab2)
not_support support Sum
female 30 20 50
male 20 30 50
Sum 50 50 100
chisq.test(cab2$gender, cab2$support,
correct = FALSE)
Pearson's Chi-squared test
data: cab2$gender and cab2$support
X-squared = 4, df = 1, p-value = 0.0455
p-value
is larger than 0.05 or notp-value
(0.0455) is smaller than 0.05, we can reject the null hypothesisWhat if you have only 50 respondents instead of 100
Download chi2_data3E.csv
Put the file (chi2_data3E.csv
) into data
folder in your RProject
folder
Load the data and name it cab3
<- read_csv("data/chi2_data3E.csv") cab3
cab3
::datatable(cab3) DT
<- table(cab3$gender, cab3$support)
table_cab3 addmargins(table_cab3)
not_support support Sum
female 15 10 25
male 10 15 25
Sum 25 25 50
chisq.test(cab3$gender, cab3$support,
correct = FALSE)
Pearson's Chi-squared test
data: cab3$gender and cab3$support
X-squared = 2, df = 1, p-value = 0.1573
p-value
is larger than 0.05 or notp-value
(0.1573) is larger than 0.05, we cannot reject the null hypothesisSummary The smaller the sample, the less likely you can have a statistical significance in testing Chi-squared test.
small
(let’s say it is smaller than 5), you cannot use \(χ^2\) testsmall
, it is valid for all sample sizes.chi2_data4E.csv
) into data
folder in your RProject
foldercab4
<- read_csv("data/chi2_data4E.csv") cab4
cab4
::datatable(cab4) DT
<- table(cab4$gender, cab4$support)
table_cab4 addmargins(table_cab4)
not_support support Sum
female 3 2 5
male 2 1 3
Sum 5 3 8
fisher.test(table_cab4, alternative = "less")
Fisher's Exact Test for Count Data
data: table_cab4
p-value = 0.7143
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
0.00000 17.73797
sample estimates:
odds ratio
0.7772203
alternative = "less"
p-value
is larger than 0.05 or notp-value
(0.7143) is larger than 0.05, we cannot reject the null hypothesis
Source: https://www.analyticsvidhya.com/blog/2021/01/beginners-guide-to-pearsons-correlation-coefficient/
<- c(1, 5, 10)
x <- c(1, 2, 10) y
xy
) containing these two variables<-data.frame(x, y) xy
plot(x ~ y)
ggplot2
and add a regression linelibrary("tidyverse")
%>%
xy ggplot(aes(x, y)) +
geom_point() +
stat_smooth(method = lm, se = FALSE)
x
and y
cor(x, y)
[1] 0.936599
x
and y
and see if this correlation can be seen in populationcor.test(x, y)
Pearson's product-moment correlation
data: x and y
t = 2.6729, df = 1, p-value = 0.2279
alternative hypothesis: true correlation is not equal to 0
sample estimates:
cor
0.936599
Null hypothesis
: true correlation is equal to 0 in populationp-value
= 0.2279null hypothesis
:comp
comp
:set.seed(12345)
<- rbinom(100, 1, .5) comp
set.seed(12345)
, so that we get the same result.set.seed(12345)
and see your results!comp
hist(comp)
table(comp)
comp
0 1
48 52
money
money
mean = 0.4 + 0.5*comp
and standard deviation (sd = 0.2
) on money
.<- rnorm(100, mean = 0.4 + 0.5*comp, sd = 0.2) money
turnout
turnout
mean = 0.4 + 0.3*comp
and standard deviation (sd = 0.1
) on turnout
.<- rnorm(100, mean = 0.4 + 0.3*comp, sd = 0.1) turnout
df
<- data.frame(money = money,
df turnout = turnout,
comp = as.factor(comp))
df
::datatable(df) DT
money
(x
axis) and turnout
(y
axis)%>%
df ggplot(aes(x = money, y = turnout)) +
geom_point() +
geom_smooth(se = FALSE, method = 'lm') +
labs(x = "Campaign money (million yen)", y = "turnout (%)")
cor.test(money, turnout)
Pearson's product-moment correlation
data: money and turnout
t = 7.5656, df = 98, p-value = 2.118e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4664265 0.7179987
sample estimates:
cor
0.6072149
Correlation coefficient
= 0.6072149Null hypothesis
: true correlation is equal to 0 in populationp-value
= 2.118e-11 = 0.00000000002118money
and turnout
in populationmoney
and turnout
?money
and turnout
in previouss of comp
(electoral competitiveness in each electoral district)%>%
df ggplot(aes(money, turnout)) +
geom_point(aes(color = comp)) +
geom_smooth(method = lm,
se = FALSE,
aes(color = comp)) +
labs(x = "Campaign money (million yen)", y = "turnout (%)") +
scale_color_discrete(name = "Electoeral Competitiveness",
labels = c("Not_competitive","Competitive"))
qgraph package
, you can visualize correlationsdata
folder in your RProject
folderhr
<- read_csv("data/hr96-21.csv",
hr na = ".")
hr
names(hr)
[1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
df1
contains the following 28 variablesvariable | detail |
---|---|
year | Election year (1996-2017) |
pref | Prefecture |
ku | Electoral district name |
kun | Number of electoral district |
mag | District magnitude (Number of candidate elected) |
rank | Ascending order of votes |
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
nocand | Number of candidates in each district |
seito | Candidate’s affiliated party |
j_name | Candidate’s name (Japanese) |
name | Candidate’s name (English) |
previous | Previous wins |
gender | Candidate’s gender:“male”, “female” |
age | Candidate’s age |
exp | Election expenditure (yen) spent by each candidate |
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
vote | votes each candidate garnered |
voteshare | Voteshare (%) |
eligible | Eligible voters in each district |
turnout | Turnout in each district (%) |
castvote | Total votes cast in each district |
seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
jiban_seshu | Relationship between candidate and his predecessor |
nojiban_seshu | Relationship between candidate and his predecessor |
campaign expenditure data each candidate spent per voter (yen)
Check the class of exp
and eligible
str(hr$exp)
num [1:9660] 9828097 9311555 9231284 2177203 NA ...
str(hr$eligible)
num [1:9660] 346774 346774 346774 346774 346774 ...
numeric
exp
and eligible
, make a new varible, exppv
<- hr %>%
hr ::mutate(exppv = exp/eligible) dplyr
exppv
summary(hr$exppv)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0013 8.1762 18.7646 23.0907 33.3863 120.8519 2831
<- hr %>%
hr2009 ::filter(year == 2009) %>%
dplyr::select(age, nocand, rank, wl, previous, vote, voteshare, eligible, exp, exppv) dplyr
hr
::datatable(hr2009) DT
summary(hr2009)
age nocand rank wl
Min. :25.0 Min. :2.000 Min. :1.000 Min. :0.0000
1st Qu.:41.0 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:0.0000
Median :50.0 Median :4.000 Median :2.000 Median :0.0000
Mean :50.1 Mean :4.005 Mean :2.496 Mean :0.4337
3rd Qu.:59.0 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :85.0 Max. :9.000 Max. :9.000 Max. :2.0000
NA's :4
previous vote voteshare eligible
Min. : 0.000 Min. : 177 Min. : 0.10 Min. :211750
1st Qu.: 0.000 1st Qu.: 5992 1st Qu.: 2.40 1st Qu.:298265
Median : 0.000 Median : 62034 Median :30.00 Median :352167
Mean : 1.331 Mean : 61940 Mean :26.34 Mean :349973
3rd Qu.: 2.000 3rd Qu.:107292 3rd Qu.:47.30 3rd Qu.:405333
Max. :15.000 Max. :201461 Max. :95.30 Max. :487837
exp exppv
Min. : 10024 Min. : 0.0258
1st Qu.: 1794542 1st Qu.: 5.3290
Median : 4809437 Median : 13.9190
Mean : 6118181 Mean : 18.4032
3rd Qu.: 9109114 3rd Qu.: 27.3219
Max. :25354069 Max. :100.8919
NA's :15 NA's :15
NA's
(missing values)NA
, you need to add use = "complete.obs
cor()
function, we visualize the correlations on the 2009 lower house election results<- cor(hr2009, use = "complete.obs") corHR
corHR
age nocand rank wl previous
age 1.00000000 -0.01949051 -0.18875565 0.06449578 0.48783691
nocand -0.01949051 1.00000000 0.35882406 -0.19173305 -0.12843147
rank -0.18875565 0.35882406 1.00000000 -0.58585349 -0.48212211
wl 0.06449578 -0.19173305 -0.58585349 1.00000000 0.39104752
previous 0.48783691 -0.12843147 -0.48212211 0.39104752 1.00000000
vote 0.19315400 -0.20289537 -0.86695872 0.63917654 0.53338630
voteshare 0.20543012 -0.23982316 -0.89626914 0.68207240 0.55385390
eligible -0.02675139 0.19908259 0.06906283 -0.11248525 -0.03812071
exp 0.30790965 -0.15181589 -0.61620242 0.40505780 0.57268149
exppv 0.28775237 -0.18464384 -0.59240696 0.42901652 0.54665625
vote voteshare eligible exp exppv
age 0.1931540 0.20543012 -0.02675139 0.30790965 0.2877524
nocand -0.2028954 -0.23982316 0.19908259 -0.15181589 -0.1846438
rank -0.8669587 -0.89626914 0.06906283 -0.61620242 -0.5924070
wl 0.6391765 0.68207240 -0.11248525 0.40505780 0.4290165
previous 0.5333863 0.55385390 -0.03812071 0.57268149 0.5466562
vote 1.0000000 0.96092467 0.14578029 0.66201463 0.5686694
voteshare 0.9609247 1.00000000 -0.05370451 0.69213847 0.6667191
eligible 0.1457803 -0.05370451 1.00000000 -0.06058169 -0.2921183
exp 0.6620146 0.69213847 -0.06058169 1.00000000 0.9498970
exppv 0.5686694 0.66671908 -0.29211834 0.94989701 1.0000000
<- qgraph(
cor1
corHR,graph = "glasso",
sampleSize = nrow(hr2009),
tuning = 0,
layout = "spring",
title = "Correlations among variables of HR elections",
details = TRUE
)
qgraph(cor1,
filetype = 'pdf',
filename = "cor_hr2009",
height = 5,
width = 10)
JGSS-2008E.csv
JGSS-2008.csv
is a survey data conducted one year prior to the 2009 lower house election in Japan.serial |
: Serial number |
gender |
: male, female |
eval |
: 1 = evaluate the performance of LDP, 0 = not evaluate |
Q1: Using R, make a cross table between gender
and eval
Q2: Can you conclude that the male support the LDP’s performance more than the female in population? - If so, show your evidence to support your argument.
women
package is an embedded dataset with Rwomen
: Average Heights and Weights for American Womenwomen
contains the average heights and weights for American women aged 30–39.?women
at Console in RStudiowomen
data(women)
<- data.frame(women)
women women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Q1: Change the unit of height
and weight
as follows:
・inch → cm
・pound → kg
Note: 1 inch = 2.54 cm, 1 pound = 0.4536 kg
Q2: Draw a scatter plot using height
(x-axis) and weight
(y-axis), and also add a regression line
Q3: Calculate the correlation coefficient between height
and weight
.
・Suppose this dataset is a sample drawn from population.
・Is this relationship also true in population?
Q4: Is the relationship between height
and weight
a correlation or a causation? Explain why?
cars
package is an embedded dataset with Rcars
: Speed and Stopping Distances of Carscars
contains the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.?cars
at Console in RStudiocars
data(cars)
cars
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
11 11 28
12 12 14
13 12 20
14 12 24
15 12 28
16 13 26
17 13 34
18 13 34
19 13 46
20 14 26
21 14 36
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
29 17 32
30 17 40
31 17 50
32 18 42
33 18 56
34 18 76
35 18 84
36 19 36
37 19 46
38 19 68
39 20 32
40 20 48
41 20 52
42 20 56
43 20 64
44 22 66
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
Q1: Change the unit of speed
and dist
as follows:
・mile → km
・foot → m
Note: 1 mile = 1.6 km, 1 foot = 0.3048 m
Q2: Draw a scatter plot using speed
(x-axis) and dist
(y-axis), and also add a regression line
Q3: Calculate the correlation coefficient between speed
and dist
.
・Suppose this dataset is a sample drawn from population.
・Is this relationship also true in population?
Q4: Is the relationship between speed
and dist
a correlation or a causation? Explain why?