R packages we use in this sectionlibrary(broom)
library(car)
library(jtools)
library(patchwork)
library(QuantPsyc)
library(stargazer)
library(summarytools)
library(tidyverse)| X (Cause) | Y (Effect) |
|---|---|
explanatory variable |
response variable |
independent variable |
dependent variable |
predictor |
outcome |
explanatory variable |
explained variable |
regressor |
regressand |
\[\hat{Y} = \hat{α} + \hat{β}x\]
\[\hat{ε} = Y - \hat{Y}\]
\[\hat{Y} = \hat{α} + \hat{β}x \]
\[\hat{Y} = \hat{α} + \hat{β}\bar{X}\]
\[\hat{Y} = (\bar{Y} - \hat{β}\bar{X}) + \hat{β}\bar{X} = \bar{Y}\]
\[\bar{Y} - \hat{β}\bar{X} = \hat{α}\]
When the value of \(x\) is equal to its mean (\(\bar{X}\)), the value of \(Y\) is also equal to its mean (\(\bar{Y}\)).
In the above plot, we see that this is indeed the case.
The regression line runs through the intersection of the vertical and horizontal dotted lines, which represent the means of X and Y, \((\bar{X}, \bar{Y})\), respectively.
A linear regression model always has zero average prediction error across all data points in the sample, but this does not necessarily mean that the linear regression model accurately represents the actual data-generating process.
Source: Imai, Kosuke. Quantitative Social Science: An Introduction (p.141-147). Princeton University Press. Kindle.
\[\hat{Y} = \hat{α} + \hat{β}x\]
\[\hat{β} = r \frac{s_y}{s_x}\]
\(r\) → ピアソンの相関係数
\(s_y\) → y の標準偏差
\(s_x\) → x の標準偏差
\[\hat{α} = \bar{Y} - β\bar{x}\]
\[r = \frac{\sum((x-\bar{x})(y-\bar{y}))}{\sqrt{Σ(x-\bar{x})^2Σ(y-\bar{y})^2}}= \frac{422.95}{\sqrt{(745.55)(297.55)}}= \frac{422.95}{{471}}= 0.89\]
\(s_y\) → y’s standard deviation
\[s_y= \sqrt{\frac{Σ(y-\bar{y})^2}{{n-1}}}= \sqrt{\frac{297.55}{{19}}}=3.96\]
\(s_x\) → x’s standard deviasion
\[s_x= \sqrt{\frac{Σ(x-\bar{x})^2}{{n-1}}}= \sqrt{\frac{745.55}{{19}}}=6.26\]
\[\hat{β} = r \frac{s_y}{s_x}= 0.89*\frac{3.96}{6.26}=0.56\]
\[\hat{α} = \bar{Y} - β\bar{x}= 9.15 - 0.55*11.35 =2.9\]
① Find a puzzle
② Present your theory to explain the puzzle
③ Present a testable hypothesis drawn from your theory
④ Test your hypothesis with data
Research Question
Why do we witness the differences on local government’s performance in Italy?
Data:
region: Abbreviation of Italian local governmentsgov_p: Performance of Italian local governments出典:飯田健『計量政治分析』p.28
Theory: Social capital enhances local government’s performance.
Source: Robert Putnam, (1994) Making Democracy Work: Civic Traditions in Modern Italy,
Princeton, NJ: Princeton University Press)
Theory
The differences on local government’s performance can be explained by the degree of social capital in each local government.
Social capital can be defined as “the networks of relationships among people who live and work in a particular society, enabling that society to function effectively”.
It involves the effective functioning of social groups through interpersonal relationships, a shared sense of identity, a shared understanding, shared norms, shared values, trust, cooperation, and reciprocity.
Social capital help people build cooperation one another.
→ In the area with more social capital, the more people trust and cooperate one another, which leads to high quality government performance.
Hypothesis:
putnam.csv)data folder in your RProject folderputnamputnam <- read_csv("data/putnam.csv") names(putnam)[1] "region" "gov_p" "cc" "econ" "location"
Data:
| Types of variables | Variables | Details |
|---|---|---|
| Outcome | gov_p |
Performance of Italian local governments |
| Predictor | region |
Abbreviation of Italian local governments |
| Predictor | cc |
Civic Community Index |
| Predictor | econ |
Economy Index (the larger, the better) |
| Predictor | location |
Area dummy (north,south) |
putnumDT::datatable(putnam)summary()summary(putnam) region gov_p cc econ
Length:20 Min. : 1.50 Min. : 1.000 Min. : 2.50
Class :character 1st Qu.: 6.25 1st Qu.: 3.875 1st Qu.: 6.25
Mode :character Median :10.00 Median :15.000 Median :11.75
Mean : 9.15 Mean :11.350 Mean :10.43
3rd Qu.:11.25 3rd Qu.:16.250 3rd Qu.:14.50
Max. :16.00 Max. :18.000 Max. :19.00
location
Length:20
Class :character
Mode :character
stargazer()library("stargazer")stargazer(as.data.frame(putnam),
type = "text",
title = "Descriptve Statistics on Local Government's Performance",
digits = 2)
Descriptve Statistics on Local Government's Performance
========================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
--------------------------------------------------------
gov_p 20 9.15 3.96 1.50 6.25 11.25 16.00
cc 20 11.35 6.26 1 3.9 16.2 18
econ 20 10.43 5.08 2 6.2 14.5 19
--------------------------------------------------------
cc and gov_pggplot(putnam, aes(cc, gov_p)) +
geom_point() +
labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
title = "erformance of Italian local governments & Civic Community Index") +
stat_smooth(method = lm) +
geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)gov_p and cccor.test(putnam$gov_p, putnam$cc)
Pearson's product-moment correlation
data: putnam$gov_p and putnam$cc
t = 8.6583, df = 18, p-value = 7.806e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7558102 0.9593028
sample estimates:
cor
0.8979883
gov_p and cc is 0.8979883Model_1<- lm(gov_p ~ cc, data = putnam)Summary()summary(Model_1)
Call:
lm(formula = gov_p ~ cc, data = putnam)
Residuals:
Min 1Q Median 3Q Max
-2.5043 -1.3481 -0.2087 0.9764 3.4957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.71115 0.84443 3.211 0.00485 **
cc 0.56730 0.06552 8.658 7.81e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared: 0.8064, Adjusted R-squared: 0.7956
F-statistic: 74.97 on 1 and 18 DF, p-value: 7.806e-08
tidy()tidy(Model_1)# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.71 0.844 3.21 0.00485
2 cc 0.567 0.0655 8.66 0.0000000781
stargazer(){r} with {r, results = "asis"} as the chunk optionstargazer(Model_1, type = "html")| Dependent variable: | |
| gov_p | |
| cc | 0.567*** |
| (0.066) | |
| Constant | 2.711*** |
| (0.844) | |
| Observations | 20 |
| R2 | 0.806 |
| Adjusted R2 | 0.796 |
| Residual Std. Error | 1.789 (df = 18) |
| F Statistic | 74.967*** (df = 1; 18) |
| Note: | p<0.1; p<0.05; p<0.01 |
\[\hat{gov_p}\ = 2.7 + .567cc\]
tidy(Model_1)# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.71 0.844 3.21 0.00485
2 cc 0.567 0.0655 8.66 0.0000000781
cc (.567):cc increases one unit, then gov_p increases by .567 pointsp-value is 0.0000000781cc = 0 in population) with 95% confidence intervalp-value (0.0000000781) is way smaller than 0.05.cc = 0 in population) with 99% confidence intervalcc is not zero in populationNull Hypothesis \(H_0: β_1 = 0\)
Alternative Hypothesis \(H_a: β_1 ≠ 0\)
cc is positively related to gov_p in populationgov_p is explained by cc in Model_1R-squared, we need to add another variable which affects the ourcome, gov_pcc increases one unit, then gov_p increases by .567 pointsA Solution to avoid selection bias → Multiple Regression Analysis
Multiple Regression analysis enables us to simultaneously control the effect of other variables
→ We can avoid selection bias problem to a certain degree
→ Multiple regression analysis is a most basic way of avoiding selection bias
econ as the control variable to our modelData:
| Types of variables | Variables | Details |
|---|---|---|
| Outcome | gov_p |
Performance of Italian local governments |
| Predictor | region |
Abbreviation of Italian local governments |
| Predictor | cc |
Civic Community Index |
| Predictor | econ |
Economy Index (the larger, the better) |
| Predictor | location |
Area dummy (north,south) |
F testDT::datatable(putnam)stargazer(as.data.frame(putnam),
type = "text",
title = "Descriptive Statistics on putnam",
digits = 2)
Descriptive Statistics on putnam
========================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
--------------------------------------------------------
gov_p 20 9.15 3.96 1.50 6.25 11.25 16.00
cc 20 11.35 6.26 1 3.9 16.2 18
econ 20 10.43 5.08 2 6.2 14.5 19
--------------------------------------------------------
gov_p(Y-axis) and cc (X-axis)plt_1 <- ggplot(putnam, aes(cc, gov_p)) +
geom_point() +
labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
title = "Performance of Italian local governments & Civic Community Index") +
stat_smooth(method = lm) +
geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)plt_2 <- ggplot(putnam, aes(econ, gov_p)) +
geom_point() +
labs(x = "Economy Index (econ)", y = "Performance of Italian local governments (gov_p)",
title = "Performance of Italian local governments & Economy Index") +
stat_smooth(method = lm) +
geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)plt_1 + plt_2Model_1 <- lm(gov_p ~ cc, data = putnam)Model_2 <- lm(gov_p ~ econ, data = putnam)stargazer(Model_1, Model_2,
digits = 3,
style = "ajps",
dep.var.caption = "Outcome",
dep.var.labels = "Gov.Performance",
title = "Results of Model_1 & Model_2",
type ="html")| Gov.Performance | ||
| Model 1 | Model 2 | |
| cc | 0.567*** | |
| (0.066) | ||
| econ | 0.589*** | |
| (0.120) | ||
| Constant | 2.711*** | 3.011** |
| (0.844) | (1.385) | |
| N | 20 | 20 |
| R-squared | 0.806 | 0.572 |
| Adj. R-squared | 0.796 | 0.549 |
| Residual Std. Error (df = 18) | 1.789 | 2.659 |
| F Statistic (df = 1; 18) | 74.967*** | 24.097*** |
| p < .01; p < .05; p < .1 | ||
Stargazer() automatically add asterisks to show the level of statistical significant levels (1%, 5%, and 10%)
p < .01(1%)- - - 「\(***\)」
p < .05(5%)- - - 「\(**\)」
p < .1 (10%)- - - 「\(*\)」
Results (Model_1)
cc (.567):cc increases one unit, then gov_p increases by .567 pointscccc = 0 in population) with 99% confidence intervalResults (Model_2)
econ (.589):econ increases one unit, then gov_p increases by .589 pointseconcc = 0 in population) with 99% confidence interval
cc and econ are statistically significantNote
t-value and p-value, you should use summary() or tidy() as follows:tidy(Model_1, conf.int = TRUE)# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.71 0.844 3.21 0.00485 0.937 4.49
2 cc 0.567 0.0655 8.66 0.0000000781 0.430 0.705
tidy(Model_2, conf.int = TRUE)# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.01 1.38 2.17 0.0433 0.102 5.92
2 econ 0.589 0.120 4.91 0.000113 0.337 0.841
Model_3 <- lm(gov_p ~ cc + econ, data = putnam)stargazer(Model_1, Model_2, Model_3,
digits = 3,
style = "ajps",
dep.var.caption = "Outcome",
dep.var.labels = "Gov.Performance",
title = "Results of Model_1 & Model_2 & Model_3",
type ="html")| Gov.Performance | |||
| Model 1 | Model 2 | Model 3 | |
| cc | 0.567*** | 0.754*** | |
| (0.066) | (0.152) | ||
| econ | 0.589*** | -0.253 | |
| (0.120) | (0.187) | ||
| Constant | 2.711*** | 3.011** | 3.236*** |
| (0.844) | (1.385) | (0.912) | |
| N | 20 | 20 | 20 |
| R-squared | 0.806 | 0.572 | 0.825 |
| Adj. R-squared | 0.796 | 0.549 | 0.805 |
| Residual Std. Error | 1.789 (df = 18) | 2.659 (df = 18) | 1.749 (df = 17) |
| F Statistic | 74.967*** (df = 1; 18) | 24.097*** (df = 1; 18) | 40.126*** (df = 2; 17) |
| p < .01; p < .05; p < .1 | |||
Sample Regression Function formula on Model_3
\[\hat{gov_p}\ = 3.236 + 0.754cc - 0.253econ\]
Interpretation :Model_2, Model_2, and Model_3
cc is statistically significant.econ is statistically significant.cc is statistically significant, but econ is not.gov_p and econ is spurious correlationcc (.754):cc increases one unit, then gov_p increases by .754 pointsAn F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.
It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled.
Check the values on F-statistics on Model_3
F-statistic: 40.13 on 2 and 17 DF, p-value: 3.645e-07
F-statistic enables us to test whether the coefficient of cc and econ are simultaneously equal to 0cc and econ are simultaneously equal to 0F-statistic = 40.13, p-value = 3.645e-07Model_3Adjusted R-squared: 0.805
gov_p is explained by cc and econ in Model_3cm, kg, US dollors, yen, %, etc.Standardized Beta Coefficient), we can compare the relative impact of each variable on the outcome. library("QuantPsyc") Model_3 <- lm(gov_p ~ cc + econ, data = putnam)
lm.beta(Model_3) cc econ
1.1932019 -0.3255233
Interpretation of β (Model_3)
The coefficient (1.1932019) → when cc increases one standard deviation, gov_p increases its standard deviation by 1.19
→ Statistically significant with the 1% significant level
The coefficient (-0.3255233) → when econ increases one standard deviation, gov_p decreases its standard deviation by 0.33
→ Not statistically significant
Tables
stargazer(Model_1, Model_2, Model_3,
digits = 3,
style = "ajps",
dep.var.caption = "Outcome",
dep.var.labels = "Gov.Performance",
title = "Results of Model_1 & Model_2 & Model_3",
type ="html")| Gov.Performance | |||
| Model 1 | Model 2 | Model 3 | |
| cc | 0.567*** | 0.754*** | |
| (0.066) | (0.152) | ||
| econ | 0.589*** | -0.253 | |
| (0.120) | (0.187) | ||
| Constant | 2.711*** | 3.011** | 3.236*** |
| (0.844) | (1.385) | (0.912) | |
| N | 20 | 20 | 20 |
| R-squared | 0.806 | 0.572 | 0.825 |
| Adj. R-squared | 0.796 | 0.549 | 0.805 |
| Residual Std. Error | 1.789 (df = 18) | 2.659 (df = 18) | 1.749 (df = 17) |
| F Statistic | 74.967*** (df = 1; 18) | 24.097*** (df = 1; 18) | 40.126*** (df = 2; 17) |
| p < .01; p < .05; p < .1 | |||
Caterpillar plots
(1) sjPlot
coef_sjplot <- sjPlot::plot_model(Model_3,
show.values = T,
show.p = T,
vline.color = "black",
order.terms = c(1, 2),
axis.labels = c("Economy Index (econ)", "Civic Community Index (cc)"),
axis.title = "Estimates", title = "Performance of Italian local governments, Civic Community, and Economy",
digits = 3)
coef_sjplot【How to interpret sjPlot】:
dot (●) is the estimate (= coefficient) for each explanatory variable
Holizontal lines (blue and red) show the 95% confidence intervals with α = 0.05 (= 5%)
Blue line・・・The blue line does not cross the black vertical line
→ Statistically significant with 5% significant level
Red line・・・The blue line does cross the black vertical line
→ Not statistically significant with 5% significant level
(2) jtools
jtools::plot_summs(Model_3)data folder in your RProject folderhrhr <- read_csv("data/hr96-21.csv",
na = ".")hrnames(hr) [1] "year" "pref" "ku" "kun"
[5] "wl" "rank" "nocand" "seito"
[9] "j_name" "gender" "name" "previous"
[13] "age" "exp" "status" "vote"
[17] "voteshare" "eligible" "turnout" "seshu_dummy"
[21] "jiban_seshu" "nojiban_seshu"
df1 contains the following 23 variables| variable | detail |
|---|---|
| year | Election year (1996-2017) |
| pref | Prefecture |
| ku | Electoral district name |
| kun | Number of electoral district |
| rank | Ascending order of votes |
| wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
| nocand | Number of candidates in each district |
| seito | Candidate’s affiliated party (in Japanese) |
| j_name | Candidate’s name (Japanese) |
| name | Candidate’s name (English) |
| previous | Previous wins |
| gender | Candidate’s gender:“male”, “female” |
| age | Candidate’s age |
| exp | Election expenditure (yen) spent by each candidate |
| status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
| vote | votes each candidate garnered |
| voteshare | Voteshare (%) |
| eligible | Eligible voters in each district |
| turnout | Turnout in each district (%) |
| castvote | Total votes cast in each district |
| seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
| jiban_seshu | Relationship between candidate and his predecessor |
| nojiban_seshu | Relationship between candidate and his predecessor |
hr96-21.csv, and answer the following questions.Q1: Using seito variable, make a party variable (DPJ dummy: dpj) where 民主党 = 1, others = 0. Add dpj to the dataframe, hr, as a new variable.
Q2: Using exp variable, make a new variable (expm) where the unit is not “Yen” but “million Yen”. Add expm to the dataframe, hr, as a new variable.
Q3: Using status variable, make a new variable, inc, where incumbent = 1, challenger and former incumbent = 0. Add inc to the dataframe, hr, as a new variable.
Q4: Show the descriptive statistics of hr using stargazer package.
Q5: Draw a scatter plot between expm (x-axis) and voteshare (y-axis)
Q6: Briely explain the relationship between expm (x-axis) and voteshare (y-axis)
Q7: Suppose you want to know whether campaign money (expm) affects votes (voteshare).
○ Explanatory Variable — expm
○ Response Variable — voteshare
summary(), tidy(), and stargazer package.voteshare.| variable | detail |
|---|---|
| 1. mag | District magnitude (Number of candidate elected) |
| 2. rank | Ascending order of votes |
| 3. nocand | Number of candidates in each district |
| 4. j_name | Candidate’s name (Japanese) |
| 5. previous | Previous wins |
| 6. gender | Candidate’s gender:“male”, “female” |
| 7. age | Candidate’s age |
| 8. wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
| 9. vote | votes each candidate garnered |
| 10. eligible | Eligible voters in each district |
| 11.turnout | Turnout in each district (%) |
| 12.castvote | Total votes cast in each district |
| 13. seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
| 14. dpj_dummy | 0 = Not-dpj candidate, 1 = dpj candidate |
| 15. inc | challenger and former incumbent = 0, incumbent = 1 |
Q8: Present your statistical results using sjPlot or jtools.
Q9: Interpret the results of multiple regression analysis.
References