R packages we use in this section

library(broom)
library(car)
library(jtools)
library(patchwork)
library(QuantPsyc)
library(stargazer)
library(summarytools)
library(tidyverse)

1. Least Squares

1.1. Simple linear regression

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of every single equation.
As shown in Section 13, correlation describes a linear relationship between two variables, x and y.

Multiple notations on X and Y

X (Cause)	Y (Effect)
`explanatory variable`	`response variable`
`independent variable`	`dependent variable`
`predictor`	`outcome`
`explanatory variable`	`explained variable`
`regressor`	`regressand`

This relationship is best characterized using the following linear model:

Y is the outcome or response variable and X is the predictor or independent (explanatory) variable.
Any line can be defined by the intercept \(α\) and the slope parameter \(β\).
The intercept \(α\) represents the average value of Y when X is zero.
The slope \(β\) measures the average increase in Y when X increases by one unit.
The intercept and slope parameters are together called coefficients.
The error (or disturbance) term, \(ε\), allows an observation to deviate from a perfect linear relationship.
Since the values of \(α\) and \(β\) in this equation are unknown to researchers, they must be estimated from the data.
The estimates of parameters are indicated by “hats,” where \(\hat{α}\) and \(\hat{β}\) represent the estimates of \(α\) and \(β\), respectively.

\[\hat{Y} = \hat{α} + \hat{β}x\]

Most likely, the predicted value will not equal the observed value.
The difference between the observed outcome and its predicted value is called the residual.
Formally, we can write the residual as:

\[\hat{ε} = Y - \hat{Y}\]

The least squares estimates of intercept (\(\hat{α}\)) and slope (\(\hat{β}\)) parameters are given as follows:

The outcome variable, \(\hat{Y}\), is given as follow:

\[\hat{Y} = \hat{α} + \hat{β}x \]

Let’s check the values of \(x\) and \(Y\) when \(x\) and \(Y\) equal to their mean

\[\hat{Y} = \hat{α} + \hat{β}\bar{X}\]

Substitute \(\hat{α}\) in (4.6) for \(\hat{α}\) in the equation

\[\hat{Y} = (\bar{Y} - \hat{β}\bar{X}) + \hat{β}\bar{X} = \bar{Y}\]

\[\bar{Y} - \hat{β}\bar{X} = \hat{α}\]

When the value of \(x\) is equal to its mean (\(\bar{X}\)), the value of \(Y\) is also equal to its mean (\(\bar{Y}\)).
In the above plot, we see that this is indeed the case.
The regression line runs through the intersection of the vertical and horizontal dotted lines, which represent the means of X and Y, \((\bar{X}, \bar{Y})\), respectively.
A linear regression model always has zero average prediction error across all data points in the sample, but this does not necessarily mean that the linear regression model accurately represents the actual data-generating process.

Source: Imai, Kosuke. Quantitative Social Science: An Introduction (p.141-147). Princeton University Press. Kindle.

1.2 Manual calculation for \(α\) and \(β\)

Let’s calculate \(α\) and \(β\) by hand using the following dataset.
This video helps you undersand how to calculate simple regression function by hand.
Although the video uses the different dataset, it helps you have a better understanding of simple linear regression.

Regression equation

\[\hat{Y} = \hat{α} + \hat{β}x\]

Slope (\(β\)) is difined as follows:

\[\hat{β} = r \frac{s_y}{s_x}\]

\(r\) →　ピアソンの相関係数
\(s_y\) →　y の標準偏差
\(s_x\) →　x の標準偏差

Intercept (\(α\)) is difined as follows:

\[\hat{α} = \bar{Y} - β\bar{x}\]

We can calculate Pearson correlation coefficient (\(r\)) as follows:

\[r = \frac{\sum((x-\bar{x})(y-\bar{y}))}{\sqrt{Σ(x-\bar{x})^2Σ(y-\bar{y})^2}}= \frac{422.95}{\sqrt{(745.55)(297.55)}}= \frac{422.95}{{471}}= 0.89\]

\(s_y\) →　 y’s standard deviation

\[s_y= \sqrt{\frac{Σ(y-\bar{y})^2}{{n-1}}}= \sqrt{\frac{297.55}{{19}}}=3.96\]

\(s_x\) →　 x’s standard deviasion

\[s_x= \sqrt{\frac{Σ(x-\bar{x})^2}{{n-1}}}= \sqrt{\frac{745.55}{{19}}}=6.26\]
\[\hat{β} = r \frac{s_y}{s_x}= 0.89*\frac{3.96}{6.26}=0.56\]

\[\hat{α} = \bar{Y} - β\bar{x}= 9.15 - 0.55*11.35 =2.9\]

\[\hat{Y}=2.79+0.56x\]

2. 4 Steps in empirical analysis

① Find a puzzle
② Present your theory to explain the puzzle
③ Present a testable hypothesis drawn from your theory
④ Test your hypothesis with data

2.1 Find a puzzle

Research Question
Why do we witness the differences on local government’s performance in Italy?

Data：

region: Abbreviation of Italian local governments
gov_p: Performance of Italian local governments

出典：飯田健『計量政治分析』p.28

2.2 Present your theory to explain the puzzle

Theory: Social capital enhances local government’s performance.

Source: Robert Putnam, (1994) Making Democracy Work: Civic Traditions in Modern Italy,
Princeton, NJ: Princeton University Press)

Theory

The differences on local government’s performance can be explained by the degree of social capital in each local government.
Social capital can be defined as “the networks of relationships among people who live and work in a particular society, enabling that society to function effectively”.
It involves the effective functioning of social groups through interpersonal relationships, a shared sense of identity, a shared understanding, shared norms, shared values, trust, cooperation, and reciprocity.
Social capital help people build cooperation one another.
→ In the area with more social capital, the more people trust and cooperate one another, which leads to high quality government performance.

2.4 Present a testable hypothesis drawn from your theory

Hypothesis：

If the theory is correct, then we should see that the larger civic community index (cc) leads to higher government performance (gov_p)

3. Test your hypothesis with data

3.1 Data preparation(`putnam.csv`)

Download (putnam.csv)
Put the file (putnam.csv) into data folder in your RProject folder
Load the data and name it putnam

putnam <- read_csv("data/putnam.csv")

Show the list of variables the data contains

names(putnam)

[1] "region"   "gov_p"    "cc"       "econ"     "location"

Data：

Types of variables	Variables	Details
Outcome	`gov_p`	Performance of Italian local governments
Predictor	`region`	Abbreviation of Italian local governments
Predictor	`cc`	Civic Community Index
Predictor	`econ`	Economy Index (the larger, the better)
Predictor	`location`	Area dummy (`north`,`south`）

Check putnum

DT::datatable(putnam)

3.2 Descriptive statistics

2 ways of showing desciptive statistics

summary()

summary(putnam)

    region              gov_p             cc              econ      
 Length:20          Min.   : 1.50   Min.   : 1.000   Min.   : 2.50  
 Class :character   1st Qu.: 6.25   1st Qu.: 3.875   1st Qu.: 6.25  
 Mode  :character   Median :10.00   Median :15.000   Median :11.75  
                    Mean   : 9.15   Mean   :11.350   Mean   :10.43  
                    3rd Qu.:11.25   3rd Qu.:16.250   3rd Qu.:14.50  
                    Max.   :16.00   Max.   :18.000   Max.   :19.00  
   location        
 Length:20         
 Class :character  
 Mode  :character

stargazer()

library("stargazer")

stargazer(as.data.frame(putnam), 
          type = "text", 
          title = "Descriptve Statistics on Local Government's Performance", 
          digits = 2)


Descriptve Statistics on Local Government's Performance
========================================================
Statistic N  Mean  St. Dev. Min  Pctl(25) Pctl(75)  Max 
--------------------------------------------------------
gov_p     20 9.15    3.96   1.50   6.25    11.25   16.00
cc        20 11.35   6.26    1     3.9      16.2    18  
econ      20 10.43   5.08    2     6.2      14.5    19  
--------------------------------------------------------

3.3 Draw a scatter plot

Draw a scatter plot between cc and gov_p

ggplot(putnam, aes(cc, gov_p)) +
  geom_point() +
  labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
         title = "erformance of Italian local governments & Civic Community Index") + 
  stat_smooth(method = lm) +  
  geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)

Calculate the correlation coefficient between gov_p and cc

cor.test(putnam$gov_p, putnam$cc)


    Pearson's product-moment correlation

data:  putnam$gov_p and putnam$cc
t = 8.6583, df = 18, p-value = 7.806e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7558102 0.9593028
sample estimates:
      cor 
0.8979883

Correlation coefficient between gov_p and cc is 0.8979883
A stronger correlation
However, we need to use a regression analysis to seek for a causality between them

3.4 Simple Regression

Model_1<- lm(gov_p ~ cc, data = putnam)

3.4.1 How to show regression results

(1) `Summary()`

summary(Model_1)


Call:
lm(formula = gov_p ~ cc, data = putnam)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5043 -1.3481 -0.2087  0.9764  3.4957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.71115    0.84443   3.211  0.00485 ** 
cc           0.56730    0.06552   8.658 7.81e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared:  0.8064,    Adjusted R-squared:  0.7956 
F-statistic: 74.97 on 1 and 18 DF,  p-value: 7.806e-08

(2) `tidy()`

tidy(Model_1)

# A tibble: 2 x 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    2.71     0.844       3.21 0.00485     
2 cc             0.567    0.0655      8.66 0.0000000781

(3) `stargazer()`

Note: replace {r} with {r, results = "asis"} as the chunk option

stargazer(Model_1, type = "html")


	Dependent variable:

	gov_p

cc	0.567^***
	(0.066)

Constant	2.711^***
	(0.844)


Observations	20
R²	0.806
Adjusted R²	0.796
Residual Std. Error	1.789 (df = 18)
F Statistic	74.967^*** (df = 1; 18)

Note:	p<0.1; p<0.05; p<0.01

From this result, we get the Sample Regression Function for Model_1 as follows:

\[\hat{gov_p}\ = 2.7 + .567cc\]

3.4.2 Interpretation of Regression Results

tidy(Model_1)

# A tibble: 2 x 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    2.71     0.844       3.21 0.00485     
2 cc             0.567    0.0655      8.66 0.0000000781

Coefficient on cc (.567):
When cc increases one unit, then gov_p increases by .567 points
What we want to know it if this is the case in population, too.
p-value is 0.0000000781
This p-valus is smaller than 0.05
→ We can reject the null hypothesis (coefficient on cc = 0 in population) with 95% confidence interval
→ Statistically significant with the 5% significant level
In general, “statistically significant” means statistically significant with the 5% significant level.
However, in this case, p-value (0.0000000781) is way smaller than 0.05.
It is also smaller than 0.01
→ We can also reject the null hypothesis (coefficient on cc = 0 in population) with 99% confidence interval
→ Statistically significant with the 1% significant level
In social sciences, we visually report statistical significance using catapillar plots (I will talk about this later).
→ The coefficient on cc is not zero in population

Null Hypothesis　\(H_0: β_1 = 0\)
Alternative Hypothesis　\(H_a: β_1 ≠ 0\)

We reject the null hypothesis and take the alternative hypothesis
→ cc is positively related to gov_p in population

Conclusion The larger civic community index (cc) leads to higher government performance (gov_p)

Other stuff to consider:

\(R^2\) = 0.8064

81% of the variance of gov_p is explained by cc in Model_1
→ The other 19% remained unexplained
To increase the value of R-squared, we need to add another variable which affects the ourcome, gov_p

4. Multiple Regression

In 3.4.2, we see that when cc increases one unit, then gov_p increases by .567 points
However, it is not realistic that one outcome is caused by ONE variable.
It is natural to think that most of the outcome is caused by MULTPLE variables!
If we fail to include a variable (which affects the outcome for sure), then we face a selection bias problem.
→　We need to include another independent variable(s) to avoid selection bias.

A Solution to avoid selection bias → Multiple Regression Analysis
Multiple Regression analysis enables us to simultaneously control the effect of other variables
→　We can avoid selection bias problem to a certain degree
→　Multiple regression analysis is a most basic way of avoiding selection bias

We add another variable, econ as the control variable to our model

Data：

Types of variables	Variables	Details
Outcome	`gov_p`	Performance of Italian local governments
Predictor	`region`	Abbreviation of Italian local governments
Predictor	`cc`	Civic Community Index
Predictor	`econ`	Economy Index (the larger, the better)
Predictor	`location`	Area dummy (`north`,`south`）

4.1 Six steps in Multiple Regression

Draw a scatter plot between X and Y
Sample Regression Function formula
Partial regression coefficient（\(α, β_1, β_2\))
F test
Adjusted Coefficient of Determination (\(Adjusted R^2\))
Standard partial regression coefficient (\(β\))

(1) Draw a scatter plot between X and Y

Check the data we use

DT::datatable(putnam)

stargazer(as.data.frame(putnam), 
          type = "text", 
          title = "Descriptive Statistics on putnam", 
          digits = 2)


Descriptive Statistics on putnam
========================================================
Statistic N  Mean  St. Dev. Min  Pctl(25) Pctl(75)  Max 
--------------------------------------------------------
gov_p     20 9.15    3.96   1.50   6.25    11.25   16.00
cc        20 11.35   6.26    1     3.9      16.2    18  
econ      20 10.43   5.08    2     6.2      14.5    19  
--------------------------------------------------------

Draw a scatter plot between gov_p(Y-axis) and cc (X-axis)

plt_1 <- ggplot(putnam, aes(cc, gov_p)) +
  geom_point() +
  labs(x = "Civic Community Index (cc)", y = "Performance of Italian local governments (gov_p)",
         title = "Performance of Italian local governments & Civic Community Index") + 
  stat_smooth(method = lm) +  
  geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)

plt_2 <-  ggplot(putnam, aes(econ, gov_p)) +
  geom_point() +
  labs(x = "Economy Index (econ)", y = "Performance of Italian local governments (gov_p)",
         title = "Performance of Italian local governments & Economy Index") + 
  stat_smooth(method = lm) +  
  geom_text(aes(y = gov_p + 0.2, label = region), size = 4, vjust = 0)

plt_1 + plt_2

We see that both have positive correlation

(2) Sample Regression Function formula

Model_1 <- lm(gov_p ~ cc, data = putnam)

Model_2 <- lm(gov_p ~ econ, data = putnam)

stargazer(Model_1, Model_2,     
          digits = 3,        
          style = "ajps", 
          dep.var.caption = "Outcome", 
          dep.var.labels = "Gov.Performance", 
          title = "Results of Model_1 & Model_2", 
          type ="html")

**Results of Model& Model**

	Gov.Performance
	Model 1	Model 2

cc	0.567^***
	(0.066)
econ		0.589^***
		(0.120)
Constant	2.711^***	3.011^**
	(0.844)	(1.385)
N	20	20
R-squared	0.806	0.572
Adj. R-squared	0.796	0.549
Residual Std. Error (df = 18)	1.789	2.659
F Statistic (df = 1; 18)	74.967^***	24.097^***

p < .01; p < .05; p < .1

Stargazer() automatically add asterisks to show the level of statistical significant levels (1%, 5%, and 10%)
p < .01（1%）- - - 「\(***\)」
p < .05（5%）- - - 「\(**\)」
p < .1 （10%）- - - 「\(*\)」

Results (Model_1)

Coefficient on cc (.567):
When cc increases one unit, then gov_p increases by .567 points
What we want to know it if this is the case in population, too.
Three asterisks on the coefficient of cc
→ We can reject the null hypothesis (coefficient on cc = 0 in population) with 99% confidence interval
→ Statistically significant with the 1% significant level

Results (Model_2)

Coefficient on econ (.589):
When econ increases one unit, then gov_p increases by .589 points
What we want to know it if this is the case in population, too.
Three asterisks on the coefficient of econ
→ We can reject the null hypothesis (coefficient on cc = 0 in population) with 99% confidence interval
→ Statistically significant with the 1% significant level

Both `cc` and `econ` are statistically significant

Note

If you want to check t-value and p-value, you should use summary() or tidy() as follows:

tidy(Model_1, conf.int = TRUE)

# A tibble: 2 x 7
  term        estimate std.error statistic      p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
1 (Intercept)    2.71     0.844       3.21 0.00485         0.937     4.49 
2 cc             0.567    0.0655      8.66 0.0000000781    0.430     0.705

tidy(Model_2, conf.int = TRUE)

# A tibble: 2 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    3.01      1.38       2.17 0.0433      0.102     5.92 
2 econ           0.589     0.120      4.91 0.000113    0.337     0.841

(3) Partial regression coefficient (α, β1, β2)

Model_3 <- lm(gov_p ~ cc + econ, data = putnam)

stargazer(Model_1, Model_2, Model_3,    
          digits = 3,        
          style = "ajps", 
          dep.var.caption = "Outcome", 
          dep.var.labels = "Gov.Performance", 
          title = "Results of Model_1 & Model_2 & Model_3", 
          type ="html")

**Results of Model& Model& Model**

	Gov.Performance
	Model 1	Model 2	Model 3

cc	0.567^***		0.754^***
	(0.066)		(0.152)
econ		0.589^***	-0.253
		(0.120)	(0.187)
Constant	2.711^***	3.011^**	3.236^***
	(0.844)	(1.385)	(0.912)
N	20	20	20
R-squared	0.806	0.572	0.825
Adj. R-squared	0.796	0.549	0.805
Residual Std. Error	1.789 (df = 18)	2.659 (df = 18)	1.749 (df = 17)
F Statistic	74.967^*** (df = 1; 18)	24.097^*** (df = 1; 18)	40.126^*** (df = 2; 17)

p < .01; p < .05; p < .1

Sample Regression Function formula on Model_3

\[\hat{gov_p}\ = 3.236 + 0.754cc - 0.253econ\]

Interpretation :Model_2, Model_2, and Model_3

In Model_1, cc is statistically significant.
In Model_2, econ is statistically significant.
However, in Model_3, cc is statistically significant, but econ is not.
Why?
→ The relationship between gov_p and econ is spurious correlation
Coefficient on cc (.754):
When cc increases one unit, then gov_p increases by .754 points
→ Statistically significant with the 1% significant level

(4) F test

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.
It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled.
Check the values on F-statistics on Model_3

 F-statistic: 40.13 on 2 and 17 DF,  p-value: 3.645e-07

F-statistic enables us to test whether the coefficient of cc and econ are simultaneously equal to 0
Null hypothesis: the coefficient of cc and econ are simultaneously equal to 0
F-statistic = 40.13, p-value = 3.645e-07
→ We can reject the null hypothesis.
Statistically significant with the 1% significant level → We can conclude that this is an appropriate model

(5) Adjusted Coefficient of Determination (\(adjR^2\))

In multiple regression, you use \(adjR^2\) instead of \(R^2\)
Check the values on F-statistics on Model_3

 Adjusted R-squared:  0.805

80.5% of the variance of gov_p is explained by cc and econ in Model_3
→ The other 19.5% remained unexplained

(6) Standard partial regression coefficient (\(β\)) 　

Usually, the unit of each variable differs, such as cm, kg, US dollors, yen, %, etc.
→ It is not possible to see how each variable has an impact on the ourcome over the other variables by simply comparing the size of coefficient.
By comparing the size of \(β\) (Standardized Beta Coefficient), we can compare the relative impact of each variable on the outcome.

library("QuantPsyc")

Model_3 <- lm(gov_p ~ cc + econ, data = putnam)
lm.beta(Model_3)

        cc       econ 
 1.1932019 -0.3255233

Interpretation of β (Model_3)

The coefficient (1.1932019) → when cc increases one standard deviation, gov_p increases its standard deviation by 1.19
→ Statistically significant with the 1% significant level
The coefficient (-0.3255233) → when econ increases one standard deviation, gov_p decreases its standard deviation by 0.33
→ Not statistically significant

4.2.8 How to present regression results

Tables

stargazer(Model_1, Model_2, Model_3,    
          digits = 3,        
          style = "ajps", 
          dep.var.caption = "Outcome", 
          dep.var.labels = "Gov.Performance", 
          title = "Results of Model_1 & Model_2 & Model_3", 
          type ="html")

**Results of Model& Model& Model**

	Gov.Performance
	Model 1	Model 2	Model 3

cc	0.567^***		0.754^***
	(0.066)		(0.152)
econ		0.589^***	-0.253
		(0.120)	(0.187)
Constant	2.711^***	3.011^**	3.236^***
	(0.844)	(1.385)	(0.912)
N	20	20	20
R-squared	0.806	0.572	0.825
Adj. R-squared	0.796	0.549	0.805
Residual Std. Error	1.789 (df = 18)	2.659 (df = 18)	1.749 (df = 17)
F Statistic	74.967^*** (df = 1; 18)	24.097^*** (df = 1; 18)	40.126^*** (df = 2; 17)

p < .01; p < .05; p < .1

Caterpillar plots
(1) sjPlot

coef_sjplot <- sjPlot::plot_model(Model_3, 
                          show.values = T,      
                          show.p = T,              
                          vline.color = "black",   
                          order.terms = c(1, 2),　
                          axis.labels = c("Economy Index (econ)", "Civic Community Index (cc)"),
                          axis.title = "Estimates", title = "Performance of Italian local governments, Civic Community, and Economy",
                          digits = 3) 

coef_sjplot

【How to interpret sjPlot】：

dot (●) is the estimate (= coefficient) for each explanatory variable
Holizontal lines (blue and red) show the 95% confidence intervals with α = 0.05 (= 5%)
Blue line・・・The blue line does not cross the black vertical line
→　Statistically significant with 5% significant level
Red line・・・The blue line does cross the black vertical line
→　Not statistically significant with 5% significant level

(2) jtools

jtools::plot_summs(Model_3)

6. Excercise

Excercise 6.1

Download hr96_21.csv
Japanese lower house election results (1996-2021)
Put the file (hr96_21.csv) into data folder in your RProject folder
Load the data and name it hr

hr <- read_csv("data/hr96-21.csv", 
               na = ".")

Check the variable names in hr

names(hr)

 [1] "year"          "pref"          "ku"            "kun"          
 [5] "wl"            "rank"          "nocand"        "seito"        
 [9] "j_name"        "gender"        "name"          "previous"     
[13] "age"           "exp"           "status"        "vote"         
[17] "voteshare"     "eligible"      "turnout"       "seshu_dummy"  
[21] "jiban_seshu"   "nojiban_seshu"

df1 contains the following 23 variables

variable	detail
year	Election year (1996-2017)
pref	Prefecture
ku	Electoral district name
kun	Number of electoral district
rank	Ascending order of votes
wl	0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
nocand	Number of candidates in each district
seito	Candidate’s affiliated party (in Japanese)
j_name	Candidate’s name (Japanese)
name	Candidate’s name (English)
previous	Previous wins
gender	Candidate’s gender:“male”, “female”
age	Candidate’s age
exp	Election expenditure (yen) spent by each candidate
status	0 = challenger / 1 = incumbent / 2 = former incumbent
vote	votes each candidate garnered
voteshare	Voteshare (%)
eligible	Eligible voters in each district
turnout	Turnout in each district (%)
castvote	Total votes cast in each district
seshu_dummy	0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu	Relationship between candidate and his predecessor
nojiban_seshu	Relationship between candidate and his predecessor

Select the 2009 election data from hr96-21.csv, and answer the following questions.

Q1: Using seito variable, make a party variable (DPJ dummy: dpj) where 民主党 = 1, others = 0. Add dpj to the dataframe, hr, as a new variable.

Q2: Using exp variable, make a new variable (expm) where the unit is not “Yen” but “million Yen”. Add expm to the dataframe, hr, as a new variable.

Q3: Using status variable, make a new variable, inc, where incumbent = 1, challenger and former incumbent = 0. Add inc to the dataframe, hr, as a new variable.

Q4: Show the descriptive statistics of hr using stargazer package.

Q5: Draw a scatter plot between expm (x-axis) and voteshare (y-axis)

Q6: Briely explain the relationship between expm (x-axis) and voteshare (y-axis)

Q7: Suppose you want to know whether campaign money (expm) affects votes (voteshare).

○ Explanatory Variable — expm
○ Response Variable — voteshare

Select eight control variables from the list below, run a multiple regression, and show the results using summary(), tidy(), and stargazer package.
You need to briefly explain why you think the eight control variables you select affect voteshare.

variable	detail
1. mag	District magnitude (Number of candidate elected)
2. rank	Ascending order of votes
3. nocand	Number of candidates in each district
4. j_name	Candidate’s name (Japanese)
5. previous	Previous wins
6. gender	Candidate’s gender:“male”, “female”
7. age	Candidate’s age
8. wl	0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
9. vote	votes each candidate garnered
10. eligible	Eligible voters in each district
11.turnout	Turnout in each district (%)
12.castvote	Total votes cast in each district
13. seshu_dummy	0 = Not-hereditary candidates, 1 = hereditary candidate
14. dpj_dummy	0 = Not-dpj candidate, 1 = dpj candidate
15. inc	challenger and former incumbent = 0, incumbent = 1

Q8: Present your statistical results using sjPlot or jtools.

Q9: Interpret the results of multiple regression analysis.

References

飯田健『計量政治分析』共立出版、2013年.

ロバート・パットナム（河田潤ー訳）『哲学する民主主義』NTT 出版、2001年.

森田果『実証分析入門』日本評論社、2014年.

宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」

土井翔平（北海道大学公共政策大学院）「Rで計量政治学入門」

矢内勇生（高知工科大学）授業一覧

浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年

浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年

Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.

Kieran Healy, DATA VISUALIZATION, Princeton, 2019

Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

18. Linear Regression 1

Masahiko Asano

2021-11-08

1. Least Squares

1.1. Simple linear regression

1.2 Manual calculation for \(α\) and \(β\)

Regression equation

Slope (\(β\)) is difined as follows:

Intercept (\(α\)) is difined as follows:

\[\hat{Y}=2.79+0.56x\]

2. 4 Steps in empirical analysis

2.1 Find a puzzle

2.2 Present your theory to explain the puzzle

2.4 Present a testable hypothesis drawn from your theory

3. Test your hypothesis with data

3.1 Data preparation(`putnam.csv`)

3.2 Descriptive statistics

2 ways of showing desciptive statistics

3.3 Draw a scatter plot

3.4 Simple Regression

3.4.1 How to show regression results

(1) `Summary()`

(2) `tidy()`

(3) `stargazer()`

3.4.2 Interpretation of Regression Results

Other stuff to consider:

\(R^2\) = 0.8064

4. Multiple Regression

4.1 Six steps in Multiple Regression

(1) Draw a scatter plot between X and Y

(2) Sample Regression Function formula

Both `cc` and `econ` are statistically significant

(3) Partial regression coefficient (α, β1, β2)

(4) F test

(5) Adjusted Coefficient of Determination (\(adjR^2\))

(6) Standard partial regression coefficient (\(β\))

4.2.8 How to present regression results

6. Excercise

Excercise 6.1

18. Linear Regression 1

Masahiko Asano

2021-11-08

1. Least Squares

1.1. Simple linear regression

1.2 Manual calculation for \(α\) and \(β\)

Regression equation

Slope (\(β\)) is difined as follows:

Intercept (\(α\)) is difined as follows:

\[\hat{Y}=2.79+0.56x\]

2. 4 Steps in empirical analysis

2.1 Find a puzzle

2.2 Present your theory to explain the puzzle

2.4 Present a testable hypothesis drawn from your theory

3. Test your hypothesis with data

3.1 Data preparation(putnam.csv)

3.2 Descriptive statistics

2 ways of showing desciptive statistics

3.3 Draw a scatter plot

3.4 Simple Regression

3.4.1 How to show regression results

(1) Summary()

(2) tidy()

(3) stargazer()

3.4.2 Interpretation of Regression Results

Other stuff to consider:

\(R^2\) = 0.8064

4. Multiple Regression

4.1 Six steps in Multiple Regression

(1) Draw a scatter plot between X and Y

(2) Sample Regression Function formula

Both cc and econ are statistically significant

(3) Partial regression coefficient (α, β1, β2)

(4) F test

(5) Adjusted Coefficient of Determination (\(adjR^2\))

(6) Standard partial regression coefficient (\(β\))

4.2.8 How to present regression results

6. Excercise

Excercise 6.1

3.1 Data preparation(`putnam.csv`)

(1) `Summary()`

(2) `tidy()`

(3) `stargazer()`

Both `cc` and `econ` are statistically significant

(6) Standard partial regression coefficient (\(β\))