R packages we use in this section

library(tidyverse)
library(stargazer)

1. What can we do with a dummy?

1.1 What is a dummy variable?

A dummy variable is one that takes only the value 0 or 1 to indicate the absence (or presence) of some categorical effect that may be expected to shift the outcome.
They can be thought of as numeric substitutes for qualitative facts in a regression model, sorting data into mutually exclusive categories (such as winner = 1, loser =0 in election).
Other examples of a dummy variable:

male = 1, or female = 0
war = 1, no war = 0
north = 1, south = 0

1.2 What we can do with a dummy variable

Research Question

Do the location of local government (north or south) matters in predicting local government’s performance in Italy?

When you do not control with local dummy (left)

You see that the better the economic situation, the more government performance in Italy.

When you control with local dummy (right)

You see the location of local government (north or south) matters in predicting local government’s performance in Italy.
Local government performance is higher in the north rather than in the south.
However, economic situation does not matter in predicting government performance in Italy

2. Economic Situation and Location

Theory: Social capital enhances local government’s performance.

Source: Robert Putnam, (1994) Making Democracy Work: Civic Traditions in Modern Italy,
Princeton, NJ: Princeton University Press)

Theory

The differences on local government’s performance can be explained by the degree of social capital in each local government.
Social capital can be defined as “the networks of relationships among people who live and work in a particular society, enabling that society to function effectively”.
It involves the effective functioning of social groups through interpersonal relationships, a shared sense of identity, a shared understanding, shared norms, shared values, trust, cooperation, and reciprocity.
Social capital help people build cooperation one another.
→ In the area with more social capital, the more people trust and cooperate one another, which leads to high quality government performance.

3. Testing Goldberg’s Argument

✔ Goldberg’s argument (1996)

Italy has very different history, tradition and culture between the north and the south
The difference on the north and the south explains the difference in society such as politics and economy.
So, you need to take the difference between the north and the south into consideration in analyzing the relationship between government performance.
North → more social capital → higher government performance
South → less social capital → lower government performance
Let’s check if what Goldberg says is correct.

3.1 Does `gov_p` differ by location?

Download (putnam.csv)
Put the file (putnam.csv) into data folder in your RProject folder
Load the data and name it putnam

putnam <- read_csv("data/putnam.csv")

Show the list of variables the data contains

names(putnam)

[1] "region"   "gov_p"    "cc"       "econ"     "location"

Data：

Types of variables	Variables	Details
Outcome	`gov_p`	Performance of Italian local governments
Predictor	`region`	Abbreviation of Italian local governments
Predictor	`cc`	Civic Community Index
Predictor	`econ`	Economy Index (the larger, the better)
Predictor	`location`	Area dummy (`north`,`south`）

Check putnum

DT::datatable(putnam)

Let’s check if the government performance differs by location

Draw a scatter plot

putnam %>% 
  ggplot(aes(x = location, y = gov_p, fill = location)) +
    geom_boxplot() +
  labs(x = "Location Dummy", y = "gov_p",
         title = "Government Performance in Italy by Location") + 
  stat_smooth(method = lm, se = FALSE)

It looks like there is a clear difference between north and south
Conduct a t-test (unpaired)

t.test(putnam$gov_p[putnam$location == "north"],
       putnam$gov_p[putnam$location == "south"])


    Welch Two Sample t-test

data:  putnam$gov_p[putnam$location == "north"] and putnam$gov_p[putnam$location == "south"]
t = 6.8253, df = 14.552, p-value = 6.737e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.607777 8.808890
sample estimates:
mean of x mean of y 
 11.83333   5.12500

Result

・Average gov_p (North) = 11.833
・Average gov_p (South) = 5.125
・The difference (-6.708) is statistically significant with the 1% significant level (p-value = 6.737e-06)
→ As Goldberg (1996) argues, there is a clear difference in government performance between north and south.

Next question we should ask is whether economic situation is related to government performance both in northern area and southern area.

3.2 Does `econ` explain `gov_p`?

It is seems that economy (econ) is related to government performance (gov_p) in Italy.
However, it is not clear yet that this is the case both in northern area and southern area.
Draw a scatter plot between econ and gov_p

putnam %>% 
  ggplot(aes(econ, gov_p)) +
  geom_point() +
  theme_bw() +
  labs(x = "econ", y = "gov_p",
         title = "Economic situation and Government Performance in Italy") + 
  stat_smooth(method = lm, se = FALSE)

We see a positive correlation between econ and gov_p.
→ The better the economic situation, the higher the local government performance.
Let’s get Sample Regression Function (SRP) for model_1.

model_1 <- lm(gov_p ~ econ, data = putnam)

summary(model_1)


Call:
lm(formula = gov_p ~ econ, data = putnam)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3386 -1.7733  0.0086  0.8336  5.5114 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.0108     1.3847   2.174 0.043264 *  
econ          0.5889     0.1200   4.909 0.000113 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.659 on 18 degrees of freedom
Multiple R-squared:  0.5724,    Adjusted R-squared:  0.5487 
F-statistic:  24.1 on 1 and 18 DF,  p-value: 0.0001131

\[\widehat{gov_p}\ = 3.01 + 0.589econ\]

Check the class of variables contained in putnam

str(putnam)

spec_tbl_df [20 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ region  : chr [1:20] "Ab" "Ba" "Cl" "Cm" ...
 $ gov_p   : num [1:20] 7.5 7.5 1.5 2.5 16 12 10 11 11 9 ...
 $ cc      : num [1:20] 8 4 1 2 18 17 13 16 17 15.5 ...
 $ econ    : num [1:20] 7 3 3 6.5 13 14.5 12.5 15.5 19 10.5 ...
 $ location: chr [1:20] "south" "south" "south" "south" ...
 - attr(*, "spec")=
  .. cols(
  ..   region = col_character(),
  ..   gov_p = col_double(),
  ..   cc = col_double(),
  ..   econ = col_double(),
  ..   location = col_character()
  .. )

→ Change the class of location from charactor to numeric
→ Change the name of data frame as df2

df2 <- mutate(putnam, 
              location = as.numeric(location == "north" )) # north = 1, south = 0

DT::datatable(df2)

To see if economic situation (econ) is related to government performance (gov_p) both in northern area and southern area, we need to simultaneously include econ and location in our regression model.

model_2 <- lm(gov_p ~ econ + location, data = df2)

Show the results
Note: replace {r} with {r, results = "asis"} as the chunk option

stargazer(model_2, type = "html")


	Dependent variable:

	gov_p

econ	-0.019
	(0.220)

location	6.884^***
	(2.229)

Constant	5.222^***
	(1.347)


Observations	20
R²	0.726
Adjusted R²	0.694
Residual Std. Error	2.190 (df = 17)
F Statistic	22.531^*** (df = 2; 17)

Note:	p<0.1; p<0.05; p<0.01

We get the following SRF for model_2:

\[\widehat{gov_p}\ = 5.222 - 0.019econ + 6.88location\]

We see that econ is not related to gov_p
We see that location is related to gov_p
→ When location = 1, (that is, when the local government is located in the North), government performance is higher by 6.884 points.
By substituting location = 0 and 1, we get the following two regression functions:
Note: The two slopes are identical!

loation = 0

\[\widehat{gov_p}\ = 5.22 - 0.019econ\]

location = 1

\[\widehat{gov_p}\ = 12.11 - 0.019econ\]

Let’s visualize these two results by drawining scatter plots

When you do not control with local dummy (left)

You see that the better the economic situation, the more government performance in Italy.

When you control with local dummy (right)

You see the location of local government (north or south) matters in predicting local government’s performance in Italy.
Local government performance is higher in the north rather than in the south.
Economic situation does not matter in predicting government performance in Italy

Result ・The relationship between economic situation (econ) and government performance (gov_p) is spurious correlation.

4. Testing Putnam’s Argument

4.1 Does `cc` explain `gov_p`?

It seems that civic community index (cc) is related to government performance (gov_p) in Italy.
However, it is not clear yet that this is the case both in northern area and southern area.
Draw a scatter plot between cc and gov_p

putnam %>% 
  ggplot(aes(cc, gov_p)) +
  geom_point() +
  theme_bw() +
  labs(x = "cc", y = "gov_p",
         title = "civic community index and Government Performance in Italy") + 
  stat_smooth(method = lm, se = FALSE)

We see a positive correlation between cc and gov_p.
→ The better civic community index the higher the local government performance.
Let’s get Sample Regression Function (SRP) for model_3.

model_3 <- lm(gov_p ~ cc, data = putnam)

summary(model_3)


Call:
lm(formula = gov_p ~ cc, data = putnam)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5043 -1.3481 -0.2087  0.9764  3.4957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.71115    0.84443   3.211  0.00485 ** 
cc           0.56730    0.06552   8.658 7.81e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.789 on 18 degrees of freedom
Multiple R-squared:  0.8064,    Adjusted R-squared:  0.7956 
F-statistic: 74.97 on 1 and 18 DF,  p-value: 7.806e-08

\[\widehat{gov_p}\ = 2.711 + 0.567econ\]

Check the class of variables contained in putnam

str(putnam)

spec_tbl_df [20 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ region  : chr [1:20] "Ab" "Ba" "Cl" "Cm" ...
 $ gov_p   : num [1:20] 7.5 7.5 1.5 2.5 16 12 10 11 11 9 ...
 $ cc      : num [1:20] 8 4 1 2 18 17 13 16 17 15.5 ...
 $ econ    : num [1:20] 7 3 3 6.5 13 14.5 12.5 15.5 19 10.5 ...
 $ location: chr [1:20] "south" "south" "south" "south" ...
 - attr(*, "spec")=
  .. cols(
  ..   region = col_character(),
  ..   gov_p = col_double(),
  ..   cc = col_double(),
  ..   econ = col_double(),
  ..   location = col_character()
  .. )

→ Change the class of location from charactor to numeric
→ Change the name of data frame as df2

df2 <- mutate(putnam, 
              location = as.numeric(location == "north" )) # north = 1, south = 0

DT::datatable(df2)

To see if civic community index (cc) is related to government performance (gov_p) both in northern area and southern area, we need to simultaneously include cc and location in our regression model.

model_3 <- lm(gov_p ~ cc + location, data = df2)

Show the results
Note: replace {r} with {r, results = "asis"} as the chunk option

stargazer(model_3, type = "html")


	Dependent variable:

	gov_p

cc	0.571^**
	(0.215)

location	-0.048
	(2.678)

Constant	2.698^**
	(1.121)


Observations	20
R²	0.806
Adjusted R²	0.784
Residual Std. Error	1.841 (df = 17)
F Statistic	35.402^*** (df = 2; 17)

Note:	p<0.1; p<0.05; p<0.01

We get the following SRF for model_3:

\[\widehat{gov_p}\ = 2.698 - 0.571econ + 0.048location\]

We see that cc is related to gov_p
We see that location is not related to gov_p
→ Regardless of the value of location that is, when the local government is either in the North or in the South), government performance does not differ.
By substituting location = 0 and 1, we get the following two regression functions:
Note: The two slopes are identical!

loation = 0

\[\widehat{gov_p}\ = 2.65 - 0.571econ\]

location = 1

\[\widehat{gov_p}\ = 2.698 - 0.571econ\]

Let’s visualize these two results by drawning scatter plots

When you do not control with local dummy (left)

You see that the better the economic situation, the more government performance in Italy.

When you control with local dummy (right)

You see the location of local government (north or south) are not related in predicting local government’s performance (gov_p) in Italy.

Result ・The Civic Community Index (cc) matters in predicting government performance both in the north and in the south in Italy

4.2 What explains `gov_p`?

model_4 <- lm(gov_p ~ cc + econ + location, data = df2)

stargazer(model_1, model_2, model_3, model_4,
          type = "html")


	Dependent variable:

	gov_p
	(1)	(2)	(3)	(4)

econ	0.589^***	-0.019		-0.269
	(0.120)	(0.220)		(0.199)

cc			0.571^**	0.700^***
			(0.215)	(0.230)

location		6.884^***	-0.048	0.858
		(2.229)	(2.678)	(2.698)

Constant	3.011^**	5.222^***	2.698^**	3.495^**
	(1.385)	(1.347)	(1.121)	(1.243)


Observations	20	20	20	20
R²	0.572	0.726	0.806	0.826
Adjusted R²	0.549	0.694	0.784	0.794
Residual Std. Error	2.659 (df = 18)	2.190 (df = 17)	1.841 (df = 17)	1.797 (df = 16)
F Statistic	24.097^*** (df = 1; 18)	22.531^*** (df = 2; 17)	35.402^*** (df = 2; 17)	25.370^*** (df = 3; 16)

Note:	p<0.1; p<0.05; p<0.01

Conclusions ・The Civic Community Index (cc) matters in predicting government performance

・Economic situaion (econ) does not matter in predicting government performance

・The location (location) does not matter in predicting government performance

References

飯田健『計量政治分析』共立出版、2013年.

Ellis Goldberg (1996), Thinking about How Democracy Works, Politics & Society, Vol. 24, pp.7-18.

宋財泫 (Jaehyun Song)- 矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」

土井翔平（北海道大学公共政策大学院）「Rで計量政治学入門」

矢内勇生（高知工科大学）授業一覧

浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年

浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年

Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.

Kieran Healy, DATA VISUALIZATION, Princeton, 2019

Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

19. Linear Regression 2 (Dummy)

Masahiko Asano

2021-10-19

1. What can we do with a dummy?

1.1 What is a dummy variable?

1.2 What we can do with a dummy variable

When you do not control with local dummy (left)

When you control with local dummy (right)

2. Economic Situation and Location

3. Testing Goldberg’s Argument

3.1 Does `gov_p` differ by location?

Let’s check if the government performance differs by location

3.2 Does `econ` explain `gov_p`?

When you do not control with local dummy (left)

When you control with local dummy (right)

4. Testing Putnam’s Argument

4.1 Does `cc` explain `gov_p`?

When you do not control with local dummy (left)

When you control with local dummy (right)

4.2 What explains `gov_p`?

19. Linear Regression 2 (Dummy)

Masahiko Asano

2021-10-19

1. What can we do with a dummy?

1.1 What is a dummy variable?

1.2 What we can do with a dummy variable

When you do not control with local dummy (left)

When you control with local dummy (right)

2. Economic Situation and Location

3. Testing Goldberg’s Argument

3.1 Does gov_p differ by location?

Let’s check if the government performance differs by location

3.2 Does econ explain gov_p?

When you do not control with local dummy (left)

When you control with local dummy (right)

4. Testing Putnam’s Argument

4.1 Does cc explain gov_p?

When you do not control with local dummy (left)

When you control with local dummy (right)

4.2 What explains gov_p?

3.1 Does `gov_p` differ by location?

3.2 Does `econ` explain `gov_p`?

4.1 Does `cc` explain `gov_p`?

4.2 What explains `gov_p`?