R packages
used in this sectionlibrary(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)
variable types | types of visalization | variables needed |
---|---|---|
Discrete | 2. bar chart | more than 1 variable |
Continuous | 3. histogram | more than 1 variable |
Continuous | 4. box plot | more than 1 variable |
Continuous | 5. lollipop chart | more than 1 variable |
Continuous | 6. scatterplot | more than 2 variables |
Continuous | 7. line graph | more than 2 variables |
Github
CRAN
jpndistrict
package is not on CRAN
jpndistrict
package via Github
by typing the following command in Console
:install.packages("remotes")
::install_github("uribo/jpndistrict") remotes
Console
:install.packages("rnaturalearth", dependencies = TRUE)
tidyverse
package, you will see the following message:library(tidyverse)
What this message means:
tidyverse
package, you automatically download 8 packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, forcatstidyverse
package conflicts with two functions: filter()
and lag()
filter()
→ dplyr::filter()
lag()
→ dplyr::lag()
ggplot
ggplot
, you can avoid them by including either of the following two commands:theme_bw(base_family = "HiraKakuProN-W3")
theme_set(theme_classic(base_size = 10,
base_family = "HiraginoSans-W3"))
datatable()
function on DT package, you can get it done by either of the following two ways:library(DT)
dtatable(df1)
::datatable(df1) DT
In this section, I use them interchageably
RStudio
data
hr96-17.csv
in the data
folderhr96-17.csv
, we need to load readr
package, which is included in tidyverse
packagelibrary(tidyverse)
<- read.csv("data/hr96-17.csv",
df na = ".")
df
names(df)
[1] "year" "pref" "ku" "kun"
[5] "mag" "rank" "wl" "nocand"
[9] "seito" "j_name" "name" "term"
[13] "gender" "age" "exp" "status"
[17] "vote" "voteshare" "eligible" "turnout"
[21] "castvotes" "seshu_dummy" "jiban_seshu" "nojiban_seshu"
wlsmd
) using a variable, wl
variable name | detail |
---|---|
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
wlsmd | 0 = loser / 1 = winner |
table(df$wl)
0 1 2
5563 2387 853
<- mutate(df, wlsmd = as.numeric(wl == 1)) df1
table(df1$wlsmd)
0 1
6416 2387
exp
is election expenditure (yen) spent by each candidateexppv
, which shows election expenditure (yen) per voter spent by each candidate per voter<- mutate(df1, exppv = exp / eligible) df1
summary(df1$exppv)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0013 8.1762 18.7646 23.0907 33.3863 120.8519 1974
df1 <- mutate(df1, exppv = exp / eligible)
, you need to take the following procedure:df1 <- mutate(df1, exppv = exp / eligible)
Steps | Command | Detail |
---|---|---|
1 | str(df1$exp) |
Check the class of exp |
2 | If the class is num , then go Step 4. Go Step 3, otherwise |
|
3 | df1$exp <- as.numeric(df1$exp) |
Change the class of exp to num |
4 | str(df1$eligible) |
Check the class of eligible |
5 | If the class is num , then go Step 7. Go Step 6, otherwise |
|
6 | df1$eligible <- as.numeric(df1$eligible) |
Change the class of eligible to num |
7 | str(df1$eligible) |
Check the class of eligible |
8 | If the class is num , then it is OK |
|
9 | str(df1$eligible) |
Check the class of eligible |
10 | If the class is num , then it is OK |
|
exp
and eligible
(which are supposed to be numeric
) not as numeric
, but as character
, then we need to change the class of each variable to numeric
by using as.numeric()
functioninc
) using a variable, status
variable name | detail |
---|---|
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
inc | 0 = non-incumbent / 1 = incumbent |
table(df1$status)
0 1 2
5106 3129 568
<- mutate(df1, inc = as.numeric(status == 1 )) df1
table(df1$inc)
0 1
5674 3129
names(df1)
[1] "year" "pref" "ku" "kun"
[5] "mag" "rank" "wl" "nocand"
[9] "seito" "j_name" "name" "term"
[13] "gender" "age" "exp" "status"
[17] "vote" "voteshare" "eligible" "turnout"
[21] "castvotes" "seshu_dummy" "jiban_seshu" "nojiban_seshu"
[25] "wlsmd" "exppv" "inc"
table(df1$seito)
アイヌ民族党 さわやか神戸・市民の会 ニューディールの会
1 2 1
みんな 安楽死党 維新
79 1 77
沖縄社会大衆党 改革 改革クラブ
1 1 4
希望の党 共産 公明
198 2123 70
幸福 国民新党 国民党
312 21 11
市民新党にいがた 次世 自民
1 39 2266
自由党 自由連合 社民
61 212 307
緒派 諸派 新社会党
44 9 38
新進党 新党さきがけ 新党尊命
235 13 1
新党大地 新党日本 世界経済共同体党
8 9 2
政事公団太平会 政治団体代表 生活
1 2 13
青年自由党 当たり前党 日本維新の会
1 1 198
日本新進党 日本未来の党 文化フォーラム
1 111 10
保守新党 保守党 民主
11 16 1654
民主改革連合 無所属 無所属の会
2 562 9
立憲民主 緑の党
63 1
seito
, we make ldp dummy variable, ldp
ldp = 1
: LDP candidates、ldp = 0
: non-LDP candidates<- mutate(df1, ldp = as.numeric(seito == "自民" )) df1
table(df1$ldp)
0 1
6537 2266
names(df1)
[1] "year" "pref" "ku" "kun"
[5] "mag" "rank" "wl" "nocand"
[9] "seito" "j_name" "name" "term"
[13] "gender" "age" "exp" "status"
[17] "vote" "voteshare" "eligible" "turnout"
[21] "castvotes" "seshu_dummy" "jiban_seshu" "nojiban_seshu"
[25] "wlsmd" "exppv" "inc" "ldp"
df1
contains the following 28 variablesvariable | detail |
---|---|
year | Election year (1996-2017) |
pref | Prefecture |
ku | Electoral district name |
kun | Number of electoral district |
mag | District magnitude (Number of candidate elected) |
rank | Ascending order of votes |
nocand | Number of candidates in each district |
seito | Candidate’s affiliated party |
j_name | Candidate’s name (Japanese) |
name | Candidate’s name (English) |
term | Previous wins |
gender | Candidate’s gender:“male”, “female” |
age | Candidate’s age |
wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
wlsmd | 0 = loser / 1 = winner |
exp | Election expenditure (yen) spent by each candidate |
status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
vote | votes each candidate garnered |
voteshare | Voteshare (%) |
eligible | Eligible voters in each district |
turnout | Turnout in each district (%) |
castvote | Total votes cast in each district |
seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
jiban_seshu | Relationship between candidate and his predecessor |
nojiban_seshu | Relationship between candidate and his predecessor |
exppv | election expenditure (yen) per voter spent by each candidate per voter |
inc | 0 = non-incumbent / 1 = incumbent |
ldp | 0 = non-LDP candidates, 1 = LDP candidates |
df1
)library(stargazer)
{r, results = "asis"}
at chunk optionstargazer(as.data.frame(df1),
type ="html",
digits = 2)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
year | 8,803 | 2,006.60 | 6.81 | 1,996 | 2,000 | 2,012 | 2,017 |
kun | 8,803 | 5.74 | 5.06 | 1 | 2 | 8 | 25 |
mag | 8,803 | 1.00 | 0.00 | 1 | 1 | 1 | 1 |
rank | 8,803 | 2.70 | 21.36 | 1 | 1 | 3 | 2,003 |
wl | 8,803 | 0.46 | 0.67 | 0 | 0 | 1 | 2 |
nocand | 8,803 | 3.96 | 1.08 | 2 | 3 | 5 | 9 |
term | 8,803 | 1.86 | 2.68 | 0 | 0 | 3 | 20 |
age | 8,799 | 50.90 | 11.08 | 25.00 | 43.00 | 59.00 | 94.00 |
exp | 6,829 | 7,551,393.00 | 5,482,684.00 | 535.00 | 2,803,567.00 | 11,044,412.00 | 27,462,362.00 |
status | 8,803 | 0.48 | 0.62 | 0 | 0 | 1 | 2 |
vote | 8,803 | 54,911.15 | 40,467.97 | 177 | 18,239.5 | 86,494.5 | 201,461 |
voteshare | 8,803 | 27.08 | 19.19 | 0 | 8.9 | 42.9 | 95 |
eligible | 7,928 | 326,092.00 | 79,708.01 | 115,013.00 | 269,945.80 | 390,965.00 | 495,212.00 |
turnout | 6,992 | 62.84 | 6.39 | 44.71 | 57.74 | 67.50 | 83.80 |
castvotes | 6,992 | 210,416.40 | 41,101.89 | 104,398.00 | 181,016.20 | 237,484.00 | 339,780.00 |
seshu_dummy | 8,803 | 0.12 | 0.32 | 0 | 0 | 0 | 1 |
wlsmd | 8,803 | 0.27 | 0.44 | 0 | 0 | 1 | 1 |
exppv | 6,829 | 23.09 | 18.13 | 0.001 | 8.18 | 33.39 | 120.85 |
inc | 8,803 | 0.36 | 0.48 | 0 | 0 | 1 | 1 |
ldp | 8,803 | 0.26 | 0.44 | 0 | 0 | 1 | 1 |
df1
contais 28 variablesReason:
Descriptive statistics only shows the variables whose class
is numeric: numeric
, integer
, double
Neither character
variable nor factor
variable is numeric variable
It is very important to check the class of variable in data visualization
df1
using str()
functionstr(df1)
'data.frame': 8803 obs. of 28 variables:
$ year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ pref : chr "愛知" "愛知" "愛知" "愛知" ...
$ ku : chr "aichi" "aichi" "aichi" "aichi" ...
$ kun : int 1 2 3 4 5 6 7 8 9 10 ...
$ mag : int 1 1 1 1 1 1 1 1 1 1 ...
$ rank : int 1 1 1 1 1 1 1 1 1 1 ...
$ wl : int 1 1 1 1 1 1 1 1 1 1 ...
$ nocand : int 7 8 7 6 7 8 7 5 7 7 ...
$ seito : chr "新進党" "新進党" "新進党" "新進党" ...
$ j_name : chr "河村たかし" "青木宏之" "吉田幸弘" "三沢淳" ...
$ name : chr "KAWAMURA, TAKASHI" "AOKI, HIROYUKI" "YOSHIDA, YUKIHIRO" "MISAWA, JUN" ...
$ term : int 2 2 1 1 3 8 7 3 13 2 ...
$ gender : chr "male" "male" "male" "male" ...
$ age : int 47 51 35 44 48 68 55 59 65 53 ...
$ exp : int 9828097 12940178 11245219 12134215 11894801 11252336 13493050 6368857 19731389 18863794 ...
$ status : int 1 1 0 0 1 1 1 1 1 1 ...
$ vote : int 66876 56101 52478 57361 48648 90812 91439 93053 111578 110820 ...
$ voteshare : num 40 32.9 32.3 35.7 30.9 39.7 47.5 44.4 47.7 46.4 ...
$ eligible : int 346774 338310 331808 315704 319846 433930 357984 377152 393953 437148 ...
$ turnout : num 49.2 51.8 50.4 52 50.3 54.2 55.5 57.1 60.6 56 ...
$ castvotes : int 167051 170317 162679 160548 157404 228631 192362 209450 234001 238646 ...
$ seshu_dummy : int 0 0 0 0 1 0 0 1 0 1 ...
$ jiban_seshu : chr NA NA NA NA ...
$ nojiban_seshu: chr NA NA NA NA ...
$ wlsmd : num 1 1 1 1 1 1 1 1 1 1 ...
$ exppv : num 28.3 38.2 33.9 38.4 37.2 ...
$ inc : num 1 1 0 0 1 1 1 1 1 1 ...
$ ldp : num 0 0 0 0 0 0 0 1 0 0 ...
Draw a barchart representing the number of candidates per Lower House election between 1996 and 2017
Check the number of candidate by using table()
function
table(df1$year)
1996 2000 2003 2005 2009 2012 2014 2017
1261 1199 1026 989 1139 1294 959 936
df1
Caution: Windows users should typ either of the following two commands to avoid garbled characters
windowsFonts(YuGothic = windowsFont("Yu Gothic"))
windowsFonts(Noto = windowsFont("Noto Sans CJK JP"))
%>%
df1 ggplot() +
geom_bar(aes(x = year)) +
labs(x = "Election Year", y = "The number of lower house election") +
theme_bw(base_family = "HiraKakuProN-W3")
str(df1$year)
int [1:8803] 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
class
of year
is numeric
Solution
year
from numeric
to factor
$year <- factor(df1$year) df1
str(df1$year)
Factor w/ 8 levels "1996","2000",..: 1 1 1 1 1 1 1 1 1 1 ...
%>%
df1 ggplot() +
geom_bar(aes(x = year)) +
labs(x = "Election Year", y = "The number of lower house election") +
theme_bw(base_family = "HiraKakuProN-W3")
What we can see from the bar chart ・The number of candidates in Japan’s lower house election is decreasing since 1996 (except 2009 and 2012)
<- df1 %>%
df2 group_by(year, ldp) %>%
summarise(N = n(),
.groups = "drop")
df2
# A tibble: 16 x 3
year ldp N
<fct> <dbl> <int>
1 1996 0 973
2 1996 1 288
3 2000 0 928
4 2000 1 271
5 2003 0 749
6 2003 1 277
7 2005 0 699
8 2005 1 290
9 2009 0 848
10 2009 1 291
11 2012 0 1005
12 2012 1 289
13 2014 0 676
14 2014 1 283
15 2017 0 659
16 2017 1 277
%>%
df2 group_by(ldp) %>%
summarize(N = mean(N, na.rm = TRUE),
.groups = "drop")
# A tibble: 2 x 2
ldp N
<dbl> <dbl>
1 0 817.
2 1 283.
class
of variable, ldp
class(df2$ldp)
[1] "numeric"
class
of ldp
from dbl
to factor
$ldp <- factor(df2$ldp) df2
class(df2$ldp)
[1] "factor"
%>%
df2 ggplot() +
geom_bar(aes(x = year, y = N, fill = ldp),
stat = "identity", position = "stack") +
labs(x = "Election Year", y = "The Number of Candidates") +
theme_minimal(base_family = "HiraKakuProN-W3")
What we can see from the bar chart ・The number of candidates in Japan’s lower house election is decreasing since 1996 (except 2009 and 2012)
・The number of LDP candidates does not change that much over time
position = "dodge"
enables us to draw parallel bar charts%>%
df2 ggplot() +
geom_bar(aes(x = year, y = N, fill = ldp),
stat = "identity", position = "dodge") +
labs(x = "Election Year", y = "The Number of Candidates") +
theme_minimal(base_family = "HiraKakuProN-W3")
The difference between parallel graph and stacked graph
Types of bar chart | What you can do |
---|---|
Stacked | You can compare the average age of winners by Election Year |
Parallel | You can compare the average age of winners by Parties |
Which one you should use depends on you!
scale_fill_manual()
function%>%
df2 ggplot() +
geom_bar(aes(x = year, y = N, fill = ldp),
stat = "identity", position = "dodge") +
labs(x = "Election Year", y = "The Number of Candidates") +
theme_minimal(base_family = "HiraKakuProN-W3") +
scale_fill_manual(values = c("springgreen2", "deeppink2"))
dplyr()
function, calculate the average and save it as df2
<- df1 %>%
df3 ::filter(wlsmd == 1) %>% # choose only smd winners
dplyrgroup_by(year) %>% # calculate by election year
summarize(age = mean(age, na.rm = TRUE), # calculate the mean of age
.groups = "drop")
df3
# A tibble: 8 x 2
year age
<fct> <dbl>
1 1996 54.2
2 2000 53.8
3 2003 52.9
4 2005 53.1
5 2009 51.0
6 2012 52.5
7 2014 54.7
8 2017 55.9
df2
(stat = "identity)
%>%
df3 ggplot() +
geom_bar(aes(x = year, y = age), stat = "identity") +
labs(x = "Election Year", y = "") +
theme_minimal(base_family = "HiraKakuProN-W3")
summary(df3$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
51.03 52.81 53.47 53.52 54.30 55.90
Summary ・The SMD winner’s averaged age is 54 years old and it does not vary much
library(tidyverse)
<- df1 %>%
df4 group_by(year, ldp) %>%
summarize(age = mean(age, na.rm = TRUE),
.groups = "drop")
df4
# A tibble: 16 x 3
year ldp age
<fct> <dbl> <dbl>
1 1996 0 49.4
2 1996 1 54.3
3 2000 0 49.5
4 2000 1 55.7
5 2003 0 49.5
6 2003 1 54.3
7 2005 0 49.3
8 2005 1 52.6
9 2009 0 48.3
10 2009 1 55.5
11 2012 0 49.8
12 2012 1 51.9
13 2014 0 51.5
14 2014 1 53.3
15 2017 0 51.7
16 2017 1 55.3
%>%
df4 group_by(ldp) %>%
summarize(age = mean(age, na.rm = TRUE),
.groups = "drop")
# A tibble: 2 x 2
ldp age
<dbl> <dbl>
1 0 49.9
2 1 54.1
class
of ldp
is <dbl>
(numeric)class
to factor
$ldp <- factor(df4$ldp) df4
class
is changedclass(df4$ldp)
[1] "factor"
%>%
df4 ggplot() +
geom_bar(aes(x = year, y = age, fill = ldp),
stat = "identity", position = "dodge") +
labs(x = "Election Year", y = "SMD Winner's Averaged Age")
Summary LDP Winners averaged age (54.1) is larger than the non-LDP’s winners (49.9).
Type of graph | x-axis | Geometric objec | Gap between bars |
---|---|---|---|
Bar Chart | Discrete variable | geom_bar() |
Yes |
Histogram | Continuous variable | geom_histogram() |
No |
Q: Refering to 6.2 Show Vote Share by Pary (2017HR)
, generate a boxplots showing the vote shares by party with the 2009 HR election.
geom_point()
, you can present dot “●” on the box plot%>%
df8 ::filter(!is.na(voteshare)) %>%
dplyrggplot(aes(x = seito, y = voteshare)) +
geom_point(aes(color = seito), alpha = 0.5,
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5, show.legend = FALSE) +
labs(x = "Party", y = "Vote Share (2017 HR Election)")
- You can scatter the dots by adding geom_jitter()
so that you can clearly see them
%>%
df8 filter(!is.na(voteshare)) %>%
ggplot(aes(x = seito,
y = voteshare)) +
geom_jitter(aes(color = seito),
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5,
show.legend = FALSE) +
labs(x = "Party", y = "Vote Share (2017 HR Election)")
width = 0.15, height = 0
%>%
df8 ::filter(!is.na(voteshare)) %>%
dplyrggplot(aes(x = seito,
y = voteshare)) +
geom_jitter(aes(color = seito),
width = 0.15, height = 0, # Adjust the dispersion
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5,
show.legend = FALSE) +
labs(x = "Party", y = "Vote Share (2017 HR Election)")
facet_wrap()
%>%
df8 ::filter(!is.na(voteshare)) %>%
dplyrggplot(aes(x = seito, y = voteshare)) +
geom_jitter(aes(color = seito), alpha = 0.5,
width = 0.15, height = 0,
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5, show.legend = FALSE) +
labs(x = "Party",
y = "Vote Share (2017 HR Election)",
caption = "Male and Female Candidates") +
facet_wrap(~ gender) +
theme_bw(base_family = "HiraKakuProN-W3") +
coord_flip()
%>%
df8 ::filter(!is.na(voteshare)) %>%
dplyrggplot(aes(x = seito,
y = voteshare)) +
geom_jitter(aes(color = gender),
alpha = 0.5,
position = position_jitterdodge(jitter.width = 0.2,
jitter.height = 0),
show.legend = FALSE) +
geom_boxplot(aes(fill = gender),
alpha = 0.5) +
labs(x = "Party",
y = "Vote Share (2017 HR Election)",
fill = "",
caption = "Male and Female Candidates")
- You can limit the number of parties you like to look into
data
hr09_14_ldp_seatshare.csv
in the data
folderhr09_14_ldp_seatshare.csv
, we need to load readr
package, which is included in tidyverse
packagelibrary(tidyverse)
<- read_csv("data/hr09_14_ldp_seatshare.csv") df_seat
datatable()
function, check the dataframe::datatable(df_seat) DT
Variables | Details |
---|---|
year | Election Year |
pref | Prefectures (in Japanese) |
id | Prefecture ID (1-47) |
nosmd | The total number of Single-Member-Districts (SMD) in each Prefecture (1-25) |
ldp | The total number of LDP winners in each SMD |
ldp_ratio | Ratio of LDP winners in each SMD(%) |
dpj | The total number of DPJ winners in each SMD |
class
of each variablestr(df_seat$year)
num [1:141] 2012 2014 2009 2012 2014 ...
class
of the variable year
is numeric
factor
$year <- factor(df_seat$year) df_seat
%>%
df_seat arrange(year, ldp_ratio) %>%
mutate(order_seq = c(1:47, rep(0, 47*2))) %>%
ggplot(aes(x = ldp_ratio,
y = reorder(pref, order_seq))) +
geom_segment(aes(yend = pref),
xend = 0, colour = "grey50") +
geom_point(size = 2,
aes(colour = year)) +
scale_colour_brewer(palette = "Set1",
limits = c("2009", "2012", "2014"),
guide = FALSE) +
theme(panel.grid.major.y =
element_blank()) +
facet_grid(~ year,
scales="free_y", space = "free_y") +
theme_bw(base_family = "HiraKakuProN-W3")+ # Show Japanese in chart
labs(x = "LDP Vote Share (%)",
y = "Prefecture")
%>%
df_seat arrange(year, ldp_ratio) %>%
mutate(order_seq = c(1:47, rep(0, 47*2))) %>%
ggplot(aes(x = reorder(pref, order_seq),
y = ldp_ratio,
fill = year)) +
geom_bar(stat = "identity") +
facet_grid(~ year, scales = "free_x") +
theme(legend.position = "none") +
coord_flip()+
theme_bw(base_family = "HiraKakuProN-W3")+
labs(x = "Prefecture",
y = "LDP's Vote Share (%)")