R packages used in this sectionlibrary(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)| variable types | types of visalization | variables needed |
|---|---|---|
| Discrete | 2. bar chart | more than 1 variable |
| Continuous | 3. histogram | more than 1 variable |
| Continuous | 4. box plot | more than 1 variable |
| Continuous | 5. lollipop chart | more than 1 variable |
| Continuous | 6. scatterplot | more than 2 variables |
| Continuous | 7. line graph | more than 2 variables |
GithubCRANjpndistrict package is not on CRANjpndistrict package via Github by typing the following command in Console:install.packages("remotes")
remotes::install_github("uribo/jpndistrict")Console:install.packages("rnaturalearth", dependencies = TRUE)tidyverse package, you will see the following message:library(tidyverse)What this message means:
tidyverse package, you automatically download 8 packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, forcatstidyverse package conflicts with two functions: filter() and lag()filter() → dplyr::filter()
lag() → dplyr::lag()
ggplotggplot, you can avoid them by including either of the following two commands:theme_bw(base_family = "HiraKakuProN-W3")theme_set(theme_classic(base_size = 10,
base_family = "HiraginoSans-W3"))datatable() function on DT package, you can get it done by either of the following two ways:library(DT)
dtatable(df1)DT::datatable(df1)In this section, I use them interchageably
RStudiodatahr96-17.csv in the data folderhr96-17.csv, we need to load readr package, which is included in tidyverse packagelibrary(tidyverse)df <- read.csv("data/hr96-17.csv",
na = ".") dfnames(df) [1] "year" "pref" "ku" "kun"
[5] "mag" "rank" "wl" "nocand"
[9] "seito" "j_name" "name" "term"
[13] "gender" "age" "exp" "status"
[17] "vote" "voteshare" "eligible" "turnout"
[21] "castvotes" "seshu_dummy" "jiban_seshu" "nojiban_seshu"
wlsmd) using a variable, wl| variable name | detail |
|---|---|
| wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
| wlsmd | 0 = loser / 1 = winner |
table(df$wl)
0 1 2
5563 2387 853
df1 <- mutate(df, wlsmd = as.numeric(wl == 1)) table(df1$wlsmd)
0 1
6416 2387
exp is election expenditure (yen) spent by each candidateexppv, which shows election expenditure (yen) per voter spent by each candidate per voterdf1 <- mutate(df1, exppv = exp / eligible) summary(df1$exppv) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0013 8.1762 18.7646 23.0907 33.3863 120.8519 1974
df1 <- mutate(df1, exppv = exp / eligible), you need to take the following procedure:df1 <- mutate(df1, exppv = exp / eligible)| Steps | Command | Detail |
|---|---|---|
| 1 | str(df1$exp) |
Check the class of exp |
| 2 | If the class is num, then go Step 4. Go Step 3, otherwise |
|
| 3 | df1$exp <- as.numeric(df1$exp) |
Change the class of exp to num |
| 4 | str(df1$eligible) |
Check the class of eligible |
| 5 | If the class is num, then go Step 7. Go Step 6, otherwise |
|
| 6 | df1$eligible <- as.numeric(df1$eligible) |
Change the class of eligible to num |
| 7 | str(df1$eligible) |
Check the class of eligible |
| 8 | If the class is num, then it is OK |
|
| 9 | str(df1$eligible) |
Check the class of eligible |
| 10 | If the class is num, then it is OK |
|
exp and eligible (which are supposed to be numeric) not as numeric, but as character, then we need to change the class of each variable to numeric by using as.numeric() functioninc) using a variable, status| variable name | detail |
|---|---|
| status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
| inc | 0 = non-incumbent / 1 = incumbent |
table(df1$status)
0 1 2
5106 3129 568
df1 <- mutate(df1, inc = as.numeric(status == 1 )) table(df1$inc)
0 1
5674 3129
names(df1) [1] "year" "pref" "ku" "kun"
[5] "mag" "rank" "wl" "nocand"
[9] "seito" "j_name" "name" "term"
[13] "gender" "age" "exp" "status"
[17] "vote" "voteshare" "eligible" "turnout"
[21] "castvotes" "seshu_dummy" "jiban_seshu" "nojiban_seshu"
[25] "wlsmd" "exppv" "inc"
table(df1$seito)
アイヌ民族党 さわやか神戸・市民の会 ニューディールの会
1 2 1
みんな 安楽死党 維新
79 1 77
沖縄社会大衆党 改革 改革クラブ
1 1 4
希望の党 共産 公明
198 2123 70
幸福 国民新党 国民党
312 21 11
市民新党にいがた 次世 自民
1 39 2266
自由党 自由連合 社民
61 212 307
緒派 諸派 新社会党
44 9 38
新進党 新党さきがけ 新党尊命
235 13 1
新党大地 新党日本 世界経済共同体党
8 9 2
政事公団太平会 政治団体代表 生活
1 2 13
青年自由党 当たり前党 日本維新の会
1 1 198
日本新進党 日本未来の党 文化フォーラム
1 111 10
保守新党 保守党 民主
11 16 1654
民主改革連合 無所属 無所属の会
2 562 9
立憲民主 緑の党
63 1
seito, we make ldp dummy variable, ldpldp = 1: LDP candidates、ldp = 0: non-LDP candidatesdf1 <- mutate(df1, ldp = as.numeric(seito == "自民" )) table(df1$ldp)
0 1
6537 2266
names(df1) [1] "year" "pref" "ku" "kun"
[5] "mag" "rank" "wl" "nocand"
[9] "seito" "j_name" "name" "term"
[13] "gender" "age" "exp" "status"
[17] "vote" "voteshare" "eligible" "turnout"
[21] "castvotes" "seshu_dummy" "jiban_seshu" "nojiban_seshu"
[25] "wlsmd" "exppv" "inc" "ldp"
df1 contains the following 28 variables| variable | detail |
|---|---|
| year | Election year (1996-2017) |
| pref | Prefecture |
| ku | Electoral district name |
| kun | Number of electoral district |
| mag | District magnitude (Number of candidate elected) |
| rank | Ascending order of votes |
| nocand | Number of candidates in each district |
| seito | Candidate’s affiliated party |
| j_name | Candidate’s name (Japanese) |
| name | Candidate’s name (English) |
| term | Previous wins |
| gender | Candidate’s gender:“male”, “female” |
| age | Candidate’s age |
| wl | 0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner |
| wlsmd | 0 = loser / 1 = winner |
| exp | Election expenditure (yen) spent by each candidate |
| status | 0 = challenger / 1 = incumbent / 2 = former incumbent |
| vote | votes each candidate garnered |
| voteshare | Voteshare (%) |
| eligible | Eligible voters in each district |
| turnout | Turnout in each district (%) |
| castvote | Total votes cast in each district |
| seshu_dummy | 0 = Not-hereditary candidates, 1 = hereditary candidate |
| jiban_seshu | Relationship between candidate and his predecessor |
| nojiban_seshu | Relationship between candidate and his predecessor |
| exppv | election expenditure (yen) per voter spent by each candidate per voter |
| inc | 0 = non-incumbent / 1 = incumbent |
| ldp | 0 = non-LDP candidates, 1 = LDP candidates |
df1)library(stargazer){r, results = "asis"} at chunk optionstargazer(as.data.frame(df1),
type ="html",
digits = 2)| Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
| year | 8,803 | 2,006.60 | 6.81 | 1,996 | 2,000 | 2,012 | 2,017 |
| kun | 8,803 | 5.74 | 5.06 | 1 | 2 | 8 | 25 |
| mag | 8,803 | 1.00 | 0.00 | 1 | 1 | 1 | 1 |
| rank | 8,803 | 2.70 | 21.36 | 1 | 1 | 3 | 2,003 |
| wl | 8,803 | 0.46 | 0.67 | 0 | 0 | 1 | 2 |
| nocand | 8,803 | 3.96 | 1.08 | 2 | 3 | 5 | 9 |
| term | 8,803 | 1.86 | 2.68 | 0 | 0 | 3 | 20 |
| age | 8,799 | 50.90 | 11.08 | 25.00 | 43.00 | 59.00 | 94.00 |
| exp | 6,829 | 7,551,393.00 | 5,482,684.00 | 535.00 | 2,803,567.00 | 11,044,412.00 | 27,462,362.00 |
| status | 8,803 | 0.48 | 0.62 | 0 | 0 | 1 | 2 |
| vote | 8,803 | 54,911.15 | 40,467.97 | 177 | 18,239.5 | 86,494.5 | 201,461 |
| voteshare | 8,803 | 27.08 | 19.19 | 0 | 8.9 | 42.9 | 95 |
| eligible | 7,928 | 326,092.00 | 79,708.01 | 115,013.00 | 269,945.80 | 390,965.00 | 495,212.00 |
| turnout | 6,992 | 62.84 | 6.39 | 44.71 | 57.74 | 67.50 | 83.80 |
| castvotes | 6,992 | 210,416.40 | 41,101.89 | 104,398.00 | 181,016.20 | 237,484.00 | 339,780.00 |
| seshu_dummy | 8,803 | 0.12 | 0.32 | 0 | 0 | 0 | 1 |
| wlsmd | 8,803 | 0.27 | 0.44 | 0 | 0 | 1 | 1 |
| exppv | 6,829 | 23.09 | 18.13 | 0.001 | 8.18 | 33.39 | 120.85 |
| inc | 8,803 | 0.36 | 0.48 | 0 | 0 | 1 | 1 |
| ldp | 8,803 | 0.26 | 0.44 | 0 | 0 | 1 | 1 |
df1 contais 28 variablesReason:
Descriptive statistics only shows the variables whose class is numeric: numeric, integer, double
Neither character variable nor factor variable is numeric variable
It is very important to check the class of variable in data visualization
df1 using str() functionstr(df1)'data.frame': 8803 obs. of 28 variables:
$ year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ pref : chr "愛知" "愛知" "愛知" "愛知" ...
$ ku : chr "aichi" "aichi" "aichi" "aichi" ...
$ kun : int 1 2 3 4 5 6 7 8 9 10 ...
$ mag : int 1 1 1 1 1 1 1 1 1 1 ...
$ rank : int 1 1 1 1 1 1 1 1 1 1 ...
$ wl : int 1 1 1 1 1 1 1 1 1 1 ...
$ nocand : int 7 8 7 6 7 8 7 5 7 7 ...
$ seito : chr "新進党" "新進党" "新進党" "新進党" ...
$ j_name : chr "河村たかし" "青木宏之" "吉田幸弘" "三沢淳" ...
$ name : chr "KAWAMURA, TAKASHI" "AOKI, HIROYUKI" "YOSHIDA, YUKIHIRO" "MISAWA, JUN" ...
$ term : int 2 2 1 1 3 8 7 3 13 2 ...
$ gender : chr "male" "male" "male" "male" ...
$ age : int 47 51 35 44 48 68 55 59 65 53 ...
$ exp : int 9828097 12940178 11245219 12134215 11894801 11252336 13493050 6368857 19731389 18863794 ...
$ status : int 1 1 0 0 1 1 1 1 1 1 ...
$ vote : int 66876 56101 52478 57361 48648 90812 91439 93053 111578 110820 ...
$ voteshare : num 40 32.9 32.3 35.7 30.9 39.7 47.5 44.4 47.7 46.4 ...
$ eligible : int 346774 338310 331808 315704 319846 433930 357984 377152 393953 437148 ...
$ turnout : num 49.2 51.8 50.4 52 50.3 54.2 55.5 57.1 60.6 56 ...
$ castvotes : int 167051 170317 162679 160548 157404 228631 192362 209450 234001 238646 ...
$ seshu_dummy : int 0 0 0 0 1 0 0 1 0 1 ...
$ jiban_seshu : chr NA NA NA NA ...
$ nojiban_seshu: chr NA NA NA NA ...
$ wlsmd : num 1 1 1 1 1 1 1 1 1 1 ...
$ exppv : num 28.3 38.2 33.9 38.4 37.2 ...
$ inc : num 1 1 0 0 1 1 1 1 1 1 ...
$ ldp : num 0 0 0 0 0 0 0 1 0 0 ...
Draw a barchart representing the number of candidates per Lower House election between 1996 and 2017
Check the number of candidate by using table() function
table(df1$year)
1996 2000 2003 2005 2009 2012 2014 2017
1261 1199 1026 989 1139 1294 959 936
df1Caution: Windows users should typ either of the following two commands to avoid garbled characters
windowsFonts(YuGothic = windowsFont("Yu Gothic"))windowsFonts(Noto = windowsFont("Noto Sans CJK JP"))df1 %>%
ggplot() +
geom_bar(aes(x = year)) +
labs(x = "Election Year", y = "The number of lower house election") +
theme_bw(base_family = "HiraKakuProN-W3") str(df1$year) int [1:8803] 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
class of year is numericSolution
year from numeric to factordf1$year <- factor(df1$year) str(df1$year) Factor w/ 8 levels "1996","2000",..: 1 1 1 1 1 1 1 1 1 1 ...
df1 %>%
ggplot() +
geom_bar(aes(x = year)) +
labs(x = "Election Year", y = "The number of lower house election") +
theme_bw(base_family = "HiraKakuProN-W3") What we can see from the bar chart ・The number of candidates in Japan’s lower house election is decreasing since 1996 (except 2009 and 2012)
df2 <- df1 %>%
group_by(year, ldp) %>%
summarise(N = n(),
.groups = "drop")
df2# A tibble: 16 x 3
year ldp N
<fct> <dbl> <int>
1 1996 0 973
2 1996 1 288
3 2000 0 928
4 2000 1 271
5 2003 0 749
6 2003 1 277
7 2005 0 699
8 2005 1 290
9 2009 0 848
10 2009 1 291
11 2012 0 1005
12 2012 1 289
13 2014 0 676
14 2014 1 283
15 2017 0 659
16 2017 1 277
df2 %>%
group_by(ldp) %>%
summarize(N = mean(N, na.rm = TRUE),
.groups = "drop")# A tibble: 2 x 2
ldp N
<dbl> <dbl>
1 0 817.
2 1 283.
class of variable, ldpclass(df2$ldp)[1] "numeric"
class of ldp from dbl to factordf2$ldp <- factor(df2$ldp) class(df2$ldp)[1] "factor"
df2 %>%
ggplot() +
geom_bar(aes(x = year, y = N, fill = ldp),
stat = "identity", position = "stack") +
labs(x = "Election Year", y = "The Number of Candidates") +
theme_minimal(base_family = "HiraKakuProN-W3")What we can see from the bar chart ・The number of candidates in Japan’s lower house election is decreasing since 1996 (except 2009 and 2012)
・The number of LDP candidates does not change that much over time
position = "dodge" enables us to draw parallel bar chartsdf2 %>%
ggplot() +
geom_bar(aes(x = year, y = N, fill = ldp),
stat = "identity", position = "dodge") +
labs(x = "Election Year", y = "The Number of Candidates") +
theme_minimal(base_family = "HiraKakuProN-W3")The difference between parallel graph and stacked graph
| Types of bar chart | What you can do |
|---|---|
| Stacked | You can compare the average age of winners by Election Year |
| Parallel | You can compare the average age of winners by Parties |
Which one you should use depends on you!
scale_fill_manual() functiondf2 %>%
ggplot() +
geom_bar(aes(x = year, y = N, fill = ldp),
stat = "identity", position = "dodge") +
labs(x = "Election Year", y = "The Number of Candidates") +
theme_minimal(base_family = "HiraKakuProN-W3") +
scale_fill_manual(values = c("springgreen2", "deeppink2"))dplyr() function, calculate the average and save it as df2df3 <- df1 %>%
dplyr::filter(wlsmd == 1) %>% # choose only smd winners
group_by(year) %>% # calculate by election year
summarize(age = mean(age, na.rm = TRUE), # calculate the mean of age
.groups = "drop")
df3# A tibble: 8 x 2
year age
<fct> <dbl>
1 1996 54.2
2 2000 53.8
3 2003 52.9
4 2005 53.1
5 2009 51.0
6 2012 52.5
7 2014 54.7
8 2017 55.9
df2(stat = "identity)df3 %>%
ggplot() +
geom_bar(aes(x = year, y = age), stat = "identity") +
labs(x = "Election Year", y = "") +
theme_minimal(base_family = "HiraKakuProN-W3")summary(df3$age) Min. 1st Qu. Median Mean 3rd Qu. Max.
51.03 52.81 53.47 53.52 54.30 55.90
Summary ・The SMD winner’s averaged age is 54 years old and it does not vary much
library(tidyverse)df4 <- df1 %>%
group_by(year, ldp) %>%
summarize(age = mean(age, na.rm = TRUE),
.groups = "drop")
df4# A tibble: 16 x 3
year ldp age
<fct> <dbl> <dbl>
1 1996 0 49.4
2 1996 1 54.3
3 2000 0 49.5
4 2000 1 55.7
5 2003 0 49.5
6 2003 1 54.3
7 2005 0 49.3
8 2005 1 52.6
9 2009 0 48.3
10 2009 1 55.5
11 2012 0 49.8
12 2012 1 51.9
13 2014 0 51.5
14 2014 1 53.3
15 2017 0 51.7
16 2017 1 55.3
df4 %>%
group_by(ldp) %>%
summarize(age = mean(age, na.rm = TRUE),
.groups = "drop")# A tibble: 2 x 2
ldp age
<dbl> <dbl>
1 0 49.9
2 1 54.1
class of ldp is <dbl> (numeric)class to factordf4$ldp <- factor(df4$ldp) class is changedclass(df4$ldp)[1] "factor"
df4 %>%
ggplot() +
geom_bar(aes(x = year, y = age, fill = ldp),
stat = "identity", position = "dodge") +
labs(x = "Election Year", y = "SMD Winner's Averaged Age") Summary LDP Winners averaged age (54.1) is larger than the non-LDP’s winners (49.9).
| Type of graph | x-axis | Geometric objec | Gap between bars |
|---|---|---|---|
| Bar Chart | Discrete variable | geom_bar() |
Yes |
| Histogram | Continuous variable | geom_histogram() |
No |
Q: Refering to 6.2 Show Vote Share by Pary (2017HR), generate a boxplots showing the vote shares by party with the 2009 HR election.
geom_point(), you can present dot “●” on the box plotdf8 %>%
dplyr::filter(!is.na(voteshare)) %>%
ggplot(aes(x = seito, y = voteshare)) +
geom_point(aes(color = seito), alpha = 0.5,
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5, show.legend = FALSE) +
labs(x = "Party", y = "Vote Share (2017 HR Election)") - You can scatter the dots by adding
geom_jitter() so that you can clearly see them
df8 %>%
filter(!is.na(voteshare)) %>%
ggplot(aes(x = seito,
y = voteshare)) +
geom_jitter(aes(color = seito),
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5,
show.legend = FALSE) +
labs(x = "Party", y = "Vote Share (2017 HR Election)")width = 0.15, height = 0df8 %>%
dplyr::filter(!is.na(voteshare)) %>%
ggplot(aes(x = seito,
y = voteshare)) +
geom_jitter(aes(color = seito),
width = 0.15, height = 0, # Adjust the dispersion
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5,
show.legend = FALSE) +
labs(x = "Party", y = "Vote Share (2017 HR Election)")facet_wrap()
df8 %>%
dplyr::filter(!is.na(voteshare)) %>%
ggplot(aes(x = seito, y = voteshare)) +
geom_jitter(aes(color = seito), alpha = 0.5,
width = 0.15, height = 0,
show.legend = FALSE) +
geom_boxplot(aes(fill = seito),
alpha = 0.5, show.legend = FALSE) +
labs(x = "Party",
y = "Vote Share (2017 HR Election)",
caption = "Male and Female Candidates") +
facet_wrap(~ gender) +
theme_bw(base_family = "HiraKakuProN-W3") +
coord_flip() df8 %>%
dplyr::filter(!is.na(voteshare)) %>%
ggplot(aes(x = seito,
y = voteshare)) +
geom_jitter(aes(color = gender),
alpha = 0.5,
position = position_jitterdodge(jitter.width = 0.2,
jitter.height = 0),
show.legend = FALSE) +
geom_boxplot(aes(fill = gender),
alpha = 0.5) +
labs(x = "Party",
y = "Vote Share (2017 HR Election)",
fill = "",
caption = "Male and Female Candidates") - You can limit the number of parties you like to look into
datahr09_14_ldp_seatshare.csv in the data folderhr09_14_ldp_seatshare.csv, we need to load readr package, which is included in tidyverse packagelibrary(tidyverse)df_seat <- read_csv("data/hr09_14_ldp_seatshare.csv") datatable() function, check the dataframeDT::datatable(df_seat)| Variables | Details |
|---|---|
| year | Election Year |
| pref | Prefectures (in Japanese) |
| id | Prefecture ID (1-47) |
| nosmd | The total number of Single-Member-Districts (SMD) in each Prefecture (1-25) |
| ldp | The total number of LDP winners in each SMD |
| ldp_ratio | Ratio of LDP winners in each SMD(%) |
| dpj | The total number of DPJ winners in each SMD |
class of each variablestr(df_seat$year) num [1:141] 2012 2014 2009 2012 2014 ...
class of the variable year is numericfactordf_seat$year <- factor(df_seat$year) df_seat %>%
arrange(year, ldp_ratio) %>%
mutate(order_seq = c(1:47, rep(0, 47*2))) %>%
ggplot(aes(x = ldp_ratio,
y = reorder(pref, order_seq))) +
geom_segment(aes(yend = pref),
xend = 0, colour = "grey50") +
geom_point(size = 2,
aes(colour = year)) +
scale_colour_brewer(palette = "Set1",
limits = c("2009", "2012", "2014"),
guide = FALSE) +
theme(panel.grid.major.y =
element_blank()) +
facet_grid(~ year,
scales="free_y", space = "free_y") +
theme_bw(base_family = "HiraKakuProN-W3")+ # Show Japanese in chart
labs(x = "LDP Vote Share (%)",
y = "Prefecture")df_seat %>%
arrange(year, ldp_ratio) %>%
mutate(order_seq = c(1:47, rep(0, 47*2))) %>%
ggplot(aes(x = reorder(pref, order_seq),
y = ldp_ratio,
fill = year)) +
geom_bar(stat = "identity") +
facet_grid(~ year, scales = "free_x") +
theme(legend.position = "none") +
coord_flip()+
theme_bw(base_family = "HiraKakuProN-W3")+
labs(x = "Prefecture",
y = "LDP's Vote Share (%)")