R packages used in this section

library(DT)
library(gapminder)
library(gghighlight)
library(ggrepel)
library(stargazer)
library(tidyverse)

1. Visualization by types of variables

variable types	types of visalization	variables needed
Discrete	2. bar chart	more than 1 variable
Continuous	3. histogram	more than 1 variable
Continuous	4. box plot	more than 1 variable
Continuous	5. lollipop chart	more than 1 variable
Continuous	6. scatterplot	more than 2 variables
Continuous	7. line graph	more than 2 variables

2. Tips on installing packages

2.1 Install packages via `Github`

Some packages are not on CRAN
If this is the case, you will see the following message when you install a package
For instance, jpndistrict package is not on CRAN
You will see the following error message
An error: package ‘jpndistrict’ is not available for this version of R
→ You need to install jpndistrict package via Github by typing the following command in Console:

install.packages("remotes")
remotes::install_github("uribo/jpndistrict")

2.2 Install multiple packages at the same time

You can install multiple variables (which are dependent each other) simultaneously by typing the following command in Console:

install.packages("rnaturalearth", dependencies = TRUE)

2.3 Avoid conflict among packages

For instance, if you install tidyverse package, you will see the following message:

library(tidyverse)

What this message means：

If you install tidyverse package, you automatically download 8 packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, forcats
This message says that tidyverse package conflicts with two functions: filter() and lag()
→ You can avoid this conflict by typing the following command:

filter() → dplyr::filter()
lag() → dplyr::lag()

2.4 How to avoid Garbled characters in `ggplot`

If you have garbled characters (in Japanese) in graphs or figures using ggplot, you can avoid them by including either of the following two commands:

theme_bw(base_family = "HiraKakuProN-W3")

theme_set(theme_classic(base_size = 10,
                        base_family = "HiraginoSans-W3"))

2.5 How to load a package

There are two ways of loading a package in RStudio
For instance, if you want to use datatable() function on DT package, you can get it done by either of the following two ways:

library(DT)
dtatable(df1)

DT::datatable(df1)

In this section, I use them interchageably

2.6 How to make a R Project

It is very useful and efficient to make a R Project when you work on RStudio
Making a R Project = making a R Projct folder
A folder = a directory
Making a R Projct folder enables you to dramatically increase the efficiency of your work load in analysis
A working directory means a directory where you are working on
Working on RStudio enables you to precisely “take note” on what you are doing and what you have done (people easily forget what they have done)

3. Data Preparation

3.1 Data Cleaning on Election Data (1996-2017)

Download hr96-17.csv
Make a folder within your R Project folder and name it data
Put the hr96-17.csv in the data folder
To read the hr96-17.csv, we need to load readr package, which is included in tidyverse package

library(tidyverse)

df <- read.csv("data/hr96-17.csv", 
               na = ".")

Show the list of variables in df

names(df)

 [1] "year"          "pref"          "ku"            "kun"          
 [5] "mag"           "rank"          "wl"            "nocand"       
 [9] "seito"         "j_name"        "name"          "term"         
[13] "gender"        "age"           "exp"           "status"       
[17] "vote"          "voteshare"     "eligible"      "turnout"      
[21] "castvotes"     "seshu_dummy"   "jiban_seshu"   "nojiban_seshu"

Make a new variables and modify variables

Make a dummy variable (wlsmd)

We make a dummy variable (wlsmd) using a variable, wl

variable name	detail
wl	0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
wlsmd	0 = loser / 1 = winner

table(df$wl)


   0    1    2 
5563 2387  853

df1 <- mutate(df, wlsmd = as.numeric(wl == 1))

table(df1$wlsmd)


   0    1 
6416 2387

Make a variable (exppv)

exp is election expenditure (yen) spent by each candidate
We want to make exppv, which shows election expenditure (yen) per voter spent by each candidate per voter
eligible is the number of eligible voters in each single-member district

df1 <- mutate(df1, exppv = exp / eligible)

summary(df1$exppv)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
  0.0013   8.1762  18.7646  23.0907  33.3863 120.8519     1974

How to deal with an error when you make exppv
Before executing df1 <- mutate(df1, exppv = exp / eligible), you need to take the following procedure:
Delete the following command you typed:
df1 <- mutate(df1, exppv = exp / eligible)
You need to retype this command after a series of this procedure

Steps	Command	Detail
1	`str(df1$exp)`	Check the class of `exp`
2		If the class is `num`, then go Step 4. Go Step 3, otherwise
3	`df1$exp <- as.numeric(df1$exp)`	Change the class of `exp` to `num`
4	`str(df1$eligible)`	Check the class of `eligible`
5		If the class is `num`, then go Step 7. Go Step 6, otherwise
6	`df1$eligible <- as.numeric(df1$eligible)`	Change the class of `eligible` to `num`
7	`str(df1$eligible)`	Check the class of `eligible`
8		If the class is `num`, then it is OK
9	`str(df1$eligible)`	Check the class of `eligible`
10		If the class is `num`, then it is OK

If RStudio recognizes exp and eligible (which are supposed to be numeric) not as numeric, but as character, then we need to change the class of each variable to numeric by using as.numeric() function

Make a dummy variable (inc)

We make a dummy variable (inc) using a variable, status

variable name	detail
status	0 = challenger / 1 = incumbent / 2 = former incumbent
inc	0 = non-incumbent / 1 = incumbent

table(df1$status)


   0    1    2 
5106 3129  568

df1 <- mutate(df1, inc = as.numeric(status == 1 ))

table(df1$inc)


   0    1 
5674 3129

names(df1)

 [1] "year"          "pref"          "ku"            "kun"          
 [5] "mag"           "rank"          "wl"            "nocand"       
 [9] "seito"         "j_name"        "name"          "term"         
[13] "gender"        "age"           "exp"           "status"       
[17] "vote"          "voteshare"     "eligible"      "turnout"      
[21] "castvotes"     "seshu_dummy"   "jiban_seshu"   "nojiban_seshu"
[25] "wlsmd"         "exppv"         "inc"

Make a dummy variable (ldp)

seito is a variable, which stands for each candidate’s affiliated party in Japanese

table(df1$seito)


          アイヌ民族党 さわやか神戸・市民の会     ニューディールの会 
                     1                      2                      1 
                みんな               安楽死党                   維新 
                    79                      1                     77 
        沖縄社会大衆党                   改革             改革クラブ 
                     1                      1                      4 
              希望の党                   共産                   公明 
                   198                   2123                     70 
                  幸福               国民新党                 国民党 
                   312                     21                     11 
      市民新党にいがた                   次世                   自民 
                     1                     39                   2266 
                自由党               自由連合                   社民 
                    61                    212                    307 
                  緒派                   諸派               新社会党 
                    44                      9                     38 
                新進党           新党さきがけ               新党尊命 
                   235                     13                      1 
              新党大地               新党日本       世界経済共同体党 
                     8                      9                      2 
        政事公団太平会           政治団体代表                   生活 
                     1                      2                     13 
            青年自由党             当たり前党           日本維新の会 
                     1                      1                    198 
            日本新進党           日本未来の党         文化フォーラム 
                     1                    111                     10 
              保守新党                 保守党                   民主 
                    11                     16                   1654 
          民主改革連合                 無所属             無所属の会 
                     2                    562                      9 
              立憲民主                 緑の党 
                    63                      1

Using seito, we make ldp dummy variable, ldp
ldp = 1: LDP candidates、ldp = 0: non-LDP candidates
「自民」means LDP in Japanese

df1 <- mutate(df1, ldp = as.numeric(seito == "自民" ))

table(df1$ldp)


   0    1 
6537 2266

names(df1)

 [1] "year"          "pref"          "ku"            "kun"          
 [5] "mag"           "rank"          "wl"            "nocand"       
 [9] "seito"         "j_name"        "name"          "term"         
[13] "gender"        "age"           "exp"           "status"       
[17] "vote"          "voteshare"     "eligible"      "turnout"      
[21] "castvotes"     "seshu_dummy"   "jiban_seshu"   "nojiban_seshu"
[25] "wlsmd"         "exppv"         "inc"           "ldp"

df1 contains the following 28 variables

variable	detail
year	Election year (1996-2017)
pref	Prefecture
ku	Electoral district name
kun	Number of electoral district
mag	District magnitude (Number of candidate elected)
rank	Ascending order of votes
nocand	Number of candidates in each district
seito	Candidate’s affiliated party
j_name	Candidate’s name (Japanese)
name	Candidate’s name (English)
term	Previous wins
gender	Candidate’s gender:“male”, “female”
age	Candidate’s age
wl	0 = loser / 1 = single-member district (smd) winner / 2 = zombie winner
wlsmd	0 = loser / 1 = winner
exp	Election expenditure (yen) spent by each candidate
status	0 = challenger / 1 = incumbent / 2 = former incumbent
vote	votes each candidate garnered
voteshare	Voteshare (%)
eligible	Eligible voters in each district
turnout	Turnout in each district (%)
castvote	Total votes cast in each district
seshu_dummy	0 = Not-hereditary candidates, 1 = hereditary candidate
jiban_seshu	Relationship between candidate and his predecessor
nojiban_seshu	Relationship between candidate and his predecessor
exppv	election expenditure (yen) per voter spent by each candidate per voter
inc	0 = non-incumbent / 1 = incumbent
ldp	0 = non-LDP candidates, 1 = LDP candidates

3.2 Descriptive Statistics of Japan’s National Elections

Show descriptive statistics of Japan’s national election data (df1)

library(stargazer)

Type {r, results = "asis"} at chunk option

stargazer(as.data.frame(df1), 
          type ="html",
          digits = 2)


Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

year	8,803	2,006.60	6.81	1,996	2,000	2,012	2,017
kun	8,803	5.74	5.06	1	2	8	25
mag	8,803	1.00	0.00	1	1	1	1
rank	8,803	2.70	21.36	1	1	3	2,003
wl	8,803	0.46	0.67	0	0	1	2
nocand	8,803	3.96	1.08	2	3	5	9
term	8,803	1.86	2.68	0	0	3	20
age	8,799	50.90	11.08	25.00	43.00	59.00	94.00
exp	6,829	7,551,393.00	5,482,684.00	535.00	2,803,567.00	11,044,412.00	27,462,362.00
status	8,803	0.48	0.62	0	0	1	2
vote	8,803	54,911.15	40,467.97	177	18,239.5	86,494.5	201,461
voteshare	8,803	27.08	19.19	0	8.9	42.9	95
eligible	7,928	326,092.00	79,708.01	115,013.00	269,945.80	390,965.00	495,212.00
turnout	6,992	62.84	6.39	44.71	57.74	67.50	83.80
castvotes	6,992	210,416.40	41,101.89	104,398.00	181,016.20	237,484.00	339,780.00
seshu_dummy	8,803	0.12	0.32	0	0	0	1
wlsmd	8,803	0.27	0.44	0	0	1	1
exppv	6,829	23.09	18.13	0.001	8.18	33.39	120.85
inc	8,803	0.36	0.48	0	0	1	1
ldp	8,803	0.26	0.44	0	0	1	1

df1 contais 28 variables
But, we can see only 20 variables here

Reason：

Descriptive statistics only shows the variables whose class is numeric: numeric, integer, double
Neither character variable nor factor variable is numeric variable

It is very important to check the class of variable in data visualization

Check the class of variables in df1 using str() function

str(df1)

'data.frame':   8803 obs. of  28 variables:
 $ year         : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
 $ pref         : chr  "愛知" "愛知" "愛知" "愛知" ...
 $ ku           : chr  "aichi" "aichi" "aichi" "aichi" ...
 $ kun          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ mag          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ rank         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ wl           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ nocand       : int  7 8 7 6 7 8 7 5 7 7 ...
 $ seito        : chr  "新進党" "新進党" "新進党" "新進党" ...
 $ j_name       : chr  "河村たかし" "青木宏之" "吉田幸弘" "三沢淳" ...
 $ name         : chr  "KAWAMURA, TAKASHI" "AOKI, HIROYUKI" "YOSHIDA, YUKIHIRO" "MISAWA, JUN" ...
 $ term         : int  2 2 1 1 3 8 7 3 13 2 ...
 $ gender       : chr  "male" "male" "male" "male" ...
 $ age          : int  47 51 35 44 48 68 55 59 65 53 ...
 $ exp          : int  9828097 12940178 11245219 12134215 11894801 11252336 13493050 6368857 19731389 18863794 ...
 $ status       : int  1 1 0 0 1 1 1 1 1 1 ...
 $ vote         : int  66876 56101 52478 57361 48648 90812 91439 93053 111578 110820 ...
 $ voteshare    : num  40 32.9 32.3 35.7 30.9 39.7 47.5 44.4 47.7 46.4 ...
 $ eligible     : int  346774 338310 331808 315704 319846 433930 357984 377152 393953 437148 ...
 $ turnout      : num  49.2 51.8 50.4 52 50.3 54.2 55.5 57.1 60.6 56 ...
 $ castvotes    : int  167051 170317 162679 160548 157404 228631 192362 209450 234001 238646 ...
 $ seshu_dummy  : int  0 0 0 0 1 0 0 1 0 1 ...
 $ jiban_seshu  : chr  NA NA NA NA ...
 $ nojiban_seshu: chr  NA NA NA NA ...
 $ wlsmd        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ exppv        : num  28.3 38.2 33.9 38.4 37.2 ...
 $ inc          : num  1 1 0 0 1 1 1 1 1 1 ...
 $ ldp          : num  0 0 0 0 0 0 0 1 0 0 ...

4. Bar Chart

4.1 The number of Candidate in HR election (1996-2017)

Draw a barchart representing the number of candidates per Lower House election between 1996 and 2017
Check the number of candidate by using table() function

table(df1$year)


1996 2000 2003 2005 2009 2012 2014 2017 
1261 1199 1026  989 1139 1294  959  936

x axis shows election year
Y axis shows the number of candidates running for election
we use the election data df1

Caution: Windows users should typ either of the following two commands to avoid garbled characters

windowsFonts(YuGothic = windowsFont("Yu Gothic"))
windowsFonts(Noto = windowsFont("Noto Sans CJK JP"))

df1 %>% 
  ggplot() +
  geom_bar(aes(x = year)) +
  labs(x = "Election Year", y = "The number of lower house election") + 
  theme_bw(base_family = "HiraKakuProN-W3")

Something is wrong with X-axis

str(df1$year)

 int [1:8803] 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...

Reason => The class of year is numeric

Solution

Change the class of year from numeric to factor

df1$year <- factor(df1$year)

str(df1$year)

 Factor w/ 8 levels "1996","2000",..: 1 1 1 1 1 1 1 1 1 1 ...

df1 %>% 
  ggplot() +
  geom_bar(aes(x = year)) +
  labs(x = "Election Year", y = "The number of lower house election") + 
  theme_bw(base_family = "HiraKakuProN-W3")

What we can see from the bar chart ・The number of candidates in Japan’s lower house election is decreasing since 1996 (except 2009 and 2012)

4.2 The Number of Candidates (LDP and Non-LDP)

Visualize the number of candidates between LDP and Non-LDP per election
Calculate the number of Candidate for LDP and Non-LDP candidates

df2 <- df1 %>%
  group_by(year, ldp) %>%
  summarise(N = n(),
            .groups   = "drop")

df2

# A tibble: 16 x 3
   year    ldp     N
   <fct> <dbl> <int>
 1 1996      0   973
 2 1996      1   288
 3 2000      0   928
 4 2000      1   271
 5 2003      0   749
 6 2003      1   277
 7 2005      0   699
 8 2005      1   290
 9 2009      0   848
10 2009      1   291
11 2012      0  1005
12 2012      1   289
13 2014      0   676
14 2014      1   283
15 2017      0   659
16 2017      1   277

Using group_by() and summarize() function, calculate the mean of candidates for LDP and Non-LDP candidates

df2 %>%
  group_by(ldp) %>% 
  summarize(N = mean(N, na.rm = TRUE),
            .groups = "drop")

# A tibble: 2 x 2
    ldp     N
  <dbl> <dbl>
1     0  817.
2     1  283.

The average of LDP candidates — 817
The average of Non-LDP candidates — 283
Check the class of variable, ldp

class(df2$ldp)

[1] "numeric"

Change the class of ldp from dbl to factor

df2$ldp <- factor(df2$ldp)

class(df2$ldp)

[1] "factor"

Draw a stacked bar chart

df2 %>%
  ggplot() +
  geom_bar(aes(x = year, y = N, fill = ldp), 
           stat = "identity", position = "stack") +
  labs(x = "Election Year", y = "The Number of Candidates") +
  theme_minimal(base_family = "HiraKakuProN-W3")

What we can see from the bar chart ・The number of candidates in Japan’s lower house election is decreasing since 1996 (except 2009 and 2012)
・The number of LDP candidates does not change that much over time

Draw a parallel bar chart

Assigning position = "dodge" enables us to draw parallel bar charts

df2 %>%
  ggplot() +
  geom_bar(aes(x = year, y = N, fill = ldp), 
           stat = "identity", position = "dodge") + 
  labs(x = "Election Year", y = "The Number of Candidates") +
  theme_minimal(base_family = "HiraKakuProN-W3")

The difference between parallel graph and stacked graph

Types of bar chart	What you can do
Stacked	You can compare the average age of winners by Election Year
Parallel	You can compare the average age of winners by Parties

Which one you should use depends on you!

Change colors in bar chart

You can assign any color you like by using scale_fill_manual() function

df2 %>%
  ggplot() +
  geom_bar(aes(x = year, y = N, fill = ldp), 
           stat = "identity", position = "dodge") +
  labs(x = "Election Year", y = "The Number of Candidates") +
  theme_minimal(base_family = "HiraKakuProN-W3") +
  scale_fill_manual(values = c("springgreen2", "deeppink2"))

Choose any color you like from ggplot2_Quick_Reference:color(and_fill)

4.3 SMD Winner"s Averaged Age in Lower House Elections

Visualize the averaged age for SMD winners
Using dplyr() function, calculate the average and save it as df2

df3 <- df1 %>%
  dplyr::filter(wlsmd == 1) %>% # choose only smd winners
  group_by(year) %>%         # calculate by election year
  summarize(age = mean(age, na.rm = TRUE),　# calculate the mean of age 
            .groups = "drop")
df3

# A tibble: 8 x 2
  year    age
  <fct> <dbl>
1 1996   54.2
2 2000   53.8
3 2003   52.9
4 2005   53.1
5 2009   51.0
6 2012   52.5
7 2014   54.7
8 2017   55.9

Draw a graph with df2
x-axis is election year
y-axis is the winner’s averaged age
Show the averaged age => (stat = "identity)

df3 %>%
  ggplot() +
  geom_bar(aes(x = year, y = age), stat = "identity") + 
  labs(x = "Election Year", y = "") +
  theme_minimal(base_family = "HiraKakuProN-W3")

summary(df3$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  51.03   52.81   53.47   53.52   54.30   55.90

Summary ・The SMD winner’s averaged age is 54 years old and it does not vary much

4.4 The SMD winner’s Averaged Age (LDP $ Non LDP)

Visualize the SMD winner’s Averaged Age for LDP and non-LDP candidates
Zombie winners are not included

library(tidyverse)

df4 <- df1 %>% 
  group_by(year, ldp) %>% 
  summarize(age = mean(age, na.rm = TRUE),
            .groups = "drop")
df4

# A tibble: 16 x 3
   year    ldp   age
   <fct> <dbl> <dbl>
 1 1996      0  49.4
 2 1996      1  54.3
 3 2000      0  49.5
 4 2000      1  55.7
 5 2003      0  49.5
 6 2003      1  54.3
 7 2005      0  49.3
 8 2005      1  52.6
 9 2009      0  48.3
10 2009      1  55.5
11 2012      0  49.8
12 2012      1  51.9
13 2014      0  51.5
14 2014      1  53.3
15 2017      0  51.7
16 2017      1  55.3

Calculate the SMD winner’s Averaged Age for LDP and non-LDP candidates

df4 %>%
  group_by(ldp) %>% 
  summarize(age = mean(age, na.rm = TRUE),
            .groups = "drop")

# A tibble: 2 x 2
    ldp   age
  <dbl> <dbl>
1     0  49.9
2     1  54.1

Non-LDP candidate"s averaged age :49.9
LDP candidate"s averaged age :54.1
class of ldp is <dbl> (numeric)
=> we need to change its class to factor

df4$ldp <- factor(df4$ldp)

Check if the class is changed

class(df4$ldp)

[1] "factor"

df4 %>%
  ggplot() +
  geom_bar(aes(x = year, y = age, fill = ldp), 
           stat = "identity", position = "dodge") + 
  labs(x = "Election Year", y = "SMD Winner's Averaged Age")

Summary LDP Winners averaged age (54.1) is larger than the non-LDP’s winners (49.9).

5. Histogram

A histogram is the most commonly used graph to show frequency distributions.
It looks very much like a bar chart, but there are important differences between them.
A histogram is a visual representation of frequency table.

Differences between a bar chart and a histogram

Type of graph	x-axis	Geometric objec	Gap between bars
Bar Chart	Discrete variable	`geom_bar()`	Yes
Histogram	Continuous variable	`geom_histogram()`	No

5.1 Histogram 1 (vote share in HR elections)

Candidate’s Vote Share
- Candidate’s Vote Sharein the lower house election (1996-2017)

summary(df1$voteshare)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.10    8.90   25.76   27.08   42.90   95.30

minimum = 0.1%, maximam = 95.3%, average = 27.08%
Draw a histogram

df1 %>% 
  ggplot() +
  geom_histogram(aes(x = voteshare)) +
  labs(x = "Candidate's Vote Share(1996-2017)", y = "Frequency")

  geom_vline(xintercept = mean(df1$voteshare),  # Draw a line at its mean
             col = "magenta3")

mapping: xintercept = ~xintercept 
geom_vline: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity

Check wl

table(df1$wl)


   0    1    2 
5563 2387  853

We want to make a data frame (df5) for SMD winners (wl == 1)

df5 <- df1 %>%
  dplyr::filter(wl == 1)

Show the descriptive statistics of the SMD winner’s vote share between 1996 and 2017 in the lower house elections

summary(df5$voteshare)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.50   44.20   50.10   50.65   56.00   95.30

minimum = 21.50%, maximam = 95.30%, average = 50.65%
Draw a histogram

df5 %>% 
  ggplot() +
  geom_histogram(aes(x = voteshare)) +
  labs(x = "The SMD Winner's Averaged Vote Share (1996-2017)", y = "Frequency") +
  geom_vline(xintercept = mean(df5$voteshare),   # Draw a line at its mean
             col = "magenta3")

5.2 Histogram 1 (vote share in HR elections)

Draw a white line between bins

df5 %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot() +
  geom_histogram(aes(x = voteshare), color = "white") +
  labs(x = "SMD Winner's vote share (1996-2017)", y = "Candidte's Number")

Change the number of Bins

Change the number of bins to 10

df5 %>%
  ggplot() +
  geom_histogram(aes(x = voteshare), 
                 color = "white", 
                 bins = 10) +
 labs(x = "SMD Winner's vote share (1996-2017)", y = "Candidte's Number")

Add another dimention

We want to draw a histogram with a new dimention: whether or not a candidate is an LDP politician（自民 means LDP in Japanese）

Stacked Histogram

df5 %>%
  mutate(ldp2 = ifelse(ldp == 1, "LDP", "Non_LDP")) %>%
  ggplot() +
  geom_histogram(aes(x = voteshare, 
                     fill = ldp2), 
                 color = "white",
                 bins = 30) +
  labs(x = "SMD Winner's vote share (1996-2017)", y = "Candidate's number", fill = "Candidate's party")

Overlapping Histogram

df5 %>%
  mutate(ldp2 = ifelse(ldp == 1, "LDP", "Non_LDP")) %>%
  dplyr::filter(!is.na(voteshare)) %>%                
  ggplot() +
  geom_histogram(aes(x = voteshare,　　　　　　 
                     fill = ldp2),           
                 color = "white",              
                 alpha = 0.5,                  
                 position = "identity",         
                 boundary = 0) +               
  labs(x = "SMD Winner's vote share (1996-2017)", 
       y = "Candidate's number", fill = "Candidate's party")

6. Box plot

6.1 Show Vote Share using Box Plot (2017HR)

df8 <- df1 %>% 
  dplyr::filter(year == 2017)

df8 %>%
  filter(!is.na(voteshare)) %>%
  ggplot() +
  geom_boxplot(aes(y = voteshare)) +
  labs(y = "Vote Share (2017 HR Election）") +
  coord_flip()  # flip the plot

- How to interpret the box plot

Source：浅野・矢内『Rによる計量政治学』p.107.

6.2 Show Vote Share by Pary (2017HR)

df8 %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot() +
  geom_boxplot(aes(x = seito, 
                   y = voteshare)) +
  labs(x = "Party", 
       y = "Vote Share (%) ") +
  theme_gray(base_family = "HiraKakuProN-W3")  +
  coord_flip()

You can nicely compare the vote share by party
On average, 公明党 (CGP) received the most vote share
共産党 (JCP) received the least vote share
無所属 (independent) candidates’ variance is the largest
公明党 (CGP) and 自民党 (LDP) candidats’ variance is relatively small
A dot「●」means a outlier
自民党 (LDP) has four outliers
They are the top four LDP winners with the most votes received
Check who they are

df8 %>% 
  dplyr::filter(voteshare > 75) %>% 
  select(pref, kun, seito, age, voteshare, vote, j_name) %>% 
  head()

    pref kun seito age voteshare   vote     j_name
1   岐阜   2  自民  54     75.40 117278   棚橋泰文
2   広島   1  自民  60     77.96 113239   岸田文雄
3 神奈川  11  自民  36     78.02 154761 小泉進次郎
4   宮城   6  自民  57     85.72 123871 小野寺五典
5   鳥取   1  自民  60     83.63 106425     石破茂

You can show it in colors by adding fill = seito command

df8  %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot() +
  geom_boxplot(aes(x = seito, 
                   y = voteshare,
                   fill = seito),    
               show.legend = FALSE) + 
   labs(x = "Party", y = "Vote Share (2017 HR Election)") +
  coord_flip()

6.3 Exercise

Q: Refering to 6.2 Show Vote Share by Pary (2017HR), generate a boxplots showing the vote shares by party with the 2009 HR election.

6.4 Present dots on box plot

Using geom_point(), you can present dot “●” on the box plot

df8 %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot(aes(x = seito, y = voteshare)) +
  geom_point(aes(color = seito), alpha = 0.5,
             show.legend = FALSE) +
  geom_boxplot(aes(fill = seito),
               alpha = 0.5, show.legend = FALSE) +
  labs(x = "Party", y = "Vote Share (2017 HR Election)")

- You can scatter the dots by adding geom_jitter() so that you can clearly see them

df8 %>%
  filter(!is.na(voteshare)) %>%
  ggplot(aes(x = seito, 
             y = voteshare)) +
  geom_jitter(aes(color = seito),
              show.legend = FALSE) +
  geom_boxplot(aes(fill = seito),
               alpha = 0.5, 
               show.legend = FALSE) +
   labs(x = "Party", y = "Vote Share (2017 HR Election)")

You can adjust the degree of dispersion using width = 0.15, height = 0
The larger the number, the larger their dispersion

df8 %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot(aes(x = seito, 
             y = voteshare)) +
  geom_jitter(aes(color = seito),
              width = 0.15, height = 0, # Adjust the dispersion  
              show.legend = FALSE) +
  geom_boxplot(aes(fill = seito),
               alpha = 0.5, 
               show.legend = FALSE) +
   labs(x = "Party", y = "Vote Share (2017 HR Election)")

6.6 Add Another Dimention to Box Plot

facet_wrap()

df8 %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot(aes(x = seito, y = voteshare)) +
  geom_jitter(aes(color = seito), alpha = 0.5,
              width = 0.15, height = 0,
              show.legend = FALSE) +
  geom_boxplot(aes(fill = seito),
               alpha = 0.5, show.legend = FALSE) +
  labs(x = "Party", 
       y = "Vote Share (2017 HR Election)",
       caption = "Male and Female Candidates") +
  facet_wrap(~ gender) +
  theme_bw(base_family = "HiraKakuProN-W3") +
  coord_flip()

You can compare vote share between male and female candidates by party

df8 %>%
  dplyr::filter(!is.na(voteshare)) %>%
  ggplot(aes(x = seito, 
             y = voteshare)) +
  geom_jitter(aes(color = gender), 
              alpha = 0.5,
              position = position_jitterdodge(jitter.width = 0.2,
                                              jitter.height = 0),
              show.legend = FALSE) +
  geom_boxplot(aes(fill = gender),
               alpha = 0.5) +
  labs(x = "Party", 
       y = "Vote Share (2017 HR Election)", 
       fill = "",
       caption = "Male and Female Candidates")

- You can limit the number of parties you like to look into

7. Lollipop Chart

Download hr09_14_ldp_seatshare.csv
The Japan’s lower house election results on the LDP between 2009 and 2014
Make a folder within your R Project folder and name it data
Put the hr09_14_ldp_seatshare.csv in the data folder
To read the hr09_14_ldp_seatshare.csv, we need to load readr package, which is included in tidyverse package

library(tidyverse)

df_seat <- read_csv("data/hr09_14_ldp_seatshare.csv")

Using datatable() function, check the dataframe

DT::datatable(df_seat)

df_seat contains the following 7 variables

Variables	Details
year	Election Year
pref	Prefectures (in Japanese)
id	Prefecture ID (1-47)
nosmd	The total number of Single-Member-Districts (SMD) in each Prefecture (1-25)
ldp	The total number of LDP winners in each SMD
ldp_ratio	Ratio of LDP winners in each SMD(%)
dpj	The total number of DPJ winners in each SMD

Check the class of each variable

str(df_seat$year)

 num [1:141] 2012 2014 2009 2012 2014 ...

The class of the variable year is numeric
Convert it into factor

df_seat$year <- factor(df_seat$year)

df_seat %>% 
  arrange(year, ldp_ratio) %>%
  mutate(order_seq = c(1:47, rep(0, 47*2))) %>% 
  
　ggplot(aes(x = ldp_ratio, 
　           y = reorder(pref, order_seq))) + 
  geom_segment(aes(yend = pref),
               xend = 0, colour = "grey50") +
  geom_point(size = 2,
             aes(colour = year)) +
  scale_colour_brewer(palette = "Set1", 
                      limits = c("2009", "2012", "2014"),
                      guide = FALSE) +
  theme(panel.grid.major.y =
          element_blank()) +
  facet_grid(~ year,
             scales="free_y", space = "free_y") + 
  theme_bw(base_family = "HiraKakuProN-W3")+ # Show Japanese in chart  
    labs(x = "LDP Vote Share (%)", 
         y = "Prefecture")

The ratio of LDP’s winners in SMd by prefecture (2009-2014)

You can show it by histogram as follows:

df_seat %>% 
  arrange(year, ldp_ratio) %>%
  mutate(order_seq = c(1:47, rep(0, 47*2))) %>% 
  
  ggplot(aes(x = reorder(pref, order_seq), 
           y = ldp_ratio,  
           fill = year)) +
  geom_bar(stat = "identity") +
  facet_grid(~ year, scales = "free_x") +
  theme(legend.position = "none") +
  coord_flip()+ 
  theme_bw(base_family = "HiraKakuProN-W3")+
   labs(x = "Prefecture", 
         y = "LDP's Vote Share (%)")

The ratio of LDP’s winners in SMd by prefecture (2009-2014)

Reference

Kieran Healy, DATA VISUALIZATION, Princeton, 2019
Yuki YANAI（矢内勇生先生の授業教材＠高知工科大学）
宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

3. Data Visualization (Basic)

Masahiko Asano

2021-09-13

1. Visualization by types of variables

2. Tips on installing packages

2.1 Install packages via `Github`

2.2 Install multiple packages at the same time

2.3 Avoid conflict among packages

2.4 How to avoid Garbled characters in `ggplot`

2.5 How to load a package

2.6 How to make a R Project

3. Data Preparation

3.1 Data Cleaning on Election Data (1996-2017)

Make a new variables and modify variables

Make a dummy variable (wlsmd)

Make a variable (exppv)

Make a dummy variable (inc)

Make a dummy variable (ldp)

3.2 Descriptive Statistics of Japan’s National Elections

4. Bar Chart

4.1 The number of Candidate in HR election (1996-2017)

4.2 The Number of Candidates (LDP and Non-LDP)

Draw a stacked bar chart

Draw a parallel bar chart

Change colors in bar chart

4.3 SMD Winner"s Averaged Age in Lower House Elections

4.4 The SMD winner’s Averaged Age (LDP $ Non LDP)

5. Histogram

Differences between a bar chart and a histogram

6. Box plot

6.3 Exercise

6.4 Present dots on box plot

6.6 Add Another Dimention to Box Plot

7. Lollipop Chart

3. Data Visualization (Basic)

Masahiko Asano

2021-09-13

1. Visualization by types of variables

2. Tips on installing packages

2.1 Install packages via Github

2.2 Install multiple packages at the same time

2.3 Avoid conflict among packages

2.4 How to avoid Garbled characters in ggplot

2.5 How to load a package

2.6 How to make a R Project

3. Data Preparation

3.1 Data Cleaning on Election Data (1996-2017)

Make a new variables and modify variables

Make a dummy variable (wlsmd)

Make a variable (exppv)

Make a dummy variable (inc)

Make a dummy variable (ldp)

3.2 Descriptive Statistics of Japan’s National Elections

4. Bar Chart

4.1 The number of Candidate in HR election (1996-2017)

4.2 The Number of Candidates (LDP and Non-LDP)

Draw a stacked bar chart

Draw a parallel bar chart

Change colors in bar chart

4.3 SMD Winner"s Averaged Age in Lower House Elections

4.4 The SMD winner’s Averaged Age (LDP $ Non LDP)

5. Histogram

Differences between a bar chart and a histogram

5.1 Histogram 1 (vote share in HR elections)

5.2 Histogram 1 (vote share in HR elections)

Stacked Histogram

Overlapping Histogram

6. Box plot

6.1 Show Vote Share using Box Plot (2017HR)

6.2 Show Vote Share by Pary (2017HR)

6.3 Exercise

6.4 Present dots on box plot

6.6 Add Another Dimention to Box Plot

7. Lollipop Chart

2.1 Install packages via `Github`

2.4 How to avoid Garbled characters in `ggplot`