✔ What we do here
・Introduce how to make a variable
・Introduce how to merge variables and make a dataframe
・Introduce how to merge multiple dataframes
・Introduce how to read data with various file name extensions
・Introduce how to clean the data you read into RStudio
・Explain technical terms we need in analyzing data

technical terms explained here text data, binary data, file extension, pass, file, folder, R project, R project folder, working directory, missing value, class, data cleaning, converting data between wide and long format

  • Load tidyverse package
library(haven)
library(readxl)
library(tidyverse)
─ Attaching packages ──────────────────── tidyverse 1.3.1 ─
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.1.2     ✓ dplyr   1.0.6
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1
─ Conflicts ───────────────────── tidyverse_conflicts() ─
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
  • tidyverse contains 8 useful packages
  • We need readr to read the data

1. Data frame

1.1 How to make variables

  • A variable is also called vector
  • Make a variable containing 8 numbers (= values) and name it id
id <- c(1,2,3,4,5,6,7,8)
  • Make a variable containing 8 names and name it name
name <- c("Thies", "Cox", "McCubbins", "Schwartz", "DeNardo", "Bawn", "Patterson", "Geddes")
  • Make a variable containing test scores and name it score
score <- c(43, 74, 80, 37, 20, 83, 64, 35)

1.2 Add variables to dataframe

  • Load tidyverse packages to use tibble() function
library(tidyverse)
  • Make a data frame with three variables (id, name, score) and name it df1
df1 <- tibble(id, name, score)
df1
# A tibble: 8 x 3
     id name      score
  <dbl> <chr>     <dbl>
1     1 Thies        43
2     2 Cox          74
3     3 McCubbins    80
4     4 Schwartz     37
5     5 DeNardo      20
6     6 Bawn         83
7     7 Patterson    64
8     8 Geddes       35

- You can also use data.frame() instead of using tibble()

df1 <- data.frame(id, name, score)
df1

tibble() shows the size of data frame, such as 8 x 3 and class of variable, such as <dbl> & <chr>
→ You should use tibble()
○ If you load tidyverse package, then you can use tibble()

1.3 Add a new variable to your data frame

  • We want to add another variable, department
  • Add $ after the data frame name (df), then put the name of the new variable (department)
df1$department <- c("poli-sci", "econ", "poli-sci", "econ", "art", "music", "communication", "history")

df1
# A tibble: 8 x 4
     id name      score department   
  <dbl> <chr>     <dbl> <chr>        
1     1 Thies        43 poli-sci     
2     2 Cox          74 econ         
3     3 McCubbins    80 poli-sci     
4     4 Schwartz     37 econ         
5     5 DeNardo      20 art          
6     6 Bawn         83 music        
7     7 Patterson    64 communication
8     8 Geddes       35 history      
  • Add gender to df1
df1$gender <- c("male", "male", "male", "male", "male", "female", "male", "female")

df1
# A tibble: 8 x 5
     id name      score department    gender
  <dbl> <chr>     <dbl> <chr>         <chr> 
1     1 Thies        43 poli-sci      male  
2     2 Cox          74 econ          male  
3     3 McCubbins    80 poli-sci      male  
4     4 Schwartz     37 econ          male  
5     5 DeNardo      20 art           male  
6     6 Bawn         83 music         female
7     7 Patterson    64 communication male  
8     8 Geddes       35 history       female

1.4 Merging data frames

  • Make a data frame, df2
  • df2 includes the following two variables:

id
prefecture

  • Make id
id <- c(1,2,3,4,5,6,7,8)
  • Make states standing for where they come from
state <- c("California", "Oregon", "NY", "Washington", "Florida", "Wisconsin", "Alabama", "South Carolina")
  • Make a data frame with the two variables and name it df2
df2 <- tibble(id, state)
df2
# A tibble: 8 x 2
     id state         
  <dbl> <chr>         
1     1 California    
2     2 Oregon        
3     3 NY            
4     4 Washington    
5     5 Florida       
6     6 Wisconsin     
7     7 Alabama       
8     8 South Carolina
  • Merge the two data frames (df1 and df2) with the same variable name (id) and name the new data frame, M
M <- merge(df1, df2, by = "id")
M
  id      name score    department gender          state
1  1     Thies    43      poli-sci   male     California
2  2       Cox    74          econ   male         Oregon
3  3 McCubbins    80      poli-sci   male             NY
4  4  Schwartz    37          econ   male     Washington
5  5   DeNardo    20           art   male        Florida
6  6      Bawn    83         music female      Wisconsin
7  7 Patterson    64 communication   male        Alabama
8  8    Geddes    35       history female South Carolina

1.5 Exercise

Question 1: Make the list of your family or friends (df1) containing the following variables:

① id: (1…..5)
② name
③ age
④ relationship

Question 2: Make the list of your family or friends (df2) containing the following variables:

① id: (1…..5)
② gender
③ height

Questin 3: Merge the two data frames you made (df1 and df2) with the shared variable (id) and name it M1

2. Basics on data

2.1 File and Folda

  • A computer file is a computer resource for recording data in a computer storage device.
  • A folder (also called a directory) is a space used to store files.
  • For instance, take a look at the following:

  • You see 4 folders on the left side (backdoor, maps, R, RDD)
  • The folder R has 4 files
  • A file has various extensions:(.html .Rmd .csv .doc .png .jpg)
  • A folder does not have extensions
  • A folder is also called a directory: examples) R Project folder, working directory

2.2 Path

  • A path is a string of characters used to uniquely identify a location in a directory structure.

  • It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory.

  • The delimiting character is most commonly the slash \("/"\).

  • getwd() = get working directry
    → You can see in which directory you are working on

  • For example, let me type getwd() on my computer and hit the return key

getwd() "/Users/asanomasahiko/Dropbox/statistics/class_materials/R"

  • If you are a Mac user, then you will see something like this
  • But, if you are a Window user, then you will see C Drive instead of Users

2.3 Working Directory

  • What R means at the end of the path shown above
  • R means the name of RProject forlder (= working directory) where you are currently working at.
    → You are working at RProject folder, named R.
  • You could set an appropriate path by yourself, but it is more efficient for you to make R Project.

Reasons:
- If you are in your R Project folder, you don’t have to assign a particular path whenevery you need data.

2.4 What you don’t want to do on file names

  • You should not use numbers at the beginning of your file name
  • Example) × 「2021_grades」=> ○「grades_2021」
  • You should not insert space between user file name
  • Example) × 「2021 grades」=> ○「grades_2021」

3. How to use Data embedded to R

  • R (sometimes R packages) has embedded data
  • If you type data(), then you can see the list of these embedded data (part of the list is shown here).
data()

  • For instance, let me show you the first 6 rows of the 7th data, state.x77
head(state.x77)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
  • Let me show you the lsat 6 rows of state.x77
tail(state.x77)
              Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
Vermont              472   3907        0.6    71.64    5.5    57.1   168  9267
Virginia            4981   4701        1.4    70.08    9.5    47.8    85 39780
Washington          3559   4864        0.6    71.72    4.3    63.5    32 66570
West Virginia       1799   3617        1.4    69.48    6.7    41.6   100 24070
Wisconsin           4589   4468        0.7    72.48    3.0    54.5   149 54464
Wyoming              376   4566        0.6    70.29    6.9    62.9   173 97203
  • You can also see the data on Titanic
head(Titanic)
, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

4. How to use Data not embedded to R

Read the data by data form

  • PC data has two types: text data and binary data
  • Data form is differentiated by extensions

Text data

  • Data we can read and understand
    .txt file — you use this when you do text analysis
    .html file — you use this when you do web scraping
    .csv file・・・comma-separated values
    → You are recommended to read csv file in RStudio

Binary data

  • Data we cannot read and understand, but computer can
    .xls file
    .xlsx file — newer than .xls file
    .dta file — you can use this on STATA
    .rds file — you can use this only for R

  • You are recommended not to use MS Office Excel, but LibreOffice
    → Free soft ware
    → You can assign character encode
    → You can evade unnecessary errors

4.1 How to read .csv file

  • Download the Japanese Lower House Election Data hr96-17.csv and read on RStudio
  • Make a new folder, named data within your RProjct folder
  • Put the hr96-17.csv into data
  • You need to use tidyverse package to read csv.file
    → Load tidyverse
library(tidyverse)
  • Read the csv file and name it hr
hr <- read_csv("data/hr96-17.csv", 
               na = ".")  # replace missing data with "."  

If you fail to read the csv file

  • When csv file contains values in Japanese, it is likely for you to see an error message

If you use EXCEL

  • Suppose you see the csv file as follows when you use EXCEL

  • You need to save it using the csv UTF-8 (.csv)form

If you use Libre Office

  • You selectUnicode(UTF-8) and save it

  • If you have further problem, then you need to try this command
hr <- read_csv("data/hr96-17.csv", 
               na = ".",
               locale = locale(encoding = "cp932"))

4.2 How to read .xls[x] file

  • Load the readxl pacakges to read .xls[x] file
library(readxl)
fh <- read_excel("data/FH_Country.xls")

4.3 How to read .dta file

  • .dta file is a binary data
  • Load the haven pacakges to read .dta file
library(haven)
  • We download and read the replication data offered by Bruce Russett and John R. Oneal (2001) “Triangulating Peace” TRIANGLE.DTA
triangle <- read_dta("data/TRIANGLE.DTA")
head(triangle)
# A tibble: 6 x 19
  statea stateb  year dependa dependb demauta demautb allies dispute1 logdstab
   <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
1      2     20  1920  0.0157   0.280      10       9      0        0     5.82
2      2     20  1921  0.0115   0.224      10      10      0        0     5.82
3      2     20  1922  0.0113   0.201      10      10      0        0     5.82
4      2     20  1923  0.0112   0.213      10      10      0        0     5.82
5      2     20  1924  0.0110   0.213      10      10      0        0     5.82
6      2     20  1925  0.0108   0.191      10      10      0        0     5.82
# … with 9 more variables: lcaprat2 <dbl>, smigoabi <dbl>, opena <dbl>,
#   openb <dbl>, minrpwrs <dbl>, noncontg <dbl>, smldmat <dbl>, smldep <dbl>,
#   dyadid <dbl>
  • Show the list of variables triangle contains
names(triangle)
 [1] "statea"   "stateb"   "year"     "dependa"  "dependb"  "demauta" 
 [7] "demautb"  "allies"   "dispute1" "logdstab" "lcaprat2" "smigoabi"
[13] "opena"    "openb"    "minrpwrs" "noncontg" "smldmat"  "smldep"  
[19] "dyadid"  
  • We can save TRIANGLE.DTA as csv form which is more widely used with the following command
write_excel_csv(triangle, "data/triangle.csv")

5. Data Cleaning of GDP data

5.1 Read the GDP dta

  • We download wb_gdp_pc.csv
  • Make data folder and put the we_gdp_pc.csv into the folder
wb_gdp <- read_csv("data/wb_gdp_pc.csv")
Warning: Missing column names filled in: 'X3' [3]

─ Column specification ────────────────────────────
cols(
  `Data Source` = col_character(),
  `World Development Indicators` = col_character(),
  X3 = col_character()
)
Warning: 265 parsing failures.
row col  expected     actual                 file
  2  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  3  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  4  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  5  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  6  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
... ... ......... .......... ....................
See problems(...) for more details.

Pay attention to the Warning: Missing column names filled in : ’X3" [3]

  • Warning is not as seriou as Error, but we need to be cautious about it

How to deal with the Warning

  • Show the first 6 rows of the data
head(wb_gdp)
# A tibble: 6 x 3
  `Data Source`     `World Development Indicators` X3                          
  <chr>             <chr>                          <chr>                       
1 Last Updated Date 2019-03-21                     <NA>                        
2 Country Name      Country Code                   Indicator Name              
3 Aruba             ABW                            GDP per capita (current US$)
4 Afghanistan       AFG                            GDP per capita (current US$)
5 Angola            AGO                            GDP per capita (current US$)
6 Albania           ALB                            GDP per capita (current US$)
  • We do not need the 1st row (Last Updated Date 2019-03-21…)
  • Directly open the file (wb_gdp_pc.csv) which is in data using LibreOffice(or Excel)and check the data

I emphasize the first 4 yellow lines so that you can easily recognize them

  • If you use read_csv(), then RStudio automatically recognize the first row of the csv file as variable names
  • We want RStudio to recognize the 5th row as variable names
    → We skip the 4 lines (line 1 to 4)
wb_gdp <- read_csv("data/wb_gdp_pc.csv", skip = 4)
Warning: Missing column names filled in: 'X64' [64]

─ Column specification ────────────────────────────
cols(
  .default = col_double(),
  `Country Name` = col_character(),
  `Country Code` = col_character(),
  `Indicator Name` = col_character(),
  `Indicator Code` = col_character(),
  `2018` = col_logical(),
  X64 = col_logical()
)
ℹ Use `spec()` for the full column specifications.
  • str() enables us to check the variable class
str(wb_gdp)
spec_tbl_df [264 × 64] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Country Name  : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Country Code  : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
 $ Indicator Name: chr [1:264] "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" ...
 $ Indicator Code: chr [1:264] "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" ...
 $ 1960          : num [1:264] NA 59.8 NA NA NA ...
 $ 1961          : num [1:264] NA 59.9 NA NA NA ...
 $ 1962          : num [1:264] NA 58.5 NA NA NA ...
 $ 1963          : num [1:264] NA 78.8 NA NA NA ...
 $ 1964          : num [1:264] NA 82.2 NA NA NA ...
 $ 1965          : num [1:264] NA 101 NA NA NA ...
 $ 1966          : num [1:264] NA 138 NA NA NA ...
 $ 1967          : num [1:264] NA 161 NA NA NA ...
 $ 1968          : num [1:264] NA 130 NA NA NA ...
 $ 1969          : num [1:264] NA 130 NA NA NA ...
 $ 1970          : num [1:264] NA 157 NA NA 3239 ...
 $ 1971          : num [1:264] NA 160 NA NA 3498 ...
 $ 1972          : num [1:264] NA 136 NA NA 4217 ...
 $ 1973          : num [1:264] NA 144 NA NA 5342 ...
 $ 1974          : num [1:264] NA 175 NA NA 6320 ...
 $ 1975          : num [1:264] NA 188 NA NA 7169 ...
 $ 1976          : num [1:264] NA 199 NA NA 7152 ...
 $ 1977          : num [1:264] NA 226 NA NA 7751 ...
 $ 1978          : num [1:264] NA 249 NA NA 9130 ...
 $ 1979          : num [1:264] NA 278 NA NA 11821 ...
 $ 1980          : num [1:264] NA 275 664 NA 12377 ...
 $ 1981          : num [1:264] NA 266 600 NA 10372 ...
 $ 1982          : num [1:264] NA NA 579 NA 9610 ...
 $ 1983          : num [1:264] NA NA 582 NA 8023 ...
 $ 1984          : num [1:264] NA NA 597 639 7729 ...
 $ 1985          : num [1:264] NA NA 712 640 7774 ...
 $ 1986          : num [1:264] 6473 NA 648 694 10362 ...
 $ 1987          : num [1:264] 7886 NA 721 675 12616 ...
 $ 1988          : num [1:264] 9765 NA 762 653 14304 ...
 $ 1989          : num [1:264] 11392 NA 863 698 15166 ...
 $ 1990          : num [1:264] 12307 NA 923 617 18879 ...
 $ 1991          : num [1:264] 13496 NA 845 337 19533 ...
 $ 1992          : num [1:264] 14047 NA 641 201 20548 ...
 $ 1993          : num [1:264] 14937 NA 430 367 16516 ...
 $ 1994          : num [1:264] 16241 NA 321 586 16235 ...
 $ 1995          : num [1:264] 16439 NA 388 751 18461 ...
 $ 1996          : num [1:264] 16586 NA 513 1010 19017 ...
 $ 1997          : num [1:264] 17928 NA 507 717 18353 ...
 $ 1998          : num [1:264] 19078 NA 420 814 18895 ...
 $ 1999          : num [1:264] 19356 NA 386 1033 19262 ...
 $ 2000          : num [1:264] 20621 NA 555 1127 21937 ...
 $ 2001          : num [1:264] 20669 NA 526 1282 22229 ...
 $ 2002          : num [1:264] 20437 184 870 1425 24741 ...
 $ 2003          : num [1:264] 20834 196 979 1846 32776 ...
 $ 2004          : num [1:264] 22570 217 1248 2374 38503 ...
 $ 2005          : num [1:264] 23300 248 1891 2674 41282 ...
 $ 2006          : num [1:264] 24046 269 2585 2973 43749 ...
 $ 2007          : num [1:264] 25836 366 3108 3595 48583 ...
 $ 2008          : num [1:264] 27086 370 4069 4371 47786 ...
 $ 2009          : num [1:264] 24631 444 3118 4114 43339 ...
 $ 2010          : num [1:264] 23513 551 3586 4094 39736 ...
 $ 2011          : num [1:264] 24984 599 4616 4437 41099 ...
 $ 2012          : num [1:264] 24710 649 5102 4248 38391 ...
 $ 2013          : num [1:264] 25018 648 5258 4413 40620 ...
 $ 2014          : num [1:264] 25528 625 5413 4579 42295 ...
 $ 2015          : num [1:264] 25796 590 4171 3953 36038 ...
 $ 2016          : num [1:264] 25252 550 3510 4132 37232 ...
 $ 2017          : num [1:264] 25655 550 4100 4538 39147 ...
 $ 2018          : logi [1:264] NA NA NA NA NA NA ...
 $ X64           : logi [1:264] NA NA NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Country Name` = col_character(),
  ..   `Country Code` = col_character(),
  ..   `Indicator Name` = col_character(),
  ..   `Indicator Code` = col_character(),
  ..   `1960` = col_double(),
  ..   `1961` = col_double(),
  ..   `1962` = col_double(),
  ..   `1963` = col_double(),
  ..   `1964` = col_double(),
  ..   `1965` = col_double(),
  ..   `1966` = col_double(),
  ..   `1967` = col_double(),
  ..   `1968` = col_double(),
  ..   `1969` = col_double(),
  ..   `1970` = col_double(),
  ..   `1971` = col_double(),
  ..   `1972` = col_double(),
  ..   `1973` = col_double(),
  ..   `1974` = col_double(),
  ..   `1975` = col_double(),
  ..   `1976` = col_double(),
  ..   `1977` = col_double(),
  ..   `1978` = col_double(),
  ..   `1979` = col_double(),
  ..   `1980` = col_double(),
  ..   `1981` = col_double(),
  ..   `1982` = col_double(),
  ..   `1983` = col_double(),
  ..   `1984` = col_double(),
  ..   `1985` = col_double(),
  ..   `1986` = col_double(),
  ..   `1987` = col_double(),
  ..   `1988` = col_double(),
  ..   `1989` = col_double(),
  ..   `1990` = col_double(),
  ..   `1991` = col_double(),
  ..   `1992` = col_double(),
  ..   `1993` = col_double(),
  ..   `1994` = col_double(),
  ..   `1995` = col_double(),
  ..   `1996` = col_double(),
  ..   `1997` = col_double(),
  ..   `1998` = col_double(),
  ..   `1999` = col_double(),
  ..   `2000` = col_double(),
  ..   `2001` = col_double(),
  ..   `2002` = col_double(),
  ..   `2003` = col_double(),
  ..   `2004` = col_double(),
  ..   `2005` = col_double(),
  ..   `2006` = col_double(),
  ..   `2007` = col_double(),
  ..   `2008` = col_double(),
  ..   `2009` = col_double(),
  ..   `2010` = col_double(),
  ..   `2011` = col_double(),
  ..   `2012` = col_double(),
  ..   `2013` = col_double(),
  ..   `2014` = col_double(),
  ..   `2015` = col_double(),
  ..   `2016` = col_double(),
  ..   `2017` = col_double(),
  ..   `2018` = col_logical(),
  ..   X64 = col_logical()
  .. )
  • RStudio recognizes Country Name and Country Code as character <chr>
  • RStudio recognizes 1960 as double <dbl>
  • NA means (missing value)
  • check variable names
names(wb_gdp)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "X64"           

We select variables we use
- We need to solve problems one by one which prevents us from conducting quantitative analysis
- Since we do not know what X64 is, check it

wb_gdp$X64
  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
  • We need to delete it because it contains nothing (=NA)

→ The followings are what we need

gdp <- wb_gdp %>% 
  select("Country Name", 
         "1960":"2018")  
names(gdp)
 [1] "Country Name" "1960"         "1961"         "1962"         "1963"        
 [6] "1964"         "1965"         "1966"         "1967"         "1968"        
[11] "1969"         "1970"         "1971"         "1972"         "1973"        
[16] "1974"         "1975"         "1976"         "1977"         "1978"        
[21] "1979"         "1980"         "1981"         "1982"         "1983"        
[26] "1984"         "1985"         "1986"         "1987"         "1988"        
[31] "1989"         "1990"         "1991"         "1992"         "1993"        
[36] "1994"         "1995"         "1996"         "1997"         "1998"        
[41] "1999"         "2000"         "2001"         "2002"         "2003"        
[46] "2004"         "2005"         "2006"         "2007"         "2008"        
[51] "2009"         "2010"         "2011"         "2012"         "2013"        
[56] "2014"         "2015"         "2016"         "2017"         "2018"        

Fix the name of variables

  • Country Name => country
gdp <- gdp %>% 
  rename(country = "Country Name")
names(gdp)
 [1] "country" "1960"    "1961"    "1962"    "1963"    "1964"    "1965"   
 [8] "1966"    "1967"    "1968"    "1969"    "1970"    "1971"    "1972"   
[15] "1973"    "1974"    "1975"    "1976"    "1977"    "1978"    "1979"   
[22] "1980"    "1981"    "1982"    "1983"    "1984"    "1985"    "1986"   
[29] "1987"    "1988"    "1989"    "1990"    "1991"    "1992"    "1993"   
[36] "1994"    "1995"    "1996"    "1997"    "1998"    "1999"    "2000"   
[43] "2001"    "2002"    "2003"    "2004"    "2005"    "2006"    "2007"   
[50] "2008"    "2009"    "2010"    "2011"    "2012"    "2013"    "2014"   
[57] "2015"    "2016"    "2017"    "2018"   
  • Check the sample size and the number of variables of gdp
dim(gdp)
[1] 264  60
  • The sample size (N) of gdp is 264

  • The number of variables is 60

  • Using DT::datatable() function, we can see how the entire data set looks like

DT::datatable(gdp)

5.2 Converting data (Wide → Long form) : gdp

  • We need to change wide format to long format

Converting data from wide form to long form

  • Using tidyr::pivot_longer() function, we convert wide to long format
    → Name the data frame gdp_long
gdp_long <- gdp %>% 
  tidyr::pivot_longer("1960":"2018", # Range of variables you want to convert
                      names_to = "year", # Put the name of variables of wide format into year
                      values_to = "GDP") %>% # Put the name of vaariables of wide format into GDP
  drop_na()                   # Drop missing values
  • Check gdp_long
DT::datatable(gdp_long)
  • check class of variables in gdp_long
str(gdp_long)
tibble [11,824 × 3] (S3: tbl_df/tbl/data.frame)
 $ country: chr [1:11824] "Aruba" "Aruba" "Aruba" "Aruba" ...
 $ year   : chr [1:11824] "1986" "1987" "1988" "1989" ...
 $ GDP    : num [1:11824] 6473 7886 9765 11392 12307 ...
  • Convert the class of year from character to numeric
gdp_long$year <- as.numeric(gdp_long$year)
str(gdp_long)
tibble [11,824 × 3] (S3: tbl_df/tbl/data.frame)
 $ country: chr [1:11824] "Aruba" "Aruba" "Aruba" "Aruba" ...
 $ year   : num [1:11824] 1986 1987 1988 1989 1990 ...
 $ GDP    : num [1:11824] 6473 7886 9765 11392 12307 ...

5.3 Data Visualization: GDP

  • Long format dataset enables us to conduct variety of analyses
  • For instance, let visualize the transition of GDP (1980-2017) between Japan and China
  • Using filter() function extract the data needed and name it jpn.chi
jpn.chi <- gdp_long %>% 
  filter(country == "Japan" | country == "China")

・You should add the following command to avoid text garbling when using Japanese and drawing figures with ggplot() function

theme_set(theme_classic(base_size = 10,
                        base_family = "HiraginoSans-W3"))
jpn.chi %>% 
  ggplot(aes(x = year, y = GDP,
             color = country, 
             linetype = country, 
             shape = country)) +
  geom_point() +
  geom_line() +
  ggtitle("Transition of GDP Per Capita (1980-2017) between Japan and China") +
  labs(x = "Year", y = "GDP per capita (US$)") +
  theme(legend.position = c(0.1, 0.8)) +
  xlim(1980, 2017) # Delete the dta of 2018

6. Data Cleaning of Freedom House

6.1 Read the Freedom House data

  • Freedom House data (1972-2016)
  • Data on Democracy for countries
  • PR: political rights
  • CL: civil liberties
  • Status  
Variables Variable Class Details
PR numeric political right (Best = 1, Worst = 7)
CL numeric civil liberties (Best = 1, Worst = 7)
status categorical F: free, PF: partly free, NF: not free
year categorical 1972-2016
  • PR and CL are measured on a one-to-seven scale, with one representing the highest degree of Freedom and seven the lowest.

  • Load readx1 package to read excel file

library(readxl)
  • Download Freedom House and read it

  • Prior reading the data, open the original data (FH_Country.xls) file in either LibreOffice or Excel

  • You can see three tabs at the bottom of the screen
  • We want to read the second tab (Country Ratings, Statuses)
  • Assign sheet = 2

Assign the sheet number and the row

  • Check the 2nd tab

  • We do not need the 1st row and the 2nd row (yellow parts)
    → Assign skip = 2
fh <- read_excel("data/FH_Country.xls", 
                 sheet = 2,
                 skip = 2)

Check the class of each variable

str(fh)
tibble [205 × 133] (S3: tbl_df/tbl/data.frame)
 $ ...1        : chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ PR...2      : chr [1:205] "4" "7" "6" "4" ...
 $ CL...3      : chr [1:205] "5" "7" "6" "3" ...
 $ Status...4  : chr [1:205] "PF" "NF" "NF" "PF" ...
 $ PR...5      : chr [1:205] "7" "7" "6" "4" ...
 $ CL...6      : chr [1:205] "6" "7" "6" "4" ...
 $ Status...7  : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...8      : chr [1:205] "7" "7" "6" "4" ...
 $ CL...9      : chr [1:205] "6" "7" "6" "4" ...
 $ Status...10 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...11     : chr [1:205] "7" "7" "7" "4" ...
 $ CL...12     : chr [1:205] "6" "7" "6" "4" ...
 $ Status...13 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...14     : chr [1:205] "7" "7" "6" "4" ...
 $ CL...15     : chr [1:205] "6" "7" "6" "4" ...
 $ Status...16 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...17     : chr [1:205] "6" "7" "6" "-" ...
 $ CL...18     : chr [1:205] "6" "7" "6" "-" ...
 $ Status...19 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...20     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...21     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...22 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...23     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...24     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...25 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...26     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...27     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...28 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...29     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...30     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...31 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...32     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...33     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...34 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...35     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...36     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...37 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...38     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...39     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...40 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...41     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...42     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...43 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...44     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...45     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...46 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...47     : chr [1:205] "6" "7" "5" "-" ...
 $ CL...48     : chr [1:205] "6" "7" "6" "-" ...
 $ Status...49 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...50     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...51     : chr [1:205] "7" "7" "4" "-" ...
 $ Status...52 : chr [1:205] "NF" "NF" "PF" "-" ...
 $ PR...53     : chr [1:205] "7" "7" "4" "-" ...
 $ CL...54     : chr [1:205] "7" "6" "4" "-" ...
 $ Status...55 : chr [1:205] "NF" "NF" "PF" "-" ...
 $ PR...56     : chr [1:205] "7" "4" "4" "-" ...
 $ CL...57     : chr [1:205] "7" "4" "4" "-" ...
 $ Status...58 : chr [1:205] "NF" "PF" "PF" "-" ...
 $ PR...59     : chr [1:205] "6" "4" "7" "-" ...
 $ CL...60     : chr [1:205] "6" "3" "6" "-" ...
 $ Status...61 : chr [1:205] "NF" "PF" "NF" "-" ...
 $ PR...62     : chr [1:205] "7" "2" "7" "2" ...
 $ CL...63     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...64 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...65     : chr [1:205] "7" "3" "7" "1" ...
 $ CL...66     : chr [1:205] "7" "4" "7" "1" ...
 $ Status...67 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...68     : chr [1:205] "7" "3" "6" "1" ...
 $ CL...69     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...70 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...71     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...72     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...73 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...74     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...75     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...76 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...77     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...78     : chr [1:205] "7" "5" "5" "1" ...
 $ Status...79 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...80     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...81     : chr [1:205] "7" "5" "5" "1" ...
 $ Status...82 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...83     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...84     : chr [1:205] "7" "5" "5" "1" ...
 $ Status...85 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...86     : chr [1:205] "7" "3" "6" "1" ...
 $ CL...87     : chr [1:205] "7" "4" "5" "1" ...
 $ Status...88 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...89     : chr [1:205] "6" "3" "6" "1" ...
 $ CL...90     : chr [1:205] "6" "3" "5" "1" ...
 $ Status...91 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...92     : chr [1:205] "6" "3" "6" "1" ...
 $ CL...93     : chr [1:205] "6" "3" "5" "1" ...
 $ Status...94 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...95     : chr [1:205] "5" "3" "6" "1" ...
 $ CL...96     : chr [1:205] "6" "3" "5" "1" ...
 $ Status...97 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...98     : chr [1:205] "5" "3" "6" "1" ...
 $ CL...99     : chr [1:205] "5" "3" "5" "1" ...
  [list output truncated]
  • We see that the class of the all variables is chr(= character)
  • It is understandable for country name and status to be chr(= character)
  • But, it does not make any sense for PR (political rights) and CL (civil liberty) to be chr(= character)
    → They should be numeric → This should be fixed

Solution:
- You can see - in the spread sheet
- This means a missing value in Freedom House data set
- RStudio recognizes a blank as a missing value and show it -
→ We need to let RStudion recognize "-" means missing value
→ Add the following command: na = "-"

fh <- read_excel("data/FH_Country.xls", 
                 sheet = 2, 
                 skip = 2,
                 na = "-") # NA = "-" でも可
str(fh)
tibble [205 × 133] (S3: tbl_df/tbl/data.frame)
 $ ...1        : chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ PR...2      : chr [1:205] "4" "7" "6" "4" ...
 $ CL...3      : chr [1:205] "5" "7" "6" "3" ...
 $ Status...4  : chr [1:205] "PF" "NF" "NF" "PF" ...
 $ PR...5      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...6      : num [1:205] 6 7 6 4 NA NA 2 NA 1 1 ...
 $ Status...7  : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...8      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...9      : num [1:205] 6 7 6 4 NA NA 4 NA 1 1 ...
 $ Status...10 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...11     : num [1:205] 7 7 7 4 6 NA 2 NA 1 1 ...
 $ CL...12     : num [1:205] 6 7 6 4 6 NA 4 NA 1 1 ...
 $ Status...13 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...14     : num [1:205] 7 7 6 4 6 NA 6 NA 1 1 ...
 $ CL...15     : num [1:205] 6 7 6 4 6 NA 5 NA 1 1 ...
 $ Status...16 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...17     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...18     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ Status...19 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...20     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...21     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...22 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...23     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...24     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...25 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...26     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...27     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...28 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...29     : num [1:205] 7 7 6 NA 7 2 6 NA 1 1 ...
 $ CL...30     : num [1:205] 7 7 6 NA 7 2 5 NA 1 1 ...
 $ Status...31 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...32     : num [1:205] 7 7 6 NA 7 2 3 NA 1 1 ...
 $ CL...33     : num [1:205] 7 7 6 NA 7 3 3 NA 1 1 ...
 $ Status...34 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...35     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...36     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...37 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...38     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...39     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...40 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...41     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...42     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...43 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...44     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...45     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...46 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...47     : num [1:205] 6 7 5 NA 7 2 2 NA 1 1 ...
 $ CL...48     : num [1:205] 6 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...49 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...50     : num [1:205] 7 7 6 NA 7 2 1 NA 1 1 ...
 $ CL...51     : num [1:205] 7 7 4 NA 7 3 2 NA 1 1 ...
 $ Status...52 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...53     : num [1:205] 7 7 4 NA 7 3 1 NA 1 1 ...
 $ CL...54     : num [1:205] 7 6 4 NA 7 2 3 NA 1 1 ...
 $ Status...55 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...56     : num [1:205] 7 4 4 NA 6 3 1 5 1 1 ...
 $ CL...57     : num [1:205] 7 4 4 NA 4 3 3 5 1 1 ...
 $ Status...58 : chr [1:205] "NF" "PF" "PF" NA ...
 $ PR...59     : num [1:205] 6 4 7 NA 6 3 2 4 1 1 ...
 $ CL...60     : num [1:205] 6 3 6 NA 6 3 3 3 1 1 ...
 $ Status...61 : chr [1:205] "NF" "PF" "NF" NA ...
 $ PR...62     : num [1:205] 7 2 7 2 7 4 2 3 1 1 ...
 $ CL...63     : num [1:205] 7 4 6 1 7 3 3 4 1 1 ...
 $ Status...64 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...65     : num [1:205] 7 3 7 1 7 4 2 3 1 1 ...
 $ CL...66     : num [1:205] 7 4 7 1 7 3 3 4 1 1 ...
 $ Status...67 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...68     : num [1:205] 7 3 6 1 6 4 2 4 1 1 ...
 $ CL...69     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...70 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...71     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...72     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...73 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...74     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...75     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...76 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...77     : num [1:205] 7 4 6 1 6 4 3 4 1 1 ...
 $ CL...78     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...79 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...80     : num [1:205] 7 4 6 1 6 4 2 4 1 1 ...
 $ CL...81     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...82 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...83     : num [1:205] 7 4 6 1 6 4 1 4 1 1 ...
 $ CL...84     : num [1:205] 7 5 5 1 6 2 2 4 1 1 ...
 $ Status...85 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...86     : num [1:205] 7 3 6 1 6 4 3 4 1 1 ...
 $ CL...87     : num [1:205] 7 4 5 1 6 2 3 4 1 1 ...
 $ Status...88 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...89     : num [1:205] 6 3 6 1 6 4 3 4 1 1 ...
 $ CL...90     : num [1:205] 6 3 5 1 5 2 3 4 1 1 ...
 $ Status...91 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...92     : num [1:205] 6 3 6 1 6 4 2 4 1 1 ...
 $ CL...93     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...94 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...95     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...96     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...97 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...98     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...99     : num [1:205] 5 3 5 1 5 2 2 4 1 1 ...
  [list output truncated]
  • All variables except PR... and CL... are recognized as numeric

  • PR...2 and CL...3 are recognized as character
    → This should be fixed

  • We need to know why these two variables (PR...2 and CL...3) are not changed to numeric
    → We need to change the class of these two variables to numeric from character

  • Using unique() function, check the values of PR...2

unique(fh$PR...2)
[1] "4"    "7"    "6"    NA     "1"    "2"    "5"    "3"    "2(5)"
  • 2(5) is included!

  • 2(5) is not a numeric but a character

  • The value of character variable is shown with ""

  • Since 2(5) is not a numeric, the value was shown with ""
    NA is an exception in RStudio
    NA is not recognized as a character

  • Because PR...2 contains 2(5), PR...2 is recognized as character variable
    → This is the reason!

  • Using unique() function, check the values of CL...3

unique(fh$CL...3)
[1] "5"    "7"    "6"    "3"    NA     "1"    "4"    "2"    "3(6)"
  • 3(6) is included!
  • 3(6) is not a numeric but a character
  • The value of character variable is shown with ""
  • Since 3(6) is not a numeric, the value was shown with ""
    NA is an exception in RStudio
    NA is not recognized as a character
  • Because CL...3 contains 3(6), CL...3 is recognized as character variable
    → This is the reason!

Solution:

  • Using if_else() function, replace 2(5) and 3(6) with NA
  • Name the new data frame as fh_na
fh_na <- fh %>% 
  dplyr::mutate(
    PR...2 = if_else(PR...2 == "2(5)", "NA", PR...2),
    CL...3 = if_else(CL...3 == "3(6)", "NA", CL...3)) %>% 
  mutate(across(c(PR...2, CL...3), as.numeric)) 
  • Using unique() functio, check the value of PR...2
unique(fh_na$PR...2)
[1]  4  7  6 NA  1  2  5  3
  • NA is not shown with ""
    NA is econgized asmissing value

  • Using unique() function, check the value of CL...3

unique(fh_na$CL...3)
[1]  5  7  6  3 NA  1  4  2
  • NA is not shown with ""
    NA is econgized asmissing value

  • Using unique() function, check the class of PR...2 and CL...3

str(fh_na$PR...2)
 num [1:205] 4 7 6 4 NA NA 6 NA 1 1 ...
str(fh_na$CL...3)
 num [1:205] 5 7 6 3 NA NA 3 NA 1 1 ...
  • Both PR...2 and CL...3 are recognized as numeric

6.2 Converting data (Wide → Long form) :fh_na

  • Check the data we use
str(fh_na)
tibble [205 × 133] (S3: tbl_df/tbl/data.frame)
 $ ...1        : chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ PR...2      : num [1:205] 4 7 6 4 NA NA 6 NA 1 1 ...
 $ CL...3      : num [1:205] 5 7 6 3 NA NA 3 NA 1 1 ...
 $ Status...4  : chr [1:205] "PF" "NF" "NF" "PF" ...
 $ PR...5      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...6      : num [1:205] 6 7 6 4 NA NA 2 NA 1 1 ...
 $ Status...7  : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...8      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...9      : num [1:205] 6 7 6 4 NA NA 4 NA 1 1 ...
 $ Status...10 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...11     : num [1:205] 7 7 7 4 6 NA 2 NA 1 1 ...
 $ CL...12     : num [1:205] 6 7 6 4 6 NA 4 NA 1 1 ...
 $ Status...13 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...14     : num [1:205] 7 7 6 4 6 NA 6 NA 1 1 ...
 $ CL...15     : num [1:205] 6 7 6 4 6 NA 5 NA 1 1 ...
 $ Status...16 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...17     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...18     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ Status...19 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...20     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...21     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...22 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...23     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...24     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...25 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...26     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...27     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...28 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...29     : num [1:205] 7 7 6 NA 7 2 6 NA 1 1 ...
 $ CL...30     : num [1:205] 7 7 6 NA 7 2 5 NA 1 1 ...
 $ Status...31 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...32     : num [1:205] 7 7 6 NA 7 2 3 NA 1 1 ...
 $ CL...33     : num [1:205] 7 7 6 NA 7 3 3 NA 1 1 ...
 $ Status...34 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...35     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...36     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...37 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...38     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...39     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...40 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...41     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...42     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...43 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...44     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...45     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...46 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...47     : num [1:205] 6 7 5 NA 7 2 2 NA 1 1 ...
 $ CL...48     : num [1:205] 6 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...49 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...50     : num [1:205] 7 7 6 NA 7 2 1 NA 1 1 ...
 $ CL...51     : num [1:205] 7 7 4 NA 7 3 2 NA 1 1 ...
 $ Status...52 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...53     : num [1:205] 7 7 4 NA 7 3 1 NA 1 1 ...
 $ CL...54     : num [1:205] 7 6 4 NA 7 2 3 NA 1 1 ...
 $ Status...55 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...56     : num [1:205] 7 4 4 NA 6 3 1 5 1 1 ...
 $ CL...57     : num [1:205] 7 4 4 NA 4 3 3 5 1 1 ...
 $ Status...58 : chr [1:205] "NF" "PF" "PF" NA ...
 $ PR...59     : num [1:205] 6 4 7 NA 6 3 2 4 1 1 ...
 $ CL...60     : num [1:205] 6 3 6 NA 6 3 3 3 1 1 ...
 $ Status...61 : chr [1:205] "NF" "PF" "NF" NA ...
 $ PR...62     : num [1:205] 7 2 7 2 7 4 2 3 1 1 ...
 $ CL...63     : num [1:205] 7 4 6 1 7 3 3 4 1 1 ...
 $ Status...64 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...65     : num [1:205] 7 3 7 1 7 4 2 3 1 1 ...
 $ CL...66     : num [1:205] 7 4 7 1 7 3 3 4 1 1 ...
 $ Status...67 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...68     : num [1:205] 7 3 6 1 6 4 2 4 1 1 ...
 $ CL...69     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...70 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...71     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...72     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...73 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...74     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...75     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...76 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...77     : num [1:205] 7 4 6 1 6 4 3 4 1 1 ...
 $ CL...78     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...79 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...80     : num [1:205] 7 4 6 1 6 4 2 4 1 1 ...
 $ CL...81     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...82 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...83     : num [1:205] 7 4 6 1 6 4 1 4 1 1 ...
 $ CL...84     : num [1:205] 7 5 5 1 6 2 2 4 1 1 ...
 $ Status...85 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...86     : num [1:205] 7 3 6 1 6 4 3 4 1 1 ...
 $ CL...87     : num [1:205] 7 4 5 1 6 2 3 4 1 1 ...
 $ Status...88 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...89     : num [1:205] 6 3 6 1 6 4 3 4 1 1 ...
 $ CL...90     : num [1:205] 6 3 5 1 5 2 3 4 1 1 ...
 $ Status...91 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...92     : num [1:205] 6 3 6 1 6 4 2 4 1 1 ...
 $ CL...93     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...94 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...95     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...96     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...97 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...98     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...99     : num [1:205] 5 3 5 1 5 2 2 4 1 1 ...
  [list output truncated]
dim(fh_na) 
[1] 205 133
  • We have 205 countries and 113 variables in fh_na
  • Since this data is wide form, we need to change it to long form
  • Compared to the gdp data we cleaned in the previous section, this data need a bit more work to do

A bit more work to do

  1. We have three variables per country and per year (PR, CL, Status)
    PR...2, CL...3, Status...4 are the data for 1972
    PR...5, CL...6, Status...7 are the data for 1973
    ・・・・・・・・・・・・・・・・・・・・・・・・
    PR...128, CL...129, Status...130 are data for 2015
    PR...131, CL...132, Status...134 are data for 2016

  2. Two different classes of vairalbes: numeric and categorical

  • PR, CL — numeric data (min 1 to max 7)
  • Status — categorical data (F, PF, NF)

Solution:Make two variables (value, status)
→ The values of PR and CL are put into value
→ The values of Status are put into status

  1. Change the name of the 1st variable: ...1
  • The variable ...1 shows country name
  • We rename ...1 as country
fh_na <- fh_na %>% 
    rename(country = 1) 

Converting GDP data from wide form to long form

  • We make two data frames: fh_country and fh_na
fh_country <- fh_na %>% 
  select(country)

fh_na <- fh_na %>% 
  select(-country)
  • Check the variables in each data frame
names(fh_na)
  [1] "PR...2"       "CL...3"       "Status...4"   "PR...5"       "CL...6"      
  [6] "Status...7"   "PR...8"       "CL...9"       "Status...10"  "PR...11"     
 [11] "CL...12"      "Status...13"  "PR...14"      "CL...15"      "Status...16" 
 [16] "PR...17"      "CL...18"      "Status...19"  "PR...20"      "CL...21"     
 [21] "Status...22"  "PR...23"      "CL...24"      "Status...25"  "PR...26"     
 [26] "CL...27"      "Status...28"  "PR...29"      "CL...30"      "Status...31" 
 [31] "PR...32"      "CL...33"      "Status...34"  "PR...35"      "CL...36"     
 [36] "Status...37"  "PR...38"      "CL...39"      "Status...40"  "PR...41"     
 [41] "CL...42"      "Status...43"  "PR...44"      "CL...45"      "Status...46" 
 [46] "PR...47"      "CL...48"      "Status...49"  "PR...50"      "CL...51"     
 [51] "Status...52"  "PR...53"      "CL...54"      "Status...55"  "PR...56"     
 [56] "CL...57"      "Status...58"  "PR...59"      "CL...60"      "Status...61" 
 [61] "PR...62"      "CL...63"      "Status...64"  "PR...65"      "CL...66"     
 [66] "Status...67"  "PR...68"      "CL...69"      "Status...70"  "PR...71"     
 [71] "CL...72"      "Status...73"  "PR...74"      "CL...75"      "Status...76" 
 [76] "PR...77"      "CL...78"      "Status...79"  "PR...80"      "CL...81"     
 [81] "Status...82"  "PR...83"      "CL...84"      "Status...85"  "PR...86"     
 [86] "CL...87"      "Status...88"  "PR...89"      "CL...90"      "Status...91" 
 [91] "PR...92"      "CL...93"      "Status...94"  "PR...95"      "CL...96"     
 [96] "Status...97"  "PR...98"      "CL...99"      "Status...100" "PR...101"    
[101] "CL...102"     "Status...103" "PR...104"     "CL...105"     "Status...106"
[106] "PR...107"     "CL...108"     "Status...109" "PR...110"     "CL...111"    
[111] "Status...112" "PR...113"     "CL...114"     "Status...115" "PR...116"    
[116] "CL...117"     "Status...118" "PR...119"     "CL...120"     "Status...121"
[121] "PR...122"     "CL...123"     "Status...124" "PR...125"     "CL...126"    
[126] "Status...127" "PR...128"     "CL...129"     "Status...130" "PR...131"    
[131] "CL...132"     "Status...133"
names(fh_country)
[1] "country"

Data cleaning: fh_na

colnames(fh_na) <- 
  str_replace_all(colnames(fh_na), 
                  c("\\.\\.\\." = "-")) %>% # replace "・・・" with "-"    
  str_subset("PR|CL|Status") %>%  # change the variable names like "pr_1972"    
  str_c(., "_") %>% 
  str_replace_all(c("-" =  "", 
                    "[0-9]" = "",
                    "PR" = "pr", # pr => PR
                    "CL" = "cl", # cl => CL 
                    "Status" = "st")) %>% # st = Status 
  str_c(.,  rep(setdiff(1972:2016, 1981), # exclude 1981
                each = 3))   # make 3 variables per year
  • check fh_na
names(fh_na)
  [1] "pr_1972" "cl_1972" "st_1972" "pr_1973" "cl_1973" "st_1973" "pr_1974"
  [8] "cl_1974" "st_1974" "pr_1975" "cl_1975" "st_1975" "pr_1976" "cl_1976"
 [15] "st_1976" "pr_1977" "cl_1977" "st_1977" "pr_1978" "cl_1978" "st_1978"
 [22] "pr_1979" "cl_1979" "st_1979" "pr_1980" "cl_1980" "st_1980" "pr_1982"
 [29] "cl_1982" "st_1982" "pr_1983" "cl_1983" "st_1983" "pr_1984" "cl_1984"
 [36] "st_1984" "pr_1985" "cl_1985" "st_1985" "pr_1986" "cl_1986" "st_1986"
 [43] "pr_1987" "cl_1987" "st_1987" "pr_1988" "cl_1988" "st_1988" "pr_1989"
 [50] "cl_1989" "st_1989" "pr_1990" "cl_1990" "st_1990" "pr_1991" "cl_1991"
 [57] "st_1991" "pr_1992" "cl_1992" "st_1992" "pr_1993" "cl_1993" "st_1993"
 [64] "pr_1994" "cl_1994" "st_1994" "pr_1995" "cl_1995" "st_1995" "pr_1996"
 [71] "cl_1996" "st_1996" "pr_1997" "cl_1997" "st_1997" "pr_1998" "cl_1998"
 [78] "st_1998" "pr_1999" "cl_1999" "st_1999" "pr_2000" "cl_2000" "st_2000"
 [85] "pr_2001" "cl_2001" "st_2001" "pr_2002" "cl_2002" "st_2002" "pr_2003"
 [92] "cl_2003" "st_2003" "pr_2004" "cl_2004" "st_2004" "pr_2005" "cl_2005"
 [99] "st_2005" "pr_2006" "cl_2006" "st_2006" "pr_2007" "cl_2007" "st_2007"
[106] "pr_2008" "cl_2008" "st_2008" "pr_2009" "cl_2009" "st_2009" "pr_2010"
[113] "cl_2010" "st_2010" "pr_2011" "cl_2011" "st_2011" "pr_2012" "cl_2012"
[120] "st_2012" "pr_2013" "cl_2013" "st_2013" "pr_2014" "cl_2014" "st_2014"
[127] "pr_2015" "cl_2015" "st_2015" "pr_2016" "cl_2016" "st_2016"
  • Using bind_cols() function, merge fh_na and fh_country
fh_na <- fh_country %>% # 
  bind_cols(fh_na)
  • check fh_na
rmarkdown::paged_table(fh_na)
  • Make two variables: value and type
PR_CL_long <- fh_na %>% 
  select(country,                        
         starts_with(c("pr", "cl"))) %>%  # select those variables starting with `pr` and `cl`   
  pivot_longer(pr_1972:cl_2016,           # assign the range of variables
               names_to = "type",         # put variable names, such as "pr_1972", into `type`  
               values_to = "value") %>%   # put values of variables, such as 1972, into `value`  
  separate(type,              
           into = c("type", "year"),      # divide the values of type into 2: `type` and `year`
           sep = "_")%>%            # two values should be connected by "_" 
           drop_na()                      # drop missing values 
ST_long <- fh_na %>% 
  select(country,                        # country を選ぶ
         starts_with("st")) %>%          # select those variables starting with `st`   
  pivot_longer(st_1972:st_2016,          # assign the range of variables
               names_to = "name",        # put variable names, such as "pr_1972", into `name`  
               values_to = "status") %>% # put values of variables, such as 1972, into `status`  
  separate(name, 
           into = c("name", "year"),     # divide the values of type into 2: `name` and `year`
           sep = "_") %>%               # two values should be connected by "_" 
  select(-name)%>%                       # nameは不要なので削除  
  drop_na()                               # drop missing values 
  • Check PR_CL_long  
names(PR_CL_long)
[1] "country" "type"    "year"    "value"  
  • Check ST_long 
names(ST_long)
[1] "country" "year"    "status" 
  • Using left_joint() function, merge PR_CL_long and ST_long with the two shared variables: country and year
fh_all_long <- PR_CL_long %>% 
  left_join(ST_long, 
            by = c("country", "year"))
DT::datatable(fh_all_long)
  • Now, we have converted wide form data into long form data

6.3 Data Visualization: Freedom House

Transition of Political Rights between North Kore and South Korea (1972-2016)

korea_PR <- fh_all_long %>% 
  filter(country == "North Korea" | country == "South Korea") %>% 
  filter(type == "pr")
korea_PR %>% 
  ggplot(aes(x = value, y = year, 
             color = country,
             shape = country)) +
  geom_point() +
  ggtitle("Political Rights between N.Korea and S.Korea: 1972-2016") +
  labs(x = "Political Rights", y = "Year") +
  theme(legend.position = c(0.5, 0.8)) 

  • North Korea’s political rights has been consistently worse (which is 7) since 1972
  • South Korea’s political rights was around 5 in 1972, but it has been getting better (which is 1 and 2)

Transition of Political Rights between Japan and China (1972-2016)

jpn.chi_PR <- fh_all_long %>% 
  filter(country == "Japan" | country == "China") %>% 
  filter(type == "pr")
jpn.chi_PR %>% 
  ggplot(aes(x = value, y = year, 
             color = country, 
             shape = country)) +
  geom_point() +
  ggtitle("Political Rights between Japan and China: 1972-2016") +
  labs(x = "Political Rights", y = "Year") +
  theme(legend.position = c(0.5, 0.8)) 

  • China’s political rights has been consistently worse (which is 7 or 6) since 1972
  • Japan’s political rights has been consistently better (which is 1 or 2)
Reference