✔　What we do here
・Introduce how to make a variable
・Introduce how to merge variables and make a dataframe
・Introduce how to merge multiple dataframes
・Introduce how to read data with various file name extensions
・Introduce how to clean the data you read into RStudio
・Explain technical terms we need in analyzing data

technical terms explained here text data, binary data, file extension, pass, file, folder, R project, R project folder, working directory, missing value, class, data cleaning, converting data between wide and long format

Load tidyverse package

library(haven)
library(readxl)
library(tidyverse)

─ Attaching packages ──────────────────── tidyverse 1.3.1 ─

✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.1.2     ✓ dplyr   1.0.6
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1

─ Conflicts ───────────────────── tidyverse_conflicts() ─
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

tidyverse contains 8 useful packages
We need readr to read the data

1. Data frame

1.1 How to make variables

A variable is also called vector
Make a variable containing 8 numbers (= values) and name it id

id <- c(1,2,3,4,5,6,7,8)

Make a variable containing 8 names and name it name

name <- c("Thies", "Cox", "McCubbins", "Schwartz", "DeNardo", "Bawn", "Patterson", "Geddes")

Make a variable containing test scores and name it score

score <- c(43, 74, 80, 37, 20, 83, 64, 35)

1.2 Add variables to dataframe

Load tidyverse packages to use tibble() function

library(tidyverse)

Make a data frame with three variables (id, name, score) and name it df1

df1 <- tibble(id, name, score)
df1

# A tibble: 8 x 3
     id name      score
  <dbl> <chr>     <dbl>
1     1 Thies        43
2     2 Cox          74
3     3 McCubbins    80
4     4 Schwartz     37
5     5 DeNardo      20
6     6 Bawn         83
7     7 Patterson    64
8     8 Geddes       35

- You can also use data.frame() instead of using tibble()

df1 <- data.frame(id, name, score)
df1

○ tibble() shows the size of data frame, such as 8 x 3 and class of variable, such as <dbl> & <chr>
→ You should use tibble()
○ If you load tidyverse package, then you can use tibble()

1.3 Add a new variable to your data frame

We want to add another variable, department
Add $ after the data frame name (df), then put the name of the new variable (department)

df1$department <- c("poli-sci", "econ", "poli-sci", "econ", "art", "music", "communication", "history")

df1

# A tibble: 8 x 4
     id name      score department   
  <dbl> <chr>     <dbl> <chr>        
1     1 Thies        43 poli-sci     
2     2 Cox          74 econ         
3     3 McCubbins    80 poli-sci     
4     4 Schwartz     37 econ         
5     5 DeNardo      20 art          
6     6 Bawn         83 music        
7     7 Patterson    64 communication
8     8 Geddes       35 history

Add gender to df1

df1$gender <- c("male", "male", "male", "male", "male", "female", "male", "female")

df1

# A tibble: 8 x 5
     id name      score department    gender
  <dbl> <chr>     <dbl> <chr>         <chr> 
1     1 Thies        43 poli-sci      male  
2     2 Cox          74 econ          male  
3     3 McCubbins    80 poli-sci      male  
4     4 Schwartz     37 econ          male  
5     5 DeNardo      20 art           male  
6     6 Bawn         83 music         female
7     7 Patterson    64 communication male  
8     8 Geddes       35 history       female

1.4 Merging data frames

Make a data frame, df2
df2 includes the following two variables:

①id
②prefecture

Make id

id <- c(1,2,3,4,5,6,7,8)

Make states standing for where they come from

state <- c("California", "Oregon", "NY", "Washington", "Florida", "Wisconsin", "Alabama", "South Carolina")

Make a data frame with the two variables and name it df2

df2 <- tibble(id, state)
df2

# A tibble: 8 x 2
     id state         
  <dbl> <chr>         
1     1 California    
2     2 Oregon        
3     3 NY            
4     4 Washington    
5     5 Florida       
6     6 Wisconsin     
7     7 Alabama       
8     8 South Carolina

Merge the two data frames (df1 and df2) with the same variable name (id) and name the new data frame, M

M <- merge(df1, df2, by = "id")
M

  id      name score    department gender          state
1  1     Thies    43      poli-sci   male     California
2  2       Cox    74          econ   male         Oregon
3  3 McCubbins    80      poli-sci   male             NY
4  4  Schwartz    37          econ   male     Washington
5  5   DeNardo    20           art   male        Florida
6  6      Bawn    83         music female      Wisconsin
7  7 Patterson    64 communication   male        Alabama
8  8    Geddes    35       history female South Carolina

1.5 Exercise

Question 1: Make the list of your family or friends (df1) containing the following variables:

① id: (1…..5)
② name
③ age
④ relationship

Question 2: Make the list of your family or friends (df2) containing the following variables:

① id: (1…..5)
② gender
③ height

Questin 3: Merge the two data frames you made (df1 and df2) with the shared variable (id) and name it M1

2. Basics on data

2.1 File and Folda

A computer file is a computer resource for recording data in a computer storage device.
A folder (also called a directory) is a space used to store files.
For instance, take a look at the following:

You see 4 folders on the left side (backdoor, maps, R, RDD)
The folder R has 4 files
A file has various extensions:(.html .Rmd .csv .doc .png .jpg)
A folder does not have extensions
A folder is also called a directory: examples) R Project folder, working directory

2.2 Path

A path is a string of characters used to uniquely identify a location in a directory structure.
It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory.
The delimiting character is most commonly the slash $"/"$.
getwd() = get working directry
→ You can see in which directory you are working on
For example, let me type getwd() on my computer and hit the return key

getwd() "/Users/asanomasahiko/Dropbox/statistics/class_materials/R"

If you are a Mac user, then you will see something like this
But, if you are a Window user, then you will see C Drive instead of Users

2.3 Working Directory

What R means at the end of the path shown above
R means the name of RProject forlder (= working directory) where you are currently working at.
→　You are working at RProject folder, named R.
You could set an appropriate path by yourself, but it is more efficient for you to make R Project.

Reasons:
- If you are in your R Project folder, you don’t have to assign a particular path whenevery you need data.

2.4 What you don’t want to do on file names

You should not use numbers at the beginning of your file name
Example) × 「2021_grades」=> ○「grades_2021」
You should not insert space between user file name
Example) × 「2021 grades」=> ○「grades_2021」

3. How to use Data embedded to R

R (sometimes R packages) has embedded data
If you type data(), then you can see the list of these embedded data (part of the list is shown here).

data()

For instance, let me show you the first 6 rows of the 7th data, state.x77

head(state.x77)

           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

Let me show you the lsat 6 rows of state.x77

tail(state.x77)

              Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
Vermont              472   3907        0.6    71.64    5.5    57.1   168  9267
Virginia            4981   4701        1.4    70.08    9.5    47.8    85 39780
Washington          3559   4864        0.6    71.72    4.3    63.5    32 66570
West Virginia       1799   3617        1.4    69.48    6.7    41.6   100 24070
Wisconsin           4589   4468        0.7    72.48    3.0    54.5   149 54464
Wyoming              376   4566        0.6    70.29    6.9    62.9   173 97203

You can also see the data on Titanic

head(Titanic)

, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

4. How to use Data not embedded to R

Read the data by data form

PC data has two types: text data and binary data
Data form is differentiated by extensions

Text data

Data we can read and understand
.txt file — you use this when you do text analysis
.html file — you use this when you do web scraping
.csv file・・・comma-separated values
→　You are recommended to read csv file in RStudio

Binary data

Data we cannot read and understand, but computer can
.xls file
.xlsx file — newer than .xls file
.dta file — you can use this on STATA
.rds file — you can use this only for R
You are recommended not to use MS Office Excel, but LibreOffice
→ Free soft ware
→ You can assign character encode
→ You can evade unnecessary errors

4.1 How to read `.csv` file

Download the Japanese Lower House Election Data hr96-17.csv and read on RStudio
Make a new folder, named data within your RProjct folder
Put the hr96-17.csv into data
You need to use tidyverse package to read csv.file
→ Load tidyverse

library(tidyverse)

Read the csv file and name it hr

hr <- read_csv("data/hr96-17.csv", 
               na = ".")  # replace missing data with "."

If you fail to read the csv file

When csv file contains values in Japanese, it is likely for you to see an error message

If you use EXCEL

Suppose you see the csv file as follows when you use EXCEL

You need to save it using the csv UTF-8 (.csv)form

If you use Libre Office

You selectUnicode(UTF-8) and save it

If you have further problem, then you need to try this command

hr <- read_csv("data/hr96-17.csv", 
               na = ".",
               locale = locale(encoding = "cp932"))

4.2 How to read `.xls[x]` file

Load the readxl pacakges to read .xls[x] file

library(readxl)

Here, we download Freedom House dataset and read it

fh <- read_excel("data/FH_Country.xls")

4.3 How to read `.dta` file

.dta file is a binary data
Load the haven pacakges to read .dta file

library(haven)

We download and read the replication data offered by Bruce Russett and John R. Oneal (2001) “Triangulating Peace” TRIANGLE.DTA

triangle <- read_dta("data/TRIANGLE.DTA")
head(triangle)

# A tibble: 6 x 19
  statea stateb  year dependa dependb demauta demautb allies dispute1 logdstab
   <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
1      2     20  1920  0.0157   0.280      10       9      0        0     5.82
2      2     20  1921  0.0115   0.224      10      10      0        0     5.82
3      2     20  1922  0.0113   0.201      10      10      0        0     5.82
4      2     20  1923  0.0112   0.213      10      10      0        0     5.82
5      2     20  1924  0.0110   0.213      10      10      0        0     5.82
6      2     20  1925  0.0108   0.191      10      10      0        0     5.82
# … with 9 more variables: lcaprat2 <dbl>, smigoabi <dbl>, opena <dbl>,
#   openb <dbl>, minrpwrs <dbl>, noncontg <dbl>, smldmat <dbl>, smldep <dbl>,
#   dyadid <dbl>

Show the list of variables triangle contains

names(triangle)

 [1] "statea"   "stateb"   "year"     "dependa"  "dependb"  "demauta" 
 [7] "demautb"  "allies"   "dispute1" "logdstab" "lcaprat2" "smigoabi"
[13] "opena"    "openb"    "minrpwrs" "noncontg" "smldmat"  "smldep"  
[19] "dyadid"

We can save TRIANGLE.DTA as csv form which is more widely used with the following command

write_excel_csv(triangle, "data/triangle.csv")

5. Data Cleaning of `GDP` data

5.1 Read the GDP dta

We download wb_gdp_pc.csv
Make data folder and put the we_gdp_pc.csv into the folder

wb_gdp <- read_csv("data/wb_gdp_pc.csv")

Warning: Missing column names filled in: 'X3' [3]


─ Column specification ────────────────────────────
cols(
  `Data Source` = col_character(),
  `World Development Indicators` = col_character(),
  X3 = col_character()
)

Warning: 265 parsing failures.
row col  expected     actual                 file
  2  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  3  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  4  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  5  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
  6  -- 3 columns 64 columns 'data/wb_gdp_pc.csv'
... ... ......... .......... ....................
See problems(...) for more details.

Pay attention to the Warning: Missing column names filled in : ’X3" [3]

Warning is not as seriou as Error, but we need to be cautious about it

How to deal with the Warning

Show the first 6 rows of the data

head(wb_gdp)

# A tibble: 6 x 3
  `Data Source`     `World Development Indicators` X3                          
  <chr>             <chr>                          <chr>                       
1 Last Updated Date 2019-03-21                     <NA>                        
2 Country Name      Country Code                   Indicator Name              
3 Aruba             ABW                            GDP per capita (current US$)
4 Afghanistan       AFG                            GDP per capita (current US$)
5 Angola            AGO                            GDP per capita (current US$)
6 Albania           ALB                            GDP per capita (current US$)

We do not need the 1st row (Last Updated Date 2019-03-21…)
Directly open the file (wb_gdp_pc.csv) which is in data using LibreOffice（or Excel）and check the data

I emphasize the first 4 yellow lines so that you can easily recognize them

If you use read_csv(), then RStudio automatically recognize the first row of the csv file as variable names
We want RStudio to recognize the 5th row as variable names
→　We skip the 4 lines (line 1 to 4)

wb_gdp <- read_csv("data/wb_gdp_pc.csv", skip = 4)

Warning: Missing column names filled in: 'X64' [64]


─ Column specification ────────────────────────────
cols(
  .default = col_double(),
  `Country Name` = col_character(),
  `Country Code` = col_character(),
  `Indicator Name` = col_character(),
  `Indicator Code` = col_character(),
  `2018` = col_logical(),
  X64 = col_logical()
)
ℹ Use `spec()` for the full column specifications.

str() enables us to check the variable class

str(wb_gdp)

spec_tbl_df [264 × 64] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Country Name  : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Country Code  : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
 $ Indicator Name: chr [1:264] "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" ...
 $ Indicator Code: chr [1:264] "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" ...
 $ 1960          : num [1:264] NA 59.8 NA NA NA ...
 $ 1961          : num [1:264] NA 59.9 NA NA NA ...
 $ 1962          : num [1:264] NA 58.5 NA NA NA ...
 $ 1963          : num [1:264] NA 78.8 NA NA NA ...
 $ 1964          : num [1:264] NA 82.2 NA NA NA ...
 $ 1965          : num [1:264] NA 101 NA NA NA ...
 $ 1966          : num [1:264] NA 138 NA NA NA ...
 $ 1967          : num [1:264] NA 161 NA NA NA ...
 $ 1968          : num [1:264] NA 130 NA NA NA ...
 $ 1969          : num [1:264] NA 130 NA NA NA ...
 $ 1970          : num [1:264] NA 157 NA NA 3239 ...
 $ 1971          : num [1:264] NA 160 NA NA 3498 ...
 $ 1972          : num [1:264] NA 136 NA NA 4217 ...
 $ 1973          : num [1:264] NA 144 NA NA 5342 ...
 $ 1974          : num [1:264] NA 175 NA NA 6320 ...
 $ 1975          : num [1:264] NA 188 NA NA 7169 ...
 $ 1976          : num [1:264] NA 199 NA NA 7152 ...
 $ 1977          : num [1:264] NA 226 NA NA 7751 ...
 $ 1978          : num [1:264] NA 249 NA NA 9130 ...
 $ 1979          : num [1:264] NA 278 NA NA 11821 ...
 $ 1980          : num [1:264] NA 275 664 NA 12377 ...
 $ 1981          : num [1:264] NA 266 600 NA 10372 ...
 $ 1982          : num [1:264] NA NA 579 NA 9610 ...
 $ 1983          : num [1:264] NA NA 582 NA 8023 ...
 $ 1984          : num [1:264] NA NA 597 639 7729 ...
 $ 1985          : num [1:264] NA NA 712 640 7774 ...
 $ 1986          : num [1:264] 6473 NA 648 694 10362 ...
 $ 1987          : num [1:264] 7886 NA 721 675 12616 ...
 $ 1988          : num [1:264] 9765 NA 762 653 14304 ...
 $ 1989          : num [1:264] 11392 NA 863 698 15166 ...
 $ 1990          : num [1:264] 12307 NA 923 617 18879 ...
 $ 1991          : num [1:264] 13496 NA 845 337 19533 ...
 $ 1992          : num [1:264] 14047 NA 641 201 20548 ...
 $ 1993          : num [1:264] 14937 NA 430 367 16516 ...
 $ 1994          : num [1:264] 16241 NA 321 586 16235 ...
 $ 1995          : num [1:264] 16439 NA 388 751 18461 ...
 $ 1996          : num [1:264] 16586 NA 513 1010 19017 ...
 $ 1997          : num [1:264] 17928 NA 507 717 18353 ...
 $ 1998          : num [1:264] 19078 NA 420 814 18895 ...
 $ 1999          : num [1:264] 19356 NA 386 1033 19262 ...
 $ 2000          : num [1:264] 20621 NA 555 1127 21937 ...
 $ 2001          : num [1:264] 20669 NA 526 1282 22229 ...
 $ 2002          : num [1:264] 20437 184 870 1425 24741 ...
 $ 2003          : num [1:264] 20834 196 979 1846 32776 ...
 $ 2004          : num [1:264] 22570 217 1248 2374 38503 ...
 $ 2005          : num [1:264] 23300 248 1891 2674 41282 ...
 $ 2006          : num [1:264] 24046 269 2585 2973 43749 ...
 $ 2007          : num [1:264] 25836 366 3108 3595 48583 ...
 $ 2008          : num [1:264] 27086 370 4069 4371 47786 ...
 $ 2009          : num [1:264] 24631 444 3118 4114 43339 ...
 $ 2010          : num [1:264] 23513 551 3586 4094 39736 ...
 $ 2011          : num [1:264] 24984 599 4616 4437 41099 ...
 $ 2012          : num [1:264] 24710 649 5102 4248 38391 ...
 $ 2013          : num [1:264] 25018 648 5258 4413 40620 ...
 $ 2014          : num [1:264] 25528 625 5413 4579 42295 ...
 $ 2015          : num [1:264] 25796 590 4171 3953 36038 ...
 $ 2016          : num [1:264] 25252 550 3510 4132 37232 ...
 $ 2017          : num [1:264] 25655 550 4100 4538 39147 ...
 $ 2018          : logi [1:264] NA NA NA NA NA NA ...
 $ X64           : logi [1:264] NA NA NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Country Name` = col_character(),
  ..   `Country Code` = col_character(),
  ..   `Indicator Name` = col_character(),
  ..   `Indicator Code` = col_character(),
  ..   `1960` = col_double(),
  ..   `1961` = col_double(),
  ..   `1962` = col_double(),
  ..   `1963` = col_double(),
  ..   `1964` = col_double(),
  ..   `1965` = col_double(),
  ..   `1966` = col_double(),
  ..   `1967` = col_double(),
  ..   `1968` = col_double(),
  ..   `1969` = col_double(),
  ..   `1970` = col_double(),
  ..   `1971` = col_double(),
  ..   `1972` = col_double(),
  ..   `1973` = col_double(),
  ..   `1974` = col_double(),
  ..   `1975` = col_double(),
  ..   `1976` = col_double(),
  ..   `1977` = col_double(),
  ..   `1978` = col_double(),
  ..   `1979` = col_double(),
  ..   `1980` = col_double(),
  ..   `1981` = col_double(),
  ..   `1982` = col_double(),
  ..   `1983` = col_double(),
  ..   `1984` = col_double(),
  ..   `1985` = col_double(),
  ..   `1986` = col_double(),
  ..   `1987` = col_double(),
  ..   `1988` = col_double(),
  ..   `1989` = col_double(),
  ..   `1990` = col_double(),
  ..   `1991` = col_double(),
  ..   `1992` = col_double(),
  ..   `1993` = col_double(),
  ..   `1994` = col_double(),
  ..   `1995` = col_double(),
  ..   `1996` = col_double(),
  ..   `1997` = col_double(),
  ..   `1998` = col_double(),
  ..   `1999` = col_double(),
  ..   `2000` = col_double(),
  ..   `2001` = col_double(),
  ..   `2002` = col_double(),
  ..   `2003` = col_double(),
  ..   `2004` = col_double(),
  ..   `2005` = col_double(),
  ..   `2006` = col_double(),
  ..   `2007` = col_double(),
  ..   `2008` = col_double(),
  ..   `2009` = col_double(),
  ..   `2010` = col_double(),
  ..   `2011` = col_double(),
  ..   `2012` = col_double(),
  ..   `2013` = col_double(),
  ..   `2014` = col_double(),
  ..   `2015` = col_double(),
  ..   `2016` = col_double(),
  ..   `2017` = col_double(),
  ..   `2018` = col_logical(),
  ..   X64 = col_logical()
  .. )

RStudio recognizes Country Name and Country Code as character <chr>
RStudio recognizes 1960 as double <dbl>
NA means (missing value)
check variable names

names(wb_gdp)

 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "X64"

We select variables we use
- We need to solve problems one by one which prevents us from conducting quantitative analysis
- Since we do not know what X64 is, check it

wb_gdp$X64

  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA

We need to delete it because it contains nothing (=NA)

→ The followings are what we need

gdp <- wb_gdp %>% 
  select("Country Name", 
         "1960":"2018")

names(gdp)

 [1] "Country Name" "1960"         "1961"         "1962"         "1963"        
 [6] "1964"         "1965"         "1966"         "1967"         "1968"        
[11] "1969"         "1970"         "1971"         "1972"         "1973"        
[16] "1974"         "1975"         "1976"         "1977"         "1978"        
[21] "1979"         "1980"         "1981"         "1982"         "1983"        
[26] "1984"         "1985"         "1986"         "1987"         "1988"        
[31] "1989"         "1990"         "1991"         "1992"         "1993"        
[36] "1994"         "1995"         "1996"         "1997"         "1998"        
[41] "1999"         "2000"         "2001"         "2002"         "2003"        
[46] "2004"         "2005"         "2006"         "2007"         "2008"        
[51] "2009"         "2010"         "2011"         "2012"         "2013"        
[56] "2014"         "2015"         "2016"         "2017"         "2018"

Fix the name of variables

Country Name => country

gdp <- gdp %>% 
  rename(country = "Country Name")

names(gdp)

 [1] "country" "1960"    "1961"    "1962"    "1963"    "1964"    "1965"   
 [8] "1966"    "1967"    "1968"    "1969"    "1970"    "1971"    "1972"   
[15] "1973"    "1974"    "1975"    "1976"    "1977"    "1978"    "1979"   
[22] "1980"    "1981"    "1982"    "1983"    "1984"    "1985"    "1986"   
[29] "1987"    "1988"    "1989"    "1990"    "1991"    "1992"    "1993"   
[36] "1994"    "1995"    "1996"    "1997"    "1998"    "1999"    "2000"   
[43] "2001"    "2002"    "2003"    "2004"    "2005"    "2006"    "2007"   
[50] "2008"    "2009"    "2010"    "2011"    "2012"    "2013"    "2014"   
[57] "2015"    "2016"    "2017"    "2018"

Check the sample size and the number of variables of gdp

dim(gdp)

[1] 264  60

The sample size (N) of gdp is 264
The number of variables is 60
Using DT::datatable() function, we can see how the entire data set looks like

DT::datatable(gdp)

5.2 Converting data (Wide → Long form) : `gdp`

We need to change wide format to long format

Converting data from wide form to long form

Using tidyr::pivot_longer() function, we convert wide to long format
→　Name the data frame gdp_long

gdp_long <- gdp %>% 
  tidyr::pivot_longer("1960":"2018", # Range of variables you want to convert
                      names_to = "year", # Put the name of variables of wide format into year
                      values_to = "GDP") %>% # Put the name of vaariables of wide format into GDP
  drop_na()                   # Drop missing values

Check gdp_long

DT::datatable(gdp_long)

check class of variables in gdp_long

str(gdp_long)

tibble [11,824 × 3] (S3: tbl_df/tbl/data.frame)
 $ country: chr [1:11824] "Aruba" "Aruba" "Aruba" "Aruba" ...
 $ year   : chr [1:11824] "1986" "1987" "1988" "1989" ...
 $ GDP    : num [1:11824] 6473 7886 9765 11392 12307 ...

Convert the class of year from character to numeric

gdp_long$year <- as.numeric(gdp_long$year)

str(gdp_long)

tibble [11,824 × 3] (S3: tbl_df/tbl/data.frame)
 $ country: chr [1:11824] "Aruba" "Aruba" "Aruba" "Aruba" ...
 $ year   : num [1:11824] 1986 1987 1988 1989 1990 ...
 $ GDP    : num [1:11824] 6473 7886 9765 11392 12307 ...

5.3 Data Visualization: `GDP`

Long format dataset enables us to conduct variety of analyses
For instance, let visualize the transition of GDP (1980-2017) between Japan and China
Using filter() function extract the data needed and name it jpn.chi

jpn.chi <- gdp_long %>% 
  filter(country == "Japan" | country == "China")

・You should add the following command to avoid text garbling when using Japanese and drawing figures with ggplot() function

theme_set(theme_classic(base_size = 10,
                        base_family = "HiraginoSans-W3"))

jpn.chi %>% 
  ggplot(aes(x = year, y = GDP,
             color = country, 
             linetype = country, 
             shape = country)) +
  geom_point() +
  geom_line() +
  ggtitle("Transition of GDP Per Capita (1980-2017) between Japan and China") +
  labs(x = "Year", y = "GDP per capita (US$)")　+
  theme(legend.position = c(0.1, 0.8)) +
  xlim(1980, 2017) # Delete the dta of 2018

6. Data Cleaning of `Freedom House`

6.1 Read the `Freedom House` data

Freedom House data (1972-2016)
Data on Democracy for countries
PR: political rights
CL: civil liberties
Status

Variables	Variable Class	Details
`PR`	numeric	political right (Best = 1, Worst = 7)
`CL`	numeric	civil liberties (Best = 1, Worst = 7)
`status`	categorical	F: free, PF: partly free, NF: not free
`year`	categorical	1972-2016

PR and CL are measured on a one-to-seven scale, with one representing the highest degree of Freedom and seven the lowest.
Load readx1 package to read excel file

library(readxl)

Download Freedom House and read it
Prior reading the data, open the original data (FH_Country.xls) file in either LibreOffice or Excel

You can see three tabs at the bottom of the screen
We want to read the second tab (Country Ratings, Statuses)
Assign sheet = 2

Assign the sheet number and the row

Check the 2nd tab

We do not need the 1st row and the 2nd row (yellow parts)
→ Assign skip = 2

fh <- read_excel("data/FH_Country.xls", 
                 sheet = 2,
                 skip = 2)

Check the class of each variable

str(fh)

tibble [205 × 133] (S3: tbl_df/tbl/data.frame)
 $ ...1        : chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ PR...2      : chr [1:205] "4" "7" "6" "4" ...
 $ CL...3      : chr [1:205] "5" "7" "6" "3" ...
 $ Status...4  : chr [1:205] "PF" "NF" "NF" "PF" ...
 $ PR...5      : chr [1:205] "7" "7" "6" "4" ...
 $ CL...6      : chr [1:205] "6" "7" "6" "4" ...
 $ Status...7  : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...8      : chr [1:205] "7" "7" "6" "4" ...
 $ CL...9      : chr [1:205] "6" "7" "6" "4" ...
 $ Status...10 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...11     : chr [1:205] "7" "7" "7" "4" ...
 $ CL...12     : chr [1:205] "6" "7" "6" "4" ...
 $ Status...13 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...14     : chr [1:205] "7" "7" "6" "4" ...
 $ CL...15     : chr [1:205] "6" "7" "6" "4" ...
 $ Status...16 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...17     : chr [1:205] "6" "7" "6" "-" ...
 $ CL...18     : chr [1:205] "6" "7" "6" "-" ...
 $ Status...19 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...20     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...21     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...22 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...23     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...24     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...25 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...26     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...27     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...28 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...29     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...30     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...31 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...32     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...33     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...34 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...35     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...36     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...37 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...38     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...39     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...40 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...41     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...42     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...43 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...44     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...45     : chr [1:205] "7" "7" "6" "-" ...
 $ Status...46 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...47     : chr [1:205] "6" "7" "5" "-" ...
 $ CL...48     : chr [1:205] "6" "7" "6" "-" ...
 $ Status...49 : chr [1:205] "NF" "NF" "NF" "-" ...
 $ PR...50     : chr [1:205] "7" "7" "6" "-" ...
 $ CL...51     : chr [1:205] "7" "7" "4" "-" ...
 $ Status...52 : chr [1:205] "NF" "NF" "PF" "-" ...
 $ PR...53     : chr [1:205] "7" "7" "4" "-" ...
 $ CL...54     : chr [1:205] "7" "6" "4" "-" ...
 $ Status...55 : chr [1:205] "NF" "NF" "PF" "-" ...
 $ PR...56     : chr [1:205] "7" "4" "4" "-" ...
 $ CL...57     : chr [1:205] "7" "4" "4" "-" ...
 $ Status...58 : chr [1:205] "NF" "PF" "PF" "-" ...
 $ PR...59     : chr [1:205] "6" "4" "7" "-" ...
 $ CL...60     : chr [1:205] "6" "3" "6" "-" ...
 $ Status...61 : chr [1:205] "NF" "PF" "NF" "-" ...
 $ PR...62     : chr [1:205] "7" "2" "7" "2" ...
 $ CL...63     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...64 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...65     : chr [1:205] "7" "3" "7" "1" ...
 $ CL...66     : chr [1:205] "7" "4" "7" "1" ...
 $ Status...67 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...68     : chr [1:205] "7" "3" "6" "1" ...
 $ CL...69     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...70 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...71     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...72     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...73 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...74     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...75     : chr [1:205] "7" "4" "6" "1" ...
 $ Status...76 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...77     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...78     : chr [1:205] "7" "5" "5" "1" ...
 $ Status...79 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...80     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...81     : chr [1:205] "7" "5" "5" "1" ...
 $ Status...82 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...83     : chr [1:205] "7" "4" "6" "1" ...
 $ CL...84     : chr [1:205] "7" "5" "5" "1" ...
 $ Status...85 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...86     : chr [1:205] "7" "3" "6" "1" ...
 $ CL...87     : chr [1:205] "7" "4" "5" "1" ...
 $ Status...88 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...89     : chr [1:205] "6" "3" "6" "1" ...
 $ CL...90     : chr [1:205] "6" "3" "5" "1" ...
 $ Status...91 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...92     : chr [1:205] "6" "3" "6" "1" ...
 $ CL...93     : chr [1:205] "6" "3" "5" "1" ...
 $ Status...94 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...95     : chr [1:205] "5" "3" "6" "1" ...
 $ CL...96     : chr [1:205] "6" "3" "5" "1" ...
 $ Status...97 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...98     : chr [1:205] "5" "3" "6" "1" ...
 $ CL...99     : chr [1:205] "5" "3" "5" "1" ...
  [list output truncated]

We see that the class of the all variables is chr（= character）
It is understandable for country name and status to be chr（= character)
But, it does not make any sense for PR (political rights) and CL (civil liberty) to be chr（= character)
→　They should be numeric → This should be fixed

Solution：
- You can see - in the spread sheet
- This means a missing value in Freedom House data set
- RStudio recognizes a blank as a missing value and show it -
→ We need to let RStudion recognize "-" means missing value
→ Add the following command: na = "-"

fh <- read_excel("data/FH_Country.xls", 
                 sheet = 2, 
                 skip = 2,
                 na = "-")　# NA = "-" でも可

str(fh)

tibble [205 × 133] (S3: tbl_df/tbl/data.frame)
 $ ...1        : chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ PR...2      : chr [1:205] "4" "7" "6" "4" ...
 $ CL...3      : chr [1:205] "5" "7" "6" "3" ...
 $ Status...4  : chr [1:205] "PF" "NF" "NF" "PF" ...
 $ PR...5      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...6      : num [1:205] 6 7 6 4 NA NA 2 NA 1 1 ...
 $ Status...7  : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...8      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...9      : num [1:205] 6 7 6 4 NA NA 4 NA 1 1 ...
 $ Status...10 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...11     : num [1:205] 7 7 7 4 6 NA 2 NA 1 1 ...
 $ CL...12     : num [1:205] 6 7 6 4 6 NA 4 NA 1 1 ...
 $ Status...13 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...14     : num [1:205] 7 7 6 4 6 NA 6 NA 1 1 ...
 $ CL...15     : num [1:205] 6 7 6 4 6 NA 5 NA 1 1 ...
 $ Status...16 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...17     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...18     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ Status...19 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...20     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...21     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...22 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...23     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...24     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...25 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...26     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...27     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...28 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...29     : num [1:205] 7 7 6 NA 7 2 6 NA 1 1 ...
 $ CL...30     : num [1:205] 7 7 6 NA 7 2 5 NA 1 1 ...
 $ Status...31 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...32     : num [1:205] 7 7 6 NA 7 2 3 NA 1 1 ...
 $ CL...33     : num [1:205] 7 7 6 NA 7 3 3 NA 1 1 ...
 $ Status...34 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...35     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...36     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...37 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...38     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...39     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...40 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...41     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...42     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...43 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...44     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...45     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...46 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...47     : num [1:205] 6 7 5 NA 7 2 2 NA 1 1 ...
 $ CL...48     : num [1:205] 6 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...49 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...50     : num [1:205] 7 7 6 NA 7 2 1 NA 1 1 ...
 $ CL...51     : num [1:205] 7 7 4 NA 7 3 2 NA 1 1 ...
 $ Status...52 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...53     : num [1:205] 7 7 4 NA 7 3 1 NA 1 1 ...
 $ CL...54     : num [1:205] 7 6 4 NA 7 2 3 NA 1 1 ...
 $ Status...55 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...56     : num [1:205] 7 4 4 NA 6 3 1 5 1 1 ...
 $ CL...57     : num [1:205] 7 4 4 NA 4 3 3 5 1 1 ...
 $ Status...58 : chr [1:205] "NF" "PF" "PF" NA ...
 $ PR...59     : num [1:205] 6 4 7 NA 6 3 2 4 1 1 ...
 $ CL...60     : num [1:205] 6 3 6 NA 6 3 3 3 1 1 ...
 $ Status...61 : chr [1:205] "NF" "PF" "NF" NA ...
 $ PR...62     : num [1:205] 7 2 7 2 7 4 2 3 1 1 ...
 $ CL...63     : num [1:205] 7 4 6 1 7 3 3 4 1 1 ...
 $ Status...64 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...65     : num [1:205] 7 3 7 1 7 4 2 3 1 1 ...
 $ CL...66     : num [1:205] 7 4 7 1 7 3 3 4 1 1 ...
 $ Status...67 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...68     : num [1:205] 7 3 6 1 6 4 2 4 1 1 ...
 $ CL...69     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...70 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...71     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...72     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...73 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...74     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...75     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...76 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...77     : num [1:205] 7 4 6 1 6 4 3 4 1 1 ...
 $ CL...78     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...79 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...80     : num [1:205] 7 4 6 1 6 4 2 4 1 1 ...
 $ CL...81     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...82 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...83     : num [1:205] 7 4 6 1 6 4 1 4 1 1 ...
 $ CL...84     : num [1:205] 7 5 5 1 6 2 2 4 1 1 ...
 $ Status...85 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...86     : num [1:205] 7 3 6 1 6 4 3 4 1 1 ...
 $ CL...87     : num [1:205] 7 4 5 1 6 2 3 4 1 1 ...
 $ Status...88 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...89     : num [1:205] 6 3 6 1 6 4 3 4 1 1 ...
 $ CL...90     : num [1:205] 6 3 5 1 5 2 3 4 1 1 ...
 $ Status...91 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...92     : num [1:205] 6 3 6 1 6 4 2 4 1 1 ...
 $ CL...93     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...94 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...95     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...96     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...97 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...98     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...99     : num [1:205] 5 3 5 1 5 2 2 4 1 1 ...
  [list output truncated]

All variables except PR... and CL... are recognized as numeric
PR...2 and CL...3 are recognized as character
→ This should be fixed
We need to know why these two variables (PR...2 and CL...3) are not changed to numeric
→　We need to change the class of these two variables to numeric from character
Using unique() function, check the values of PR...2

unique(fh$PR...2)

[1] "4"    "7"    "6"    NA     "1"    "2"    "5"    "3"    "2(5)"

2(5) is included!
2(5) is not a numeric but a character
The value of character variable is shown with ""
Since 2(5) is not a numeric, the value was shown with ""
→ NA is an exception in RStudio
→ NA is not recognized as a character
Because PR...2 contains 2(5), PR...2 is recognized as character variable
→ This is the reason!
Using unique() function, check the values of CL...3

unique(fh$CL...3)

[1] "5"    "7"    "6"    "3"    NA     "1"    "4"    "2"    "3(6)"

3(6) is included!
3(6) is not a numeric but a character
The value of character variable is shown with ""
Since 3(6) is not a numeric, the value was shown with ""
→ NA is an exception in RStudio
→ NA is not recognized as a character
Because CL...3 contains 3(6), CL...3 is recognized as character variable
→ This is the reason!

Solution：

Using if_else() function, replace 2(5) and 3(6) with NA
Name the new data frame as fh_na

fh_na <- fh %>% 
  dplyr::mutate(
    PR...2 = if_else(PR...2 == "2(5)", "NA", PR...2),
    CL...3 = if_else(CL...3 == "3(6)", "NA", CL...3)) %>% 
  mutate(across(c(PR...2, CL...3), as.numeric))

Using unique() functio, check the value of PR...2

unique(fh_na$PR...2)

[1]  4  7  6 NA  1  2  5  3

NA is not shown with ""
→ NA is econgized asmissing value
Using unique() function, check the value of CL...3

unique(fh_na$CL...3)

[1]  5  7  6  3 NA  1  4  2

NA is not shown with ""
→ NA is econgized asmissing value
Using unique() function, check the class of PR...2 and CL...3

str(fh_na$PR...2)

 num [1:205] 4 7 6 4 NA NA 6 NA 1 1 ...

str(fh_na$CL...3)

 num [1:205] 5 7 6 3 NA NA 3 NA 1 1 ...

Both PR...2 and CL...3 are recognized as numeric

6.2 Converting data (Wide → Long form) :`fh_na`

Check the data we use

str(fh_na)

tibble [205 × 133] (S3: tbl_df/tbl/data.frame)
 $ ...1        : chr [1:205] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ PR...2      : num [1:205] 4 7 6 4 NA NA 6 NA 1 1 ...
 $ CL...3      : num [1:205] 5 7 6 3 NA NA 3 NA 1 1 ...
 $ Status...4  : chr [1:205] "PF" "NF" "NF" "PF" ...
 $ PR...5      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...6      : num [1:205] 6 7 6 4 NA NA 2 NA 1 1 ...
 $ Status...7  : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...8      : num [1:205] 7 7 6 4 NA NA 2 NA 1 1 ...
 $ CL...9      : num [1:205] 6 7 6 4 NA NA 4 NA 1 1 ...
 $ Status...10 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...11     : num [1:205] 7 7 7 4 6 NA 2 NA 1 1 ...
 $ CL...12     : num [1:205] 6 7 6 4 6 NA 4 NA 1 1 ...
 $ Status...13 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...14     : num [1:205] 7 7 6 4 6 NA 6 NA 1 1 ...
 $ CL...15     : num [1:205] 6 7 6 4 6 NA 5 NA 1 1 ...
 $ Status...16 : chr [1:205] "NF" "NF" "NF" "PF" ...
 $ PR...17     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...18     : num [1:205] 6 7 6 NA 7 NA 6 NA 1 1 ...
 $ Status...19 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...20     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...21     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...22 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...23     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...24     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...25 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...26     : num [1:205] 7 7 6 NA 7 NA 6 NA 1 1 ...
 $ CL...27     : num [1:205] 7 7 6 NA 7 NA 5 NA 1 1 ...
 $ Status...28 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...29     : num [1:205] 7 7 6 NA 7 2 6 NA 1 1 ...
 $ CL...30     : num [1:205] 7 7 6 NA 7 2 5 NA 1 1 ...
 $ Status...31 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...32     : num [1:205] 7 7 6 NA 7 2 3 NA 1 1 ...
 $ CL...33     : num [1:205] 7 7 6 NA 7 3 3 NA 1 1 ...
 $ Status...34 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...35     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...36     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...37 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...38     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...39     : num [1:205] 7 7 6 NA 7 3 2 NA 1 1 ...
 $ Status...40 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...41     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...42     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...43 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...44     : num [1:205] 7 7 6 NA 7 2 2 NA 1 1 ...
 $ CL...45     : num [1:205] 7 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...46 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...47     : num [1:205] 6 7 5 NA 7 2 2 NA 1 1 ...
 $ CL...48     : num [1:205] 6 7 6 NA 7 3 1 NA 1 1 ...
 $ Status...49 : chr [1:205] "NF" "NF" "NF" NA ...
 $ PR...50     : num [1:205] 7 7 6 NA 7 2 1 NA 1 1 ...
 $ CL...51     : num [1:205] 7 7 4 NA 7 3 2 NA 1 1 ...
 $ Status...52 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...53     : num [1:205] 7 7 4 NA 7 3 1 NA 1 1 ...
 $ CL...54     : num [1:205] 7 6 4 NA 7 2 3 NA 1 1 ...
 $ Status...55 : chr [1:205] "NF" "NF" "PF" NA ...
 $ PR...56     : num [1:205] 7 4 4 NA 6 3 1 5 1 1 ...
 $ CL...57     : num [1:205] 7 4 4 NA 4 3 3 5 1 1 ...
 $ Status...58 : chr [1:205] "NF" "PF" "PF" NA ...
 $ PR...59     : num [1:205] 6 4 7 NA 6 3 2 4 1 1 ...
 $ CL...60     : num [1:205] 6 3 6 NA 6 3 3 3 1 1 ...
 $ Status...61 : chr [1:205] "NF" "PF" "NF" NA ...
 $ PR...62     : num [1:205] 7 2 7 2 7 4 2 3 1 1 ...
 $ CL...63     : num [1:205] 7 4 6 1 7 3 3 4 1 1 ...
 $ Status...64 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...65     : num [1:205] 7 3 7 1 7 4 2 3 1 1 ...
 $ CL...66     : num [1:205] 7 4 7 1 7 3 3 4 1 1 ...
 $ Status...67 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...68     : num [1:205] 7 3 6 1 6 4 2 4 1 1 ...
 $ CL...69     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...70 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...71     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...72     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...73 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...74     : num [1:205] 7 4 6 1 6 4 2 5 1 1 ...
 $ CL...75     : num [1:205] 7 4 6 1 6 3 3 4 1 1 ...
 $ Status...76 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...77     : num [1:205] 7 4 6 1 6 4 3 4 1 1 ...
 $ CL...78     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...79 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...80     : num [1:205] 7 4 6 1 6 4 2 4 1 1 ...
 $ CL...81     : num [1:205] 7 5 5 1 6 3 3 4 1 1 ...
 $ Status...82 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...83     : num [1:205] 7 4 6 1 6 4 1 4 1 1 ...
 $ CL...84     : num [1:205] 7 5 5 1 6 2 2 4 1 1 ...
 $ Status...85 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...86     : num [1:205] 7 3 6 1 6 4 3 4 1 1 ...
 $ CL...87     : num [1:205] 7 4 5 1 6 2 3 4 1 1 ...
 $ Status...88 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...89     : num [1:205] 6 3 6 1 6 4 3 4 1 1 ...
 $ CL...90     : num [1:205] 6 3 5 1 5 2 3 4 1 1 ...
 $ Status...91 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...92     : num [1:205] 6 3 6 1 6 4 2 4 1 1 ...
 $ CL...93     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...94 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...95     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...96     : num [1:205] 6 3 5 1 5 2 2 4 1 1 ...
 $ Status...97 : chr [1:205] "NF" "PF" "NF" "F" ...
 $ PR...98     : num [1:205] 5 3 6 1 6 2 2 5 1 1 ...
 $ CL...99     : num [1:205] 5 3 5 1 5 2 2 4 1 1 ...
  [list output truncated]

dim(fh_na)

[1] 205 133

We have 205 countries and 113 variables in fh_na
Since this data is wide form, we need to change it to long form
Compared to the gdp data we cleaned in the previous section, this data need a bit more work to do

A bit more work to do

We have three variables per country and per year (PR, CL, Status)
■PR...2, CL...3, Status...4 are the data for 1972
■PR...5, CL...6, Status...7 are the data for 1973
・・・・・・・・・・・・・・・・・・・・・・・・
■PR...128, CL...129, Status...130 are data for 2015
■PR...131, CL...132, Status...134 are data for 2016
Two different classes of vairalbes: numeric and categorical

PR, CL — numeric data (min 1 to max 7)
Status — categorical data (F, PF, NF)

Solution：Make two variables (value, status)
→　The values of PR and CL are put into value
→　The values of Status are put into status

Change the name of the 1st variable: ...1

The variable ...1 shows country name
We rename ...1 as country

fh_na <- fh_na %>% 
    rename(country = 1)

Converting GDP data from wide form to long form

We make two data frames: fh_country and fh_na

fh_country <- fh_na %>% 
  select(country)

fh_na <- fh_na %>% 
  select(-country)

Check the variables in each data frame

names(fh_na)

  [1] "PR...2"       "CL...3"       "Status...4"   "PR...5"       "CL...6"      
  [6] "Status...7"   "PR...8"       "CL...9"       "Status...10"  "PR...11"     
 [11] "CL...12"      "Status...13"  "PR...14"      "CL...15"      "Status...16" 
 [16] "PR...17"      "CL...18"      "Status...19"  "PR...20"      "CL...21"     
 [21] "Status...22"  "PR...23"      "CL...24"      "Status...25"  "PR...26"     
 [26] "CL...27"      "Status...28"  "PR...29"      "CL...30"      "Status...31" 
 [31] "PR...32"      "CL...33"      "Status...34"  "PR...35"      "CL...36"     
 [36] "Status...37"  "PR...38"      "CL...39"      "Status...40"  "PR...41"     
 [41] "CL...42"      "Status...43"  "PR...44"      "CL...45"      "Status...46" 
 [46] "PR...47"      "CL...48"      "Status...49"  "PR...50"      "CL...51"     
 [51] "Status...52"  "PR...53"      "CL...54"      "Status...55"  "PR...56"     
 [56] "CL...57"      "Status...58"  "PR...59"      "CL...60"      "Status...61" 
 [61] "PR...62"      "CL...63"      "Status...64"  "PR...65"      "CL...66"     
 [66] "Status...67"  "PR...68"      "CL...69"      "Status...70"  "PR...71"     
 [71] "CL...72"      "Status...73"  "PR...74"      "CL...75"      "Status...76" 
 [76] "PR...77"      "CL...78"      "Status...79"  "PR...80"      "CL...81"     
 [81] "Status...82"  "PR...83"      "CL...84"      "Status...85"  "PR...86"     
 [86] "CL...87"      "Status...88"  "PR...89"      "CL...90"      "Status...91" 
 [91] "PR...92"      "CL...93"      "Status...94"  "PR...95"      "CL...96"     
 [96] "Status...97"  "PR...98"      "CL...99"      "Status...100" "PR...101"    
[101] "CL...102"     "Status...103" "PR...104"     "CL...105"     "Status...106"
[106] "PR...107"     "CL...108"     "Status...109" "PR...110"     "CL...111"    
[111] "Status...112" "PR...113"     "CL...114"     "Status...115" "PR...116"    
[116] "CL...117"     "Status...118" "PR...119"     "CL...120"     "Status...121"
[121] "PR...122"     "CL...123"     "Status...124" "PR...125"     "CL...126"    
[126] "Status...127" "PR...128"     "CL...129"     "Status...130" "PR...131"    
[131] "CL...132"     "Status...133"

names(fh_country)

[1] "country"

Data cleaning: `fh_na`

colnames(fh_na) <- 
  str_replace_all(colnames(fh_na), 
                  c("\\.\\.\\." = "-")) %>% # replace "・・・" with "-"    
  str_subset("PR|CL|Status") %>%  # change the variable names like "pr_1972"    
  str_c(., "_") %>% 
  str_replace_all(c("-" =  "", 
                    "[0-9]" = "",
                    "PR" = "pr", # pr => PR
                    "CL" = "cl", # cl => CL 
                    "Status" = "st")) %>% # st = Status 
  str_c(.,  rep(setdiff(1972:2016, 1981), # exclude 1981
                each = 3))   # make 3 variables per year

check fh_na

names(fh_na)

  [1] "pr_1972" "cl_1972" "st_1972" "pr_1973" "cl_1973" "st_1973" "pr_1974"
  [8] "cl_1974" "st_1974" "pr_1975" "cl_1975" "st_1975" "pr_1976" "cl_1976"
 [15] "st_1976" "pr_1977" "cl_1977" "st_1977" "pr_1978" "cl_1978" "st_1978"
 [22] "pr_1979" "cl_1979" "st_1979" "pr_1980" "cl_1980" "st_1980" "pr_1982"
 [29] "cl_1982" "st_1982" "pr_1983" "cl_1983" "st_1983" "pr_1984" "cl_1984"
 [36] "st_1984" "pr_1985" "cl_1985" "st_1985" "pr_1986" "cl_1986" "st_1986"
 [43] "pr_1987" "cl_1987" "st_1987" "pr_1988" "cl_1988" "st_1988" "pr_1989"
 [50] "cl_1989" "st_1989" "pr_1990" "cl_1990" "st_1990" "pr_1991" "cl_1991"
 [57] "st_1991" "pr_1992" "cl_1992" "st_1992" "pr_1993" "cl_1993" "st_1993"
 [64] "pr_1994" "cl_1994" "st_1994" "pr_1995" "cl_1995" "st_1995" "pr_1996"
 [71] "cl_1996" "st_1996" "pr_1997" "cl_1997" "st_1997" "pr_1998" "cl_1998"
 [78] "st_1998" "pr_1999" "cl_1999" "st_1999" "pr_2000" "cl_2000" "st_2000"
 [85] "pr_2001" "cl_2001" "st_2001" "pr_2002" "cl_2002" "st_2002" "pr_2003"
 [92] "cl_2003" "st_2003" "pr_2004" "cl_2004" "st_2004" "pr_2005" "cl_2005"
 [99] "st_2005" "pr_2006" "cl_2006" "st_2006" "pr_2007" "cl_2007" "st_2007"
[106] "pr_2008" "cl_2008" "st_2008" "pr_2009" "cl_2009" "st_2009" "pr_2010"
[113] "cl_2010" "st_2010" "pr_2011" "cl_2011" "st_2011" "pr_2012" "cl_2012"
[120] "st_2012" "pr_2013" "cl_2013" "st_2013" "pr_2014" "cl_2014" "st_2014"
[127] "pr_2015" "cl_2015" "st_2015" "pr_2016" "cl_2016" "st_2016"

Using bind_cols() function, merge fh_na and fh_country

fh_na <- fh_country %>% # 
  bind_cols(fh_na)

check fh_na

rmarkdown::paged_table(fh_na)

Make two variables: value and type

PR_CL_long <- fh_na %>% 
  select(country,                        
         starts_with(c("pr", "cl"))) %>%  # select those variables starting with `pr` and `cl`   
  pivot_longer(pr_1972:cl_2016,           # assign the range of variables
               names_to = "type",         # put variable names, such as "pr_1972", into `type`  
               values_to = "value") %>%   # put values of variables, such as 1972, into `value`  
  separate(type, 　　　　　　　　　　　　 
           into = c("type", "year"),      # divide the values of type into 2: `type` and `year`
           sep = "_")%>% 　　　　　  　   # two values should be connected by "_" 
           drop_na()                      # drop missing values

ST_long <- fh_na %>% 
  select(country,                        # country を選ぶ
         starts_with("st")) %>%          # select those variables starting with `st`   
  pivot_longer(st_1972:st_2016,          # assign the range of variables
               names_to = "name",        # put variable names, such as "pr_1972", into `name`  
               values_to = "status") %>% # put values of variables, such as 1972, into `status`  
  separate(name, 
           into = c("name", "year"),     # divide the values of type into 2: `name` and `year`
           sep = "_") %>%               # two values should be connected by "_" 
  select(-name)%>%                       # nameは不要なので削除  
  drop_na()                               # drop missing values

Check PR_CL_long

names(PR_CL_long)

[1] "country" "type"    "year"    "value"

Check ST_long

names(ST_long)

[1] "country" "year"    "status"

Using left_joint() function, merge PR_CL_long and ST_long with the two shared variables: country and year

fh_all_long <- PR_CL_long %>% 
  left_join(ST_long, 
            by = c("country", "year"))

DT::datatable(fh_all_long)

Now, we have converted wide form data into long form data

6.3 Data Visualization: `Freedom House`

Transition of Political Rights between North Kore and South Korea (1972-2016)

korea_PR <- fh_all_long %>% 
  filter(country == "North Korea" | country == "South Korea") %>% 
  filter(type == "pr")

korea_PR %>% 
  ggplot(aes(x = value, y = year, 
             color = country,
             shape = country)) +
  geom_point() +
  ggtitle("Political Rights between N.Korea and S.Korea: 1972-2016") +
  labs(x = "Political Rights", y = "Year")　+
  theme(legend.position = c(0.5, 0.8))

North Korea’s political rights has been consistently worse (which is 7) since 1972
South Korea’s political rights was around 5 in 1972, but it has been getting better (which is 1 and 2)

Transition of Political Rights between Japan and China (1972-2016)

jpn.chi_PR <- fh_all_long %>% 
  filter(country == "Japan" | country == "China") %>% 
  filter(type == "pr")

jpn.chi_PR %>% 
  ggplot(aes(x = value, y = year, 
             color = country, 
             shape = country)) +
  geom_point() +
  ggtitle("Political Rights between Japan and China: 1972-2016") +
  labs(x = "Political Rights", y = "Year")　+
  theme(legend.position = c(0.5, 0.8))

China’s political rights has been consistently worse (which is 7 or 6) since 1972
Japan’s political rights has been consistently better (which is 1 or 2)

Reference

宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
土井翔平（北海道大学公共政策大学院）「Rで計量政治学入門」
矢内勇生（高知工科大学）授業一覧
浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
Kieran Healy, DATA VISUALIZATION, Princeton, 2019
Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017
Bruce Russett and John R. Oneal (2001) “Triangulating Peace”

2. Data Cleaning

Masahiko Asano

2021-09-13

1. Data frame

1.1 How to make variables

1.2 Add variables to dataframe

1.3 Add a new variable to your data frame

1.4 Merging data frames

1.5 Exercise

2. Basics on data

2.1 File and Folda

2.2 Path

2.3 Working Directory

2.4 What you don’t want to do on file names

3. How to use Data embedded to R

4. How to use Data not embedded to R

Text data

Binary data

4.1 How to read `.csv` file

If you fail to read the csv file

If you use EXCEL

If you use Libre Office

4.2 How to read `.xls[x]` file

4.3 How to read `.dta` file

5. Data Cleaning of `GDP` data

5.1 Read the GDP dta

5.2 Converting data (Wide → Long form) : `gdp`

Converting data from wide form to long form

5.3 Data Visualization: `GDP`

6. Data Cleaning of `Freedom House`

6.1 Read the `Freedom House` data

6.2 Converting data (Wide → Long form) :`fh_na`

Converting GDP data from wide form to long form

Data cleaning: `fh_na`

6.3 Data Visualization: `Freedom House`

2. Data Cleaning

Masahiko Asano

2021-09-13

1. Data frame

1.1 How to make variables

1.2 Add variables to dataframe

1.3 Add a new variable to your data frame

1.4 Merging data frames

1.5 Exercise

2. Basics on data

2.1 File and Folda

2.2 Path

2.3 Working Directory

2.4 What you don’t want to do on file names

3. How to use Data embedded to R

4. How to use Data not embedded to R

Text data

Binary data

4.1 How to read .csv file

If you fail to read the csv file

If you use EXCEL

If you use Libre Office

4.2 How to read .xls[x] file

4.3 How to read .dta file

5. Data Cleaning of GDP data

5.1 Read the GDP dta

5.2 Converting data (Wide → Long form) : gdp

Converting data from wide form to long form

5.3 Data Visualization: GDP

6. Data Cleaning of Freedom House

6.1 Read the Freedom House data

6.2 Converting data (Wide → Long form) :fh_na

Converting GDP data from wide form to long form

Data cleaning: fh_na

6.3 Data Visualization: Freedom House

4.1 How to read `.csv` file

4.2 How to read `.xls[x]` file

4.3 How to read `.dta` file

5. Data Cleaning of `GDP` data

5.2 Converting data (Wide → Long form) : `gdp`

5.3 Data Visualization: `GDP`

6. Data Cleaning of `Freedom House`

6.1 Read the `Freedom House` data

6.2 Converting data (Wide → Long form) :`fh_na`

Data cleaning: `fh_na`

6.3 Data Visualization: `Freedom House`