1. Two types of “statistics”

Descriptive statistics:

  • The process of using and analysing a summary statistic that quantitatively describes or summarizes features from a collection of information.

Inferential statistics

  • Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.
  • It is assumed that the observed data set is sampled from a larger population.

2. Descriptive statistics

2.1 Data (income.csv)

  • By typing the following command, you can check which directory (= working directory) you are currently working on.
getwd()
[1] "/Users/asanomasahiko/Dropbox/statistics/class_materials"
  • It is strongly recommended that you make a R Project which enables you to efficiently conduct your research on RStudio

How to make an R Project

  • File => Select New Project

How to use RMarkdown

  • File => Select New File => R Markdown
  • Enter the title you like => OK
  • Delete the line between 12 and 30
    => Click Knit button
    => Type the name of your .Rmd file after Name as:
  • Make a new folder in your RProject folder and name it data
  • Download income.csv and put it into data
  • Load tidyverse package to read the csv file
library("tidyverse")                           
df1 <- read_csv("data/income.csv")  
  • Check df1
DT::datatable(df1)
  • Number of observation = 100, number of variables = 7
  • Using str() function, check the structure of df1
str(df1)
spec_tbl_df [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id        : chr [1:100] "AU" "AY" "AB" "AM" ...
 $ sex       : chr [1:100] "male" "female" "male" "male" ...
 $ age       : num [1:100] 70 70 69 67 66 66 65 65 65 64 ...
 $ height    : num [1:100] 160 156 173 166 171 ...
 $ weight    : num [1:100] 58.3 44 75.7 69.3 76.5 67.3 41.5 53.5 46.8 52.7 ...
 $ income    : num [1:100] 201 487 424 1735 929 ...
 $ generation: chr [1:100] "elder" "elder" "elder" "elder" ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_character(),
  ..   sex = col_character(),
  ..   age = col_double(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   income = col_number(),
  ..   generation = col_character()
  .. )
  • In order to show descriptive statistics, the class of a variable must be numeric

2.2 Summary Statistics   

  • Show summary statistics of df1
summary(df1)
      id                sex                 age            height     
 Length:100         Length:100         Min.   :20.00   Min.   :148.0  
 Class :character   Class :character   1st Qu.:36.00   1st Qu.:158.1  
 Mode  :character   Mode  :character   Median :45.00   Median :162.9  
                                       Mean   :45.96   Mean   :163.7  
                                       3rd Qu.:57.25   3rd Qu.:170.2  
                                       Max.   :70.00   Max.   :180.5  
     weight          income        generation       
 Min.   :28.30   Min.   :  24.0   Length:100        
 1st Qu.:48.95   1st Qu.: 134.8   Class :character  
 Median :59.95   Median : 298.5   Mode  :character  
 Mean   :59.18   Mean   : 434.4                     
 3rd Qu.:67.33   3rd Qu.: 607.2                     
 Max.   :85.60   Max.   :2351.0                     
  • If you use stargazer() with type = "text", then you can have a nicer table
library(stargazer)
stargazer(as.data.frame(df1), 
          type ="text",
          digits = 2)

=============================================================
Statistic  N   Mean  St. Dev.  Min   Pctl(25) Pctl(75)  Max  
-------------------------------------------------------------
age       100 45.96   13.33     20      36      57.2     70  
height    100 163.75   7.69   148.00  158.10   170.17  180.50
weight    100 59.18   12.65   28.30   48.95    67.32   85.60 
income    100 434.40  445.78    24    134.8    607.2   2,351 
-------------------------------------------------------------
  • If you use stargazer() with type = "html", then you can have a fancier table
  • You need to type ```{r, results = "asis"} at the chunk option
stargazer(as.data.frame(df1), 
          type ="html",
          digits = 2)
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
age 100 45.96 13.33 20 36 57.2 70
height 100 163.75 7.69 148.00 158.10 170.17 180.50
weight 100 59.18 12.65 28.30 48.95 67.32 85.60
income 100 434.40 445.78 24 134.8 607.2 2,351
Desriptive Statistics Details
N: The number of observation
Mean: Average value
St. Dev. Standard deviation
Min Minimum value
Pctl(25) 1st Quantile (25%)
Pctl(75): 3rd Quantile (75%)
Max: 最大値

2.2.1 Mean

  • The arithmetic mean, also known as average or arithmetic average, is a central value of a finite set of numbers: specifically, the sum of the values divided by the number of values.
  • The mean of variable x (\(=\bar{x}\)) is calculated with the following equation:

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

  • Let’ suppose the TOEFL Score of the 10 Waseda students are the following:
toefl <- c(60, 80, 90, 80, 85, 60, 80, 90, 85, 100)
  • Show the data we made just now
toefl
 [1]  60  80  90  80  85  60  80  90  85 100
  • How to calculate the mean of toel1 with R (1)
(60+80+90+80+85+60+80+90+85+100)/10
[1] 81
  • How to calculate the mean of toel1 with R (2)
sum(toefl)
[1] 810
sum(toefl)/10
[1] 81
  • How to calculate the mean of toel1 with R (3)
mean(toefl)
[1] 81
  • How to calculate the mean of toel1 with R (4)
summary(toefl)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  60.00   80.00   82.50   81.00   88.75  100.00 
  • Show the distribution of toelf1 using hist( )
hist(toefl)

2.2.2 Median

  • The median is the value separating the higher half from the lower half of a data sample

  • For a data set, it may be thought of as “the middle” value.

  • If we have the data set: 1, 2, 3.

  • The median is 2.

  • Using table( ), we can make a table of toefl

table(toefl)
toefl
 60  80  85  90 100 
  2   3   2   2   1 
  • Since the total number of observation is odds number (10), there is no number in the middle value.
  • In such a case, we define the median as the average of the two values in the middle:
  • In this case, the median = (80 + 85)/2 = 82.5
  • In R, we can calculate the median as follows using median( )
median(toefl)
[1] 82.5

2.2.3 Mode

  • The mode is the value that appears most often in a set of data values.
  • For example, if we have the data set: 1, 2, 3, 3, 3, 4
  • The mode is 3.
  • R does not have a function to calcuate the mode.
    → We calculate the mode using table( )
table(toefl)
toefl
 60  80  85  90 100 
  2   3   2   2   1 
  • The mode of toefl is 80.

2.2.4 Variance

  • In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

  • Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.

  • Variance is calculated with the following equasion:

\[Variance = \frac{\sum_{i=1}^N (individual.value - Average)^2}{N}\]

  • Calculate the variance of toefl
toefl
 [1]  60  80  90  80  85  60  80  90  85 100
  • Calculate the mean of toefl and name it toefl_mean
toefl_mean <- mean(toefl)
toefl_mean
[1] 81
  • Calculate (individual.value - Average) and name it x
x <- toefl - toefl_mean
x
 [1] -21  -1   9  -1   4 -21  -1   9   4  19
  • Square x and name it x2
x2 <- x^2
x2
 [1] 441   1  81   1  16 441   1  81  16 361
  • Add the squared value of x2 and name it sum_x2
sum_x2 <- sum(x2)
sum_x2
[1] 1440
  • Define the number of observation: N
N <- length(toefl)  # Number of observation
N
[1] 10
  • Thus, we get the variance of toefl
    \[Variance = \frac{\sum_{i=1}^N (individual.value - Average)^2}{N}\]

\[= \frac{1440}{10} = 144\] - This is the variance of toefl
- We can also calculate variance of toefl with R as follows:

variance_toefl <- var(toefl) * (length(toefl) - 1) / length(toefl)
variance_toefl
[1] 144

2.2.5 Standard Deviation

  • The standard deviation is a measure of the amount of variation or dispersion of a set of values.

\[Standard Deviation = \sqrt{Variance}\] - Thus, the standard deviation of toefl is calculated with variance_toefl

sqrt(variance_toefl)
[1] 12
参考文献
  • 宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」
  • 土井翔平(北海道大学公共政策大学院)「Rで計量政治学入門」
  • 矢内勇生(高知工科大学)授業一覧
  • 浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年
  • 浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年
  • Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.
  • Kieran Healy, DATA VISUALIZATION, Princeton, 2019
  • Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017