1. Two types of “statistics”

Descriptive statistics:

  • The process of using and analysing a summary statistic that quantitatively describes or summarizes features from a collection of information.

Inferential statistics

  • Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.
  • It is assumed that the observed data set is sampled from a larger population.

2. Descriptive statistics

2.1 Data (income.csv)

  • By typing the following command, you can check which directory (= working directory) you are currently working on.
  • It is strongly recommended that you make a R Project which enables you to efficiently conduct your research on RStudio

How to make an R Project

  • File => Select New Project

How to use RMarkdown

  • File => Select New File => R Markdown
  • Enter the title you like => OK
  • Delete the line between 12 and 30
    => Click Knit button
    => Type the name of your .Rmd file after Name as:
  • Make a new folder in your RProject folder and name it data
  • Download income.csv and put it into data
  • Load tidyverse package to read the csv file
df1 <- read_csv("data/income.csv")  
  • Check df1
  • Number of observation = 100, number of variables = 7
  • Using str() function, check the structure of df1
  • In order to show descriptive statistics, the class of a variable must be numeric

2.2 Summary Statistics   

  • Show summary statistics of df1
      id                sex                 age            height     
 Length:100         Length:100         Min.   :20.00   Min.   :148.0  
 Class :character   Class :character   1st Qu.:36.00   1st Qu.:158.1  
 Mode  :character   Mode  :character   Median :45.00   Median :162.9  
                                       Mean   :45.96   Mean   :163.7  
                                       3rd Qu.:57.25   3rd Qu.:170.2  
                                       Max.   :70.00   Max.   :180.5  
     weight          income        generation       
 Min.   :28.30   Min.   :  24.0   Length:100        
 1st Qu.:48.95   1st Qu.: 134.8   Class :character  
 Median :59.95   Median : 298.5   Mode  :character  
 Mean   :59.18   Mean   : 434.4                     
 3rd Qu.:67.33   3rd Qu.: 607.2                     
 Max.   :85.60   Max.   :2351.0                     
  • If you use stargazer() with type = "text", then you can have a nicer table
          type ="text",
          digits = 2)

Desriptive Statistics Details
N: The number of observation
Mean: Average value
St. Dev. Standard deviation
Min Minimum value
Pctl(25) 1st Quantile (25%)
Pctl(75): 3rd Quantile (75%)
Max: 最大値

2.2.1 Mean

  • The arithmetic mean, also known as average or arithmetic average, is a central value of a finite set of numbers: specifically, the sum of the values divided by the number of values.
  • The mean of variable x (\(=\bar{x}\)) is calculated with the following equation:

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

  • Let’ suppose the TOEFL Score of the 10 Waseda students are the following:
toefl <- c(60, 80, 90, 80, 85, 60, 80, 90, 85, 100)
  • Show the data we made just now
 [1]  60  80  90  80  85  60  80  90  85 100
  • How to calculate the mean of toel1 with R (1)
[1] 81
  • How to calculate the mean of toel1 with R (2)
[1] 810
[1] 81
  • How to calculate the mean of toel1 with R (3)
[1] 81
  • How to calculate the mean of toel1 with R (4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  60.00   80.00   82.50   81.00   88.75  100.00 
  • Show the distribution of toelf1 using hist( )

2.2.2 Median

  • The median is the value separating the higher half from the lower half of a data sample

  • For a data set, it may be thought of as “the middle” value.

  • If we have the data set: 1, 2, 3.

  • The median is 2.

  • Using table( ), we can make a table of toefl

 60  80  85  90 100 
  2   3   2   2   1 
  • Since the total number of observation is odds number (10), there is no number in the middle value.
  • In such a case, we define the median as the average of the two values in the middle:
  • In this case, the median = (80 + 85)/2 = 82.5
  • In R, we can calculate the median as follows using median( )
[1] 82.5

2.2.3 Mode

  • The mode is the value that appears most often in a set of data values.
  • For example, if we have the data set: 1, 2, 3, 3, 3, 4
  • The mode is 3.
  • R does not have a function to calcuate the mode.
    → We calculate the mode using table( )
 60  80  85  90 100 
  2   3   2   2   1 
  • The mode of toefl is 80.

2.2.4 Variance

  • In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

  • Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.

  • Variance is calculated with the following equasion:

\[Variance = \frac{\sum_{i=1}^N (individual.value - Average)^2}{N}\]

  • Calculate the variance of toefl
 [1]  60  80  90  80  85  60  80  90  85 100
  • Calculate the mean of toefl and name it toefl_mean
toefl_mean <- mean(toefl)
[1] 81
  • Calculate (individual.value - Average) and name it x
x <- toefl - toefl_mean
 [1] -21  -1   9  -1   4 -21  -1   9   4  19
  • Square x and name it x2
x2 <- x^2
 [1] 441   1  81   1  16 441   1  81  16 361
  • Add the squared value of x2 and name it sum_x2
sum_x2 <- sum(x2)
[1] 1440
  • Define the number of observation: N
N <- length(toefl)  # Number of observation
[1] 10
  • Thus, we get the variance of toefl
    \[Variance = \frac{\sum_{i=1}^N (individual.value - Average)^2}{N}\]

\[= \frac{1440}{10} = 144\] - This is the variance of toefl
- We can also calculate variance of toefl with R as follows:

variance_toefl <- var(toefl) * (length(toefl) - 1) / length(toefl)
[1] 144

2.2.5 Standard Deviation

  • The standard deviation is a measure of the amount of variation or dispersion of a set of values.

\[Standard Deviation = \sqrt{Variance}\] - Thus, the standard deviation of toefl is calculated with variance_toefl

[1] 12
