1. Two types of “statistics”

Descriptive statistics:

The process of using and analysing a summary statistic that quantitatively describes or summarizes features from a collection of information.

Inferential statistics

Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.
It is assumed that the observed data set is sampled from a larger population.

2. Descriptive statistics

2.1 Data (`income.csv`)

By typing the following command, you can check which directory (= working directory) you are currently working on.

getwd()

[1] "/Users/asanomasahiko/Dropbox/statistics/class_materials"

It is strongly recommended that you make a R Project which enables you to efficiently conduct your research on RStudio

How to make an `R Project`

File => Select New Project

How to use RMarkdown

File => Select New File => R Markdown
Enter the title you like => OK
Delete the line between 12 and 30
=> Click Knit button
=> Type the name of your .Rmd file after Name as:
Make a new folder in your RProject folder and name it data
Download income.csv and put it into data
Load tidyverse package to read the csv file

library("tidyverse")                           
df1 <- read_csv("data/income.csv")

Check df1

DT::datatable(df1)

Number of observation = 100, number of variables = 7
Using str() function, check the structure of df1

str(df1)

spec_tbl_df [100 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id        : chr [1:100] "AU" "AY" "AB" "AM" ...
 $ sex       : chr [1:100] "male" "female" "male" "male" ...
 $ age       : num [1:100] 70 70 69 67 66 66 65 65 65 64 ...
 $ height    : num [1:100] 160 156 173 166 171 ...
 $ weight    : num [1:100] 58.3 44 75.7 69.3 76.5 67.3 41.5 53.5 46.8 52.7 ...
 $ income    : num [1:100] 201 487 424 1735 929 ...
 $ generation: chr [1:100] "elder" "elder" "elder" "elder" ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_character(),
  ..   sex = col_character(),
  ..   age = col_double(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   income = col_number(),
  ..   generation = col_character()
  .. )

In order to show descriptive statistics, the class of a variable must be numeric

2.2 Summary Statistics 　　

Show summary statistics of df1

summary(df1)

      id                sex                 age            height     
 Length:100         Length:100         Min.   :20.00   Min.   :148.0  
 Class :character   Class :character   1st Qu.:36.00   1st Qu.:158.1  
 Mode  :character   Mode  :character   Median :45.00   Median :162.9  
                                       Mean   :45.96   Mean   :163.7  
                                       3rd Qu.:57.25   3rd Qu.:170.2  
                                       Max.   :70.00   Max.   :180.5  
     weight          income        generation       
 Min.   :28.30   Min.   :  24.0   Length:100        
 1st Qu.:48.95   1st Qu.: 134.8   Class :character  
 Median :59.95   Median : 298.5   Mode  :character  
 Mean   :59.18   Mean   : 434.4                     
 3rd Qu.:67.33   3rd Qu.: 607.2                     
 Max.   :85.60   Max.   :2351.0

If you use stargazer() with type = "text", then you can have a nicer table

library(stargazer)

stargazer(as.data.frame(df1), 
          type ="text",
          digits = 2)


=============================================================
Statistic  N   Mean  St. Dev.  Min   Pctl(25) Pctl(75)  Max  
-------------------------------------------------------------
age       100 45.96   13.33     20      36      57.2     70  
height    100 163.75   7.69   148.00  158.10   170.17  180.50
weight    100 59.18   12.65   28.30   48.95    67.32   85.60 
income    100 434.40  445.78    24    134.8    607.2   2,351 
-------------------------------------------------------------

If you use stargazer() with type = "html", then you can have a fancier table
You need to type ```{r, results = "asis"} at the chunk option

stargazer(as.data.frame(df1), 
          type ="html",
          digits = 2)


Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

age	100	45.96	13.33	20	36	57.2	70
height	100	163.75	7.69	148.00	158.10	170.17	180.50
weight	100	59.18	12.65	28.30	48.95	67.32	85.60
income	100	434.40	445.78	24	134.8	607.2	2,351

Desriptive Statistics	Details
N:	The number of observation
Mean:	Average value
St. Dev.	Standard deviation
Min	Minimum value
Pctl(25)	1st Quantile (25%)
Pctl(75):	3rd Quantile (75%)
Max:	最大値

2.2.1 Mean

The arithmetic mean, also known as average or arithmetic average, is a central value of a finite set of numbers: specifically, the sum of the values divided by the number of values.
The mean of variable x (\(=\bar{x}\)) is calculated with the following equation:

\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n}\]

Let’ suppose the TOEFL Score of the 10 Waseda students are the following:

toefl <- c(60, 80, 90, 80, 85, 60, 80, 90, 85, 100)

Show the data we made just now

toefl

 [1]  60  80  90  80  85  60  80  90  85 100

How to calculate the mean of toel1 with R (1)

(60+80+90+80+85+60+80+90+85+100)/10

[1] 81

How to calculate the mean of toel1 with R (2)

sum(toefl)

[1] 810

sum(toefl)/10

[1] 81

How to calculate the mean of toel1 with R (3)

mean(toefl)

[1] 81

How to calculate the mean of toel1 with R (4)

summary(toefl)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  60.00   80.00   82.50   81.00   88.75  100.00

Show the distribution of toelf1 using hist( )

hist(toefl)

2.2.2 Median

The median is the value separating the higher half from the lower half of a data sample
For a data set, it may be thought of as “the middle” value.
If we have the data set: 1, 2, 3.
The median is 2.
Using table( ), we can make a table of toefl

table(toefl)

toefl
 60  80  85  90 100 
  2   3   2   2   1

Since the total number of observation is odds number (10), there is no number in the middle value.
In such a case, we define the median as the average of the two values in the middle:
In this case, the median = (80 + 85)/2 = 82.5
In R, we can calculate the median as follows using median( )

median(toefl)

[1] 82.5

2.2.3 Mode

The mode is the value that appears most often in a set of data values.
For example, if we have the data set: 1, 2, 3, 3, 3, 4
The mode is 3.
R does not have a function to calcuate the mode.
→　We calculate the mode using table( )

table(toefl)

toefl
 60  80  85  90 100 
  2   3   2   2   1

The mode of toefl is 80.

2.2.4 Variance

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.
Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.
Variance is calculated with the following equasion:

\[Variance = \frac{\sum_{i=1}^N (individual.value - Average)^2}{N}\]

Calculate the variance of toefl

toefl

 [1]  60  80  90  80  85  60  80  90  85 100

Calculate the mean of toefl and name it toefl_mean

toefl_mean <- mean(toefl)
toefl_mean

[1] 81

Calculate (individual.value - Average) and name it x

x <- toefl - toefl_mean
x

 [1] -21  -1   9  -1   4 -21  -1   9   4  19

Square x and name it x2

x2 <- x^2
x2

 [1] 441   1  81   1  16 441   1  81  16 361

Add the squared value of x2 and name it sum_x2

sum_x2 <- sum(x2)
sum_x2

[1] 1440

Define the number of observation: N

N <- length(toefl)  # Number of observation
N

[1] 10

Thus, we get the variance of toefl
\[Variance = \frac{\sum_{i=1}^N (individual.value - Average)^2}{N}\]

\[= \frac{1440}{10} = 144\] - This is the variance of toefl
- We can also calculate variance of toefl with R as follows:

variance_toefl <- var(toefl) * (length(toefl) - 1) / length(toefl)
variance_toefl

[1] 144

2.2.5 Standard Deviation

The standard deviation is a measure of the amount of variation or dispersion of a set of values.

\[Standard Deviation = \sqrt{Variance}\] - Thus, the standard deviation of toefl is calculated with variance_toefl

sqrt(variance_toefl)

[1] 12

参考文献

宋財泫 (Jaehyun Song)・矢内勇生 (Yuki Yanai)「私たちのR: ベストプラクティスの探究」

土井翔平（北海道大学公共政策大学院）「Rで計量政治学入門」

矢内勇生（高知工科大学）授業一覧

浅野正彦, 矢内勇生.『Rによる計量政治学』オーム社、2018年

浅野正彦, 中村公亮.『初めてのRStudio』オーム社、2018年

Winston Chang, R Graphics Cookbook, O’Reilly Media, 2012.

Kieran Healy, DATA VISUALIZATION, Princeton, 2019

Kosuke Imai, Quantitative Social Science: An Introduction, Princeton University Press, 2017

9. Descriptive Statistics

Masahiko Asano

2021-09-13

1. Two types of “statistics”

Descriptive statistics:

Inferential statistics

2. Descriptive statistics

2.1 Data (`income.csv`)

How to make an `R Project`

How to use RMarkdown

2.2 Summary Statistics

2.2.1 Mean

2.2.2 Median

2.2.3 Mode

2.2.4 Variance

2.2.5 Standard Deviation

9. Descriptive Statistics

Masahiko Asano

2021-09-13

1. Two types of “statistics”

Descriptive statistics:

Inferential statistics

2. Descriptive statistics

2.1 Data (income.csv)

How to make an R Project

How to use RMarkdown

2.2 Summary Statistics

2.2.1 Mean

2.2.2 Median

2.2.3 Mode

2.2.4 Variance

2.2.5 Standard Deviation

2.1 Data (`income.csv`)

How to make an `R Project`

2.2 Summary Statistics