Slide 2_distribution 最近の更新履歴 Keisuke Kawata's HP

(1)

Econometrics

Data Classification and Descriptive Statistics

Keisuke Kawata

Hiroshima University

(2)

Starting points of empirical analysis

• What is the best strategy to answer your research questions ? ^⇒ Depending on the characteristics of your data.

• The starting points of empirical analysis ^⇒

⇒ We should first study the method to understand characteristics of our data.

(3)

Plan of talks

1. Classification of data

2. Descriptive statistics for a variable 1. Histogram

2. Mean and median

3. Variance and standard deviations 3. Descriptive statistics for two variables

1. Scatter graph

2. Conditional mean and variance 3. Covariance and correlation

(4)

e.g.,) Poorest ou tries i

Co u n try N ame GDP pe r c apita U n e mplo yme n t rate

Congo, Dem. Rep. 246 6.8

Burundi 247 7.2

Ethiopia 335 2.8

Malawi 364 6.3

Liberia 377 3.7

Niger 388 5.6

Eritrea 440 7.7

Uganda 441 3.5

Guinea 454 3.5

• A data is consisted by the set of observations (samples) a d o ser a le variables.

• Unit of observations: Countries (Congo, Burundi, Ethiopia, Malawi, etc.)

• Variables: GDP per capital and Unemployment rate

(5)

Fundamental characteristics of data

• You must first check the following characteristics; –

– –

(6)

Fundamental structure of Data

• By the unit of observations, our data can be classified as follows:

– consists of observations taken at a given point in time.

– consists of observations over time.

– consists of a time series for each cross-sectional member.

• Best empirical strategy crucially depends on the fundamental structure.

(7)

e.g.,) Cross section data:

Bangladesh Household Survey (1991)

nh year villid thanaid agehead sexhead educhead famsize hhland hhasset expfd expnfd exptot

11054 0 1 1 72 1 0 3 36 33295 3055.856 902.2549 3958.112

11061 0 1 1 35 1 5 10 116 180325 3031.017 421.2493 3452.266

11081 0 1 1 54 1 3 6 91 80735 3347.578 343.3877 3690.967

11101 0 1 1 44 1 5 7 8 16755 2106.833 265.8344 2372.667

12021 0 2 1 28 1 8 6 10 18795 3401.24 789.907 4191.147

12035 0 2 1 25 1 8 4 184 81105 3760.61 1482.338 5242.948

12051 0 2 1 63 1 6 8 16 20800 3097.738 807.8052 3905.543

12054 0 2 1 27 1 5 3 63 32717 2703.48 369.9149 3073.395

12081 0 2 1 26 1 0 6 10 26160 3188.72 487.4715 3676.191

12121 0 2 1 43 1 3 8 5 17080 2179.248 245.2691 2424.517

13014 0 3 1 38 1 7 5 26 30320 3115.934 731.7919 3847.726

13015 0 3 1 21 1 0 3 138 84103 3194.497 715.0913 3909.589

13021 0 3 1 45 1 10 5 73 61220 2634.16 302.0899 2936.25

13025 0 3 1 41 1 4 7 229 133391 3161.487 581.0589 3742.546

13035 0 3 1 37 1 0 14 960 636530 2972.095 864.5654 3836.66

13041 0 3 1 47 1 0 4 55 39100 2950.432 458.5735 3409.006

13054 0 3 1 60 1 5 10 15 22490 2471.259 413.6964 2884.955

13061 0 3 1 60 1 0 4 10 25310 2595.167 266.9247 2862.092

13091 0 3 1 63 1 0 3 19 34010 3055.856 413.3333 3469.19

13101 0 3 1 44 1 4 6 19 32460 2111.372 289.9497 2401.321

13111 0 3 1 34 1 0 3 3 4242 3130.953 648.6937 3779.647

13121 0 3 1 27 1 0 3 8 11615 2888.333 364.364 3252.696

(8)

e.g.,) Time series data: Japanese labor market

(Ten thousand persons)

year Employed person Unemployed person Not labor force

2000 6446 320 4057

2001 6412 340 4125

2002 6330 359 4229

2003 6316 350 4285

2004 6329 313 4336

2005 6356 294 4346

2006 6389 275 4358

2007 6427 257 4375

2008 6409 265 4407

2009 6314 336 4446

2010 6298 334 4473

2011 6289 302 4517

2012 6270 285 4540

2013 6311 265 4506

(9)

e.g.,) Panel data: Average Monthly wage in each prefecture

prefecture year Monthly wage

Tokyo ₂₀₀₃

_481,163

Tokyo ₂₀₀₄

_491,189

Tokyo ₂₀₀₅

_485,455

Osaka ₂₀₀₃

_423,950

Osaka ₂₀₀₄

_415,649

Osaka ₂₀₀₅

_416,202

Hiroshima ₂₀₀₃

_372,708

Hiroshima ₂₀₀₄

_369,635

Hiroshima ₂₀₀₅

_367,461

(10)

Repeated cross section data

• To increase the sample size, we often merge two and more cross section data in different year.

e.g.,) Wages of faculty staff

• What should we call? ^⇒

• Repeated cross section data has same property as

Name Year Wage

Prof K 2013 600

Prof I 2013 1100

Prof U 2014 400

Prof T 2014 700

(11)

Distribution

• Values of variables including your data vary from observation to observation.

⇒ It’s alled that data has a

In the starting points of analysis, we should grasp the distribution of own data.

• From just seeing raw data, can we grasp the characteristics of data?

⇒ If sample size is not quit small,

⇒ We should use .

(12)

Example

ID Gender

1 Male

2 Female 3 Female

4 Male

5 Male

6 Female

7 Male

Data

frequency of observations

Female 3

Male 4

• If the number of possible value is small ^⇒ We can make

Distribution table.

(13)

Histogram

• If the number of possible value is many, we often use

• We can make the histogram by following steps. 1. Determine the range of each category.

2. Calculate the frequency (fraction of the number of observations) in each category.

3. Draw bars: the relative frequency is in the vertical axis and the data range in the horizontal axis.

(14)

e.g.,) Household’s e pe diture i Ba gladesh

0

0 10000 20000 30000 40000 50000

HH per capita total expenditure: Tk/year

(15)

Mean and Median

• To chapter the of your data, we often use descriptive statistics as Mean (average) and Median

: The value obtained by dividing the sum of data by the number of observations.

where n is the sample size, _� is the i th observation, and _�=^� _� ea s the su of _� for � ru i g fro to .

The value that the centered observation (50 percentile) has

(16)

e.g., Wages of faculty staffs

• What is the mean of data 1 and 2?

• What is the median?

Name Wage Prof K 600

Prof I 1100 Prof Y 400

Data 1 _{Data 2}

(17)

Mean VS Median

• Which is better statistics to represent the represe tati e alue ?

⇒We often use mean, but it is ore se siti e a out a o al tha edia .

• If the values of mean and median are totally different, you should represent both in your paper.

e.g., What alue is represe tati e ?

N ame Wage s

Prof. I ^¥

Prof. Y ¥

Prof. K ^¥

Prof. U ^¥

Prof. F

(18)

Variance

• To easure the dispersio or the spread of a pro a ilit distri utio , e ofte use and .

Variance (_{�� )}

⇒The mean of the square of the deviation from its mean.

(19)

Standard Deviation

Standard deviation ( _{� )}

(20)

e.g., Wages of faculty staffs

• What are variance of wage in data 1 and 2? Name Wage

Prof K 60 Prof I 110 Prof Y 40

Prof I 160 Prof Y 0

data 1 _{data 2}

(21)

e.g., Household’s e pe diture i Ba gladesh

Mean: 5,473.268 Median: 4,432.264 Variance: 17,100,000

Standard Deviation: 4,140.221

(22)

Question

Question 1: What is the mean of above data? Question 2: What is the median data?

Question 3: What is the variance?

(23)

Descriptive statistics for two variables

• Most of the interesting questions in social science involve two or more variables. – Ca i rofi a e progra i pro e households’ elfare?

– Ca tea her’s trai i g i pro e edu atio attai e ts of their stude ts?

⇒These questions concern the distribution of two variables, considered together.

⇒ We should study the descriptive statistics to understand the relationship between variables.

(24)

Scatter graph

• Each observation is displayed as a point.

• The value of one variable determines the position on the horizontal axis, and the value of the other variable determines the position on the vertical axis.

0

1000020000300004000050000

0 5 10 15

Education of HH head: years

(25)

Conditional mean

Conditional mean (� �| = � ): Mean of x among observations

⇔Mean of x in the sub-sample with m=A Example

� � = � =

Name Income Gender

1 100 Male

2 300 Female

3 500 Female

4 300 Male

(26)

Conditional variance

Conditional variance (var �| = � ): variance of x among observations

⇔Variance of x in the sub-sample with m=A Example

�� = � =

Name Income Gender

1 100 Male

2 300 Female

3 500 Female

4 300 Male

with m=A.

(27)

e.g., Household expenditure in Bangladesh

• e: household’s e pe diture

• m: =1 if a household has female members who use microfinance, and =0 if not.

� � � = = , . 9

�� = = , ,

� � � = = , 9.

�� = = , ,

(28)

Covariance

Measure of the extent to which two variables move together.

• The covariance between x and y can be defined as

cov(x,y)=

⇒covariance is : x is greater than its mean, then y tends be greater than its mean (and x is less than its mean, then y tends to be less than its mean).

⇒ covariance is : x is greater than its mean, then y tends be less than its mean (and X is less than its mean, then Y tends to be greater than its mean).

(29)

Example: Covariance between wage and high

What is the covariance between wage and high?

Name Wage High

Prof Y 600 170

Prof I 1100 180

Prof K 400 160

(30)

Question

What is the mean of wage and high?

What is the covariance between wage and high?

Name Wage High

Prof Y 600 170

Prof I 900 160

Prof K 600 180

(31)

Covariance: Unit problem

• The size of covariance is crucially depended on the unit of variable.

e.g.,) The covariance between GDP per capita ($) and Unemployment rates in 2010.

• If we evaluate GDP per capita using $, the covariance is about -10258.

• If we evaluate GDP per capita using ¥ (1$=100¥), the covariance is about - 1025778.

⇒The values of covariance are totally different.

⇒ These problems are called as

⇒ To grasp the relatio ship of t o aria les, e should use the u it problem- free measure.

(32)

Correlation

Correlation: measure of the extent to which two move together.

• The correlation between X and Y can be defined as Correlation=

• − ≤ � � ≤

⇒correlation is : The relationship between x and y is positive and very strong.

⇒ correlation is : The relationship between x and y is negative and very strong.

(33)

e.g., Household expenditure and education year

Covariance: 4,011.88 Correlation: 0.2788

(34)

Rule of mean

• Suppose there are variables y, , , , … , _�^. If ₌ ₊ ₊ _{+ ⋯ +} _�, ^⇒

and

(35)

Example: test score

Student ID Test score: Reading Test score: Math

1 40 50

2 40 100

3 70 30

What is the mean of total test score(reading score + math score)?

(36)

Conclusion

• We study

– Key classification of data: Cross-section, Time series, and Panel Structure.

– Some key descriptive statistics: Mean, Median, Variance, Standard Deviation, covariance, correlation.