Econometrics
Data Classification and Descriptive Statistics
Keisuke Kawata
Hiroshima University
Starting points of empirical analysis
• What is the best strategy to answer your research questions ? ⇒ Depending on the characteristics of your data.
• The starting points of empirical analysis ⇒
⇒ We should first study the method to understand characteristics of our data.
Plan of talks
1. Classification of data
2. Descriptive statistics for a variable 1. Histogram
2. Mean and median
3. Variance and standard deviations 3. Descriptive statistics for two variables
1. Scatter graph
2. Conditional mean and variance 3. Covariance and correlation
e.g.,) Poorest ou tries i
Co u n try N ame GDP pe r c apita U n e mplo yme n t rate
Congo, Dem. Rep. 246 6.8
Burundi 247 7.2
Ethiopia 335 2.8
Malawi 364 6.3
Liberia 377 3.7
Niger 388 5.6
Eritrea 440 7.7
Uganda 441 3.5
Guinea 454 3.5
• A data is consisted by the set of observations (samples) a d o ser a le variables.
• Unit of observations: Countries (Congo, Burundi, Ethiopia, Malawi, etc.)
• Variables: GDP per capital and Unemployment rate
Fundamental characteristics of data
• You must first check the following characteristics; –
– –
Fundamental structure of Data
• By the unit of observations, our data can be classified as follows:
– consists of observations taken at a given point in time.
– consists of observations over time.
– consists of a time series for each cross-sectional member.
• Best empirical strategy crucially depends on the fundamental structure.
e.g.,) Cross section data:
Bangladesh Household Survey (1991)nh year villid thanaid agehead sexhead educhead famsize hhland hhasset expfd expnfd exptot
11054 0 1 1 72 1 0 3 36 33295 3055.856 902.2549 3958.112
11061 0 1 1 35 1 5 10 116 180325 3031.017 421.2493 3452.266
11081 0 1 1 54 1 3 6 91 80735 3347.578 343.3877 3690.967
11101 0 1 1 44 1 5 7 8 16755 2106.833 265.8344 2372.667
12021 0 2 1 28 1 8 6 10 18795 3401.24 789.907 4191.147
12035 0 2 1 25 1 8 4 184 81105 3760.61 1482.338 5242.948
12051 0 2 1 63 1 6 8 16 20800 3097.738 807.8052 3905.543
12054 0 2 1 27 1 5 3 63 32717 2703.48 369.9149 3073.395
12081 0 2 1 26 1 0 6 10 26160 3188.72 487.4715 3676.191
12121 0 2 1 43 1 3 8 5 17080 2179.248 245.2691 2424.517
13014 0 3 1 38 1 7 5 26 30320 3115.934 731.7919 3847.726
13015 0 3 1 21 1 0 3 138 84103 3194.497 715.0913 3909.589
13021 0 3 1 45 1 10 5 73 61220 2634.16 302.0899 2936.25
13025 0 3 1 41 1 4 7 229 133391 3161.487 581.0589 3742.546
13035 0 3 1 37 1 0 14 960 636530 2972.095 864.5654 3836.66
13041 0 3 1 47 1 0 4 55 39100 2950.432 458.5735 3409.006
13054 0 3 1 60 1 5 10 15 22490 2471.259 413.6964 2884.955
13061 0 3 1 60 1 0 4 10 25310 2595.167 266.9247 2862.092
13091 0 3 1 63 1 0 3 19 34010 3055.856 413.3333 3469.19
13101 0 3 1 44 1 4 6 19 32460 2111.372 289.9497 2401.321
13111 0 3 1 34 1 0 3 3 4242 3130.953 648.6937 3779.647
13121 0 3 1 27 1 0 3 8 11615 2888.333 364.364 3252.696
e.g.,) Time series data: Japanese labor market
(Ten thousand persons)
year Employed person Unemployed person Not labor force
2000 6446 320 4057
2001 6412 340 4125
2002 6330 359 4229
2003 6316 350 4285
2004 6329 313 4336
2005 6356 294 4346
2006 6389 275 4358
2007 6427 257 4375
2008 6409 265 4407
2009 6314 336 4446
2010 6298 334 4473
2011 6289 302 4517
2012 6270 285 4540
2013 6311 265 4506
e.g.,) Panel data: Average Monthly wage in each prefecture
prefecture year Monthly wage
Tokyo 2003
481,163
Tokyo 2004
491,189
Tokyo 2005
485,455
Osaka 2003
423,950
Osaka 2004
415,649
Osaka 2005
416,202
Hiroshima 2003
372,708
Hiroshima 2004369,635
Hiroshima 2005367,461
Repeated cross section data
• To increase the sample size, we often merge two and more cross section data in different year.
e.g.,) Wages of faculty staff
• What should we call? ⇒
• Repeated cross section data has same property as
Name Year Wage
Prof K 2013 600
Prof I 2013 1100
Prof U 2014 400
Prof T 2014 700
Distribution
• Values of variables including your data vary from observation to observation.
⇒ It’s alled that data has a
In the starting points of analysis, we should grasp the distribution of own data.
• From just seeing raw data, can we grasp the characteristics of data?
⇒ If sample size is not quit small,
⇒ We should use .
Example
ID Gender
1 Male
2 Female 3 Female
4 Male
5 Male
6 Female
7 Male
Data
frequency of observations
Female 3
Male 4
• If the number of possible value is small ⇒ We can make
Distribution table.
Histogram
• If the number of possible value is many, we often use
• We can make the histogram by following steps. 1. Determine the range of each category.
2. Calculate the frequency (fraction of the number of observations) in each category.
3. Draw bars: the relative frequency is in the vertical axis and the data range in the horizontal axis.
e.g.,) Household’s e pe diture i Ba gladesh
0
0 10000 20000 30000 40000 50000
HH per capita total expenditure: Tk/year
Mean and Median
• To chapter the of your data, we often use descriptive statistics as Mean (average) and Median
: The value obtained by dividing the sum of data by the number of observations.
where n is the sample size, � is the i th observation, and �=� � ea s the su of � for � ru i g fro to .
The value that the centered observation (50 percentile) has
e.g., Wages of faculty staffs
• What is the mean of data 1 and 2?
• What is the median?
Name Wage Prof K 600
Prof I 1100 Prof Y 400
Name Wage Prof K 800
Prof I 1500 Prof Y 0
Data 1 Data 2
Mean VS Median
• Which is better statistics to represent the represe tati e alue ?
⇒We often use mean, but it is ore se siti e a out a o al tha edia .
• If the values of mean and median are totally different, you should represent both in your paper.
e.g., What alue is represe tati e ?
N ame Wage s
Prof. I ¥
Prof. Y ¥
Prof. K ¥
Prof. U ¥
Prof. F
Variance
• To easure the dispersio or the spread of a pro a ilit distri utio , e ofte use and .
Variance (��� � )
⇒The mean of the square of the deviation from its mean.
Standard Deviation
Standard deviation ( � )
e.g., Wages of faculty staffs
• What are variance of wage in data 1 and 2? Name Wage
Prof K 60 Prof I 110 Prof Y 40
Name Wage Prof K 80
Prof I 160 Prof Y 0
data 1 data 2
e.g., Household’s e pe diture i Ba gladesh
Mean: 5,473.268 Median: 4,432.264 Variance: 17,100,000
Standard Deviation: 4,140.221
Question
Question 1: What is the mean of above data? Question 2: What is the median data?
Question 3: What is the variance?
Name Wage Prof K 200
Prof I 1100 Prof Y 500
Descriptive statistics for two variables
• Most of the interesting questions in social science involve two or more variables. – Ca i rofi a e progra i pro e households’ elfare?
– Ca tea her’s trai i g i pro e edu atio attai e ts of their stude ts?
⇒These questions concern the distribution of two variables, considered together.
⇒ We should study the descriptive statistics to understand the relationship between variables.
Scatter graph
• Each observation is displayed as a point.
• The value of one variable determines the position on the horizontal axis, and the value of the other variable determines the position on the vertical axis.
0
1000020000300004000050000
0 5 10 15
Education of HH head: years
Conditional mean
Conditional mean (� �| = � ): Mean of x among observations
⇔Mean of x in the sub-sample with m=A Example
� � = � =
� � = � =
Name Income Gender
1 100 Male
2 300 Female
3 500 Female
4 300 Male
Conditional variance
Conditional variance (var �| = � ): variance of x among observations
⇔Variance of x in the sub-sample with m=A Example
�� � = � =
�� � = � =
Name Income Gender
1 100 Male
2 300 Female
3 500 Female
4 300 Male
with m=A.
e.g., Household expenditure in Bangladesh
• e: household’s e pe diture
• m: =1 if a household has female members who use microfinance, and =0 if not.
� � � = = , . 9
��� � � = = , ,
� � � = = , 9.
��� � � = = , ,
Covariance
Measure of the extent to which two variables move together.
• The covariance between x and y can be defined as
cov(x,y)=
⇒covariance is : x is greater than its mean, then y tends be greater than its mean (and x is less than its mean, then y tends to be less than its mean).
⇒ covariance is : x is greater than its mean, then y tends be less than its mean (and X is less than its mean, then Y tends to be greater than its mean).
Example: Covariance between wage and high
What is the covariance between wage and high?
Name Wage High
Prof Y 600 170
Prof I 1100 180
Prof K 400 160
Question
What is the mean of wage and high?
What is the covariance between wage and high?
Name Wage High
Prof Y 600 170
Prof I 900 160
Prof K 600 180
Covariance: Unit problem
• The size of covariance is crucially depended on the unit of variable.
e.g.,) The covariance between GDP per capita ($) and Unemployment rates in 2010.
• If we evaluate GDP per capita using $, the covariance is about -10258.
• If we evaluate GDP per capita using ¥ (1$=100¥), the covariance is about - 1025778.
⇒The values of covariance are totally different.
⇒ These problems are called as
⇒ To grasp the relatio ship of t o aria les, e should use the u it problem- free measure.
Correlation
Correlation: measure of the extent to which two move together.
• The correlation between X and Y can be defined as Correlation=
• − ≤ � � ≤
⇒correlation is : The relationship between x and y is positive and very strong.
⇒ correlation is : The relationship between x and y is negative and very strong.
e.g., Household expenditure and education year
Covariance: 4,011.88 Correlation: 0.2788
Rule of mean
• Suppose there are variables y, , , , … , �. If = + + + ⋯ + �, ⇒
and
Example: test score
Student ID Test score: Reading Test score: Math
1 40 50
2 40 100
3 70 30
What is the mean of total test score(reading score + math score)?
Conclusion
• We study
– Key classification of data: Cross-section, Time series, and Panel Structure.
– Some key descriptive statistics: Mean, Median, Variance, Standard Deviation, covariance, correlation.