Rubin’s
Model
for Causal
Inference:
a
review
千葉大学大学院理学研究科 江 金芳 (Jinfang Wang)
GraduateSchool of Science ChibaUniversity
1
Introduction
Rubin’s model for causal inference,
or
simply Rubin causal model (RCM), sometimes referred toas
the Neyman-Rubin causal modelor
Neyman-Rubin-Holland model for causal inference, is developed ina
series ofpapers
by Rubin ([31, 32, 33]), though RCMmay
be traced backto the work of Neyman ([24]), while Holland ([18]) and Holland and Rubin ([19]) providepenetrating reviews ofthis model. Alnorecompletepicture of RCMmay
be found ina
collection ofpapers
byRubin ([36]). This model has found applications in diversity of
areas
includingstatistics, medicine,economics,political science,sociology andlaw,
among
others;see
$|39|$ for referenceson some
ofthe recent applications. Recently
some
rigorous resultson
RCM have been established in theeconometric literature, see,
e.g.,
[13], $|17|,$ $[1|$.
$\ln$ thispaper
wc
givea
review
ofRCM withemphases
on
themore
resentdevelopments.2
Rubin
Causal
Model
Borrowing from the language of design ofexperiments,
suppose
thatwe
havea
population from whichwe
drawa
random sample of$n$ units. Each unit is able tobe exposed to eithera
treatmentor
a
control. Let$Z_{i}$ representa
random variable oftreatment assignmentso
that$Z_{i}=1$ if the ithunit is assignedto the treatment
group
and $Z_{i}=0$ if the $i$th unit isassigned to thecontrolgroup.
Thus,the ithunit has two potentialoutcomes, $\}_{i}’(1)$ ifitisexposed tothe treatment when $Z_{i}=1$,
or
$Y_{i}(0)$ ifit is exposed to the control when $Z_{i}=0$. Theobserved dataon
the ithunit consist ofthe pair$(Z_{i}, Y_{i})$,where
$\}_{i}’=Z_{i}1_{i}^{\nearrow}(1)+(1-Z_{i})Y_{i}(0)$ .
Theeffect caused by the teatment for the$i$th unit(relativetothe control),
or
simply the treatmenteffect
forthe ith unit,is definedas
the difference$l_{i}^{f}(1)-Y_{i}(0)$. This quantitymeasures
the gainin the outcome variable under the assignmenttothe treatment relative to the control. We
suppose
that each unit
can
be exposed toonly the treatmentor
the control,thereforewe can
observeeither$l_{i}’(1)$
or
$Y_{i}(0)$, butnever
both. That is, either$Y_{i}(1)$or
$Y_{i}(0)$ is missing for the ith unit, implyingthat the treatmenteffectfor the$i$th unitis notobservable. This factis calledby Holland $(|18|)$the
fundamental
ptoblemof
causalilference.
To statistically
overcome
the fundamental problem of causal inference,thefirst thingwe
doistoreplacethe inferentia] goal ofestimatingthe treatmenteffect for
an
individual unitbyconsider-ingthe problem ofestimatingtheaverage treatmenf
effect:
whercthcexpectation is assumedtobe independent of$i$.We note that sincetheoperational
mean-ingsof thetwo randomvariables $l_{i}’(0)$ and $1_{i}^{r}(1)$ involve the random variable$Z_{i}$,the expectations
$E\{Y_{i}(1)\}$ and$E\{\}_{i}^{-}(0)\}$ thereforealmost$a$]
ways
dependon
the distri butionof$Z_{i}$,thatis.
themech-anismof the treatment assignment.More $\exp$]icitly,
we
can
write $E\{Y_{i}(1)\}=E[E\{Y,\cdot(1)\}|Z_{i}]$and similar]yfor$E\{1_{i}^{r}(0)\}$. The
average
treatmenteffect$\theta$has thepotentialtobe estimatedbecausepotentialoutcomes$1_{?}^{\nearrow}\cdot(1)$and$Y_{i}(0)$
on
differcntunitsmay
now
be used toestimatetheexpectations $E\{1_{i}^{r}(1)\}$ and $E\{]_{i}^{r}(0)\}$.
To achieve this goal,further $f^{\backslash }undamenta1$ assumptionson
the treatmentassignment
mechanismare
howeverrequired since the observcd data $(Z_{i}, l_{i}^{\nearrow})$ only provideinfor-mation
on
theexpectations$E\{Y_{i}|Z_{i}=1\}=E\{Y_{i}(1)|Z_{i}=1\}$ and
$E\{Y_{\dot{1}}|Z_{i}=0\}=E\{l_{7}^{r}’(0)|Z_{i}=0\}$.
The fundamental problem of causal inference
can
beovercome
byconsideringtwo suchassump-tions, namely the independence assumption([18])and theassumptionofstrongignorabili$O^{}$ ([29]).
Both conditions
are
natural in thesense
that theycan
be derived whenone
$considers^{\backslash }$ therela-tions betweentheexpectations$E\{Y_{i}’(1)\},$ $E\{Y_{i}(0)\}$ and theconditionalexpectations$E\{Y_{1}(1)|Z_{i}=$
$1\},$$E\{Y_{\dot{7}}(0)|Z_{i}. \infty 0\}$
.
The independence assumptionconcerns
the classicalcase
ofrandomized
experiment, where
we
assume
that the treatment assignment $Z_{i}$ is independent of the potentialoutcomes $(l_{i}^{r}(1), l_{i}’(0))$ and al] other potential confounding variables. Cansal inference for
ran-domized experiment is straightforward because under this independence assumption
we
have the basicidentities$E\{l_{i}’(1)\}=E\{Y_{i}(1)|Z=1\}$
$E\{Y_{i}(0)\}=E\{\}_{i}’(0)|Z=0\}$ .
Thus the independence assumption
ensurcs
that$\theta=E\{Y_{i}(1)|Z=1\}-E\{Y_{i}(0)|Z=0\}$ (2)
So the sampledifference inthetwo
groups
inthiscase
will givean
unbiasedestimatefor$\theta$.The large body of the literature
on
causal inference howeverconcerns
the secondcase
when theexperiment is not randomized,thatis,the independenceassumptiondoes notholdtrue. Thesecases are
knownas
nonrandomized experimentsor
observationalstudies ([28]), and will be thetopic inthe restof this
paper.
In order toestimatethe
average
treatmenteffect in observationalstudies,we
assumc,as
isusu-al]$y$the case,thatinadditionto$(Z_{i}, Y_{i})$,
we
also observe for eachunit$i$thevalueon a
pretreatmentvariable$X_{i}$,
a
vectoroflength$p$. Thevalue of thc prctreatmentvariable $X_{i}$ usuallymcasurcs
thecharacteristics of the $i$th unit(e.g., gender, paraent‘s educational level,etc.) before the treatment
as
signment,and thusisnotaffected by the treatment. Wenow
relaxthe independenceassumption ina
randomized experimentby the following assumption ofstrong ignorability([29]):(i) (Uncot$\iota fo$undedness)
$(1_{i}^{r}(1), \}_{i}^{r}(0))\perp Z_{i}|X_{i}$
(ii) (Overlap)
$0<Pr(Z_{i}=1|X_{i}=x)<1$
where$A\perp B|C$istheDawid’s([8]) notationdenotingtheconditional inpdenepdence of$A$ and
$B$given$C$
.
The conditional probability ofassignmenttotreatment$gi$ven
the pretreatment variableisknown
as
thepropensityscore
([29]):$e(x)=Pr(Z_{i}=1|X_{i}=x)=E\{Z_{i}|X_{i}=x\}$ (3)
To
see
whythe strongly ignorabletreatmentassignment should lead toan
estimationprocedure for theaverage
treatment effect,we
note thefollowingbasic identity under theunconfoundedness assumption:$E\{Y_{i}(z)|X_{i}=x\}$ $=$ $E\{Y_{i}(z)|Z_{i}=z, X_{l}, =x\}$
$=$ $E\{Y_{i}|Z_{i}=z, X_{i}=x\}$ (4)
where$z$takesvalues$0$
or
1. Thus,estimationof theaverage
treatmenteffect$\theta$can
be done by firstestimatingthe
average
treatmenteffectfora
subpopulationat$X=x$:
$\theta(x)=E\{Y_{i}|Z_{i}=1, X_{i}=x\}-E\{Y_{i}|Z_{i}=0, X_{i}=x\}$,
by using the averaged sample treatment-control difference within the subpopulation at $X=x$
.
We thenaverage
this differenceover
all possible values of$x$ to givean
unbiased estimator for$\theta$because
we
have$\theta=E[E\{Y_{i}(1) IX_{i}\}-E\{\}_{i}^{r}(0)|X_{i}\}]=E\{\theta(X_{i})\}$. (5)
Thus, in observational studies thefundamental problem ofcausal inference is
now
overcome
bythe additional knowledge
on
pretreatment varibles and the unconfoundedness assumption. Notethat the overlap assumption is crucial in estimating $\theta(x)$, for violation of this assumption at $x$
will
mean
that thereare
only treatedor
control units at $x$ thus making the estimation of either$E\{Y_{i}(1)|X_{i}=x\}$
or
$E\{Y_{i}(0)|\lambda_{i}’=x\}$an
impossibility. It is $a1$so
worthy ofnoting that thebasicequation (4) itselfmay be used
as
a weakerassumption instead of the unconfoundednessin order to estimatetheaverage
treatment effect ([15]). The assumption (4) is however almostas
difficult toverifyas
with theunconfoundedness condition in practice.To conclude this section
we
note that althoughwe
shall focuson
estimation of theaverage
trearmenteffect$\theta$,there isalso considerableinterestin theliterature
on
estimationof thetreatmenteffect
for
the treated(e.g.,[14], [15], [16]):$\theta_{T}=E\{Y_{i}(1)-1_{i}^{\nearrow}(0)|Z_{i}=1\}$
We needthe strongignorabi]ityassumption here
as
wellforestimating$\theta_{T}$ because $Y_{i}(0)$are
3
Estimating
the
Average Treatment Effect
3.1
Regression
Estimators
Regression adjustment for estimating the
average
treatment effect in observational studies hasa
longhistory (e.g., [25], [3], [6], [7], [32], etc.). The idea is to
use
regression techniques to find estimates $\mu_{1}’(x)$ and $\hat{\mu}_{0}(x)$ ofthe tworegression functions in(4), namely,$\mu_{1}(x)=E\{Y_{i}^{r}(1)$I
$X_{i}=$$x\}$ and$\mu_{0}(x)=E\{Y_{i}(0)|X_{i}=x\}$
.
By (5),we
thenaverage
thedifference$\hat{\mu}3(x)-\hat{\mu}_{0}(x)$over
theempirical distribution of$x$ toget
an
unbiasedestimate of$\theta$:
$\hat{\theta}=\frac{1}{n}\sum_{i}\{\hat{\mu}_{1}(x_{i})-\hat{\mu}_{0}(x_{i})\}$ (6)
where$\hat{\mu}_{1}(x_{i})$ and$\hat{\mu}_{0}(x_{i})$
are
estimated, due to unconfoundedness,usingthe treatmentgroup
sam-ples and the control
group
samples respectively.Forinstance,
suppose
thatwe may
assume,as
in [32],theregressionfunctionsare
linear in$x$$\mu_{1}(x)$ $=$ $\alpha_{1}+\beta_{1}x$
$\mu_{0}(x)$ $=$ $\alpha_{0}+\beta_{0}x$
where$x$ is
a
univariate continuouscovariate. Let $\hat{\beta}_{1},\hat{\beta}_{0}$denote the respective within sample leastsquares
estimators and $\overline{y}_{1},\overline{x}_{1},\overline{y}_{1)},\overline{x}_{0}$ the respectivewithin samplemeans.
The predicted valuesoftheregression functions
are
thengivenby$\hat{\mu}_{1}(x)$ $=$ $\overline{?}\iota\prime_{1}+\grave{\beta}_{1}(x-\overline{x}_{1})$
$\mu_{0}’(x)$ $=$ $\overline{y}_{0}+(\hat{3}_{0}(x-\overline{x}_{0})$
Note that the predictors$\hat{\mu}_{1}(x)$ and$\hat{\mu}_{0}(x)$rely
on
extrapolationandthus theestimator(6) under thislinear model
can
havepoor
propertieswhen thedistributions of thecovariatediffer significantly in the treated and controlgroups.
Recently attempts have been made which focus
on
nonparametric estimationof theregression functions $\mu_{1}(x)$ and $\mu_{0}(x)$.
One such method is touse
the method of sieves ([11]),see,e.g.
[20]and [4]. The resultingestimatorisefficientin the
sense
defined inthe next section. Anothertypeofestimatorforestimating$\mu_{1}(x)$ and$\mu_{0}(x)$ isto
use
kernel methods (see,e.g., [14, 15],[16]).3.2
Matching Estimators
Estimators constructed by matching thecovariates $X$
are
among
the mostpopular estimatorsduetotheir algorithmic simplicity. Theseestimatorsclosely resemble thenonparametrickemel
regres-sionestimators, where the number of matched samples plays the role of the bandwidth in kernel
regression. Large sample propertiesof simple matching estimators
are
however established only recently ([1]). When the covariate$X$ is used inmatching, the Mahalanobis distance between twomultivariate observations
arc
usually employed(e.g., [7], [34],[35]). The unidimensionalpropen-sity
score
can
also be used in matching. We will discuss propensityscore
matching in the nextsubsection.
Matchingestimators
are
ustlally used to estimate theaverage
treatment effect for the treated$\theta_{T}$, this is becausein
many
observational studies thereare more
controls thanthe treatedso
that itis easiertoimpute missing values $Y_{i}(0)$ forunits with $Z_{i}=1$
.
Tofindan
imputed value for$\}_{i}^{r}(0)$we
compute the distance$d(X_{j}, X_{i})$ for all $X_{j}$in the controlgroup,
thenretain the $k$ unitswith theclosest distance with $X_{i}$
.
Thenwe
use
theaverage
value$\hat{Y}_{i}(0)=\frac{1}{k}\sum_{j}Y_{j}(0)$
as a
predicted value for$Y_{i}(0)$.
Here $k$isan
arbitraryinteger, usually small suchas
twoor
one.
Theestimatortakes the form
$\hat{\theta}_{1^{\urcorner}}’=\frac{\sum_{i}\{Z_{i}Y_{i}(1)-Z_{i}\hat{Y}_{i}(0)\}}{\sum_{i}Z_{i}}$ (7)
If both
groups are
of relatively large size thenwe
can
impute either $Y_{i}(1)$or
$Y_{i}(0)$ for $a$]$1$$i=1,$ $\ldots,$$n$
.
Let$\hat{Y},(1)$
or
$\hat{Y}_{i}(0)$ be the imputed values,then the resulting matchingestimator forthe
average
treatmenteffect$\theta$ is simply takenas
the averaged difference$\hat{\theta}=\frac{1}{n}\sum_{i}(\hat{Y}_{i}(1)-\hat{Y}_{i}(0))$ (8)
In [1] it is shown that the estimator (8) has
a
bias of order $O(n^{-1/k})$, where $k$ is the number ofthe continuous components of$X$
.
So if $k\geq 2$, when enlarged bya
factor $\sqrt{}$.
the bias of thisestimator wiIl not vanish
as
$narrow\infty$, although this biasmay
not beso
large in practiceas
toconcem
thepractitioner.In the above discussion
one
usuallyuse
the Mahalanobis metric tomeasure
the distancebe-tween$X_{i}$ and$X_{j}$,
$d(X_{?}, X_{j})=\sqrt{(X_{i}-X_{j})’V^{-1}(X_{i}-X_{j})}$ (9)
where $V$ is the estimated covariance matrix of $X$
.
In $[34|,$ $V$ is taken to be the pooled withinsamplecovariance matrix
$V= \frac{(X_{1}’X_{1}-n_{1}\overline{X}_{1}’\overline{X}_{1})+(X_{2}^{f}X_{2}-n_{2}\overline{X}_{2}’\overline{X}_{2})}{n-2}$
where$X_{i}$ is the$n_{i}\cross p$datamatrix for the$i$th
group.
To achieve
even
betterbalance in the covariate between the treated and controlgroup,
in [10]and [38] theMahalanobisdistance (9) isgeneralized to
$d_{G}(X_{i}, X_{j})=\sqrt{(X_{i}-X_{j})’(V^{-1/2})’WV^{-1/2}(X_{i}-X_{j})}$ (10)
where $V^{-1/2}$ istheCholesky decomposition of$V$and $W$is
a
$p\cross p$positive definite weight matrixhas
a
rcliable model for$e(X_{i})$.
This methodiscalledgenetic
matching becausea
genetic algorithm([22],[40])is used toestimate the componentsof the weightmatrix $W$
.
By giving specific $weig$}$)ts$ to $WilJ(10)$, the genetic matching estimator using metric (10)
reduces lo the Mahalanobis matchingestimator
or
thepropensity
score
matching estimator, The genetic matching method may be ofmeritsrelativeto the$Maha$]$anobis$ matching especially whenthe covariate has
a
large dimension and is nonellipsoidally distributed ([36,p.
462]). Forsome
app]icationsof this method
see
[26],[23] and [12].When matching is applied to the covariate
X.
themetric
used playsan
important role. See also [41] foran
a]ternativemetric
which takesinto
account the consideration of the correlation of$X_{i},$ $Z_{i}$and $(\};$(1)$, Y_{i}’(0))$
.
3.3
Propensity Score Methods
Significant
progress
has been madeon
estimating theavergae
treatment effect under RCM by the discovery ofa
property for the propensityscore
([29]). This propertysays
that iftreatmentassignment$Z_{i}$ is unconfounded giventhe pretreatment variable $X_{i}$, then $Z_{i}$ is alsounconfounded
given theone-dimensionalpropensity
score
$e(X_{i})$.That is, underunconfoundedness, it holds that$(Y_{i}(1)_{\}Y_{i}^{r}(0))_{-}\perp Z_{i}|e(X_{i})$ (11)
This property
may
be proved by showing that$Pr\{Z_{J}=1|Y_{i}(1), Y_{i}(0), e(Z_{i})\}=Pr\{Z_{1}=1|e(Z_{i})\}$
which is equal to $e(X_{i})$
.
To showthis,we
express theprobabilitiesas
expectations andbycondi-tioning
on
the covariate $X_{i}$. Thus due to (11), the fUndamental problem of causal inferencecan
now
beovercome
by conditioningon
thepropensityscore
because of$([29|)$$\theta=E[E\{l_{i}^{7}(1) IZ_{1}=1, e(X_{i})\}-E\{Y_{i}(0) IZ_{1}=0, e(X_{i})\}]$ (12)
Thisis
an
importantresultbecausebias due to theimbalanceof the covariatecan now
be corrected by conditioningon
the univariate propensity score, not the covariate vector$X_{i}$. Nowwe
discussseveral methods forestimating$\theta$ which
use
thepropensityscore.
3.3.1
MatchingIn Section 3.2
we
discussed how toconstructan
estimatorof$\theta$ by matching thecovariate$X$. Dueto (12) we can altematively match
on
thepropensityscore
$e(X)$ instead of the full covariate $X$.When the propensity
score
$e(X)$ is unknownwe
have to first estimateit, usually usinga
logisticregressionmodel:
$e(X_{i})= \frac{e^{\beta’X_{i}}}{1+e^{\beta X_{i}}}$
To avoid side effect
near
zero
andone,it is preferable to matchon
thelinear predictor$\hat{\beta}’X_{i}$ insteadofAbadie and lmbens([1]) then shows that matchingestimator usingthe scalar
propensity
score
produces
a
$\sqrt{}$consistent estimator.33.2
BlockingBlocking,
or
subclassification ([29]) isa
method which divides the unidimensional propensityscore
into$B$blocks, usually equally lengthed. Within each blockwe
treat the dataas
if theycome
from
a
randomized experiment, and theoreforeuse
the averaged treatment-control difference $\theta_{b}$to
estimate
theavergae
treatmenteffect forthe bth block. The blocking estimator fortheaverage
treatmenteffect
is
takenas
theweightedmean
$\acute{\theta}=\sum_{b=1}^{l3}\frac{n_{1b}+n_{0b}}{n}\theta_{b}^{\wedge}$ (13)
where$n_{1b}$ and$n_{0b}$
are
therespectivenumbers of treated and controls inthe bthblok. Estimatorofvariance for$\hat{\theta}$
of(13)isdiscussed in [21].
For
a
one-dimensional covariate, with equal-sized block and assuming normality, it is shown([5]) that $B=$ \={o} is adequate forremoving
more
than95% of the bias associated with the simple treatment-control difference. This is thereason
that $B=5$ is usually employed in defining the blockestimator([30], [9], [2]).3.33
RegressionIn Section
3.1
we
discussed the ideaof estimating$\theta$ byusingregressiontechniquestoestimatethetwo conditional
means
$\mu_{1}(x)=E\{Y_{i}(1)|X_{i}=x\}$ and $\mu_{0}(x)=E\{1_{i}’(0)|X_{i}=x\}$.
Due to(12),underunconfoundedness,
we
can
altemativelyestimatethe regressionfunctions$\eta_{1}(p)=E\{Y_{i}(1)|e(X_{i})=p\}$ and
$\eta_{0}(p)=E\{Y_{i}(0)|e(_{d}Y_{i})=p\}$
Using estimates$\hat{\eta}_{1}(x)$ and $\hat{\eta}_{0}(x)$,
we
can
estimate$\theta$by$\hat{\theta}=\frac{1}{n}\sum_{i}\{7\hat{|}1(e(x_{i}))-\hat{\eta}_{0}(e(x_{i}))\}$ (14)
Notethatto
use
thisestimatorwe
have tospecifya
model for thepropensityscore
$e(X_{i})$in ordertoestimate$\hat{\eta}_{1}(x)$ and $\hat{\eta}_{0}(x)$
.
Itis of interesttoinvestigate conditionsunder which theestimator using $\hat{\eta}_{z}(x)$ mayperform betterthanthat using$\hat{\mu}_{\vee}\sim(x)$. Forestimator(14)tohavea
chanceofsuccess
one
needs
a
reasonablygood model fortheregression functions$\eta_{z}(p)$ ([21]).3.3.4
WeightingA weighting estimator forthe
average
rreatmenteffecttakes theformwhere $\hat{e}(X_{i})$
is a
nonparametricsieve estimator
ofthe propensityscore.
This estimator issemi-parametricallyefficient. We willdiscussthis estimator in
more
detail in Section 4.2.4
Semiparametric Efficiency Bounds and Efficient Estimation
4.1
Efficiency Bounds
In estimating the
average
treatment effect, it is thebias of the estimators rather than the variance of theseestimators
that should be ofprimaryconcem
to the researcher ([36]). However, whenan
estimator
is known to beunbiasedor
asymptoticallyunbiased, it is then of interesttoconsiderthe variance of such estimators. For instance,for
a
randomized experiment, it is known that the unbiased simple averaged treatment-control difference is notan
efficient estimator for theaver-age
treatment effect ([17]). Toconstructan
efficient estimator in thiscase
with known constantpropensityscore,
one
can
inversely weight the observationsusingthenonparametrically estimated propensityscores.
We will discuss thisestimator in detailinnextsubsection.Under unconfoundedness and other regularity conditions, Hahn ([13]) established the effi-ciency bound of
a
regular estimator $\hat{\theta}$for the
average
treatment effect $\theta$.
He showed that $\hat{\theta}$is asymptotically normally di stributed
$\sqrt{}(\hat{\theta}-\theta)arrow d\mathcal{N}(0, V)$
with variance bounded by
$V \geq E\{\frac{\sigma_{1}^{2}(X_{i})}{e(X_{i})}+\frac{\sigma_{()}^{2}(X_{i})}{1-e(_{d}Y_{i})}+(\theta(X_{i})-\theta)^{2}\}$ (15)
In(15),$\theta(X_{i})$isthe
average
treatmenteffect forthesubpopulationat$X_{i}$,and$\sigma_{1}^{2’}(X_{i})=$var
$(Y_{i}(1)|X_{i})$,$\sigma_{0}^{2}(X_{i})=$
var
$(Y_{r\prime}(0)|X_{i})$are
the conditional variances. Ther.h.$s$.
of(15) givesthe semiparametricefficiency bound for
a
regularestimator forthe average treatmenteffect$\theta$.
This efficiency bound$p]$
ays
an
analogous roleas
theCram\’er-Rao lower boundfor parametric estimation. Hahn showedthattheefficiency bound in(15)remainsunchanged
even
thoughthe propensityscore
isknown in advance. In the specialcase
when the propensityscore
$equa$]$s$an
unknown constant,$e(X_{i})=p$.
that is, the treatmentassignment israndomized,
we
have $\theta=\theta_{f}$ and lhe efficiency bound for thecommom
parameterbecomes$E\{\frac{\sigma_{1}^{2}(X_{i})}{p}+\frac{\sigma_{0}^{2}(-Y_{i})}{1-p}+(\theta(X_{i})-\theta)^{2}\}$ .
When thepropensity
score
isnotknown,a
similar boundexistsfortheaverage
treatmenteffecton
thetreated$\theta_{I’}$:
where$p=E\{e(X_{i})\}$
.
When thepropensity
score
is known,the corresponding efficiency bounddecreases by
an
amount$E\{\frac{(\theta(_{\lrcorner}Y_{i})-\theta_{T})^{2}e(X_{i})(1-e(X_{i}))}{p^{2}}\}$ .
which
may
be consideredas
thegain inefficiency by the knowledgeofthepropensityscore.
4.2
Efficient Estimators
Hahanalsoproposed estimators for both the
average
treatment effect $\theta$ and theaverage
treatmenteffect
on
the$tl\cdot eated\theta_{T}$, which achievetherespective
efficiency bound describled above. Tomoti-vate theseestimators,observethat,underunconfoundedness,
we
have$E\{Z_{i}Y_{i}|X_{i}\}=E\{Z_{i}Y_{i}(1)|X_{i}\}=E\{Z_{t}, IX_{i}\}E\{Y_{i}(1)|X_{i}\}$
implying
$E\{Y_{?:}(1)|\lambda_{i}’\}=\frac{E\{Z_{i}1_{i}^{r}|X_{i}\}}{E\{Z_{i}|X_{i}\}}=\frac{E\{Z_{i}Y_{i}|_{\lrcorner}Y_{i}\}}{e(X_{i})}$ (16)
Similarly
we
also have$E\{Y_{i}(0)|X_{i}\}=\frac{E\{(1-Z_{i})Y_{i}^{r}|X_{i}\}}{1-e(-X_{i}^{\Gamma})}$ (17)
These two expressions relate the conditional expectations $\mu_{1}(X_{i})=E\{Y_{i}(1)|X_{i}\}$ and $\mu_{0}(X_{i})=$ $E\{Y_{i}(0)|X_{i}\}$to theconditional
expectations
$E\{Z_{i}Y_{i}|X_{i}\},E\{(1-Z_{i})Y_{i}|X_{j}\}$and$e(-Y_{i})=E\{Z_{i},|X_{i}\}$.
The idea is to
use
nonparametric regression techniques to estimate the quantities $E\{Z_{i}Y_{1},|X_{i}\}$,$E\{(1-Z_{i})Y_{i}|X_{i}\}$ and$e(X_{i})$togive estimates$\hat{\mu}_{1}(X_{i})$and$\hat{\mu}_{0}(X_{i})$for$E\{\}_{i}’(1)|X_{i}\}$and$E\{Y_{i}(0)|X_{i}\}$
respectively. These estimates $\hat{\mu}_{1}(X_{i})$ and $\hat{\mu}_{0}(X_{i})$
may
be usedas
imputed values for $Y_{i}(1)$ and $Y_{i}(0)$ when theyare
missing. With theimputed valueswe
now
havea
‘complete’ datasituation:$\hat{Y}_{i}(1)=Z_{i}Y_{i}(1)+(1-Z_{i})\hat{\mu}_{1}(X_{i})$ under ‘treatment’ $\hat{Y}_{i}(0)=(1-Z_{i})Y_{i}(0)+Z_{i}\hat{\mu}_{0}(X_{i})$ under ‘control’
Hahn proved thattheefficientestimatorfor$\theta$ and $\theta_{T}$
are
givenrespectively by$\hat{\theta}=\frac{1}{n}\sum_{i}(\hat{Y}_{i}(1)-\hat{Y}_{i}(0))$ (18)
and
$\hat{\theta}_{T}=\frac{\sum_{i}Z_{i}(1_{i}’(1)-\hat{Y},(0))\wedge}{\sum_{i}Z_{i}}$ (19)
Alternatively,
we
notethat$\theta=$ $E\{\theta(X_{i})\}$
$=$ $E\{E[Y_{f},(1)|\lambda_{i}’]-E[Y_{i}(0)|X_{i}]\}$
This lnotivates thefollowing estimator
$\overline{\theta}=\frac{1}{n}\sum_{i}(\int\hat{1}_{1}(X_{i})-\hat{\mu}_{0}(X_{i}))$ (20)
whichis againshown by Hahntobe$ef^{\backslash }ficient$for estimating$\theta$
.
Similarly, theefficient estimator for $\theta_{J’}$ is$\tilde{\theta}_{T}=\frac{\sum_{i}Z_{i}(\hat{\mu}_{1}(X_{i})-\hat{\mu}_{0}(\lambda_{i}^{r}))}{\sum_{i},Z_{i}}$ (21)
So far
we
haveleft unspecified theestimatesfor$E\{Z_{i}Y_{i}|X_{i}\},$$E\{(1-Z_{i})Y_{i}|X_{i}\}$ and$e(X_{i})=$ $E\{Z_{i}|X_{i}\}$,whichare
usedtoform theestimates $\hat{\mu}_{1}(X_{i})$ and $\hat{\mu}_{0}(1l_{i}^{r})$.
When $X_{i}$ has finite support,we can
use
thefollowingestimates$\hat{E}\{Z_{i}Y_{i}|X_{i}=x\}=\frac{\sum_{j}Z_{j}Y_{j}\cdot 1(X_{j}=x)}{\sum_{j}1(-Y_{j}=x)}$ ,
$\hat{E}\{(1-Z_{i})Y_{i}|X_{i}=x\}=\frac{\sum_{j}(1-Z_{j})Y_{j}\cdot 1(X_{i}=x)}{\sum_{1}\cdot 1(X_{j}=x)}$ ,
$\hat{E}\{Z_{i}|X_{i}=x\}=\frac{\sum_{j}Z_{j}\cdot 1(_{A}Y_{j}=x)}{\sum_{j}1(X_{j}=x)}$,
where 1$(X_{j}=x)$ is the indicator function.
When $X_{i}$ has
a
continuous distribution, Hahn suggests touse
the series estimators for theseconditiona expectations. One difficulty with the series estimators isthat
one
hastochooseasome-what arbitrary number of terms in the series. Hirano et al. ([17]) considered another type of efficientestimatorfor$\theta$
so
that theseries estimatorsare
$1^{\cdot}equired$ onlyforestimatingthe propensity
score.
The meritsofusingestimatedpropensityscore
in gainingefficiencyeven
when thepropen-sity
score
is known has been poined bya
number ofresearches (e.g., [27]. [37], [13], [15]). Tomotivate theirestimator,
we
notice that,by (16) and (17), theaverage
treatmenteffect$\theta$can
alsobeexpressed
as
$\theta$ $=$ $E\{E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}]\}$ $=$ $E\{\frac{E[Z_{i}Y_{i}|X_{i}]}{e(_{\lrcorner}Y_{i})}-\frac{E[(1-Z_{i})1_{i}^{r}|X_{i}]}{1-e(-Y_{i})}\}$ $=$ $E\{E[\frac{Z_{i}Y_{i}}{e(X_{;},)}|X_{i}]-E[\frac{(1-Z_{i})Y_{i}}{1-e(_{\wedge}Y_{j})}|X_{i}]\}$ $=$ $E\{\frac{Z_{i}Y_{i}}{e(X_{i})}-\frac{(1-Z_{i})Y_{i}}{1-e(-Y_{\gamma},)}\}$The salnpleversion ofthe lastexpectation,with thepropensity
score
estimated,givesan
estimator for$\theta$:where $\hat{e}(X_{i})$ in (22) is thenonparametric sieve estimator for the
propensity
score.
Hirano, etal.([17])showed that$\hat{\theta^{\wedge}}$
attains thesemiparametric$ef^{\backslash }fi$ciency bound(15),thus is
an
efficientestimatorfor $\theta$. The advantage of $\hat{\theta^{\wedge}}$
over
$\hat{\theta}$or
$\overline{\theta}$is that to compute $\hat{\theta^{\wedge}}$
we
only need estimattion for the propensityscore.
References
[1] Abadie, A. and Imbens, G.: Large sample
properties
of matchingestimators
foraverage
treatmenteffects, Econometrica,Vol. 74,
pp. 235-267
(2006).[2] Becker, S., and lchino, A.: Estimation of
avera
ge
treatment effects basedon
propensityscores, The StataJournal,Vol.2,No.4,
pp. 358-377
(2002).[3] Belson,W.A.: A techniquefor studying theeffects of
a
television broadcast,AppliedStatis-tics,Vol. 5,
pp. 195-202
(1956).[4] Chen, X.,Hong, H.and Tarozzi,A.Semiparametric Efficiencyin GMM Models of
Nonclas-sical Measurement ErrorsAnnals
of
Statistics,Vol.36,No. 2,pp.
808-843(2008).[5] Cochran, W.G: The
effectiveness
of adjustment by subclassification in removing bias inob-servational studies, Biometrics,Vol.24,
pp. 295-314
(1968).[6$|$ Cochran, W.G: The
use
ofcovarance
in observational studies, Applied Statistics, Vol. 18,pp. 270-275
(1969).[71 Cochran,W.G. andRubin,D.: Controlling $bi$
as
in observational studies,Sankhya$A$,Vol. 35,pp.
417-446(1973).[81 Dawid, A. P.: Conditional independence in statistical theory, Journal
of
the Royal Statistical$Socieo’$,Series$B$, Vol.41,No. 1,
pp.
1-31 (1979).[9] Dehejia, R., and Wahba, S.: Causal effects in nonexperimental studies: reevaluating the
evaluation of training
programs,
Journalof
the American StatisticalAssociation, Vol. 94,pp. 1053-1062
(1999).[10] Diamond, A. and Sekhon, J.: Genetic matching for estimating causal effects:
a
genera] multivariate matching method for achieving balance in observational studies,
http://sekhon.berkeley.$edu/papers/$
GenMatch.pdf (2005).
[11] Geman,S.and Hwang, C.: Nonparametric maximum likelihood estimation by the method of
sieves,Annals
of
Statistics,Vo]. 10,401-414
(1982):[12] Gordon,S. andHuber,G.: TheEffect ofElectoralCompetitiveness
on
IncumbentBehavior,$[13|$ Hahn,J.: On the role of thepropensity
score
inefficientsemiparametric estimationofaverage
treatmenteffects,Econometrica,$Vol66$,No.2,
pp. 315-331
(1998).[14] Heckman, J., Ichimura, H. and Todd, P.: Matching
as
an
econometric evaluationestima-tor: evidence from evaluatingajob training
program,
ReviewofEconomic
Studies, Vol. 64,pp. 605-654
(1997).[15] Heckman, J.,Ichimura, H. and Todd, P.: Matching
as an
econometric evaluation estimator, Reviewof
EconomicStudies.
Vol. 65,261-294
(1998)[16] Heckman, J.,Ichimura,H., Smith,J.andTodd,P.: Characterizing selection bias using
exper-imental data, Econometrica,Vo]. 66, No. 5, $1017-t098$(1998).
[17] Hirano,K.,Imbens,G. andRidder,G.: Efficientestimationof
average
treatmenteffectsusingtheestimated propensity score, Econometrica,Vo].71,No. 4,
pp.
1161-1189
(2003).July[18] Holland,P. W.: Statisticsand causal infcrence,Journal
of
the American StatisticalAssocia-tion,Vol. 81, No. 396,pp.
945-960
(1986).[19] Holland,P. W. andRubin,D.: Causal lnference in Retrospective Studies,EvaluationReview,
Vol. 12,No.3,
pp. 203-231
(1988).[20] Imbens, G., Newey, W. and Ridder, G.: Mean-squared-error calculations for average
treat-ment effects,Working Paper, Depaltmentof Economics,UC Berkeley.
[21] Imbens,G. and Wooldridge, J.: Recent developments in theeconometricsof
program
evalu-ation,NBER Working Paper No. 14251 (2008).
[22] Mebane, W. and Sekhon, J.: GENetic optimization using derivatives (GENOUD),
http://sekhon.berkeley.edu/rgenoud/(1998).
[23] Morgan, S. and Harding, D.: Matchingestimatorsof causal effects: prospects andpitfalls in theory andpractice,Sociological Methods
&
Research,Vol. 35,No.l,pp.
3-60
(2006). [24] Neyman, J.: On the application of probability theory to agricultural experiments.essay
on
principles. Section 9, translated in Statistical Science (with discussion), Vol. 5, No. 4,pp. 465-480
[1990](1923)[25] Peters,C.C.: A method of matching
groups
forexperimentwithno
lossof population,Jour-nal
of
EdttcationalResearch,Vol. 34,pp. 606-612
(1941).[26] Raessler, S. and Rubin, D.: Complications when using nonrandomized job trainingdata to
drawcausal inferences,Proceedings
of
the International StatisticalInstitute,(2005).$|27]$ Rosenbaum,P.: Model-based direct adjustment, Journal
of
the American StatisticalAssocia-tion,Vo].82,pp.387-394(1987).
[29] Rosenbaum, P., and Rubin, D.: The central role of the propensity
score
in observational studies for causal effects, Biometrika,Vol.70,41-55 (1983a).[30] Rosenbaum.P.,andRubin,D.: Assessingthesensitivityto
an
unobserved binarycovariate
inan
observational study withbinary outcome,Journalof
the$Roval$ StatisticalSociety, Ser. $B$,Vol. 45,No. 212-218 (1983b).
[31] Rubin,D.:
estimating
causal effectsof
treatmentsinrandomizedandnonralidomized
studies,Journal
of
Educational Psychology, Vol.66,pp.688-701
(1974).[32] Rubin,D.: Assignmenttotreatment
group
on
theBasis ofa
covariate,Journalof
EducationalStatistics,Vol.2,No. 1,
pp. 1-26
(1977).[33] Rubin,D.: Bayesianinference for causaleffects: the role ofrandomization,Annals
of
Statis-tics,$Vol,$$6$,
pp. 34-58
(1978).[34] Rubin,D.: Usingmultivariate sampling andregressionadjustmenttocontrol biasin observa-tional studies,Journal
of
theAmerican StatisticalAssociation,Vol.74,pp. 318-328
(1979).[35] Rubin, D.: Bias reduction usingMahalanobis-metric matching, Biometrics, Vol. 36, No. 2,
pp.
293-298
(1980).[36] Rubin, D.: MatchedSampling
for
Causal Effects, Cambridge, England: CambridgeUniver-sityPress(2006).
[37] Rubin, D. and Thomas,N.: Matching usingestimated propensity
scores:
relating theory topractice,Biometrics, Vol. 52,
pp. 249-264
(]996).[38] Sekhon,J.: Matching: algorithmsandsoftware formultivariate andpropensity
score
match-ingwithbalanceoptimizationvia genetic search,Journalof
StatisticalSoftware,(2007). [39] Sekhon,J.: Causal inference inquantitativeandqualitativeresearch,InTheOxford
Handbookof
PoliticalMethodolog.$v$ (eds. Box-Steffensmeier, J., Brady, H. and Collier, D.), OxfordUniversity Press (2008).
[40] Sekhon, J. and Mebane,W.: Genetic optimization using derivatives: theory and application
tononlinearmodels,Political Anal.vsis, Vol.7,pp.
189-203
(1998).[41] Zhao,Z.: Using matchingtoestimatelreatmenteffects: datarequirements, matchingmetrics