ï»¿ WPS6587
Policy Research Working Paper 6587
Evaluation of Development Programs
Randomized Controlled Trials or Regressions?
Chris Elbers
Jan Willem Gunning
The World Bank
Development Economics Vice Presidency
Partnerships, Capacity Building Unit
September 2013
Policy Research Working Paper 6587
Abstract
Can project evaluation methods be used to evaluate often correlated. The TPE can also deal with the common
programs: complex interventions involving multiple situation in which such a correlation is the result of
activities? A program evaluation cannot be based simply decisions on (intended) program participation not being
on separate evaluations of its components if interactions taken centrally. In this context RCTs are less suitable even
between the activities are important. In this paper a for the simplest interventions.
measure is proposed, the total program effect (TPE), The TPE can be estimated by applying regression
which is an extension of the average treatment effect on techniques to observational data from a representative
the treated (ATET). It explicitly takes into account that sample from the targeted population. The approach
in the real world (with heterogeneous treatment effects) is illustrated with an evaluation of a health insurance
individual treatment effects and program assignment are program in Vietnam.
This paper is a product of the Partnerships, Capacity Building Unit, Development Economics Vice Presidency. It is part
of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy
discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org.
The authors may be contacted at c.t.m.elbers@vu.nl and j.w.gunning@vu.nl.
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.
Produced by the Research Support Team
Evaluation of Development Programs:
Randomized Controlled Trials or Regressions?
Chris Elbers and Jan Willem Gunning*
JEL Classification Codes: C21, C33, O22
Keywords: program evaluation; randomized controlled trials; policy evaluation;
treatment heterogeneity; budget support; sector-wide programs; aid effectiveness
Sector Board: Economic Policy (EPOL)
* The authors are professors at the VU University Amsterdam and fellows of the Tinbergen
Institute. Their addresses are c.t.m.elbers@vu.nl and j.w.gunning@vu.nl (corresponding author).
They are grateful to Remco Oostendorp, Menno Pradhan, Martin Ravallion, Elisabeth Sadoulet,
Finn Tarp, the editors and two anonymous referees of the World Bank Economic Review and to
seminar participants in Amsterdam, Namur, Oxford and Paris for very valuable comments on
previous versions.
Experimental methods for impact evaluation presuppose that the intervention is well-defined:
the â€œprojectâ€? is limited in space and scope (e.g. Duflo et al., 2008). However, governments,
NGOs and donor agencies are often interested in evaluating the effect of a program consisting of
various interventions, e.g. a sector-wide health or education programs (De Kemp et al., 2011).
Program evaluation faces two complications. First, a sharp distinction between treatment and
control groups is usually impossible. For example, a program in the education sector may
involve activities such as school building, teacher training and supply of textbooks. Typically all
communities are affected in some way by the program, but they may differ dramatically in what
interventions they are exposed to and the extent of that exposure. Secondly, in a program the
interventions are typically implemented at various administrative levels so that the policy maker
has only imperfect control over actual treatment.
The impact of such a program cannot simply be calculated on the basis of the results of
randomized controlled trials (RCTs). This would run into well known problems of external
validity (Bracht and Glass, 1968, Rodrik, 2008, Ravallion, 2009, Banerjee and Duflo, 2009,
Deaton, 2010, Imbens, 2010) even if the program involved only a single intervention. In
addition, if the program involves multiple interventions and interactions are important then it is
not clear how RCT evaluations of individual components of the program should be combined to
an overall assessment of the program. However, regression techniques can be used for program
evaluation. This involves drawing a representative sample of beneficiaries (e.g. households,
schools, communities) and collecting data on the combination of interventions experienced by
each beneficiary, together with other possible determinants of the outcome variables of interest.
Regression techniques can then be used to estimate the impact of the various interventions. 1 In
this paper this approach is generalized by allowing for treatment heterogeneity and a way of
estimating aggregate program impact is proposed.
2
Obviously, the intervention variables are likely to be endogenous in a regression analysis. For
example, an unobserved variable such as the political preferences of the community may affect
both the impact variable of interest and the intervention. Also, the impact of the intervention will
differ between beneficiaries and the allocation of interventions across beneficiaries may be
based on such treatment heterogeneity, either through self-selection or through the allocation
decisions of program officers. Heckman (1997) and Heckman et al. (2008) call this â€œselection
on the gainâ€?. The first complication is usually dealt with by using panel data or by randomized
assignment of treatment. The second complication is much more serious. It may be particularly
hard for RCTs when program assignment in practice cannot be mimicked by assignment to the
treatment arm in an RCT since this would not capture the way program officers take their
decisions. However, it will be shown that regression techniques can be adapted so as to produce
an appropriate estimate of the program effect.
The paper is organized as follows. In the first section the total program effect (TPE) is
introduced. This measure extends the average treatment effect on the treated (ATET). The TPE
is suitable for complex interventions and can deal with selection on the gain (treatment
heterogeneity). Then two complications are considered: correlation between program variables
and the controls in section 2 and spillover effects in section 3. Section 4 investigates whether
estimating the TPE using RCTs is an alternative. The approach is illustrated in section 5 by
estimating the TPE for a health insurance intervention in Vietnam. Section 6 concludes.
I. The Total Program Effect (TPE)
Consider the following model:
= Î± X it + Î²i Pit + Î³ t + Î·i + Îµ it
yit (1)
3
where y measures an outcome of interest, in this paper taken to be a scalar; t = 0, 1 is the time of
measurement; and i = 1,..., n denotes the unit of observation, e.g. households or locations. P
denotes a vector of the interventions to be evaluated and X a vector of observed controls. 2 The
P-variables can either be binary variables or multi-valued (discrete or continuous) variables. Î±
and Î²i are vectors of parameters, Î³ t denotes a time effect and Î·i represents time-invariant
unobserved characteristics and Îµ it is the error term, assumed to be independent over time. It is
also assumed that the interventions and control variables are uncorrelated with the error process:
X i1 , X i 0 , Pi1 , Pi 0 âŠ¥ Îµ i1 , Îµ i 0 .
At this stage P and X are assumed to be independent:
X i1 , X i 0 âŠ¥ Pi1 , Pi 0 .
This will be relaxed in section 2. Note that equation (1) excludes spillover effects of the type
where yit depends on Pjt (i â‰ j ) and j is not necessarily included in the sample. This point will
be discussed in section 3. In many applications (1) will represent a reduced form or â€œblack boxâ€?
regression, but it can also represent a structural model.
The evaluator is interested in the expectation (in the population) of the effect of interventions on
the outcome variable, the total program effect (TPE): 3
TPE E Î²i ( Pi1 âˆ’ Pi 0 ).
=
Note that the impact parameters Î²i need not be the same for all i: heterogeneity of program
impact is allowed.
As an example consider a very simple special case:
yit= Î²i Pit + Î³ t + Î·i + Îµ it t= 0,1 (2)
4
where Pit now is a binary variable rather than a vector, Pi=
0 =
0 for all i and Pi Pi1 âˆ’ Pi . Taking
first differences gives:
âˆ†yi Î²i Pi + Î³ + âˆ†Îµ i
=
= Î³ 1 âˆ’ Î³ 0 . This is analogous to the equation for a standard project evaluation, but written
where Î³
in differences. 4 The TPE for this case equals E Î²i Pi which is related to the familiar average
treatment effect on the treated (ATET)
TPE
ATET = .
EPi
In another special case of equation (1) the TPE can be identified as follows. Assume that data
are available from a random sample and that for a subsample (the â€œcontrol groupâ€?) there is no
change in the interventions: Pi1 = Pi 0 . (At this stage it is not assumed that the assignment to
intended â€œtreatmentâ€? and â€œcontrolâ€? groups is random.) Taking first differences in (1) for this
group gives:
âˆ†yi = Î±âˆ†X i + Î³ + âˆ†Îµ i if Pi = 0.
This allows estimation of Î± and hence Î±
Ë† âˆ†X i so that the TPE can be estimated as
Ë† =
TPE âˆ†yi âˆ’ Î±
Ë† âˆ†X i .
However, in a program consisting of multiple interventions, the context of this paper, there will
usually not be a sufficiently large control group to make this identification strategy realistic.
Indeed, typically the control group will be empty: all i will have experienced a change in at least
some components of the vector âˆ†Pi .
For this more general case
âˆ†yi = Î±âˆ†X i + Î²i âˆ†Pi + Î³ + âˆ†Îµ i (3)
5
Allowing for â€œselection on the gainâ€?, correlation between impact parameters Î² i and the
program variables Pi and also for correlation between Î² i and X i equation (3) can be rewritten
as
âˆ†yi =Î±âˆ†X i + E ( Î²i | âˆ†X i , âˆ†Pi )âˆ†Pi + Î³ + Ï‰i , (4)
where Ï‰i =âˆ†Îµ i + ( Î²i âˆ’ E ( Î²i | âˆ†X i , âˆ†Pi ))âˆ†Pi and this is uncorrelated with âˆ†X i and âˆ†Pi .
The term E ( Î²i | âˆ†X i , âˆ†Pi ) can be approximated linearly: 5
E ( Î²i | âˆ†X i , âˆ†Pi ) â‰ˆ Î´ 0 + Î´1âˆ†X i + Î´ 2 âˆ†Pi .
Substitution in (4) and collecting terms gives
âˆ†y
=i Î³ + Î¸1âˆ†X i + Î¸ 2 âˆ†Pi + Î¸3âˆ†X i âŠ— âˆ†Pi + Î¸ 4 âˆ†Pi âŠ— âˆ†Pi + Ï‰i (5)
where
Î¸ 2 âˆ†Pi + Î¸3âˆ†X i âŠ— âˆ†Pi + Î¸ 4 âˆ†Pi âŠ— âˆ†Pi
is the approximation of Ti = E ( Î²i âˆ†Pi | âˆ†X i , âˆ†Pi ).
Equation (5) can be estimated using the sample data. The estimated coefficients can then be used
to estimate Ti as
TË†= Î¸Ë† âˆ†P + Î¸Ë† âˆ†X âŠ— âˆ†P + Î¸Ë† âˆ†P âŠ— âˆ†P .
i 2 i 3 i i 4 i i
The TPE can now be estimated as the average of TË† in the sample.
i
Ë† = 1 âˆ‘T
TPE Ë†= Î¸Ë† âˆ†P + Î¸Ë† âˆ†X âŠ— âˆ†P + Î¸Ë† âˆ†P âŠ— âˆ†P
n i
i 2 i 3 i i 4 i i (6)
where bars denote sample averages. 6
6
In practice this means that one regresses âˆ†yi on âˆ†X i , âˆ†Pi and their interactions with âˆ†Pi and
collects all terms involving âˆ†Pi to calculate the total program effect. Since the estimated TPE is
linear in the Î¸Ë† parameters its standard error can be obtained from the covariance matrix of the
OLS-coefficients.
It is instructive to consider the special case of equation (5) where Di = âˆ†Pi is a binary variable
taking the value 1 for the treatment group and 0 for the control group, i.e. the case of a
difference-in-difference analysis. Equation (5) now reduces to
âˆ†y=
i Î³ + Î¸1âˆ†X i + Î¸ 2 Di + Î¸3 Di âˆ†X i + Ï‰i
since in this case Di2 = Di . Compared to a standard diff-in-diff regression this equation contains
the interaction term Di âˆ†X i .
The total program effect will in this case be estimated as
Ë† = Î¸
TPE Ë† D +Î¸Ë† D âˆ†X . (7)
2 i 3 i i
This shows that when the sample is representative sample means can be used to construct the
total program effect. The interaction term in (7) avoids the bias resulting from correlations
between treatment effects and either program participation or controls.
Many diff-in-diff studies do not include the interaction terms (e.g., Khandker et al., 2009 or
Almeida and Galasso, 2010). Studies that do often report estimates of impact for different values
of the controls X which makes it difficult to assess the aggregate impact of a program.
7
Equation (1) allows for two types of selection effects: Pit may be correlated with Î² i or with the
unobserved characteristics Î·i . A correlation of Pit and Î·i is dealt with by differencing, as in
(3). 7 However, the TPE measures the effect of the program inclusive of selectivity in the
assignment of program interventions resulting in a correlation of Î² i and âˆ†Pi . This is appropriate
since the way the program was assigned (in an ex post evaluation) or will be assigned (in an ex
ante evaluation) is one of its characteristics. If the program was successful in part because
program officers made sure the program interventions were assigned to households or locations
where they expected a high impact, then obviously the evaluation should reflect this. In fact the
evaluation would be misleading if it tried to â€œcorrectâ€? for such selection effects by presenting (if
this were feasible) an estimate ( E Î²i ) of the programâ€™s impact if it had been assigned randomly.
Recall that in the special, binary case of a `projectâ€™ evaluation TPE= E Î²i âˆ†Pi= ATET Ã— E âˆ†Pi . If
administrative data can be used to estimate E âˆ†Pi the question arises whether the ATET is
identified in an RCT. Obviously this is the case if Î²i = Î² for all i . More generally, if âˆ†Pi and
Î²i are independent the TPE can be estimated on the basis of an RCT: the trial would give an
estimate of E Î²i which in this case is also the ATET. A special case of independence is that of
universal treatment ( Pi = 1 for all i ). 8 In the most general case when âˆ†Pi and Î² i are not
independent the ATET as established by an RCT may differ from the ATET in the population
and estimating the TPE on the basis of RCTs can become problematic. This issue will be
considered in section 4.
8
II. Correlation between P and X
In the previous section P and X were assumed to be independent. (P, X) correlations are often
important in evaluations. For example, changes in teacher training may induce changes in
parental input. 910 Not all such inputs will be observed (e.g. additional parental help with
homework will probably not be recorded); Pit will then be correlated with Î² i and this was
already considered in the previous section. Conversely, if the parental input is observed then Pit
will be correlated with X it . In that case the TPE identifies the direct effect of P, but not its total
effect (including the indirect effect through induced changes in X). If the induced effect is to be
included then the affected components of âˆ†X i should be omitted from the regression (5).
If causality is in the reverse direction, from âˆ†X i to âˆ†Pi , then there is no need to amend the
section 1 estimate of the TPE since there is no induced change in âˆ†X i . (The asymmetry arises
because in either case the interest is in the impact of changes in âˆ†Pi , rather than in the impact of
changes in âˆ†X i .)
In the general case where the direction of causality is not known it will usually not be possible
to estimate the indirect effect of the program. Occasionally, however, appropriate instruments
can be found so that the impact of âˆ†Pi on âˆ†X i can be identified.
9
III. Spillover Effects
Recall that in section 1 spillover effects were excluded: in equation (1) yi of case i does not
depend on Pj of case j. In evaluations there are two important situations where this assumption
is untenable. First, Chen et al. (2009) and Deaton (2010) discuss the possibility that policy in
control villages is partly determined by policies in treatment villages so that the SUTVA (stable
unit treatment value assumption) is violated. Indeed, if policies thus affected are not represented
in the policy vector Pi this creates a classical case of omitted variable bias. In Chen et al. the
problem arises because the data record participation in a particular program as a binary Pi
variable, while other programs which may affect the outcome are initially ignored. In the
approach advocated in the present paper all potentially relevant programs would in principle be
included in Pi so that the problem of SUTVA violation is avoided. 11 Secondly, policies in
village j may affect outcomes in village i. For example, a program aimed at an infectious disease
in village j may affect health outcomes in the â€œuntreatedâ€? village i. 12 If the external effects of
policy are general equilibrium effects such as regional wage increases, it will be hard to identify
the full impact of a policy. But often more structure can be imposed, e.g. by including a proxy
for relevant policies in neighboring villages in the outcome regression, so that equation (3) is
extended to
âˆ†yi = Î²i âˆ†Pi + Î±âˆ†X i + Î³ + Î´âˆ†K i + âˆ†Îµ i .
where âˆ†K i is the proxy for policy changes in the neighborhood. If there is sufficient variation in
K i then Î´ is identified in this regression. The TPE would then be E Î²i âˆ†Pi + Î´ E âˆ†K i .
10
IV. Regression Methods and RCTs Compared
In section 1 it was shown how the TPE can be estimated using regression methods. A natural
question is whether the TPE can also be estimated using RCTs. Using RCTs may be difficult,
e.g. because in programs the distinction between treatment and control groups may break down.
However, there may be problems even in the case of binary treatments, namely under treatment
heterogeneity when the probability of treatment is correlated with the individual impact
parameters Î² i and unknown to the evaluator. If this correlation arises through self-selection then
the usual response is to consider the average treatment effect on the treated rather than the
average treatment effect in the population. If, however, the correlation arises at a higher level,
e.g. because the policy maker targets on observables, then an RCT would have to mimic this
assignment, possibly by stratifying the sample on the basis of the targeting variables.
But in many government and NGO programs the â€œpolicy makerâ€? does not directly control the P
variables: assignment is decided by lower level staff (â€œprogram officersâ€?) on the basis of private
information, variables that cannot be observed by the policy maker or the evaluator. In this case
an RCT can still identify the TPE, but at the cost of having to randomize at a higher level than
the treatment under consideration: randomization would apply to program officers rather than
beneficiaries. This implies that the power of the statistical analysis may be reduced. It also
involves losing the direct link with the intervention.
This may be illustrated with an example. Consider the following model
y=
i Î²i Pi + Î³ + Îµ i
where Î² i and Îµ i are independent, Pi is binary and EÎµ i = 0 . For simplicity Î² i will be considered
as the intention-to-treat impact, so that a subject i â€™s refusal to undergo offered treatment Pi is
11
reflected in Î² i , rather than in Pi . Program implementation involves program officers who have
imperfect knowledge of Î² i : they perceive Ï‰=
i Î²i + Î·i and will assign treatment if and only if
Ï‰i > 0 . Assume that Î·i has mean zero and is independent of Î²i and Îµ i . Crucially, this
knowledge of program officers is unknown to the evaluator. Denote the CDF of Î·i by F . With
this assignment rule Pi is exogenous (i.e. independent of Îµ ij ).
An RCT evaluation might involve drawing a random sample from the population and assign
treatment randomly within this sample. The researcher would then estimate the programâ€™s
intention to treat effect (ITE) as E Î²i . The TPE would be estimated as E Î²i EPi .
This would be incorrect since, under the assumptions made above
> 0) E[(1 âˆ’ F (âˆ’ Î²i )) Î²i ] â‰ EPE
Î²i Pi E ( Î²i | Î²i + Î·i > 0) P( Î²i + Î·i =
TPE = E= i Î²i .
= P( Î²i + Î·i > 0)
(Note that E (1 âˆ’ F (âˆ’ Î²i )) = EPi . As before, the ATET = TPE / EPi .) The problem
arises because in this case the RCT design does not mimic the actual assignment process. To
obtain an unbiased estimate of the TPE randomization would have to take place at a higher level,
that of the program officers. 13 The control group then consist of program officers who never
â€œtreatâ€? and the treatment group of program officers who sometimes (but not always) treat.
The proposed regression method gives an unbiased estimator of the TPE using observational
data for ( yi , Pi ) from a random sample of the population. The difference is that while the RCT
approach compares average outcomes at the level of program officers the regression approach
does so at the level of beneficiaries. The RCT approach therefore has lower statistical power. 14
12
Moving beyond the example there is a more fundamental objection to the RCT approach if
outcomes depend not only on P but also on X, as in (1). If the RCT involved randomization over
actual program officers then it is unlikely that randomization can also be achieved in terms of all
the confounding X variables since program officers will not have been posted randomly across
space. This introduces a correlation between X and characteristics of the program officers and
hence a correlation between P and X. The two groups of program officers (â€œtreatmentâ€? and
â€œcontrolâ€?) will therefore differ systematically so that internal validity is lost. 15 The proposed
approach, by contrast, collects data at the level of beneficiaries and can therefore control for
differences in X.
In summary, estimating the TPE on the basis of group averages from RCTs becomes
problematic when Î² and P are correlated as a result of targeting on the basis of unobservables. If
one randomizes at the level of beneficiaries the TPE estimator will be biased because the
correlation is not taken into account. If one randomizes at the level of program officers the
estimator is inefficient and, if confounders are important, may become inconsistent.
V. An Empirical Example: Estimating the Total Program Effect for a Health Insurance
Program in Vietnam
To illustrate how the total program effect can deviate from a naÃ¯ve approach to calculating the
effect of a program a study of the impact of a health insurance program in Vietnam (Wagstaff
and Pradhan, 2005) is reconsidered. Health insurance was introduced between the 1992-93 and
the 1997-98 rounds of the Vietnam Household Living Standards Survey (General Statistics
Office of Vietnam, 1993 and 1998). To account for possible treatment heterogeneity Wagstaff
and Pradhan match households on propensity scores and then compare changes in health
13
outcomes (as well as some non-health outcomes) between insured households or individuals and
(matching) uninsured households or individuals. They find modest favorable effects on
childrenâ€™s nutritional status, a mild effect on health expenditure and a sizeable effect on non-
health spending.
A propensity score based approach is not suitable for calculation of a total program effect since
the common support requirement in a PSM approach will exclude part of the population in a
systematic way. Therefore the Vietnam data are used to estimate the effect of the program using
a standard diff-in-diff approach i.e. without allowing for heterogeneity (labeled â€˜naÃ¯veâ€™). The
results are compared with an estimate of the TPE. In this case the â€˜programâ€™ is a simple
intervention. 16 This makes a comparison with a standard approach clearer.
The data are summarized in Table 1. A difficulty is that some of the outcome variables are
individual anthropometric measurements while only households can be matched between survey
rounds. Therefore the individual measurements have been averaged per household - a crude
procedure only suitable for the current purpose of illustrating the TPE. Lacking information on
1992-3 the sampling weights from 1997-8 are used; clustering is also based on the 1997-8
survey round.
The outcome variables considered are changes in arm circumference, height, body weight,
health expenditure and total expenditure. The explanatory variables are the other variables
shown in Table 1 (insurance status and the controls school attended, currently attending school,
gender, age, a farm dummy, household size): and their interaction with the intervention variable
are used as explanatory variables.1 When total expenditure is not a dependent variable it is also
used as control variable.
14
Table 2 summarizes the results. First a naÃ¯ve regression is run (without interaction terms) and
the implied program effect (calculated as the regression coefficient of insurance times mean
insurance). This naÃ¯ve program effect is then compared with a TPE calculated as in equation (6).
The results show striking differences between the two methods. In the case of arm
circumference the standard method would have led to the conclusion that insurance had no
(significant) effect. Once treatment heterogeneity is allowed for the effect is in fact highly
significant albeit very small. For height neither method finds a significant effect. For body
weight both methods show a significant increase but the effect is more than twice as large when
heterogeneity is allowed for.
Insurance appears to have no significant effect on health expenditure irrespective of the method
used. Both methods do find a substantial (and significant) effect of insurance on total
consumption. Again, the effect is stronger once one takes heterogeneity into account.
Obviously, there is no reason why these results should generalize. However, they do suggest that
treatment heterogeneity can have a substantial effect on the estimates of a programâ€™s impact. A
simple way to investigate this possibility is to test for the joint significance of the coefficients on
the variables which would not normally be included in the regression: the interactions of
treatment variables with themselves and with the controls. When this test indicates that
heterogeneity may be an issue it is advisable to calculate the TPE.
VI. Conclusion
Policy makers in developing countries, NGOs and donor agencies are under increasing pressure
to demonstrate the effectiveness of their program activities. At the same time there is a growing
15
interest in using randomized controlled trials (RCTs) for impact evaluation of projects. This
raises the question to what extent RCTs can be used to evaluate programs, for instance by
aggregating the impact of the components of the program. This question is particularly relevant
for the evaluation of budget support or of NGOs which typically involve a wide variety of
activities.
The strength of RCTs is in establishing proof of principle. Going further and using RCTs to
estimate the impact of programs is possible in special cases but becomes problematic if the
probability of assignment is correlated with the effectiveness of the intervention. For example,
teachers may give more attention to children who they think can benefit more from it. An RCT
which randomizes at the level of beneficiaries (children) would produce a biased estimate of the
program effect by ignoring this correlation between assignment and treatment effects.
Alternatively, randomization at the appropriate level (teachers) would require a larger sample
for the same precision. If confounders are important and correlated with characteristics of the
program officers, the RCT-based estimate of the programâ€™s impact would even be inconsistent.
The approach proposed in this paper requires observational panel data for a representative
sample of beneficiaries rather than experimental data for randomly selected treatment and
control groups. If treatment is exogenous this will correctly reflect the assignment process even
under treatment heterogeneity. Instead of estimating average impact coefficients for each of the
various interventions of the program, the expected value (across beneficiaries) of the total
impact of the combined interventions is estimated. This gives the total program effect (TPE).
The paper has shown how and under what conditions regression techniques can be used to
estimate the TPE in the presence of selection effects. As an example TPE estimates for a simple
intervention: a health insurance program in Vietnam were presented. The example shows that
16
allowing for heterogeneity can lead to very different estimates of a programâ€™s effect. The
proposed method offers a simple way of dealing with such heterogeneity.
The approach has three advantages. First, by using observational data for a random sample from
the population of intended beneficiaries external validity is ensured. While the disadvantages of
observational data are well known, this is an important advantage. Secondly, by focusing on the
combined effect of program components they are automatically correctly weighted. Finally, it
avoids the problems which RCTs encounter when assignment is imperfectly controlled and
correlated with unobservables, as is plausible in development programs.
17
Notes
1
This approach is discussed in White (2006) and Elbers et al. (2009).
2
Here P reflects â€œactualâ€? treatment . In principle it could reflect â€œintendedâ€? treatment if intended treatment can be
observed, e.g. because intended beneficiaries were offered vouchers.
3
Strictly speaking this is the total effect of changes in the program. The symbol E is used for population averages
and a bar over a variable for sample averages. Note that the total program effect does not include general
equilibrium effects of the program.
4
This assumes that the autonomous trend = Î³ 1 âˆ’ Î³ 0 is the same for all subjects (or, alternatively that the
Î³
difference âˆ†Î³ it is exogenous and can be treated as part of the residual). In the terminology of double differencing
this is the assumption of parallel trends. If this assumption is questionable then data for more periods are needed to
estimate how trends depend on P. This paper abstracts from this complication and limit the analysis to two periods.
The extension to more periods is non-trivial but conceptually straightforward.
5
Higher order approximations would not change the argument but it should be noted that the number of regressors
expands very rapidly. De Janvry et al. (2012) account for treatment heterogeneity in a similar way in the context of
a schooling program.
6
Obviously, to identify Î¸4 a restriction on parameters like Î¸ 4,k ï?¬ = Î¸ 4,ï?¬k is required.
7
Differencing is sufficient because of the assumption of parallel trends (cf. footnote 5).
8
Imbens (2010) describes a reduction in class size in all California schools. This is an example of universal
treatment.
9
Deaton (2010) gives the example where random assignments made by the central government (e.g. the Ministry of
Education) are partly offset by induced changes in allocations by local or provincial governments. Ravallion (2012)
gives a similar example and Chen et al. (2009) quantify such a spillover effect in China. Similarly, the political
economy may be such that the central government is unable to prevent allocations being diverted to favored ethnic
or political groups. In either case Pi might be correlated with Î²i.
10
This is similar to the case considered by Das et al. (2004, 2007) where teacher absenteeism as a result of
HIV/AIDS induces greater parental input.
11
Recall that the approach does not involve a distinction between treatment and control groups: most if not all
subjects receive some treatment.
18
12
This has implications for sampling: since data on policies in neighboring villages are required one must sample
groups (possibly pairs) of adjacent villages.
13
Duflo et al. (2008, pp. 3935-37) make this point in a similar context (partial compliance) concluding that â€œOne
must compare all those initially allocated to the treatment group to all those initially randomized to the comparison
groupâ€?.
14
This is shown in the supplemental appendix.
15
This is shown in the supplemental appendix.
16
It should be noted that the intervention variable is not binary (as it would be in a â€˜projectâ€™) since insurance
enrollment is measured as an average at the household level.
19
References
Almeida, Rita K., and Emanuela Galasso (2010), â€˜Jump-starting Self-employment? Evidence for
Welfare Participants in Argentinaâ€™, World Development 38 (5): 742â€“55.
Banerjee, Abhijit V. and Esther Duflo (2009), â€˜The Experimental Approach to Development
Economicsâ€™, Annual Review of Economics 1: 151-78.
Bracht, Glenn H. and Glass, Gene V. (1968), â€˜The External Validity of Experimentsâ€™, American
Education Research Journal 5(4) : 437-74.
Chen, Shaohua, Ren Mu, and Martin Ravallion (2009), â€˜Are There Lasting Impacts of Aid to
Poor Areas?â€™, Journal of Public Economics 93(3): 512-28.
Das, Jishnu, Stefan Dercon, James Habyarimana, Pramila Krishnan (2004), â€˜When Can School
Inputs Improve Test Scores?â€™, World Bank Policy Research Working Paper 3217, Washington
DC: The World Bank.
Das, Jishnu, Stefan Dercon, James Habyarimana, Pramila Krishnan (2007), â€˜Teacher Shocks and
Student Learning: Evidence from Zambiaâ€™, Journal of Human Resources 42(4): 820-62.
Deaton, Angus (2010), â€˜Instruments, Randomization, and Learning about Developmentâ€™,
Journal of Economic Literature 28(2): 424-55.
De Janvry, Alain, Frederico Finan, and Elisabeth Sadoulet (2012), â€˜Local Electoral Incentives
and Decentralized Program Performance,â€™ Review of Economics and Statistics 94(3): 672â€“85.
20
De Kemp, Anthonie, JÃ¶rg Faust and Stefan Leiderer (2011), Between High Expectations and
Reality: an Evaluation of Budget Support in Zambia, Bonn/The Hague/ Stockholm:
BMZ/Ministry of Foreign Affairs/Sida.
Duflo, Esther, Rachel Glennerster and Michael Kremer (2008), â€˜Using Randomization in
Development Economics Research: a Toolkitâ€™, in T. Paul Schultz and John Strauss (eds.),
Handbook of Development Economics, Amsterdam: North-Holland, pp. 3895-3962.
Elbers, Chris and Jan Willem Gunning (2009), â€˜Evaluation of Development Policy: Treatment
versus Program Effectsâ€™, Tinbergen Institute Discussion Paper 2009-073/2.
Elbers, Chris, Jan Willem Gunning and Kobus de Hoop (2009), â€˜Assessing Sector-Wide
Programs with Statistical Impact Evaluation: a Methodological Proposalâ€™, World Development
37(2): 513-20.
General Statistics Office of Vietnam (1993) Living Standards Survey 1992-93,
http://go.worldbank.org/JZFNBLXM80.
General Statistics Office of Vietnam (1998) Living Standards Survey 1997-98,
http://go.worldbank.org/4QR0OSXMD0.
Heckman James J. (1997), â€˜Instrumental Variables: a Study of Implicit Behavioral Assumptions
Used in Making Program Evaluationsâ€™, Journal of Human Resources 32(3): 441-62.
21
Heckman, James J., Sergio Urzua and Edward J. Vytlacil (2008), â€˜Understanding Instrumental
ariables with Essential Heterogeneityâ€™, Review of Economics and Statistics 88(3): 389-432.
Imbens, Guido W. and Joshua D. Angrist (1994), â€˜Identification and Estimation of Local
Average Treatment Effectsâ€™, Econometrica 62(2): 467-76.
Khandker, Shahidur R., Zaid Bakht, and Gayatri B. Koolwal (2009), â€˜The Poverty Impact of
Rural Roads: Evidence from Bangladeshâ€™, Economic Development and Cultural Change 57(4):
685â€“722.
Ravallion, Martin (2009), â€˜Evaluation in the Practice of Developmentâ€™, World Bank Research
Observer 24(1): 29-53.
Ravallion, Martin (2012), â€˜Fighting Poverty One Experiment at a Time: a Review of Abhijit
Banerjee and Esther Dufloâ€™s Poor Economics: A Radical Rethinking of the Way to Fight Global
Povertyâ€™, Journal of Economic Literature 50(1): 103-114.
Rodrik, Dani (2008), â€˜The New Development Economics: We Shall Experiment But How Shall
We Learn?â€™, John F. Kennedy School of Government, Harvard University, HKS Working Paper
RWP 08-055.
Wagstaff, Adam, and Menno Pradhan (2005), â€˜Health Insurance Impacts on Health and
Nonmedical Consumption in a Developing Countryâ€™, World Bank Policy Research Working
Paper 3563, Washington DC: The World Bank.
22
White, Howard (2006), Impact Evaluation: the Experience of the Independent Evaluation Group
of the World Bank. Washington, DC: World Bank.
23
Table 1: Data for the Vietnam Insurance Example
Variable: change in (average) Mean Std. Dev Min Max
Arm circumference (cm) 1.154 2.013 -7.3 9.4
Height (cm) 5.175 11.35 -49.57 39.84
Body weight (kg) 2.983 6.544 -27.75 26.25
Health expenditure (â€˜000 Dong) 1,081 5,519 -8808 23,3965
Total consumption expenditure
(â€˜000 Dong) 6,513 8,009 -22,988 11,6826
Insurance (binary at individual
level) 0.170 0.268 0 1
School attended16 -0.017 0.683 -3.5 3
Currently attending school (binary
at individual level) 0.082 0.388 -2 2
Gender 0.002 0.138 -0.75 1
Age 3.522 8.299 -48.43 48.6
Farm dummy -0.079 0.421 -1 1
Household size -0.267 1.696 -18 11
The number of observations varies between 4299 and 4305.
Source: authorsâ€™ calculations using the Vietnam Living Standard Surveys 1992-3,
1997-8.
24
Table 2: Total Program Effects
Dependent variable NaÃ¯ve program Total program R-squared Remarks
â€
effect (I) effectâ€ â€ (II) of
(s.e.) (s.e.) underlying
regressions
I II
Arm circumference .022 0.090*** 0.22 0.23
(.029) (0.027)
Height -0.190 .095 0.34 0.36
(0.154) ( 0.139)
Body weight 0.167* 0.384*** 0.31 0.33
(0.083) (0.074)
Health expenditure -28.08 -52.79 0.03 0.04 Total
(60.59) (51.01) consumption
included in
controls
Health expenditure 55.41 64.32 0.00 0.00 Total
(66.42) (52.87) consumption
expenditure not
included
Total consumption 626.7*** 888.8*** 0.10 0.12 Total
expenditure (110.9) (105.7) consumption
expenditure not
included
Robust clustered standard errors in parentheses. In all but the health expenditure regressions
squared intervention and interactions of controls with intervention are jointly significant.
Significance: * indicates 5% threshold, *** 0.1%.
â€
The naÃ¯ve program effect is calculated as the regression coefficient on the insurance variable
time the estimated population mean of that variable.
â€ â€
The total program effect is calculated according to equation (6).
The sampling errors on the estimated population means are not taken into account.
Source: authorsâ€™ calculations using the Vietnam Living Standard Surveys 1992-3, 1997-8.
25
ElbersÂ andÂ Gunning:Â EvaluationÂ ofÂ DevelopmentÂ ProgramsÂ
Â
SupplementalÂ AppendixÂ
PrecisionÂ ofÂ TPEÂ estimatorsÂ whenÂ treatmentÂ isÂ exogenousÂ butÂ notÂ fullyÂ
controlled1Â
Using RCTs
â€œProgram Officersâ€? (POs) are divided into treatment- and control-POs. All subjects within the
catchment area of a treatment-PO are considered as treated (i.e., we want to estimate the
intention to treat effect).
Consider the following model linking outcome yij to (actual) treatment Pij :
yij ï€½ ï?¡ i ï€« ï?¢ ij Pij ï€« ï?¥ ij ,
where i refers to the program officer responsible for administrating treatment to subject j who
falls within the catchment area of i . The disturbance ï?¥ ij is assumed to be homoscedastic and
independent of ï?¡ i , ï?¢ ij and Pij . To model clustering by POs an officer random effect ï?¡ i is
included in the model. Random effects are assumed to be i.i.d. and independent of ï?¢ ij and Pij .
We further assume that the number of subjects per PO is constant to avoid trivial complications
of weighing.
The evaluator wants to estimate TPE ï€½ E ï?¢ ij Pij and in order to capture any selectivity in
application of treatment by the program officers (PO) a random sample of POs has been drawn
and subsequently been randomly divided into a group T of treatment-POs who are supposed to
apply treatment to the ultimate beneficiaries j and a group C of control-POs who are asked not
1
The context is that of section 4 in the main text of the paper.
to give treatment to subjects. Within the catchment area of sampled POs a random sample of
subjects is drawn for whom we observe (at least) yij . This allows estimation of the TPE as the
difference in average outcomes between group T and group C subjects: hat over TPE?
Ë† ï€½ y ï€ y ï€½ ï?¡ ï€ ï?¡ ï€« [ï?¢ P ] ï€« ï?¥ ï€ ï?¥ ,
TPE (A.1)
T C T C ij ij T T C
where the bars denote sample averages over the two groups of subjects. Since this estimator is
unbiased, its precision can be determined by the variance:
Ë† )ï€½ïƒ¦
MSE(TPE
1 1 ïƒ¶ 2 1
ïƒ§ ï€« ïƒ·ï?³ï?¡ ï€« [var( ï?¢ij Pij )]T ï€« (
1
ï€«
1
)ï?³ ï?¥2
ïƒ¨ T
n nC ïƒ¸ N T N T N C
where nT and nC denote the number of sampled treatment-POs and control-POs, NT the total
number of sampled subjects associated with treatment-POs, and N C the number of sampled
subjects falling under control-POs.
Regression using observational data
Now consider sampling directly at the level of subjects. Typically such a sample will also be
clustered, albeit not necessarily by PO. To create a â€˜level playing fieldâ€™ we will assume that the
sample has n ï€½ nT ï€« nC clusters with a total of N ï€½ NT ï€« N C subjects. For each sampled subject
j from cluster i we observe Pij (actual treatment) and yij . The estimator for the TPE reduces
to
Ë† ï€½ y ï€ yij (1 ï€ Pij ) ï€½ ï?¢ P ï€« ï?¡ ï€ ï?¡ i (1 ï€ Pij ) ï€« ï?¥ ï€ ï?¥ ij (1 ï€ Pij ) .
TPE (A.2)
1 ï€ Pij 1 ï€ Pij 1 ï€ Pij
ij ij i ij
Assuming as in the RCT setup that ï?¡ i is independent of Pij and ï?¢ ij this estimator is again
unbiased2 and
2
Correlation of ï?¡i and ï?¢ ij , Pij would reflect level effects which, as explained in section 2, should be neutralized
by using differenced data.
1
ïƒ¦ ïƒ¶ ïƒ¦ ïƒ¶
Ë† ) ï€½ 1 var( ï?¢ P ) ï€« var ïƒ§ ï?¡ ï€ ï?¡ i (1 ï€ Pij ) ïƒ· ï€« var ïƒ§ ï?¥ ï€ ï?¥ ij (1 ï€ Pij ) ïƒ· .
MSE(TPE
N
ij ij
ïƒ§ i
1 ï€ Pij ïƒ· ïƒ§ ij
1 ï€ Pij ïƒ·
ïƒ¨ ïƒ¸ ïƒ¨ ïƒ¸
N ï€1
Using the delta method and the equality E ( Pij ï€ Pij ) 2 ï€½ EP(1 ï€ EP) it can be verified that3
N
ïƒ¦ 1 ïƒ¶
ïƒ¦ ï?¥ ij (1 ï€ Pij ) ïƒ¶ ïƒ§N ïƒ¥ï?¥ ij( Pij ï€ Pij ) ïƒ·
P 1
var ïƒ§ ï?¥ ij ï€ ïƒ· ï€½ var ïƒ§ ij
ïƒ· ï‚» ij ï?³ ï?¥2 ,
ïƒ§ 1 ï€ Pij ïƒ¸ ïƒ· ïƒ§ 1 ï€ Pij ïƒ· 1 ï€ Pij N ï€ 1
ïƒ¨ ïƒ§ ïƒ·
ïƒ¨ ïƒ¸
and likewise that
ïƒ¦ 1 ïƒ¶
ïƒ¦ ï?¡ i (1 ï€ Pij ) ïƒ¶ ïƒ§ N ïƒ¥ ï?¡ i ( Pij ï€ P ) ïƒ· P 1
var ïƒ§ ï?¡ i ï€ ïƒ· ï€½ var ïƒ§ ïƒ· ï‚» ij ï?³ï?¡
ij 2
.
ïƒ§ 1 ï€ P ïƒ· ïƒ§ 1 ï€ P ïƒ· 1 ï€ P N ï€ 1
ïƒ¨ ij ïƒ¸ ïƒ§
ij
ïƒ·
ij
ïƒ¨ ïƒ¸
It follows that in the regression setup precision is of order N while in the RCT setup precision
if at best of order NT / 2 and, if clustering of data is an issue, of order nT / 2 . (Note that if
the two groups are of equal size: NT = N/2, then the regression setup is twice as precise as the
RCT setup.)
Covariates
Both methods fail if ï?¥ and ï?¢ P are correlated. What if there are observables X ij determining
both P and y ? This could be the result of program targeting. In that case formulas (A.1) and
(A.2) can no longer be used. To account for the confounding effect of covariates a regression
approach is required, also with an RCT setup. For RCTs using intention to treat by PO for
estimating the TPE, efficient estimation would amount to a regression equation like
yij ï€½ ï?¡ i ï€« TPE I{iïƒŽT } ï€« ï?§ X ij ï€« ï?¥ ij .
The reason formula (A.1) can no longer be used is that randomization over POs does not
guarantee randomization over observables xij . Applying formula (A.1) we would find
3
In this case E denotes an average over all possible samples.
2
Ë† ï€½ y ï€ y ï€½ ï?¡ ï€ ï?¡ ï€« ï?§ ( X T ï€ X C ) ï€« [ï?¢ P ] ï€« ï?¥ ï€ ï?¥ .
TPE T C T C ij ij T T C
The bias ï?§ ( X T ï€ X C ) would vanish if X T ï‚» X C , i.e., when X ij and I{iïƒŽT } . are uncorrelated.
3