Main navigation

Thoughts on the use of propensity score matching in the evaluation of the Educational Maintenance Allowance (EMA)

By Alex Bryson, Principal Research Fellow, Policy Studies Institute

This note discusses the use of propensity score matching (PSM) in the first two years of the EMA evaluation. Comments relate to two reports (Ashworth et al., 2001; Ashworth et al., 2002 - referred to as R1 and R2 respectively below) where PSM is used to estimate EMA effects on participation in school post-compulsion (Year 12) and retention by Year 13. The reports are engaging, well-written and competent pieces of evaluation research. Thanks to honesty and endeavour of the evaluators in seeking continual improvements in their analyses, reading the reports back-to-back offers real in-sights into the sensitivity of the evaluation results to the implementation of matching. I therefore commend the reports to policy evaluators and policy makers. Below I make observations under eight?? broad headings.

1. Identifying the causal impact of EMA

Matching occurs at two levels: area-level, to find matches for the LEA pilot areas among the remaining LEAs in England - and at level of the individual across pilot and control areas. The latter may be required, even if area-level matching seems reasonable, to account for differences in demographics across pilot/control areas that may affect outcomes.

Good area-level matching is always critical in area-based evaluation because the approach is predicated on the assumption that differences in outcomes can be attributed to the programme (in this case EMA) and not to other area-based differences. The evaluators note some 'problems' (R1, page 13) with the initial area matching which relied on YCS data. YCS lacked some key data items - notably household income which is a determinant of EMA eligibility as well as an influence on young people's education choices - and contained relatively few observations per LEA.

The evaluators report using EMA survey data to improve the allocation of pilot areas but, of course, by that point, the control areas used to administer the survey had already been chosen. The evaluators readily admit that the area matching was by no means perfect. They were particularly concerned about systematic differences between pilot and control areas in pre-EMA staying-on rates and deprivation rates. Their concerns were born out. By R2, the evaluators had ward-level indices of deprivation, ward-level staying-on rates pre-EMA plus information on local school quality. They were able to use these controls in individual-level matching so overcoming some of drawbacks in the area-level matching. The effect of conditioning on these extra variables is dramatic. In R1 EMA effects in urban areas get smaller post-matching whereas effects in rural areas get larger post-matching. Conditioning on the wider set of Xs in R2, the opposite happens: EMA effects in urban areas are bigger post-matching whereas rural effects become smaller post-matching and are no longer statistically significant. Of course, we can be more confident in the results presented in R2 but we can never know whether we have conditioned on all relevant Xs. PSM, like all other non-experimental evaluation techniques, relies on rich data to identify unbiased programme estimates.

2. Detail on matching procedures and diagnostics

In judging whether PSM is likely to capture the causal impact of a programme and, if not, the likely biases in the estimated effects, one needs to know:

The reports do not offer enough detail to make the judgement. For instance, the probit estimates for the area-level matching and individual-level matching are never presented. Although we are informed that determinants of treatment and outcomes differ systematically across the sexes and by rural and urban location, prompting sub-group analyses throughout, we are never shown the probit estimates for these sub-groups which would allow the reader to see how the groups differ. There is only cursory mention of the shift from nearest neighbour matching in R1 to kernel density regression in R2 and, in spite of some support problems in estimating effects for rural areas, these are not discussed. In defence of the evaluators, they have already produced two lengthy reports and it is unlikely that any reader would have the stamina for more. Instead, it might be advisable to publish a separate technical report that presents all the information above. (Perhaps this is a lesson for all government evaluation that relies on the use of complex techniques.) Only then can the reader come to a considered opinion about whether the study has isolated an unbiased EMA effect.

3. Which treatment effect: eligibility and take-up

The evaluators make it clear that they are estimating the impact of eligibility for EMA, as opposed to actual take-up of EMA. Two issues arise. First, the evaluators face severe difficulties in accurately estimating eligibility using the survey data. EMA eligibility depends on the age of the child and, because EMA offers means-tested assistance, household income. The latter is likely to be measured with substantial error due to the number of income items comprising household income. The reports make this clear through the fulsome reporting of imputation methods to overcome item non-response and discussion of the misalignment between eligibility and take-up in the pilot areas. It is not clear what bias this measurement error might imply. Second, the effect of eligibility combines the impact of EMA receipt and the probability of take-up. The latter varies a great deal across areas in a way that may be explained, in part, by differences in administrative efficiency across LEAs. The problem here is that these differences in LEA efficiency can not be easily separated from EMA effects: using eligibility as the measure of treatment conflates the two.

Given these two concerns - measurement error in estimating eligibility and conflation of entitlement and take-up - it would have been valuable to supplement estimates of EMA eligibility with estimates of actual EMA take-up. This could be done by taking the current estimates of the impact on all eligibles and dividing through by the take-up rate (provided one is prepared to assume that the impact on those who do not take it up is zero). Alternatively, one could redo the matching to match those who take-up their entitlements in pilot areas with 'like' eligibles in control areas. In any event, it would be interesting to know more about the nature of EMA take-up by modelling it in its own right. We know whether or not the young person applied and also the amount they actually received, permitting joint modelling of take-up and measurement error.

4. Problems in re-weighting data to overcome sample attrition

Like many survey-based evaluations, EMA suffers from substantial attrition resulting from opt-outs, non-contacts, movers, refusals, item non-response and non-response at follow-up interviews. Where attrition is non-random and is associated with treatment and outcomes, as in this case, it is important to try to re-weight the data so that it reflects the original pilot population. The efforts made to re-weight the achieved samples back to the pilot populations using Family Resource Survey generated weights are not convincing. This is clear from Table 2.9 which shows lower EMA effects on Y12 participation using wave 2 cohort 1 relative to wave 1 cohort 1, despite the use of weights. This suggests the weighting schema is insufficient. It might be better to generate attrition weights with probits estimating various reasons for attrition. These weights - which are the inverse of the probability of non-response, moving etc. - can be combined with the match weights in estimating the EMA effect. In fact, the evaluators' efforts to re-estimate using attrition rates ran into difficulties because the bootstrapping procedure needed to obtain correct standard errors for PSM estimates slowed the process down considerably.

The relative impact of full versus partial eligibility
The evaluators find the EMA effect is only statistically significant for full eligibility: there is no significant effect of partial eligibility. The distinction between full and partial eligibility is not that useful from a policy perspective because partial eligibility lumps together individuals who are eligible for anything between £5 and just under £30 (the full amount) per week.

6. How to interpret the diminishing EMA effect

The EMA effect for urban men diminishes over time such that the effect is not statistically significant for cohort 2 (Table 2.3 of R2). It is not clear why this is so but the evaluators point to a 5 percentage point increase in post-compulsory education participation by young men in the urban control areas between cohorts 1 and 2 (R2, p. 25). This change in the control areas seems large over such a small period of time: if it had occurred in the pilot areas it may well have been attributed erroneously to EMA! Among urban women, on the other hand, the EMA effect increases. Clearly we need to know more about what has been happening in the control areas since they are supposed to be 'like' the pilot areas, other than that the pilot areas are running EMA.

7. Difficulties in extrapolating from these results to a wider population

The evaluators seek to extrapolate from their results to the potential impact of EMA if extended more widely. This extrapolation is not convincing. The problem the evaluators face is that EMA pilot areas were chosen because they were not representative of England as a whole. Indeed, they were chosen as pilot areas because they were thought to have particularly low levels of participation in post-compulsory education. It seems unlikely, therefore, that the use of FRS-generated weights to weight the results to the wider population will suffice. If policy-makers were interested in designing an evaluation capable of estimating the EMA effect in other areas it would have been preferable to have chosen the pilot areas at random.

A second reason why efforts to extrapolate these results to the wider population is unconvincing is the difficulty the evaluators have in identifying the impact EMA may have on staying-on and retention among those ineligible for it. These effects may be large in the case of programmes affecting a sizeable proportion of a population: this is the case with EMA which, according to the evaluators, may be available to over half of each cohort considering whether to stay-on post-compulsory education. The concern is that EMA may affect the schooling decisions of non-eligibles either positively or negatively, thus boosting or diminishing the EMA effect arising from its direct impact on eligibles. To get at these 'spillover' effects, the evaluators undertake additional matching, this time using ineligibles as well as eligibles in the matching. They find no evidence of spillovers on participation (see R2, Table 2.8). However, the problem with this analysis is that ineligibles are taken from control areas as well as pilot areas. By definition, one would never expect to encounter spillover in the control areas because the programme is absent in those areas. It might have been interesting to see what spillover effects may emerge from an analysis confined to eligibles and ineligibles in the pilot areas.

There is an additional problem in extrapolating the rural area results to rural areas more generally since the rural EMA estimates are based on a single rural pilot area - Cornwall - which is not particularly well-matched to its two control area counterparts. It is always hazardous to draw policy inferences on the basis of what is effectively a single observation. In the event, the R2 results find EMA does not significantly improve participation in rural areas. Although this result is due in part to the inflation of standard errors arising from the introduction of ward-level controls into the matching estimator, one does not know a priori what effect might emerge with a larger sample size.

8. General Lessons for Evaluation

There are two general lessons for evaluation arising from the EMA evaluation. The first relates to the use of PSM, the second to collaboration between policy makers and evaluators.

PSM has become an established tool in the evaluator's tool box. But it is not an all-purpose tool: it performs certain tasks well, others not so well. The matched area design is usually required in evaluation where a survey is indispensable in collecting outcome information or the conditioning Xs needed to isolate the causal effect of a programme. The cost of doing so means it is important to collect this information from individuals likely to be close comparators to the treated. Identifying matching areas is one way of doing this, provided matching also undertaken at individual-level. But matching will not be appropriate where there is uncertainty around the determinants of the treatment or outcomes, or where evaluators do not observe key Xs. Even if these conditions are met, areas and individuals matched at one point in time do not remain matched forever, raising issues about the use of the matched comparison area design for the evaluation of long-term outcomes. Even if matching is conducted well, it will only give estimates of programme effects at a particular point in time and for the surveyed cohort. One can not rely on single point-in-time evaluations to extrapolate to impacts in other times or places. The advent of richer over-time administrative data - some of it collected explicitly for evaluation, some of it now available through the Census and other sources - is beginning to offer evaluators real alternatives to survey-generated data. It is likely that, in the near future, evaluations including those using PSM, will come to rely increasingly on administrative data.

The second lesson that emerges from the EMA evaluation is the value of close collaboration between policy makers and evaluators. In the case of EMA, for example, the evaluators were able to change the control areas to effect a better match than was possible with YCS data alone. In the future, it is likely that more evaluations will adopt the 'demonstration' model prevalent in the United States where the treatment is devised alongside the evaluation strategy.