One of the most pressing problems facing evaluators and policy makers is to determine whether a policy or programme has caused change to occur in the outcomes it was designed to influence, and whether any such change has occurred in the desired direction.
For example, have policies that aim to reduce unemployment led to a fall in the numbers out of work, or are some other factors responsible? Social experiments help policy makers answer these important questions.
Social experiments essentially test whether a programme or policy has led to change in the outcomes that the programme or policy was designed to affect, over and above that which would have occurred in the absence of the programme or policy. Social experiments do this by providing potentially unbiased estimates of the programme or policy's impact. That is, an estimate of impact that is entirely attributable to the programme or policy itself, rather than some other factor(s). For example, a new policy might seek to reduce re-conviction rates among offenders. A social experiment aims to show how much of any observed drop in re-convictions is attributable to the policy alone, rather than to some other factor(s).
More detailed guidance is contained in the Background Document (pdf, 162kb)
Central to a social experiment is the concept of random allocation, alternatively referred to as random assignment or randomisation . To understand how this works, consider Figure 7.1 which depicts a simple random allocation design:

Randomisation is very important because, provided that the sample is large enough, it ensures that there are no systematic differences , on average, between units in the programme and control groups, at the point when they are allocated or assigned. In other words, there is no systematic relationship between membership of the programme or control groups, and the observed and/or unobserved characteristics of the units in the study. This means that any statistically significant difference in the average value of outcomes for the programme group and the average value of those same outcomes in the control group (represented as Δ= Y p - Y c above), measured after the new policy or programme has been introduced, result from the impact of the programme or policy alone.
Very often the control group is said to be an estimate of the counterfactual . The counterfactual represents what would have happened to the programme group had the units allocated to it not been subject to the new policy or programme, or subject to an alternative policy or programme.
Many evaluators consider a control group formed at random to be the best representation of the counterfactual (Boruch, 1997, Shadish, Cook and Campbell, 2001; Orr 1999,). This is because on average, units in the programme group are statistically equivalent to units in the control group, except for the fact that the latter are exposed to the new programme or policy being tested.
For social experiments to provide unbiased estimates of programme impacts a number of conditions should hold. Estimates must possess both internal and external validity. Furthermore, the statistical power of the experiment must be sufficient to be able to detect programme impacts should they exist.
Randomisation in theory provides a single or series of impact measures which, if the experiment is properly designed and its control requirements adhered to, provide strong evidence of programme impacts net of confounding factors. The additional advantages of randomisation can be summarised as follows:
The randomised controlled trial (RCT) has impressive advantages as a method of evaluation, but it has some disadvantages. The criticisms and difficulties associated with RCTs can be grouped under three headings (see Burtless and Orr 1986): policy utility, methodological, ethical/practical and cost.
Commentators have criticised RCTs on the grounds that they do not address many of the questions of interest to policymakers (see Heckman and Smith 1995, and Pawson and Tilley 1997, for examples of critiques of RCTs addressing this issue from different perspectives). In many of its guises, critiques of this nature refer to what is known as the 'black box' problem. An RCT 'produces a description of outcomes, rather than explanations of why programmes work' (Pawson and Tilly 1997: 30). Clearly policymakers are interested in what it is about their policy or intervention which leads to change. Questions of this nature are important for those responsible for implementation and delivery. It is worth noting that these arguments are not unique to RCTs and could also be applied to quasi-experiments, and other methods of policy evaluation.
Heckman and Smith (1995: 95) suggest that in most practical cases, RCTs will not be able to answer important questions such as:
'Some of these questions might in principle be evaluated using random assignment designs, but practical difficulties would make it impossible in most cases' (Heckman and Smith 1995: 95).
Whilst RCTs usually cannot inform many black box problems and implementation issues, they do provide valid information on the likely impact of a policy or programme and the variation in effect sizes across different research sites. The reasons for these variations in effect size, and other implementation issues, are usually best addressed using qualitative, consultative and other formative methods of evaluation. However, it is sometimes possible to test implementation issues and some black box processes by designing a structured experiment, that evaluate a number of delivery mechanisms.
There are a number of methodological criticisms of RCTs. However, these issues are not unique to RCTs and apply equally to quasi-experimental approaches.
A charge commonly levelled at social experiments by administrators, policy makers and even some evaluators and analysts is that they are unethical. This is said to be because individuals allocated to the control group are barred from receiving the services available to the programme group, and thus members of the control group are being discriminated against . Such attitudes may also stem from a sense of unease concerning the ethics of 'experimenting' on human subjects.
Such a charge is countered through the assertion that prior to the results from the social experiment becoming available, the supposed scale and direction of any impact on units in the study is unknown or equipoise. In other words, policy makers and evaluators have no way of knowing in advance whether those assigned to the control group are worse off than they would have been had they been assigned to the programme group instead, or that those assigned to the programme group will be better off. Furthermore, if there is good reason to suspect that, a priori , a programme will produce benefits for those units receiving programme services, there is no evidence as to whether these benefits accrue at an unreasonable cost to society. The justification for launching a social experiment is that policy makers are unsure of whether the policy or programme generates the benefits it was designed to achieve. In some cases, once a social experiment has shown that some programme or policy produces positive effects for those in the programme, the experiment can be stopped and members of the control group given access to the services that have been proven effective.
There are, however, some circumstances in which it is unethical to mount a social experiment. These are where (Rossi, Freeman and Lipsey 1999, Orr 1999, Cook and Campbell 1979):
In order to address concerns regarding the ethics of social experimentation, participants are frequently asked to provide informed consent to randomisation (Boruch 1997, Orr 1999). Individuals have the experiment described to them in detail and are asked to provide written consent to be randomly allocated. What constitutes 'informed consent', however, is often contentious or uncertain.
In most cases mounting a RCT is not an insignificant undertaking. Put simply, social policy experiments are complex and expensive . Results from RCTs also take time to become available, in most cases at least two to three years. These points though apply to most forms of evaluation and should not be taken as arguments against RCTs. However, there are some questions that evaluators should consider in deciding whether an RCT is the appropriate evaluation methodology for particular policy or programme:
Evaluators should note the advice of Cook and Campbell:
'The case for random assignment has to be made on the grounds that it is better than the available alternative for inferring cause and not on the grounds that it is perfect for inferring cause' (Cook and Campbell 1979: 342)
It should also be noted that quasi-experiments and other types of research and evaluation can also be very expensive because these studies usually require a significant amount of data collection. These other methods also cannot usually deliver the same degree of validity and reliability of findings, free of the biasing effects of confounding factors.
Cook and Campbell's (1979) classic text on quasi-experimental design defines a quasi-experiment as:
'Experiments that have treatments, outcome measures, and experimental units, but do not use random assignment to create the comparisons from which treatment-caused change is inferred. Instead, the comparisons depend on non-equivalent groups that differ from each other in many ways other than the presence of the treatment whose effects are being tested' (Cook and Campbell 1979: 6).
Quasi-experimental methods are basically applied in situations where the degree of control over the policy or intervention required to apply random assignment is not available to the evaluator, or the application of random assignment is felt to be unethical. For example, a policy or intervention may be introduced universally to the entire eligible target population, leaving no scope for randomising-out a proportion of those eligible to form a control group.
In such circumstances, it is the convention that the 'policy off' group , where one exists, is referred to as a comparison rather than control group . The use of the term control group is generally reserved for cases where access to the new programme or policy is determined through random assignment (with exception of some case control studies). As is the case with a control group in an experimental design, outcomes for a comparison group within the context of a quasi-experimental design represent an estimate of the counterfactual.
The single group pre- and post-test design is a common means of estimating the impacts of policies or programmes. Generally such designs are considered to be weak because they are unable to account satisfactorily for a wide variety of alternative explanations for any observed programme impacts. That is, this type of design does not really provide any valid and reliable information about an independent counterfactual (i.e. what would have happened to a comparison group if the policy or programme had not been offered or if some other intervention had been provided). Indeed Campbell and Stanley (1963) use it for heuristic purposes as a means of illustrating the full range of factors that can undermine internal validity in quasi-experimental evaluation. It should be noted, that this design can be extended in a number of ways to improve the validity of programme impact estimates.
Indeed, a single group pre and post-test design is seldom implemented without some additional refinements (or design controls) in order to improve the capacity to draw valid causal inferences. For the purpose of explaining the design and some of its limitations, however, a simplistic variant is described in Figure 7.2:

Figure 7.2 illustrates a simple single group design. The programme or policy under investigation is directed at a target population or a subset of this population in the form of a group or cohort. Prior to the introduction of the new policy or programme, data are collected on the outcomes (or dependent variables) that the policy programme seeks to influence ' Y t-1 '. This stage in the design is referred to as the baseline data collection stage or pre-test.
Once baseline data have been collected, the new policy or programme, or policy change can be introduced. At some point following the introduction of the programme or policy, follow-up or post-test data is collected on outcomes ' Y t '. The impact of the new policy or programme, or 'Δ', is simply computed as ' Y t minus Y t-1 '.
This computation can be adjusted (using regression analysis) to account for measured factors known to affect outcomes other than the programme (statistical controls - in contrast to design controls mentioned above). Such an adjustment attempts to control for changes in background variables that might have influenced ' Δ' independently of the effect of the programme or policy under investigation.
The problem with this design is that many rival events and factors could be responsible for 'Δ' other than the new policy or programme and changes in measurable background characteristics. These rival events or factors are often referred to as 'confounds' or 'threats' to internal validity. In order for 'Δ' to be an unbiased estimate of the impact of the programme or policy being evaluated, the assumption that no unmeasured change (that is change we can not control for in a regression model) would occur (i.e. that Δ = 0) in the absence of the policy or programme must hold. As we will see, this is a very strong assumption (Campbell and Kenny 1999) that is highly unlikely to be plausible in most contexts. It is for this reason that a single pre and post-test design is considered weak in terms of internal validity.
Very often, however, evaluators resort to using such designs where random allocation is not possible for either design, political or administrative reasons, where a comparison group is unavailable, or where the evaluator might wish to concentrate on achieving external validity - such as accounting for scale bias. In the latter case, the evaluator may make a conscious decision to trade-off internal validity for external validity.
Shadish, Cook and Campbell (2002) report a typology of confounds or threats to internal validity. The interested reader should refer to this source for a fuller discussion of the threats to internal validity in experimental and quasi-experimental research, but briefly the main threats can be summarised under nine headings.
1. Ambiguous temporal precedence
When it is not clear which variable occurred first confusion arises between which variable is the cause and which is the effect.
2. Selection
Systematic differences in the characteristics between the treatment and the comparison groups (Note: this does not apply to the one group pre and post test evaluation but does apply to the two-group-pre-and post-test design- see below).
3. History
Events that occur concurrently with treatment could produce the treatment effect.
4. Maturation
Naturally occurring changes over time that could account for a treatment effect.
5. Regression/ regression towards the mean
When units are selected because of their extreme scores these scores can regress towards their average on re-measurement.
6. Attrition
Loss of respondents to treatment can produce biased results if the drop-outs are different from those who remain within the group.
7. Testing
Exposure to a test can result in changes to scores on re-testing that are independent of treatment.
8. Instrumentation
The measure may not be reliable or it may change over time and these effects could be confused with a treatment effect.
9. Additive and interactive effect of threats to internal validity
The impact of a threat can be added to that of another threat or may depend on the level of another threat.
The point to note is that a well-designed social experiment (using random allocation methods) deals effectively with each of these threats to internal validity. Three of the most important threats to internal validity are discussed below: history threats, maturation threats and regression to the mean.
A 'history' threat to causal inference occurs where some event or events , between baseline (the pre-test) and follow-up (the post-test), lead to changes in outcomes independently of the programme or policy under investigation. For example, in an evaluation to measure the effect of new classroom teaching materials, a change in teaching personnel might also occur between baseline and follow-up, thereby making it difficult to ascertain whether changes in the follow-up or post-test result from the effect of the materials or change in personnel. Where such 'history' threats exist, unless such a threat is identified, measured and controlled for in any analysis, change in outcomes might erroneously be attributed solely to the policy or programme. In most policy or programme evaluations there are numerous history events which complicate, or make extremely difficult, the assessment of programme or policy effectiveness. Introducing a comparison group of units with similar characteristics to those exposed to the policy or programme under investigation can help to control for history effects. The extent of design control for history threats gained from adding a comparison group depends on the degree to which the comparison group is exposed to the same magnitude of history effects as the programme group.
A 'maturation' threat occurs where some fraction of the estimated impact stems not from the influence of the programme or policy under investigation, but simply from the passage of time between baseline and follow-up. Differences between baseline and follow-up values may capture the effects of some underlying secular trend causing the outcome variables being measured to change independently of the programme or policy being evaluated.
Regression to the mean is a common but not particularly well-understood phenomenon, which can occur for a number of reasons. In the case of a pre-test/post-test evaluation design, regression to the mean usually arises because the programme or policy is directed at individuals possessing some extreme characteristic (for example, low income, low attainment in school tests, high rates of criminal recidivism and so on). The general pattern is that units with high pre-test scores tend to score lower at post-test, and units with low pre-test scores tend to score at higher at post-test - in other words extreme scores tend on re-test to move or regress toward the mean test value.
It is important to note that regression to the mean is a potential threat to internal validity in many evaluation designs, with the exception of random allocation and regression discontinuity. In the case of the former the effects of regression to the mean are randomly distributed between programme and control groups and do not therefore affect mean comparisons of outcomes between programme and control groups. In regression discontinuity designs, the regression line, the centre-piece of the approach, effectively controls for mean reversion (see Shadish, Cook and Campbell for further details). The interested reader should consult Campbell and Kenny (1999) for a fuller discussion of regression artefacts.
The non-equivalent comparison group (NECG) design involves the evaluator selecting a group of units similar to those receiving the new policy or programme that is being tested. Such a group is called a comparison group (similar to a control group in a social experiment) and acts as a counterfactual . As we have seen, estimation of a counterfactual is essential to the process of causal inference or attribution. The concept underlying such selection is to obtain a comparison group that is as similar as possible to the programme group in all respects. It is a stronger form of design to the one-group pre and post test design because it includes a comparison group.

Figure 7.3 illustrates a simple NECG design. Pre-test data and post-test data are collected for a group of units that receive or are exposed to the new policy or programme being evaluated. For simplicity, it is assumed here that pre-test data are pre-programme measures of outcome or dependent variables of interest. Post-test data are measures of the dependent variables or outcomes after the programme or policy has been introduced. Pre- and post-test data are collected for a comparison group and programme group at the same points in time.
The object of the comparison group is to help the evaluator interpret estimated programme impacts through providing a more convincing estimate of the counterfactual than that obtain through a single group design. The designation 'non-equivalent' means that the comparison group can be selected in any number of ways, with the exception that access to the programme cannot be determined through random allocation, which as we have seen ensures statistical equivalence between programme and control groups.
Collecting pre-test data for both programme and comparison group allows the evaluator to examine whether the two research groups differ prior to the introduction of the policy or programme being tested in terms of pre-test values. Random assignment ensures statistical equivalence at baseline in terms of both measured and unmeasured factors. With an NECG design, no such assurance exists and thus it is important to explore the extent to which programme and comparison groups might differ. Such differences might not have been measured, and may indeed not be measurable.
Where differences between units in the programme and comparison groups do exist, be they observed or unobserved, and these differences are statistically related to the outcomes of interest, a selection problem is said to exist. A selection problem can interact with other threats to internal validity - for example a selection-history threat, or a selection-maturation threat, or selection-regression to the mean.
NECG designs suffer in the main from similar threats to statistical validity, construct validity and external validity, as is the case with social experiments.
Interrupted time series designs investigate repeated observations of a constant variable over time and look for 'interruptions' to the series or sequence of observations (see Figure 7.4). Such interruptions might be attributable to an intervention , though they could also be a random blip in the series of observations (not likely in the example in Figure 7.4 where the interruption was continuous). There are different types of change to a time series sequence of observations; changes in the level and in the slope of the curve, the degree of permanency of the effect (as in Figure 7.4), and the type of impact (immediate or delayed).
In order to attribute the interruption in the time series of observations to a particular intervention it is important to know the specific point (e.g. the date) when the intervention was introduced. This must, of course, have been before the interruption in the observations if causality is to be inferred. If this is the case it is then necessary to consider, and rule out, any other reasonable explanations for why the interruption occurred. In the case of the clear interruption to the time series data on traffic accident fatalities in Figure 7.4 this would include ruling out, for instance, that there had been a fuel shortage after year 6; or that a major new tax had been introduced on road usage (or petrol); or that there had been a national alcoholic beverages strike; or that the price of alcoholic beverages had increased significantly, and so on. Assuming that none of these alternative explanations are accepted, we can infer that the noticeable reduction in road traffic accident fatalities was almost certainly attributable to the introduction of the Road Traffic Act.
It is also important when working with interrupted time series designs to establish that the variable(s) being measured are constant over time. Where definitions and counting practices change frequently over time (e.g. unemployment statistics) it is much more difficult, and sometimes impossible, to use such data as valid measurements, or to establish any causal significance to an interruption in the time series.
Statistical matching designs are similar to the before and after non-equivalent control group method outline above. Instead of finding a group of units whom we assume by virtue of some aggregate measure are a good match for the group receiving a treatment or intervention, we construct a comparison group through one-to-one matching. In other words, we attempt to find a control that matches a treated individual on the basis of what we observe about them. In short, matching attempts to 're-establish the conditions of an experiment [RCT/random assignment] when no such data are available' (Blundell and Costa Dias 2000: 444).
The basic idea with all forms of statistical matching is that the comparison group is so closely matched to the treatment group that the only difference between the two groups is the impact of the programme. Following on from this, the impact of the programme or intervention can be deduced from simple comparisons of means or proportions on the outcome variable (post-test) given samples of an adequate size.
It is worth pointing out that in almost all cases impact estimates from statistical matching methods are likely to contain bias when compared to results from an RCT applied in the same context (LaLonde 1986; Lipsey and Wilson, 1993; Heckman, Ichimura and Todd 1997, Blundell and Costa Dias 2000). Some methods of matching (e.g Propensity Score Matching) however have proven to be better than others in replicating results from evaluations where random assignment has be adopted (Heckman, Ichimura and Todd 1997; Blundell and Costa Dias 2000).
Generally for statistical matching to be successful observations are required on a wide range of variables which are known from the research literature to be statistically related to both the likelihood that a unit will choose treatment and those related to the outcome measure. The main limitation is that the evaluator can only control for those variables that he/she has measured in advance.
Regression Discontinuity Design (RD design) is a method that has been developed relatively recently but it is already established as the quasi-experiment that comes closest to an experimental design in eliminating selection bias (Mosteller, 1990). It has been described as 'one of the strongest methodological alternatives to randomized experiments when one is interested in studying social programs' (Trochim, 1984).
The RD design is a type of before-and-after two group design. In other words all persons in the study are assessed before they receive treatment and are then re-assessed after receiving treatment. There are always two groups: an intervention group and a comparison group. In these respects the RD design is not unique. There are plenty of other designs which could be called before-and-after-two-group designs. However, the unique feature of the RD design is the allocation process, whereby study participants are allocated to the control and treatment conditions solely on the basis of a cut-off score on a pre-programme measure. In other forms of quasi-experiment the allocation process is not controlled, and the treatment and comparison groups are self-selected. It is this feature that makes the RD design so much more robust than other forms of quasi-experiment.
In the RD design the only bias between the treatment and comparison groups is the difference in scores on the pre-programme measure. No other variable influences the selection process. This is not the case in other forms of quasi-experiment where only a limited number of variables are controlled and where an infinite number of unknown variables could influence the results of the evaluation. Moreover, in the RD design the source of the selection bias is not only known but it has been quantified by the pre-programme measure. This therefore allows for it to be controlled for by the use of regression analysis.
Social experiments (including quasi-experiments) are often criticised from an epistemological perspective, in that as a method, experimentation is commonly associated with ' positivism '. Positivism holds that the purpose of scientific enquiry is to test and predict the phenomena experienced in the social world, and that science should only be concerned with what can be measured . In a 'positivist' world, one cannot observe the process of cause but simply measure the consequences or outcomes of causal processes. According to the positivist perspective, the objective of the scientific process is to be able to predict and control the material world, and by extension in the social sciences, the social/political world.
Positivism, is seen by some critics, as a rather crude and naïve account of the scientific process and has been superseded most prominently by interpretivism, phenomenology, critical realism and post-modernism in the social sciences. A common argument of these critics is that science can never completely account for the nature of reality and that all scientific measurement is subject to various forms of error. Consequently all scientific theory, and all knowledge, is revisable and can be deconstructed and reconstituted. The idea that an evaluator can be 'objective' about the social world is rejected. Instead, individual evaluators must compare and contrast multiple accounts and attempted to triangulate their findings with those of others.
One of the recent more and influential critiques of random allocation comes from the 'realist' perspective outlined by Pawson and Tilley (1997). They characterise the experimental method as being based on a 'successionist' conceptualisation of cause (similar to what Shadish, Cook and Campbell call causal description), whereby evaluators simply implement random allocation methods and neglect to study the processes or causal mechanism yielding the measured outcomes and impacts.
Pawson and Tilley contrast the crude simplicity of experimentation with the subtle more nuanced conceptualisation of cause expounded in 'generative' theories of causation (similar to Shadish, Cook and Campbell's concept of causal explanation).
In the view of Pawson and Tilley experimentation neglects the vital task of studying transformation directly and accounting for all the processes that bring about change. Pawson and Tilley are concerned with elucidating the context, mechanisms and outcomes (regularities) of policies and programmes.
Part of the Pawson and Tilly critique also highlights the practical difficulties associated with implementing social experiments and the claimed inconsistencies and variability of findings from them (Heckman and Smith, 1995, also make this point).
Although Pawson and Tilley (1997) and many others have extreme misgivings concerning the experimental method such approaches retain currency and usefulness. The reason for this is that the experimental approach, although inappropriate in some circumstances and never sufficient on its own, continues to provide policy makers, particularly those who have to make decisions regarding the allocation of resources, with the types of information they find useful. Few if any policy makers or practitioners would doubt that knowledge is contingent, ephemeral, and revisable in light of interpretation and further analysis. Nonetheless, most would be content to work with reasonably stable understandings of the social world in order to plan and predict within certain parameters of risk.
Furthermore, there are examples internationally of where the experimental method of evaluation, within a multi-method approach, has been shown to provide cumulative, reliable findings that have been influential in the development of policy (Greenberg and Mandell 1990).
Social experiments are never, or should never, be used as the sole source of evidence in evaluating a programme or policy. Experiments need to be conducted alongside a thorough process study, which explicitly seeks to understand the context and causal processes being evaluated.
This chapter has considered experimental and quasi-experimental methods for evaluating the effect of policies and programmes, and has detailed the advantages and disadvantages associated with these methods.
Social experiments essentially test whether a programme or policy has led to change in the outcomes that the programme or policy was designed to affect, over and above that which would have occurred in the absence of the programme or policy.
Social experiments use random allocation and so yield unbiased estimates of the programme or policy's impact. In other words they provide an estimate of impact that is entirely attributable to the programme or policy itself, rather than some other factor(s).
In contrast, quasi-experiments are unable to eliminate the possibility that some extraneous variable may account for the impact of the programme. This is not to say that quasi-experimental methods should be ignored. Indeed there are circumstances when it is not possible to use random allocation for good practical and ethical reasons. In these instances a well implemented quasi-experiment, using a regression discontinuity design or a matched comparison design with PSM, provides a robust alternative.
The experimental method is not without its critics (see Pawson and Tilley, 1997). However, a structured and properly implemented design, when combined with a thorough process study, can overcome many of these criticisms and provide robust data on the effect of a policy or programme.
Baker, J. L. (2000). Evaluating the Impact of Development Projects on Poverty: A Handbook for Practitioners. Directions in Development. The World Bank. Washington, D.C.
Berk, R. A; Rauma, D (1983), Capitalizing on nonrandom assignment to treatments: a regression-discontinuity evaluation of a crime-control program. Journal of the American Statistical Association, March, Volume 78, Number 381, p 21-27.
Bloom, H. S. (1995) Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs. Evaluation Review, 19, 547-556.
Blundell, R. and Costa Dias, M. (2000) Evaluation methods for non-experimental
data Fiscal Studies 21(4).
Bonate, P. L. (2000) Analysis of Pretest-Postest Designs (Boca Raton: Chapman & Hall/CRC).
Boruch, R. F. (1997) Randomized Experiments for Planning and Evaluation: A Practical Guide (Thousand Oaks: Sage Publications).
Bryson, A., Dorsett, R. and Purdon, S. (2002) The Use of Propensity Score Matching in the Evaluation of Active Labour Market Policies, Department for Work and Pensions Working Paper No. 4 (London: HMSO).
Burtless, G. (1995) The Case for Randomized Field Trials in Economic and Policy Research, Journal of Economic Perspectives, 9, 63-84.
Burtless, G. and Orr, L. L. (1985) Are Classical Experiments Needed for Manpower Policy? The Journal of Human Resources, 21, 607-639.
Campbell, D. T. and Kenny, D. A. (1999) A Primer on Regression Artifacts (New York: The Guildford Press).
Campbell, D. T. and Stanley, J. C. (1963) Experimental and Quasi-experimental Designs for Research (Boston: Houghton Mifflin Company).
Clarke, M. and Oxman, A. D. (eds.) (1999) Cochrane Reviewer's Handbook 4.0 (Oxford: The Cochrane Collaboration).
Cochrane, W. and Rubin, D. (1973) Controlling bias in observational studies: A
Review, Sankhya, Series A, 35: 417-46
Cook, T. D. (2002) Randomized Experiments in Educational Policy Research: A Critical Examination of the Reasons the Educational Evaluation Community has Offered for not Doing Them, Educational Evaluation and Policy Analysis, 24, 175-199
Cook, T. D. and Campbell, D. T. (1979) Quasi-experimental Design & Analysis Issues for Field Settings, (Boston: Houghton Mifflin Company).
Greenberg, D. H. and Mandell, M. B. (1990) Research Utilization in Policymaking: A Tale of Two Series (of Social Experiments), Discussion Paper No. 925-990 (Institute for Research on Poverty: Madison).
Greene, W. H. 1997. Econometric Analysis. Hemel Hempstead, New Jersey: Prentice Hall Press.
Greenberg, D. H. and Shroder M. (1997) The Digest of Social Experiments (Washington: Urban Institute).
Heckman, J. J. (1979) Sample Selection Bias as a Specification Error, Econometrica, 47, 153-161
Heckman, J. J. and Smith, J. A. (1995) Assessing the Case for Social Experiments, Journal of Economic Perspectives, 9, 85-110.
Heckman, J., Ichimura, H. and Todd, P. (1997) 'Matching as an econometric evaluation estimator: evidence from evaluating a job training programme' Review of Economic Studies 64: 605-654.
LaLonde, R. (1986) 'Evaluating the Econometric Evaluations of Training Programs with Experimental Data', American Economic Review, 76,(4), 604-620
Morris, S., Greenberg, D., Riccio, J., Mittra, B., Green, H., Lissenburgh, S. and Blundell, R. (2003) Designing a Demonstration Project - An Employment, Retention and Advancement Demonstration for Great Britain, Occasional Paper No.1 (London: Strategy Unit, Cabinet Office).
Oakes, M. J. and Feldman, H. A. (2001) Statistical Power for Nonequivalent Pretest-Postest Designs, Evaluation Review, 25, 3-28.
Orr, L. L. (1999) Social Experiments: Evaluating Public Programs with Experimental Methods (Thousand Oaks: Sage Publications).
Pawson, R. and Tilley, N. (1997) Realistic Evaluation (London: Sage Publications).
Purdon, S (2002) Estimating the Impact of Labour Market Programmes, Department for Work and Pensions Working Paper No. 3 (London: HMSO).
Reichardt, C. S. (1979) The Statistical Analysis of Data from Nonequivalent Group Designs, in T. D. Cook and D. T. Campbell (eds.) Quasi-experimental Design & Analysis Issues for Field Settings, (Boston: Houghton Mifflin Company) 147-205.
Rosenbaum PR, Rubin DB. (1983) The central role of the propensity score in observational studies for causal effects. Biometrika;70:41-55.
Rossi, P. H., Freeman, H. E. and Lipsey, M. W. (1999) Evaluation: A Systematic Approach (Thousand Oaks: Sage Publications).
Scriven, M (1991) Evaluation Thesaurus: Forth Edition, (Newbury Park: Sage Publications).
Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002) Experimental and Quasi-experimental Designs for Generalized Causal Inference, (Boston: Houghton Mifflin Company)
Smith, A. S., Youngs, R., Ashworth, K., McKay, S., Walker, R., Elias, P. and McKnight, A. (2000) Understanding the Impact of Jobseeker's Allowance, Department of Social Security Research Report No. 111 (Leeds: Corporate Documents Services).
Trochim, W. M. (2000) The Research Methods Knowledge Base, 2nd Edition. Internet WWW page, at URL: <http://trochim.human.cornell.edu/kb/index.htm> (version current as of August 02, 2000).
Weiss, C. H. (1998) Evaluation (Upper Saddle River: Prentice-Hall).
Winship, C. and Morgan, S. L. (1999) The Estimation of Causal Effects from Observational Data, Annual Review of Sociology, 25, 659-707.