Experimental vs Quasi-Experimental Design: Which to Choose?
Here’s a table that summarizes the similarities and differences between an experimental and a quasi-experimental study design:
What is a quasi-experimental design?
A quasi-experimental design is a non-randomized study design used to evaluate the effect of an intervention. The intervention can be a training program, a policy change or a medical treatment.
Unlike a true experiment, in a quasi-experimental study the choice of who gets the intervention and who doesn’t is not randomized. Instead, the intervention can be assigned to participants according to their choosing or that of the researcher, or by using any method other than randomness.
Having a control group is not required, but if present, it provides a higher level of evidence for the relationship between the intervention and the outcome.
(for more information, I recommend my other article: Understand Quasi-Experimental Design Through an Example ) .
Examples of quasi-experimental designs include:
- One-Group Posttest Only Design
- Static-Group Comparison Design
- One-Group Pretest-Posttest Design
- Separate-Sample Pretest-Posttest Design
What is an experimental design?
An experimental design is a randomized study design used to evaluate the effect of an intervention. In its simplest form, the participants will be randomly divided into 2 groups:
- A treatment group: where participants receive the new intervention which effect we want to study.
- A control or comparison group: where participants do not receive any intervention at all (or receive some standard intervention).
Randomization ensures that each participant has the same chance of receiving the intervention. Its objective is to equalize the 2 groups, and therefore, any observed difference in the study outcome afterwards will only be attributed to the intervention – i.e. it removes confounding.
(for more information, I recommend my other article: Purpose and Limitations of Random Assignment ).
Examples of experimental designs include:
- Posttest-Only Control Group Design
- Pretest-Posttest Control Group Design
- Solomon Four-Group Design
- Matched Pairs Design
- Randomized Block Design
When to choose an experimental design over a quasi-experimental design?
Although many statistical techniques can be used to deal with confounding in a quasi-experimental study, in practice, randomization is still the best tool we have to study causal relationships.
Another problem with quasi-experiments is the natural progression of the disease or the condition under study — When studying the effect of an intervention over time, one should consider natural changes because these can be mistaken with changes in outcome that are caused by the intervention. Having a well-chosen control group helps dealing with this issue.
So, if losing the element of randomness seems like an unwise step down in the hierarchy of evidence, why would we ever want to do it?
This is what we’re going to discuss next.
When to choose a quasi-experimental design over a true experiment?
The issue with randomness is that it cannot be always achievable.
So here are some cases where using a quasi-experimental design makes more sense than using an experimental one:
- If being in one group is believed to be harmful for the participants , either because the intervention is harmful (ex. randomizing people to smoking), or the intervention has a questionable efficacy, or on the contrary it is believed to be so beneficial that it would be malevolent to put people in the control group (ex. randomizing people to receiving an operation).
- In cases where interventions act on a group of people in a given location , it becomes difficult to adequately randomize subjects (ex. an intervention that reduces pollution in a given area).
- When working with small sample sizes , as randomized controlled trials require a large sample size to account for heterogeneity among subjects (i.e. to evenly distribute confounding variables between the intervention and control groups).
Further reading
- Statistical Software Popularity in 40,582 Research Papers
- Checking the Popularity of 125 Statistical Tests and Models
- Objectives of Epidemiology (With Examples)
- 12 Famous Epidemiologists and Why
Jump to navigation
Cochrane Training
Chapter 24: including non-randomized studies on intervention effects.
Barnaby C Reeves, Jonathan J Deeks, Julian PT Higgins, Beverley Shea, Peter Tugwell, George A Wells; on behalf of the Cochrane Non-Randomized Studies of Interventions Methods Group
Key Points:
- For some Cochrane Reviews, the question of interest cannot be answered by randomized trials, and review authors may be justified in including non-randomized studies.
- Potential biases are likely to be greater for non-randomized studies compared with randomized trials when evaluating the effects of interventions, so results should always be interpreted with caution when they are included in reviews and meta-analyses.
- Non-randomized studies of interventions vary in their ability to estimate a causal effect; key design features of studies can distinguish ‘strong’ from ‘weak’ studies.
- Biases affecting non-randomized studies of interventions vary depending on the features of the studies.
- We recommend that eligibility criteria, data collection and assessment of included studies place an emphasis on specific features of study design (e.g. which parts of the study were prospectively designed) rather than ‘labels’ for study designs (such as case-control versus cohort).
- Review authors should consider how potential confounders, and how the likelihood of increased heterogeneity resulting from residual confounding and from other biases that vary across studies, are addressed in meta-analyses of non-randomized studies.
Cite this chapter as: Reeves BC, Deeks JJ, Higgins JPT, Shea B, Tugwell P, Wells GA. Chapter 24: Including non-randomized studies on intervention effects [last updated October 2019]. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5. Cochrane, 2024. Available from www.training.cochrane.org/handbook .
24.1 Introduction
This chapter aims to support review authors who are considering including non-randomized studies of interventions (NRSI) in a Cochrane Review. NRSI are defined here as any quantitative study estimating the effectiveness of an intervention (harm or benefit) that does not use randomization to allocate units (individuals or clusters of individuals) to intervention groups. Such studies include those in which allocation occurs in the course of usual treatment decisions or according to peoples’ choices (i.e. studies often called observational ). (The term observational is used in various ways and, therefore, we discourage its use with respect to NRSI studies; see Box 24.2.a and Section 24.2.1.3 .) Review authors have a duty to patients, practitioners and policy makers to do their best to provide these groups with a summary of available evidence balancing harms against benefits, albeit qualified with a certainty assessment. Some of this evidence, especially about harms of interventions, will often need to come from NRSI.
NRSI are used by researchers to evaluate numerous types of interventions, ranging from drugs and hospital procedures, through diverse community health interventions, to health systems implemented at a national level. There are many types of NRSI. Common labels attached to them include cohort studies, case-control studies, controlled before-and-after studies and interrupted-time-series studies (see Section 24.5.1 for a discussion of why these labels are not always clear and can be problematic). We also consider controlled trials that use inappropriate strategies of allocating interventions (sometimes called quasi-randomized studies), and specific types of analysis of non-randomized data, such as instrumental variable analysis and regression discontinuity analysis, to be NRSI. We prefer to characterize NRSI with respect to specific study design features (see Section 24.2.2 and Box 24.2.a ) rather than study design labels. A mapping of features to some commonly used study design labels can be found in Reeves and colleagues (Reeves et al 2017).
Including NRSI in a Cochrane Review allows, in principle, the inclusion of non-randomized studies in which the use of an intervention occurs in the course of usual health care or daily life. These include interventions that a study participant chooses to take (e.g. an over-the-counter preparation or a health education session). Such studies also allow exposures to be studied that are not obviously ‘interventions’, such as nutritional choices, and other behaviours that may affect health. This introduces a grey area between evidence about effectiveness and aetiology.
An intervention review needs to distinguish carefully between aetiological and effectiveness research questions related to a particular exposure. For example, nutritionists may be interested in the health-related effects of a diet that includes a minimum of five portions of fruit or vegetables per day (‘five-a-day’), an aetiological question. On the other hand, public health professionals may be interested in the health-related effects of interventions to promote a change in diet to include ‘five-a-day’, an effectiveness question. NRSI addressing the former type of question are often perceived as being more direct than randomized trials because of other differences between studies addressing these two kinds of question (e.g. compared with the randomized trials, NRSI of health behaviours may be able to investigate longer durations of follow-up and outcomes than become apparent in the short term). However, it is important to appreciate that they are addressing fundamentally different research questions. Cochrane Reviews target effects of interventions, and interventions have a defined start time.
This chapter has been prepared by the Cochrane Non-Randomized Studies of Interventions Methods Group (NRSMG). It aims to describe the particular challenges that arise if NRSI are included in a Cochrane Review. Where evidence or established theory indicates a suitable strategy, we propose this strategy; where it does not, we sometimes offer our recommendations about what to do. Where we do not make any recommendations, we aim to set out the pros and cons of alternative actions and to identify questions for further methodological research.
Review authors who are considering including NRSI in a Cochrane Review should not start with this chapter unless they are already familiar with the process of preparing a systematic review of randomized trials. The format and basic steps of a Cochrane Review should be the same irrespective of the types of study included. The reader is referred to Chapters 1 to 15 of the Handbook for a detailed description of these steps. Every step in carrying out a systematic review is more difficult when NRSI are included and the review team should include one or more people with expert knowledge of the subject and of NRSI methods.
24.1.1 Why consider non-randomized studies of interventions?
Cochrane Reviews of interventions have traditionally focused mainly on systematic reviews of randomized trials because they are more likely to provide unbiased information about the differential effects of alternative health interventions than NRSI. Reviews of NRSI are generally undertaken when the question of interest cannot be answered by a review of randomized trials. Broadly, we consider that there are two main justifications for including NRSI in a systematic review, covered by the flow diagram shown in Figure 24.1.a :
- To provide evidence of the effects (benefit or harm) of interventions that can feasibly be studied in randomized trials, but for which available randomized trials address the review question indirectly or incompletely (an element of the GRADE approach to assessing the certainty of the evidence, see Chapter 14, Section 14.2 ) (Schünemann et al 2013). Such non-randomized evidence might address, for example, long-term or rare outcomes, different populations or settings, or ways of delivering interventions that better match the review question.
- To provide evidence of the effects (benefit or harm) of interventions that cannot be randomized, or that are extremely unlikely to be studied in randomized trials. Such non-randomized evidence might address, for example, population-level interventions (e.g. the effects of legislation; (Macpherson and Spinks 2008) or interventions about which prospective study participants are likely to have strong preferences, preventing randomization (Li et al 2016).
A third justification for including NRSI in a systematic review is reasonable, but is unlikely to be a strong reason in the context of a Cochrane Review:
- To examine the case for undertaking a randomized trial by providing an explicit evaluation of the weaknesses of available NRSI. The findings of a review of NRSI may also be useful to inform the design of a subsequent randomized trial (e.g. through the identification of relevant subgroups).
Two other reasons sometimes described for including NRSI in systematic reviews are:
- When an intervention effect is very large.
- To provide evidence of the effects (benefit or harm) of interventions that can feasibly be studied in randomized trials, but for which only a small number of randomized trials is available (or likely to be available).
We urge caution in invoking either of these justifications. Reason 4, that an effect is large, is implicitly a result-driven or post-hoc argument, since some evidence or opinion would need to be available to inform the judgement about the likely size of the effect. Whilst it can be argued that large effects are less likely to be completely explained by bias than small effects (Glasziou et al 2007), clinical and economic decisions still need to be informed by unbiased estimates of the magnitude of these large effects (Reeves 2006). Randomized trials are the appropriate design to quantify large effects (and the trials need not be large if the effects are truly large). Of course, there may be ethical opposition to randomized trials of interventions already suspected to be associated with a large benefit, making it difficult to randomize participants, and interventions postulated to have large effects may also be difficult to randomize for other reasons (e.g. surgery versus no surgery). However, the justification for a systematic review including NRSI in these circumstances can be classified as reason 2 above (i.e. interventions that are unlikely to be randomized).
The appropriateness of reason 5 depends to a large extent on expectations of how the review will be used in practice. Most Cochrane Reviews seek to identify highly trustworthy evidence (typically only randomized trials) and if none is found then the review can be published as an ‘empty review’. However, as Cochrane Reviews also seek to inform clinical and policy decisions, it can be necessary to draw on the ‘best available’ evidence rather than the ‘highest tier’ of evidence for questions that have a high priority. While acknowledging the priority to inform decisions, it remains important that the challenges associated with appraising, synthesizing and interpreting evidence from NRSI, as discussed in the remainder of this chapter, are well-appreciated and addressed in this situation. See also Section 24.2.1.3 for further discussion of these issues. Reason 5 is a less appropriate justification in a review that is not a priority topic where there is a paucity of evidence from randomized trials alone; in such instances, the potential of NRSI to inform the review question directly and without a critical risk of bias are paramount.
Review authors may need to apply different eligibility criteria in order to answer different review questions about harms as well as benefits ( Chapter 19, Section 19.2.2 ). In some reviews the situation may be still more complex, since NRSI specified to answer questions about benefits may have different design features from NRSI specified to answer questions about harms (see Section 24.2 ). A further complexity arises in relation to the specification of eligible NRSI in the protocol and the desire to avoid an empty review (depending on the justification for including NRSI).
Whenever review authors decide that NRSI are required to answer one or more review questions, the review protocol must specify appropriate methods for reviewing NRSI. If a review aims to include both randomized trials and NRSI, the protocol must specify methods appropriate for both. Since methods for reviewing NRSI can be complex, we recommend that review authors scope the available NRSI evidence , after registering a title but in advance of writing a protocol, allowing review authors to check that relevant NRSI exist and to specify NRSI with the most appropriate study design features in the protocol (Reeves et al 2013). If the registered title is broadly conceived, this may require detailed review questions to be formulated in advance of scoping: these are the PICOs for each synthesis as discussed in Chapter 3, Section 3.2 . Scoping also allows the directness of the available evidence to be assessed against specific review questions (see Figure 24.1.a ). Basing protocol decisions on scoping creates a small risk that different kinds of studies are found to be necessary at a later stage to answer the review questions. In such instances, we recommend completing the review as specified and including other studies in a planned update, to allow timelines for the completion of a review to be set.
An alternative approach is to write a protocol that describes the review methods to be used for both randomized trials and NRSI (and all types of NRSI) and to specify the study design features of eligible NRSI after carrying out searches for both types of study. We recommend against this approach in a Cochrane Review, largely to minimize the work required to write the protocol, carry out searches and examine study reports, and to allow timelines for the completion of a review to be set.
Figure 24.1.a Algorithm to decide whether a review should include non-randomized studies of an intervention or not
24.1.2 Key issues about the inclusion of non-randomized studies of interventions in a Cochrane Review
Randomized trials are the preferred design for studying the effects of healthcare interventions because, in most circumstances, a high-quality randomized trial is the study design that is least likely to be biased. All Cochrane Reviews must consider the risk of bias in individual primary studies, whether randomized trials or NRSI (see Chapter 7 , Chapter 8 and Chapter 25 ). Some biases apply to both randomized trials and NRSI. However, some biases are specific (or particularly important) to NRSI, such as biases due to confounding or selection of participants into the study (see Chapter 25 ). The key advantage of a high-quality randomized trial is its ability to estimate the causal relationship between an experimental intervention (relative to a comparator) and outcome. Review authors will need to consider (i) the strengths of the design features of the NRSI that have been used (such as noting their potential to estimate causality, in particular by inspecting the assumptions that underpin such estimation); and (ii) the execution of the studies through a careful assessment of their risk of bias. The review team should be constituted so that it can judge suitability of the design features of included studies and implement a careful assessment of risk of bias.
Potential biases are likely to be greater for NRSI compared with randomized trials because some of the protections against bias that are available for randomized trials are not established for NRSI. Randomization is an obvious example. Randomization aims to balance prognostic factors across intervention groups, thus preventing confounding (which occurs when there are common causes of intervention group assignment and outcome). Other protections include a detailed protocol and a pre-specified statistical analysis plan which, for example, should define the primary and secondary outcomes to be studied, their derivation from measured variables, methods for managing protocol deviations and missing data, planned subgroup and sensitivity analyses and their interpretation.
24.1.3 The importance of a protocol for a Cochrane Review that includes non-randomized studies of interventions
Chapter 1 (Section 1.5) establishes the importance of writing a protocol before carrying out the review. Because the methodological choices made during a review including NRSI are complex and may affect the review findings, a protocol is even more important for such a review. The rationale for including NRSI (see Section 24.1.1 ) should be documented in the protocol. The protocol should include much more detail than for a review of randomized trials, pre-specifying key methodological decisions about the methods to be used and the analyses that are planned. The protocol needs to specify details that are not as relevant for randomized trials (e.g. potential confounding domains, important co-interventions, details of the risk-of-bias assessment and analysis of the NRSI), as well as providing more detail about standard steps in the review process that are more difficult when including NRSI (e.g. specification of eligibility criteria and the search strategy for identifying eligible studies).
We recognize that it may not be possible to pre-specify all decisions about the methods used in a review. Nevertheless, review authors should aim to make all decisions about the methods for the review without reference to the findings of primary studies, and report methodological decisions that had to be made or modified after collecting data about the study findings.
24.2 Developing criteria for including non-randomized studies of interventions
24.2.1 what is different when including non-randomized studies of interventions, 24.2.1.1 evaluating benefits and harms.
Cochrane Reviews aim to quantify the effects of healthcare interventions, both beneficial and harmful, and both expected and unexpected. The expected benefits of an intervention can often be assessed in randomized trials. Randomized trials may also report some of the harms of an intervention, either those that were expected and which a trial was designed to assess, or those that were not expected but which were collected in a trial as part of standard monitoring of safety. However, many serious harms of an intervention are rare or do not arise during the follow-up period of randomized trials, preventing randomized trials from providing high-quality evidence about these effects, even when combined in a meta-analysis (see Chapter 19 for further discussion of adverse events). Therefore, one of the most important reasons to include NRSI in a review is to assess potential unexpected or rare harms of interventions (reason 1 in Section 24.1.1 ).
Although widely accepted criteria for selecting appropriate studies for evaluating rare or long-term adverse and unexpected effects have not been established, some design features are preferred to reduce the risk of bias. In cohort studies, a preferred design feature is the ascertainment of outcomes of interest (e.g. an adverse event) from the onset of an exposure (i.e. the start of intervention); these are sometimes referred to as inception cohorts. The relative strengths and weaknesses of different study design features do not differ in principle between beneficial and harmful outcomes, but the choice of study designs to include may depend on both the frequency of an outcome and its importance. For example, for some rare or delayed adverse outcomes only case series or case-control studies may be available. NRSI with some study design features that are more susceptible to bias may be acceptable for evaluation of serious adverse events in the absence of better evidence, but the risk of bias must still be assessed and reported.
Confounding (see Chapter 25, Section 25.2.1 ) may be less of a threat to the validity of a review when researching rare harms or unexpected effects of interventions than when researching expected effects, since it may be argued that ‘confounding by indication’ mainly influences treatment decisions with respect to outcomes about which the clinicians are primarily concerned. However, confounding can never be ruled out because the same factors that are confounders for the expected effects may also be direct confounders for the unexpected effects, or be correlated with factors that are confounders.
A related issue is the need to distinguish between quantifying and detecting an effect of an intervention. Quantifying the intended benefits of an intervention – maximizing the precision of the estimate and minimizing susceptibility to bias – is critical when weighing up the relative merits of alternative interventions for the same condition. A review should also try to quantify the harms of an intervention, minimizing susceptibility to bias as far as possible. However, if a review can establish beyond reasonable doubt that an intervention causes a particular harm, the precision and susceptibility to bias of the estimated effect may not be essential. In other words, the seriousness of the harm may outweigh any benefit from the intervention. This situation is more likely to occur when there are competing interventions for a condition.
24.2.1.2 Including both randomized trials and non-randomized studies of interventions
When both randomized trials and NRSI are identified that appear to address the same underlying research question, it is important to check carefully that this is indeed the case. There are often systematic differences between randomized trials and NRSI in the PICO elements (MacLehose et al 2000), which may become apparent when considering the directness (e.g. applicability or generalizability) of the primary studies (see Chapter 14, Section 14.2.2 ).
A NRSI can be viewed as an attempt to emulate a hypothetical randomized trial answering the same question. Hernán and Robins have referred to this as a ‘target’ trial; the target trial is usually a hypothetical pragmatic randomized trial comparing the health effects of the same interventions, conducted on the same participant group and without features putting it at risk of bias (Hernán and Robins 2016). Importantly, a target randomized trial need not be feasible or ethical. This concept is the foundation of the risk-of-bias assessment for NRSI, and helps a review author to distinguish between the risk of bias in a NRSI (see Chapter 25 ) and a lack of directness of a NRSI with respect to the review question (see Chapter 14, Section 14.2.2 ). A lack of directness among randomized trials may be a motivation for including NRSI that address the review question more directly. In this situation, review authors need to recognize that discrepancies in intervention effects between randomized trials and NRSI (and, potentially, between NRSI with different study design features) may arise either from differential risk of bias or from differences in the specific PICO questions evaluated by the primary studies.
A single review may include different types of study to address different outcomes, for example, randomized trials for evaluating benefits and NRSI to evaluate harms; see Section 24.2.1.1 and Chapter 19. Section 19.2 . Scoping in advance of writing a protocol should allow review authors to identify whether NRSI are required to address directly one or more of the PICO questions for a review comparison. In time, as a review is updated, the NRSI may be dropped if randomized trials addressing these questions become available.
24.2.1.3 Determining which non-randomized studies of interventions to include
A randomized trial is a prospective, experimental study design specifically involving random allocation of participants to interventions. Although there are variations in randomized trial design (see Chapter 23 ), they constitute a distinctive study category. By contrast, NRSI embrace a number of fundamentally different design principles, several of which were originally conceived in the context of aetiological epidemiology; some studies combine different principles. As we discuss in Section 24.2.2 , study design labels such as ‘cohort’ or ‘prospective study’ are not consistently applied. The diversity of NRSI designs raises two related questions. First, should all NRSI relevant to a PICO question for a planned synthesis be included in a review, irrespective of their study design features? Second, if review authors do not include all NRSI, what study design features should be used as criteria to decide which NRSI to include and which to exclude?
NRSI vary with respect to their intrinsic ability to estimate the causal effect of an intervention (Reeves et al 2017, Tugwell et al 2017). Therefore, to reach reliable conclusions, review authors should include only ‘strong’ NRSI that can estimate causality with minimal risk of bias. It is not helpful to include primary studies in a review when the results of the studies are highly likely to be biased even if there is no better evidence (except for justification 3, i.e. to examine the case for performing a randomized trial by describing the weakness of the NRSI evidence; see Section 24.1.1 ). This is because a misleading effect estimate from a systematic review may be more harmful to future patients than no estimate at all, particularly if the people using the evidence to make decisions are unaware of its limitations (Doll 1993, Peto et al 1995). Systematic reviews have a privileged status in the evidence base (Reeves et al 2013), typically sitting between primary research studies and guidelines (which frequently cite them). There may be long-term undesirable consequences of reviewing evidence when it is inadequate: an evidence synthesis may make it less likely that less biased research will be carried out in the future, increasing the risk that more poorly informed decisions will be made than would otherwise have been the case (Stampfer and Colditz 1991, Siegfried et al 2005).
There is not currently a general framework for deciding which kinds of NRSI will be used to answer a specific PICO question. One possible strategy is to limit included NRSI to those that have used a strong design (NRSI with specified design features; (Reeves et al 2017, Tugwell et al 2017). This should give reasonably valid effect estimates, subject to assessment of risk of bias. An alternative strategy is to include the best available NRSI (i.e. those with the strongest design features among those that have been carried out) to answer the PICO question. In this situation, we recommend scoping available NRSI in advance of finalizing study eligibility for a specific review question and defining eligibility with respect to study design features (Reeves et al 2017). Widespread adoption of the first strategy might result in reviews that consistently include NRSI with the same design features, but some reviews would include no studies at all. The second strategy would lead to different reviews including NRSI with different study design features according to what is available. Whichever strategy is adopted, it is important to explain the choice of included studies in the protocol. For example, review authors might be justified in using different eligibility criteria when reviewing the harms, compared with the benefits, of an intervention (see Chapter 19, Section 19.2 ).
We advise caution in assessing NRSI according to existing ‘evidence hierarchies’ for studies of effectiveness (Eccles et al 1996, National Health and Medical Research Council 1999, Oxford Centre for Evidence-based Medicine 2001). These appear to have arisen largely by applying hierarchies for aetiological research questions to effectiveness questions and refer to study design labels. NRSI used for studying the effects of interventions are very diverse and complex (Shadish et al 2002) and may not be easily assimilated into existing evidence hierarchies. NRSI with different study design features are susceptible to different biases, and it is often unclear which biases have the greatest impact and how they vary between healthcare contexts. We recommend including at least one expert with knowledge of the subject and NRSI methods (with previous experience of estimating an intervention effect from NRSI similar to the ones of interest) on a review team to help to address these complexities.
24.2.2 Guidance and resources available to support review authors
Review authors should scope the available NRSI evidence between deciding on the specific synthesis PICOs that the review will address and finalizing the review protocol (see Section 24.1.1 ). Review authors may need to consult with stakeholders about the specific PICO questions of interest to ensure that scoping is informative. With this information, review authors can then use the algorithm ( Figure 24.1.a ) to decide whether the review needs to include NRSI and for which questions, enabling review authors to justify their decision(s) to include or exclude NRSI in their protocol. It will be important to ensure that the review team includes informed methodologists. Review authors intending to review the adverse effects (harms) of an intervention should consult Chapter 19 .
We recommend that review authors use explicit study design features (NB: not study design labels) when deciding which types of NRSI to include in a review. A checklist of study design features was first drawn up for the designs most frequently used to evaluate healthcare interventions (Higgins et al 2013). This checklist has since been revised to include designs often used to evaluate health systems (Reeves et al 2017) and combines the previous two checklists (for studies with individual and cluster-level allocation, respectively). Thirty-two items are grouped under seven headings, characterizing key features of strong and weak study designs ( Box 24.2.a ). The paper also sets out which features are associated with NRSI study design labels (acknowledging that these labels can be used inconsistently). We propose that the checklist be used in the processes of data collection and as part of the assessment of the studies (Sections 24.4.2 and 24.6.2 ).
Some Cochrane Reviews have limited inclusion of NRSI by study design labels, sometimes in combination with considerations of methodological quality. For example, Cochrane Effective Practice and Organisation of Care accepts protocols that include interrupted time series (ITS) and controlled before-and-after (CBA) studies, and specifies some minimum criteria for these types of studies. The risks of using design labels are highlighted by a recent review that showed that Cochrane Reviews inconsistently labelled CBA and ITS studies, and included studies that used these labels in highly inconsistent ways (Polus et al 2017). We believe that these issues will be addressed by applying the study feature checklist.
Our proposal is that:
- the review team decides which study design features are desirable in a NRSI to address a specific PICO question;
- scoping will indicate the study design features of the NRSI that are available; and
- the review team sets eligibility criteria based on study design features that represent an appropriate balance between the priority of the question and the likely strength of the available evidence.
When both randomized trials and NRSI of an intervention exist in relation to a specific PICO question and, for one or more of the reasons given in Section 24.1.1 , both are defined as eligible, the results for randomized trials and for NRSI should be presented and analysed separately. Alternatively, if there is an adequate number of randomized trials to inform the main analysis for a review question, comments about relevant NRSI can be included in the Discussion section of a review although the reader needs to be reassured that NRSI studies are not selectively cited.
Box 24.2.a Checklist of study features. Responses to each item should be recorded as: yes, no, or can’t tell (Reeves et al 2017). Reproduced with permission of Elsevier
24.3 Searching for non-randomized studies of interventions
24.3.1 what is different when including non-randomized studies of interventions, 24.3.1.1 identifying non-randomized studies in searches.
Searching for NRSI is less straightforward than searching for randomized trials. A broad search strategy – with search strings for the population and disease characteristics, the intervention and possibly the comparator – can potentially identify all evidence about an intervention. When a review aims to include randomized trials only, various approaches are available to focus the search strategy towards randomized trials (see Chapter 4, Section 4.4 ):
- implement the search within resources, such as the Cochrane Central Register of Controlled Trials (CENTRAL), that are ‘rich’ in randomized trials;
- use methodological filters and indexing fields, such as publication type in MEDLINE, to limit searches to studies that are likely to be randomized trials; and
- search trials registers.
Restricting the search to NRSI with specific study design features is more difficult. Of the above approaches, only 1 is likely to be helpful. Some Cochrane Review Groups maintain specialized trials registers that also include NRSI, only some of which will also be found in CENTRAL, and authors of Cochrane Reviews can search these registers where they are likely to be relevant (e.g. the register of Cochrane Effective Practice and Organisation of Care). There are no databases of NRSI similar to CENTRAL.
Some review authors have tried to develop and validate methodological filters for NRSI (strategy 2) but with limited success because NRSI design labels are not reliably indexed by bibliographic databases and are used inconsistently by authors of primary studies (Wieland and Dickersin 2005, Fraser et al 2006, Furlan et al 2006). Furthermore, study design features, which are the preferred approach to determining eligibility of NRSI for a review, suffer from the same problems. Review authors have also sought to optimize search strategies for adverse effects (see Chapter 19, Section 19.3 ) (Golder et al 2006c, Golder et al 2006b). Because of the time-consuming nature of systematic reviews that include NRSI, attempts to develop search strategies for NRSI have not investigated large numbers of review questions. Therefore, review authors should be cautious about assuming that previous strategies can be applied to new topics.
Finally, although trials registers such as ClinicalTrials.gov do include some NRSI, their coverage is very low so strategy 3 is unlikely to be very fruitful.
Searching using ‘snowballing’ methods may be helpful, if one or more publications of relevance or importance are known (Wohlin 2014), although it is likely to identify other evidence about the research question in general rather than studies with similar design features.
24.3.1.2 Non-reporting biases for non-randomized studies
We are not aware of evidence that risk of bias due to missing evidence affects randomized trials and NRSI differentially. However, it is difficult to believe that publication bias could affect NRSI less than randomized trials, given the increasing number of safeguards associated with carrying out and reporting randomized trials that act to prevent reporting biases (e.g. pre-specified protocols, ethical approval including progress and final reports, the CONSORT statement (Moher et al 2001), trials registers and indexing of publication type in bibliographic databases). These safeguards are much less applicable to NRSI, which may not have been executed according to a pre-specified protocol, may not require explicit ethical approval, are unlikely to be registered, and do not always have a research sponsor or funder. The likely magnitude and determinants of publication bias for NRSI are not known.
24.3.1.3 Practical issues in selecting non-randomized studies for inclusion
Section 24.2.1.3 points out that NRSI include diverse study design features, and that there is difficulty in categorizing them. Assuming that review authors set specific criteria against which potential NRSI should be assessed for eligibility (e.g. study features), many of the potentially eligible NRSI will report insufficient information to allow them to be classified.
There is a further problem in defining exactly when a NRSI comes into existence. For example, is a cohort study that has collected data on the interventions and outcome of interest, but that has not examined their association, an eligible NRSI? Is computer output in a filing cabinet that includes a calculated odds ratio for the relevant association an eligible NRSI? Consequently, it is difficult to define a ‘finite population of NRSI’ for a particular review question. Many NRSI that have been done may not be traceable at all, that is, they are not to be found even in the proverbial ‘bottom drawer’.
Given these limitations of NRSI evidence, it is tempting to question the benefits of comprehensive searching for NRSI. It is possible that the studies that are the hardest to find are the most biased – if being hard to find is associated with design features that are susceptible to bias – to a greater extent than has been shown for randomized trials for some topics. It is likely that search strategies can be developed that identify eligible studies with reasonable precision (see Chapter 4, Section 4.4.3 ) and are replicable, but which are not comprehensive (i.e. lack sensitivity). Unfortunately, the risk of bias to review findings with such strategies has not been researched and their acceptability would depend on pre-specifying the strategy without knowledge of influential results, which would be difficult to achieve.
24.3.2 Guidance and resources available to support review authors
We do not recommend limiting search strategies by index terms relating to study design labels. However, review authors may wish to contact information specialists with expertise in searching for NRSI, researchers who have reported some success in developing efficient search strategies for NRSI (see Section 24.3.1 ) and other review authors who have carried out Cochrane Reviews (or other systematic reviews) of NRSI for review questions similar to their own.
When searching for NRSI, review authors are advised to search for studies investigating all effects of an intervention and not to limit search strategies to specific outcomes ( Chapter 4, Section 4.4.2 ). When searching for NRSI of specific rare or long-term (usually adverse or unintended) outcomes of an intervention, including free text and MeSH terms for specific outcomes in the search strategy may be justified (see Chapter 19, Section 19.3 ).
Review authors should check with their Cochrane Review Group editors whether the Group-specific register includes NRSI with particular study design features and should seek the advice of information retrieval experts within the Group and in the Information Retrieval Methods Group (see also Chapter 4 ).
24.4 Selecting studies and collecting data
24.4.1 what is different when including non-randomized studies.
Search results obtained using search strategies without study design filters are often much more numerous, and contain large numbers of irrelevant records. Also, abstracts of NRSI reports often do not provide adequate detail about NRSI study design features (which are likely to be required to judge eligibility), or some secondary outcomes measured (such as adverse effects). Therefore, more so than when reviewing randomized trials, very many full reports of studies may need to be obtained and read in order to identify eligible studies.
Review authors need to collect the same types of data required for a systematic review of randomized trials (see Chapter 5, Section 5.3 ) and will also need to collect data specific to the NRSI. For a NRSI, review authors should extract the estimate of intervention effect together with a measure of precision (e.g. a confidence interval) and information about how the estimate was derived (e.g. the confounders controlled for). Relevant results can then be meta-analysed using standard software.
If both unadjusted and adjusted intervention effects are reported, then adjusted effects should be preferred. It is straightforward to extract an adjusted effect estimate and its standard error for a meta-analysis if a single adjusted estimate is reported for a particular outcome in a primary NRSI. However, some NRSI report multiple adjusted estimates from analyses including different sets of covariates. If multiple adjusted estimates of intervention effect are reported, the one that is judged to minimize the risk of bias due to confounding should be chosen (see Chapter 25, Section 25.2.1 ). (Simple numerators and denominators, or means and standard errors, for intervention and control groups cannot control for confounding unless the groups have been matched on all important confounding domains at the design stage.)
Anecdotally, the experience of review authors is that NRSI are poorly reported so that the required information is difficult to find, and different review authors may extract different information from the same paper. Data collection forms may need to be customized to the research question being investigated. Restricting included studies to those that share specific features can help to reduce their diversity and facilitate the design of customized data collection forms.
As with randomized trials, results of NRSI may be presented using different measures of effect and uncertainty or statistical significance. Before concluding that information required to describe an intervention effect has not been reported, review authors should seek statistical advice about whether reported information can be transformed or used in other ways to provide a consistent effect measure across studies so that this can be analysed using standard software (see Chapter 6 ). Data collection sheets need to be able to handle the different kinds of information about study findings that review authors may encounter.
24.4.2 Guidance and resources available to support review authors
Data collection for each study needs to cover the following.
- Data about study design features to demonstrate the eligibility of included studies against criteria specified in the review protocol. The study design feature checklist can help to do this (see Section 24.2.2 ). When using this checklist, whether to decide on eligibility or for data extraction, the intention should be to document what researchers did in the primary studies, rather than what researchers called their studies or think they did. Further guidance on using the checklist is included with the description of the tool (Reeves et al 2017).
- Variables measured in a study that characterize confounding domains of interest; the ROBINS-I tool provides a template for collecting this information (see Chapter 25, Section 25.3 ) (Sterne et al 2016).
- The availability of data for experimental and comparator intervention groups, and about the co-interventions; the ROBINS-I tool provides a template for collecting information about co-interventions (see Chapter 25 ).
- Data to characterize the directness with which the study addresses the review question (i.e. the PICO elements of the study). We recommend that review authors record this information, then apply a simple template that has been published for doing this (Schünemann et al 2013, Wells et al 2013), judging the directness of each element as ‘sufficient’ on a 4-point categorical scale. (This tool could be used for scoping and can be applied to randomized trials as well as NRSI.)
- Data describing the study results (see Section 24.6.1 ). Capturing these data is likely to be challenging and data collection will almost certainly need to be customized to the research question being investigated. Review authors are strongly advised to pilot the methods they plan to use with studies that cover the expected diversity; developing the data collection form may require several iterations. It is almost impossible to finalize these forms in advance. Methods developed at the outset (e.g. forms or database) may need to be amended to record additional important information identified when appraising NRSI but overlooked at the outset. Review authors should record when required data are not available due to poor reporting, as well as data that are available. Data should be captured describing both unadjusted and adjusted intervention effects.
24.5 Assessing risk of bias in non-randomized studies
24.5.1 what is different when including non-randomized studies.
Biases in non-randomized studies are a major threat to the validity of findings from a review that includes NRSI. Key challenges affecting NRSI include the appropriate consideration of confounding in the absence of randomization, less consistent development of a comprehensive study protocol in advance of the study, and issues in the analysis of routinely collected data.
Assessing the risk of bias in a NRSI has long been a challenge and has not always been performed or performed well. Indeed, two studies of systematic reviews that included NRSI have commented that only a minority of reviews assessed the methodological quality of included studies (Audigé et al 2004, Golder et al 2006a).
The process of assessing risk of bias in NRSI is hampered in practice by the quality of reporting of many NRSI, and – in most cases – by the lack of availability of a protocol. A protocol is a tool to protect against bias; when registered in advance of a study starting, it proves that aspects of study design and analysis were considered in advance of starting to recruit (or acquiring historical data), and that data definitions and methods for standardizing data collection were defined. Primary NRSI rarely report whether the methods are based on a protocol and, therefore, these protections often do not apply to NRSI. An important consequence of not having a protocol is the lack of constraint on researchers with respect to ‘cherry-picking’ outcomes, subgroups and analyses to report; this can be a source of bias even in randomized trials where protocols exist (Chan et al 2004).
24.5.2 Guidance and resources available to support review authors
The recommended tool for assessing risk of bias in NRSI included in Cochrane Reviews is the ROBINS-I tool, described in detail in Chapter 25 (Sterne et al 2016). If review authors choose not to use ROBINS-I, they should demonstrate that their chosen method of assessment covers the range of biases assessed by ROBINS-I.
The ROBINS-I tool involves some preliminary work when writing the protocol. Notably, review authors will need to specify important confounding domains and co-interventions. There is no established method for identifying a pre-specified set of important confounding domains. The list of potential confounding domains should not be generated solely on the basis of factors considered in primary studies included in the review (at least, not without some form of independent validation), since the number of suspected confounders is likely to increase over time (hence, older studies may be out of date) and researchers themselves may simply choose to measure confounders considered in previous studies. Rather, the list should be based on evidence (although undertaking a systematic review to identify all potential prognostic factors is extreme) and expert opinion from members of the review team and advisors with content expertise.
The ROBINS-I assessment involves consideration of several bias domains. Each domain is judged as low, moderate, serious or critical risk of bias. A judgement of low risk of bias for a NRSI using ROBINS-I equates to a low risk-of-bias judgement for a high-quality randomized trial. Few circumstances around a NRSI are likely to give a similar level of protection against confounding as randomization, and few NRSI have detailed statistical analysis plans in advance of carrying out analyses. We therefore consider it very unlikely that any NRSI will be judged to be at low risk of bias overall.
Although the bias domains are common to all types of NRSI, specific issues can arise for certain types of study, such as analyses of routinely collected data, pharmaco-epidemiological studies. Review authors are advised to consider carefully whether a methodologist with knowledge of the kinds of study to be included should be recruited to the review team to help to identify key areas of weakness.
24.6 Synthesis of results from non-randomized studies
24.6.1 what is different when including non-randomized studies.
Review authors should expect greater heterogeneity in a systematic review of NRSI than a systematic review of randomized trials. This is partly due to the diverse ways in which non-randomized studies may be designed to investigate the effects of interventions, and partly due to the increased potential for methodological variation between primary studies and the resulting variation in their risk of bias. It is very difficult to interpret the implications of this diversity in the analysis of primary studies. Some methodological diversity may give rise to bias, for example different methods for measuring exposure and outcome, or adjustment for more versus fewer important confounding domains. There is no established method for assessing how, or the extent to which, these biases affect primary studies (but see Chapter 7 and Chapter 25 ).
Unlike for randomized trials, it will usually be appropriate to analyse adjusted, rather than unadjusted, effect estimates (i.e. analyses should be selected that attempt to control for confounding). Review authors may have to choose between alternative adjusted estimates reported for one study and should choose the one that minimizes the risk of bias due to confounding (see Chapter 25, Section 25.2.1 ). In principle, any effect measure used in meta-analysis of randomized trials can also be used in meta-analysis of non-randomized studies (see Chapter 6 ). The odds ratio will commonly be used as it is the only effect measure for dichotomous outcomes that can be estimated from case-control studies, and is estimated when logistic regression is used to adjust for confounders.
One danger is that a very large NRSI of poor methodological quality (e.g. based on routinely collected data) may dominate the findings of other smaller studies at less risk of bias (perhaps carried out using customized data collection). Review authors need to remember that the confidence intervals for effect estimates from larger NRSI are less likely to represent the true uncertainty of the observed effect than are the confidence intervals for smaller NRSI (Deeks et al 2003), although there is no way of estimating or correcting for this. Review authors should exclude from analysis any NRSI judged to be at critical risk of bias and may choose to include only studies that are at moderate or low risk of bias, specifying this choice a priori in the review protocol.
24.6.2 Guidance and resources available to support review authors
24.6.2.1 combining studies.
If review authors judge that included NRSI are at low to moderate overall risk of biases and relatively homogeneous in other respects, then they may combine results across studies using meta-analysis (Taggart et al 2001). Decisions about combining results at serious risk of bias are more difficult to make, and any such syntheses will need to be presented with very clear warnings about the likelihood of bias in the findings. As stated earlier, results considered to be at critical risk of bias using the ROBINS-I tool should be excluded from analyses.
Estimated intervention effects for NRSI with different study design features can be expected to be influenced to varying degrees by different sources of bias (see Section 24.6 ). Results from NRSI with different combinations of study design features should be expected to differ systematically, resulting in increased heterogeneity. Therefore, we recommend that NRSI that have very different design features should be analysed separately. This recommendation implies that, for example, randomized trials and NRSI should not be combined in a meta-analysis , and that cohort studies and case-control studies should not be combined in a meta-analysis if they address different research questions.
An illustration of many of these points is provided by a review of the effects of some childhood vaccines on overall mortality. The authors analysed randomized trials separately from NRSI. However, they decided that the cohort studies and case-control studies were asking sufficiently similar questions to be combined in meta-analyses, while results from any NRSI that were judged to be at a very high risk of bias were excluded from the syntheses (Higgins et al 2016). In many other situations, it may not be reasonable to combine results from cohort studies and case-control studies.
Meta-analysis methods based on estimates and standard errors, and in particular the generic inverse-variance method, will be suitable for NRSI (see Chapter 10, Section 10.3 ). Given that heterogeneity between NRSI is expected to be high because of their diversity, the random-effects meta-analysis approach should be the default choice; a clear rationale should be provided for any decision to use the fixed-effect method.
24.6.2.2 Analysis of heterogeneity
The exploration of possible sources of heterogeneity between studies should be part of any Cochrane Review, and is discussed in detail in Chapter 10 (Section 10.11 ). Non-randomized studies may be expected to be more heterogeneous than randomized trials, given the extra sources of methodological diversity and bias. Researchers do not always make the same decisions concerning confounding factors, so the extent of residual confounding is an important source of heterogeneity between studies. There may be differences in the confounding factors considered, the method used to control for confounding and the precise way in which confounding factors were measured and included in analyses.
The simplest way to display the variation in results of studies is by drawing a forest plot (see Chapter 10, Section 10.2.1 ). Providing that sufficient intervention effect estimates are available, it may be valuable to undertake meta-regression analyses to identify important determinants of heterogeneity, even in reviews when studies are considered too heterogeneous to combine. Such analyses could include study design features believed to be influential, to help to identify methodological features that systematically relate to observed intervention effects, and help to identify the subgroups of studies most likely to yield valid estimates of intervention effects. Investigation of key study design features should preferably be pre-specified in the protocol, based on scoping.
24.6.2.3 When combining results is judged not to be appropriate
Before undertaking a meta-analysis, review authors should ask themselves the standard question about whether primary studies are ‘similar enough’ to justify combining results (see Chapter 9, Section 9.3.2 ). Forest plots allow the presentation of estimates and standard errors for each study, and in most software (including RevMan) it is possible to omit summary estimates from the plots, or include them only for subgroups of studies. Providing that effect estimates from the included studies can be expressed using consistent effect measures, we recommend that review authors display individual study results for NRSI with similar study design features using forest plots, as a standard feature. If consistent effect measures are not available or calculable, then additional tables should be used to present results in a systematic format (see also Chapter 12, Section 12.3 ).
If the features of studies are not sufficiently similar to combine in a meta-analysis (which is expected to be the norm for reviews that include NRSI), we recommend displaying the results of included studies in a forest plot but suppressing the summary estimate (see Chapter 12, Section 12.3.2 ). For example, in a review of the effects of circumcision on risk of HIV infection, a forest plot illustrated the result from each study without synthesizing them (Siegfried et al 2005). Studies may be sorted in the forest plot (or shown in separate forest plots) by study design feature, or their risk of bias. For example, the circumcision studies were separated into cohort studies, cross-sectional studies and case-control studies. Heterogeneity diagnostics and investigations (e.g. testing and quantifying heterogeneity, the I 2 statistic and meta-regression analyses) are worthwhile even when a judgement has been made that calculating a pooled estimate of effect is not (Higgins et al 2003, Siegfried et al 2003).
Non-statistical syntheses of quantitative intervention effects (see Chapter 12 ) are challenging, however, because it is difficult to set out or describe results without being selective or emphasizing some findings over others. Ideally, authors should set out in the review protocol how they plan to use narrative synthesis to report the findings of primary studies.
24.7 Interpretation and discussion
24.7.1 what is different when including non-randomized studies .
As highlighted at the outset, review authors have a duty to summarize available evidence about interventions, balancing harms against benefits and qualified with a certainty assessment. Some of this evidence, especially about harms of interventions, will often need to come from NRSI. Nevertheless, obtaining definitive results about the likely effects of an intervention based on NRSI alone can be difficult (Deeks et al 2003). Many reviews of NRSI conclude that an ‘average’ effect is not an appropriate summary (Siegfried et al 2003), that evidence from NRSI does not provide enough certainty to demonstrate effectiveness or harm (Kwan and Sandercock 2004) and that randomized trials should be undertaken (Taggart et al 2001). Inspection of the risk-of-bias judgements for the individual domains addressed by the ROBINS-I tool should help interpretation, and may highlight the main ways in which NRSI are limited (Sterne et al 2016).
Challenges arise at all stages of conducting a review of NRSI: deciding which study design features should be specified as eligibility criteria, searching for studies, assessing studies for potential bias, and deciding how to synthesize results. A review author needs to satisfy the reader of the review that these challenges have been adequately addressed, or should discuss how and why they cannot be met. In this section, the challenges are illustrated with reference to issues raised in the different sections of this chapter. The Discussion section of the review should address the extent to which the challenges have been met.
24.7.1.1 Have important and relevant studies been included?
Even if the choice of eligible study design features can be justified, it may be difficult to show that all relevant studies have been identified because of poor indexing and inconsistent use of study design labels or poor reporting of design features by researchers. Comprehensive search strategies that focus only on the health condition and intervention of interest are likely to result in a very long list of bibliographic records including relatively few eligible studies; conversely, restrictive strategies will inevitably miss some eligible studies. In practice, available resources may make it impossible to process the results from a comprehensive search, especially since review authors will often have to read full papers rather than abstracts to determine eligibility. The implications of using a more or less comprehensive search strategy are not known.
24.7.1.2 Has the risk of bias to included studies been adequately assessed?
Interpretation of the results of a review of NRSI should include consideration of the likely direction and magnitude of bias, although this can be challenging to do. Some of the biases that affect randomized trials also affect NRSI but typically to a greater extent. For example, attrition in NRSI is often worse (and poorly reported), intervention and outcome assessment are rarely conducted according to standardized protocols, outcomes are rarely assessed blind to the allocation to intervention and comparator, and there is typically little protection against selection of the reported result. Too often these limitations of NRSI are seen as part of doing a NRSI, and their implications for risk of bias are not properly considered. For example, some users of evidence may consider NRSI that investigate long-term outcomes to have ‘better quality’ than randomized trials of short-term outcomes, simply on the basis of their directness without appraising their risk of bias; long-term outcomes may address the review question(s) more directly, but may do so with a considerable risk of bias.
We recommend using the ROBINS-I tool to assess the risk of bias because of the consensus among a large team of developers that it covers all important bias domains. This is not true of any other tool to assess the risk of bias in NRSI. The importance of individual bias domains may vary according to the review question; for example, confounding may be less likely to arise in NRSI studies of long-term or adverse effects, or some public health primary prevention interventions.
As with randomized trials, one clue to the presence of bias is notable between-study heterogeneity. Although heterogeneity can arise through differences in participants, interventions and outcome assessments, the possibility that bias is the cause of heterogeneity in reviews of NRSI must be seriously considered. However, lack of heterogeneity does not indicate lack of bias, since it is possible that a consistent bias applies in all studies.
Predicting the direction of bias (within each bias domain) is an optional element of the ROBINS-I tool. This is a subject of ongoing research which is attempting to gather empirical evidence on factors (such as study design features and intervention type) that determine the size and direction of the biases. The ability to predict both the likely magnitude of bias and the likely direction of bias would greatly improve the usefulness of evidence from systematic reviews of NRSI. There is currently some evidence that in limited circumstances the direction, at least, can be predicted (Henry et al 2001).
24.7.2 Evaluating the strength of evidence provided by reviews that include non-randomized studies
Assembling the evidence from NRSI on a particular health question enables informed debate about its meaning and importance, and the certainty that can be attributed to it. Critically, there needs to be a debate about whether the findings could be misleading. Formal hierarchies of evidence all place NRSI lower than randomized trials, but above those of clinical opinion (Eccles et al 1996, National Health and Medical Research Council 1999, Oxford Centre for Evidence-based Medicine 2001). This emphasizes the general concern about biases in NRSI, and the difficulties of attributing causality to the observed associations between intervention and outcome.
In preference to these traditional hierarchies, the GRADE approach is recommended for assessing the certainty of a body of evidence in Cochrane Reviews, and is summarized in Chapter 14 (Section 14.2 ). There are four levels of certainty: ‘high’, ‘moderate’, ‘low’ and ‘very low’. A collection of studies begins with an assumption of ‘high’ certainty (with the introduction of ROBINS-I, this includes collections of NRSI) (Schünemann et al 2018). The certainty is then rated down in the presence of serious concerns about study limitations (risk of bias), indirectness of evidence, heterogeneity, imprecision or publication bias. In practice, the final rating for a body of evidence based on NRSI is typically rated as ‘low’ or ‘very low’.
Application of the GRADE approach to systematic reviews of NRSI requires expertise about the design of NRSI due to the nature of the biases that may arise. For example, the strength of evidence for an association may be enhanced by a subset of primary studies that have tested considerations about causality not usually applied to randomized trial evidence (Bradford Hill 1965), or use of negative controls (Jackson et al 2006). In some contexts, little prognostic information may be known, limiting identification of possible confounding (Jefferson et al 2005).
Whether the debate concludes that the evidence from NRSI is adequate for informed decision making or that there is a need for randomized trials will depend on the value placed on the uncertainty arising through use of potentially biased NRSI, and the collective value of the observed effects. The GRADE approach interprets certainty as the certainty that the effect of the intervention is large enough to reach a threshold for action. This value may depend on the wider healthcare context. It may not be possible to include assessments of the value within the review itself, and it may become evident only as part of the wider debate following publication.
For example, is evidence from NRSI of a rare serious adverse effect adequate to decide that an intervention should not be used? The evidence has low certainty (due to a lack of randomized trials) but the value of knowing that there is the possibility of a potentially serious harm is considerable, and may be judged sufficient to withdraw the intervention. (It is worth noting that the judgement about withdrawing an intervention may depend on whether equivalent benefits can be obtained from elsewhere without such a risk; if not, the intervention may still be offered but with full disclosure of the potential harm.) Where evidence of benefit is also uncertain, the value attached to a systematic review of NRSI of harm may be even greater.
In contrast, evidence of a small benefit of a novel intervention from a systematic review of NRSI may not be sufficient for decision makers to recommend widespread implementation in the face of the uncertainty of the evidence and the costs arising from provision of the intervention. In these circumstances, decision makers may conclude that randomized trials should be undertaken to improve the certainty of the evidence if practicable and if the investment in the trial is likely to be repaid in the future.
24.7.3 Guidance for potential review authors
Carrying out a systematic review of NRSI is likely to require complex decisions, often necessitating members of the review team with content knowledge and methodological expertise about NRSI at each stage of the review. Potential review authors should therefore seek to collaborate with methodologists, irrespective of whether a review aims to investigate harms or benefits, short-term or long-term outcomes, frequent or rare events.
Review teams may be keen to include NRSI in systematic reviews in areas where there are few or no randomized trials because they have the ambition to improve the evidence-base in their specialty areas (a key motivation for many Cochrane Reviews). However, for reviews of NRSI to estimate the effects of an intervention on short-term and expected outcomes, review authors should also recognize that the resources required to do a systematic review of NRSI are likely to be much greater than for a systematic review of randomized trials. Inclusion of NRSI to address some review questions will be invaluable in addressing the broad aims of a review; however, the conclusions in relation to some review questions are likely to be much weaker and may make a relatively small contribution to the topic. Therefore, review authors and Cochrane Review Group editors need to decide at an early stage whether the investment of resources is likely to be justified by the priority of the research question.
Bringing together the required team of healthcare professionals and methodologists may be easier for systematic reviews of NRSI to estimate the effects of an intervention on long-term and rare adverse outcomes, for example when considering the side effects of drugs. A review of this kind is likely to provide important missing evidence about the effects of an intervention in a priority area (i.e. adverse effects). However, these reviews may require the input of additional specialist authors, for example with relevant content pharmacological expertise. There is a pressing need in many health conditions to supplement traditional systematic reviews of randomized trials of effectiveness with systematic reviews of adverse (unintended) effects. It is likely that these systematic reviews will usually need to include NRSI.
24.8 Chapter information
Authors: Barnaby C Reeves, Jonathan J Deeks, Julian PT Higgins, Beverley Shea, Peter Tugwell, George A Wells; on behalf of the Cochrane Non-Randomized Studies of Interventions Methods Group
Acknowledgements: We gratefully acknowledge Ole Olsen, Peter Gøtzsche, Angela Harden, Mustafa Soomro and Guido Schwarzer for their early drafts of different sections. We also thank Laurent Audigé, Duncan Saunders, Alex Sutton, Helen Thomas and Gro Jamtved for comments on previous drafts.
Funding: BCR is supported by the UK National Institute for Health Research Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bristol. JJD receives support from the National Institute for Health Research (NIHR) Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. JPTH is a member of the NIHR Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bristol. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
24.9 References
Audigé L, Bhandari M, Griffin D, Middleton P, Reeves BC. Systematic reviews of nonrandomized clinical studies in the orthopaedic literature. Clinical Orthopaedics and Related Research 2004: 249-257.
Bradford Hill A. The environment and disease: association or causation? Proceedings of the Royal Society of Medicine 1965; 58 : 295-300.
Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles. JAMA 2004; 291 : 2457-2465.
Deeks JJ, Dinnes J, D'Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman DG. Evaluating non-randomised intervention studies. Health Technology Assessment 2003; 7 : 27.
Doll R. Doing more good than harm: The evaluation of health care interventions: Summation of the conference. Annals of the New York Academy of Sciences 1993; 703 : 310-313.
Eccles M, Clapp Z, Grimshaw J, Adams PC, Higgins B, Purves I, Russel I. North of England evidence based guidelines development project: methods of guideline development. BMJ 1996; 312 : 760-762.
Fraser C, Murray A, Burr J. Identifying observational studies of surgical interventions in MEDLINE and EMBASE. BMC Medical Research Methodology 2006; 6 : 41.
Furlan AD, Irvin E, Bombardier C. Limited search strategies were effective in finding relevant nonrandomized studies. Journal of Clinical Epidemiology 2006; 59 : 1303-1311.
Glasziou P, Chalmers I, Rawlins M, McCulloch P. When are randomised trials unnecessary? Picking signal from noise. BMJ 2007; 334 : 349-351.
Golder S, Loke Y, McIntosh HM. Room for improvement? A survey of the methods used in systematic reviews of adverse effects. BMC Medical Research Methodology 2006a; 6 : 3.
Golder S, McIntosh HM, Duffy S, Glanville J, Centre for Reviews and Dissemination and UK Cochrane Centre Search Filters Design Group. Developing efficient search strategies to identify reports of adverse effects in MEDLINE and EMBASE. Health Information and Libraries Journal 2006b; 23 : 3-12.
Golder S, McIntosh HM, Loke Y. Identifying systematic reviews of the adverse effects of health care interventions. BMC Medical Research Methodology 2006c; 6 : 22.
Henry D, Moxey A, O'Connell D. Agreement between randomized and non-randomized studies: the effects of bias and confounding. 9th Cochrane Colloquium; 2001; Lyon (France).
Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology 2016; 183 : 758-764.
Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003; 327 : 557-560.
Higgins JPT, Ramsay C, Reeves BC, Deeks JJ, Shea B, Valentine JC, Tugwell P, Wells G. Issues relating to study design and risk of bias when including non-randomized studies in systematic reviews on the effects of interventions. Research Synthesis Methods 2013; 4 : 12-25.
Higgins JPT, Soares-Weiser K, López-López JA, Kakourou A, Chaplin K, Christensen H, Martin NK, Sterne JA, Reingold AL. Association of BCG, DTP, and measles containing vaccines with childhood mortality: systematic review. BMJ 2016; 355 : i5170.
Jackson LA, Jackson ML, Nelson JC, Neuzil KM, Weiss NS. Evidence of bias in estimates of influenza vaccine effectiveness in seniors. International Journal of Epidemiology 2006; 35 : 337-344.
Jefferson T, Smith S, Demicheli V, Harnden A, Rivetti A, Di Pietrantonj C. Assessment of the efficacy and effectiveness of influenza vaccines in healthy children: systematic review. The Lancet 2005; 365 : 773-780.
Kwan J, Sandercock P. In-hospital care pathways for stroke. Cochrane Database of Systematic Reviews 2004; 4 : CD002924.
Li X, You R, Wang X, Liu C, Xu Z, Zhou J, Yu B, Xu T, Cai H, Zou Q. Effectiveness of prophylactic surgeries in BRCA1 or BRCA2 mutation carriers: a meta-analysis and systematic review. Clinical Cancer Research 2016; 22 : 3971-3981.
MacLehose RR, Reeves BC, Harvey IM, Sheldon TA, Russell IT, Black AM. A systematic review of comparisons of effect sizes derived from randomised and non-randomised studies. Health Technology Assessment 2000; 4 : 1-154.
Macpherson A, Spinks A. Bicycle helmet legislation for the uptake of helmet use and prevention of head injuries. Cochrane Database of Systematic Reviews 2008; 3 : CD005401.
Moher D, Schulz KF, Altman DG. The CONSORT Statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 2001; 357 : 1191-1194.
National Health and Medical Research Council. A guide to the development, implementation and evaluation of clinical practice guidelines [Endorsed 16 November 1998] . Canberra (Australia): Commonwealth of Australia; 1999.
Oxford Centre for Evidence-based Medicine. Levels of Evidence. 2001. www.cebm.net .
Peto R, Collins R, Gray R. Large-scale randomized evidence: large, simple trials and overviews of trials. Journal of Clinical Epidemiology 1995; 48 : 23-40.
Polus S, Pieper D, Burns J, Fretheim A, Ramsay C, Higgins JPT, Mathes T, Pfadenhauer LM, Rehfuess EA. Heterogeneity in application, design, and analysis characteristics was found for controlled before-after and interrupted time series studies included in Cochrane reviews. Journal of Clinical Epidemiology 2017; 91 : 56-69.
Reeves BC. Parachute approach to evidence based medicine: as obvious as ABC. BMJ 2006; 333 : 807-808.
Reeves BC, Higgins JPT, Ramsay C, Shea B, Tugwell P, Wells GA. An introduction to methodological issues when including non-randomised studies in systematic reviews on the effects of interventions. Research Synthesis Methods 2013; 4 : 1-11.
Reeves BC, Wells GA, Waddington H. Quasi-experimental study designs series-paper 5: a checklist for classifying studies evaluating the effects on health interventions-a taxonomy without labels. Journal of Clinical Epidemiology 2017; 89 : 30-42.
Schünemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G, Helfand M. Non-randomized studies as a source of complementary, sequential or replacement evidence for randomized controlled trials in systematic reviews on the effects of interventions. Research Synthesis Methods 2013; 4 : 49-62.
Schünemann HJ, Cuello C, Akl EA, Mustafa RA, Meerpohl JJ, Thayer K, Morgan RL, Gartlehner G, Kunz R, Katikireddi SV, Sterne J, Higgins JPT, Guyatt G, Grade Working Group. GRADE guidelines: 18. How ROBINS-I and other tools to assess risk of bias in nonrandomized studies should be used to rate the certainty of a body of evidence. Journal of Clinical Epidemiology 2018.
Shadish WR, Cook TD, Campbell DT. Experimental and Quasi-Experimental Designs for Generalized Causal Inference . Boston (MA): Houghton Mifflin; 2002 2002.
Siegfried N, Muller M, Volmink J, Deeks J, Egger M, Low N, Weiss H, Walker S, Williamson P. Male circumcision for prevention of heterosexual acquisition of HIV in men. Cochrane Database of Systematic Reviews 2003; 3 : CD003362.
Siegfried N, Muller M, Deeks J, Volmink J, Egger M, Low N, Walker S, Williamson P. HIV and male circumcision--a systematic review with assessment of the quality of studies. Lancet Infectious Diseases 2005; 5 : 165-173.
Stampfer MJ, Colditz GA. Estrogen replacement therapy and coronary heart disease: a quantitative assessment of the epidemiologic evidence. Preventive Medicine 1991; 20 : 47-63.
Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A, Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L, Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC, Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing risk of bias in non-randomized studies of interventions. BMJ 2016; 355 : i4919.
Taggart DP, D'Amico R, Altman DG. Effect of arterial revascularisation on survival: a systematic review of studies comparing bilateral and single internal mammary arteries. The Lancet 2001; 358 : 870-875.
Tugwell P, Knottnerus JA, McGowan J, Tricco A. Big-5 Quasi-Experimental designs. Journal of Clinical Epidemiology 2017; 89 : 1-3.
Wells GA, Shea B, Higgins JPT, Sterne J, Tugwell P, Reeves BC. Checklists of methodological issues for review authors to consider when including non-randomized studies in systematic reviews. Research Synthesis Methods 2013; 4 : 63-77.
Wieland S, Dickersin K. Selective exposure reporting and Medline indexing limited the search sensitivity for observational studies of the adverse effects of oral contraceptives. Journal of Clinical Epidemiology 2005; 58 : 560-567.
Wohlin C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. EASE '14 Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering; London, UK 2014.
For permission to re-use material from the Handbook (either academic or commercial), please see here for full details.
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
The Use and Interpretation of Quasi-Experimental Studies in Medical Informatics
Anthony d harris , md, mph, jessina c mcgregor , phd, eli n perencevich , md, ms, jon p furuno , phd, jingkun zhu , ms, dan e peterson , md, mph, joseph finkelstein , md.
- Author information
- Article notes
- Copyright and License information
Correspondence and reprints: Anthony D. Harris, MD, MPH, Division of Healthcare Outcomes Research, Department of Epidemiology and Preventive Medicine, University of Maryland School of Medicine, 100 N. Greene Street, Lower Level, Baltimore, MD; e-mail: < [email protected] >.
Received 2004 Nov 19; Accepted 2005 Aug 12.
Quasi-experimental study designs, often described as nonrandomized, pre-post intervention studies, are common in the medical informatics literature. Yet little has been written about the benefits and limitations of the quasi-experimental approach as applied to informatics studies. This paper outlines a relative hierarchy and nomenclature of quasi-experimental study designs that is applicable to medical informatics intervention studies. In addition, the authors performed a systematic review of two medical informatics journals, the Journal of the American Medical Informatics Association (JAMIA) and the International Journal of Medical Informatics (IJMI), to determine the number of quasi-experimental studies published and how the studies are classified on the above-mentioned relative hierarchy. They hope that future medical informatics studies will implement higher level quasi-experimental study designs that yield more convincing evidence for causal links between medical informatics interventions and outcomes.
Quasi-experimental studies encompass a broad range of nonrandomized intervention studies. These designs are frequently used when it is not logistically feasible or ethical to conduct a randomized controlled trial. Examples of quasi-experimental studies follow. As one example of a quasi-experimental study, a hospital introduces a new order-entry system and wishes to study the impact of this intervention on the number of medication-related adverse events before and after the intervention. As another example, an informatics technology group is introducing a pharmacy order-entry system aimed at decreasing pharmacy costs. The intervention is implemented and pharmacy costs before and after the intervention are measured.
In medical informatics, the quasi-experimental, sometimes called the pre-post intervention, design often is used to evaluate the benefits of specific interventions. The increasing capacity of health care institutions to collect routine clinical data has led to the growing use of quasi-experimental study designs in the field of medical informatics as well as in other medical disciplines. However, little is written about these study designs in the medical literature or in traditional epidemiology textbooks. 1 , 2 , 3 In contrast, the social sciences literature is replete with examples of ways to implement and improve quasi-experimental studies. 4 , 5 , 6
In this paper, we review the different pretest-posttest quasi-experimental study designs, their nomenclature, and the relative hierarchy of these designs with respect to their ability to establish causal associations between an intervention and an outcome. The example of a pharmacy order-entry system aimed at decreasing pharmacy costs will be used throughout this article to illustrate the different quasi-experimental designs. We discuss limitations of quasi-experimental designs and offer methods to improve them. We also perform a systematic review of four years of publications from two informatics journals to determine the number of quasi-experimental studies, classify these studies into their application domains, determine whether the potential limitations of quasi-experimental studies were acknowledged by the authors, and place these studies into the above-mentioned relative hierarchy.
The authors reviewed articles and book chapters on the design of quasi-experimental studies. 4 , 5 , 6 , 7 , 8 , 9 , 10 Most of the reviewed articles referenced two textbooks that were then reviewed in depth. 4 , 6
Key advantages and disadvantages of quasi-experimental studies, as they pertain to the study of medical informatics, were identified. The potential methodological flaws of quasi-experimental medical informatics studies, which have the potential to introduce bias, were also identified. In addition, a summary table outlining a relative hierarchy and nomenclature of quasi-experimental study designs is described. In general, the higher the design is in the hierarchy, the greater the internal validity that the study traditionally possesses because the evidence of the potential causation between the intervention and the outcome is strengthened. 4
We then performed a systematic review of four years of publications from two informatics journals. First, we determined the number of quasi-experimental studies. We then classified these studies on the above-mentioned hierarchy. We also classified the quasi-experimental studies according to their application domain. The categories of application domains employed were based on categorization used by Yearbooks of Medical Informatics 1992–2005 and were similar to the categories of application domains employed by Annual Symposiums of the American Medical Informatics Association. 11 The categories were (1) health and clinical management; (2) patient records; (3) health information systems; (4) medical signal processing and biomedical imaging; (5) decision support, knowledge representation, and management; (6) education and consumer informatics; and (7) bioinformatics. Because the quasi-experimental study design has recognized limitations, we sought to determine whether authors acknowledged the potential limitations of this design. Examples of acknowledgment included mention of lack of randomization, the potential for regression to the mean, the presence of temporal confounders and the mention of another design that would have more internal validity.
All original scientific manuscripts published between January 2000 and December 2003 in the Journal of the American Medical Informatics Association (JAMIA) and the International Journal of Medical Informatics (IJMI) were reviewed. One author (ADH) reviewed all the papers to identify the number of quasi-experimental studies. Other authors (ADH, JCM, JF) then independently reviewed all the studies identified as quasi-experimental. The three authors then convened as a group to resolve any disagreements in study classification, application domain, and acknowledgment of limitations.
Results and Discussion
What is a quasi-experiment.
Quasi-experiments are studies that aim to evaluate interventions but that do not use randomization. Similar to randomized trials, quasi-experiments aim to demonstrate causality between an intervention and an outcome. Quasi-experimental studies can use both preintervention and postintervention measurements as well as nonrandomly selected control groups.
Using this basic definition, it is evident that many published studies in medical informatics utilize the quasi-experimental design. Although the randomized controlled trial is generally considered to have the highest level of credibility with regard to assessing causality, in medical informatics, researchers often choose not to randomize the intervention for one or more reasons: (1) ethical considerations, (2) difficulty of randomizing subjects, (3) difficulty to randomize by locations (e.g., by wards), (4) small available sample size. Each of these reasons is discussed below.
Ethical considerations typically will not allow random withholding of an intervention with known efficacy. Thus, if the efficacy of an intervention has not been established, a randomized controlled trial is the design of choice to determine efficacy. But if the intervention under study incorporates an accepted, well-established therapeutic intervention, or if the intervention has either questionable efficacy or safety based on previously conducted studies, then the ethical issues of randomizing patients are sometimes raised. In the area of medical informatics, it is often believed prior to an implementation that an informatics intervention will likely be beneficial and thus medical informaticians and hospital administrators are often reluctant to randomize medical informatics interventions. In addition, there is often pressure to implement the intervention quickly because of its believed efficacy, thus not allowing researchers sufficient time to plan a randomized trial.
For medical informatics interventions, it is often difficult to randomize the intervention to individual patients or to individual informatics users. So while this randomization is technically possible, it is underused and thus compromises the eventual strength of concluding that an informatics intervention resulted in an outcome. For example, randomly allowing only half of medical residents to use pharmacy order-entry software at a tertiary care hospital is a scenario that hospital administrators and informatics users may not agree to for numerous reasons.
Similarly, informatics interventions often cannot be randomized to individual locations. Using the pharmacy order-entry system example, it may be difficult to randomize use of the system to only certain locations in a hospital or portions of certain locations. For example, if the pharmacy order-entry system involves an educational component, then people may apply the knowledge learned to nonintervention wards, thereby potentially masking the true effect of the intervention. When a design using randomized locations is employed successfully, the locations may be different in other respects (confounding variables), and this further complicates the analysis and interpretation.
In situations where it is known that only a small sample size will be available to test the efficacy of an intervention, randomization may not be a viable option. Randomization is beneficial because on average it tends to evenly distribute both known and unknown confounding variables between the intervention and control group. However, when the sample size is small, randomization may not adequately accomplish this balance. Thus, alternative design and analytical methods are often used in place of randomization when only small sample sizes are available.
What Are the Threats to Establishing Causality When Using Quasi-experimental Designs in Medical Informatics?
The lack of random assignment is the major weakness of the quasi-experimental study design. Associations identified in quasi-experiments meet one important requirement of causality since the intervention precedes the measurement of the outcome. Another requirement is that the outcome can be demonstrated to vary statistically with the intervention. Unfortunately, statistical association does not imply causality, especially if the study is poorly designed. Thus, in many quasi-experiments, one is most often left with the question: “Are there alternative explanations for the apparent causal association?” If these alternative explanations are credible, then the evidence of causation is less convincing. These rival hypotheses, or alternative explanations, arise from principles of epidemiologic study design.
Shadish et al. 4 outline nine threats to internal validity that are outlined in ▶ . Internal validity is defined as the degree to which observed changes in outcomes can be correctly inferred to be caused by an exposure or an intervention. In quasi-experimental studies of medical informatics, we believe that the methodological principles that most often result in alternative explanations for the apparent causal effect include (a) difficulty in measuring or controlling for important confounding variables, particularly unmeasured confounding variables, which can be viewed as a subset of the selection threat in ▶ ; (b) results being explained by the statistical principle of regression to the mean . Each of these latter two principles is discussed in turn.
Threats to Internal Validity
Adapted from Shadish et al. 4
An inability to sufficiently control for important confounding variables arises from the lack of randomization. A variable is a confounding variable if it is associated with the exposure of interest and is also associated with the outcome of interest; the confounding variable leads to a situation where a causal association between a given exposure and an outcome is observed as a result of the influence of the confounding variable. For example, in a study aiming to demonstrate that the introduction of a pharmacy order-entry system led to lower pharmacy costs, there are a number of important potential confounding variables (e.g., severity of illness of the patients, knowledge and experience of the software users, other changes in hospital policy) that may have differed in the preintervention and postintervention time periods ( ▶ ). In a multivariable regression, the first confounding variable could be addressed with severity of illness measures, but the second confounding variable would be difficult if not nearly impossible to measure and control. In addition, potential confounding variables that are unmeasured or immeasurable cannot be controlled for in nonrandomized quasi-experimental study designs and can only be properly controlled by the randomization process in randomized controlled trials.
Example of confounding. To get the true effect of the intervention of interest, we need to control for the confounding variable.
Another important threat to establishing causality is regression to the mean. 12 , 13 , 14 This widespread statistical phenomenon can result in wrongly concluding that an effect is due to the intervention when in reality it is due to chance. The phenomenon was first described in 1886 by Francis Galton who measured the adult height of children and their parents. He noted that when the average height of the parents was greater than the mean of the population, the children tended to be shorter than their parents, and conversely, when the average height of the parents was shorter than the population mean, the children tended to be taller than their parents.
In medical informatics, what often triggers the development and implementation of an intervention is a rise in the rate above the mean or norm. For example, increasing pharmacy costs and adverse events may prompt hospital informatics personnel to design and implement pharmacy order-entry systems. If this rise in costs or adverse events is really just an extreme observation that is still within the normal range of the hospital's pharmaceutical costs (i.e., the mean pharmaceutical cost for the hospital has not shifted), then the statistical principle of regression to the mean predicts that these elevated rates will tend to decline even without intervention. However, often informatics personnel and hospital administrators cannot wait passively for this decline to occur. Therefore, hospital personnel often implement one or more interventions, and if a decline in the rate occurs, they may mistakenly conclude that the decline is causally related to the intervention. In fact, an alternative explanation for the finding could be regression to the mean.
What Are the Different Quasi-experimental Study Designs?
In the social sciences literature, quasi-experimental studies are divided into four study design groups 4 , 6 :
Quasi-experimental designs without control groups
Quasi-experimental designs that use control groups but no pretest
Quasi-experimental designs that use control groups and pretests
Interrupted time-series designs
There is a relative hierarchy within these categories of study designs, with category D studies being sounder than categories C, B, or A in terms of establishing causality. Thus, if feasible from a design and implementation point of view, investigators should aim to design studies that fall in to the higher rated categories. Shadish et al. 4 discuss 17 possible designs, with seven designs falling into category A, three designs in category B, and six designs in category C, and one major design in category D. In our review, we determined that most medical informatics quasi-experiments could be characterized by 11 of 17 designs, with six study designs in category A, one in category B, three designs in category C, and one design in category D because the other study designs were not used or feasible in the medical informatics literature. Thus, for simplicity, we have summarized the 11 study designs most relevant to medical informatics research in ▶ .
Relative Hierarchy of Quasi-experimental Designs
O = Observational Measurement; X = Intervention Under Study. Time moves from left to right.
In general, studies in category D are of higher study design quality than studies in category C, which are higher than those in category B, which are higher than those in category A. Also, as one moves down within each category, the studies become of higher quality, e.g., study 5 in category A is of higher study design quality than study 4, etc.
The nomenclature and relative hierarchy were used in the systematic review of four years of JAMIA and the IJMI. Similar to the relative hierarchy that exists in the evidence-based literature that assigns a hierarchy to randomized controlled trials, cohort studies, case-control studies, and case series, the hierarchy in ▶ is not absolute in that in some cases, it may be infeasible to perform a higher level study. For example, there may be instances where an A6 design established stronger causality than a B1 design. 15 , 16 , 17
Quasi-experimental Designs without Control Groups
Here, X is the intervention and O is the outcome variable (this notation is continued throughout the article). In this study design, an intervention (X) is implemented and a posttest observation (O1) is taken. For example, X could be the introduction of a pharmacy order-entry intervention and O1 could be the pharmacy costs following the intervention. This design is the weakest of the quasi-experimental designs that are discussed in this article. Without any pretest observations or a control group, there are multiple threats to internal validity. Unfortunately, this study design is often used in medical informatics when new software is introduced since it may be difficult to have pretest measurements due to time, technical, or cost constraints.
This is a commonly used study design. A single pretest measurement is taken (O1), an intervention (X) is implemented, and a posttest measurement is taken (O2). In this instance, period O1 frequently serves as the “control” period. For example, O1 could be pharmacy costs prior to the intervention, X could be the introduction of a pharmacy order-entry system, and O2 could be the pharmacy costs following the intervention. Including a pretest provides some information about what the pharmacy costs would have been had the intervention not occurred.
The advantage of this study design over A2 is that adding a second pretest prior to the intervention helps provide evidence that can be used to refute the phenomenon of regression to the mean and confounding as alternative explanations for any observed association between the intervention and the posttest outcome. For example, in a study where a pharmacy order-entry system led to lower pharmacy costs (O3 < O2 and O1), if one had two preintervention measurements of pharmacy costs (O1 and O2) and they were both elevated, this would suggest that there was a decreased likelihood that O3 is lower due to confounding and regression to the mean. Similarly, extending this study design by increasing the number of measurements postintervention could also help to provide evidence against confounding and regression to the mean as alternate explanations for observed associations.
This design involves the inclusion of a nonequivalent dependent variable ( b ) in addition to the primary dependent variable ( a ). Variables a and b should assess similar constructs; that is, the two measures should be affected by similar factors and confounding variables except for the effect of the intervention. Variable a is expected to change because of the intervention X, whereas variable b is not. Taking our example, variable a could be pharmacy costs and variable b could be the length of stay of patients. If our informatics intervention is aimed at decreasing pharmacy costs, we would expect to observe a decrease in pharmacy costs but not in the average length of stay of patients. However, a number of important confounding variables, such as severity of illness and knowledge of software users, might affect both outcome measures. Thus, if the average length of stay did not change following the intervention but pharmacy costs did, then the data are more convincing than if just pharmacy costs were measured.
The Removed-Treatment Design
This design adds a third posttest measurement (O3) to the one-group pretest-posttest design and then removes the intervention before a final measure (O4) is made. The advantage of this design is that it allows one to test hypotheses about the outcome in the presence of the intervention and in the absence of the intervention. Thus, if one predicts a decrease in the outcome between O1 and O2 (after implementation of the intervention), then one would predict an increase in the outcome between O3 and O4 (after removal of the intervention). One caveat is that if the intervention is thought to have persistent effects, then O4 needs to be measured after these effects are likely to have disappeared. For example, a study would be more convincing if it demonstrated that pharmacy costs decreased after pharmacy order-entry system introduction (O2 and O3 less than O1) and that when the order-entry system was removed or disabled, the costs increased (O4 greater than O2 and O3 and closer to O1). In addition, there are often ethical issues in this design in terms of removing an intervention that may be providing benefit.
The Repeated-Treatment Design
The advantage of this design is that it demonstrates reproducibility of the association between the intervention and the outcome. For example, the association is more likely to be causal if one demonstrates that a pharmacy order-entry system results in decreased pharmacy costs when it is first introduced and again when it is reintroduced following an interruption of the intervention. As for design A5, the assumption must be made that the effect of the intervention is transient, which is most often applicable to medical informatics interventions. Because in this design, subjects may serve as their own controls, this may yield greater statistical efficiency with fewer numbers of subjects.
Quasi-experimental Designs That Use a Control Group but No Pretest
An intervention X is implemented for one group and compared to a second group. The use of a comparison group helps prevent certain threats to validity including the ability to statistically adjust for confounding variables. Because in this study design, the two groups may not be equivalent (assignment to the groups is not by randomization), confounding may exist. For example, suppose that a pharmacy order-entry intervention was instituted in the medical intensive care unit (MICU) and not the surgical intensive care unit (SICU). O1 would be pharmacy costs in the MICU after the intervention and O2 would be pharmacy costs in the SICU after the intervention. The absence of a pretest makes it difficult to know whether a change has occurred in the MICU. Also, the absence of pretest measurements comparing the SICU to the MICU makes it difficult to know whether differences in O1 and O2 are due to the intervention or due to other differences in the two units (confounding variables).
Quasi-experimental Designs That Use Control Groups and Pretests
The reader should note that with all the studies in this category, the intervention is not randomized. The control groups chosen are comparison groups. Obtaining pretest measurements on both the intervention and control groups allows one to assess the initial comparability of the groups. The assumption is that if the intervention and the control groups are similar at the pretest, the smaller the likelihood there is of important confounding variables differing between the two groups.
The use of both a pretest and a comparison group makes it easier to avoid certain threats to validity. However, because the two groups are nonequivalent (assignment to the groups is not by randomization), selection bias may exist. Selection bias exists when selection results in differences in unit characteristics between conditions that may be related to outcome differences. For example, suppose that a pharmacy order-entry intervention was instituted in the MICU and not the SICU. If preintervention pharmacy costs in the MICU (O1a) and SICU (O1b) are similar, it suggests that it is less likely that there are differences in the important confounding variables between the two units. If MICU postintervention costs (O2a) are less than preintervention MICU costs (O1a), but SICU costs (O1b) and (O2b) are similar, this suggests that the observed outcome may be causally related to the intervention.
In this design, the pretests are administered at two different times. The main advantage of this design is that it controls for potentially different time-varying confounding effects in the intervention group and the comparison group. In our example, measuring points O1 and O2 would allow for the assessment of time-dependent changes in pharmacy costs, e.g., due to differences in experience of residents, preintervention between the intervention and control group, and whether these changes were similar or different.
With this study design, the researcher administers an intervention at a later time to a group that initially served as a nonintervention control. The advantage of this design over design C2 is that it demonstrates reproducibility in two different settings. This study design is not limited to two groups; in fact, the study results have greater validity if the intervention effect is replicated in different groups at multiple times. In the example of a pharmacy order-entry system, one could implement or intervene in the MICU and then at a later time, intervene in the SICU. This latter design is often very applicable to medical informatics where new technology and new software is often introduced or made available gradually.
Interrupted Time-Series Designs
An interrupted time-series design is one in which a string of consecutive observations equally spaced in time is interrupted by the imposition of a treatment or intervention. The advantage of this design is that with multiple measurements both pre- and postintervention, it is easier to address and control for confounding and regression to the mean. In addition, statistically, there is a more robust analytic capability, and there is the ability to detect changes in the slope or intercept as a result of the intervention in addition to a change in the mean values. 18 A change in intercept could represent an immediate effect while a change in slope could represent a gradual effect of the intervention on the outcome. In the example of a pharmacy order-entry system, O1 through O5 could represent monthly pharmacy costs preintervention and O6 through O10 monthly pharmacy costs post the introduction of the pharmacy order-entry system. Interrupted time-series designs also can be further strengthened by incorporating many of the design features previously mentioned in other categories (such as removal of the treatment, inclusion of a nondependent outcome variable, or the addition of a control group).
Systematic Review Results
The results of the systematic review are in ▶ . In the four-year period of JAMIA publications that the authors reviewed, 25 quasi-experimental studies among 22 articles were published. Of these 25, 15 studies were of category A, five studies were of category B, two studies were of category C, and no studies were of category D. Although there were no studies of category D (interrupted time-series analyses), three of the studies classified as category A had data collected that could have been analyzed as an interrupted time-series analysis. Nine of the 25 studies (36%) mentioned at least one of the potential limitations of the quasi-experimental study design. In the four-year period of IJMI publications reviewed by the authors, nine quasi-experimental studies among eight manuscripts were published. Of these nine, five studies were of category A, one of category B, one of category C, and two of category D. Two of the nine studies (22%) mentioned at least one of the potential limitations of the quasi-experimental study design.
Systematic Review of Four Years of Quasi-designs in JAMIA
JAMIA = Journal of the American Medical Informatics Association; IJMI = International Journal of Medical Informatics.
Could have been analyzed as an interrupted time-series design.
In addition, three studies from JAMIA were based on a counterbalanced design. A counterbalanced design is a higher order study design than other studies in category A. The counterbalanced design is sometimes referred to as a Latin-square arrangement. In this design, all subjects receive all the different interventions but the order of intervention assignment is not random. 19 This design can only be used when the intervention is compared against some existing standard, for example, if a new PDA-based order entry system is to be compared to a computer terminal–based order entry system. In this design, all subjects receive the new PDA-based order entry system and the old computer terminal-based order entry system. The counterbalanced design is a within-participants design, where the order of the intervention is varied (e.g., one group is given software A followed by software B and another group is given software B followed by software A). The counterbalanced design is typically used when the available sample size is small, thus preventing the use of randomization. This design also allows investigators to study the potential effect of ordering of the informatics intervention.
Although quasi-experimental study designs are ubiquitous in the medical informatics literature, as evidenced by 34 studies in the past four years of the two informatics journals, little has been written about the benefits and limitations of the quasi-experimental approach. As we have outlined in this paper, a relative hierarchy and nomenclature of quasi-experimental study designs exist, with some designs being more likely than others to permit causal interpretations of observed associations. Strengths and limitations of a particular study design should be discussed when presenting data collected in the setting of a quasi-experimental study. Future medical informatics investigators should choose the strongest design that is feasible given the particular circumstances.
Supplementary Material
Dr. Harris was supported by NIH grants K23 AI01752-01A1 and R01 AI60859-01A1. Dr. Perencevich was supported by a VA Health Services Research and Development Service (HSR&D) Research Career Development Award (RCD-02026-1). Dr. Finkelstein was supported by NIH grant RO1 HL71690.
- 1. Rothman KJ, Greenland S. Modern epidemiology. Philadelphia: Lippincott–Raven Publishers, 1998.
- 2. Hennekens CH, Buring JE. Epidemiology in medicine. Boston: Little, Brown, 1987.
- 3. Szklo M, Nieto FJ. Epidemiology: beyond the basics. Gaithersburg, MD: Aspen Publishers, 2000.
- 4. Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin, 2002.
- 5. Trochim WMK. The research methods knowledge base. Cincinnati: Atomic Dog Publishing, 2001.
- 6. Cook TD, Campbell DT. Quasi-experimentation: design and analysis issues for field settings. Chicago: Rand McNally Publishing Company, 1979.
- 7. MacLehose RR, Reeves BC, Harvey IM, Sheldon TA, Russell IT, Black AM. A systematic review of comparisons of effect sizes derived from randomised and non-randomised studies. Health Technol Assess. 2000;4:1–154. [ PubMed ] [ Google Scholar ]
- 8. Shadish WR, Heinsman DT. Experiments versus quasi-experiments: do they yield the same answer? NIDA Res Monogr. 1997;170:147–64. [ PubMed ] [ Google Scholar ]
- 9. Grimshaw J, Campbell M, Eccles M, Steen N. Experimental and quasi-experimental designs for evaluating guideline implementation strategies. Fam Pract. 2000;17(Suppl 1):S11–6. [ DOI ] [ PubMed ] [ Google Scholar ]
- 10. Zwerling C, Daltroy LH, Fine LJ, Johnston JJ, Melius J, Silverstein BA. Design and conduct of occupational injury intervention studies: a review of evaluation strategies. Am J Ind Med. 1997;32:164–79. [ DOI ] [ PubMed ] [ Google Scholar ]
- 11. Haux RKC, editor. Yearbook of medical informatics 2005. Stuttgart: Schattauer Verlagsgesellschaft, 2005, 563.
- 12. Morton V, Torgerson DJ. Effect of regression to the mean on decision making in health care. BMJ. 2003;326:1083–4. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 13. Bland JM, Altman DG. Regression towards the mean. BMJ. 1994;308:1499. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 14. Bland JM, Altman DG. Some examples of regression towards the mean. BMJ. 1994;309:780. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 15. Guyatt GH, Haynes RB, Jaeschke RZ, Cook DJ, Green L, Naylor CD, et al. Users' guides to the medical literature: XXV. Evidence-based medicine: principles for applying the users' guides to patient care. Evidence-Based Medicine Working Group. JAMA. 2000;284:1290–6. [ DOI ] [ PubMed ] [ Google Scholar ]
- 16. Harris RP, Helfand M, Woolf SH, Lohr KN, Mulrow CD, Teutsch SM, et al. Current methods of the US Preventive Services Task Force: a review of the process. Am J Prev Med. 2001;20:21–35. [ DOI ] [ PubMed ] [ Google Scholar ]
- 17. Harbour R, Miller J. A new system for grading recommendations in evidence based guidelines. BMJ. 2001;323:334–6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 18. Wagner AK, Soumerai SB, Zhang F, Ross-Degnan D. Segmented regression analysis of interrupted time series studies in medication use research. J Clin Pharm Ther. 2002;27:299–309. [ DOI ] [ PubMed ] [ Google Scholar ]
- 19. Campbell DT. Counterbalanced design. In: Company RMCP, editor. Experimental and Quasiexperimental Designs for Research. Chicago: Rand-McNally College Publishing Company, 1963, 50–5.
- 20. Staggers N, Kobus D. Comparing response time, errors, and satisfaction between text-based and graphical user interfaces during nursing order tasks. J Am Med Inform Assoc. 2000;7:164–76. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 21. Schriger DL, Baraff LJ, Buller K, Shendrikar MA, Nagda S, Lin EJ, et al. Implementation of clinical guidelines via a computer charting system: effect on the care of febrile children less than three years of age. J Am Med Inform Assoc. 2000;7:186–95. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 22. Patel VL, Kushniruk AW, Yang S, Yale JF. Impact of a computer-based patient record system on data collection, knowledge organization, and reasoning. J Am Med Inform Assoc. 2000;7:569–85. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 23. Borowitz SM. Computer-based speech recognition as an alternative to medical transcription. J Am Med Inform Assoc. 2001;8:101–2. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 24. Patterson R, Harasym P. Educational instruction on a hospital information system for medical students during their surgical rotations. J Am Med Inform Assoc. 2001;8:111–6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 25. Rocha BH, Christenson JC, Evans RS, Gardner RM. Clinicians' response to computerized detection of infections. J Am Med Inform Assoc. 2001;8:117–25. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 26. Lovis C, Chapko MK, Martin DP, Payne TH, Baud RH, Hoey PJ, et al. Evaluation of a command-line parser-based order entry pathway for the Department of Veterans Affairs electronic patient record. J Am Med Inform Assoc. 2001;8:486–98. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 27. Hersh WR, Junium K, Mailhot M, Tidmarsh P. Implementation and evaluation of a medical informatics distance education program. J Am Med Inform Assoc. 2001;8:570–84. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 28. Makoul G, Curry RH, Tang PC. The use of electronic medical records: communication patterns in outpatient encounters. J Am Med Inform Assoc. 2001;8:610–5. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 29. Ruland CM. Handheld technology to improve patient care: evaluating a support system for preference-based care planning at the bedside. J Am Med Inform Assoc. 2002;9:192–201. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 30. De Lusignan S, Stephens PN, Adal N, Majeed A. Does feedback improve the quality of computerized medical records in primary care? J Am Med Inform Assoc. 2002;9:395–401. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 31. Mekhjian HS, Kumar RR, Kuehn L, Bentley TD, Teater P, Thomas A, et al. Immediate benefits realized following implementation of physician order entry at an academic medical center. J Am Med Inform Assoc. 2002;9:529–39. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 32. Ammenwerth E, Mansmann U, Iller C, Eichstadter R. Factors affecting and affected by user acceptance of computer-based nursing documentation: results of a two-year study. J Am Med Inform Assoc. 2003;10:69–84. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 33. Oniki TA, Clemmer TP, Pryor TA. The effect of computer-generated reminders on charting deficiencies in the ICU. J Am Med Inform Assoc. 2003;10:177–87. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 34. Liederman EM, Morefield CS. Web messaging: a new tool for patient-physician communication. J Am Med Inform Assoc. 2003;10:260–70. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 35. Rotich JK, Hannan TJ, Smith FE, Bii J, Odero WW, Vu N, Mamlin BW, et al. Installing and implementing a computer-based patient record system in sub-Saharan Africa: the Mosoriot Medical Record System. J Am Med Inform Assoc. 2003;10:295–303. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 36. Payne TH, Hoey PJ, Nichol P, Lovis C. Preparation and use of preconstructed orders, order sets, and order menus in a computerized provider order entry system. J Am Med Inform Assoc. 2003;10:322–9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 37. Hoch I, Heymann AD, Kurman I, Valinsky LJ, Chodick G, Shalev V. Countrywide computer alerts to community physicians improve potassium testing in patients receiving diuretics. J Am Med Inform Assoc. 2003;10:541–6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 38. Laerum H, Karlsen TH, Faxvaag A. Effects of scanning and eliminating paper-based medical records on hospital physicians' clinical work practice. J Am Med Inform Assoc. 2003;10:588–95. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 39. Devine EG, Gaehde SA, Curtis AC. Comparative evaluation of three continuous speech recognition software packages in the generation of medical reports. J Am Med Inform Assoc. 2000;7:462–8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 40. Dunbar PJ, Madigan D, Grohskopf LA, Revere D, Woodward J, Minstrell J, et al. A two-way messaging system to enhance antiretroviral adherence. J Am Med Inform Assoc. 2003;10:11–5. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 41. Lenert L, Munoz RF, Stoddard J, Delucchi K, Bansod A, Skoczen S, et al. Design and pilot evaluation of an Internet smoking cessation program. J Am Med Inform Assoc. 2003;10:16–20. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 42. Koide D, Ohe K, Ross-Degnan D, Kaihara S. Computerized reminders to monitor liver function to improve the use of etretinate. Int J Med Inf. 2000;57:11–9. [ DOI ] [ PubMed ] [ Google Scholar ]
- 43. Gonzalez-Heydrich J, DeMaso DR, Irwin C, Steingard RJ, Kohane IS, Beardslee WR. Implementation of an electronic medical record system in a pediatric psychopharmacology program. Int J Med Inf. 2000;57:109–16. [ DOI ] [ PubMed ] [ Google Scholar ]
- 44. Anantharaman V, Swee Han L. Hospital and emergency ambulance link: using IT to enhance emergency pre-hospital care. Int J Med Inf. 2001;61:147–61. [ DOI ] [ PubMed ] [ Google Scholar ]
- 45. Chae YM, Heon Lee J, Hee Ho S, Ja Kim H, Hong Jun K, Uk Won J. Patient satisfaction with telemedicine in home health services for the elderly. Int J Med Inf. 2001;61:167–73. [ DOI ] [ PubMed ] [ Google Scholar ]
- 46. Lin CC, Chen HS, Chen CY, Hou SM. Implementation and evaluation of a multifunctional telemedicine system in NTUH. Int J Med Inf. 2001;61:175–87. [ DOI ] [ PubMed ] [ Google Scholar ]
- 47. Mikulich VJ, Liu YC, Steinfeldt J, Schriger DL. Implementation of clinical guidelines through an electronic medical record: physician usage, satisfaction and assessment. Int J Med Inf. 2001;63:169–78. [ DOI ] [ PubMed ] [ Google Scholar ]
- 48. Hwang JI, Park HA, Bakken S. Impact of a physician's order entry (POE) system on physicians' ordering patterns and patient length of stay. Int J Med Inf. 2002;65:213–23. [ DOI ] [ PubMed ] [ Google Scholar ]
- 49. Park WS, Kim JS, Chae YM, Yu SH, Kim CY, Kim SA, et al. Does the physician order-entry system increase the revenue of a general hospital? Int J Med Inf. 2003;71:25–32. [ DOI ] [ PubMed ] [ Google Scholar ]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
- View on publisher site
- PDF (115.2 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
13.1.2 Why consider non-randomized studies?
The Cochrane Collaboration focuses particularly on systematic reviews of randomized trials because they are more likely to provide unbiased information than other study designs about the differential effects of alternative forms of health care. Reviews of NRS are only likely to be undertaken when the question of interest cannot be answered by a review of randomized trials. The NRSMG believes that review authors may be justified in including NRS which are moderately susceptible to bias. Broadly, the NRSMG considers that there are three main reasons for including NRS in a Cochrane review:
a) To examine the case for undertaking a randomized trial by providing an explicit evaluation of the weaknesses of available NRS. The findings of a review of NRS may also be useful to inform the design of a subsequent randomized trial, e.g. through the identification of relevant subgroups.
b) To provide evidence of the effects (benefit or harm) of interventions that cannot be randomized, or which are extremely unlikely to be studied in randomized trials. In these contexts, a disinterested (free from bias and partiality) review that systematically reports the findings and limitations of available NRS can be useful.
c) To provide evidence of effects (benefit or harm) that cannot be adequately studied in randomized trials, such as long-term and rare outcomes, or outcomes that were not known to be important when existing, major randomized trials were conducted.
Three other reasons are often cited in support of systematic reviews of NRS but are poor justifications:
d) Studying effects in patient groups not recruited to randomized trials (such as children, pregnant women, the elderly). Although it is important to consider whether the results of trials can be generalized to people who are excluded from them, it is not clear that this can be achieved by consideration of non-randomized studies. Regardless of whether estimates from NRS agree or disagree with those of randomized trials, there is always potential for bias in the results of the NRS, such that misleading conclusions are drawn.
e) To supplement existing randomized trial evidence. Adding non-randomized to randomized evidence may change an imprecise but unbiased estimate into a precise but biased estimate, i.e. an exchange of undesirable uncertainty for unacceptable error.
f) When an intervention effect is really large. Implicitly, this is a result-driven or post hoc justification, since the review (or some other synthesis of the evidence) needs to be undertaken to observe the likely size of the effects. Whilst it is easier to argue that large effects are less likely to be completely explained by bias than small effects (Glasziou 2007), for the practice of health care it is still important to obtain unbiased estimates of the magnitude of large effects to make clinical and economic decisions (Reeves 2006). Thus randomized trials are still needed for large effects (and they need not be large if the effects are truly large). There may be ethical opposition to randomized trials of interventions already suspected to be associated with a large benefit as a result of a systematic review of NRS, making it difficult to randomize participants, and interventions postulated to have large effects may also be difficult to randomize for other reasons (e.g. surgery vs. no surgery). However, the justification for a systematic review of NRS in these circumstances should be classified as (b), i.e. interventions that are unlikely to be randomized, rather than as (f).
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Can Non-Randomised Studies of Interventions Provide Unbiased Effect Estimates? A Systematic Review of Internal Replication Studies
Hugh sharma waddington , phd ma bsc, paul fenton villar , phd msc bsc, jeffrey c valentine , phd ma ba.
- Author information
- Article notes
- Copyright and License information
Hugh Sharma Waddington, London School of Hygiene and Tropical Medicine, London International Development Centre, 20 Bloomsbury square, London WC1A 2NS, UK. Email: [email protected]
Issue date 2023 Jun.
This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License ( https://creativecommons.org/licenses/by-nc/4.0/ ) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page ( https://us.sagepub.com/en-us/nam/open-access-at-sage ).
Non-randomized studies of intervention effects (NRS), also called quasi-experiments, provide useful decision support about development impacts. However, the assumptions underpinning them are usually untestable, their verification resting on empirical replication. The internal replication study aims to do this by comparing results from a causal benchmark study, usually a randomized controlled trial (RCT), with those from an NRS conducted at the same time in the sampled population. We aimed to determine the credibility and generalizability of findings in internal replication studies in development economics, through a systematic review and meta-analysis. We systematically searched for internal replication studies of RCTs conducted on socioeconomic interventions in low- and middle-income countries. We critically appraised the benchmark randomized studies, using an adapted tool. We extracted and statistically synthesized empirical measures of bias. We included 600 estimates of correspondence between NRS and benchmark RCTs. All internal replication studies were found to have at least “some concerns” about bias and some had high risk of bias. We found that study designs with selection on unobservables, in particular regression discontinuity, on average produced absolute standardized bias estimates that were approximately zero, that is, equivalent to the estimates produced by RCTs. But study conduct also mattered. For example, matching using pre-tests and nearest neighbor algorithms corresponded more closely to the benchmarks. The findings from this systematic review confirm that NRS can produce unbiased estimates. Authors of internal replication studies should publish pre-analysis protocols to enhance their credibility.
Keywords: design replication, internal replication, meta-analysis, non-randomized study of interventions, quasi-experimental design, systematic review, within-study comparison
Introduction
In the past few decades there has been an explosion in the numbers of randomized controlled trials (RCTs) of development interventions, overall ( Sabet & Brown, 2018 ) and in specific sectors like water, sanitation and hygiene ( Chirgwin et al., 2021 ), and governance ( Phillips et al., 2017 ). However, some types of relationship are not amenable to randomised assignment, for example, where program eligibility is universal or implementation has already begun, or where the primary measure of interest is an exposure, like use, rather than assignment to an intervention. In addition, some types of outcomes are measured with difficulty in prospective studies for ethical reasons (e.g., death in childhood). Contamination of controls threatens internal validity in trials, where measurement requires long follow-up periods. When effect sizes are small, it may be difficult to design studies prospectively to detect them (e.g., Bloom et al., 2008 ). There is interest in estimating causal effect magnitudes in all of these cases.
Theory is clear that under the right conditions—specifically that the selection process is completely known and has been perfectly measured 1 —non-randomized studies of intervention effects, also called quasi-experiments, can produce unbiased treatment effect estimates. It follows that if the selection process is reasonably well understood and measured, NRS should produce results that are reasonably close to those that would have been produced in a randomized experiment. The question is the extent to which this actually happens. To assess this, empirical studies of bias compare non-randomized study findings with those of a benchmark study, usually a randomized controlled trial (RCT) that is assumed to provide unbiased estimates. One type of benchmark study involves within-study comparison, or internal replication, in which the randomized and non-randomized estimates are drawn from the same population.
Internal replication studies on social science topics abound: we estimated there to be 133 such studies at the time our searches were completed. However, one needs to be careful when evaluating this literature because the studies may contain inherent biases. Researchers are not usually blinded to findings from the benchmark study and may therefore be influenced by those findings in specification searches. Measures of bias are confounded where different treatment effect estimands, representing different population samples, are used in benchmark and NRS. Systematic review and meta-analysis can help alleviate these concerns about bias, through systematic searches and screening of all relevant studies to avoid cherry-picking of findings, critical appraisal to assess risk of bias, and statistical synthesis to increase precision around estimates which, when well-designed and conducted, should be close to zero.
This paper presents the results of a systematic review of internal replication studies of economic and social programs in low- and middle-income countries (L&MICs). To our knowledge, it is the first review of these studies to use methods to identify, critically appraise, and synthesize evidence to systematic review standards ( Campbell Collaboration, 2021 ). Section 2 presents the background and the systematic review approach. Section 3 presents the results of risk-of-bias assessment and quantitative estimates of bias using meta-analysis. Section 4 concludes.
Replication study design
Empirical approaches assess bias by comparing a given NRS estimator with an unbiased, causal benchmark estimator, usually an estimate produced by a well-conducted RCT ( Bloom et al., 2002 ). One approach uses “cross-study” comparison (or external replication) of effect sizes from studies that are selected using systematic search methods and pooled using meta-analysis (e.g., Sterne et al., 2002 ; Vivalt, 2020 ). Cross-study comparisons are indirect as they use different underlying sample populations. They may therefore be subject to confounding due to context, intervention, comparator, participant group, and so on. Another approach is the “internal replication study” ( Cook et al., 2008 ) or “design replication study” ( Wong & Steiner, 2018 ). Like cross-study comparisons, these compare a particular estimator, usually a non-randomized comparison group, with a causal benchmark, usually an RCT, which is assumed to provide an unbiased estimate. However, the comparison arm used in the NRS may come from the same study, or data collection at the same time among the target population, hence they are also called “within-study comparisons” ( Bloom et al., 2002 ; Glazerman et al., 2003 ). They have been undertaken in the social sciences since Lalonde (1986) . A number of literature reviews of these studies exist ( Glazerman et al., 2003 ; Cook et al., 2008 ; Hansen et al., 2013 ; Wong et al., 2017 ; Chaplin et al., 2018 ; Villar & Waddington, 2019 ).
There are different ways of doing internal replication studies ( Wong & Steiner, 2018 ), the most commonly used—including all of the examples from development economics—being “simultaneous design.” In these studies, a non-equivalent comparison group is created, the mean of which is compared to the mean of the control group in the RCT. 2 In a standard simultaneous design, the NRS uses administrative data or an observational study from a sample of the population that did not participate in the RCT (e.g., Diaz & Handa, 2006 ). However, inference requires measurement of the same outcome at the same time, under the same study conditions, factors which are often difficult to satisfy ( Smith & Todd, 2005 ) unless the experiment and NRS survey instruments are designed together (e.g., McKenzie et al., 2010 ). 3
Two types of simultaneous design are used to evaluate regression discontinuity designs (RDDs). The “tie-breaker” design ( Chaplin et al., 2018 ) initially assigns clusters into the benchmark using an eligibility criterion, after which random assignment is done. Where the eligibility criterion is a threshold score, the design is used to compare observations within clusters immediately around the eligibility threshold in RDD—control observations from the RCT are compared to observations on the other side of the threshold which were ineligible for treatment (e.g., Buddelmeyer and Skoufias, 2004 ).
In “synthetic design,” the researcher simulates the RDD from existing RCT data by removing observations from the treatment and/or control arm to create non-equivalent groups. 4 For example, in cluster-RCTs in education, where schools are already using pre-test scores to assign students to remedial education, participants in remedial education from treated clusters of the RCT (which has been done to estimate the impact of a completely different intervention) are compared to those not assigned to remedial classes from control clusters ( Barrera-Osorio et al., 2014 ). In this way, the RDD is constructed by researchers, and it may be applied to any threshold assignment variable measured at pre-test ( Wong & Steiner, 2018 ). 5
Measuring Bias in Replication Studies
Bias in a particular estimate may arise from sampling error, study design and conduct (internal validity), and sampling bias (external validity) ( Greenland, 2000 ). The extent to which evidence of statistical correspondence with RCT estimates adequately represents bias in NRS findings therefore depends only partly on internal validity of the RCT and NRS. Other factors affecting correspondence, which are sometimes inappropriately assumed to represent bias, include differences in the sampled population and specification searches. 6
Regarding internal validity, Cook et al. (2008) showed that NRS in which the method of treatment assignment is known, or carefully modeled using baseline data, produced very similar findings in direct comparisons with RCTs. Glazerman et al. (2003) found that the data source, the breadth of control variables, and evidence of statistical robustness tests were related to the magnitude of estimator bias in labor economics. In education, Wong et al. (2017) found that use of baseline outcomes, the geographical proximity of treatment and comparison, and breadth of control variables were associated with less bias. They also noted that NRS, which simply relied on a set of demographic variables or prioritized local matching when local comparisons were not comparable to treated cases, rarely replicated RCT estimates. One NRS approach that produces an internally valid estimator in expectation is the regression discontinuity design ( Rubin, 1977 ). Chaplin et al. (2018) assessed the statistical correspondence of 15 internal replications comparing RDDs with RCTs at the cut-off, finding the average difference was 0.01 standard deviations. However, they warned larger samples and the choice of bandwidth may prove important in determining the degree of bias in individual study estimates. Hansen et al. (2013) noted that the difference between NRS estimates and RCTs was smaller where selection into treatment was done at the group level (hence individual participant self-selection into treatment was not the main source of variation). This finding is intuitively appealing, as group selection (by sex, age, geography, and so on) by implementers, also called “program placement bias,” may be easier to model than self-selection bias (which may be a function of individual aptitudes, capabilities and desires).
The second potential source of discrepancy between the findings of RCTs and NRS is in the effect size quantity or estimand due to differences in the target population in each study (external validity). For example, the correspondence between NRS and RCT may not represent bias when comparing an average treatment effect (ATE) estimate from an RCT with ATET from a double difference or matching study, or local average treatment effect (LATE) from an RDD ( Cook et al., 2008 ). The ITT estimator, on which ATE is based in RCTs, becomes smaller as non-adherence increases, making raw comparison of the two estimators inappropriate, even if they are both unbiased. Similarly, when RDD is used to estimate the unbiased effect of an intervention amongst the population immediately around the treatment threshold, this may still differ from the RCT estimate due to heterogeneity in effects across the population receiving treatment. In other words, the interpretation of correspondence as bias may be confounded. An early review that found that NRS rarely replicated experimental estimates did not take this source of confounding into account ( Glazerman et al., 2003 ).
A final factor is specification searches. Cook et al. (2008) argued that, due to the potential for results-based choices in the covariates and methods used, NRS analysts should be blinded to the results of the RCT they are replicating. These biases may serve to accentuate or diminish the differences between RCT and NRS depending on the replication study authors’ priors. Thus, Fretheim et al. (2016) “concealed the results and discussion sections in the retrieved articles using 3M Post-it notes and attempted to remain blinded to the original results until after our analyses had been completed” (p.326). Where it is not possible to blind replication researchers to the RCT findings, which would usually be the case, a reasonable expectation is that the internal replication report should contain sensitivity analysis documenting differences in effects due to changes in the specification ( Hansen et al., 2013 ). An advantage of the latter approach, whether done openly or blinded, is to enable sensitivity analysis to different methods of conduct in the particular NRS.
Systematic Review approach
Most existing reviews of internal replication studies have not been done systematically—that is, based on systematic approaches to identify and critically appraise studies and statistically synthesize effect size findings. Exceptions include a review by Wong et al. (2017) , which reported a systematic search strategy, and Glazerman et al. (2003) and Chaplin et al. (2018) , which used statistical meta-analysis of effect sizes. This systematic review was registered ( Waddington et al., 2018 ).
The eligibility criteria for inclusion in the review, alongside examples of excluded studies, are in Table 1 . Eligible benchmark studies needed to use randomized assignment, whether controlled by researchers or administrators. Eligible within-study comparisons included any non-randomized approach to estimate the effect, including approaches with selection on unobservables and those using selection on observables only. These included methods with adjustment for unobservable confounding, such as difference-in-differences, also called double-differences (DD), instrumental variables (IV), RDD, and methods adjusting for observables such as statistical matching and adjusted regression estimation of the parametric model applied to cross-section data.
Systematic Review Inclusion Criteria.
The NRS and benchmark needed to use the same treatment estimand. Where the bias estimator used the benchmark control and NRS comparison means only, data needed to be from the same sampled population. As discussed, this is important to avoid confounding. Evidence suggests that the assumption of constant treatment effects (treatment effect homogeneity) across sub-samples, which would be necessary to validate the comparison of different treatment estimands, should not be relied on. For example, Oosterbeek et al. (2008) showed positive impacts on school enrollment for the poorest quintile receiving benefits under the Bono de Desarrollo Humano (BDH) CCT program in Ecuador, but no impacts for the second poorest quintile.
Previous reviews noted several issues in systematically identifying internal replication studies due to a lack of common language used to index this evidence. Glazerman et al. (2003) indicated electronic searches failed to comprehensively identify many known studies, while Chaplin et al. (2018) stated that, despite attempting to search broadly, “we cannot even be sure of having found all past relevant studies” (p.424). Hence, a combination of search methods was used, including electronic searches of Research Papers in Economics (RePEc) database via EBSCO, where search terms were identified using “pearl harvesting” (using keywords from known eligible studies) ( Sandieson, 2006 ) and 3ie′s Impact Evaluation Repository ( Sabet & Brown, 2018 ); bibliographic back-referencing of bibliographies of included studies and reviews of internal replication studies; forward citation tracing of reviews of internal replication studies using three electronic tracking systems (Google Scholar, Web of Science, and Scopus); hand searches of the repository of a known institutional provider of internal replication studies (Manpower Demonstration Research Corporation, MDRC); and by contacting authors. Full details of the search strategy and results are in Villar and Waddington (2019) .
Existing reviews of internal replication studies do not provide comprehensive assessments of the risk of bias to the effect estimate in the benchmark study using formal risk-of-bias tools. Partial exceptions are Glazerman et al. (2003) , who commented on the likely validity of the benchmark RCTs (randomization oversight, performance bias, and attrition), and Chaplin et al. (2018) who coded information on use of covariates to control for pre-existing differences across groups and use of balance tests in estimation.
Modified applications of Cochrane’s tools for assessing risk of bias in RCTs were used to assess biases in benchmark cluster-randomized studies ( Eldridge et al., 2016 ; Higgins et al., 2016 ). 7 For the individually randomized benchmark, which was analyzed using instrumental variables due to non-adherence, the risk-of-bias assessment drew on Hombrados and Waddington (2012) , as well as relevant questions about selection bias into the study from Eldridge et al. (2016) . 8 In addition, the appraisal of the benchmark took into account the relevance of the bias domains in determining internal validity of RCT estimate, as well as other factors that may have caused differences between the benchmark and NRS replication estimates. We also evaluated bias from specification searches using publication bias analysis at the review level.
Data collected from included papers included outcome means in control and comparison groups, outcome variances, sample sizes, and significance test values (e.g., t-statistics, confidence intervals, and p -values). These were used to calculate the distance metric measure of bias and its standard error. D is defined as the primary distance metric measuring the difference between the non-experimental and experimental means, interpreted as the size of the bias, calculated as
where Y ¯ N R S c and Y ¯ R C T c are the mean outcomes of the non-randomized comparison and randomized control groups and Y ¯ R C T t is the mean outcome of the randomized treatment group. Both numerical and absolute differences in D were calculated. Taking the absolute difference in D ensured that a measure of the overall deviation of randomized and non-randomized estimators was estimated, and not a measure based on the numerical difference that, on average “cancelled out” positive and negative deviations, potentially obscuring differences of interest. 9 Distance estimates were standardized by the standard deviation of the outcome, D S , as well as being compared as percentages of the treatment effect estimate and control and prima facie means to aid comparison. In total, six relative distance metrics were used to compare the difference between NRS and benchmark means, interpreted as the magnitude of bias in the NRS estimator: the standardised numerical difference; the standardised absolute difference; the percentage difference; the absolute difference as a percentage of the control mean; the percentage reduction in bias; and the mean-squared error. These are presented in Table 2 .
Distance Metrics Used in Analysis.
Sources: Greenland (2000) ; Glazerman et al. (2003) ; Hansen et al. (2013) ; Steiner and Wong (2018) .
The standard error of D s is given by the generic formulation of the difference between two independent estimates
where s e N R S and s e R C T are the standard errors of the non-randomized and randomized mean outcomes, respectively, which from equation ( 1 ) can be assumed independent.
In order to account for differences in precision across estimates, pooled means were calculated using fixed-effect inverse variance-weighted meta-analysis. The fixed effect model may be justified under the assumption that the estimates are from the same target populations, with the remaining bias being due to sampling error. However, each internal replication study reported multiple bias estimates using different methods of analysis and/or specifications. The weights w for each estimate needed to consider the different numbers of bias estimates each study contributed, using the following approach 10
where s i 2 is the variance of distance estimate i and m j k is the number of distance estimates provided by study k . The pooled weighted average of D was calculated as
Noting that the weight for a single study is equal to the inverse of the variance for each estimate adjusted for the total number of estimates, following Borenstein et al. (2009) , it follows that the variance of the weighted average is the inverse of the sum of the weights across k included studies
We also tested the sensitivity of fixed effect meta-analysis estimates to different weighting schemes including simple averages and weighted averages using the inverse of the variance and the sample size.
Information About the Sample
Eight eligible internal replications were included of randomized studies of social and economic programs ( Table 3 ). All but one included study used a cluster-randomized controlled field trial as the benchmark. McKenzie et al. (2010) used administratively-randomized data, where program assignment was done individually by a lottery implemented by administrators, and the data itself were collected by the authors specifically to estimate the treatment effect of the lottery. Clusters were randomly assigned to the program in Galiani and McEwan (2013) and Galiani et al. (2017) as part of a field trial, and the study used census data to evaluate outcomes.
Eligible Within-Study Comparisons of Development Programs.
Note. PROGRESA = Programa de Educación, Salud y Alimentación ; RPS = Red de Protección Social; RCT = ; RDD = regression discontinuity design; DD = double-differences; GDD = geographical discontinuity design; PRAF = Programa de Asignación Familiar.
Four of the studies featured in a literature review of internal replication studies in development economics ( Hansen et al., 2013 ). An additional four studies were located through the searches, including two of the Programa de Asignación Familiar (PRAF) in Honduras ( Galiani & McEwan, 2013 ; Galiani et al., 2017 ) and one of a scholarship program in Cambodia ( Barrera-Osorio et al., 2014 ), all of which examined discontinuity designs. A final study of electricity subsidies in Tanzania evaluated matching estimators ( Chaplin et al., 2017 ).
The studies tested a range of non-randomized replication methods including cross-sectional and panel data regression, geographical discontinuity design (GDD), 12 IV, propensity score matching (PSM) and RDD.
Data were collected on treatment effects for the benchmark study, as well as each corresponding non-randomized replication. We calculated distance estimates from 586 specifications, of which 151 were estimated from test statistics due to incomplete information reported about standard deviations of the outcome in the benchmark ( McKenzie et al., 2010 ; Galiani & McEwan, 2013 ; Barrera-Osorio et al., 2014 ; Galiani et al., 2017 ). The largest number of estimates was from matching and discontinuity designs, each totaling over 170 across four studies. The fewest estimates were from DD and IV estimation, with only 5 in total from a single study. The studies explored design and conduct, thus a range of matching estimators were tested, such as kernel matching (70 estimates) and nearest neighbor matching (59 estimates), or prospective RDD (92 estimates) and retrospectively designed RDD (81 estimates). The estimate of effect which most closely corresponded with the population for the non-randomized arm was taken from the RCT—the bandwidth around the treatment threshold for the replications using RDD ( Buddelmeyer & Skoufias, 2004 ; Galiani & McEwan, 2013 ; Barrera-Osorio et al., 2014 ; Galiani et al., 2017 ) and the instrumental variables analysis of the administratively randomized study ( McKenzie et al., 2010 ).
Bias in the Within-Study Comparisons
This section presents a summary of findings from the risk-of-bias assessment; the complete assessment is given in the Supplemental Appendix . Only one benchmark was estimated to have “low risk of bias” ( Galiani & McEwan, 2013 ; Galiani et al., 2017 ). However, due to problems in implementing the NRS in those studies, there remained “some concerns” about confounding of the NRS-RCT distance estimate with respect to its interpretation as bias. The benchmark for PROGRESA was estimated to have “high risk of bias” due to attrition. 13 The remaining benchmark studies had “some concerns.” 14 Hence, all the within-study comparison estimates of bias in our sample may be confounded ( Table 4 ).
Risk-of-Bias Assessment for Within Study Comparisons.
Notes: * assessment draws on ( Diaz & Handa, 2005 ), Behrman and Todd (1999) , Skoufias et al. (2001) , Angelucci and de Giorgi (2006) and Rubalcava et al. (2009) ; ** assessment draws on Maluccio and Flores (2004 , 2005) ; *** assessment is of the instrumental variables estimate for the randomised sample; **** assessment draws on Barrera-Osorio and Filmer (2016) ; ***** assessment draws on Glewwe and Olinto (2004) ; ^ assessment takes into account relevance of the domain for relative bias regarding within-study comparison.
Source: authors using Higgins et al. (2016) , Eldridge et al. (2016) and ( Hombrados & Waddington, 2012 ).
Concerns about the benchmarks often arose from a lack of information, such as in the case of attrition in the PROGRESA benchmark experiment, or in assessing imbalance of baseline characteristics using distance metrics. In other instances, concerns were more difficult to address. For example, none of the studies was able to blind participants to intervention, and outcomes were mainly collected through self-report, a possible source of bias in open (unblinded) studies ( Savović et al., 2012 ). For benchmark studies using cluster-randomization, where informed consent often does not necessarily alert participants to the intervention, this source of bias may be less problematic ( Schmidt, 2017 ). Also, where participants are identified after cluster assignment it is not clear that evaluations can sufficiently capture data on non-adherence due to participant migration into, out of, or between study clusters.
However, it was not always clear whether the risk of bias arising in the benchmark estimate would cause bias in the difference estimate. For example, a threat to validity due to incomplete treatment implementation (“departures from intended treatment” domain) is not a threat to validity in the distance estimate for within-study comparisons that compare the randomised control and NRS comparison means only, which do not depend on treatment fidelity, as in the cases reviewed here. Similarly, biases arising due to the collection of reported outcomes data (“bias in measurement of the outcome” domain) in open trials may not cause bias in the internal replication estimate if the NRS uses the same data collection methods, and the potential sources of bias in benchmark and observational study are considered to be equivalent (e.g., there are no additional threats to validity due to motivational biases from participating in, or repeated measurement as part of, a trial). Multiple specifications, outcomes, and sub-groups were included to provide diversity in the estimates in all studies, hence selective reporting that may have affected benchmark trials (under “selection of the reported result” domain) was not judged problematic in the context of within-study comparisons.
Finally, bias in the difference estimate may be caused by bias in the NRS, confounding of the relationship and specification searches. Bias in the NRS is captured in the meta-analysis of different specifications. Confounding of the relationship may occur due to differences in the survey (e.g., outcome measurement) and target population. 15 All discontinuity design replications included were able to restrict the RCT samples to create localized randomized estimates in the vicinity of the discontinuity and compared the distance between the two treatment effect estimates ( Buddelmeyer & Skoufias, 2004 ; Galiani & McEwan, 2013 ; Barrera-Osorio et al., 2014 ; Galiani et al., 2017 ). In the case of Galiani and McEwan (2013) , where program eligibility was set for localities below a threshold on mean height-for-age z-score (HAZ), the RDD comparison was generated for untreated localities just above the threshold, where HAZ was predicted due to limited data. In Buddelmeyer and Skoufias (2004) , there were four groups of households that enabled the RDD estimator to be compared to the RCT. The groups were differentiated by treatment status of the cluster, determined by randomization across those clusters below a maximum discriminant score (poverty index); and eligibility of households within clusters for treatment, determined by the household’s discriminant score. The RCT treatment estimand was calculated over households within the same bandwidth as the RDDs to ensure comparability of the target population. Other studies used statistical methods to compare NRS comparison groups with randomized control group means ( Diaz & Handa, 2006 ; Handa & Maluccio, 2010 ; McKenzie et al., 2010 ; Chaplin et al., 2017 ).
Quantitative Estimates of Bias
Nrs with selection on observables.
Table 5 compares the distance estimates obtained from the different methods of calculating the pooled effect. Two within-study comparisons reported distance using regression-based estimators ( Diaz & Handa, 2006 ; McKenzie et al., 2010 ). The cross-section regression specifications may perhaps be one benchmark against which other estimators may be compared. As expected, these distance estimators tended to be larger than those using other methods, including double differences, credible instrumental variables, and statistical matching. 16
Pooled Standardised Bias Estimates.
Note. RDD = regression discontinuity design; DD = double-differences.
Notes: * simple average used to calculate pooled estimate; ** weighted average calculated using the inverse of the variance multiplied by the inverse of the number of estimates in the study; *** weighted average calculated using the benchmark sample size multiplied by the inverse of the number of estimates in the study; ^ indicates RDD estimate compared with RCT average treatment effect (ATE comparisons also incorporated Urquieta et al., 2009 and Lamadrid-Figueroa et al., 2010 ); $ sample size is for calculations in (1-7), calculations in (8) use a smaller number of studies owing to more limited availability of a prima facie estimate.
Matching produced small to medium sized estimates on average—between 0.10 and 0.30 in simple weighting ( Table 5 columns 1–2), <0.10 for some specifications in more complex weighting ( Table 5 columns 3–4)—but conduct mattered. Using more parsimonious matching by reducing the covariates in the matching equation to social and demographic characteristics that would be available in a typical household survey, usually led to bigger distance estimates than matching using rich control variables in the data available ( Diaz & Handa, 2006 ; Chaplin et al., 2017 ). Matching on pre-test outcomes provided smaller distance metrics on average ( Chaplin et al., 2017 ; McKenzie et al., 2010 ). Finally, smaller distance metrics were estimated when matching on local comparisons ( Chaplin et al., 2017 ; Handa & Maluccio, 2010 ; McKenzie et al., 2010 ). 17
The remaining columns of Table 5 attempt to translate the findings into metrics that better indicate the substantive importance of the bias, Column 5 gives the mean-squared error, column 6 presents the bias as a percentage of the benchmark treatment effect, and column 7 gives bias as a percentage of the benchmark control mean.
Matching tended to produce estimates that differed from the RCT treatment effect by large percentages, on average 200% bigger than the RCT estimate ( Table 5 column 6). However, matching would be expected to present a larger treatment estimate where it estimates ATET, which is bigger than the intent-to-treat estimate under non-adherence. Presenting bias as a percentage of the control mean (column 7), the estimates were smaller. In addition, as noted above, where the control mean was close to zero, or small relative to the treatment estimate, the percentage difference estimator was large, as was the case in many of the matching estimators presented by Handa and Maluccio (2010) . The most important aspect of study conduct in matching was the use of “rich controls,” leading to 83% bias reduction on average across 116 estimates from four studies, although with a relatively high expected MSE of 0.15 ( Table 5 column 5). Nearest neighbor matching also outperformed other matching methods, accounting for 52% of bias with expected MSE of 0.07, based on 59 estimates from four studies. Matching on the baseline measure, which is similar to DD estimation, on average removed 56% of bias with expected MSE less than 0.001, across 15 estimates.
In contrast, across 10 estimates from two studies, regression analysis applied to cross-section data removed 34% of bias with an expected MSE of 0.18.
NRS With Selection on Unobservables
The studies examining discontinuity designs produced distance metrics that were typically less than 0.1 standard deviations. These relatively small distance metrics, compared with the other NRS estimators, varied by the bandwidth used ( Buddelmeyer & Skoufias, 2004 ), as shown in the comparison of LATE and ATE estimators. 18 It is notable that the sample includes RDDs designed both prospectively ( Buddelmeyer & Skoufias, 2004 ; Barrera-Osorio et al., 2014 ) and retrospectively ( Galiani & McEwan, 2013 ; Galiani et al., 2017 ), providing tests of both types of RDD implemented in practice. These findings are useful, given that potential sources of bias in prospective and retrospective RDDs are different—for example, retrospective studies are potentially more susceptible to biased selection into the study (due to missing data), whereas prospectively designed studies may be more susceptible to motivation bias (e.g., Hawthorne effects).
Regression discontinuity design estimation produced bias estimates that were on average different from the RCT treatment effect by 7%, and 8% of the control mean. However, when RDD was compared to ATE estimates, it produced distance estimates that are on average 20% different from the RCT estimate, providing evidence for heterogenous impacts. These findings were strengthened by the inclusion of distance estimates from two studies that were excluded from previous analysis ( Urquieta et al., 2009 ; Lamadrid-Figueroa et al., 2010 ), which compared RDD estimates to RCT ATEs. Regarding the statistical significance of the findings, RDDs are also usually of lesser power because they are estimated for a sub-sample around the cut-off. 19 However, the strongest evidence for accuracy were for RDD, which across 173 separate estimates from four studies, removed 94% of bias on average, with expected MSE less than 0.001.
McKenzie et al. (2010) examined the correspondence of two DD regression estimates, which removed an estimated 56% of bias with expected MSE less than 0.02 compared to the RCT. In two-stage least squares (2SLS) instrumental variables estimation, one instrument was the migrant’s network (indicated by number of relatives in the country of immigration). This was shown to be correlated with migration (albeit with F-statistic = 6, which is below the satisfactory threshold of F = 10; Bound et al., 1995 ), but produced a treatment effect distance metric exceeding that for single differences, based on pre-test post-test or cross-section adjustment. This supports the theoretical prediction that inappropriate instruments produce 2SLS findings that are more biased than OLS. The authors argued it was unlikely to satisfy the exclusion restriction since it was very likely correlated with income after immigration, despite being commonly used in the field of migration. Another instrument, distance to the application center, produced the smallest distance metric of any within-study comparison, effectively equal to zero. The instrument was highly correlated with migration (F-statistic = 40) and, it was argued, satisfied the exclusion restriction as it was unlikely to determine income for participants on the main island where “there is only a single labor market… where all villages are within 1 hour of the capital city” (p.939). While distance may provide a plausible source of exogenous variation in some IV studies, it is also not possible to rule out the possibility that the arguments being made for the success of the instrument were based on results. Distance would not provide an appropriate instrument where program participants move to obtain access to services.
Sensitivity analysis
The matching estimates were sensitive to the choice of weighting scheme. It can be seen that the simple unweighted average of the numerical bias tended to produce the smallest distance metric where the individual underlying difference estimates were distributed above and below the null effect, on average “cancelling out” each other ( Table 5 , column 1). Using absolute mean differences accentuates the difference between the RCT and NRS mean, by definition exceeding zero. The corollary is that taking the simple (unweighted) average of the absolute difference produced distance estimates that tended to be bigger ( Table 5 , column 2). This explains why the findings from this review are different from those found in other within-study comparison papers, which, sometimes implicitly, used unweighted averages of the numerical difference when discussing their findings (e.g., Hansen et al., 2013 ).
On the other hand, using the adjusted inverse-variance weighted average produced distance metrics between these two extremes ( Table 5 , column 3). Even the metrics for matching are below 0.1 in these cases, although this is due to the large number of small distance metrics produced by Chaplin et al. (2017) . When the studies were instead weighted by RCT sample size, 20 rather than inverse of the variance, the matching distance metrics reverted to magnitudes presented above, although remaining small for baseline measurement, local comparison, and nearest neighbor algorithm ( Table 5 , column 4). RDD estimates did not appear sensitive to the choice of study weights.
Meta-regressions were estimated to explore differences across findings by NRS design and conduct simultaneously, alongside factors that might affect correspondence between NRS and benchmark, including type of outcome measure and risk of bias. 21 Use of income measures of the outcome and high risk of bias in the estimate were significantly associated with greater differences between NRS and benchmark estimates ( Table 6 ). Variables associated substantively with smaller differences were use of RDD, matching, and binary outcomes. 22 The estimated between-study variance is also equal to zero (Tau-sq = 0.000); hence, we are unable to reject the null hypothesis of homogeneity in the true effect sizes. While we chose the fixed effects model on conceptual grounds, this finding provides some empirical reassurance that our model choice was reasonable.
Meta-Regression of Standardized Absolute Bias.
Notes: * base category. Standard errors use cluster adjustment (equation ( 5 )).
A funnel graph was plotted of the standardized numerical difference against the standard error to evaluate bias from specification searches ( Figure 1 ). There was symmetry in the plot and regression line intercept passed through the origin, indicating no statistical evidence for specification searches. Since studies were designed to conduct and report results from multiple tests, regardless of findings, this evidence supports the validity of internal replication studies, even when authors are unblinded to the benchmark effect.
Funnel graph with confidence intervals and regression line.
Conclusions
In this article, we aimed to provide empirical evidence on bias in non-randomized studies of intervention effects, or quasi-experiments, in development economics. We conducted a systematic review of evidence from internal replication studies on the correspondence between NRS and benchmark RCTs, and critically appraised the design and conduct of the studies. We conclude that NRS can provide unbiased effect estimates, supporting the findings of other researchers, notably ( Cook et al., 2008 ). This is a useful finding for instances where causal inference is needed but randomized design infeasible to answer the evaluation questions. The analysis suggests that study design is probably the most important factor in determining bias. The most accurate findings, with large enough samples to generalize from, were from RDD, which were examined in four studies. Both prospective and retrospectively designed RDDs provided credible estimates. As predicted by theory, the bias properties of some estimators were dependent on effective study conduct, such as the choice of the instrument or the incorporation of baseline measures, geographically local matches, or matching algorithms. The strong performance of nearest neighbor matching algorithms on MSE is consistent with Busso et al.’s (2014) findings from Monte Carlo analysis, although these authors also found a trade-off between bias and variance.
The findings have implications for critical appraisal tools commonly used to assess risk of bias (e.g., Sterne et al., 2016 ), such as on the value of particular designs like regression discontinuity, the use of baseline covariates, or the methods of selecting matches. With regards to selection on observables more generally, matching sometimes produced almost identical bias coefficients to cross-section regression, but other times did not. Where matching used baseline adjustment, local matches, and nearest neighbor algorithms, the biases were smaller. The cause of selection bias is also likely to be important. As noted by Hansen et al. (2013) , the estimates from McKenzie et al. (2010) are a case where the main source is participant self-selection, which is thought more difficult to control for directly than program placement bias at geographic level. It is possible, therefore, that program placement modeled using selection on observables may be able to provide more accurate findings, although the study in our sample of a group targeted program did not suggest findings using cross-section regression or matching were particularly accurate ( Diaz & Handa, 2006 ).
Indeed, Smith and Todd (2005) warned against “searching for ‘the’ nonexperimental estimator that will always solve the selection bias problem inherent in nonexperimental evaluations” (p.306). Instead, they argued research should seek to map and understand the contexts that may influence studies’ degrees of bias. For instance, Hansen et al. (2013) noted the potential importance of the type of dependent variable examined in studies, suggesting simple variables (such as binary indicators of school attendance) may be easier to model relative to more complex outcome variables (such as consumption expenditure or earnings). Our meta-regressions support this finding. Additionally, complexity may not be a problem in and of itself, but rather simply magnify other problems, in particular missingness (easier to measure means probably less potential for missing data) and lower reliability.
On the magnitudes of the standardised distance metrics, which were found to be negligible in the case of discontinuity designs and, by conventional standards, small to medium for matching, recent attempts to examine effects sizes observed in empirical research show that these conventional values may provide a poor comparison of the magnitude of effects that seen in applied research. Reviews of education interventions in high-income countries and in L&MICs have shown that very few have effects that would be classified as anything but small according to Cohen’s approximations ( Coe, 2002 ; McEwan, 2015 ). Averaging the effects of 76 meta-analyses of past education interventions, Hill et al. (2008) found the mean effect size ranged between 0.2 and 0.3 standard deviations. But there are also concerns about percentage distance metrics which depend on the magnitude of the baseline value, as seen here. Hence, it may be useful to compare distance based on both standardization and percentages, as done here.
A comment is warranted about generalizability, given the relatively small number of internal replication studies that exist in development economics and the small numbers of estimates for particular estimators. First, the interventions are restricted largely to conditional cash transfers, an approach that has been extensively tested using cluster-randomization. With the exception of the studies in Cambodia, Tanzania, and Tonga, most evidence from internal replications is from Latin America. There may therefore be legitimate concerns about the transferability of the evidence to other contexts and sectors. Furthermore, risk-of-bias assessments found that all of the studies had “some concerns” about bias, and those with “high risk of bias” were found to demonstrate less correspondence between NRS and RCT, confirming that conduct of the internal replication study itself is important in estimating bias ( Cook et al., 2008 ).
A final comment concerns the conduct of further internal replication studies. We noted that it may be difficult to blind NRS replication researchers convincingly to the benchmark study findings, but that multiple model specifications, outcomes, and subgroups may help to provide sufficient variation in hypothesis testing. However, even though we were not able to find evidence of publication bias, reporting bias may clearly be problematic in these studies. The final implication, therefore, is that analysis protocols for future internal replication studies should be published. This should specify the findings from the benchmark study for which replication is sought, together with the proposed NRS designs and methods of analysis. It does not need to artificially constrain the NRS, in an area where pre-specifying all possible analyses may be difficult. As in evidence synthesis research, sensible deviations from protocol are acceptable provided the reasons for doing so are clearly articulated.
Supplemental Material
Supplemental material for Can Non-Randomised Studies of Interventions Provide Unbiased Effect Estimates? A Systematic Review of Internal Replication Studies by Hugh Sharma Waddington, Paul Fenton Villar, and Jeffrey C. Valentine in Evaluation Review
In regression discontinuity design, random error in measurement of the assignment variable can be incorporated to produce strong causal identification at the assignment threshold ( Lee & Lemieux, 2010 ).
Simultaneous designs are dependent designs where the RCT treatment arm in dependent studies is common across study arms, and hence “differenced out” in distance estimator calculations (see equations 1) and 2 ) below).
“Multi-site simultaneous design” attempts to account for this by using data from an RCT based on multiple selected sites, within each of which participants are randomly assigned to treatment and control. Bias is inferred by comparing average outcomes from the treatment group in one site to the control observations from another site ( Wong & Steiner, 2018 ).
For example, Fretheim et al. (2013) discarded control group data from a cluster-RCT with 12 months of outcome data points available from health administrative records before and after the intervention, in order to compare the RCT findings with interrupted time series analysis.
This approach was also used in the group A (eligible households in treated clusters) versus group D (ineligible households in control clusters) comparisons in Buddelmeyer and Skoufias (2004) , and in the “pure control” group comparisons in Barrera-Osorio et al. (2014) . The key difference between simultaneous tie-breaker and synthetic design is that, in the latter, the researcher removes observations to generate a “synthetic RDD,” whereas the former requires knowledge about the threshold decision rule used to assign groups into the RCT.
Where there is non-compliance due to no-shows in the treatment group, it is possible that the intervention “target population” could be included in the control group in the NRS in some within-study comparison designs. Furthermore, the sample used in NRS may differ from the RCT sample depending on whether observations are dropped non-randomly (e.g., to satisfy common support in PSM), which is sometimes referred to as sampling bias ( Greenland, 2000 ), and similar to the problem of comparing population average treatment effects with local average treatment effects from discontinuity designs and instrumental variables estimation.
It was not considered necessary to blind coders to results following Cook et al. (2008) , for example, by removing the numeric results and the descriptions of results (including relevant text from abstract and conclusion), as well as any identifying items such as author’s names, study titles, year of study, and details of publication. All studies reported multiple within-study comparisons and all data were extracted and analyzed by the authors.
Cochrane’s risk of bias tool for RCTs does not enable the reviewer to discern the validity of the application of IV to correct for non-compliance.
In practice, the standardized difference calculated as the subtraction of RCT numerical estimate from that of the NRS was frequently either side of zero, which did tend to “cancel out” across specifications, as shown in the results for simple subtracted standardized bias ( Table 5 Column 1).
The generalized approach presented in Tanner-Smith and Tipton (2014) simplifies to equation (3) as follows
w i j = 1 ( s i 2 + τ 2 ) [ 1 + ( m j k − 1 ) ρ ] = 1 ( s i 2 + 0 ) [ 1 + m j k − 1 ] 1 = 1 s i 2 m j k
where the weighting considers the between-studies error in a random effects model, τ 2 (equal to zero in the fixed effect case), and the estimated correlation between effects, ρ (equal to 1 where all NRS comparisons draw on the same sample and the benchmark control is the same across all distance estimates).
The visas enabled Tongans to take permanent residency in New Zealand under New Zealand’s immigration policy which allows an annual quota of Tongans to migrate.
Galiani et al. (2017) stated that it was unlikely that households from the indigenous Lenca group migrated to obtain benefits under the CCT program, suggesting validity of the benchmark control group. However, there remained differences in shares of Lenca populations across the geographical discontinuity in cash transfer treatment and control communities, potentially invalidating the GDD comparison. Therefore, in this study the potential outcomes are assumed independent of treatment assignment, conditional on observed covariates.
The studies of PROGRESA were awarded as having “high risk of bias” due to high overall attrition and limited information about differential attrition in published reports available. For example, Rubalcava et al. (2009) noted “one-third of households left the sample during the study period” and “no attempt was made to follow movers” (p.515). Differential attrition in PROGRESA is discussed in Faulkner (2014) .
There were two instances of “high risk of bias” in the NRS replications due to differences in the definition of outcomes relative to the benchmark survey questions ( Diaz & Handa, 2006 ; Handa & Maluccio, 2010 )—see Appendix.
In two studies there was risk of bias in the distance estimate due to differences in survey questionnaire for the expenditure and child labor outcomes ( Diaz & Handa, 2006 ) and preventive health check-ups ( Handa & Maluccio, 2010 ).
McKenzie et al. (2010) also reported the single difference estimator, taken from the difference between pre-test and post-test. This was found to be a less accurate predictor of the counterfactual outcome than matching on baseline outcome, double-differences and credible instrumental variables, but more accurate than cross-section regression and statistical matching which excluded the baseline measure.
McKenzie et al. (2010) implicitly used local matches, by choosing NRS comparisons from geographically proximate households in the same villages as treated households. Due to the reduced risk of contamination, as the treated households had emigrated already, matches in McKenzie et al. (2010) could be from the same villages, unlike in other matched studies (for an intervention where there is a risk of contamination or spillover effects), where matches would need to be geographically separate.
In Barrera-Osorio et al. (2014) , the bias in test scores estimate was substantially smaller than the bias in grade completion, which the authors noted was estimated by enumerators and may therefore have been measured with error.
For example, Goldberger (1972) originally estimated that the sampling variance for an early conception of RDD would be 2.75 times larger than an RCT of equivalent sample size. See also Schochet (2008) .
Sample size weighting uses the following formula: w i j = n i / m j k where n i is the sample size for difference estimate i and m j the number of estimates contributed by study k .
We used three regression specifications: meta-regression, OLS regression and Tobit regression (to account for censoring of the standardised absolute difference below 0). All models produced the same coefficient estimates. Results available on request from authors.
The meta-regression coefficient on a dummy variable equal to 1 when the study measured a binary outcome, was −0.10 (95%CI = −0.2, 0.0). When both binary outcome and income outcome dummies were included simultaneously, neither coefficient was statistically significant.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: we gratefully received funding from the American Institutes for Research under Campbell Methods Grant CMG1.11.
Supplemental Material: Supplemental material for this article is available online.
- Angelucci M., de Giorgi G. (2006). Indirect effects of an aid program: the case of PROGRESA and consumption. IZA. Discussion Paper No. 1955, January 2006. [ Google Scholar ]
- Barrera-Osorio F., Filmer D. (2016). Incentivizing schooling for learning evidence on the impact of alternative targeting approaches. Journal of Human Resources, 51(2), 461–499. 10.3368/jhr.51.2.0114-6118r1 [ DOI ] [ Google Scholar ]
- Barrera-Osorio F., Filmer D., McIntyre J. (2014). An empirical comparison of randomized control trials and regression discontinuity estimations. SREE Conference Abstract, Society for Research on Educational Effectiveness. [ Google Scholar ]
- Behrman J., Todd P. (1999). Randomness in the experimental samples of PROGRESA. International Food Policy Research Institute. [ Google Scholar ]
- Bloom H., Hill C., Black A., Lipsey M. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational Effectiveness, 1(4), 289–328. 10.1080/19345740802400072 [ DOI ] [ Google Scholar ]
- Bloom H. S., Michalopoulos C., Hill C.J., Lei Y. (2002). Can nonexperimental comparison group methods match the findings from a random assignment evaluation of mandatory welfare-to-work programs? MDRC working papers on research methodology. Manpower Demonstration Research Corporation. [ Google Scholar ]
- Borenstein M., Hedges L. V., Higgins J. P. T., Rothstein H. (2009). Introduction to meta-analysis. John Wiley and Sons. [ Google Scholar ]
- Bound J., Jaeger D. A., Baker R. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association, 90(430), 443–450. 10.1080/01621459.1995.10476536 [ DOI ] [ Google Scholar ]
- Buddelmeyer H., Skoufias E. (2004). An evaluation of the performance of regression discontinuity design on PROGRESA World Bank Policy Research Working Paper 3386. The World Bank. [ Google Scholar ]
- Busso M., DiNardo J., McCrary J. (2014). New evidence on the finite sample properties of propensity score reweighting and matching estimators. The Review of Economics and Statistics, 96(5), 885–897. 10.1162/rest_a_00431 [ DOI ] [ Google Scholar ]
- Campbell Collaboration (2021). Campbell systematic reviews: policies and guidelines. Version 1.8. Campbell Policies and Guidelines Series 1. Campbell Collaboration. 10.4073/cpg.2016.1 [ DOI ] [ Google Scholar ]
- Chaplin D., Cook T., Zurovac J., Coopersmith J., Finucane M., Vollmer L., Morris R. (2018). The internal and external validity of the regression discontinuity design: A meta-analysis of 15 within-study comparisons: Methods for policy analysis. Journal of Policy Analysis and Management, 37(2), 403–429. 10.1002/pam.22051 [ DOI ] [ Google Scholar ]
- Chaplin D., Mamun A., Protik A., Schurrer J., Vohra D., Bos K., Burak H., Meyer L., Dumitrescu A., Ksoll A., Cook T. (2017). Grid electricity expansion in Tanzania by MCC: Findings from a rigorous impact evaluation. MPR Report, Mathematica Research Policy. [ Google Scholar ]
- Chirgwin H., Cairncross S., Zehra D., Sharma Waddington H. (2021). Interventions promoting uptake of water, sanitation and hygiene (WASH) technologies in low- and middle-income countries: An evidence and gap map of effectiveness studies. Campbell Systematic Reviews, 17(4), Article e1194. 10.1002/cl2.1194 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Coe R. (2002). It's the effect size, stupid: What effect size is and why it is important. In Presented at: The annual conference of the British educational research association, England, 12–14 September 2002. University of Exeter. [ Google Scholar ]
- Cook T. D., Shadish W., Wong V. (2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of policy analysis and management, 27(4), 724–750. 10.1002/pam.20375 [ DOI ] [ Google Scholar ]
- Diaz J. J., Handa S. (2005). An assessment of propensity score matching as a nonexperimental impact estimator: Estimates from Mexico’s PROGRESA program. Working paper OVE/WP-04/05 July 22, 2005, Office of Evaluation and Oversight, Inter-American Development Bank, Washington, D.C. [ Google Scholar ]
- Diaz J. J., Handa S. (2006). An assessment of propensity score matching as a nonexperimental impact estimator: Estimates from Mexico’s PROGRESA program. The Journal of Human Resources, 41(2), 319–345. 10.3368/jhr.xli.2.319 [ DOI ] [ Google Scholar ]
- Eldridge S., Campbell M., Campbell M., Drahota A., Giraudeau B., Higgins J., Reeves B., Siegfried N. (2016) Revised Cochrane risk of bias tool for randomized trials (RoB 2.0) Additional considerations for cluster-randomized trials. Available at: https://www.riskofbias.info/welcome/rob-2-0-tool/archive-rob-2-0-cluster-randomized-trials-2016 (accessed 28 October 2020).
- Faulkner W. (2014). A critical analysis of a randomized controlled trial evaluation in Mexico: Norm, mistake or exemplar? Evaluation, 20(2), 230–243. 10.1177/1356389014528602 [ DOI ] [ Google Scholar ]
- Fretheim A., Soumerai S. B., Zhang F., Oxman A. D., Ross-Degnan D. (2013). Interrupted time-series analysis yielded an effect estimate concordant with the cluster-randomized controlled trial result. Journal of Clinical Epidemiology, 66(8), 883–887. 10.1016/j.jclinepi.2013.03.016 [ DOI ] [ PubMed ] [ Google Scholar ]
- Friedman W., Kremer M., Miguel E., Thornton R. (2016). Education as liberation? Economica, 83(329), 1–30. 10.1111/ecca.12168 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Galiani S., McEwan P. (2013). The heterogeneous impact of conditional cash transfers. Journal of Public Economics, 103(C), 85–96. 10.1016/j.jpubeco.2013.04.004 [ DOI ] [ Google Scholar ]
- Galiani S., McEwan P., Quistorff B. (2017). External and internal validity of a geographic quasi-experiment embedded in a cluster-randomized experiment. In Cattaneo M. D., Escanciano J. C. (Eds.), Regression discontinuity designs: Theory and applications. Advances in econometrics (Volume 38, pp. 195–236). Emerald Publishing Limited. [ Google Scholar ]
- Glazerman S., Levy D.M., Myers D. (2003). Nonexperimental versus experimental estimates of earnings impacts. The Annals of the American Academy of Political and Social Science, 589(1), 63–93. 10.1177/0002716203254879 [ DOI ] [ Google Scholar ]
- Glewwe P., Kremer M., Moulin S., Zitzewitz E. (2004). Retrospective vs. prospective analyses of school inputs: The case of flip charts in Kenya. Journal of Development Economics, 74(1), 251–268. 10.1016/j.jdeveco.2003.12.010 [ DOI ] [ Google Scholar ]
- Glewwe P., Olinto P. (2004). Evaluating the impact of conditional cash transfers on schooling: An experimental analysis of Honduras’ PRAF program. Final report for USAID January 2004 Washington, D.C. [ Google Scholar ]
- Goldberger A. (1972). Selection bias in evaluation of treatment effects: the case of interaction. Unpublished Manuscript. [ Google Scholar ]
- Greenland S. (2000). Principles of multilevel modelling. International Journal of Epidemiology, 29(1), 158–167. 10.1093/ije/29.1.158 [ DOI ] [ PubMed ] [ Google Scholar ]
- Handa S., Maluccio J.A. (2010). Matching the gold standard: Comparing experimental and nonexperimental evaluation techniques for a geographically targeted program. Economic Development and Cultural Change, 58(3), 415–447. 10.1086/650421 [ DOI ] [ Google Scholar ]
- Hansen H., Klejnstrup N. R., Andersen O. W. (2013). A Comparison of model-based and design-based impact evaluations of interventions in developing countries. American Journal of Evaluation, 34(3), 320–338. 10.1177/1098214013476915 [ DOI ] [ Google Scholar ]
- Higgins J. P. T., Sterne J. A. C., Savović J., Page M. J., Hróbjartsson A., Boutron I., Reeves B., Eldridge S. (2016). A revised tool for assessing risk of bias in randomized trials. In Chandler J., McKenzie J., Boutron I., Welch V. (Eds), Cochrane methods. Cochrane database of systematic reviews 2016 Issue 10 (Suppl 1). 10.1002/14651858.CD201601 [ DOI ] [ Google Scholar ]
- Hill C., Bloom H., Black A., Lipsey M. (2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2(3), 172–177. 10.1111/j.1750-8606.2008.00061.x [ DOI ] [ Google Scholar ]
- Hombrados J. G., Waddington H. (2012). A tool to assess risk of bias for experiments and quasi-experiments in development research. Mimeo. International Initiative for Impact Evaluation. [ Google Scholar ]
- Lalonde R. (1986). Evaluating the econometric evaluations of training with experimental data. American Economic Review, 76(4), 604–620. [ Google Scholar ]
- Lamadrid-Figueroa H., Ángeles G., Mroz T., Urquieta-Salomón J., Hernández-Prado B., Cruz-Valdez A., Téllez-Rojo M. M. (2010). Heterogeneous impact of the social programme Oportunidades on use of contraceptive methods by young adult women living in rural areas. Journal of Development Effectiveness, 2(1), 74–86. 10.1080/19439341003599726 [ DOI ] [ Google Scholar ]
- Lee D. S., Lemieux T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature, 48(2), 281–355. 10.1257/jel.48.2.281 [ DOI ] [ Google Scholar ]
- Maluccio J., Flores R. (2004). Impact evaluation of a conditional cash transfer program: The Nicaraguan red de Protección social. FCND Discussion Paper No. 184, Food consumption and nutrition division. International Food Policy Research Institute, Washington, D.C. [ Google Scholar ]
- Maluccio J., Flores R. (2005). Impact evaluation of a conditional cash transfer program: The Nicaraguan red de Protección social. Research report No. 141. International Food Policy Research Institute. [ Google Scholar ]
- McEwan P. J. (2015). Improving learning in primary schools of developing countries: A meta-analysis of randomized experiments. Review of Educational Research, 85(3), 353–394. 10.3102/0034654314553127 [ DOI ] [ Google Scholar ]
- McKenzie D., Stillman S., Gibson J. (2010). How important is selection? Experimental vs nonexperimental measures of the income gains from migration. Journal of the European Economic Association, 8(4), 913–945. 10.1111/j.1542-4774.2010.tb00544.x [ DOI ] [ Google Scholar ]
- Oosterbeek H., Ponce J., Schady N. (2008). The impact of cash transfers on school enrolment: Evidence from Ecuador. Policy research working paper No. 4645. World Bank. [ Google Scholar ]
- Phillips D., Coffey C., Gallagher E., Villar P.F., Stevenson J., Tsoli S., Dhanasekar S., Eyers J. (2017). State-society relations in low- and middle-income countries: An evidence gap map. 3ie evidence gap map 7. The International Initiative for Impact Evaluation. [ Google Scholar ]
- Rubalcava L., Teruel G., Thomas D. (2009). Investments, time preferences and public transfers paid to women. Economic Development and Cultural Change, 57(3), 507–538. 10.1086/596617 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Rubin D. B. (1977). Assignment to treatment on the basis of a covariate. Journal of Educational Statistics, 2, 1–26. 10.2307/1164933 [ DOI ] [ Google Scholar ]
- Sabet S.M., Brown A. (2018). Is impact evaluation still on the rise? The new trends in 2010-2015. Journal of Development Effectiveness, 10(3), 291–304. 10.1080/19439342.2018.1483414 [ DOI ] [ Google Scholar ]
- Sandieson R. (2006). Pathfinding in the research forest: The pearl harvesting method for effective information retrieval. Education and Training in Developmental Disabilities, 41(4), 401–409. http://www.jstor.org/stable/23879666 [ Google Scholar ]
- Savović J., Jones H., Altman D., Harris R., Jűni P., Pildal J., Als-Nielsen B., Balk E., Gluud C., Gluud L., Ioannidis J., Schulz K., Beynon R., Welton N., Wood L., Moher D., Deeks J., Sterne J. (2012). Influence of reported study design characteristics on intervention effect estimates from randomised controlled trials: Combined analysis of meta-epidemiological studies. Health Technology Assessment, 16(35), 1–82. 10.3310/hta16350 [ DOI ] [ PubMed ] [ Google Scholar ]
- Schmidt W.-P. (2017). Randomised and non-randomised studies to estimate the effect of community-level public health interventions: Definitions and methodological considerations. Emerging Themes in Epidemiology, 14(9), 1–11. 10.1186/s12982-017-0063-5 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Schochet P. Z. (2008). Technical methods report: Statistical power for regression discontinuity designs in education evaluations (NCEE 2008-4026). National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. [ Google Scholar ]
- Skoufias E., David B., de la Vega S. (2001). Targeting the poor in Mexico: An evaluation of the selection of households into PROGRESA. World Development, 29(10), 1769–1784. 10.1016/s0305-750x(01)00060-2 [ DOI ] [ Google Scholar ]
- Smith J. C., Todd P. (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125(1–2), 303–353. 10.1016/j.jeconom.2004.04.011 [ DOI ] [ Google Scholar ]
- Steiner P. M., Wong V. (2018). Assessing correspondence between experimental and non-experimental results in within-study-comparisons. Evaluation Review, 42(2), 214–247. 10.1177/0193841x18773807 [ DOI ] [ PubMed ] [ Google Scholar ]
- Sterne J. A. C., Hernán M., Reeves B. C., Savovic J., Berkman N. D., Viswanathan M., Henry D., Altman D. G., Ansari M. T., Boutron I., Carpenter J. R., Chan A. W., Churchill R., Deeks J. J., Hrobjartsson A., Kirkham J., Juni P., Loke Y. K., Pigott T. D., Higgins J. P. (2016). ROBINS-I: A tool for assessing risk of bias in non-randomised studies of interventions. British Medical Journal, 355, i4919. 10.1136/bmj.i4919 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Sterne J. A. C ., Juni P., Schulz K. F., Altman D. G., Bartlett C., Egger M. (2002). Statistical methods for assessing the influence of study characteristics on treatment effects in ‘meta-epidemiological’ research. Statistics in Medicine, 21(11), 1513–1524. 10.1002/sim.1184 [ DOI ] [ PubMed ] [ Google Scholar ]
- Tanner-Smith E., Tipton E. (2014). Robust variance estimation with dependent effect sizes: Practical considerations including a software tutorial in stata and SPSS. Research Synthesis Methods, 5(1), 13–30. 10.1002/jrsm.1091 [ DOI ] [ PubMed ] [ Google Scholar ]
- Urquieta J., Angeles G., Mroz T., Lamadrid-Figueroa H., Hernández B. (2009). Impact of Oportunidades on skilled attendance at delivery in rural areas. Economic Development and Cultural Change, 57(3), 539–558. 10.1086/596598 [ DOI ] [ Google Scholar ]
- Villar P. F., Waddington H. (2019). Within-study comparison and risk of bias in international development: Systematic review and critical appraisal. Methods research paper. Campbell Systematic Reviews, 15(1–2), Article e1027. 10.1002/cl2.1027 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Vivalt E. (2020). How Much Can We Generalize From Impact Evaluations?. Journal of the European Economic Association, 18(6), 3045–3089. 10.1086/596598 [ DOI ] [ Google Scholar ]
- Waddington H., Villar P. F., Valentine J. (2018). Within-study design replications of social and economic interventions: Map and systematic review (title registration). The Campbell Collaboration. [ Google Scholar ]
- Wong V., Steiner P. (2018). Designs of empirical evaluations of nonexperimental methods in field settings. Evaluation Review, 42(2), 176–213. 10.1177/0193841X18778918 [ DOI ] [ PubMed ] [ Google Scholar ]
- Wong V., Valentine J., Miller-Bains K. (2017). Empirical performance of covariates in education observational studies. Journal of Research on Educational Effectiveness, 10(1), 207–236. 10.1080/19345747.2016.1164781 [ DOI ] [ Google Scholar ]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
- View on publisher site
- PDF (785.9 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
London School of Hygiene & Tropical Medicine
- Quantifying impact
- Process evaluation
- Evidence synthesis
- Publications
Non-Randomised and Quasi-Experimental Designs
Often, random allocation of the intervention under study is not possible and in such cases the primary challenge for investigators is to control confounding.
Members of the Centre for Evaluation organised a multi-disciplinary symposium in London in 2006 to discuss barriers to randomisation, review the issues, and identify practical solutions. The following two papers summarise the arguments presented, drawing on examples from high- and low-income countries:
Alternatives to randomisation in the evaluation of public health interventions: design challenges and solutions Bonell CP, Hargreaves J, Cousens S, Ross D, Hayes R, Petticrew M, Kirkwood BR. Alternatives to randomisation in the evaluation of public health interventions: design challenges and solutions. Journal of Epidemiology and Community Health. 2011 Jul 1;65(7):582-7.
Alternatives to randomisation in the evaluation of public-health interventions: statistical analysis and causal inference Cousens S, Hargreaves J, Bonell C, Armstrong B, Thomas J, Kirkwood BR, Hayes R. Alternatives to randomisation in the evaluation of public-health interventions: statistical analysis and causal inference. Journal of epidemiology and community health. 2009 Aug 6:jech-2008.
Some methodological approaches beyond stratification and regression to address confounding in quasi-experimental or non-randomised designs are highlighted on this page:
Within the Centre for Evaluation at LSHTM, this type of work is carried out in close collaboration with the LSHTM Centre for Statistical Methodology , in particular the Causal Inference, Missing Data, and Time Series Regression Analysis groups, as well as LSHTM’s MARCH (Maternal Adolescent Reproductive and Child Health) Centre .
Craig and colleagues from the UK Medical Research Council have introduced new guidance on the use of some of these methods, and others, under the umbrella term of natural experiments:
Using natural experiments to evaluate population health interventions: new Medical Research Council guidance Craig P, Cooper C, Gunnell D, Haw S, Lawson K, Macintyre S, Ogilvie D, Petticrew M, Reeves B, Sutton M, Thompson S. Using natural experiments to evaluate population health interventions: new Medical Research Council guidance. Journal of epidemiology and community health. 2012 May 10:jech-2011.
Difference in Differences
This method is used to evaluate the impact of interventions that are non-randomly allocated to a sub-set of potentially eligible places. The change in the outcomes in places that got the intervention (the ‘difference’) is compared to the change in the outcomes in the places that did not get the intervention: hence the difference-in-the-differences.
This approach requires data from before and after the intervention is delivered, in places the do and do not get the intervention, and is often estimated as the interaction between the change over time and the allocation group (i.e. whether or not a place got the intervention) in a regression model.
It is possible that the places that receive the intervention are different at baseline from the places that do not receive the intervention in terms of the outcome of interest, and this method accounts for this possibility. However, the method assumes that in the absence of the intervention the change over time in the outcome of interest occurs at the same rate in the intervention and comparison places – this is often referred to as the ‘parallel lines assumption’. Therefore, while the method can account for differences at baseline, it cannot account for a varying rate of change over time that is not due to the intervention. This assumption cannot be directly tested since it is an assumption about the counterfactual state: i.e. what would have happened without the intervention, which was not observed. Researchers can look at trends in other related outcomes, or trends in the outcome of interest before the intervention started, to try to find evidence that supports the assumption about the trends that they cannot actually see.
In the following paper, Tim Powel-Jackson and colleagues investigated the effect of a demand-side financial incentives intervention to increase the uptake of maternity services in India using difference-in-differences, and a number of diagnostics to assess the assumptions underlying the method:
Financial incentives in health: New evidence from India’s Janani Suraksha Yojana Powell-Jackson T, Mazumdar S, Mills A. Financial incentives in health: New evidence from India’s Janani Suraksha Yojana. Journal of health economics. 2015 Sep 30;43:154-69.
Regression Discontinuity
Regression discontinuity is used to evaluate the impact of interventions when allocation is determined by a cut-off value on a numerical scale. For example, if counties with a population of over one million are allocated to receive an intervention, while those with a lower population are not, then regression discontinuity could be used.
Regression discontinuity compares outcomes in places that fall within a narrow range on either side of the cut-off value. For example, any place with a population short of or over one million by, say, 50,000 people could be included in the comparison. This method assumes that places on either side of the cut-off value are very similar, and therefore, the allocation of an intervention based solely on an arbitrary cut-off value may be as good as a random allocation. The method requires few additional assumptions and has been shown to be valid.
It is important to bear in mind that the effect is estimated only for places that fall within a range around the cut-off value, and therefore cannot be generalised to places that are markedly different, such as those with much smaller or much larger populations.
In the paper below, Arcand and colleagues investigated the effect of an HIV education intervention in Cameroon that was allocated according to the number of schools in the town.
Teacher training and HIV/AIDS prevention in West Africa: regression discontinuity design evidence from the Cameroon Arcand JL, Wouabe ED. Teacher training and HIV/AIDS prevention in West Africa: regression discontinuity design evidence from the Cameroon. Health Economics. 2010 Sep 1;19(S1):36-54.
The Centre for Statistical Methodology at LSHTM provides a guide for conducting Time Series Regression Analysis , including methodological challenges, researchers with expertise, and references on methods and to publications.
Interrupted Time Series
The interrupted time series method is used to estimate the effect of interventions by examining the change in the trend of an outcome after an intervention is introduced. It can be used in a situation when comparison places are not available as all eligible places receive the intervention.
This method requires a large amount of data to be collected before and after the intervention is introduced, and from a number of time points, to allow modelling of what the trend in the outcome would have been if the intervention was not introduced. The model is compared to what actually occurs. Any change in the level of the outcome or in the rate of change over time, compared to the model, can be interpreted as the effect of the intervention.
It is possible that changes in the trend in the outcome may be due to factors other than the intervention. This can be accounted for quantitatively: by investigating events or policy changes that took place at the same time. Alternatively, like the approach used in the difference-in-differences method to assess the counterfactual rate of change over time, researchers may investigate ‘control trends’ in outcomes. This is done by investigating other related outcomes that might be affected by most of the possible alternative explanations for the change in the trend observed, but not affected by the actual intervention.
In the paper below, the authors investigate the effect of a pneumonia vaccine on pneumonia admissions. They considered that changes in the wider healthcare system might have also affected pneumonia admissions, so they investigated the trends in another related outcome: admissions for dehydration. The assumptions made were that the majority of the possible alternative explanations, such as policy changes or changes to delivery of healthcare, would have affected dehydration admissions to the same extent as pneumonia admissions; that dehydration admissions would not be affected by the vaccine; and that pneumonia did not cause dehydration in large amounts. Using this approach, they were able to show more convincingly that the vaccine brought about the change in the trend.
Decline in pneumonia admissions after routine childhood immunisation with pneumococcal conjugate vaccine in the USA: a time-series analysis Grijalva CG, Nuorti JP, Arbogast PG, Martin SW, Edwards KM, Griffin MR. Decline in pneumonia admissions after routine childhood immunisation with pneumococcal conjugate vaccine in the USA: a time-series analysis. The Lancet. 2007 Apr 13;369(9568):1179-86.
In another paper, below, Lopez-Barnal and colleagues used interrupted time series analysis to investigate the effect of the 2008 Financial Crisis on suicide in Spain.
The effect of the late 2000s financial crisis on suicides in Spain: an interrupted time-series analysis Bernal JA, Gasparrini A, Artundo CM, McKee M. The effect of the late 2000s financial crisis on suicides in Spain: an interrupted time-series analysis. The European Journal of Public Health. 2013 Jun 25:ckt083.
Synthetic Controls
Synthetic controls is a relatively new method for evaluating the impact of interventions using data from places that did not get the intervention, collected over time. The method works by first looking at the trends in the outcome of interest before the intervention was introduced. The data from various places that do not ultimately get the intervention are each given a weight so that the weighted-average of their data look as much as possible like the trend in the places that will get the intervention. This weighted-average is the ‘synthetic control’. The weights, unchanged, are then applied to the places without the intervention after the intervention has been introduced, and this weighted average is compared to the actual trend in the place with the intervention. This comparison can be used to estimate the impact. Similar to the other methods discussed earlier, researchers must assume that there is not another intervention or policy change that is happening in the places getting the intervention at the same time. The method requires a lot of data, both from many places and over time. It does not use or require parameterised models, so inferential statistics are calculated using permutations rather than more traditional methods.
In the paper below, Abadie and colleagues introduced the method and applied it to investigate the impact of a tobacco control policy change on cigarette consumption in California, by comparing the trend in California with a weighted-average of the trends in the other states in the USA.
Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program Abadie A, Diamond A, Hainmueller J. Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. Journal of the American statistical Association. 2012.
IMAGES
VIDEO
COMMENTS
A quasi-experiment is an empirical interventional study used to estimate the causal impact of an intervention on target population without random assignment.Quasi-experimental research shares similarities with the traditional experimental design or randomized controlled trial, but it specifically lacks the element of random assignment to treatment or control.
13.2.1.3 Determining which types of non-randomized study to include. A randomized trial is a prospective, experimental study design specifically involving random allocation of participants to interventions. Although there are variations in randomized trial design (including random allocation of individuals, clusters or body parts; multi-arm ...
A quasi-experimental design is a non-randomized study design used to evaluate the effect of an intervention. The intervention can be a training program, a policy change or a medical treatment. Unlike a true experiment, in a quasi-experimental study the choice of who gets the intervention and who doesn't is not randomized.
Like a true experiment, a quasi-experimental design aims to establish a cause-and-effect relationship between an independent and dependent variable. However, unlike a true experiment, a quasi-experiment does not rely on random assignment. Instead, subjects are assigned to groups based on non-random criteria.
CHECKLIST FOR QUASI-EXPERIMENTAL STUDIES (NON-RANDOMIZED EXPERIMENTAL STUDIES) Critical Appraisal tools for use in JBI Systematic Reviews
Non-randomized studies may be expected to be more heterogeneous than randomized trials, given the extra sources of methodological diversity and bias. Researchers do not always make the same decisions concerning confounding factors, so the extent of residual confounding is an important source of heterogeneity between studies. There may be ...
What Is a Quasi-experiment? ... Russell IT, Black AM. A systematic review of comparisons of effect sizes derived from randomised and non-randomised studies. Health Technol Assess. 2000;4:1-154. [Google Scholar] 8. Shadish WR, Heinsman DT. Experiments versus quasi-experiments: do they yield the same answer? NIDA Res Monogr. 1997;170:147-64.
The findings of a review of NRS may also be useful to inform the design of a subsequent randomized trial, e.g. through the identification of relevant subgroups. b) To provide evidence of the effects (benefit or harm) of interventions that cannot be randomized, or which are extremely unlikely to be studied in randomized trials. In these contexts ...
Theory is clear that under the right conditions—specifically that the selection process is completely known and has been perfectly measured 1 —non-randomized studies of intervention effects, also called quasi-experiments, can produce unbiased treatment effect estimates. It follows that if the selection process is reasonably well understood ...
Using natural experiments to evaluate population health interventions: new Medical Research Council guidance. Journal of epidemiology and community health. 2012 May 10:jech-2011. Difference in Differences. This method is used to evaluate the impact of interventions that are non-randomly allocated to a sub-set of potentially eligible places.