National Information Center on Health Services Research and Health Care Technology (NICHSR)
HTA 101: III. PRIMARY DATA METHODS
- A. Primary Data Studies: Diverse Attributes
- B. Assessing the Quality of Primary Data Studies
- C. Instruments for Assessing Quality of Individual Studies
- D. Strengths and Limitations of RCTs
- E. Different Study Designs for Different Questions
- F. Complementary Methods for Internal and External Validity
- G. Evidence Hierarchies
- H. Alternative and Emerging Study Designs Relevant to HTA
- I. Collecting New Primary Data
- References for Chapter III
A. Primary Data Studies: Diverse Attributes
Primary data methods involve collection of original data, ranging from more scientifically rigorous approaches for determining the causal effect of health technologies, such as randomized controlled trials (RCTs), to less rigorous ones, such as case series. These study designs can be described and categorized based on multiple attributes or dimensions, e.g.:
- Comparative vs. non-comparative
- Separate (i.e., external) control group vs. no separate (i.e., internal) control group
- Participants (study populations /groups) defined by a health outcome vs. by having been exposed to, or received or been assigned, an intervention
- Prospective vs. retrospective
- Interventional vs. observational
- Experimental vs. non-experimental
- Random assignment vs. non-random assignment of patients to treatment and control groups
All experimental studies are, by definition, interventional studies. Some non-experimental studies can be interventional, e.g., if investigators assign a technology to a patient population but without a control group or with a non-randomized control group, and then assess their outcomes. An interventional cross-sectional design can be used to assess the accuracy of a diagnostic test. Some study designs are better at rigorous demonstration of causality in well-defined circumstances, such as the RCT. Other study designs may be better for reflecting real-world practice, such as pragmatic clinical trials and some observational studies, such as cohort, cross-sectional, or case control studies using data from registries, surveillance, electronic health (or medical) records, and payment claims.
Box III-1. Examples of Experimental and Non-Experimental Study Designs
|Experimental Studies||Non-experimental studies|
|Randomized controlled trial||Prospective cohort|
|Randomized cross-over trial||Retrospective cohort|
|Group randomized trial||Cross-sectional|
|Non-randomized controlled trial*||Interrupted time series with comparison|
|Pragmatic trials (randomized or non-randomized)||Non-concurrent cohort|
|Interrupted time series without comparison|
*A controlled trial in which participants are assigned to treatment and control groups using a method other than randomization, yet intended to form similar groups. Sometimes known as a “quasi-experimental” design.
Box III-1 categorizes various types of primary data studies as experimental and non-experimental. Researchers have developed various frameworks, schemes, and other tools for classifying study designs, such as for the purpose of conducting systematic reviews (Hartling 2010). Box III-2 and Box III-3 show algorithms for identifying study designs. Some of these study designs have alternative names, and some studies use diverse combinations of design attributes.
Box III-2. Study Design Algorithm, Guide to Community Preventive Services
Source: Briss PA, Zasa S, Pappaioanou M, Fielding J, et al. Developing an evidence-based Guide to Community Preventive Services--Am J Prev Med 2000;18(1S):35-43, Copyright © 2000) with permission from Elsevier.
Box III-3. Design Algorithm for Studies of Health Care Interventions*
*Developed, though no longer advocated by, the Cochrane Non-Randomised Studies Methods Group.
Source: Hartling L, et al. Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures. Agency for Healthcare Research and Quality; December 2010. Methods Research Report. AHRQ Publication No. 11-EHC-007.
Although the general type of a study design (e.g., RCT, prospective cohort study, case series) conveys certain attributes about the quality of a study (e.g., control group, random assignment), study design type alone is not a good proxy for study quality. More important are the attributes of study design and conduct that diminish sources of bias and random error, as described below.
New types of observational study designs are emerging in the form of patient-centered online registries and related research platforms. For example, PatientsLikeMe, a patient network, is set up for entry of member patient demographic information, treatment history, symptoms, outcome data, and evaluations of treatments, as well as production of individual longitudinal health profiles and aggregated reports. Such patient-centered registries can supplement clinical trials and provide useful postmarket data across heterogeneous patients and circumstances (Frost 2011, Nakamura 2012).
Most HTA programs rely on integrative methods (especially systematic reviews), particularly to formulate findings based on available evidence from primary data studies that are identified through systematic literature searches. Some HTA programs collect primary data, or are part of larger organizations that collect primary data. It is not always possible to conduct, or base an assessment on, the most rigorously designed studies. Indeed, policies and decisions often must be made in the absence, or before completion, of definitive studies. Given their varying assessment purposes, resource constraints, and other factors, HTA programs use evidence from various study designs, although they usually emphasize evidence based on the more rigorous and systematic methods of data collection.
The following sections describe concepts that affect the quality of primary data studies, particularly their ability to yield unbiased and precise estimates of treatment effects and other findings.
B. Assessing the Quality of Primary Data Studies
Our confidence that the estimate of a treatment effect, accuracy of a screening or diagnostic test, or other impact of a health care technology that is generated by a study is correct reflects our understanding of the quality of the study. For various types of interventions, we examine certain attributes of the design and conduct of a study to assess the quality of that study. For example, some of the attributes or criteria that are commonly used to assess the quality of studies for demonstrating the internal validity of the impact of therapies on health outcomes are the following:
- Prospective, i.e., following a study population over time as it receives an intervention or exposure and experiences outcomes, rather than retrospective design
- Experimental rather than observational
- Controlled, i.e., with one or more comparison groups, rather than uncontrolled
- Contemporaneous control groups rather than historical ones
- Internal (i.e., managed within the study) control groups rather than external ones
- Allocation concealment of patients to intervention and control groups
- Randomized assignment of patients to intervention and control groups
- Blinding of patients, clinicians, and investigators as to patient assignment to intervention and control groups
- Large enough sample size (number of patients/participants) to detect true treatment effects with statistical significance
- Minimal patient drop-outs or loss to follow-up of patients (or differences in these between intervention and control groups) for duration of study
- Consistency of pre-specified study protocol (patient populations, assignment to intervention and control groups, regimens, etc.) and outcome measures with the reported (post-study) protocol and outcome measures
Similarly, some attributes that are commonly used for assessing the external validity of the impact of therapies and other technologies on health outcomes include:
- Flexible entry criteria to identify/enroll patient population that is representative of patient diversity likely to be offered the intervention in practice, including demographic characteristics, risk factors, disease stage/severity, comorbidities
- Large enough patient population to conduct meaningful subgroup analyses (especially for pre-specified subgroups)
- Dosing, regimen, technique, delivery of the intervention consistent with anticipated practice
- Comparator is standard of care or other relevant, clinically acceptable (not-substandard) intervention
- Dosing, regimen, or other forms of delivering the comparator consistent with standard care
- Patient monitoring and efforts to maintain patient adherence comparable to those in practice
- Accompanying/concurrent/ancillary care similar to what will be provided in practice
- Training, expertise, skills of clinicians and other health care providers similar to those available or feasible for providers anticipated to deliver the intervention
- Selection of outcome measures relevant to those experienced by and important to intended patient groups
- Systematic effort to follow-up on all patients to minimize attrition
- Intention-to-treat analysis used to account for all study patients
- Study duration consistent with the course/episode of disease/condition in practice in order to detect outcomes of importance to patients and clinicians
- Multiple study sites representative of type/level of health care settings and patient and clinician experience anticipated in practice
RCTs are designed to maximize internal validity, and are generally regarded as the “gold standard” study design for demonstrating the causal impact of a technology on health care outcomes. However, some attributes that strengthen the internal validity of RCTs tend to diminish RCTs’ external validity. Probing the strengths and limitations of RCTs with respect to internal and external validity is also instructive for understanding the utility of other studies. A variety of design aspects intended to improve the external validity of RCTs and related experimental designs are described briefly later in this chapter.
The commonly recognized attributes of study quality noted above that strengthen internal and external validity of primary data studies are derived from and extensive body of methodological concepts and principles, including those summarized below: confounding and the need for controls, prospective vs. retrospective design, sources of bias, random error, and selected other factors.
1.Types of Validity in Methods and Measurement
Whether they are experimental or non-experimental in design, studies vary in their ability to produce valid findings. Validity refers to how well a study or data collection instrument measures what it is intended to measure. Understanding different aspects of validity helps in comparing strengths and weaknesses of alternative study designs and our confidence in the findings generated by those studies. Although these concepts are often addressed in reference to primary data methods, they generally apply as well to integrative methods.
Internal validity refers to the extent to which the results of a study accurately represent the causal relationship between an intervention and an outcome in the particular circumstances of that study. This includes the extent to which the design and conduct of a study minimize the risk of any systematic (non-random) error (i.e., bias) in the study results. Internal validity can be suspect when biases in the design or conduct of a clinical trial or other study could have affected outcomes, thereby causing the study results to deviate from the true magnitude of the treatment effect. True experiments such as RCTs generally have high internal validity.
External validity refers to the extent to which the results of a study conducted under particular circumstances can be generalized (or are applicable) to other circumstances. When the circumstances of a particular study (e.g., patient characteristics, the technique of delivering a treatment, or the setting of care) differ from the circumstances of interest (e.g., patients with different characteristics, variations in the technique of delivering a treatment, or alternative settings of care), the external validity of the results of that study may be limited.
Construct validity refers to how well a measure is correlated with other accepted measures of the construct of interest (e.g., pain, anxiety, mobility, quality of life), and discriminates between groups known to differ according to the construct. Face validity is the ability of a measure to represent reasonably (i.e., to be acceptable “on its face”) a construct (i.e., a concept, trait, or domain of interest) as judged by someone with knowledge or expertise in the construct.
Content validity refers to the degree to which the set of items of a data collection instrument is known to represent the range or universe of meanings or dimensions of a construct of interest. For example, how well do the domains of a health-related quality of life index for rheumatic arthritis represent the aspects of quality of life or daily functioning that are important to patients with rheumatoid arthritis?
Criterion validity refers to how well a measure, including its various domains or dimensions, is correlated with a known gold standard or definitive measurement, if one exists. The similar concept of concurrent validity refers to how well a measure correlates with a previously validated one, and the ability of a measure to accurately differentiate between different groups at the time the measure is applied. Predictive validity refers to the ability to use differences in a measure of a construct to predict future events or outcomes. It may be considered a subtype of criterion validity.
Convergent validity refers to the extent to which different measures that are intended to measure the same construct actually yield similar results, such as two measures of quality of life. Discriminant validity concerns whether different measures that are intended to measure different constructs actually fail to be positively associated with each other. Convergent validity and discriminant validity contribute to, or can be considered subtypes of, construct validity.
2.Confounding and the Need for Controls
Confounding occurs when any factor that is associated with an intervention has an impact on an outcome that is independent of the impact of the intervention. As such, confounding can “mask” or muddle the true impact of an intervention. In order to diminish any impact of confounding factors, it is necessary to provide a basis of comparing what happens to patients who receive an intervention to those that do not.
The main purpose of control groups is to enable isolating the impact of an intervention of interest on patient outcomes from the impact of any extraneous factors. The composition of the control group is intended to be as close as possible to that of the intervention group, and both groups are managed as similarly as possible, so that the only difference between the groups is that one receives the intervention of interest and the other does not. In controlled clinical trials, the control groups may receive a current standard of care, no intervention, or a placebo.
For a factor to be a confounder in a controlled trial, it must differ for the intervention and control groups and be predictive of the treatment effect, i.e., it must have an impact on the treatment effect that is independent of the intervention of interest. Confounding can arise due to differences between the intervention and control groups, such as differences in baseline risk factors at the start of a trial or different exposures during the trial that could affect outcomes. Investigators may not be aware of all potentially confounding factors in a trial. Examples of potentially confounding factors are age, prevalence of comorbidities at baseline, and different levels of ancillary care. To the extent that potentially confounding factors are present at different rates between comparison groups, a study is subject to selection bias (described below).
Most controlled studies use contemporaneous controls alongside (i.e., constituted and followed simultaneously with) intervention groups. Investigators sometimes rely on historical control groups. However, a historical control group is subject to known or unknown inherent differences (e.g., risk factors or other prognostic factors) from a current intervention group, and environmental or other contextual differences arising due to the passage of time that may confound outcomes. In some instances, including those noted below, historical controls have sufficed to demonstrate definitive treatment effects. In a crossover design study, patients start in one group (intervention or control) and then are switched to the other (sometimes multiple times), thereby acting as their own controls, although such designs are subject to certain forms of bias.
Various approaches are used to ensure that intervention and control groups comprise patients with similar characteristics, diminishing the likelihood that baseline differences between them will confound observed treatment effects. The best of these approaches is randomization of patients to intervention and control groups. Random allocation diminishes the impact of any potentially known or unrecognized confounding factors by tending to distribute those factors evenly across the groups to be compared. “Pseudo-randomization” approaches such as alternate assignment or using birthdays or identification numbers to assign patients to intervention and control groups can be vulnerable to confounding.
Among the ongoing areas of methodological controversy in clinical trials is the appropriate use of placebo controls. Issues include: (1) appropriateness of using a placebo in a trial of a new therapy when a therapy judged to be effective already exists, (2) statistical requirements for discerning what may be smaller differences in outcomes between a new therapy and an existing one compared to differences in outcomes between a new therapy and a placebo, and (3) concerns about comparing a new treatment to an existing therapy that, except during the trial itself, may be unavailable in a given setting (e.g., a developing country) because of its cost or other economic or social constraints (Rothman 1994; Varmus 1997); and (4) when and how to use the placebo effect to patient advantage. As in other health technologies, surgical procedures can be subject to the placebo effect. Following previous missteps that raised profound ethical concerns, guidance was developed for using “sham” procedures as placebos in RCTs of surgical procedures (Horng 2003). Some instances of patient blinding have been most revealing about the placebo effect in surgery, including arthroscopic knee surgery (Moseley 2002), percutaneous myocardial laser revascularization (Stone 2002), and neurotransplantation surgery (Boer 2002). Even so, the circumstances in which placebo surgery is ethically and scientifically acceptable as well as practically feasible and acceptable to enrolled patients may be very limited (Campbell 2011).
In recent years there has been considerable scientific progress in understanding the physiological and psychological basis of the placebo response, prompting efforts to put it to use in improving outcomes. It remains important to control for the placebo effect in order to minimize its confounding effect on evaluating the treatment effect of an intervention. However, once a new drug or other technology is in clinical use, the patient expectations and learning mechanisms contributing to the placebo effect may be incorporated into medication regimens to improve patient satisfaction and outcomes. Indeed, this approach may be personalized based on patient genomics, medical history, and other individual characteristics (Enck 2013).
3. Prospective vs. Retrospective Design
Prospective studies are planned and implemented by investigators using real-time data collection. These typically involve identification of one or more patient groups according to specified risk factors or exposures, followed by collection of baseline (i.e., initial, prior to intervention) data, delivery of interventions of interest and controls, collecting follow-up data, and comparing baseline to follow-up data for the patient groups. In retrospective studies, investigators collect samples of data from past interventions and outcomes involving one or more patient groups.
Prospective studies are usually subject to fewer types of confounding and bias than retrospective studies. In particular, retrospective studies are more subject to selection bias than prospective studies. In retrospective studies, patients’ interventions and outcomes have already transpired and been recorded, raising opportunities for intentional or unintentional selection bias on the part of investigators. In prospective studies, patient enrollment and data collection can be designed to reduce bias (e.g., selection bias and detection bias), which is an advantage over most retrospective studies. Even so, the logistical challenges of maintaining blinding of patients and investigators are considerable and unblinding can introduce performance and detection bias.
Prospective and retrospective studies have certain other relative advantages and disadvantages that render them more or less useful for certain types of research questions. Both are subject to certain types of missing or otherwise limited data. As retrospective studies primarily involve selection and analyses of existing data, they tend to be far less expensive than prospective studies. However, their dependence on existing data makes it difficult to fill data gaps or add data fields to data collection instruments, although they can rely in part on importing and adjusting data from other existing sources. Given the costs of enrolling enough patients and collecting sufficient data to achieve statistical significance, prospective studies tend to be more suited to investigating health problems that are prevalent and yield health outcomes or other events that occur relatively frequently and within short follow-up periods. The typically shorter follow-up periods of prospective studies may subject them to seasonal or other time-dependent biases, whereas retrospective studies can be designed to extract data from longer time spans. Retrospective studies offer the advantage of being able to canvass large volumes of data over extended time periods (e.g., from registries, insurance claims, and electronic health records) to identify patients with specific sets of risk factors or rare or delayed health outcomes, including certain adverse events.
4. Sources of Bias
The quality of a primary data study determines our confidence that the estimated treatment effect in a primary data study is correct. Bias refers to any systematic (i.e., not due to random error) deviation in an observation from the true nature of an event. In clinical trials, bias may arise from any factor that systematically distorts (increases or decreases) the observed magnitude of an outcome (e.g., treatment effect or harm) relative to the true magnitude of the outcome. As such, bias diminishes the accuracy (though not necessarily the precision; see discussion below) of an observation. Biases may arise from inadequacies in the design, conduct, analysis, or reporting of a study.
Major types of bias in comparative primary data studies are described below, including selection bias, performance bias, detection bias, attrition bias, and reporting bias (Higgins, Altman, Gøtzsche 2011; Higgins, Altman, Sterne 2011; Viswanathan 2014). Also noted are techniques and other study attributes that tend to diminish each type of bias. These attributes for diminishing bias also serve as criteria for assessing the quality of individual studies.
Selection bias refers to systematic differences between baseline characteristics of the groups that are compared, which can arise from, e.g., physician assignment of patients to treatments, patient self-selection of treatments, or association of treatment assignment with patient clinical characteristics or demographic factors. Among the means for diminishing selection bias are random sequence generation (random allocation of patients to treatment and control groups) and allocation concealment for RCTs, control groups to diminish confounders in cohort studies, and case matching in case-control studies.
Allocation concealment refers to the process of ensuring that the persons assessing patients for potential entry into a trial, as well as the patients, do not know whether any particular patient will be allocated to an intervention group or control group. This helps to prevent the persons who manage the allocation, or the patients, from influencing (intentionally or not) the assignment of a patient to one group or another. Patient allocation based on personal identification numbers, birthdates, or medical record numbers may not ensure concealment. Better methods include centralized randomization (i.e., managed at one site rather than at each enrollment site); sequentially numbered, opaque, sealed envelopes; and coded medication bottles or containers.
Performance bias refers to systematic differences between comparison groups in the care that is provided, or in exposure to factors other than the interventions of interest. This includes, e.g., deviating from the study protocol or assigned treatment regimens so that patients in control groups receive the intervention of interest, providing additional or co-interventions unevenly to the intervention and control groups, and inadequately blinding providers and patients to assignment to intervention and control groups, thereby potentially affecting whether or how assigned interventions or exposures are delivered. Techniques for diminishing performance bias include blinding of patients and providers (in RCTs and other controlled trials in particular), adhering to the study protocol, and sustaining patients’ group assignments.
Detection (or ascertainment) bias refers to systematic differences between groups in how outcomes are assessed. These differences may arise due to, e.g., inadequate blinding of outcome assessors regarding patient treatment assignment, reliance on patient or provider recall of events (also known as recall bias), inadequate outcome measurement instruments, or faulty statistical analysis. Whereas allocation concealment is intended to ensure that persons who manage patient allocation, as well as the patients themselves, do not influence patient assignment to one group or another, blinding refers to preventing anyone who could influence assessment of outcomes from knowing which patients have been assigned to one group or another. Knowledge of patient assignment itself can affect outcomes as experienced by patients or assessed by investigators. The techniques for diminishing detection bias include blinding of outcome assessors including patients, clinicians, investigators, and/or data analysts, especially for subjective outcome measures; and validated and reliable outcome measurement instruments and techniques.
Attrition bias refers to systematic differences between comparison groups in withdrawals (drop-outs) from a study, loss to follow-up, or other exclusion of patients/participants and how these losses are analyzed. Ignoring these losses or accounting for them differently between groups can skew study findings, as patients who withdraw or are lost to follow-up may differ systematically from those patients who remain for the duration of the study. Indeed, patients’ awareness of whether they have been assigned to a particular treatment or control group may differentially affect their likelihood of dropping out of a trial. Techniques for diminishing attrition bias include blinding of patients as to treatment assignment, completeness of follow-up data for all patients, and intention-to-treat analysis (with imputations for missing data as appropriate).
Reporting bias refers to systematic differences between reported and unreported findings, including, e.g., differential reporting of outcomes between comparison groups and incomplete reporting of study findings (such as reporting statistically significant results only). Also, narrative and systematic reviews that do not report search strategies or disclose potential conflicts of interest raise concerns about reporting bias as well as selection bias (Roundtree 2009). Techniques for diminishing reporting bias include thorough reporting of outcomes consistent with outcome measures specified in the study protocol, including attention to documentation and rationale for any post-hoc (after the completion of data collection) analyses not specified prior to the study, and reporting of literature search protocols and results for review articles. Reporting bias, which concerns differential or incomplete reporting of findings in individual studies, is not the same as publication bias, which concerns the extent to which all relevant studies on given topic proceed to publication.
Registration of Clinical Trials and Results
Two related sets of requirements have improved clinical trial reporting for many health technologies. These requirements help to diminish reporting bias and publication bias, thereby improving the quality of the evidence available for HTA. Further, they increase the value of clinical trials more broadly to trial participants, patients, clinicians, and other decision makers, and society (Huser 2013).
In the US, the Food and Drug Administration Amendments Act of 2007 (FDAAA) mandates that certain clinical trials of drugs, biologics, and medical devices that are subject to FDA regulation for any disease or condition be registered with ClinicalTrials.gov. A service of the US National Library of Medicine, ClinicalTrials.gov is a global registry and results database of publicly and privately supported clinical studies. Further, FDAAA requires investigators to register the results of these trials, generally no more than 12 months after trial completion. Applicable trials include those that have one or more sites in the US, are conducted under an FDA investigational new drug application (IND) or investigational device exemption (IDE), or involve a drug, biologic, or device that is manufactured in the US and its territories and is exported for research (ClinicalTrials.gov 2012; Zarin 2011).
The International Committee of Medical Journal Editors (ICMJE) requires clinical trial registration as a condition for publication of research results generated by a clinical trial. Although the ICMJE does not advocate any particular registry, it is required that a registry meet certain criteria for investigators to meet the condition for publication. (ClinicalTrials.gov meets these criteria.) ICMJE requires registration of trial methodology but not trials results (ICMJE 2013).
As noted above, study attributes that affect bias can be used as criteria for assessing the quality of individual studies. For example, the use of randomization to reduce selection bias and blinding of outcomes assessors to reduce detection bias are among the criteria used for assessing the quality of clinical trials. Even within an individual study, the extent of certain types of bias may vary for different outcomes. For example, in a study of the impact of a technology on mortality and quality of life, the presence of detection bias and reporting bias may vary for those two outcomes.
Box III-4 shows a set of criteria for assessing risk of bias for benefits for several types of study design based on the main types of risk of bias cited above and used by the US Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Centers (EPCs).
In contrast to the systematic effects of various types of bias, random error is a source of non-systematic deviation of an observed treatment effect or other outcome from a true one. Random error results from chance variation in the sample of data collected in a study (i.e., sampling error). The extent to which an observed outcome is free from random error is precision. As such, precision is inversely related to random error.
Random error can be reduced, but it cannot be eliminated. P-values and confidence intervals account for the extent of random error, but they do not account for systematic error (bias). The main approach to reducing random error is to establish large enough sample sizes (i.e., numbers of patients in the intervention and control groups of a study) to detect a true treatment effect (if one exists) at acceptable levels of statistical significance. The smaller the true treatment effect, the more patients may be required to detect it. Therefore, investigators who are planning an RCT or other study consider the estimated magnitude of the treatment effect that they are trying to detect at an acceptable level of statistical significance, and then “ power” (i.e., determine the necessary sample size of) the study accordingly. Depending on the type of treatment effect or other outcome being assessed, another approach to reducing random error is to reduce variation in an outcome for each patient by increasing the number of observations made for each patient. Random error also may be reduced by improving the precision of the measurement instrument used to take the observations (e.g., a more precise diagnostic test or instrument for assessing patient mobility).
6. Role of Selected Other Factors
Some researchers contend that if individual studies are to be assembled into a body of evidence for a systematic review, precision should be evaluated not at the level of individual studies, but when assessing the quality of the body of evidence. This is intended to avoid double-counting limitations in precision from the same source (Viswanathan 2014).
In addition to evaluating internal validity of studies, some instruments for assessing the quality of individual studies evaluate external validity. However, by definition, the external validity of a study depends not only on its inherent attributes, but on the nature of an evidence question for which the study is more or less relevant. An individual study may have high external validity for some evidence questions and low external validity for others. Certainly, some aspects of bias for internal validity noted above may also be relevant to external validity, such as whether the patient populations compared in a treatment and control group represent same or different populations, and whether the analyses account for attrition in a way that represents the population of interest, including any patient attributes that differ between patients who were followed to study completion and those who were lost to follow-up. Some researchers suggest that if individual studies are to be assembled into a body of evidence for a systematic review, then external validity should be evaluated when assessing the quality of the body of evidence, but not at the level of individual studies (Atkins 2004; Viswanathan 2014).
Box III-4. Design-Specific Criteria to Assess Risk of Bias for Benefits
Source: Viswanathan M, Ansari MT, Berkman ND, Chang S, et al. Chapter 9. Assessing the risk of bias of individual studies in systematic reviews of health care interventions. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews. AHRQ Publication No. 10(14)-EHC063-EF. Rockville, MD: Agency for Healthcare Research and Quality. January 2014.
Some quality assessment tools for individual studies account for funding source (or sponsor) of a study and disclosed conflicts of interest (e.g., on the part of sponsors or investigators) as potential sources of bias. Rather than being direct sources of bias themselves, a funding source or a person with a disclosed conflict of interest may induce bias indirectly, e.g., in the form of certain types of reporting bias or detection bias. Also, whether the funding source of research comes is a government agency, non-profit organization, or health technology company does not necessarily determine whether it induces bias. Of course, all of these potential sources of bias should be systematically documented (Viswanathan 2014).
C. Instruments for Assessing Quality of Individual Studies
A variety of assessment instruments are available to assess the quality of individual studies. Many of these are for assessing internal validity or risk of bias for benefits and harms; others focus on assessing external validity. These include instruments for assessing particular types of studies (e.g., RCTs, observational studies) and certain types of interventions (e.g., screening, diagnosis, and treatment).
A systematic review identified more than 20 scales (and their modifications) for assessing the quality of RCTs (Olivo 2008). Although most of these had not been rigorously developed or tested for validity and reliability, the systematic review found that one of the original scales, the Jadad Scale (Jadad 1996), shown in Box III-5, was the strongest.
The Cochrane Risk of Bias Tool for RCTs, shown in Box III-6, accounts for the domains of bias noted above (i.e., selection, performance, detection, attrition, and reporting bias), providing criteria for assessing whether there is low, unclear, or high risk of bias for each domain for individual RCTs as well as across a set of RCTs for a particular evidence question (Higgins, Altman, Sterne 2011).
Criteria and ratings for assessing internal validity of RCTs and cohort studies and of diagnostic accuracy studies used by the US Preventive Services Task Force (USPSTF) are shown in Box III-7 and Box III-8, respectively. Box III-9 shows a framework used by the USPSTF to rate the external validity of individual studies. QUADAS-2 is a quality assessment tool for diagnostic accuracy studies (Whiting 2011).
Among their numerous instruments for assessing the quality of individual studies, the AHRQ EPCs use a PICOS framework to organize characteristics that can affect the external validity of individual studies, which are used as criteria for evaluating study quality for internal validity, as shown in Box III-10.
D. Strengths and Limitations of RCTs
For demonstrating the internal validity of a causal relationship between an intervention and one or more outcomes of interest, the well-designed, blinded (where feasible), appropriately powered, well-conducted, and properly reported RCT has dominant advantages over other study designs. Among these, the RCT minimizes selection bias in that any enrolled patient has the same probability, due to randomization, of being assigned to an intervention group or control group. This also minimizes the potential impact of any known or unknown confounding factors (e.g., risk factors present at baseline), because randomization tends to distribute such confounders evenly across the groups to be compared.
When the sample size of an RCT is calculated to achieve sufficient statistical power, it minimizes the probability that the observed treatment effect will be subject to random error. Further, especially with larger groups, randomization enables patient subgroup comparisons between intervention and control groups. The primacy of the RCT remains even in an era of genomic testing and expanding use of biomarkers to better target selection of patients for adaptive clinical trials of new drugs and biologics, and advances in computer-based modeling that may replicate certain aspects of RCTs (Ioannidis 2013).
Box III-5. Jadad Instrument to Assess the Quality of RCT Reports
This is not the same as being asked to review a paper. It should not take more than 10 minutes to score a report and there are no right or wrong answers.
Please read the article and try to answer the following questions:
- Was the study described as randomized (this includes the use of words such as randomly, random, and randomization)?
- Was the study described as double blind?
- Was there a description of withdrawals and dropouts?
Scoring the items:
Either give a score of 1 point for each “yes” or 0 points for each “no.” There are no in-between marks.
Give 1 additional point if: For question 1, the method to generate the sequence of randomization was described and it was appropriate (table of random numbers, computer generated, etc.)
and/or: If for question 2, the method of double blinding was described and it was appropriate (identical placebo, active placebo, dummy, etc.)
Deduct 1 point if: For question 1, the method to generate the sequence of randomization was described and it was inappropriate (patients were allocated alternately, or according to date of birth, hospital number, etc.)
and/or: for question 2, the study was described as double blind but the method of blinding was inappropriate (e.g., comparison of tablet vs. injection with no double dummy)
Guidelines for Assessment
1. Randomization: A method to generate the sequence of randomization will be regarded as appropriate if it allowed each study participant to have the same chance of receiving each intervention and the investigators could not predict which treatment was next. Methods of allocation using date of birth, date of admission, hospital numbers, or alternation should not be regarded as appropriate.
2. Double blinding: A study must be regarded as double blind if the word “double blind” is used. The method will be regarded as appropriate if it is stated that neither the person doing the assessments nor the study participant could identify the intervention being assessed, or if in the absence of such a statement the use of active placebos, identical placebos, or dummies is mentioned.
3. Withdrawals and dropouts: Participants who were included in the study but did not complete the observation period or who were not included in the analysis must be described. The number and the reasons for withdrawal in each group must be stated. If there were no withdrawals, it should be stated in the article. If there is no statement on withdrawals, this item must be given no points.
Reprinted from: Jadad AR, et al. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Control Clin Trials. 1996;17:1-12, Copyright © (1996) with permission from Elsevier.
Box III-6. The Cochrane Collaboration’s Tool for Assessing Risk of Bias
|Domain||Support for Judgment||Review authors’ judgement|
|Allocation concealment.||Describe the method used to conceal the allocation sequence in sufficient detail to determine whether intervention allocations could have been foreseen in advance of, or during, enrolment.||Selection bias (biased allocation to interventions) due to inadequate concealment of allocations prior to assignment.|
|Domain||Support for Judgment||Review authors’ judgement|
|Blinding of participants and personnel Assessments should be made for each main outcome (or class of outcomes).||Describe all measures used, if any, to blind study participants and personnel from knowledge of which intervention a participant received. Provide any information relating to whether the intended blinding was effective.||Performance bias due to knowledge of the allocated interventions by participants and personnel during the study.|
|Domain||Support for Judgment||Review authors’ judgement|
|Blinding of outcome assessment Assessments should be made for each main outcome (or class of outcomes).||Describe all measures used, if any, to blind outcome assessors from knowledge of which intervention a participant received. Provide any information relating to whether the intended blinding was effective.||Detection bias due to knowledge of the allocated interventions by outcome assessors.|
|Domain||Support for Judgment||Review authors’ judgement|
|Incomplete outcome data Assessments should be made for each main outcome (or class of outcomes).||Describe the completeness of outcome data for each main outcome, including attrition and exclusions from the analysis. State whether attrition and exclusions were reported, the numbers in each intervention group (compared with total randomized participants), reasons for attrition/exclusions where reported, and any re-inclusions in analyses performed by the review authors.||Attrition bias due to amount, nature or handling of incomplete outcome data.|
|Domain||Support for Judgment||Review authors’ judgement|
|Selective reporting.||State how the possibility of selective outcome reporting was examined by the review authors, and what was found.||Reporting bias due to selective outcome reporting.|
|Domain||Support for Judgment||Review authors’ judgement|
|Other sources of bias.||State any important concerns about bias not addressed in the other domains in the tool.If particular questions/entries were pre-specified in the review’s protocol, responses should be provided for each question/entry.||Bias due to problems not covered elsewhere in the table.|
Reprinted with permission: Higgins JPT, Altman DG, Sterne, JAC, eds. Chapter 8: Assessing risk of bias in included studies. In: Higgins JPT, Green S, eds. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011."
Box III-7. Criteria for Assessing Internal Validity of Individual Studies:
Randomized Controlled Trials and Cohort Studies, USPSTF
- Initial assembly of comparable groups:
- For RCTs: adequate randomization, including first concealment and whether potential confounders were distributed equally among groups.
- For cohort studies: consideration of potential confounders with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts.
- Maintenance of comparable groups (includes attrition, cross-overs, adherence, contamination).
- Important differential loss to follow-up or overall high loss to follow-up.
- Measurements: equal, reliable, and valid (includes masking of outcome assessment).
- Clear definition of interventions.
- All important outcomes considered.
- Analysis: adjustment for potential confounders for cohort studies, or intention to treat analysis for RCTs.
Definitions of ratings based on above criteria:
Good: Meets all criteria: Comparable groups are assembled initially and maintained throughout the study (follow-up at least 80 percent); reliable and valid measurement instruments are used and applied equally to the groups; interventions are spelled out clearly; all important outcomes are considered; and appropriate attention to confounders in analysis. In addition, for RCTs, intention to treat analysis is used.
Fair: Studies will be graded “fair” if any or all of the following problems occur, without the fatal flaws noted in the "poor" category below: Generally comparable groups are assembled initially but some question remains whether some (although not major) differences occurred with follow-up; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for. Intention to treat analysis is done for RCTs.
Poor: Studies will be graded “poor” if any of the following fatal flaws exists: Groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied at all equally among groups (including not masking outcome assessment); and key confounders are given little or no attention. For RCTs, intention to treat analysis is lacking.
Source: US Preventive Services Task Force Procedure Manual. AHRQ Pub. No. 08-05118-EF, July 2008.
Box III-8. Criteria for Assessing Internal Validity of Individual Studies:
Diagnostic Accuracy Studies, USPSTF
- Screening test relevant, available for primary care, adequately described.
- Study uses a credible reference standard, performed regardless of test results.
- Reference standard interpreted independently of screening test.
- Handles indeterminate results in a reasonable manner.
- Spectrum of patients included in study.
- Sample size.
- Administration of reliable screening test.
Definitions of ratings based on above criteria:
Good: Evaluates relevant available screening test; uses a credible reference standard; interprets reference standard independently of screening test; reliability of test assessed; has few or handles indeterminate results in a reasonable manner; includes large number (more than 100) broad-spectrum patients with and without disease.
Fair: Evaluates relevant available screening test; uses reasonable although not best standard; interprets reference standard independent of screening test; moderate sample size (50 to 100 subjects) and a "medium" spectrum of patients.
Poor: Has fatal flaw such as: Uses inappropriate reference standard; screening test improperly administered; biased ascertainment of reference standard; very small sample size or very narrow selected spectrum of patients.
Source: US Preventive Services Task Force Procedure Manual. AHRQ Pub. No. 08-05118-EF, July 2008.
Box III-9. Global Rating of External Validity (Generalizability) of Individual Studies,
US Preventive Services Task Force
External validity is rated "good" if:
- The study differs minimally from the US primary care population/situation/providers and only in ways that are unlikely to affect the outcome; it is highly probable (>90%) that the clinical experience with the intervention observed in the study will be attained in the US primary care setting.
External validity is rated "fair" if:
- The study differs from the US primary care population/situation/providers in a few ways that have the potential to affect the outcome in a clinically important way; it is only moderately probable (50%-89%) that the clinical experience with the intervention in the study will be attained in the US primary care setting.
External validity is rated "poor" if:
- The study differs from the US primary care population/ situation/ providers in many way that have a high likelihood of affecting the clinical outcomes; the probability is low (<50%) that the clinical experience with the intervention observed in the study will be attained in the US primary care setting.
Source: US Preventive Services Task Force Procedure Manual. AHRQ Pub. No. 08-05118-EF, July 2008.
Box III-10. Characteristics of Individual Studies That May Affect Applicability (AHRQ)
- Narrow eligibility criteria and exclusion of those with comorbidities
- Large differences between demographics of study population and community patients
- Narrow or unrepresentative severity, stage of illness, or comorbidities
- Run in period with high-exclusion rate for non-adherence or side effects
- Event rates much higher or lower than observed in population-based studies
- Doses or schedules not reflected in current practice
- Intensity and delivery of behavioral interventions that may not be feasible for routine use
- Monitoring practices or visit frequency not used in typical practice
- Older versions of an intervention no longer in common use
- Co-interventions that are likely to modify effectiveness of therapy
- Highly selected intervention team or level of training/proficiency not widely available
- Inadequate dose of comparison therapy
- Use of substandard alternative therapy
- Composite outcomes that mix outcomes of different significance
- Short-term or surrogate outcomes
- Standards of care differ markedly from setting of interest
- Specialty population or level of care differs from that seen in community
Source: Atkins D, et al. Chapter 5. Assessing the Applicability of Studies When Comparing Medical Interventions. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews. AHRQ Publication No. 10(12)-EHC063-EF. Rockville, MD: Agency for Healthcare Research and Quality. April 2012.
As described below, despite its advantages for demonstrating internal validity of causal relationships, the RCT is not the best study design for all evidence questions. Like all methods, RCTs have limitations. RCTs can have particular limitations regarding external validity. The relevance or impact of these limitations varies according to the purposes and circumstances of study. In order to help inform health care decisions in real-world practice, evidence from RCTs and other experimental study designs should be augmented by evidence from other types of studies. These and related issues are described below.
RCTs can cost in the tens or hundreds of millions of dollars, and exceeding $1 billion in some instances. Costs can be particularly high for phase III trials of drugs and biologics conducted to gain market approval by regulatory agencies. Included are costs of usual care and the additional costs of conducting research. Usual care costs include those for, e.g., physician visits, hospital stays, laboratory tests, radiology procedures, and standard medications, which are typically covered by third-party payers. Research-only costs (which would not otherwise occur for usual care) include patient enrollment and related management; investigational technologies; additional tests and procedures done for research purposes; additional time by clinical investigators; data infrastructure, management, collection, analysis, and reporting; and regulatory compliance and reporting (DiMasi 2003; Morgan 2011; Roy 2012). Costs are higher for trials with large numbers of enrollees, large numbers of primary and secondary endpoints (requiring more data collection and analysis), and longer duration. Costs are generally high for trials that are designed to detect treatment effects that are anticipated to be small (therefore requiring large sample sizes to achieve statistical significance) or that require extended follow-up to detect differences in, e.g., survival and certain health events.
A clinical trial is the best way to assess whether an intervention works, but it is arguably the worst way to assess who will benefit from it (Mant 1999).
Most RCTs are designed to investigate the effects of a uniformly delivered intervention in a specific type of patient in specific circumstances. This helps to ensure that any observed difference in outcomes between the investigational treatment and comparator is less likely to be confounded by variations in the patient groups compared, the mode of delivering the intervention, other previous and current treatments, health care settings, and other factors. However, while this approach strengthens internal validity, it can weaken external validity.
Patients who enroll in an RCT are typically subject to inclusion and exclusion criteria pertaining to, e.g., age, comorbidities, other risk factors, and previous and current treatments. These criteria tend to yield homogeneous patient groups that may not represent the diversity of patients that would receive the interventions in real practice. RCTs often involve special protocols of care and testing that may not be characteristic of general care, and are often conducted in university medical centers or other special settings. Findings from these RCTs may not be applicable to different practice settings for variations in the technique of delivering the intervention.
When RCTs are conducted to generate sufficient evidence for gaining market approval or clearance, they are sometimes known as “efficacy trials” in that they may establish only short-term efficacy (rather than effectiveness) and safety in a narrowly selected group of patients. Given the patient composition and the choice of comparator, results from these RCTs can overstate how well a technology works as well as under-represent the diversity of the population that will ultimately use the technology.
Given the high costs of RCTs and sponsors’ incentives to generate findings, such as to gain market approval for regulated technologies, these trials may be too small (i.e., have insufficient statistical power) or too short in duration to detect rare or delayed outcomes, including adverse events, and other unintended impacts. On the other hand, even in large, long-term RCTs (as well as other large studies), an observed statistically significant difference in adverse events may arise from random error, or these events may simply happen to co-occur with the intervention rather than being caused by it (Rawlins 2008). As such, the results from RCTs may be misleading or insufficiently informative for clinicians, patients, and payers who make decisions pertaining to more heterogeneous patients and care settings.
Given their resource constraints and use to gain market approval for regulated technologies, RCTs may be designed to focus on a small number of outcomes, especially shorter-term intermediate endpoints or surrogate endpoints rather than ultimate endpoints such as mortality, morbidity, or quality of life. As such, findings from these RCTs may be of limited use to clinicians and patients. Of course, the use of validated surrogate endpoints is appropriate in many instances, including when the health impact of interventions for some health care conditions will not be realized for years or decades, e.g., screening for certain cancers, prevention of risky health behaviors, and management of hypertension and dyslipidemia to prevent strokes and myocardial infarction in certain patient groups.
RCTs are traditionally designed to test a null hypothesis, i.e., the assumption by investigators that there is no difference between intervention and control groups. This assumption often does not pertain for several reasons. Among these, the assumption may be unrealistic when findings of other trials (including phase II trials for drugs and biologics) of the same technology have detected a treatment effect. Further, it is relevant only when the trial is designed to determine if one intervention is better than another, in contrast to whether they can be considered equivalent or one is inferior to the other (Rawlins 2008). Testing of an “honest” null hypothesis in an RCT is consistent with the principle of equipoise, which refers to a presumed state of uncertainty regarding whether any one of alternative health care interventions will confer more favorable outcomes, including balance of benefits and harms (Freedman 1987). However, there is controversy regarding whether this principle is realistic and even whether it is always ethical (Djulbegovic 2009; Fries 2004; Veatch 2007).
RCTs depend on principles of probability theory whose validity may be diminished in health care research, including certain aspects of the use of p-values and multiplicity, which refers to analyses of numerous endpoints in the same data set, stopping rules for RCTs that involve “multiple looks” at data emerging from the trial, and analysis of numerous subgroups. Each of these types of multiplicity involve iterative (repeated) tests of statistical significance based on conventional p-value thresholds (e.g., <0.05). Such iterative tests are increasingly likely to result in at least one false-positive finding, whether for an endpoint, a decision to stop a trial, or patient subgroup in which there appears to be a statistically significant treatment effect (Rawlins 2008; Wang 2007).
Using a p-value threshold (e.g., p<0.01 or p<0.05) as the basis for accepting a treatment effect can be misleading. There is still a chance (e.g., 1% or 5%) that the difference is due to random error. Also, a statistically significant difference detected with a large sample size may have no clinical significance. On the other hand, a finding of no statistical significance (e.g., p>0.01 or p>0.05) does not prove the absence of a treatment effect, including because the sample size of the RCT may have been too small to detect a true treatment effect. The reliance of most RCTs on p-values, particularly that the probability that a conclusion is in error can be determined from the data in a single trial, ignores evidence from other sources or the plausibility of the underlying cause-and-effect mechanism (Goodman 2008).
As noted below, other study designs are preferred for many types of evidence questions, even in some instances when the purpose is to determine the causal effect of a technology. For investigating technologies for treating rare diseases, the RCT may be impractical for enrolling and randomizing sufficient numbers of patients to achieve the statistical power to detect treatment effects. On the other hand, RCTs may be unnecessary for detecting very large treatment effects, especially where patient prognosis is well established and historical controls suffice.
To conduct an RCT may be judged unethical in some circumstances, such as when patients have a largely fatal condition for which no effective therapy exists. Use of a placebo control alone can be unethical when an effective standard of care exists and withholding it poses great health risk to patients, such as for HIV/AIDS prevention and therapy and certain cancer treatments. RCTs that are underpowered (i.e., with sample sizes too small to detect a true treatment effect or that yield statistically significant effects that are unreliable) can yield overestimated treatment effects and low reproducibility of results, thereby raising ethical concerns about wasted resources and patients’ commitments (Button 2013).
E. Different Study Designs for Different Questions
RCTs are not the best study design for answering all evidence questions of potential relevance to an HTA. As noted in Box III-11, other study designs may be preferable for different questions. For example, the prognosis for a given disease or condition may be based on a follow-up studies of patient cohorts at uniform points in the clinical course of a disease. Case control studies, which are usually
Box III-11. RCTs Are Not the Best Study Design for All Evidence Questions
Other study designs may include the following (not a comprehensive list):
- Prevalence of a disease/disorder/trait? Random sample survey of relevant population
- Identification of risk factors for a disease/disorder/adverse event? Case control study (for rare outcome) or cohort study (for more common outcome)
- Prognosis? Patient cohort studies with follow-up at uniform points in clinical course of disease/disorder
- Accuracy and reliability of a diagnostic test? Cross-over study of index test (new test) vs. reference method (“gold standard”) in cohort of patients at risk of having disease/disorder
- Accuracy and reliability of a screening test? Cross-over study of index test vs. reference method (“gold standard”) in representative cross-section of asymptomatic population at-risk for trait/disorder/preclinical disease
- Efficacy/effectiveness (for health outcomes) of screening or diagnostic tests? RCT if time and resources allow; observational studies and RCTs rigorously linked for analytic validity, clinical validity, and clinical utility
- Efficacy/effectiveness (for health outcomes) of most therapies and preventive interventions? RCT
- Efficacy/effectiveness of interventions for otherwise fatal conditions? Non-randomized trials, case series
- Safety, effectiveness of incrementally modified technologies posing no known additional risk? Registries
- Safety, effectiveness of interventions in diverse populations in real-world settings? Registries, especially to complement findings of available RCTs, PCTs
- Rates of recall or procedures precipitated by false positive screening results? Cohort studies
- Complication rates from surgery, other procedures? Registries, case series
- Identification of a cause of a suspected iatrogenic (caused by a physician or therapy) disorder? Case-control studies
- Incidence of common adverse events potentially due to an intervention? RCTs, nested case-control studies, n-of-1 trial for particular patients, surveillance, registries
- Incidence of rare or delayed adverse events potentially due to an intervention? Surveillance; registries; n-of-1 trial for particular patients; large, long-term RCT if feasible
retrospective, are often used to identify risk factors for diseases, disorders, and adverse events. The accuracy of a new diagnostic test (though not its ultimate effect on health outcomes) may be determined by a cross-over study in which patients suspected of having a disease or disorder receive both the new (“index”) test and the “gold standard” test. Non-randomized trials or case series may be preferred for determining the effectiveness of interventions for otherwise fatal conditions, i.e., where little or nothing is to be gained by comparison to placebos or known ineffective treatments. Surveillance and registries are used to determine the incidence of rare or delayed adverse events that may be associated with an intervention. For incrementally modified technologies posing no known additional risk, registries may be appropriate for determining safety and effectiveness.
Although experimentation in the form of RCTs is regarded as the gold standard for deriving unbiased estimates of the causal effect of an intervention on health care outcomes, RCTs are not always necessary to reach the same convincing finding. Such instances arise when the size of the treatment effect is very large relative to the expected (well-established and predictable) prognosis of the disease and when this effect occurs quickly relative to the natural course of the disease, as may be discerned using historical controlled trials and certain well-designed case series and non-randomized cohort studies. Some examples include ether for anesthesia, insulin for diabetes, blood transfusion for severe hemorrhagic shock, penicillin for lobar pneumonia, ganciclovir for cytomegalovirus, imiglucerase for Gaucher’s disease, phototherapy for skin tuberculosis, and laser therapy for port wine stains (Glasziou 2007; Rawlins 2008).
F. Complementary Methods for Internal and External Validity
Those who conduct technology assessments should be as innovative in their evaluations as the technologies themselves ... The randomized trial is unlikely to be replaced, but it should be complemented by other designs that address questions about technology from different perspectives (Eisenberg 1999).
Given the range of impacts evaluated in HTA and its role in serving decision makers and policymakers with diverse responsibilities, HTA must consider the methodological validity and other attributes of various primary data methods. There is increasing recognition of the need for evidence generated by primary data methods with complementary attributes.
Although primary study investigators and assessors would prefer to have methods that achieve both internal and external validity, they often find that study design attributes that increase one type of validity jeopardize the other. As described above, a well-designed and conducted RCT is widely considered to be the best approach for ensuring internal validity. However, for the reasons that an RCT may have high internal validity, its external validity may be limited.
Findings of some large observational studies (e.g., from large cohort studies or registries) have external validity to the extent that they can provide insights into the types of outcomes that are experienced by different patient groups in different circumstances. However, these less rigorous designs are more subject to certain forms of bias and confounding that threaten internal validity of any observed relationship between an intervention (or other exposure) and outcomes. These studies are subject, for example, to selection bias on the part of patients, who have self-selected or otherwise influenced choice of an intervention, and investigators, who select which populations to study and compare. They are also subject to investigator detection bias. Interesting or promising findings from observational studies can generate hypotheses that can be tested using study designs with greater internal validity.
It is often not practical to conduct RCTs in all of the patient populations that might benefit from a particular intervention. Combinations of studies that, as a group, address internal validity and external validity may suffice. For example, RCTs demonstrating the safety and efficacy in a narrowly defined patient population can be complemented with continued follow-up of the original patient groups in those trials and by observational studies following more diverse groups of patients over time. These observational studies might include registries of larger numbers of more diverse patients who receive the intervention in various health care settings, studies of insurance claims data for patients with the relevant disease and intervention codes, studies using medical records, and postmarketing surveillance for adverse events in patients who received the intervention. Further, the RCT and observational data can provide inputs to computer-based simulations of the safety, effectiveness, and costs of using the intervention in various patient populations.
The methodological literature often contends that, due to their inherent lack of rigor, observational studies tend to report larger treatment effects than RCTs. However, certain well-designed observational studies can yield results that are similar to RCTs. An analysis published in 2000 that compared treatment effects reported from RCTs to those reported from observational studies for 19 treatments between 1985 and 1998 found that the estimates of treatment effects were similar for a large majority of the treatments (Benson 2000). Similarly, a comparison of the results of meta-analyses of RCTs and meta-analyses of observational studies (cohort or case control designs) for the same five clinical topics published between 1991 and 1995 found that the reported treatment effects (including point estimates and 95% confidence intervals) were similar (Concato 2000).
Similar to quality assessment tools for various types of studies, the GRACE (Good ReseArch for Comparative Effectiveness) principles were developed to evaluate the methodological quality of observational research studies of comparative effectiveness. The GRACE principles comprise a series of questions to guide the evaluation, including what belongs in a study plan, key elements for good conduct and reporting, and ways to assess the accuracy of comparative effectiveness inferences for a population of interest. Given the range of types of potentially relevant evidence and the need to weigh applicability for particular circumstances of routine care, GRACE has no scoring system (Dreyer 2010). The accompanying GRACE checklist is used to assess the quality and usefulness for decision making of observational studies of comparative effectiveness (Dreyer 2014).
G. Evidence Hierarchies
So should we assess evidence the way Michelin guides assess hotels and restaurants? (Glasziou2004).
Researchers often use evidence hierarchies or other frameworks to portray the relative quality of various study designs for the purposes of evaluating single studies as well as a body of evidence comprising multiple studies. An example of a basic evidence hierarchy is:
- Systematic reviews and meta-analyses of RCTs
- Randomized controlled trials (RCTs)
- Non-randomized controlled trials
- Prospective observational studies
- Retrospective observational studies
- Expert opinion
In this instance, as is common in such hierarchies, the top item is a systematic review of RCTs, an integrative method that pools data or results from multiple single studies. (Hierarchies for single primary data studies typically have RCTs at the top.) Also, the bottom item, expert opinion, does not comprise evidence as such, though it may reflect the judgment of one or more people drawing on their perceptions of scientific evidence, personal experience, and other subjective input. There are many versions of such hierarchies, including some with more extensive levels/breakdowns.
Hierarchies cannot, moreover, accommodate evidence that relies on combining the results from RCTs and observational studies (Rawlins 2008).
As noted earlier in this chapter, although the general type or name of a study design (e.g., RCT, prospective cohort study, case series) conveys certain attributes about the quality of a study, the study design name itself is not a good proxy for study quality. One of the weaknesses of these conventional one-dimensional evidence hierarchies is that, while they tend to reflect internal validity, they do not generally reflect external validity of the evidence to more diverse patients and care settings. Depending on the intended use of the findings of a single study or of a body of evidence, an assessment of internal validity may be insufficient. Such hierarchies do not lend themselves to characterizing the quality of a body of diverse, complementary evidence that may yield fuller understanding about how well an intervention works across a heterogeneous population in different real-world circumstances. Box III-12 lists these and other limitations of conventional evidence hierarchies.
Box III-12. Limitations of Conventional Evidence Hierarchies
- Originally developed for pharmacological models of therapy
- Poor design and implementation of high-ranking study designs may yield less valid findings than lower-ranking, though better designed and implemented, study types
- Emphasis on experimental control, while enhancing internal validity, can jeopardize external validity
- Cannot accommodate evidence that relies on considering or combining results from multiple study designs
- Though intended to address internal validity of causal effect of an intervention on outcomes, they have been misapplied to questions about diagnostic accuracy, prognosis, or adverse events
- Number and inconsistencies among (60+) existing hierarchies suggest shortcomings, e.g.,
- ranking of meta-analyses relative to RCTs
- ranking of different observational studies
- terminology (“cohort studies,” “quasi-experimental,” etc.)
Sources: See, e.g.:
Glasziou P, et al. Assessing the quality of research. BMJ. 2004;328:39-41.
Rawlins MD. On the evidence for decisions about the use of therapeutic interventions. The Harveian Oration of 2008. London: Royal College of Physicians, 2008.
Walach H, et al. Circular instead of hierarchical: methodological principles for the evaluation of complex interventions. BMC Med Res Methodol. 2006;24;6:29.
Box III-13 shows an evidence framework from the Oxford Centre for Evidence-Based Medicine that defines five levels of evidence for each of several types of evidence questions pertaining to disease prevalence, screening tests, diagnostic accuracy, therapeutic benefits, and therapeutic harms. The lowest level of evidence for several of these evidence questions, “Mechanism-based reasoning,” refers to some plausible scientific basis, e.g., biological, chemical, or mechanical, for the impact of an intervention. Although the framework is still one-dimensional for each type of evidence question, it does allow for moving up or down a level based on study attributes other than the basic study design.
While retaining the importance of weighing the respective methodological strengths and limitations of various study designs, extending beyond rigid one-dimensional evidence hierarchies to more useful evidence appraisal (Glasziou 2004; Howick 2009; Walach 2006) recognizes that:
- Appraising evidence quality must extend beyond categorizing study designs
- Different types of research questions call for different study designs
- It is more important for ‘direct’ evidence to demonstrate that the effect size
is greater than the combined influence of plausible confounders than it is for the
study to be experimental.
- Best scientific evidence─for a pragmatic estimate of effectiveness and
safety─may derive from a complementary set of methods
- They can offset respective weaknesses/vulnerabilities
- “Triangulating” findings achieved with one method by replicating it with other
methods may provide a more powerful and comprehensive approach than the prevailing
- Systematic reviews are necessary, no matter the research type
Box III-13. Oxford Centre for Evidence-Based Medicine 2011 Levels of Evidence
Table 1 of 2. You can view the complete table in it's PDF format by going
You can also view an image of the table.
|How common is the problem?||Local and current random sample surveys (or censuses)||Systematic review of surveys that allow matching to local circumstances**||Local non-random sample**|
|Is this diagnostic or monitoring test accurate? (Diagnosis)||Systematic review of cross sectional studies with consistently applied reference standard and blinding||Individual cross sectional studies with consistently applied reference standard and blinding||Non-consecutive studies, or studies without consistently applied references tandards**|
|What will happen if we do not add a therapy? (Prognosis)||Systematic review of inception cohort studies||Inception cohort studies||Cohort study or control arm of randomized trial*|
|Does this intervention help? (Treatment Benefits)||Systematic review of randomized trials or n-of-1 trials||Randomized trial or observational study with dramatic effect||Non-randomized controlled cohort/follow-upstudy**|
|What are the COMMON harms?(Treatment Harms)||Systematic review of randomized trials, systematic review of nested case-controlstudies, n- of 1 trial with the patient you are raising the question about, or observational study with dramatic effect||Individual randomized trial or (exceptionally) observational study with dramatic effect||Non-randomized controlled cohort/follow-upstudy (post-marketing surveillance) provided there are sufficient numbers to rule out a common harm. (For long-term harms the duration of follow-up must be sufficient.)**|
|What are the RARE harms?(Treatment Harms)||Systematic review of randomized trials or n-of-1trial||Randomized trial or (exceptionally observational study with dramatic effect|
|Is this (early detection)test worthwhile?(Screening)||Systematic review of randomized trials||Randomized trial||Non-randomized controlled cohort/follow-upstudy**|
Table 2 of 2. You can view the complete table in it's PDF format by going
You can also view an image of the table.
|How common is the problem?||Case-series**||n/a|
|Is this diagnostic or monitoring test accurate? (Diagnosis)||Case-control studies, or “poor or non-independent reference standard**||Mechanism-based reasoning|
|What will happen if we do not add a therapy? (Prognosis)||Case-series or case-control studies, or poor quality prognostic cohort study**||n/a|
|Does this intervention help? (Treatment Benefits)||Case-series, case-controlstudies, or historically controlled studies**||Mechanism-based reasoning|
|What are the COMMON harms?(Treatment Harms)||Case-series,case-control, or historically controlled studies**||Mechanism-based reasoning|
|What are the RARE harms?(Treatment Harms)|
|Is this (early detection)test worthwhile?(Screening)||Case-series,case-control, or historically controlled studies**||Mechanism-based reasoning|
*Level may be graded down on the basis ofstudy quality, imprecision, indirectness (study PICO does not match questions PICO), because of inconsistency between studies, or because the absolute effect size is very small; Level may be graded up if there is a large or very large effect size.
* *As always, a systematic review is generally better than an individual study.
Source: OCEBM Levels of Evidence Working Group. The Oxford 2011 Levels of Evidence. Oxford Centre for Evidence-Based Medicine. http://www.cebm.net/index.aspx?o=5653
H. Alternative and Emerging Study Designs Relevant to HTA
Primary data collection methods are evolving in ways that affect the body of evidence used in HTA. Of great significance is the recognition that clinical trials conducted for biomedical research or to gain market approval or clearance by regulatory agencies do not necessarily address the needs of decision makers or policymakers.
Comparative effectiveness research (CER) reflects the demand for real-world evidence to support practical decisions. It emphasizes evidence from direct (“head-to-head”) comparisons, effectiveness in real-world health care settings, health care outcomes (as opposed to surrogate or other intermediate endpoints), and ability to identify different treatment effects in patient subgroups. As traditional RCTs typically do not address this set of attributes, CER can draw on a variety of complementary study designs and analytical methods. Other important trends in support of CER are the gradual increase in use of electronic health records and more powerful computing and related health information technology, which enable more rapid and sophisticated analyses, especially of observational data. The demand for evidence on potentially different treatment effects in patient subgroups calls for study designs, whether in clinical trials or observational studies, that can efficiently discern such differences. Another powerful factor influencing primary data collection is the steeply increasing costs of conducting clinical trials, particularly of RCTs for new drugs, biologics, and medical devices; this focuses attention on study designs that require fewer patients, streamline data collection, and are of shorter duration.
Investigators continue to make progress in combining some of the desirable attributes of RCTs and observational studies. Some of the newer or still evolving clinical trial designs include: large simple trials, pragmatic clinical trials, cluster trials, adaptive trials, Bayesian trials, enrichment trials, and clinical registry trials (Lauer 2012), as described below.
Large simple trials (LSTs)retain the methodological strengths of prospective, randomized design, but use large numbers of patients, more flexible patient entry criteria and multiple study sites to generate effectiveness data and improve external validity. Fewer types of data may be collected for each patient in an LST, easing participation by patients and clinicians (Buring 1994; Ellenberg 1992; Peto 1995; Yusuf 1990). Prominent examples of LSTs include the GISSI trials of thrombolytic treatment of acute myocardial infarction (AMI) (Maggioni 1990), the ISIS trials of alternative therapies for suspected AMI (Fourth International Study of Infarct Survival 1991), and the CATIE trial of therapies for schizophrenia (Stroup 2003).
Pragmatic (or practical) clinical trials (PCTs) are a related group of trial designs whose main attributes include: comparison of clinically relevant alternative interventions, a diverse population of study participants, participants recruited from heterogeneous practice settings, and data collection on a broad range of health outcomes. PCTs require that clinical and health policy decision makers become more involved in priority setting, research design, funding, and other aspects of clinical research (Tunis 2003). Some LSTs are also PCTs.
Cluster randomized trials involve randomized assignment of interventions at the level of natural groups or organizations rather than at the level of patients or other individuals. That is, sets of clinics, hospitals, nursing homes, schools, communities, or geographic regions are randomized to receive interventions or comparators. Such designs are used when it is not feasible to randomize individuals or when an intervention is designed to be delivered at a group or social level, such as a workplace-based smoking cessation campaign or a health care financing mechanism. These are also known as “group,” “place,” or “community” randomized trials (Eldridge 2008).
Adaptive clinical trials use accumulating data to determine how to modify the design of ongoing trials according to a pre-specified plan. Intended to increase the quality, speed, and efficiency of trials, adaptive trials typically involve interim analyses, changes to sample size, changes in randomization to treatment arms and control groups, and changes in dosage or regimen of a drug or other technology (FDA Adaptive Design 2010; van der Graaf 2012).
A current example of an adaptive clinical trial is the I-SPY 2 (Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging and Molecular Analysis 2), which is investigating multiple drug combinations and accompanying biomarkers for treating locally advanced breast cancer. In this adaptive trial, investigators calculate the probability that each newly enrolled patient will respond to a particular investigational drug combination based on how previous patients in the trial with similar genetic “signatures” (i.e., set of genetic markers) in their tumors have responded. Each new patient is then assigned to the indicated treatment regimen, accordingly, with an 80% chance of receiving standard chemotherapy plus an investigational drug and a 20% chance of receiving standard chemotherapy alone (Barker 2009; Printz 2013).
Bayesian clinical trials are a form of adaptive trials that rely on principles of Bayesian statistics. Rather than waiting until full enrollment and completion of follow-up for all enrolled patients, a Bayesian trial allows for assessment of results during the course of the trial and modifying its design to arrive at results more efficiently. Such midcourse modifications may include, e.g., changing the ratio of randomization to treatment arms (e.g., two patients randomized to the investigational group for every one patient randomized to the control group) to favor what appear to be more effective therapies, adding or eliminating treatment arms, changing enrollee characteristics to focus on patient subgroups that appear to be better responders, changing hypotheses from non-inferiority to superiority or vice-versa, and slowing or stopping patient accrual as certainty increases about treatment effects. These trial modifications also can accumulate and make use of information about relationships between biomarkers and patient outcomes (e.g., for enrichment, as described below). These designs enable more efficient allocation of patients to treatment arms, with the potential for smaller trials and for patients to receive better treatment (Berry 2006). Recent advances in computational algorithms and high-speed computing enable the calculations required for the complex design and simulations involved in planning and conducting Bayesian trials (FDA Guidance for the Use of Bayesian 2010; Lee 2012).
“Enrichment” refers to techniques of identifying patients for enrollment in clinical trials based on prospective use of patient attributes that are intended to increase the likelihood of detecting a treatment effect (if one truly exists) compared to an unselected population. Such techniques can decrease the number of patients needed to enroll in a trial; further, they can decrease patient heterogeneity of response, select for patients more likely to experience a disease-related trial endpoint, or select for patients (based on a known predictive biomarker) more likely to respond to a treatment (intended to result in a larger effect size). In adaptive enrichment of clinical trials, investigators seek to discern predictive markers during the course of a trial and apply these to enrich subsequent patient enrollment in the trial (FDA 2012). While these techniques improve the likelihood of discerning treatment effects in highly-selected patient groups, the findings of such trials may lack external validity to more heterogeneous patients. In one form of enrichment, the randomized-withdrawal trial, patients who respond favorably to an investigational intervention are then randomized to continue receiving that intervention or placebo. The study endpoints are return of symptoms or the ability to continue participation in the trial. The patients receiving the investigational intervention continue to do so only if they respond favorably, while those receiving placebo continue to do only until their symptoms return. This trial design is intended to minimize the time that patients receive placebo (IOM Committee on Strategies for Small-Number-Participant Clinical Research Trials 2001; Temple 1996).
Clinical registry trials are a type of multicenter trial design using existing online registries as an efficient platform to conduct patient assignment to treatment and control groups, maintain case records, and conduct follow-up. Trials of this type that also randomize patient assignment to treatment and control groups are randomized clinical registry trials (Ferguson 2003; Fröbert 2010).
N-of-1 trials are clinical trials in which a single patient is the total population for the trial and in which a sequence of experimental and control interventions are allocated to the patient (i.e., a multiple crossover study conducted in a single patient). A trial in which random allocation is used to determine the sequence of interventions is given to a patient is an N-of-1 RCT. N-of-1 trials are used to determine treatment effects in individuals, and sets of these trials can be used to estimateheterogeneity of treatment effects across a population (Gabler 2011).
Patient preference trials are intended to account for patient preferences in the design of RCTs, including their ability to discern the impact of patient preference on health outcomes. Among the challenges to patient enrollment and participation in traditional RCTs are that some patients who have a strong preference for a particular treatment may decline to proceed with the trial or drop out early if they are not assigned to their preferred treatment. Also, these patients may experience or report worse or better outcomes due to their expectations or perceptions of the effects of assignment to their non-preferred or preferred treatment groups. Any of these actions may bias the results of the trial. Patient preference trials enable patients to express their preferred treatment prior to enrolling in an RCT. In some of these trials, the patients with a strong preference, e.g., for a new treatment or usual care, are assigned to a parallel group receiving their preferred intervention. The patients who are indifferent to receiving the new treatment or usual care are randomized into one group or another. Outcomes for the parallel, non-randomized groups (new intervention and usual care) are analyzed apart from the outcomes for the randomized groups.
In addition to enabling patients with strong preferences to receive their preferred treatment and providing for comparison of randomized groups of patients who expressed no strong preference, these trials may provide some insights about the relative impact on outcomes of receiving one’s preferred treatment. However, this design is subject to selection bias, as there may be systematic differences in prognostic factors and other attributes between patients with a strong preference for the new treatment and patients with strong preferences for usual care. Selection bias can also affect the indifferent patients who are randomized, as there may be systematic differences in prognostic factors and other attributes between indifferent patients and the general population, thereby diminishing the external validity of the findings. To the extent that patients with preferences are not randomized, the time and cost required to enroll a sufficient number of patients for the RCT to achieve statistical power will be greater. Patient preference trials have alternative designs, e.g., partially randomized preference trials and fully randomized preference trials. In the fully randomized preference design, patient preferences are recorded prior to the RCT, but all patients then randomized regardless of their preference. In that design, subgroup analyses enable determining whether receiving one’s preferred treatment has any impact on treatment adherence, drop-outs, and outcomes (Howard 2006; Mills 2011; Preference Collaborative Review Group 2008; Silverman 1996; Torgerson 1998).
I. Collecting New Primary Data
It is beyond the scope of this document to describe the planning, design, and conduct of clinical trials, observational studies, and other investigations for collecting new primary data. There is a large and evolving literature on these subjects (Friedman 2010; Piantadosi 2005; Spilker 1991). Also, there is a literature on priority setting and efficient resource allocation for clinical trials, and cost-effective design of clinical trials (Antman 2012; Chilcott 2003; Claxton 1996; Detsky 1990; FDA Adaptive Design 2010).
As noted above, the process of compiling evidence for an assessment may call attention to the need for new primary data. An assessment program may determine that existing evidence is insufficient for informing the desired policy needs, and that new studies are needed to generate data for particular aspects of the assessment. Once available, the new data can be interpreted and incorporated into the existing body of evidence.
In the US, major units of the National Institutes of Health (NIH) such as the National Cancer Institute (NCI); the National Heart, Lung, and Blood Institute (NHLBI); and the National Institute of Allergy and Infectious Diseases (NIAID) sponsor and conduct biomedical research, including clinical trials. The Department of Veterans Affairs (VA) Cooperative Studies Program is responsible for the planning and conduct of large multicenter clinical trials and epidemiological studies within the VA. This program also works with the VA Health Economics Resource Center to perform economic analyses as part of its clinical trials. The Food and Drug Administration (FDA) does not typically conduct primary studies related to the marketing of new drugs and devices; rather, it reviews primary data from studies sponsored or conducted by the companies that make these technologies. The FDA also maintains postmarketing surveillance programs, including the FDA Adverse Event Reporting System on adverse events and medication error reports for drug and therapeutic biologic products, and the MedWatch program, in which physicians and other health professionals and the public voluntarily report serious reactions and other problems with drugs, devices, and other medical products.
In the US, the Patient-Centered Outcomes Research Institute (PCORI) was established as an independent research institute by Congress in the Patient Protection and Affordable Care Act of 2010. PCORI conducts CER and related research that is guided by patients, caregivers and the broader health care community. PCORI’s five national research priorities are: assessment of prevention, diagnosis, and treatment options; improving health care systems; enhancing communication and dissemination of evidence; addressing disparities in health and health care; and improving CER methods and data infrastructure. PCORI devotes more than 60% of its research budget to CER, including for pragmatic clinical trials, large simple trials, and large observational studies, with the balance allocated to infrastructure, methods, and communication and dissemination research (Selby 2014).
Third-party payers generally do not sponsor clinical trials. However, they have long supported clinical trials of new technologies indirectly by paying for care associated with trials of those technologies, or by paying unintentionally for non-covered new procedures that were coded as covered procedures. As noted above, payers provide various forms of conditional coverage, such as coverage with evidence development (CED), for certain investigational technologies in selected settings to compile evidence that can be used to make more informed coverage decisions. Two main types of CED are “only in research,” in which coverage of a technology is provided only for patients with specified clinical indications in the payer’s beneficiary population who are enrolled in a clinical trial of that technology, and “only with research,” in which coverage of a technology is provided for all of the patients with specified clinical indications if a subset of those patients is enrolled in a clinical trial of that technology.
An early example of CED was the multicenter RCT of lung-volume reduction surgery, the National Emphysema Treatment Trial (NETT) conducted in the US, funded by the NHLBI and the Centers for Medicare and Medicaid Services (CMS, which administers the US Medicare program) (Fishman 2003; Ramsey 2003). In another form of conditional coverage known as conditional treatment continuation, payment is provided only as long as patients meet short-term treatment goals such as lowered blood cholesterol or cancer tumor response. In performance-linked reimbursement (or “pay-for-performance”), payment for a technology is linked to data demonstrating achievement of pre-specified clinical outcomes in practice; this includes schemes in which a manufacturer must provide rebates, refunds, or price adjustments to payers if their products do not achieve certain patient outcomes (Carlson 2010). Findings about the impact of conditional coverage, performance-linked reimbursement, and related efforts on coverage policies, patient outcomes, and costs are still emerging (de Bruin 2011).
Payers and researchers often analyze data from claims, electronic health records, registries, and surveys to determine comparative effectiveness of interventions, develop coverage policies, or determine provider compliance with coverage policies. These analyses increasingly involve efforts to link claims and other administrative sources to electronic health records and other clinical sources (Croghan 2010; de Souza 2012).
The ability of most assessment programs to undertake new primary data collection, particularly clinical trials, is limited by such factors as programs’ remit (which may not include sponsoring primary data collection), financial constraints, time constraints, and other aspects of the roles or missions of the programs. An HTA program may decide not to undertake and assessment if insufficient data are available. Whether or not an assessment involves collection of new primary data, the assessment reports should note what new primary studies should be undertaken to address gaps in the current body of evidence, or to meet anticipated assessment needs.
References for Chapter III
Antman EM, Harrington RA. Transforming clinical trials in cardiovascular disease: mission critical for health and economic well-being. JAMA. 2012;308(17):1743-4.
Atkins D, Best D, Briss PA, Eccles M, et al., GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328(7454):1490. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC428525.
Atkins D, Chang S, Gartlehner G, Buckley DI, et al. Chapter 6. Assessing the Applicability of Studies When Comparing Medical Interventions. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews. AHRQ Publication No. 10(13)-EHC063-EF. Rockville, MD: Agency for Healthcare Research and Quality. September 2013. Accessed Nov. 1, 2013 at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/CER-methods-guide-130916.pdf.
Barker AD, Sigman CC, Kelloff GJ, Hylton NM, et al. I-SPY 2: an adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clin Pharmacol Ther. 2009;86(1):97-100.
Benson K, Hartz AJ. A comparison of observation studies and randomized, controlled trials. N Engl J Med. 2000;342(25):1878-86. http://www.nejm.org/doi/full/10.1056/NEJM200006223422506.
Berry DA. Bayesian clinical trials. Nat Rev Drug Discov. 2006;5(1):27-36.
Boer GJ, Widner H. Clinical neurotransplantation: core assessment protocol rather than sham surgery as control. Brain Res Bull. 2002;58(6):547-53.
Briss PA, Zaza S, Pappaioanou M, Fielding J, et al. Developing an evidence-based Guide to Community Preventive Services--Am J Prev Med 2000;18(1S):35-43.
Buring JE, Jonas MA, Hennekens CH. Large and simple randomized trials. In Tools for Evaluating Health Technologies: Five Background Papers. US Congress, Office of Technology Assessment, 1995;67-91. BP-H-142. Washington, DC: US Government Printing Office; 1994. Accessed Nov. 1, 2013 at: http://ota-cdn.fas.org/reports/9440.pdf.
Button KS, Ioannidis JP, Mokrysz C, Nosek BA, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14(5):365-76.
Carlson JJ, Sullivan SD, Garrison LP, Neumann PJ, Veenstra DL. Linking payment to health outcomes: a taxonomy and examination of performance-based reimbursement schemes between healthcare payers and manufacturers. Health Policy. 2010;96(3):179-90.
Campbell MK, Entwistle VA, Cuthbertson BH, Skea ZC, et al.; KORAL study group. Developing a placebo-controlled trial in surgery: issues of design, acceptability and feasibility. Trials. 2011;12:50. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3052178.
Chilcott J, Brennan A, Booth A, Karnon J, Tappenden P. The role of modelling in prioritising and planning clinical trials. Health Technol Assess. 2003;7(23):iii,1-125. http://www.journalslibrary.nihr.ac.uk/__data/assets/pdf_file/0006/64950/FullReport-hta7230.pdf.
Claxton K, Posnett J. An economic approach to clinical trial design and research priority-setting. Health Econ. 1996;5(6):513-24.
ClinicalTrials.gov. FDAAA 801 Requirements. December 2012. Accessed Aug. 1, 2013 at: http://clinicaltrials.gov/ct2/manage-recs/fdaaa#WhichTrialsMustBeRegistered.
Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med. 2000;342:1887-92. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557642.
Croghan TW, Esposito D, Daniel G, Wahl P, Stoto MA. Using medical records to supplement a claims-based comparative effectiveness analysis of antidepressants. Pharmacoepidemiol Drug Saf. 2010;19(8):814-8.
de Bruin SR, Baan CA, Struijs JN. Pay-for-performance in disease management: a systematic review of the literature. BMC Health Serv Res. 2011;11:272. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218039.
de Souza JA, Polite B, Perkins M, Meropol NJ, et al. Unsupported off-label chemotherapy in metastatic colon cancer. BMC Health Serv Res. 2012;12:481. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3544564.
Detsky AS. Using cost-effectiveness analysis to improve the efficiency of allocating funds to clinical trials. Stat Med. 1990;9(1-2):173-84.
DiMasi JA, Hansen RW, Grabowski HG. The price of innovation: new estimates of drug development costs. J Health Econ. 2003;22(2):151–85.
Djulbegovic B. The paradox of equipoise: the principle that drives and limits therapeutic discoveries in clinical research. Cancer Control. 2009;16(4):342-7. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2782889.
Dreyer NA, Velentgas P, Westrich K, Dubois R. The GRACE Checklist for Rating the Quality of Observational Studies of Comparative Effectiveness: A Tale of Hope and Caution. J Manag Care Pharm. 2014;20(3):301-8.
Eisenberg JM. Ten lessons for evidence-based technology assessment. JAMA. 1999; 282(19):1865-9.
Eldridge S, Ashby D, Bennett C, et al. Internal and external validity of cluster randomised trials: systematic review of recent trials. BMJ. 2008;336(7649):876-80. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2323095.
Ellenberg SS. Do large, simple trials have a place in the evaluation of AIDS therapies? Oncology. 1992;6(4):55-9,63.
Enck P, Bingel U, Schedlowski M, Rief W. The placebo response in medicine: minimize, maximize or personalize? Nat Rev Drug Discov. 2013;12(3):191-204.
Ferguson TB Jr, Peterson ED, Coombs LP, Eiken MC, et al. Use of continuous quality improvement to increase use of process measures in patients undergoing coronary artery bypass graft surgery: A randomized controlled trial. JAMA. 2003;290(1):49-56.
Fishman A, Martinez F, Naunheim K, et al; National Emphysema Treatment Trial Research Group. A randomized trial comparing lung-volume-reduction surgery with medical therapy for severe emphysema. N Engl J Med. 2003 May 22;348(21):2059-73. http://www.nejm.org/doi/full/10.1056/NEJMoa030287.
Food and Drug Administration. Adaptive Design Clinical Trials for Drugs and Biologics. Draft Guidance. Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research. Rockville, MD, February 2010. Accessed Nov. 1, 2013 at: http://www.fda.gov/downloads/Drugs/.../Guidances/ucm201790.pdf.
Food and Drug Administration. Guidance for Industry. Enrichment Strategies for Clinical Trials to Support Approval of Human Drugs and Biological Products. Draft Guidance. Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, Center for Devices and Radiological Health. Rockville, MD, December 2012. Accessed Nov. 1, 2013 at: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM332181.pdf.
Food and Drug Administration. Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials. Center for Devices and Radiological Health, Center for Biologics Evaluation and Research. Rockville, MD, February5,2010. Accessed Nov. 1, 2013 at: http://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm071121.pdf.
Fourth International Study of Infarct Survival: protocol for a large simple study of the effects of oral mononitrate, of oral capvtopril, and of intravenous magnesium. ISIS-4 collaborative group. . Am J Cardiol. 1991;68(14):87D-100D.
Freedman B. Equipoise and the ethics of clinical research. N Engl J Med. 1987;317(3):141-5.
Friedman LM. Furberg CD, DeMets DL. Fundamentals of Clinical Trials. (4th edition). New York: Springer, 2010.
Fries JF, Krishnan E. Equipoise, design bias, and randomized controlled trials: the elusive ethics of new drug development. Arthritis Res Ther. 2004;6(3):R250-5. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC416446.
Fröbert O, Lagerqvist B, Gudnason T, Thuesen L, et al. Thrombus Aspiration in ST-Elevation myocardial infarction in Scandinavia (TASTE trial). A multicenter, prospective, randomized, controlled clinical registry trial based on the Swedish angiography and angioplasty registry (SCAAR) platform. Study design and rationale. Am Heart J. 2010;160(6):1042-8.
Frost J, Okun S, Vaughan T, Heywood J, Wicks P. Patient-reported outcomes as a source of evidence in off-label prescribing: analysis of data from PatientsLikeMe. J Med Internet Res. 2011 Jan 21;13(1):e6. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3221356.
Gabler NB, Duan N, Vohra S, Kravitz RL. N-of-1 trials in the medical literature: a systematic review. Med Care. 2011;49(8):761-8.
Glasziou P, Vandenbroucke J, Chalmers I. Assessing the quality of research. BMJ.2004;328(7430):39-41. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC313908.
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008;45(3):135-40.
Hartling L, Bond K, Harvey K, Santaguida PL, et al. Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures. Agency for Healthcare Research and Quality; December 2010. Methods Research Report. AHRQ Publication No. 11-EHC-007. http://www.ncbi.nlm.nih.gov/books/NBK52670/pdf/TOC.pdf.
Higgins JP, Altman DG, Gøtzsche PC, Jüni P, et al.; Cochrane Bias Methods Group; Cochrane Statistical Methods Group. The Cochrane Collaboration's tool for assessing risk of bias in randomized trials. BMJ. 2011;343:d5928. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196245.
Higgins JPT, Altman DG, Sterne, JAC, eds. Chapter 8: Assessing risk of bias in included studies. In: Higgins JPT, Green S, eds. Cochrane Handbook for Systematic Reviews of Interventions. Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011.
Horng S, Miller FG. Ethical framework for the use of sham procedures in clinical trials. Crit Care Med. 2003;31(suppl. 3):S126-30.
Howard L, Thornicroft G. Patient preference randomised controlled trials in mental health research. Br J Psychiatry. 2006;188:303-4. http://bjp.rcpsych.org/content/188/4/303.long.
Howick J, Glasziou P, Aronson JK. The evolution of evidence hierarchies: what can Bradford Hill's 'guidelines for causation' contribute? J R Soc Med. 2009;102(5):186-94. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2677430.
Huser V, Cimino JJ. Evaluating adherence to the International Committee of Medical Journal Editors' policy of mandatory, timely clinical trial registration. J Am Med Inform Assoc. 2013;20(e1):e169-74.
ICMJE (International Committee of Medical Journal Editors). Uniform Requirements for Manuscripts Submitted to Biomedical Journals: Publishing and Editorial Issues Related to Publication in Biomedical Journals: Obligation to Register Clinical Trials. 2013. Accessed Aug. 1, 2013 at: http://www.icmje.org/publishing_10register.html.
Institute of Medicine. Committee on Strategies for Small-Number-Participant Clinical Research Trials. Small Clinical Trials: Issues and Challenges. Washington, DC: National Academies Press; 2001. http://www.nap.edu/openbook.php?record_id=10078&page=1.
Ioannidis JP, Khoury MJ. Are randomized trials obsolete or more important than ever in the genomic era? Genome Med. 2013;5(4):32. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3707036.
Jadad AR, Moore RA, Carrol D, et al. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Control Clin Trials. 1996;17:1-12.
Lauer MS. Commentary: How the debate about comparative effectiveness research should impact the future of clinical trials. Stat Med. 2012;31(25):3051-3.
Lee JJ, Chu CT. Bayesian clinical trials in action. Stat Med. 2012;31(25):2955-72.
Maggioni AP, Franzosi MG, Fresco C, et al. GISSI trials in acute myocardial infarction. Rationale, design, and results. Chest. 1990;97(4 Suppl):146S-150S.
Mant D. Evidence and primary care: can randomized trials inform clinical decisions about individual patients? Lancet. 1999;353:743–6.
Mills N, Donovan JL, Wade J, Hamdy FC, et al. Exploring treatment preferences facilitated recruitment to randomized controlled trials. J Clin Epidemiol. 2011;64(10):1127-36. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167372.
Morgan S, Grootendorst P, Lexchin J, Cunningham C, Greyson D. The cost of drug development: a systematic review. Health Policy. 2011;100(1):4-17.
Moseley JB, O’Malley K, Petersen NJ, et al. A controlled trial of arthroscopic surgery for osteoarthritis of the knee. N Engl J Med. 2002;347(2):81-8. http://www.nejm.org/doi/full/10.1056/NEJMoa013259.
Nakamura C, Bromberg M, Bhargava S, Wicks P, Zeng-Treitler Q. Mining online social network data for biomedical research: a comparison of clinicians' and patients' perceptions about amyotrophic lateral sclerosis treatments. J Med Internet Res. 2012;14(3):e90. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414854.
OCEBM Levels of Evidence Working Group. The Oxford 2011 Levels of Evidence. Oxford Centre for Evidence-Based Medicine. http://www.cebm.net/index.aspx?o=5653.
Olivo SA, Macedo LG, Gadotti IC, Fuentes J, et al. Scales to assess the quality of randomized controlled trials: a systematic review. Phys Ther. 2008;88(2):156-75. http://ptjournal.apta.org/content/88/2/156.long.
Peto R, Collins R, Gray R. Large-scale randomized evidence: large, simple trials and overviews of trials. J Clin Epidemiol. 1995;48(1):23-40.
Piantadosi S. Clinical Trials: A Methodological Perspective (2nd edition). New York: Wiley, 2005.
Preference Collaborative Review Group. Patients' preferences within randomised trials: systematic review and patient level meta-analysis. BMJ. 2008;337:a1864. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2659956.
Printz C. I-SPY 2 may change how clinical trials are conducted: Researchers aim to accelerate approvals of cancer drugs. Cancer. 2013;119(11):1925-7.
Rawlins MD. De Testimonio: On the evidence for decisions about the use of therapeutic interventions. The Harveian Oration of 2008. London: Royal College of Physicians, 2008.
Rothman KJ, Michels KB. The continuing unethical use of placebo controls. N Engl J Med. 1994;331(6):394-7.
Roundtree AK, Kallen MA, Lopez-Olivo MA, Kimmel B, et al. Poor reporting of search strategy and conflict of interest in over 250 narrative and systematic reviews of two biologic agents in arthritis: a systematic review. J Clin Epidemiol. 2009;62(2):128-37.
Roy ASA. Stifling New Cures: The True Cost of Lengthy Clinical Drug Trials. Project FDA Report 5. New York: Manhattan Institute for Policy Research; April 2012. http://www.manhattan-institute.org/html/fda_05.htm.
Selby JV, Lipstein SH. PCORI at 3 years--progress, lessons, and plans. N Engl J Med. 2014;370(7):592-5. http://www.nejm.org/doi/full/10.1056/NEJMp1313061.
Silverman WA, Altman DG. Patients' preferences and randomised trials. Lancet. 1996;347(8995):171-4.
Spilker B. Guide to Clinical Trials. New York, NY: Raven Press, 1991.
Stone GW, Teirstein PS, Rubenstein R, et al. A prospective, multicenter, randomized trial of percutaneous transmyocardial laser revascularization in patients with nonrecanalizable chronic total occlusions. J Am Coll Cardiol. 2002;39(10):1581-7.
Stroup TS, McEvoy JP, Swartz MS, et al. The National Institute of Mental Health Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) project: schizophrenia trial design and protocol development. Schizophr Bull. 2003;29(1):15-31. http://schizophreniabulletin.oxfordjournals.org/content/29/1/15.long.
Temple R. Problems in interpreting active control equivalence trials. Account Res. 1996;4(3-4):267-75.
Torgerson DJ, Sibbald B. Understanding controlled trials. What is a patient preference trial? BMJ. 1998;316(7128):360. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665528.
Tunis SR, Stryer DB, Clancy CM. Practical clinical trials: increasing the value of clinical research for decision making in clinical and health policy. JAMA. 2003;290(12):1624-32.
US Preventive Services Task Force Procedure Manual. AHRQ Publication No. 08-05118-EF, July 2008. Accessed Aug.1, 2013 at: http://www.uspreventiveservicestaskforce.org/uspstf08/methods/procmanual.htm.
van der Graaf R, Roes KC, van Delden JJ. Adaptive trials in clinical research: scientific and ethical issues to consider. JAMA. 2012;307(22):2379-80.
Varmus H, Satcher D. Ethical complexities of conducting research in developing countries. N Engl J Med. 1997;337(14):1003-5.
Veatch RM. The irrelevance of equipoise. J Med Philos. 2007;32(2):167-83.
Viswanathan M, Ansari MT, Berkman ND, Chang S, et al. Chapter 9. Assessing the risk of bias of individual studies in systematic reviews of health care interventions. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews. AHRQ Publication No. 10(14)-EHC063-EF. Rockville, MD: Agency for Healthcare Research and Quality. January 2014. Accessed Feb. 1, 2014 at: http://www.effectivehealthcare.ahrq.gov/ehc/products/60/318/CER-Methods-Guide-140109.pdf.
Walach H, Falkenberg T, Fønnebø V, Lewith G, Jonas WB. Circular instead of hierarchical: methodological principles for the evaluation of complex interventions. BMC Med Res Methodol. 2006;24;6:29. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1540434.
Wang R, Lagakos WE, Ware JH, Hunter DJ, Drazen JM. Statistics in medicine—reporting of subgroup analyses in clinical trials. N Engl J Med 2007;357(21):2189-94.
Whiting PF, Rutjes AW, Westwood ME, Mallett S, et al.; QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529-36.
Yusuf S, Held P, Teo KK, Toretsky ER. Selection of patients for randomized controlled trials: implications of wide or narrow eligibility criteria. Stat Med. 1990;9(1-2):73-83.
Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC. The ClinicalTrials.gov results database--update and key issues. N Engl J Med. 2011;364(9):852-60. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3066456.