Skip to Content
United States National Library of Medicine National Institutes of Health

HTA 101: V. APPRAISING THE EVIDENCE

A challenge for any HTA is to derive substantial findings from scientific evidence drawn from different types of studies of varying quality. Assessors should use a systematic approach to critically appraise the quality of the available studies.

Interpreting evidence requires knowledge of investigative methods and statistics. Assessment groups should include members who are knowledgeable in these areas. Some assessment programs assign content experts and evidence evaluation experts to prepare background papers that present and appraise the available evidence for use by assessment groups. Notwithstanding the expertise required to thoroughly and accurately assess evidence, even a basic understanding of fundamental evidence principles can help decision makers to appreciate the importance to health practice and policy of distinguishing between stronger and weaker evidence.

As suggested by the causal pathway in Box 23, assessors can interpret evidence at multiple levels. Evidence can be interpreted at the level of an individual study, e.g., an RCT pertaining to a particular intervention and outcome. It also can be interpreted at the level of a body of evidence (e.g., set of clinical studies) pertaining to the intervention and outcome. In some instances, evidence can be interpreted for a broader body of evidence for a linked set of interventions as a whole, such as for a screening test linked to results that are linked to one or more treatments with intermediate and long-term outcomes (Harris 2001). For example, the main criteria for judging evidence quality at each of these levels by the US Preventive Services Task Force are shown in Box 24.

 

Box 23
A General Causal Pathway: Screening Procedure and Alternative Treatments

A General Causal Pathway: Screening Procedure and Alternative Treatments

Source: Adapted from Harris 2001.

Box 24
Evaluating Evidence Quality at Three Levels

 

Level of Evidence


Criteria for Judging Quality


Individual study

-Internal validitya

-External validityb


Linkage in analytic framework -


-Aggregate internal validitya

-Aggregate external validityb

-Coherence/consistency


Entire preventive service -


-Quality of the evidence from Stratum 2 for each linkage in the analytic framework

- Degree to which there is a complete chain of linkages supported by adequate evidence to connect the preventive service to health outcomes
- Degree to which the complete chain of linkages "fit" togetherc
-Degree to which the evidence connecting the preventive service and health outcomes is "direct"d

a Internal validity is the degree to which the study(ies) provides valid evidence for the population and setting in which it was conducted.
b External validity is the extent to which the evidence is relevant and generalizable to the population and conditions of typical primary care practice.
c"Fit" refers to the degree to which the linkages refer to the same population and conditions. For example, if studies of a screening linkage identify people who are different from those involved in studies of the treatment linkage, the linkages are not supported by evidence that "fits" together.
d "Directness" of evidence is inversely proportional to the number of bodies of evidence required to make the connection between the preventive service and health outcomes. Evidence is direct when a single body of evidence makes the connection, and more indirect if two or more bodies of evidence are required.

Source: Harris 2001.

Appraising Individual Studies

Certain attributes of primary studies produce better evidence than others. In general, the following attributes of primary studies can be used to distinguish between stronger and weaker evidence for internal validity (i.e., for accurately representing the causal relationship between an intervention and an outcome in the particular circumstances of a study).

  • Prospective studies are superior to retrospective studies.
  • Experimental study designs are superior to observational study designs.
  • Controlled studies are superior to uncontrolled ones.
  • Contemporaneous (occurring at the same time) control groups are superior to historical control groups.
  • Internal control groups (i.e., managed within the study) are superior to studies with external control groups.
  • Randomized studies are superior to nonrandomized ones.
  • Large studies (i.e., involving enough patients to detect with acceptable confidence levels any true treatment effects) are superior to small studies.
  • Blinded studies (in which patients, and clinicians and data analysts where possible, do not know which intervention is being used) are superior to unblinded studies.
  • Studies that clearly define patient populations, interventions, and outcome measures are superior to those that do not clearly define these parameters.

Basic types of methods for generating new data on the effects of health care technology in humans include the following.

  • Large randomized controlled trial (RCT)
  • Small RCT
  • Nonrandomized trial with contemporaneous controls
  • Nonrandomized trial with historical controls
  • Cohort study
  • Case-control study
  • Cross-sectional study
  • Surveillance (e.g., using databases, registers, or surveys)
  • Series of consecutive cases
  • Single case report (anecdote)

Consistent with the attributes of stronger evidence noted above, these methods are listed in rough order of most to least scientifically rigorous for internal validity. This ordering of methods assumes that each study is properly designed and conducted. This list is representative; there are other variations of these study designs and some investigators use different terminology for certain methods. The demand for studies of higher methodological rigor is increasing among health care technology regulators, payers, providers and other policymakers.

It is not only the basic type of a study design (e.g., RCT or case-control study) that affects the quality of the evidence, but the way in which the study was designed and conducted. There are systematic ways to evaluate the quality of individual studies. In particular, there are numerous approaches for assessing studies of health care interventions, particularly RCTs (Schulz 1995, Jadad 1996). Such approaches typically use one of three main approaches: component, checklist, and scale assessment (Moher, Jadad 1996), for example, as shown in Box 25 and Box 26. Available research indicates that the more complex scales do not seem to produce more reliable assessments of the validity or "quality" of a study (Juni 1999).

Box 25
Basic Checklist for Reviewing Reports of Randomized Controlled Trials

Did the trial:

Yes

No

1. Specify outcome measures (endpoints) prior to the trial?

 

__

__

2. Provide patient inclusion/exclusion criteria?

 

__

__

3. Specify a-level for defining statistical significance?

 

__

__

4. Specify b-level (power) to detect a treatment effect of a given meaningful magnitude?

 

__

__

5. Make a prior estimate of required sample size (to satisfy levels of a and b)?

 

__

__

6. Use a proper method for random allocation of patients to treatment and control groups?

 

__

__

7. Use blinding (where possible):

 

__

__

a. in the randomization process?

__

__

b. for patients regarding their treatment?

__

__

c. for observers/care givers regarding treatment?

__

__

d. in collecting outcome data?

__

__

8. State the numbers of patients assigned to the respective treatment and control groups?

 

__

__

9. Clearly describe treatment and control (including placebo where applicable)?

 

__

__

10. Account for patient compliance with treatments/regimens?

 

__

__

11. Account for all events used as primary outcomes?

 

__

__

12. Account for patient withdrawals/losses to follow-up?

 

__

__

13. Analyze patient withdrawals/losses to follow-up

 

__

__

a. by intention-to-treat?

 

__

__

b. by treatment actually received?

 

__

__

14. Account for treatment complications/side effects?

 

__

__

15. Provide test statistics (e.g., F, t, Z, chi-square) and P values for endpoints?

 

__

__

16. Provide confidence intervals or confidence distributions?

 

__

__

17. Discuss whether power was sufficient for negative trials?

 

__

__

18. Interpret retrospective analyses (post hoc examination of subgroups and additional endpoints not identified prior to trial) appropriately?

 

Source: Goodman 1993.

Box 26
Jadad Instrument to Assess the Quality of RCT Reports

This is not the same as being asked to review a paper. It should not take more than 10 minutes to score a report and there are no right or wrong answers.

Please read the article and try to answer the following questions (see attached instructions):

  1. Was the study described as randomized (this includes the use of words such as randomly, random, and randomization)?
  2. Was the study described as double blind?
  3. Was there a description of withdrawals and dropouts?

Scoring the items:

Either give a score of 1 point for each "yes" or 0 points for each "no." There are no in-between marks.

Give 1 additional point if: For question 1, the method to generate the sequence of randomization was described and it was appropriate (table of random numbers, computer generated, etc.)

and/or: If for question 2, the method of double blinding was described and it was appropriate (identical placebo, active placebo, dummy, etc.)

Deduct 1 point if: For question 1, the method to generate the sequence of randomization was described and it was inappropriate (patients were allocated alternately, or according to date of birth, hospital number, etc.)

and/or: for question 2, the study was described as double blind but the method of blinding was inappropriate (e.g., comparison of tablet vs. injection with no double dummy)

Guidelines for Assessment

  1. Randomization: A method to generate the sequence of randomization will be regarded as appropriate if it allowed each study participant to have the same chance of receiving each intervention and the investigators could not predict which treatment was next. Methods of allocation using date of birth, date of admission, hospital numbers, or alternation should not be regarded as appropriate.
  2. Double blinding: A study must be regarded as double blind if the word "double blind" is used. The method will be regarded as appropriate if it is stated that neither the person doing the assessments nor the study participant could identify the intervention being assessed, or if in the absence of such a statement the use of active placebos, identical placebos, or dummies is mentioned.
  3. Withdrawals and dropouts: Participants who were included in the study but did not complete the observation period or who were not included in the analysis must be described. The number and the reasons for withdrawal in each group must be stated. If there were no withdrawals, it should be stated in the article. If there is no statement on withdrawals, this item must be given no points.

Source: Jadad 1996.

The criteria used for assessing quality of studies vary by type of design. For example, the internal validity of an RCT depends on such methodological criteria as: method of randomization, accounting for withdrawals and dropouts, and blinding/masking of outcomes assessment. The internal validity of systematic reviews (discussed below) depends on such methodological criteria as: time period covered by the review, comprehensiveness of the sources and search strategy used, relevance of included studies to the review topic, and application of a standard appraisal of included studies.  

The ability of analysts to determine the internal and external validity of a published study and to otherwise interpret its quality depends on how thoroughly and clearly the information about its study's design, conduct, statistical analysis, and other aspects are reported. The inadequate quality of a high proportion of published reports of RCTs, even in leading journals, has been well documented (Freiman 1978; Moher 1994). Several national and international groups of researchers and medical journal editors have developed standards for reporting of RCTs and other studies (Moher 2001; International Committee of Medical Journal Editors 1997). The trend of more journals to require structured abstracts has assisted analysts in identifying and screening reports of RCTs and other studies.

Many primary studies of health care technologies involve small, non-randomized series of consecutive cases or single case reports, and therefore have methodological limitations that make it difficult to establish the efficacy (or other attributes) of the technologies with sound scientific validity. To some extent, these methodological shortcomings are unavoidable given the nature of the technologies being evaluated, or are otherwise beyond the control of the investigators. In the instance of determining the efficacy of a new drug, the methodological standard is a large, prospective, double-blind, placebo-controlled RCT. These methodological attributes increase the chances of detecting any real treatment effect of the new drug, control for patient characteristics that might influence any treatment effect, and reduce opportunities for investigator or patient bias to affect results.

Although their contributions to methodological validity are generally well recognized, it is not possible to apply all of these attributes for trials of certain types of technologies or for certain clinical indications or settings. Further, these attributes are controversial in certain instances. Patient and/or investigator blinding is impractical or impossible for many medical devices and most surgical procedures. For clinical trials of technologies for rare diseases (e.g., "orphan drugs" and devices), it may be difficult to recruit numbers of patients large enough to detect convincing treatment effects.  

Among the various areas of methodological controversy in clinical trials is the appropriate use of placebo controls. Issues include: (1) appropriateness of using a placebo in a trial of a new therapy when a therapy judged to be effective already exists, (2) statistical requirements for discerning what may be smaller differences in outcomes between a new therapy and an existing one compared to differences in outcomes between a new therapy and a placebo, and (3) concerns about comparing a new treatment to an existing therapy that, except during the trial itself, may be unavailable in a given setting (e.g., a developing country) because of its cost or other economic or social constraints (Rothman 1994; Varmus 1997). As in other health technologies, surgical procedures can be subject to the placebo effect. In recent years, following previous missteps that raised profound ethical concerns, guidance has emerged for using "sham" procedures as placebos in RCTs of surgical procedures (Horng 2003). Some instances of patient blinding have been most revealing about the placebo effect in surgery, including arthroscopic knee surgery (Moseley 2002), percutaneous myocardial laser revascularization (Stone 2002), and neurotransplantation surgery (Boer 2002).

Notwithstanding the limitations inherent in clinical study of many technologies, the methodological rigor used in many primary studies falls short of what it could be. Clinicians, patients, payers, hospital managers, national policymakers, and others who make technology-related decisions and policies are becoming more sophisticated in demanding and interpreting the strength of scientifically-based findings.

Decide How to Use Studies

Most assessment groups have decided that it is not appropriate to consider all studies equally important, and that studies of higher quality should influence their findings more than studies of lesser quality. Experts in evidence interpretation do not agree on the proper approach for deciding how to use studies of differing quality. According to some experts, the results of studies that do not have randomized controls are subject to such great bias that they should not be included for determining the effects of an intervention. Others say that studies from nonrandomized prospective studies, observational studies, and other weaker designs should be used, but given less weight or adjusted for their biases.

There are several basic approaches to deciding how to use the individual studies in an assessment. These are: use all studies as reported; decide whether to include or exclude each study as reported; weight studies according to their relative quality; and make adjustments to the results of studies to compensate for their biases. Each approach has advantages and disadvantages, as well as differing technical requirements. As noted below with regard to establishing search strategies, the approaches to determining what types of studies to be used in an assessment should be determined prospectively as much as possible, so as to avoid injecting selection bias into study selection. Therefore, to the extent that assessors decide to use only certain types of studies (e.g., RCTs and systematic reviews) or not to use certain types of studies (e.g., case studies, case series, and other weaker designs), they should set their inclusion and exclusion criteria prospectively and design their literature search strategies accordingly. Assessment reports should document the criteria or procedures by which they chose to make use of study results for use in the assessment.

Appraising a Body of Evidence

As described above, certain attributes of primary study designs produce better evidence than others. A useful step in appraising evidence is to classify it by basic design type and other study characteristics.

Evidence tables provide a useful way to summarize and display important qualities about multiple individual studies pertaining to a given question. The information summarized in evidence tables may include attributes of study design (e.g., randomization, control, blinding, patient characteristics (e.g., number, age, gender), patient outcomes (e.g., mortality, morbidity, HRQL) and derived summary statistics (e.g., Pvalues, confidence intervals). The tabular format enables reviewers to compare systematically the key attributes of studies and to provide an overall picture of the amount and quality of the available evidence. Box 27 is an evidence table of selected study characteristics and outcomes of double-blind placebo-controlled RCTs of aspirin for patients after myocardial infarction.   

"Grading" a body of evidence according to its methodological rigor is a standard part of HTA. It can take various forms, each of which involves structured, critical appraisal of the evidence against formal criteria (RTI International-University of North Carolina 2002). Box 28 shows an evidence hierarchy that ranks study types from "well-designed randomized controlled trials" at the top through "opinions of respected authorities based on clinical experience" and similar types of expert views at the bottom. Box


Box 27
Evidence Table: Double-Blind Placebo-Controlled RCTs of Aspirin in Patients After Myocardial Infarction
Trial,
Year
No. patients
randomized
Age range
(mean)
Male
%
Months from
qualifying event
to trial entry
Daily dose
ASA1 (mg)
Follow-up
(years)
Average
Mortality
(%)2Sum. stat.3
Cardiac death
(%)Sum. stat.
Nonfatal MI
(%)Sum.
stat.










AMIS
1980
ASA: 2,267
plac:2,257
30-69
(54.8)

89

2 - 60 1,000 3.2 10.8       Z=1.27
9.7
8.7        Z=0.82
8.0
7.7        Z=-2.11
9.5
Breddin
1980
ASA: 317
plac: 309
45-70

78

1 - 1.4 1,500 2.0 8.5         Z=-0.79
10.4
1.6   
3.2
3.5   
4.9
CDPR
1976
ASA: 758
plac: 771
62%>55yrs
61%>55yrs

100

74% > 60
77% > 60
972 1.8 5.8         Z=-1.90
8.3
5.4        Z=-1.87
7.8
3.7        Z=-0.46
4.2
Elwood
1974
ASA: 615
plac: 624
57%>55yrs
54%>55yrs
(55.0)

100

76% < 3 300 1.0 7.6         not sig.
9.8
-     -    
Elwood
1979
ASA: 832
plac: 850
(56.0)

85

50% < 0.25 900 1.0 12.8 not sig.
14.8 at P<0.05
-     -    
PARIS
1980
ASA: 810
plac: 406
30-74

87

2 - 60 972 3.4 10.5        Z=-1.21
12.8
8.0 Z=-1.24
10.1
6.9 not sig.
9.9

 

1ASA: aspirin (acetylsalicylic acid); plac: placebo
2Percent of mortality, cardiac death, and nonfatal myocardial infarction based on number of patients randomized.
3Sum. stat.: summary statistic. Z is a statistical test that can be used to determine whether the difference in proportions or means between a treatment group and a control group are statistically significant. For a two-tailed test, Z values of +1.96 and +2.58 are approximately equivalent to P values of 0.05 and 0.01.

Sources: Aspirin Myocardial Infarction Study Research Group 1980; Breddin et al. 1980; The Coronary Drug Project Research Group 1976; Elwood and Sweetnam 1979; Elwood et al. 1974; Elwood 1983; The Persantine-Aspirin Reinfarction Study Research Group 1980.


Box 28
UK NHS Centre for Reviews and Dissemination: Hierarchy of Evidence

Level

Description

I

Well-designed randomized controlled trials

II-1a

Well-designed controlled trial with pseudo-randomization

II-1b

Well-designed controlled trials with no randomization

II-2a

Well-designed cohort (prospective) study with concurrent controls

II-2b

Well-designed cohort (prospective) study with historical controls

II-2c

Well-designed cohort (retrospective) study with concurrent controls

II-3

Well-designed case-control (retrospective) study

III

Large differences from comparisons between times and/or places with and without intervention

(in some circumstances these may be equivalent to level II or I)

IV

Opinions of respected authorities based on clinical experience; descriptive studies; reports of expert committees

Source: NHS Centre for Reviews and Dissemination 1996.

29 shows a basic evidence-grading scheme that has been used by the US Preventive Services Task Force. This scheme grades evidence in a manner that favors certain attributes of stronger studies for primary data, beginning with properly-designed RCTs. In order to better address how well studies are conducted, the task force augmented this hierarchy with a three-category rating of the internal validity of each study, shown in Box 30.

Another type of evidence table, shown in Box 31, has a count of articles published during a given time period, arranged by type of study, about the use of percutaneous transluminal coronary angioplasty. Rather than showing details about individual studies, this evidence table shows that the distribution of types of studies in an apparently large body of evidence included a relatively small number of RCTs, and a large number of less rigorous observational studies.

Assessment groups can classify studies in evidence tables to gain an understanding of the distribution of evidence by type, and apply evidence hierarchies such as those shown above to summarize a body of evidence. However, more information may be needed to characterize the evidence in a useful way. For example, more detailed grading schemes can be used to account for instances where two or more well-designed studies have conflicting (heterogeneous) results. Box 32 distinguishes between groups of studies with homogeneous and heterogeneous results. This hierarchy also recognizes as stronger evidence studies with low probabilities of false positive error (α) and false negative error (β). This hierarchy also distinguishes between bodies of evidence depending on whether high-quality overviews (i.e., systematic reviews or meta-analyses) are available. 

Box 29
US Preventive Services Task Force: Hierarchy of Research Design

 

I:

Evidence obtained from at least one properly-designed randomized controlled trial.

II-1:

Evidence obtained from well designed controlled trials without randomization

II-2:

Evidence obtained from well designed cohort or case-controlled analytic studies, preferably from more than one center or research group

II-3:

Evidence obtained from multiple time series with or without the intervention. Dramatic results in uncontrolled experiments (such as the results of the introduction of penicillin treatment in the 1940s) could also be regarded as this type of evidence

III:

Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees.

Source: Harris 2001.

Box 30
US Preventive Services Task Force: Grades for Strength of Overall Evidence

 

Grade

Definition

Good:

Evidence includes consistent results from well-designed, well-conducted studies in representative populations that directly assess effects on health outcomes

Fair:

Evidence is sufficient to determine effects on health outcomes, but the strength of the evidence is limited by the number, quality, or consistency of the individual studies; generalizability to routine practices; or indirect nature of the evidence on health outcomes

Poor:

Evidence is insufficient to assess the effects on health outcomes because of limited number or power of studies, important flaws in their design or conduct, gaps in the chain of evidence, or lack of information on important health outcomes.

Source: U.S. Preventive Services Task Force 2002.

Box 31
Distribution of Research Articles on PTCA by Year of Publication and Method Used to Collect or Review Data

 

Article Class

1980

'81

'82

'83

'84

'85

'86

'87

'88

'89

'90

Total

Prospective RCT

0

0

0

0

1

1

2

4

2

1

2

13

Prospective non-RCT

0

0

1

3

4

5

5

6

11

8

3

46

Prospective registry

0

0

2

4

13

2

1

1

3

4

7

37

Case-control & adjusted cohort

0

0

1

2

0

0

2

2

4

5

2

18

Observational

1

1

1

3

12

12

12

27

25

29

8

131

Survey

0

0

0

0

0

0

0

0

1

0

1

2

Editorial

0

0

0

1

2

3

1

4

2

4

5

22

Review

0

0

0

2

3

4

4

5

16

14

11

59

Cross-sectional

0

0

0

0

0

0

0

2

1

0

0

3

Decision analysis

0

0

0

0

0

0

0

0

0

0

1

1

Total

1

1

5

15

35

27

27

51

65

65

40

332

Articles were retrieved using MEDLINE searches.

Source: Hilborne 1991.

Box 32
Levels of Evidence and Grades of Recommendations

 

If No Overview Available

If High-Quality Overview Available

Level of Evidence

Level of Evidence

Grade of Recommendation

I: Randomized trials with low false-positive (α) and low false negative (β) errors.

Lower limits of CI for treatment effect exceeds clinically significant benefit and:
I+: Individual study results homogeneous
I-: Individual study results heterogeneous

A

II: Randomized trials with high false-positive (α) and high false negative (β) errors.

Lower limit of CI for treatment effects falls below clinically significant benefit and:
II+: Individual study results homogeneous
II-: Individual study results heterogeneous

B

III: Nonrandomized concurrent cohort studies

C

IV: Nonrandomized historical cohort studies

V: Case series

Source: Cook 1992.

The more comprehensive evidence hierarchy from the UK NHS Centre for Evidence Based Medicine, shown in Box 33, provides levels of evidence (1a-c, 2a-c, etc.) to accompany findings based on evidence derived from various study designs and applications in prevention, therapy, diagnosis, economic analysis, etc.

Of course, HTAs may involve multiple questions about the use of a technology, e.g., pertaining to particular patient populations or health care settings. Therefore, the evidence and recommendations applying to each question may be evaluated separately or at different levels, as suggested in the causal pathway shown in Box 23.

Link Recommendations to Evidence

Findings and recommendations should be linked explicitly to the quality of the evidence. The process of interpreting and integrating the evidence helps assessment groups to determine the adequacy of the evidence for addressing aspects of their assessment problems (Hayward 1995).

An example of linking recommendations to evidence is incorporated into the evidence appraisal scheme cited above in Box 32, which assigns three grade levels to recommendations based on the evidence. Accompanying the grades for evidence (as shown in Box 30), the US Preventive Services Task Force provides grades for recommendations based on the evidence. This approach, shown in Box 34, reflects two dimensions: the direction of the recommendation (e.g., for or against providing a preventive service) and the strength of the recommendation, tied to the grade of evidence (e.g., a strong recommendation if there is good evidence). Finally, the comprehensive evidence hierarchy shown in Box 33 also includes grades of recommendation that are linked to levels of evidence, including levels that account for evidence homogeneity and heterogeneity.

Even for those aspects of an assessment problem for which there is little useful evidence, an assessment group may have to provide some type of findings or recommendations. This may involve making inferences from the limited evidence, extrapolations of evidence from one circumstance to another, theory, or other subjective judgments. Whether a recommendation about using a technology in particular circumstances is positive, negative, or equivocal (neutral), users of the assessment should understand the basis of that recommendation and with what level of confidence it was made. Unfortunately, the recommendations made in many assessment reports do not reflect the relative strength of the evidence upon which they are based. In these instances, readers may have the mistaken impression that all of the recommendations in an assessment report are equally valid or authoritative.

Approaches for linking the quality of available evidence to the strength and direction of findings and recommendations are being improved and new ones are being developed (Harbour 2001). Using evidence this way enables readers to better understand the reasoning behind the assessment findings and recommendations. It also provides readers with a more substantive basis upon which to challenge the assessment as appropriate. Further, it helps assessment programs and policymakers to determine if a reassessment is needed as relevant new evidence becomes available.


Box 33
Oxford Centre for Evidence-based Medicine Levels of Evidence (May 2001)

 

Level

Therapy/Prevention, Aetiology/Harm

Prognosis

Diagnosis

Differential diagnosis/symptom prevalence study

Economic and decision analyses

1a

SR (with homogeneity*) of RCTs

SR (with homogeneity*) of inception cohort studies; CDR†  validated in different populations

SR (with homogeneity*) of Level 1 diagnostic studies; CDR†  with 1b studies from different clinical centres

SR (with homogeneity*) of prospective cohort studies

SR (with homogeneity*) of Level 1 economic studies

1b

Individual RCT (with narrow Confidence Interval‡)

Individual inception cohort study with > 80% follow-up; CDR† validated in a single population

Validating** cohort study with good††† reference standards; or CDR†  tested within one clinical centre

Prospective cohort study with good follow-up****

Analysis based on clinically sensible costs or alternatives; systematic review(s) of the evidence; and including multi-way sensitivity analyses

1c

All or none§

All or none case-series

Absolute SpPins and SnNouts†† 

All or none case-series

Absolute better-value or worse-value analyses †††† 

2a

SR (with homogeneity* ) of cohort studies

SR (with homogeneity*) of either retrospective cohort studies or untreated control groups in RCTs

SR (with homogeneity*) of Level >2 diagnostic studies

SR (with homogeneity*) of 2b and better studies

SR (with homogeneity*) of Level >2 economic studies

2b

Individual cohort study (including low quality RCT; e.g., <80% follow-up)

Retrospective cohort study or follow-up of untreated control patients in an RCT; Derivation of CDR†  or validated on split-sample§§§ only

Exploratory** cohort study with good††† reference standards; CDR†  after derivation, or validated only on split-sample§§§ or databases

Retrospective cohort study, or poor follow-up

Analysis based on clinically sensible costs or alternatives; limited review(s) of the evidence, or single studies; and including multi-way sensitivity analyses

2c

"Outcomes" Research; Ecological studies

"Outcomes" Research

 

Ecological studies

Audit or outcomes research

3a

SR (with homogeneity*) of case-control studies

 

SR (with homogeneity*) of 3b and better studies

SR (with homogeneity*) of 3b and better studies

SR (with homogeneity*) of 3b and better studies

3b

Individual Case-Control Study

 

Non-consecutive study; or without consistently applied reference standards

Non-consecutive cohort study, or very limited population

Analysis based on limited alternatives or costs, poor quality estimates of data, but including sensitivity analyses incorporating clinically sensible variations.

4

Case-series (and poor quality cohort and case-control studies§§ )

Case-series (and poor quality prognostic cohort studies***)

Case-control study, poor or non-independent reference standard

Case-series or superseded reference standards

Analysis with no sensitivity analysis

5

Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles"

Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles"

Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles"

Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles"

Expert opinion without explicit critical appraisal, or based on economic theory or "first principles"


Notes

Users can add a minus-sign "-" to denote the level of that fails to provide a conclusive answer because of:

  • EITHER a single result with a wide Confidence Interval (such that, for example, an ARR in an RCT is not statistically significant but whose confidence intervals fail to exclude clinically important benefit or harm)
  • OR a Systematic Review with troublesome (and statistically significant) heterogeneity.
  • Such evidence is inconclusive, and therefore can only generate Grade D recommendations.

*

By homogeneity we mean a systematic review that is free of worrisome variations (heterogeneity) in the directions and degrees of results between individual studies. Not all systematic reviews with statistically significant heterogeneity need be worrisome, and not all worrisome heterogeneity need be statistically significant. As noted above, studies displaying worrisome heterogeneity should be tagged with a "-" at the end of their designated level.

† 

Clinical Decision Rule. (These are algorithms or scoring systems which lead to a prognostic estimation or a diagnostic category. )

See note #2 for advice on how to understand, rate and use trials or other studies with wide confidence intervals.

§

Met when all patients died before the Rx became available, but some now survive on it; or when some patients died before the Rx became available, but none now die on it.

§§

By poor quality cohort study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably blinded), objective way in both exposed and non-exposed individuals and/or failed to identify or appropriately control known confounders and/or failed to carry out a sufficiently long and complete follow-up of patients. By poor quality case-control study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably blinded), objective way in both cases and controls and/or failed to identify or appropriately control known confounders.

§§§

Split-sample validation is achieved by collecting all the information in a single tranche, then artificially dividing this into "derivation" and "validation" samples.

†† 

An "Absolute SpPin" is a diagnostic finding whose Specificity is so high that a Positive result rules-in the diagnosis. An "Absolute SnNout" is a diagnostic finding whose Sensitivity is so high that a Negative result rules-out the diagnosis.

‡‡

Good, better, bad and worse refer to the comparisons between treatments in terms of their clinical risks and benefits.

††† 

Good reference standards are independent of the test, and applied blindly or objectively to applied to all patients. Poor reference standards are haphazardly applied, but still independent of the test. Use of a non-independent reference standard (where the 'test' is included in the 'reference', or where the 'testing' affects the 'reference') implies a level 4 study.

†††† 

Better-value treatments are clearly as good but cheaper, or better at the same or reduced cost. Worse-value treatments are as good and more expensive, or worse and the equally or more expensive.

**

Validating studies test the quality of a specific diagnostic test, based on prior evidence. An exploratory study collects information and trawls the data (e.g. using a regression analysis) to find which factors are 'significant'.

***

By poor quality prognostic cohort study we mean one in which sampling was biased in favour of patients who already had the target outcome, or the measurement of outcomes was accomplished in <80% of study patients, or outcomes were determined in an unblinded, non-objective way, or there was no correction for confounding factors.

****

Good follow-up in a differential diagnosis study is >80%, with adequate time for alternative diagnoses to emerge (eg 1-6 months acute, 1 - 5 years chronic)

Grades of Recommendation

A

consistent level 1 studies

B

consistent level 2 or 3 studies or extrapolations from level 1 studies

C

level 4 studies or extrapolations from level 2 or 3 studies

D

level 5 evidence or troublingly inconsistent or inconclusive studies of any level

Source: Center for Evidence-Based Medicine 2003.

Box 34
US Preventive Services Task Force: Grades for Strength of Recommendations

 

Grade

Recommendation

A

The USPSTF strongly recommends that clinicians routinely provide [the service] to eligible patients. The USPSTF found good evidence that [the service] improves important health outcomes and concludes that benefits substantially outweigh harms

B

The USPSTF recommends that clinicians routinely provide [the service] to eligible patients. The USPSTF found at least fair evidence that [the service] improves important health outcomes and concludes that benefits outweigh harms.

C

The USPSTF makes no recommendation for or against routine provision of [the service]. The USPSTF found at least fair evidence that [the service] can improve health outcomes but concludes that the balance of benefits and harms is too close to justify a general recommendation

D

The USPSTF recommends against routinely providing [the service] to asymptomatic patients.  The USPSTF found at least fair evidence that [the service] is ineffective or that harms outweigh benefits.

I

The USPSTF concludes that the evidence is insufficient to recommend for or against routinely providing [the service]. Evidence that [the service] is effective is lacking,, of poor quality, or conflicting, and the balance of benefits and harms cannot be determined.

Source: U.S. Preventive Services Task Force 2002.

Assessment organizations and others that review evidence are increasingly providing guidance to technology sponsors and other stakeholders for preparing dossiers and other submissions of clinical and economic evidence. For example, the UK National Institute for Clinical Excellence (NICE) provides guidance to technology manufacturers and sponsors for preparing submissions of evidence to inform NICE technology appraisals (National Institute for Clinical Excellence 2001). The Academy of Managed Care Pharmacy (AMCP) provides a recommended format for submission of clinical and economic data in support of formulary consideration by pharmacy and therapeutics committees of health plans in the US (Academy of Managed Care Pharmacy 2002).  

VI. DETERMINING TOPICS FOR HTA

Organizations that conduct or sponsor HTAs have only limited resources for this activity. With the great variety of potential assessment topics, HTA organizations need some practical means of determining what to assess. This section considers how assessment programs identify candidate assessment topics and set priorities among these.

Identify Candidate Topics

To a large extent, assessment topics are determined or bounded, by the mission or purpose of an organization. For example, the US FDA [http://www.fda.gov/] is systematically required to assess all new drugs and to assess health devices according to specific provisions made for particular classes of devices. For a new drug, a company normally files an Investigational New Drug Application (IND) with the FDA for permission to begin testing the drug in people; later, following successful completion of necessary clinical trials, the company files a New Drug Application (NDA) to seek FDA approval to market the drug. For certain medical devices (i.e., new "Class III" devices that sustain or support life, are implanted in the body, or present a potential risk of illness or injury), the Investigational Device Exemption (IDE) and Premarketing Approval (PMA) Application are analogous to the IND and NDA, respectively. The FDA is notified about many other devices when a company files a "510(k)" application seeking market approval based on a device's "substantial equivalence" to another device that has already received FDA marketing approval.

Third-party payers generally assess technologies on a reactive basis; a new medical or surgical procedure that is not recognized by payers as being standard or established may become a candidate for assessment. For the US Centers for Medicare and Medicaid Services (CMS), assessment topics arise in the form of requests for national coverage policy determinations that cannot be resolved at the local level or that are recognized to be of national interest. These requests typically originate with Medicare contractors that administer the program in their respective regions, Medicare beneficiaries (patients), physicians, health product companies, health professional associations, and government entities. CMS may request assistance in the form of evidence reports or other assessments by a sister agency, AHRQ.  

For the Evidence-based Practice Centers program, also administered by AHRQ, the agency solicits topic nominations for evidence reports and technology assessments in a public notice in the US Federal Register. Topics have been nominated by a variety of other government agencies, payers, health systems and networks, health professions associations, employer and consumer groups, disease-based organizations, and others. In selecting topics, AHRQ considers not only the information about the topic itself, but the plans of the nominating organization to make use of the findings of the assessment. Information required in these nominations is shown in Box 35.

The American College of Physicians (ACP) Clinical Efficacy Assessment Program (CEAP), which develops clinical practice guidelines, determines its guideline topics based upon evidence reports developed by the AHRQ Evidence-based Practice Centers (EPC) program. (Topics of the EPC program are nominated by outside groups, including ACP.) The topics undertaken by ECRI's technology assessment service are identified by request of the service's subscribers, including payers, providers, and others. For the Cochrane Collaboration, potential topics generally arise from members of the review groups, who are encouraged to investigate topics of interest to them, subject to the agreement of their review groups (Clarke 2003).

Box 35
Evidence-based Practice Centers Topic Nominations

Topic nominations for the AHRQ EPC program should include:

  • Defined condition and target population
  • Three to five very focused questions to be answered
  • Incidence or prevalence, and indication of disease burden (e.g., mortality, morbidity, functional impairment) in the US general population or in subpopulations (e.g., Medicare and Medicaid populations)
  • Costs associated with the conditions, including average reimbursed amounts for diagnostic and therapeutic interventions
  • Impact potential of the evidence report or technology assessment to decrease health care costs or to improve health status or clinical outcomes
  • Availability of scientific data and bibliographies of studies on the topic
  • References to significant differences in practice patterns and/or results; alternative therapies or controversies
  • Plans of the nominating organization to incorporate the report into its managerial or policy decision making (e.g., practice guidelines, coverage policies)
  • Plans of the nominating organization for dissemination of these derivative products to its membership
  • Process by which the nominating organization will measure members' use of the derivative products
  • Process by which the nominating organization will measure the impact of such use on clinical practice

Source: Agency for Healthcare Research and Quality 2003.

Horizon Scanning

The demand for scanning of multiple types of sources for information about new health care interventions has prompted the development of "early warning" or "horizon scanning" functions in the US, Europe, and elsewhere (Douw 2003). Horizon scanning functions are intended to serve multiple purposes, including to:

  • Identify potential topics for HTA and information for setting priorities among these
  • Clarify expectations for the uses or indications of a technology
  • Increase public awareness about new technologies
  • Estimate the expected health and economic impacts
  • Identify critical thresholds of effectiveness improvements in relation to additional costs, e.g., to demonstrate the cost-effectiveness of a new intervention
  • Anticipate potential social, ethical, or legal implications of a technology (Harper 1998; Stevens 1998; Carlsson 1998).

 

Among the organizations with horizon scanning functions are:  

For example, CETAP draws its information from the Internet, published literature, CCOHTA committee members, and other experts. The products of CETAP include short Alerts that address very early technologies, and as more evidence becomes available, CCOHTA publishes more in-depth, peer-reviewed Issues in Emerging Health Technologies bulletins. The purposes of EuroScan (European Information Network on New and Changing Health Technologies), a collaborative network of more than a dozen HTA agencies, are to: evaluate and exchange information on new and changing technologies, develop information sources, develop applied methods for early assessment, and disseminate information on early identification and assessment activities.  

As shown in Box 36, a considerable variety of online databases, newsletters, and other sources provide streams of information pertaining to new and emerging health care interventions. Certainly, an important set of sources for identifying new topics are bibliographic databases such as MEDLINE (accessible, e.g., via PubMed) and EMBASE. The Cochrane Collaboration protocols are publicly available, detailed descriptions of systematic reviews currently underway by Cochrane, which include detailed descriptions of the rationale for the review, information sources, and search strategies.

Although the major thrust of horizon scanning has been to identify "rising" technologies that eventually may merit assessment, horizon scanning may turn to the other direction to identify "setting" technologies that may be outmoded or superseded by newer ones. In either case, horizon scanning provides an important input into setting assessment priorities.  

Setting Assessment Priorities

Some assessment programs have explicit procedures for setting priorities; others set priorities only in an informal or vague way. Given very limited resources for assessment and increasing accountability of assessment programs to their parent organizations and others who use or are affected by their assessments, it is important to articulate how assessment topics are chosen.

Box 36
Information Sources for New and Emerging Health Care Interventions
  • Trade journals (e.g., F-D-C Reports: The Pink Sheet, NDA Pipeline, The Gray Sheet; In Vivo; Adis International; Biomedical Instrumentation and Technology; R&Directions)
  • General news (PR Newswire, Reuters Health, New York Times)
  • Health professions and industry newsletters (e.g., Medscape, Medicine & Health, American Health Line, CCH Health & Medicine)
  • Conferences (and proceedings) of medical specialty societies and health industry groups
  • General medical journals and specialty medical journals
  • Technology company web sites
  • Publicly available market research reports (IHS Health Group)
  • FDA announcements of market approvals of new pharmaceuticals (e.g., NDAs, NDA supplements), biotechnologies (e.g., BLAs), and devices (e.g., PMAs, PMA supplements, and 510[k]s)*
  • Adverse event/alert announcements (from FDA, USP, NIH Clinical Alerts and Advisories, etc.)
  • New Medicines in Development (disease- and population-specific series from PhRMA, including clinical trial status)
  • Databases of ongoing research, e.g., Clinicaltrials.gov and HSRProj (Health Services Research Projects in Progress) from NLM
  • Reports and other sources of information on significant variations in practice, utilization, or payment policies (e.g., The Dartmouth Atlas, LMRP.NET)
  • Special reports on health care trends and futures (e.g., Health and Health Care 2010 (Institute for the Future 2000); Health Technology Forecast (ECRI 2002)
  • Priority lists and forthcoming assessments from public and non-profit evaluation/assessment organizations (e.g., INAHTA member organizations)
  • Cochrane Collaboration protocols

*NDA: New Drug Application approvals; BLA: Biologics License Application approvals; PMA: Premarket Approval

Application approvals; 510(k): substantially equivalent device application approvals.

Most assessment programs have criteria for topic selection, although these criteria are not always explicit. Is it most important to focus on costly health problems and technologies? What about health problems that affect large numbers of people, or health problems that are life-threatening? What about technologies that cause great public controversy? Should an assessment be undertaken if it is unlikely that its findings will change current practice? Examples of selection criteria that are used in setting assessment priorities are:  

  • High individual burden of morbidity, mortality, or disability
  • High population burden of morbidity, mortality, or disability
  • High unit cost of a technology or health problem
  • High aggregate cost of a technology or health problem
  • Substantial variations in practice
  • Available findings not well disseminated or adopted by practitioners
  • Need to make regulatory decision
  • Need to make a health program implementation decision (e.g., for initiating a major immunization program)
  • Need to make payment decision (e.g., provide coverage or include in health benefits)
  • Scientific controversy or great interest among health professionals
  • Public or political demand
  • Sufficient research findings available upon which to base assessment
  • Timing of assessment relative to available evidence (e.g., recent or anticipated pivotal scientific findings)
  • Potential for the findings of an assessment to be adopted in practice
  • Potential for change in practice to affect patient outcomes or costs
  • Feasibility given resource constraints (funding, time, etc.) of the assessment program

The timing for undertaking an assessment may be sensitive to the availability of evidence. For example, the results of a recently completed RCT or meta-analysis may challenge standard practice, and prompt an HTA to consolidate these results with other available evidence for informing clinical or payment decisions. Or, an assessment may be delayed pending the results of an ongoing study that has the potential to shift the weight of the body of evidence on that topic.

A systematic priority-setting process could include the following steps (Donaldson and Sox 1992; Lara and Goodman 1990).

  1. Select criteria to be used in priority setting.
  2. Assign relative weights to the criteria.
  3. Identify candidate topics for assessment (e.g., as described above).
  4. If the list of candidate topics is large, reduce it by eliminating those topics that would clearly not rank highly according to the priority setting criteria.
  5. Obtain data for rating the topics according to the criteria.
  6. For each topic, assign a score for each criterion.
  7. Calculate a priority score for each topic.
  8. Rank the topics according to their priority scores.
  9. Review the priority topics to ensure that assessment of these would be consistent with the organizational purpose.

Processes for ranking assessment priorities range from being highly subjective (e.g., informal opinion of a small group of experts) to quantitative (e.g., using a mathematical formula) (Donaldson 1992; Eddy 1989; Phelps 1992). Box 37 shows a quantitative model for priority setting. The Cochrane Collaboration uses a more decentralized approach. Starting with topics suggested by their review group members, many Cochrane Collaboration review groups set priorities by considering burden of disease and other criteria, as well as input from discussions with key stakeholders and suggestions from consumers. These priorities are then offered to potential reviewers who might be interested in preparing and maintaining relevant reviews in these areas (Clarke 2003). 

Of course, there is no single correct way to set priorities. The great diversity of potential assessment topics, the urgency of some policymaking needs, and other factors may diminish the practical benefits of using highly systematic and quantitative approaches. On the other hand, ad hoc, inconsistent, or non­transparent processes are subject to challenges and skepticism of policymakers and other observers who are affected by HTA findings. Certainly, there is a gap between theory and application of priority setting. Many of the priority setting models are designed to support resource allocation that maximizes health gains, i.e., identify health interventions which, if properly assessed and appropriately used, could result in substantial health improvements at reasonable costs. However, some potential weaknesses of these approaches are that they tend to set priorities among interventions rather than the assessments that should be conducted, they do not address priority setting in the context of a research portfolio, and they do not adopt an incremental perspective (i.e., consideration of the net difference that conducting an assessment might accomplish) (Sassi 2003).

Reviewing the process by which an assessment program sets its priorities, including the implicit and explicit criteria it uses in determining whether or not to undertake an assessment, can help to ensure that the HTA program is fulfilling its purposes effectively and efficiently.


Specify the Assessment Problem

One of the most important aspects of an HTA is to specify clearly the problem(s) or question(s) to be addressed; this will affect all subsequent aspects of the assessment. An assessment group should have an explicit understanding of the purpose of the assessment and who the intended users of the assessment are to be. This understanding might not be established at the outset of the assessment; it may take more probing, discussion and clarification.

Box 37
A Quantitative Model for Priority Setting

A 1992 report by the Institute of Medicine provided recommendations for priority setting to the

Agency for Health Care Policy and Research (now AHRQ). Seven criteria were identified:

  • Prevalence of a health condition
  • Burden of illness
  • Cost
  • Variation in rates of use
  • Potential of results to change health outcomes
  • Potential of results to change costs
  • Potential of results to inform ethical, legal, or social issues

The report offered the following formula for calculating a priority score for each candidate topic.

Priority Score = W1lnS1 + W2lnS2 + ... W7lnS7

where:

W is the relative weight of each of seven priority-setting criteria

S is the score of a given candidate topic for a criterion

ln is the natural logarithm of the criterion scores.

Candidate topics would then be ranked according to their priority score.

Source: Donaldson 1992.

The intended users or target groups of an assessment should affect its content, presentation, and dissemination of results. Clinicians, patients, politicians, researchers, hospital managers, company executives, and others have different interests and levels of expertise. They tend to have different concerns about the effects or impacts of health technologies (health outcomes, costs, social and political effects, etc.). They also have different needs regarding the scientific or technical level of reports, the presentation of evidence and findings, and the format (e.g., length and appearance) of reports.  

When the assessment problem and intended users have been specified, they should be reviewed by the requesting agency or sponsors of the HTA. The review of the problem by the assessment program may have clarified or focused the problem in a way that differs from the original request. This clarification may prompt a reconsideration or restatement of the problem before the assessment proceeds.

Problem Elements

There is no single correct way to state an assessment problem. In general, an assessment problem could entail specifying at least the following elements: health care problem(s); patient population(s); technology(ies); practitioners or users; setting(s) of care; and properties (or impacts or health outcomes) to be assessed.

For example, a basic specification of one assessment problem would be:

  • Health care problem: management of moderate hypertension
  • Patient population: males and females, age >60 years, diastolic blood pressure 90-114 mm Hg, systolic blood pressure <240 mm Hg, no other serious health problems
  • Technologies:  specific types/classes of pharmacologic and nonpharmacologic treatments
  • Practitioners:  primary care providers
  • Setting of care: outpatient care, self care
  • Properties, impacts, or outcomes: safety (including side-effects), efficacy, effectiveness and cost-effectiveness (especially cost-utility)

Causal Pathways

A useful means of presenting an assessment problem is a "causal pathway," sometimes known as an "analytical framework." Causal pathways depict direct and indirect linkages between interventions and outcomes. Although often used to present clinical problems, they can be used as well for organizational, financing, and other types of interventions or programs in health care.  

Causal pathways provide clarity and explicitness in defining the questions to be addressed in an HTA, and draw attention to pivotal linkages for which evidence may be lacking. They can be useful working tools to formulate or narrow the focus of an assessment problem. For a clinical problem, a causal pathway typically includes a patient population, one or more alternative interventions, intermediate outcomes (e.g., biological markers), health outcomes, and other elements as appropriate. In instances where a topic concerns a single intervention for narrowly defined indications and outcomes, these pathways can be relatively straightforward. However, given the considerable breadth and complexity of some HTA topics, which may cover multiple interventions for broadly defined health problem (e.g., screening, diagnosis, and treatment of osteoporosis in various population groups), causal pathways can become detailed. While the development of a perfectly representative causal pathway is not the objective of an HTA, these can be specified to a level of detail that is sufficient for the sponsor of an HTA and the group that will conduct the HTA concur on the assessment problem. In short, it helps to draw a picture.

An example of a general causal pathway for a screening procedure with alternative treatments is shown in Box 23. As suggested in this example, the evidence that is assembled and interpreted for an HTA may be organized according to an indirect relationship (e.g., between a screening test and an ultimate health outcome) as well as various intervening direct causal relationships (e.g., between a treatment indicated by the screening test and a biological marker, such as blood pressure or cholesterol level).

Reassessment and the Moving Target Problem

Health technologies are "moving targets" for assessment (Goodman 1996). As a technology matures, changes occur in the technology itself or other factors that can diminish the currency of an HTA report and its utility for health care policies. As such, HTA can be more of an iterative process than a one-time analysis. Some of the factors that would trigger a reassessment might include changes in the:

  • Evidence pertaining to the safety, effectiveness, and other outcomes or impacts of using the technology (e.g., publication of significant new results of a major clinical trial or a new meta-analysis)
  • Technology itself (modified techniques, models, formulations, delivery modes, etc.)
  • Indications for use (different health problems, degree of severity, etc.)
  • Populations in which it is used (different age groups, comorbidities, etc.)
  • Protocols or care pathways of which the technology is a part that may alter the role or utility of the technology
  • Care setting in which the technology is applied (inpatient, outpatient, physician office, home, long-term care)
  • Provider of the technology (type of clinician, other caregiver, patient, etc.)
  • Practice patterns (e.g., large practice variations)
  • Alternative technology or standard of care to which the technology is compared
  • Outcomes or impacts considered to be important (e.g., types of costs or quality of life)
  • Resources available for health care or the use of a particular technology (i.e., raising or lowering the threshold for decisions to use the technology)
  • Adoption or use of guidelines, payment policies, or other decisions that are based on the HTA report
  • Interpretation of existing research findings (e.g., based on corrections or re-analyses).

There are numerous instances of moving targets that have prompted reassessments. For example, since the inception of percutaneous transluminal coronary angioplasty (PTCA, approved by the US FDA in 1980), its clinical role vis-à-vis coronary artery bypass graft surgery (CABG) has changed as the techniques and instrumentation for both technologies have evolved, their indications have expanded, and as competing, complementary, and derivative technologies have emerged (e.g., laser angioplasty, coronary artery stents, minimally-invasive and "beating-heart" CABG). The emergence of viable pharmacological therapy for osteoporosis (e.g., with bisphosphonates and selective estrogen receptor modulators) has increased the clinical utility of bone densitometry. Long rejected for its devastating teratogenic effects, thalidomide has reemerged for carefully managed use in a variety of approved and investigational uses in leprosy and other skin diseases, certain cancers, chronic graft-vs.-host disease, and other conditions (Combe 2001; Richardson 2002).  

While HTA programs cannot avoid the moving target problem, they can manage and be responsive to it. Box 38 lists approaches for managing the moving target problem.  

Box 38
Managing the Moving Target Problem
  • Recognize that HTA must have the capacity to revisit topics as needed, whether periodically or as prompted by important changes that have transpired since preparation of the original HTA report.
  • Document in HTA reports the information sources, assumptions, and processes used. This information baseline will better enable HTA programs and other interested groups to recognize when it is time for reassessment.
  • In the manner of a sensitivity analysis, indicate in HTA reports what magnitudes of change in key variables (e.g., accuracy of a diagnostic test, effectiveness of a drug, patient compliance, costs) would result in a significant change in the report findings.
  • Note in HTA reports any known ongoing research, work on next-generation technologies, population trends, and other developments that might prompt the need for reassessment.
  • Have or subscribe to a scanning or monitoring function to help detect significant changes in technologies and other developments that might trigger a reassessment.
  • Recognize that, as the number of technology decision makers increases and evidence-based methods diffuse, multiple assessments are generated at different times from different perspectives. This may diminish the need for clinicians, payers, and other decision makers to rely on a single, definitive assessment on a particular topic.

Aside from changes in technologies and their applications, even new interpretations of, or corrections in, existing evidence can prompt a new assessment. This was highlighted by a 2001 report of a Cochrane Center that prompted the widespread re-examination of screening mammography guidelines by government and clinical groups. The report challenged the validity of evidence indicating that screening for breast cancer reduces mortality, and suggested that breast cancer mortality is a misleading outcome measure (Olsen 2001).

Some research has been conducted on the need to reassess a particular application of HTA findings, i.e., clinical practice guidelines. For example, for a study of the validity of 17 guidelines developed in the 1990s by AHCPR (now AHRQ), investigators developed criteria defining when a guideline needs to be updated, surveyed members of the panels that prepared the respective guidelines, and searched the literature for relevant new evidence published since the appearance of the guidelines. Using a "survival analysis," the investigators determined that about half of the guidelines were outdated in 5.8 years, and that at least 10% of the guidelines were no longer valid by 3.6 years. They recommended that, as a general rule, guidelines should be reexamined for validity every three years (Shekelle, Ortiz 2001). Others counter that the factors that might prompt a reassessment do not arise predictably or at regular intervals (Brownman 2001). Some investigators have proposed models for determining whether a guideline or other evidence-based report should be reassessed (Shekelle, Eccles 2001).

Changes in the volume or nature of publications may trigger the need for an initial assessment or reassessment. A "spike" (sharp increase) in publications on a topic, such as in the number of research reports or commentary, may signal trends that would merit attention for assessment. However, in order to determine whether such publication events are reliable indicators of technology emergence or moving targets requiring assessment, further bibliometric research should be conducted to determine whether actual emergence of new technologies or substantial changes in them or their use has been correlated with such publication events or trends (Mowatt 1997).

Not all changes require conducting a reassessment, or that a reassessment should entail a full HTA. A reassessment may require updating only certain aspects of an original report. In some instances, current clinical practices or policies may be recognized as being optimal relative to available evidence, so that a new assessment would have little potential for impact; or the set of clinical alternatives and questions have evolved so much since the original assessment that it would not be relevant to update it.  

In some instances, an HTA program may recognize that it should withdraw an existing assessment because to maintain it could be misleading to users and perhaps even have adverse health consequences. This may arise, for example, when an important flaw is identified in a pivotal study in the evidence base underlying the assessment, when new research findings appear to refute or contradict the original research base, or when the assumptions used in the assessment are determined to be flawed. The determination to maintain or withdraw the existing assessment while a reassessment is conducted, to withdraw the existing assessment and not conduct a reassessment, or to take other actions, depends on the risks and benefits of these alternative actions for patient health, and any relevant legal implications for the assessment program or users of its assessment reports.

Once an HTA program determines that a report topic is a candidate for being updated, the program should determine the need to undertake a reassessment in light of its other priorities. Assessment programs may consider that candidates for reassessment should be entered into the topic priority-setting process, subject to the same or similar criteria for selecting HTA topics.

VII.RETRIEVING EVIDENCE FOR HTA

One of the great challenges in HTA is to assemble the evidence&#8722;the data, literature and other information&#8722;that is relevant to a particular assessment. For very new technologies, this information may be sparse and difficult to find; for many technologies, it can be profuse, scattered and of widely varying quality. Literature searching and related evidence retrieval are integral to successful HTA, and the time and resources required for these activities should be carefully considered in planning any HTA (Auston 1994; Goodman 1993).

Types of Sources

Available information sources cover different, though often overlapping, sectors of health care information. Although some are devoted to health care topics, others cover the sciences more broadly. Multiple sources should be searched to increase the likelihood of retrieving relevant reports. The variety of types of sources that may be useful for HTA include:

  • Computer databases of published literature
  • Computer databases of clinical and administrative data
  • Printed indexes and directories
  • Government reports and monographs
  • Policy and research institute reports
  • Professional association reports and guidelines
  • Market research reports
  • Company reports and press releases
  • Reference lists in available studies and reviews
  • Special inventories/registers of reports
  • Health newsletters and newspapers
  • Colleagues and investigators

Of course, the Internet is an extraordinarily broad and readily accessible medium that provides access to many of these information sources.  

There are hundreds of publicly available computer databases for health care and biomedical literature. Among these are various general types. For example, bibliographic databases have indexed citations for journal articles and other publications. Factual databases provide information in the form of guidelines for diagnosis and treatment, patient indications, and contraindications, and other authoritative information. Referral databases provide information about organizations, services and other information sources.

The National Information Center on Health Services Research & Health Care Technology (NICHSR) [http://www.nlm.nih.gov/nichsr/nichsr.html] of the US National Library of Medicine (NLM) provides an extensive, organized set of the many, evolving databases, publications, outreach and training, and other information resources for HTA. One online source, Etext on Health Technology Assessment (HTA) Information Resources [http://www.nlm.nih.gov/nichsr/ehta/], is a comprehensive textbook on sources of HTA information and searching approaches compiled by information specialists and researchers from around the world (National Library of Medicine 2003). Various other useful compendia of HTA information resources have been prepared (Busse 2002; Glanville 2003; Chan 2003). Some of the main bibliographic and factual databases useful in HTA are listed in Box 39.

The most widely used of these resources for HTA are the large bibliographic databases, particularly MEDLINE, produced by NLM, and EMBASE, produced by Elsevier. MEDLINE can be accessed at the NLM website using PubMed, which also includes new in-process citations (with basic citation information and abstracts before being indexed with MeSH terms and added to MEDLINE), citations from various life science journals, and certain other entries. In addition, there are many specialized or more focused databases in such areas as AIDS, bioethics, cancer treatment, pharmaceutical research and development, ongoing clinical trials (e.g., ClinicalTrials.gov of NLM), and practice guidelines (e.g., National Guideline Clearinghouse of AHRQ).

The Cochrane Collaboration [http://www.cochrane.org/] is an international organization that prepares, maintains and disseminates systematic reviews of RCTs (and other evidence when appropriate) of treatments for many clinical conditions. More than 1,500 systematic reviews have been produced by nearly 50 Cochrane review groups in such areas as acute respiratory infections, breast cancer, diabetes, hypertension, infectious diseases, and pregnancy and childbirth. The Cochrane Collaboration produces the Cochrane Library, which includes databases and registers produced by the Cochrane Collaboration as well as some produced by other organizations. The Database of Abstracts of Reviews and Dissemination (DARE) [http://www.york.ac.uk/inst/crd/crddatabases.htm#DARE] and the NHS Economic Evaluation Database are produced by the NHS Centre for Reviews and Dissemination (NHSCRD).  The HTA Database is produced by the International Network of Agencies for Health Technology Assessment (INAHTA) [http://www.inahta.org/], in collaboration with the NHSCRD.

The selection of sources for literature searches should depend on the purpose of the HTA inquiry and pertinent time and resource constraints. Most searches are likely to involve MEDLINE or another large database of biomedical literature (Suarez-Almazor 2000; Topfer 1999). However, the selection of other databases may differ by purpose, e.g., horizon scanning, ascertaining regulatory or payment status of technologies, comprehensive systematic review, or identifying literature in particular clinical areas.

Gray Literature

Much valuable information is available beyond the traditional published sources. This "gray" or "fugitive" literature is found in industry and government monographs, regulatory documents, professional association reports and guidelines, market research reports, policy and research institute studies, spot publications of special panels and commissions, conference proceedings, and other sources. Many of these can be found via the Internet. Although the gray literature can be timely and cover aspects of technologies that are not addressed in mainstream sources, it is usually not subject to peer review, and must be scrutinized accordingly.  

Box 39
Selected Bibliographic and Factual Databases for HTA

Some Core Sources

  • MEDLINE: citations for biomedical journal articles
  • EMBASE: citations for biomedical journal articles (Elsevier)
  • Cochrane Database of Systematic Reviews: systematic reviews of controlled trials on hundreds of clinical topics
  • Cochrane Controlled Trials Register: bibliography of controlled trials including sources outside peerreviewed journal literature
  • Database of Abstracts of Reviews of Effectiveness (DARE): structured abstracts of systematic reviews from around the world, critically appraised by NHS Centre for Reviews and Dissemination
  • NHS Economic Evaluation Database: abstracts and other information about published economic evaluations of health care interventions
  • Health Technology Assessment Database: records of ongoing projects of members of INAHTA and completed HTAs by INAHTA members and other organizations
  • National Guideline Clearinghouse: evidence-based clinical practice guidelines (AHRQ)

Additional Sources

  • Other NLM/NIH sources:
    • ClinicalTrials.gov: current information about current clinical research studies in health services research and behavioral and social sciences
    • DIRLINE: directory of organizations
    • HSRProj: ongoing health services research projects
    • HSRR (Health Services/Sciences Research Resources): research datasets and instruments/indices.
    • HSTAT: full text of US clinical practice guidelines, consensus development reports, technology assessment reports, etc.
    • PDQ: cancer treatment, supportive care, screening, prevention, clinical trials
    • Other specialized databases such as AIDSLINE, Bioethics, and HealthSTAR have been incorporated into MEDLINE, accessed, e.g., via PubMed
  • ACP Journal Club: selected studies and systematic reviews for immediate attention of clinicians, with "value added" abstracts and commentary
  • AltHealthWatch: information resources on alternative medicine
  • Bandolier: journal of evidence summaries
  • Best Evidence (ACP Journal Club plus Evidence Based Medicine)
  • BIOSIS Previews: citations of life sciences literature (BIOSIS)
  • CEA Registry: database of standardized cost-utility analyses (Harvard School of Public Health)
  • CINAHL: citations for nursing and allied health literature (Cinahl Information Systems)
  • CDC Wonder: gateway to reports and data of the US Centers for Disease Control and Prevention (CDC)
  • Cochrane Methodology Register: bibliography of articles and books on the science of research synthesis
  • Cochrane Database of Methodology Reviews: full text of systematic reviews of empirical methodological studies
  • HDA Evidence Base: summaries of systematic reviews of effectiveness, literature reviews, meta-analyses, expert group reports, and other review-level information (NHS Health Development Agency, UK)
  • MANTIS: bibliographic database on manual, alternative, and natural therapies
  •  Netting the Evidence: (ScHARR, University of Sheffield, UK