BJR
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

First published online February 28, 2007
British Journal of Radiology (2007) 80, 152-160
© 2007 British Institute of Radiology
doi: 10.1259/bjr/64096611

This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ng, C.
Right arrow Articles by Palmer, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ng, C.
Right arrow Articles by Palmer, C.

Full paper

Analysis of diagnostic confidence and diagnostic accuracy: a unified framework

CS Ng, MRCP, FRCR 1 and CR Palmer, MA, PhD 2

1 Department of Radiology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA, 2 Centre for Applied Medical Statistics, University of Cambridge, Cambridge CB2 2SR, UK

Correspondence: C S Ng, Department of Radiology, Unit 57, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030-4009, USA.


    Abstract
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 
Diagnostic confidence has been used as a measure of diagnostic efficacy, but this measure in isolation fails to take into account incorrect diagnoses. Conventional analytical approaches of diagnostic confidence ignore associated diagnostic accuracy. To address this limitation, we introduce a unifying framework which incorporates diagnostic confidence, changes in diagnoses and ultimate accuracy.The framework is illustrated using data from a study in which 62 patients with acute abdominal pain prospectively underwent CT. Admitting surgeons documented their diagnoses and graded their diagnostic confidences (on a 5-point scale) on admission and again after CT. Our approach, unlike conventional analyses, incorporates knowledge of final diagnoses, obtained from surgery or 6 months follow up, in assessing the impact of the test (on a 9-point scale). Changes in pre- and post-CT confidence scores were assessed by the one-sample t-test comparing against zero change, with the test statistic acting as a standardized quantity allowing comparison between our and conventional methodological approaches. Overall, 52% (32/62) of patients were misdiagnosed on admission and 19% (12/62) had incorrect post-CT diagnoses. Diagnostic confidence following CT increased significantly compared with pre-CT confidence on applying both analytical methods, although the level of statistical significance was less marked using our approach. Mean (95% confidence interval) increase in confidence under conventional analysis was 1.32 (1.03, 1.62), with standardized score t = 8.90 [p<0.0001], whereas our method yielded 0.69 (0.25, 1.13), with standardized score t = 3.12 [p = 0.003]. Although both analytical methods led to the same inference regarding the efficacy of CT in the illustrative case study presented, they differed somewhat in degree. It is conceivable that disparate conclusions may emerge in other studies and circumstances. Failure to take adequate account of incorrect diagnoses is potentially misleading. We suggest that a comprehensive analysis of diagnostic confidence requires the incorporation of diagnostic accuracy.


    Introduction
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 
Diagnostic confidence has a pivotal impact on clinical management. In particular, if confidence in a given diagnosis is high, specific therapeutic measures are frequently instituted on the assumption that the diagnosis is correct. Such confidence, if misplaced, can lead to serious medical errors [1]. On the other hand, if diagnostic confidence is low, specific therapy is frequently withheld and further investigations may be pursued in the hope of increasing confidence in the diagnosis or in identifying an alternative diagnosis in which there may be higher confidence.

The majority of prior studies that have attempted to assess the efficacy of a diagnostic test in terms of its impact on diagnostic confidence have used analyses based simply on changes between pre- and post-test confidences (or probabilities of diagnoses). Such analyses fail to take into account the possibility of incorrect diagnoses and their potential consequences. Confidently wrong diagnoses may be harmful to patients: for instance, patients may undergo unnecessary surgery, or else fail to have necessary surgery, with potentially serious consequences.

We present a framework for the analysis of diagnostic confidence that takes into account ultimate diagnostic accuracy. We should emphasize that we illustrate the methodology using a case study and data from a prospective trial assessing the value of abdomino-pelvic CT in patients admitted to hospital with acute abdominal pain of uncertain aetiology. We compare the new approach with more conventional evaluations of diagnostic confidence, which limit themselves to simple pre- and post-test scores of confidence.


    Methods and materials
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 
Case study: patients
Patients with acute abdominal pain of uncertain aetiology who were admitted to the surgical service and deemed on admission not to require abdomino-pelvic CT on clinical grounds formed the basis of the inclusion criteria in this illustrative case study. As part of the assessment of the value of CT, patients were randomized to undergo abdomino-pelvic CT within 24 hof admission or not. The subset of patients who underwent CT within 24 h of admission (n = 62) provided the basis of the data for this case study. Further details of the patient inclusion criteria, randomization, CT technique and evaluation have been presented previously [2]. The study was approved by the Institutional Review Board and written informed consent was obtained from all patients.

Diagnostic confidence and accuracy
The admitting surgeons documented their diagnoses and graded their diagnostic confidence at the time of admission (0 h) and again after the CT examination. Grading of diagnostic confidence by the surgeons was on a 5-point scale: 1 (10% confidence, i.e. very unsure), 2 (30%), 3 (50%), 4 (70%) and 5 (90% confidence, i.e. highly confident). These pre- and post-test diagnostic confidences formed the bases for the conventional analysis of efficacy, to which our approach was compared.

The "final" (gold standard) diagnosis was defined as that made at surgery or at 6 month follow-up, whichever was sooner. In this case study, "diagnostic accuracy" was the accuracy of the diagnosis in relation to the final diagnosis. A fundamentally important consideration is that post-test diagnoses may be wrong. Furthermore, such incorrect post-test diagnoses may either under-estimate ("undercall") or overestimate ("overcall") the severity of the actual diagnosis. In this illustrative case study, a more severe diagnosis was taken as one that would typically have needed surgical intervention (rather than medical or conservative therapy), or one that would usually be associated with a poorer prognosis. For example, acute appendicitis in most circumstances would be considered a more severe diagnosis than diverticulitis; in contrast, gastroenteritis would in most circumstances be considered less severe than diverticulitis. Similarly, the test itself may or may not change the pre-test diagnosis; and likewise, a post-test diagnosis may be more severe than the pre-test diagnosis (i.e. the test may "upgrade" the severity of the pre-test diagnosis), or it may be less severe than the pre-test diagnosis (i.e. the test may "downgrade" the severity of the initial diagnosis). Changes between the admitting and post-CT diagnoses, and between the post-CT and final diagnoses were assessed by two observers (a surgeon and a radiologist). If there was a change in diagnosis, a consensus as to whether the change was towards a more or less severe diagnosis was reached; if the diagnosis did not change, no change was recorded.

Scoring of diagnostic confidence in relation to diagnostic accuracy in proposed approach
The surgeons' diagnostic confidence grades were dichotomized such that "high" confidence level consisted of surgical confidence grades 4 or 5 (i.e. >50% confidence); "low" confidence levels consisted of surgical confidence grades 1, 2, or 3 (i.e. ≤50% confidence). The premise was that a high level of confidence in a diagnosis usually implies that definitive management proceeds on that basis, whereas a low level of confidence does not usually bring about specific management.

A scoring system was used to quantify the impact of the test (i.e. the CT examination) with particular attention to the patients' perspective, taking into account the changes in diagnostic confidence level (high or low), change in diagnoses (more or less severe, or no change), and ultimate accuracy of the diagnosis. To introduce the scoring system, we illustrate in Figure 1Go the nine basic "routes" for those situations in which initial and post-test levels of confidence are both high. These routes consist of combinations of diagnostic changes and ultimate diagnostic accuracy and are labelled A1, A3 etc. These same labels feature in Figure 2Go, which depicts in flow diagram format the 36 total possible combinations arising from all 4 permutations of low and high initial and post-test levels of confidence as applied to the above 9 basic routes.


Figure 1
View larger version (15K):
[in this window]
[in a new window]

 
Figure 1. Routes schematically showing changes in diagnoses on admission, following CT and at final diagnosis. The upper diagram schematically represents scenarios where the post-CT diagnosis is of high confidence and is correct. The middle diagram schematically represents scenarios where the post-CT diagnosis is of high confidence and is incorrect, and the final diagnosis is more severe than the post-CT diagnosis. The lower diagram schematically represents scenarios where the post-CT diagnosis is of high confidence and is incorrect, and the final diagnosis is less severe than the post-CT diagnosis. Left-hand labels: Route labels. For consistency the same nomenclature is used in Figure 2. x (y, z), Confidence score assigned to route travelled by patient from admission to final diagnosis: x=best estimate (y=pessimistic, z=optimistic) score.

 

Figure 2
View larger version (34K):
[in this window]
[in a new window]

 
Figure 2. Flow chart demonstrating the 36 possible combinations of confidence levels, diagnostic changes and ultimate diagnostic accuracy. Route Label: the italiczed labels (such as A1 and A3) are the same route labels as those in Figure 1. *Route score: this score was derived from a weighted average of the best estimate, pessimistic, and optimistic score (see Materials and methods section). For clarity, the underlying component scores have been presented, denoted as x (y, z) = best estimate (pessimistic, optimistic) scores. **Last row of flow chart = number of patients in each route in the case study.

 
In Figure 1Go, route A3 represents an initial diagnosis that was high in confidence, which was "downgraded" in severity following CT (i.e. whose severity was over-called on admission) with high confidence, followed by no change in diagnosis (indicating that the post-CT diagnosis proved to be correct). Route F5 represents an initial diagnosis that was high in confidence, which was downgraded in severity with high confidence after CT, but that proved to be incorrect with the final diagnosis being more severe. An example of an A3 route might be a patient who had an admitting diagnosis of cholecystitis with high confidence who was correctly diagnosed post-CT with high confidence, with a right basal pneumonia. An example of route F5 might be a patient who had an admitting diagnosis of appendicitis with high confidence who was then considered, with high confidence after CT to have non-specific abdominal pain, but who actually did have appendicitis on follow up. When a downgrade and upgrade in diagnostic severity occur either side of CT (as in routes E5 and F5), the final diagnosis need not coincide with the initial diagnosis.

A score for each route, the "route score", was derived by application of the principles in the following six paragraphs. Each route was assigned a score on a 9-point scale, ranging from –4 to +4, defined as follows. The best possible impact for a test from the patients' perspective (diagnostic score of +4) was one in which the test correctly diagnosed, with high confidence, a condition that was initially confidently considered a less severe diagnosis (i.e. route A1). The least favourable scenario (diagnostic score of –4) was considered to be one in which an intervention incorrectly diagnosed, with high confidence, a condition that was initially confidently and correctly considered to be a more serious condition (i.e. route F5).

There are two special routes the test does not change an admitting diagnosis that ultimately proves to be correct, nor changes the confidence level of the diagnoses (routes A5 and B6). As such, these routes have no significant impact on the patient, being neither beneficial nor detrimental, and have been assigned defining neutral scores of 0.

Scores for the other intermediate routes were derived by considering hypothetical combinations of "severe" diagnoses, for example acute appendicitis or ruptured abdominal aortic aneurysm, and "less severe" diagnoses, for example gastroenteritis or non-specific abdominal pain, together with combinations of high and low diagnostic confidences. Scoring was undertaken prospectively by two observers in consensus without knowledge of the specific diagnoses in each route. In arriving at a score, termed the "best estimate" score, it was helpful to consider first if the route was generally favourable or unfavourable from the patients' point of view, and then to consider its relative benefit, or otherwise, in relation to the fixed points (of –4, 0 and 4) defined above.

To address the potentially broad range of diagnoses and circumstances encapsulated by each of the 36 routes, the best estimate scores were supplemented by a pair of values: a lower score and an upper score to reflect hypothetical "pessimistic" and "optimistic" scenarios, respectively. This was in recognition that routes could attract less or more favourable scores on either side of their best estimate score, depending on the circumstances. Individual disease entities can have a spectrum of potential severity, for example diseases themselves as diverse as gastroenteritis and diverticulitis can individually range in severity from self-limiting to life-threatening. Furthermore, some individual routes can arguably inherently be more or less favourable for the patient depending on the setting. For example, route F1 could generally be considered unfavourable in that the test missed a more serious final diagnosis. However, it could be considered that the test was of some benefit to the patient by guiding clinicians away from complacency towards a more severe diagnosis (hence an optimistic score of +1). Likewise, the two routes with defining scores of 0 were both assigned a pessimistic to optimistic score range of –1 to +1. This is in recognition that it could be considered of some reassurance to the patient that a (ultimately correct) diagnosis was confirmed by the test but, conversely, it could be considered that the test was of no diagnostic value and in fact expended cost and time (and in the context of the case study, also radiation exposure).

For internal consistency, and in recognition that higher levels of confidence tend to influence management decisions more strongly, assigned scores were one less in routes B1–B4 than in routes A1–A4, respectively; one less in routes F1–F6 than in routes D1–D6, respectively; and one less in routes E1–E6 than in routes C1–C6, respectively (see Figure 2Go). The above sets of groups reflect differences between low vs high post-test (post-CT) confidences. The same principle was applied to scores between low vs high pre-test (admission) confidence routes, for example, routes A1 vs A2, A3 vs A4, B1 vs B2, to F5 vs F6, where differences in best estimate scores were assigned to be one. Routes A5, A6, B5 and B6 were exceptions because of the defining 0 score status of routes A5 and B6. Note that routes A1–A6 and B1–B6 have positive scores in general, reflecting the favourable contribution of the test in achieving a correct diagnosis in these routes; this is except for route B5, in which the test (inappropriately) reduced the confidence in a correct initial diagnosis. Similarly, scores in routes within groups C, D, E and F have mostly negative scores, reflecting the unfavourable impact of the test in yielding incorrect diagnoses in these routes.

The above best estimate, pessimistic and optimistic scores (presented as individual components for clarity in Figure 2Go) were combined into a single weighted average score, which summarizes and quantifies the change between pre- and post-CT confidences for each route, the "route score", which was used in subsequent statistical analyses. This weighted average score was derived by placing one-sixth weight on each of the pessimistic and optimistic scores, and four-sixths weight on the best estimate score. This particular 1:4:1 weighting is standard within "Project Evaluation and Review Technique" (PERT), a well-established method applied in network analysis when coping with uncertainty [3].


    Conventional analysis of diagnostic confidence
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 
Conventional analysis was based on changes between pre- and post-test diagnostic confidences (i.e. surgical grades of confidence), which does not take into account the accuracy of the post-test diagnoses. Comparisons were undertaken both with and without corrections in diagnostic confidence scores as resulting from changes in diagnosis that may occur with the test, as previously described (for convenience referred to as the "Omary correction" [4, 5]). In brief, this correction applies if the pre-test confidence is high (in the cited analyses >50%) and there is a change in diagnosis following the test. Diagnostic improvement is then judged only after inferring a pre-test confidence in the updated diagnosis (see Figure 3Go).


Figure 3
View larger version (29K):
[in this window]
[in a new window]

 
Figure 3. Conventional analyses of confidence without and with Omary correction. Note: the analyses are independent of the correctness, or otherwise, of the post-test diagnosis.

 
Statistical assessment and comparison of the analytical approaches
The score-based approach employs a direct measure on a [–4, 4] scale derived from the route travelled by each patient. Conventional analytical approaches, with or without the Omary correction, also yield a mean score difference on a [–4, 4] scale, although the derivation is quite different (with extremes achievable only by making the transition between diagnostic confidence grades of 1 and 5 at the two time-points). For estimating within-person changes in confidence, a sample of size 62 enables each point estimate to be within 0.5 units of its true value with 95% confidence, provided such within-person changes have standard deviation of no more than 2 units. Thus, the 95% confidence interval estimates for the various changes considered were designed to have a width of no more than one unit on a common scale when assessing each analytical approach separately.

The primary analysis was to compare change in confidence within each method by constructing 95% confidence interval estimates and testing whether such changes differed significantly from zero. Paired t-test statistics (or their corresponding p-values) can be seen as standardized scores for assessing changes in confidence in terms of numbers of standard errors away from zero.

All analyses were carried out using the software SPSS for Windows (version 12.0; SPSS Inc. Chicago, IL).


    Results
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 
Patients and diagnoses in the case study
The illustrative case study consisted of 62 patients with median age (range) of 62 (18–92) years and 33 (53%) men. 11 (18%) patients underwent surgery. The numbers of patients traversing each diagnostic route in our proposed schema are presented in the last row of Figure 2Go. The pattern of overall changes in diagnoses (between admission and CT and from CT to ultimate diagnosis) is presented in Figure 4Go. 52% (32/62) of admitting diagnoses and 19% (12/62) of post-CT diagnoses proved to be incorrect. The sensitivity, specificity, and accuracy of CT in identifying the correct acute abdominal disorder were 98% (39/40), 50% (11/22) and 81% (50/62), respectively.


Figure 4
View larger version (43K):
[in this window]
[in a new window]

 
Figure 4. Flow of patients through the study.*Represents patients in routes F4, F6 in Figure 2. **Represents patients in routes C4, C5, E2, E3, E4, E6 in Figure 2. ***Number of post-CT diagnoses correct at follow up. {dagger}Represents patients in routes A5, A6, B5, B6 in Figure 2 and is the number of admitting diagnoses correct at follow up.

 
Comparison of proposed score-based method and conventional analysis of diagnostic confidence
Figure 5Go illustrates the distributions of changes in confidence according to: (a) the score-based method; (b) uncorrected (or unadjusted) change in diagnostic confidence scores; and (c) Omary corrected (adjusted) changes in diagnostic confidence.


Figure 5
View larger version (15K):
[in this window]
[in a new window]

 
Figure 5. Frequency distributions, and superimposed normal curves, of within-person changes in diagnostic confidence on common [–4, 4] scale, for n = 62 study subjects, according to: (a) score-based weighted average method; (b) conventional analysis, uncorrected for post-CT change in diagnosis; and (c) conventional analysis, with Omary correction for post-CT change in diagnosis.

 
Mean changes in confidence and their associated 95% confidence intervals are reported in Table 1Go for each analytical approach. The first row of Table 1Go shows a significant increase in diagnostic confidence score following CT when utilizing the new methodology: mean (standard deviation, SD) of 0.69 (1.73) (t = 3.12, p = 0.003). The second and third rows give the corresponding results for the conventional analyses, unadjusted and when employing the Omary correction: mean (SD) increase in confidence grade following CT, 1.13 (1.24) (t = 7.2, p<0.0001), and 1.32 (1.17) (t = 8.9, p<0.0001), respectively.


View this table:
[in this window]
[in a new window]

 
Table 1. Differences in diagnostic confidence using the proposed method, shown in the first row, and differences in surgical confidence grade utilizing conventional analyses(without, and with ("Omary"), correction for changes in diagnoses, shown in the second and third rows, respectively). SD = standard deviation; CI = confidence interval

 

    Discussion
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 
Our analytical method for assessing diagnostic confidence differs fundamentally from the majority of previous approaches, which do not consider whether post-test diagnoses are correct or not. The results from our illustrative case study indicate that CT significantly improves diagnostic confidence. However, it suggests that more conventional analytical approaches may over-estimate its efficacy: what appear to be highly significant conclusions in favour of CT using a conventional analysis (p<0.0001) are less striking when utilizing our approach (p = 0.003). Some of this change in order of magnitude of p-value can be attributed to Omary's correction enlarging the change in confidence and simultaneously diminishing its variability (as shown by reducing the SD) compared with the naïve or unadjusted approach. The score-based method proposed makes a correction in a more thorough sense by taking into account a patient's ultimate diagnosis and, in the case study presented, diminishes confidence change and increases its SD compared with the conventional approaches.

Assessment of the efficacy of a test is an important component in determining the utility of that test and plays an important role in evidence-based evaluations. There are various levels at which efficacy can be assessed. The six-tier hierarchical model proposed by Thornbury and Fryback [4, 6] provides a basis for such evaluations. That model consists of the following: technical efficacy (level 1), diagnostic-accuracy efficacy (level 2), diagnostic-thinking efficacy (level 3), therapeutic efficacy (level 4), patient outcome efficacy (level 5) and societal efficacy (level 6). Most assessments of efficacy in diagnostic imaging to date have focused on levels 1 and 2, and comparatively few have undertaken evaluations at level 3. The majority of the prior studies of diagnostic confidence (or diagnostic probabilities), a level 3 evaluation, have examined simply pre- and post-test changes in confidence [713]. Some have recognized the additional need to adjust for attendant changes in diagnoses [1416].

To our knowledge, only one previous study has attempted to adjust for the impact of the possibility of incorrect post-test diagnoses on the analysis of confidence, but their approach did not consider that diagnostic errors may occur in one of two directions (i.e. misdiagnosing either more or less severe conditions) and the resultant impact of such possible differences [17]. A test that increases one's confidence in an incorrect diagnosis, i.e. a diagnosis that is confidently wrong, can have potentially undesirable consequences. Such an example from the illustrative case study was a 30-year-old woman admitted with right lower quadrant pain, who on admission was considered to have appendicitis with an initial confidence of 30%. Following CT she was thought to have non-specific abdominal pain with a confidence of 90%. She underwent laparoscopic evaluation because of persisting symptoms. Pathological evaluation of the appendectomy specimen showed signs of an acute appendicitis, although the appendix appeared normal on gross surgical evaluation (retrospective review of the CT also confirmed no detectable abnormality).

Conventional analyses of diagnostic confidence would suffice if post-test diagnoses could be assumed to be correct. However, this is not a sound assumption as no test is infallible. Of note, in the illustrative case study presented, as many as 19% (95% CI, 10%, 29%) of the post-CT diagnoses proved to be ultimately incorrect. That this represents about one-fifth of all cases, some with potentially severe consequences, serves to underscore the desirability and need for a method that takes into account inaccurate diagnoses.

Our analysis draws attention to the critical interaction of diagnostic accuracy and diagnostic confidence in evaluating the efficacy of a test. In our view, a full assessment of diagnostic confidence efficacy requires knowledge of the eventual "truth" of the test in question. Such information is often hard to establish; in the case study presented, findings at surgery and rigorous follow up provided the necessary gold standard.

Creating a scoring system that reflects the benefits, or otherwise, to the patient of the test in question is necessarily a subjective process. The appropriate increments in the scale can be debated. For example, one could propose that there should be as many as 36 different scores to reflect the relative ranks of all combinations of diagnostic changes, diagnostic outcomes and confidences in our framework (Figures 1Go and 2Go); in our view, however, it is neither necessary nor desirable to rank the routes precisely, given the wide spectrum of possible circumstances and scenarios. We consider that a relatively narrow, such as a 9-point, scale is simple to use, with the main focus being placed on whether routes are generally favourable or unfavourable for the patient.

It could be argued that the need to derive 36 scores is burdensome and complex. However, it should be noted that if one were to consider just 20 possible individual diagnoses, each with three levels of severity, and the three time-points under consideration (pre-CT, post-CT, final diagnosis), this would generate over 200000 potential paths to consider. Furthermore, not 36 but only 9 independent routes need to be scored in detail since the other 27 route scores readily follow by application of our previously described simple rules for assuring internal consistency.

It should be noted that, in our case example, assignment of the best estimate scores in these nine basic routes was fairly limited, since one route was defined to be neutral (score of "0") and two were constrained to be extreme values (best possible "+4", and worst possible "–4"); for the remaining six routes, the main issue was whether they were generally beneficial (possible scores "1, 2, 3") or detrimental to the patient (possible scores, "–1, –2, –3"). There was greater potential for variability in assignment of the "optimistic" and "pessimistic" score range. However, we found that the results were not materially affected by leaving aside these values and basing the analysis on just the best estimate scores, probably because of the 4:1 weighting that was applied. We did not formally assess observer variability in scoring because of the relatively narrow range of possible scores in the nine basic routes and because of the relative insensitivity to the assignment of optimistic and pessimistic scores.

Part of the derivation of the 36 routes in our analysis was based on the recognition that not only may diagnoses be wrong, but also that such incorrect diagnoses may be either more or less severe than the correct diagnoses. The only previous analytical methodology that recognized the importance of the former aspect did not include the more rigorous evaluation required by the latter aspect [17]. Judgement related to the direction of errors (more severe vs less severe diagnoses) potentially introduces an element of subjectivity, but this determination is often clear-cut, for example, in the case study, gastroenteritis is generally considered less severe than appendicitis. In practice, depending on the particular outcome measure of interest, this determination could be assisted by drawing on data of perhaps operative rates, mortality, prognosis, or length of hospital stay for specific conditions, to assess the relative severity of diagnoses encountered. The ability to pre-specify a range for pessimistic and optimistic scores enhances the robustness of our method, allowing scope for recognition of subjectivity about the benefit or otherwise of the test within each route.

Clinical grading of diagnostic confidence (in the case study, by surgeons) is another arguably subjective process. Nevertheless, clinical diagnostic confidence undoubtedly has an important impact on medical decision-making in the face of uncertainty, because actions are often based on levels of confidence. We believe that stratification of confidence scores into high and low levels captures the essence of the impact of diagnostic confidence, namely, to proceed on the basis of one's belief in the diagnosis or not.

Many factors influence a clinician's confidence in a particular diagnosis, and these are necessarily weighed before coming to a final assessment of confidence. One such factor is "experience", which in turn is influenced by the complex interaction between past experience of the performance of the test in question and the individual clinician's perception of its accuracy in those encounters [1, 18]. This is a form of feedback loop. Thus, in coming to a judgement, it is possible to be inappropriately influenced by past experience of a particular test. In addition, individual clinicians' diagnoses may be systematically biased towards underdiagnosis or overdignosis, influenced perhaps by perceived relative risks of misdiagnosis and/or medicolegal considerations.

The framework presented is generalizable to other settings, being applicable whenever a diagnostic test is under evaluation, for example the evaluation of imaging studies in the assessment of appendicitis or pulmonary embolism, or the evaluation of tests to determine the presence of acute myocardial infarction. It could also be used when comparing more than two sets of tests or groups, such as CT pulmonary angiography, ventilation–perfusion scintigraphy, catheter pulmonary angiography and d-dimers in the example given above. The scoring scale need not have nine discrete values as in the illustrative case example, since any ordinal or continuous scale could be employed. However, having a fixed central point of 0 defining neutrality is helpful.

Deciding which diagnostic test to adopt can be complex, and assessments at higher levels of the Thornbury and Fryback six-tier model may be appropriate. Many studies assessing diagnostic tests focus initially on their technical and diagnostic accuracy (levels 1 and 2 of the Thornbury and Fryback model). Those studies which explore issues at level 3 (for example, diagnostic confidence) have generally done so independently of level 1 and 2 considerations and have typically not incorporated issues of diagnostic accuracy in their evaluations of diagnostic confidence. This would appear to be limiting since there is close inter-relationship between diagnostic accuracy and diagnostic confidence when considering their impact on the patient. Careful assessment of the interaction and impact of diagnostic errors and diagnostic confidence would seem of fundamental importance.

Assessment of diagnostic confidence alone may suffice in comparing tests if the tests are 100% accurate; however, in these circumstances, there would be no reason to suspect differences in confidence between tests. Further considerations, apart from diagnostic accuracy, in comparing diagnostic tests include the costs, potential risks and availability of the test. Although conventional analyses of confidence are simple to apply, their limitations when used in isolation may lead to false impressions of efficacy. It is appropriate in an evidence-based environment that higher levels of the Thornbury and Fryback six-tier hierarchy be explored, and that such evaluations should be rigorous. We do not propose dispensing with the assessment of diagnostic accuracy alone, since this is clearly important to consider.

We do suggest, however, that judging the clinical efficacy of a test by changes in diagnostic confidence alone has limitations and has the potential to yield misleading conclusions. We advocate that a robust analysis of diagnostic confidence requires the incorporation of diagnostic accuracy.

The development of the methodological approach described in this article was unfunded.


    Acknowledgments
 
We wish to thank our colleagues who contributed to the randomized CT study, which enabled the illustrative case study, and the NHS Executive for supporting the Centre for Applied Medical Statistics. We would also like to thank Mariann Crapanzano and Kathleen Wagner for their help in preparing the manuscript.

Received for publication June 21, 2005. Revision received March 26, 2006. Accepted for publication July 13, 2006.


    References
 Top
 Abstract
 Introduction
 Methods and materials
 Conventional analysis of...
 Results
 Discussion
 References
 

  1. Friedman C, Gatti G, Elstein A, Franz T, Murphy G, Wolf F. Are clinicians correct when they believe they are correct? Implications for medical decision support. Medinfo 2001;10:454–8.[Medline]
  2. Ng CS, Watson CJ, Palmer CR, et al. Evaluation of early abdominopelvic computed tomography in patients with acute abdominal pain of unknown cause: prospective randomised study. BMJ 2002;325:14[Free Full Text]
  3. Lockyer KG, Introduction to critical path analysis. Critical path analysis and other project network techniques. London: Pitman, 1991
  4. Thornbury JR, Fryback DG, Edwards W. Likelihood ratios as a measure of the diagnostic usefulness of excretory urogram information. Radiology 1975;114:561–5.[Abstract]
  5. Omary RA, Kaplan PA, Dussault RG, et al. The impact of ankle radiographs on the diagnosis and management of acute ankle injuries. Acad Radiol 1996;3:758–65.[CrossRef][Medline]
  6. Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decision Making 1991;11:88–94.[Abstract/Free Full Text]
  7. Dixon AK, Southern JP, Teale A, et al. Magnetic resonance imaging of the head and spine: effective for the clinician or the patient? BMJ 1991;302:79–82.[Abstract/Free Full Text]
  8. Mustard CA, McClarty B, MacEwan D. Influence of magnetic resonance imaging on diagnosis and therapeutic intention. Acad Radiol 1996;3:586–9.
  9. Anzilotti K, Jr, Schweitzer ME, Hecht P, Wapner K, Kahn M, Ross M. Effect of foot and ankle MR imaging on clinical decision making [see comment]. Radiology 1996;201:515–17.[Abstract/Free Full Text]
  10. Hollingworth W, Todd CJ, Bell MI, et al. The diagnostic and therapeutic impact of MRI: an observational multi-centre study [see comment]. Clin Radiol 2000;55:825–31.[CrossRef][Medline]
  11. Dhillon S, Halligan S, Goh V, Matravers P, Chambers A, Remedios D. The therapeutic impact of abdominal ultrasound in patients with acute abdominal symptoms. Clin Radiol 2002;57:268–71.[CrossRef][Medline]
  12. Chambers A, Halligan S, Goh V, Dhillon S, Hassan A. Therapeutic impact of abdominopelvic computed tomography in patients with acute abdominal symptoms. Acta Radiol 2004;45:248–53.[CrossRef][Medline]
  13. Warren RM, Bobrow LG, Earl HM, et al. Can breast MRI help in the management of women with breast cancer treated by neoadjuvant chemotherapy? Br J Cancer 2004;90:1349–60.[CrossRef][Medline]
  14. Neish AS, Taylor GA, Lund DP, Atkinson CC. Effect of CT information on the diagnosis and management of acute abdominal injury in children. Radiology 1998;206:327–31.[Abstract/Free Full Text]
  15. Carrico CW, Fenton LZ, Taylor GA, DiFiore JW, Soprano JV. Impact of sonography on the diagnosis and treatment of acute lower abdominal pain in children and young adults. AJR Am J Roentgenol 1999;172:513–16.[Abstract/Free Full Text]
  16. Abramson S, Walders N, Applegate KE, Gilkeson RC, Robbin MR. Impact in the emergency department of unenhanced CT on diagnostic confidence and therapeutic efficacy in patients with suspected renal colic: a prospective survey. 2000 ARRS President's Award. American Roentgen Ray Society. AJR Am J Roentgenol 2000;175:1689–95.[Abstract/Free Full Text]
  17. Tsushima Y, Aoki J, Endo K. Contribution of the diagnostic test to the physician's diagnostic thinking: new method to evaluate the effect. Acad Radiol 2003;10:751–5.[CrossRef][Medline]
  18. Tversky A, Kahneman D. Judgement under uncertainty: heuristics and biases. Science 1974;185:1124–31.[Abstract/Free Full Text]



This article has been cited by other articles:


Home page
Br. J. Radiol.Home page
BJR review of the year -- 2007
Br. J. Radiol., April 1, 2008; 81(964): 265 - 269.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ng, C.
Right arrow Articles by Palmer, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ng, C.
Right arrow Articles by Palmer, C.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
BJR DMFR IMAGING  ALL BIR JOURNALS