| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Full paper |
1 Department of Radiology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA, 2 Centre for Applied Medical Statistics, University of Cambridge, Cambridge CB2 2SR, UK
Correspondence: C S Ng, Department of Radiology, Unit 57, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030-4009, USA.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
The majority of prior studies that have attempted to assess the efficacy of a diagnostic test in terms of its impact on diagnostic confidence have used analyses based simply on changes between pre- and post-test confidences (or probabilities of diagnoses). Such analyses fail to take into account the possibility of incorrect diagnoses and their potential consequences. Confidently wrong diagnoses may be harmful to patients: for instance, patients may undergo unnecessary surgery, or else fail to have necessary surgery, with potentially serious consequences.
We present a framework for the analysis of diagnostic confidence that takes into account ultimate diagnostic accuracy. We should emphasize that we illustrate the methodology using a case study and data from a prospective trial assessing the value of abdomino-pelvic CT in patients admitted to hospital with acute abdominal pain of uncertain aetiology. We compare the new approach with more conventional evaluations of diagnostic confidence, which limit themselves to simple pre- and post-test scores of confidence.
| Methods and materials |
|---|
|
|
|---|
Diagnostic confidence and accuracy
The admitting surgeons documented their diagnoses and graded their diagnostic confidence at the time of admission (0 h) and again after the CT examination. Grading of diagnostic confidence by the surgeons was on a 5-point scale: 1 (10% confidence, i.e. very unsure), 2 (30%), 3 (50%), 4 (70%) and 5 (90% confidence, i.e. highly confident). These pre- and post-test diagnostic confidences formed the bases for the conventional analysis of efficacy, to which our approach was compared.
The "final" (gold standard) diagnosis was defined as that made at surgery or at 6 month follow-up, whichever was sooner. In this case study, "diagnostic accuracy" was the accuracy of the diagnosis in relation to the final diagnosis. A fundamentally important consideration is that post-test diagnoses may be wrong. Furthermore, such incorrect post-test diagnoses may either under-estimate ("undercall") or overestimate ("overcall") the severity of the actual diagnosis. In this illustrative case study, a more severe diagnosis was taken as one that would typically have needed surgical intervention (rather than medical or conservative therapy), or one that would usually be associated with a poorer prognosis. For example, acute appendicitis in most circumstances would be considered a more severe diagnosis than diverticulitis; in contrast, gastroenteritis would in most circumstances be considered less severe than diverticulitis. Similarly, the test itself may or may not change the pre-test diagnosis; and likewise, a post-test diagnosis may be more severe than the pre-test diagnosis (i.e. the test may "upgrade" the severity of the pre-test diagnosis), or it may be less severe than the pre-test diagnosis (i.e. the test may "downgrade" the severity of the initial diagnosis). Changes between the admitting and post-CT diagnoses, and between the post-CT and final diagnoses were assessed by two observers (a surgeon and a radiologist). If there was a change in diagnosis, a consensus as to whether the change was towards a more or less severe diagnosis was reached; if the diagnosis did not change, no change was recorded.
Scoring of diagnostic confidence in relation to diagnostic accuracy in proposed approach
The surgeons' diagnostic confidence grades were dichotomized such that "high" confidence level consisted of surgical confidence grades 4 or 5 (i.e. >50% confidence); "low" confidence levels consisted of surgical confidence grades 1, 2, or 3 (i.e.
50% confidence). The premise was that a high level of confidence in a diagnosis usually implies that definitive management proceeds on that basis, whereas a low level of confidence does not usually bring about specific management.
A scoring system was used to quantify the impact of the test (i.e. the CT examination) with particular attention to the patients' perspective, taking into account the changes in diagnostic confidence level (high or low), change in diagnoses (more or less severe, or no change), and ultimate accuracy of the diagnosis. To introduce the scoring system, we illustrate in Figure 1
the nine basic "routes" for those situations in which initial and post-test levels of confidence are both high. These routes consist of combinations of diagnostic changes and ultimate diagnostic accuracy and are labelled A1, A3 etc. These same labels feature in Figure 2
, which depicts in flow diagram format the 36 total possible combinations arising from all 4 permutations of low and high initial and post-test levels of confidence as applied to the above 9 basic routes.
|
|
A score for each route, the "route score", was derived by application of the principles in the following six paragraphs. Each route was assigned a score on a 9-point scale, ranging from 4 to +4, defined as follows. The best possible impact for a test from the patients' perspective (diagnostic score of +4) was one in which the test correctly diagnosed, with high confidence, a condition that was initially confidently considered a less severe diagnosis (i.e. route A1). The least favourable scenario (diagnostic score of 4) was considered to be one in which an intervention incorrectly diagnosed, with high confidence, a condition that was initially confidently and correctly considered to be a more serious condition (i.e. route F5).
There are two special routes the test does not change an admitting diagnosis that ultimately proves to be correct, nor changes the confidence level of the diagnoses (routes A5 and B6). As such, these routes have no significant impact on the patient, being neither beneficial nor detrimental, and have been assigned defining neutral scores of 0.
Scores for the other intermediate routes were derived by considering hypothetical combinations of "severe" diagnoses, for example acute appendicitis or ruptured abdominal aortic aneurysm, and "less severe" diagnoses, for example gastroenteritis or non-specific abdominal pain, together with combinations of high and low diagnostic confidences. Scoring was undertaken prospectively by two observers in consensus without knowledge of the specific diagnoses in each route. In arriving at a score, termed the "best estimate" score, it was helpful to consider first if the route was generally favourable or unfavourable from the patients' point of view, and then to consider its relative benefit, or otherwise, in relation to the fixed points (of 4, 0 and 4) defined above.
To address the potentially broad range of diagnoses and circumstances encapsulated by each of the 36 routes, the best estimate scores were supplemented by a pair of values: a lower score and an upper score to reflect hypothetical "pessimistic" and "optimistic" scenarios, respectively. This was in recognition that routes could attract less or more favourable scores on either side of their best estimate score, depending on the circumstances. Individual disease entities can have a spectrum of potential severity, for example diseases themselves as diverse as gastroenteritis and diverticulitis can individually range in severity from self-limiting to life-threatening. Furthermore, some individual routes can arguably inherently be more or less favourable for the patient depending on the setting. For example, route F1 could generally be considered unfavourable in that the test missed a more serious final diagnosis. However, it could be considered that the test was of some benefit to the patient by guiding clinicians away from complacency towards a more severe diagnosis (hence an optimistic score of +1). Likewise, the two routes with defining scores of 0 were both assigned a pessimistic to optimistic score range of 1 to +1. This is in recognition that it could be considered of some reassurance to the patient that a (ultimately correct) diagnosis was confirmed by the test but, conversely, it could be considered that the test was of no diagnostic value and in fact expended cost and time (and in the context of the case study, also radiation exposure).
For internal consistency, and in recognition that higher levels of confidence tend to influence management decisions more strongly, assigned scores were one less in routes B1B4 than in routes A1A4, respectively; one less in routes F1F6 than in routes D1D6, respectively; and one less in routes E1E6 than in routes C1C6, respectively (see Figure 2
). The above sets of groups reflect differences between low vs high post-test (post-CT) confidences. The same principle was applied to scores between low vs high pre-test (admission) confidence routes, for example, routes A1 vs A2, A3 vs A4, B1 vs B2, to F5 vs F6, where differences in best estimate scores were assigned to be one. Routes A5, A6, B5 and B6 were exceptions because of the defining 0 score status of routes A5 and B6. Note that routes A1A6 and B1B6 have positive scores in general, reflecting the favourable contribution of the test in achieving a correct diagnosis in these routes; this is except for route B5, in which the test (inappropriately) reduced the confidence in a correct initial diagnosis. Similarly, scores in routes within groups C, D, E and F have mostly negative scores, reflecting the unfavourable impact of the test in yielding incorrect diagnoses in these routes.
The above best estimate, pessimistic and optimistic scores (presented as individual components for clarity in Figure 2
) were combined into a single weighted average score, which summarizes and quantifies the change between pre- and post-CT confidences for each route, the "route score", which was used in subsequent statistical analyses. This weighted average score was derived by placing one-sixth weight on each of the pessimistic and optimistic scores, and four-sixths weight on the best estimate score. This particular 1:4:1 weighting is standard within "Project Evaluation and Review Technique" (PERT), a well-established method applied in network analysis when coping with uncertainty [3].
| Conventional analysis of diagnostic confidence |
|---|
|
|
|---|
|
The primary analysis was to compare change in confidence within each method by constructing 95% confidence interval estimates and testing whether such changes differed significantly from zero. Paired t-test statistics (or their corresponding p-values) can be seen as standardized scores for assessing changes in confidence in terms of numbers of standard errors away from zero.
All analyses were carried out using the software SPSS for Windows (version 12.0; SPSS Inc. Chicago, IL).
| Results |
|---|
|
|
|---|
|
|
|
| Discussion |
|---|
|
|
|---|
Assessment of the efficacy of a test is an important component in determining the utility of that test and plays an important role in evidence-based evaluations. There are various levels at which efficacy can be assessed. The six-tier hierarchical model proposed by Thornbury and Fryback [4, 6] provides a basis for such evaluations. That model consists of the following: technical efficacy (level 1), diagnostic-accuracy efficacy (level 2), diagnostic-thinking efficacy (level 3), therapeutic efficacy (level 4), patient outcome efficacy (level 5) and societal efficacy (level 6). Most assessments of efficacy in diagnostic imaging to date have focused on levels 1 and 2, and comparatively few have undertaken evaluations at level 3. The majority of the prior studies of diagnostic confidence (or diagnostic probabilities), a level 3 evaluation, have examined simply pre- and post-test changes in confidence [713]. Some have recognized the additional need to adjust for attendant changes in diagnoses [1416].
To our knowledge, only one previous study has attempted to adjust for the impact of the possibility of incorrect post-test diagnoses on the analysis of confidence, but their approach did not consider that diagnostic errors may occur in one of two directions (i.e. misdiagnosing either more or less severe conditions) and the resultant impact of such possible differences [17]. A test that increases one's confidence in an incorrect diagnosis, i.e. a diagnosis that is confidently wrong, can have potentially undesirable consequences. Such an example from the illustrative case study was a 30-year-old woman admitted with right lower quadrant pain, who on admission was considered to have appendicitis with an initial confidence of 30%. Following CT she was thought to have non-specific abdominal pain with a confidence of 90%. She underwent laparoscopic evaluation because of persisting symptoms. Pathological evaluation of the appendectomy specimen showed signs of an acute appendicitis, although the appendix appeared normal on gross surgical evaluation (retrospective review of the CT also confirmed no detectable abnormality).
Conventional analyses of diagnostic confidence would suffice if post-test diagnoses could be assumed to be correct. However, this is not a sound assumption as no test is infallible. Of note, in the illustrative case study presented, as many as 19% (95% CI, 10%, 29%) of the post-CT diagnoses proved to be ultimately incorrect. That this represents about one-fifth of all cases, some with potentially severe consequences, serves to underscore the desirability and need for a method that takes into account inaccurate diagnoses.
Our analysis draws attention to the critical interaction of diagnostic accuracy and diagnostic confidence in evaluating the efficacy of a test. In our view, a full assessment of diagnostic confidence efficacy requires knowledge of the eventual "truth" of the test in question. Such information is often hard to establish; in the case study presented, findings at surgery and rigorous follow up provided the necessary gold standard.
Creating a scoring system that reflects the benefits, or otherwise, to the patient of the test in question is necessarily a subjective process. The appropriate increments in the scale can be debated. For example, one could propose that there should be as many as 36 different scores to reflect the relative ranks of all combinations of diagnostic changes, diagnostic outcomes and confidences in our framework (Figures 1
and 2
); in our view, however, it is neither necessary nor desirable to rank the routes precisely, given the wide spectrum of possible circumstances and scenarios. We consider that a relatively narrow, such as a 9-point, scale is simple to use, with the main focus being placed on whether routes are generally favourable or unfavourable for the patient.
It could be argued that the need to derive 36 scores is burdensome and complex. However, it should be noted that if one were to consider just 20 possible individual diagnoses, each with three levels of severity, and the three time-points under consideration (pre-CT, post-CT, final diagnosis), this would generate over 200000 potential paths to consider. Furthermore, not 36 but only 9 independent routes need to be scored in detail since the other 27 route scores readily follow by application of our previously described simple rules for assuring internal consistency.
It should be noted that, in our case example, assignment of the best estimate scores in these nine basic routes was fairly limited, since one route was defined to be neutral (score of "0") and two were constrained to be extreme values (best possible "+4", and worst possible "4"); for the remaining six routes, the main issue was whether they were generally beneficial (possible scores "1, 2, 3") or detrimental to the patient (possible scores, "1, 2, 3"). There was greater potential for variability in assignment of the "optimistic" and "pessimistic" score range. However, we found that the results were not materially affected by leaving aside these values and basing the analysis on just the best estimate scores, probably because of the 4:1 weighting that was applied. We did not formally assess observer variability in scoring because of the relatively narrow range of possible scores in the nine basic routes and because of the relative insensitivity to the assignment of optimistic and pessimistic scores.
Part of the derivation of the 36 routes in our analysis was based on the recognition that not only may diagnoses be wrong, but also that such incorrect diagnoses may be either more or less severe than the correct diagnoses. The only previous analytical methodology that recognized the importance of the former aspect did not include the more rigorous evaluation required by the latter aspect [17]. Judgement related to the direction of errors (more severe vs less severe diagnoses) potentially introduces an element of subjectivity, but this determination is often clear-cut, for example, in the case study, gastroenteritis is generally considered less severe than appendicitis. In practice, depending on the particular outcome measure of interest, this determination could be assisted by drawing on data of perhaps operative rates, mortality, prognosis, or length of hospital stay for specific conditions, to assess the relative severity of diagnoses encountered. The ability to pre-specify a range for pessimistic and optimistic scores enhances the robustness of our method, allowing scope for recognition of subjectivity about the benefit or otherwise of the test within each route.
Clinical grading of diagnostic confidence (in the case study, by surgeons) is another arguably subjective process. Nevertheless, clinical diagnostic confidence undoubtedly has an important impact on medical decision-making in the face of uncertainty, because actions are often based on levels of confidence. We believe that stratification of confidence scores into high and low levels captures the essence of the impact of diagnostic confidence, namely, to proceed on the basis of one's belief in the diagnosis or not.
Many factors influence a clinician's confidence in a particular diagnosis, and these are necessarily weighed before coming to a final assessment of confidence. One such factor is "experience", which in turn is influenced by the complex interaction between past experience of the performance of the test in question and the individual clinician's perception of its accuracy in those encounters [1, 18]. This is a form of feedback loop. Thus, in coming to a judgement, it is possible to be inappropriately influenced by past experience of a particular test. In addition, individual clinicians' diagnoses may be systematically biased towards underdiagnosis or overdignosis, influenced perhaps by perceived relative risks of misdiagnosis and/or medicolegal considerations.
The framework presented is generalizable to other settings, being applicable whenever a diagnostic test is under evaluation, for example the evaluation of imaging studies in the assessment of appendicitis or pulmonary embolism, or the evaluation of tests to determine the presence of acute myocardial infarction. It could also be used when comparing more than two sets of tests or groups, such as CT pulmonary angiography, ventilationperfusion scintigraphy, catheter pulmonary angiography and d-dimers in the example given above. The scoring scale need not have nine discrete values as in the illustrative case example, since any ordinal or continuous scale could be employed. However, having a fixed central point of 0 defining neutrality is helpful.
Deciding which diagnostic test to adopt can be complex, and assessments at higher levels of the Thornbury and Fryback six-tier model may be appropriate. Many studies assessing diagnostic tests focus initially on their technical and diagnostic accuracy (levels 1 and 2 of the Thornbury and Fryback model). Those studies which explore issues at level 3 (for example, diagnostic confidence) have generally done so independently of level 1 and 2 considerations and have typically not incorporated issues of diagnostic accuracy in their evaluations of diagnostic confidence. This would appear to be limiting since there is close inter-relationship between diagnostic accuracy and diagnostic confidence when considering their impact on the patient. Careful assessment of the interaction and impact of diagnostic errors and diagnostic confidence would seem of fundamental importance.
Assessment of diagnostic confidence alone may suffice in comparing tests if the tests are 100% accurate; however, in these circumstances, there would be no reason to suspect differences in confidence between tests. Further considerations, apart from diagnostic accuracy, in comparing diagnostic tests include the costs, potential risks and availability of the test. Although conventional analyses of confidence are simple to apply, their limitations when used in isolation may lead to false impressions of efficacy. It is appropriate in an evidence-based environment that higher levels of the Thornbury and Fryback six-tier hierarchy be explored, and that such evaluations should be rigorous. We do not propose dispensing with the assessment of diagnostic accuracy alone, since this is clearly important to consider.
We do suggest, however, that judging the clinical efficacy of a test by changes in diagnostic confidence alone has limitations and has the potential to yield misleading conclusions. We advocate that a robust analysis of diagnostic confidence requires the incorporation of diagnostic accuracy.
The development of the methodological approach described in this article was unfunded.
| Acknowledgments |
|---|
Received for publication June 21, 2005. Revision received March 26, 2006. Accepted for publication July 13, 2006.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
BJR review of the year -- 2007 Br. J. Radiol., April 1, 2008; 81(964): 265 - 269. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| BJR | DMFR | IMAGING | ALL BIR JOURNALS |