BJR
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

British Journal of Radiology (2007) 80, 674-677
© 2007 British Institute of Radiology
doi: 10.1259/bjr/83042364

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Brealey, S
Right arrow Articles by Westwood, M
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Brealey, S
Right arrow Articles by Westwood, M

SHORT COMMUNICATION

Are you reading what we are reading? The effect of who interprets medical images on estimates of diagnostic test accuracy in systematic reviews

S Brealey, BSC, PhD 1 and M Westwood, BSC, PhD 2

1 York Trials Unit, Department of Health Sciences, Second floor, Area 4, Seebohm Rowntree Building, 2 Centre for Reviews and Dissemination, University of York, Heslington, York YO10 5DD, UK

Correspondence: Dr S Brealey, York Trials Unit, Department of Health Sciences, Second floor, Area 4, Seebohm Rowntree Building, University of York, Heslington, York YO10 5DD, UK. E-mail: sb143{at}york.ac.uk


    Abstract
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
Observer variation and error in the interpretation of medical images is substantial and has been described as Radiology's Achilles' heel. The enormous development in imaging technologies has brought with it an increase in the complexity and volume of images produced. There is also increased diversity as to who interprets medical images. Whilst the influence of the observer on diagnostic test performance is frequently ignored, there is evidence that this influences estimates of accuracy. Characteristics of observers that should be considered when designing systematic reviews of diagnostic test accuracy are: allocation of images to be read by observers; number, experience and training of observers; profession of observers; and assessment of observer variability and examination of its effect on test accuracy. This information could be used to inform study appraisal, data synthesis, and the investigation of sources of heterogeneity. Establishing the effect of the role of the observer on estimates of accuracy and explaining heterogeneity is important for informing the delivery of these potentially expensive and resource-intensive imaging technologies and the continuing debate about who should read the images.


    Introduction
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
Observer variation in the interpretation of medical images is substantial and has been described as radiology's Achilles' heel [1]. Only a few years after the invention of X-rays by Röntgen in 1885 it became apparent that reading images could result in error and disputes [2]. Studies carried out over 50 years ago [3, 4] showed variation in the interpretation of chest radiographs in patients with suspected tuberculosis. In recent decades there has been enormous development in imaging technologies with an increase in the complexity and volume of images produced. Nearly one million MRI examinations are performed in England each year [5] and further investment is being made in diagnostics with the National Health Service set to spend £2.4 billion ($4.19 billion; \#8364;3.47 billion) in the next 5 years [6]. Government policy to promote allied health-care professional skills has increased diversity as to who interprets medical images with significant implications for the delivery of diagnostic services [7]. The STARD (standards for the reporting of diagnostic accuracy studies) statement is a checklist used to guide the reporting of studies of accuracy and recognizes the need to describe the persons (or observers) involved in reading medical images [8]. This has not been included as a criterion in QUADAS (quality assessment for diagnostic accuracy studies), which is a generic tool used to appraise the quality of primary studies in systematic reviews of diagnostic accuracy [9]. We argue here for the need to assess who interprets images when designing systematic reviews on the accuracy of these potentially expensive and resource-intensive imaging technologies and suggest how to make such assessments.


    Observer error and variation
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
Diagnostic accuracy studies are integral to the evaluation of new and existing technologies and to the measurement of their ability to distinguish patients with and without the target disorder [10]. Many studies of diagnostic accuracy have major methodological shortcomings and lack information on key elements of design and conduct [11]. Meta-analysis of diagnostic test accuracy can be affected by these limitations [12]. Evaluation of the diagnostic accuracy of an imaging technique such as MRI of the knee requires the subjective, qualitative interpretation of the images as a written text report. Interpretation of the images can be ambiguous and can provide an arbitrary distinction as to the presence of disease that may result from different observers having different thresholds for calling an image "positive" [13]. In reality, the report provides the referring clinician with varying degrees of certainty as to whether, for example, a meniscal lesion is present or not, and, if so, how small or large is the tear. Errors arise when the correct interpretation of an image is not in dispute by experts representing a gold standard, but the observer fails to reach the same conclusion. Differences of opinion about the correct interpretation of an image can result in observer variation [1]. Error concerns how accurate or valid is a report, and observer variation how reliable or reproducible is a report: a report that is not reproducible cannot be valid. Sometimes, however, it could be argued that a report is valid, e.g. when two independent radiologists observe the presence of pneumothorax on a chest radiograph, but not reproducible, because one radiologist reports a clinically insignificant finding such as a healed fracture. The clinical importance of what is or is not reported should be considered when judging whether reports are reproducible. Standardization of the reporting of images and the use of classification criteria could help to increase reproducibility between observers and lead to more valid reports, but such systems are not generally used in normal practice. Moreover, in clinical practice, it is not always possible to provide a clear-cut interpretation of an image because of factors such as technical defects and artefacts, patient restrictions or administrative limitations [14].

A recent systematic review of 55 published studies that focused on bias or variation in diagnostic accuracy studies found that only eight (15%) considered observer variation and, in seven of these, it affected estimates of accuracy [15]. Although the influence of the observer on test performance is frequently ignored in both primary and secondary research, the evidence suggests that the observer is an important effect modifier. We propose that the characteristics of the observers who interpret medical images should be considered when designing, conducting and reporting systematic reviews of imaging technologies.


    Reporting of information
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
Without complete and accurate reporting of observer characteristics in primary studies it is difficult to assess and explore their potential effect on review findings. We asked seven key questions on this subject and applied them to two systematic reviews recently completed by one of us (MW). The reviews were about the accuracy of imaging modalities to detect renal scarring subsequent to urinary tract infection in children and to assess lower limb peripheral arterial disease. These are very different reviews that address a wide variety of imaging modalities and are used to illustrate the need for complete and accurate reporting of information rather than to be representative of all reviews of imaging technologies. The example text describes what could have been reported in a primary study assessing the accuracy of ultrasound in the detection of renal scarring.

Example text
Two consultant radiologists with at least 5 years experience and a special interest in uroradiology independently reported on the presence of renal scarring in a randomly allocated sample of 200 sonograms. No additional training was provided. Each radiologist also reported a random sample of 10% of the sonograms reported by the other radiologist to test for interobserver agreement. Using the Kappa statistic this was found to be 0.5.

Table 1Go presents the answers to the seven questions. Four of the 136 studies (3%) described the allocation of images to the observers as "random" with no further details provided. Over half of the studies, i.e. 72 (53%), referred to the number of observers, which was generally one or two. Only 23 (17%) studies presented information on the experience of the observers and most of these lacked clarity, simply describing them as "experienced". Details about prior training of observers and their professional background were reported in 51 (38%) and 70 (51%) studies, respectively. Finally, 10 (7%) of the studies estimated observer variability and six (4%) explored the potential effect on test accuracy. The limited and variable reporting (ranging from 3% to 53% of the questions being addressed) of this information in primary studies is not surprising [11] but should improve with the progressive implementation of the STARD initiative [8]. We now discuss the potential to incorporate characteristics of observers into the assessment of the quality of diagnostic accuracy studies in systematic reviews.


View this table:
[in this window]
[in a new window]

 
Table 1. Key questions to ask about the observer reading the images applied to two systematic reviews of diagnostic test accuracy with the response rates

 

    Assessing quality
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
The quality of primary studies is integral to ascertaining the best evidence to guide policies, to influence good practice and to direct future research. The assessment of quality can be used as a criterion for including studies in a review or its primary analyses, or to investigate the effect of different biases and sources of variation on results and transferability. In the context of who interprets medical images we can define the characteristics of the observer in terms of number, profession, experience (e.g. grade, years in profession, number of images read each year) and prior training. These characteristics, along with estimates of interobserver agreement, can modify estimates of diagnostic test performance and so their adequate reporting is important when assessing the quality of primary studies.

The number of observers interpreting images contributes to the generalizability of study findings. Evidence from a single, highly specialist observer is likely to have low external validity. In contrast, such an observer could help to produce the best estimates of test accuracy and so increase internal validity [16]. Professional background is also important. For the reporting of MRI images in 15 patients with suspected lumbar spine stenosis the average interobserver Kappa score for seven observers was 0.26; it was highest among radiologists (0.40), followed by neurosurgeons (0.21) and then orthopaedic surgeons (0.15) [17]. Variation in image interpretation by different professions could affect estimates of accuracy and their precision. This also applies to the experience of the observers in terms of years in their profession and grade, as well as prior training. In a recent study of 101 observers interpreting 50 digital chest radiographs for tuberculosis, the more experienced observers (years of work in a speciality) showed significantly greater agreement on the presence of abnormalities and cavities [18]. For primary studies, it is desirable to quantify interobserver agreement and to assess its effect on test accuracy estimates. This is particularly important when two or more tests are being compared as the difference between tests can be outweighed by observer variation [1]. In a study of 14 observers interpreting lung abnormalities in 80 patients the differences in receiver operating characteristic (ROC) curves for three image types (conventional, full size and minified storage phosphor) were insignificant. When the observers were stratified into two groups, seven chest radiologists and seven chest residents, significant differences were observed in the ROC curves for the three image types [19]. Differences between technologies were obscured by who read the images.


    Summary findings and heterogeneity
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
The quality of studies is important to data synthesis and to explore sources of heterogeneity; if studies are not robust then conclusions from systematic reviews will be equally weak. How you judge the quality of a study and incorporate this into data synthesis can be problematic [20]. The evidence-based QUADAS tool has simplified this for systematic reviews of diagnostic accuracy [9], but unlike earlier guidelines for the appraisal of diagnostic test studies it ignores the role of the observer [20, 21]. Knowledge of the subject area and findings of other research can guide how reviewers address this. Depending on the information presented in primary studies quality could be assessed as (i) a single criterion as to whether or not sufficient detail about the observers was reported; (ii) as several individual quality criteria about the characteristics of observers such as number, profession or training; or (iii) as a composite score of quality. How could we use this information in the conduct of systematic reviews about the accuracy of diagnostic tests?

First, narrative synthesis could be performed including tabulation of study characteristics (both clinical, e.g. type of test, body area and setting, and methodological, e.g. validity of reference standard and masking) and results. This can be the most appropriate method for presenting and discussing findings in a clear and logical order. When it is valid to pool estimates of accuracy the robustness of results could be tested in a sensitivity analysis that excludes studies that do not provide details of the observers [22]. When information is available meta-analyses could be stratified with regard to whether single or multiple observers were used or whether observers received prior training or not. Covariates, or effect modifiers, such as the profession of observers, could be included in a meta-regression model to explore heterogeneity and to test its effect on accuracy [23]. Although the choice of covariates in a model should be specified a priori by reviewing the literature and consulting specialists, this does not eliminate the possibility of spurious findings. Meta-regression analyses are more exploratory than definitive and are discouraged unless a large number of studies are available [24].


    Implications beyond accuracy
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
Establishing diagnostic accuracy is important in the evaluation of an imaging technology but is only an intermediate outcome in a complex chain of events [7]. Technologies can have similar estimates of accuracy, but differences in the type of false-negative and false-positive outcomes between technologies can have different implications for patient management and outcome. Even when one technology is more accurate than another this does not necessarily translate into benefits to clinicians' decision making or patients' health. The observer who interprets the images can contribute to how valid are the estimates of accuracy, the confidence with which the referring clinician diagnoses and manages a patient, and the effect on patient outcome [7]. Thus when systematic reviews of randomised trials of a diagnostic test are conducted it is prudent to consider observer characteristics when synthesizing and interpreting study results. The potential cost implications of ignoring variability between observers from different professions could be considerable and could be explored in diagnostic decision-analytic models.


    Conclusions
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 
Differences between observers in image interpretation can obscure true differences in performance between technologies [1]. Despite this, few primary studies report information on observer characteristics or assess observer variability let alone consider the effect on test accuracy [13]. This lack of detail makes it difficult to assess and explore the potential effect on review findings. We recommend when designing and conducting systematic reviews of imaging technologies that, in addition to the quality criteria listed in QUADAS [9], the characteristics of observers and formal assessment of interobserver variation are considered and incorporated into study appraisal and data synthesis. Establishing the effect of the role of the observer on estimates of accuracy and explaining heterogeneity is important for informing the delivery of expensive imaging technologies and the continuing debate about who should read the images [25].


    Acknowledgments
 
We are grateful for comments on a draft manuscript from Professor Martin Bland, Professor David Manning, Dr David King and the anonymous reviewers.

Received for publication November 7, 2006. Revision received December 15, 2006. Accepted for publication January 12, 2007.


    References
 Top
 Abstract
 Introduction
 Observer error and variation
 Reporting of information
 Assessing quality
 Summary findings and...
 Implications beyond accuracy
 Conclusions
 References
 

  1. Robinson PJA. Radiology's Achilles' heel: error and variation in the interpretation of the Röntgen image. Br J Radiol 1997;70:1085–98.[Abstract]
  2. Anon. Lancet (26 Jan) 1901;i:251
  3. Garland LH. On the scientific evaluation of diagnostic procedures. Radiology 1949;52:309–28.[Medline]
  4. Yerushalmy J, Harkness JT, Kennedy BR. The role of dual reading in mass radiography. Am Rev Tuberculosis 1950;61:443–64.
  5. Department of Health. Hospital activity statistics: imaging and radiodiagnostics files. Available from: http://www.performance.doh.gov.uk/hospitalactivity/data_requests/imaging_and_radiodiagnostics.htm (accessed 7 February 2006)
  6. Barrett A. Waiting times for scans to decrease, vows Department of Health. Br Med J 2005;331:256[Free Full Text]
  7. Brealey S. Measuring the effects of image interpretation: an evaluative framework. Clin Radiol 2001;56:341–7.[CrossRef][Medline]
  8. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Clin Chem 2003;49:7–18.[Abstract/Free Full Text]
  9. Whiting P, Westwood M, Rutjes AWS, Reitsma JB, Bossuyt PMM, Kleijnen J. Evaluation of QUADAS, a tool for the quality assessment of diagnostic accuracy studies. BMC Med Res Methodol 2006;6:9[CrossRef][Medline]
  10. Sackett DL, Haynes RB. Evidence base of clinical diagnosis: the architecture of diagnostic research. Br Med J 2002;324:539–41.[Free Full Text]
  11. Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research. JAMA 1995;274:645–51.[Abstract/Free Full Text]
  12. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282:1061–6.[Abstract/Free Full Text]
  13. Irwig L, Bossuyt P, Glasziou P, Gatsonis C, Lijmer J. Evidence base of clinical diagnosis: designing studies to ensure that estimates of test accuracy are transferable. BMJ 2002;324:669–71.[Free Full Text]
  14. Robinson PJA, Fletcher JM. Clinical coding in radiology. Imaging 1994;6:133–42.
  15. Whiting P, Rutjes AWS, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004;140:189–202.[Abstract/Free Full Text]
  16. Brealey S, Scally AJ. Bias in plain film reading performance studies. Br J Radiol 2001;74:307–16.[Abstract/Free Full Text]
  17. Speciale AC, Pietrobon R, Urban CW, Richardson WJ, Helms CA, Major N, et al. Observer variability in assessing lumbar spinal stenosis severity on magnetic resonance imaging and its relation to cross-sectional spinal canal area. Spine 2002;27:1082–6.[CrossRef][Medline]
  18. Balabanova Y, Coker R, Fedorin I, Zakharova S, Plavinskij S, Krukov N, et al. Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study. Br Med J 2005;331:379–82.[Abstract/Free Full Text]
  19. Kido S, Ikezoe J, Takeuchi N, Kondoh H, Tomiyama M, Jokoh T, et al. Interpretation of subtle interstitial lung abnormalities: conventional versus storage phosphor radiography. Radiology 1993;187:527–33.[Abstract/Free Full Text]
  20. Greenhalgh T. How to read a paper: papers that report diagnostic or screening tests. Br Med J 1997;315:540–3.[Free Full Text]
  21. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. London: Little, Brown, 1991: 51–68
  22. Juni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. JAMA 1999;282:1054–60.[Abstract/Free Full Text]
  23. Deeks J. Systematic reviews of evaluations of diagnostic and screening tests. In: Egger M, Smith GD, Altman G, editors. Systematic reviews in health care: meta-analysis in context. London, UK: BMJ Publishing Group, 2001
  24. Higgins J, Thompson S, Deeks J, Altman D. Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice. J Health Serv Res Policy 2002;7:51–61.[Abstract/Free Full Text]
  25. Brealey S, Scally A, Hahn S, Thomas N, Godfrey C, Crane S. Accuracy of radiographers red dot or triage of accident and emergency radiographs in clinical practice: a systematic review. Clin Radiol 2006;61:604–15.[CrossRef][Medline]



This article has been cited by other articles:


Home page
Br. J. Radiol.Home page
BJR review of the year -- 2007
Br. J. Radiol., April 1, 2008; 81(964): 265 - 269.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Brealey, S
Right arrow Articles by Westwood, M
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Brealey, S
Right arrow Articles by Westwood, M


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
BJR DMFR IMAGING  ALL BIR JOURNALS