| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Review article |
1 Department of Health Sciences & Clinical Evaluation, University of York, York YO1 5DD 2 Division of Radiography, University of Bradford, Bradford BD5 0BB, UK
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Conducting a systematic review involves locating, appraising and synthesizing evidence from scientific studies to provide empirical answers to well defined research questions [10]. This requires adherence to strict scientific design to ensure the review is both comprehensive and minimizes bias, thus providing reliable results from which decisions about the delivery of healthcare are made [11, 12]. It is most straightforward to synthesize results from well planned and well executed randomized controlled trials (RCTs), as this study design is least subject to bias and there are statistical models available for pooling estimates of effect [13]. In some areas of medicine and healthcare there are very few, if any, RCTs despite an extensive literature of research data [14]. A large scale RCT that compares the film reading performance of two different professional groups would protect against some of the biases. However, such studies are expensive and may not be amenable to a rapidly evolving political climate. It is therefore important to be aware of the weaknesses and strengths of alternative study designs.
There is also a conceptual hurdle to overcomewhen evaluating film reading performance how does one relate patient outcome to the reports made by different professionals when other factors such as therapy are involved? This difficulty was resolved in the context of imaging technologies by applying a hierarchical framework first suggested by Fineberg et al [15] and subsequently extended by the Institute of Medicine [16]. The categories in the framework are: technical capability; diagnostic accuracy; diagnostic impact; therapeutic impact; and impact on health [17]. Film reading performance studies are comparable with the "diagnostic accuracy" category. Studies could also measure changes in referring clinicians' diagnosis (diagnostic impact), management plans (therapeutic impact) or patient health status (impact on health) following reports made by different professionals. However, the subject of this article is those studies that assess only the plain film reading performance of healthcare professionals (i.e. observers). Such studies involve selecting a sample of observers from the same or different professions to interpret a sample of films. Then, a healthcare professional (i.e. arbiter) judges whether the reports made by the observers are concordant with a reference standard (consultant radiologist). This allows the investigator to assess how accurately the observers can interpret the films. Table 1
illustrates the classification of such studies, including clinical examples [18].
|
|
|
| Biases due to patient (or film) selection |
|---|
|
|
|---|
Film cohort bias
Following referral, the criteria used to establish which films are eligible for inclusion will also affect the characteristics of the sample selected. Depending on whether the purpose of the study is to assess the potential for radiographers to report in clinical practice or is more pragmatic in design determines whether spectrum or population bias arises.
Spectrum bias
If the focus is to assess the potential of radiographers to report, for instance students on an "image interpretation" course reporting a validated bank of radiographic examinations [20], a limited range of disease type, severity, duration or clinical demographics in the film sample can considerably bias performance [2123]. Prevention of this bias is more concerned with internal validity, as the aim is to assess radiographers' performance when reporting an unrepresentative but more difficult batch of films. Therefore, stratifying the sample in favour of a higher prevalence and variety of pathology is desirable.
Population bias
In contrast, the emphasis may be on external validity if the aim is to assess radiographers' A&E film reading performance in clinical practice [24, 25]. A representative case mix is required if the results are to be generalized to how well radiographers can interpret all A&E films. Ideally, wewish to describe radiographers' ability to diagnose the disease status of the patient as positive or negative. However, in the absence ofanincontrovertible standard, radiographer performance is often compared with a reference standard such as the diagnosis of a consultant radiologist who would initially have a greater clinical knowledge and experience. We are therefore evaluating radiographers' ability to predict the radiologist's diagnosis, which may be wrong, rather than the patient's true disease status. Subsequently, when sensitivity and specificity are calculated, they are related to the prevalence of the abnormality [26]. As the reference standard is not always correct, it is prudent, if possible, to evaluate radiographers' performance using a random sample of films from clinical practice to reflect the same prevalence of disease [27].
Film selection bias
This bias occurs if the radiographers do not interpret all the films eligible for inclusion in the study and/or they have the opportunity to choose which eligible films they want to interpret [28, 29]. Thus, if radiographers' A&E film reading performance was assessed and they were asked to report films when time permitted, they may only interpret films they are confident to report. By not reporting the more difficult cases, this could inflate their indices of performance.
| Biases due to observer selection |
|---|
|
|
|---|
Observer cohort bias
This would occur if an inappropriate selection of radiographers was included in a study, being very dependent on the research question being asked. To estimate the influence on external validity, it is important to know the number of radiographers and the level of their experience or training.
Observer cohort comparator bias
This bias can occur when two or more groups of observers are compared without the appropriate use of matching; again, this is dependent on the research question. For example, a study might apply the principles of a controlled trial to demonstrate the effectiveness of a training programme (intervention) by comparing a study group (radiographers who have received training) and a control group (radiographers without training) [20]. The two groups of radiographers should be matched for certain characteristics such as number of years experience in the profession and/or in a relevant specialty. It is important to ensure comparability between the two groups so that differences in performance can be attributed to the training programme rather than differences between radiographers.
| Biases associated with the application of the reference standard |
|---|
|
|
|---|
Work-up bias
There are contradicting opinions as to whether there is any difference between verification bias and work-up bias [22, 26]. Work-up bias is defined here as a specific type of verification bias. It occurs when not all films receive confirmation with the reference standard owing to the report of the observer under evaluation.
Work-up bias would occur if an investigator compared the A&E film reading performance of radiographers and casualty officers when reporting the same films and assumed that if the pair of reports agree there was no need to apply the reference standard [24, 25]. If the standard was applied, it may be discordant with the pairs of reports. The statistical consequence is an overestimation of the two professions' film reading performance, as omission of the standard denies the potential for identifying other false negatives and/or false positives. A further example would be if only radiographers' A&E film reading performance was evaluated and the investigator assumed that if a report was normal it was unnecessary to apply the reference standard. If only abnormal reports receive verification with a standard, this will artificially inflate sensitivity by underestimating the number of false negatives [30]. This bias is exacerbated if the reference standard knows that the reason they are reporting the films is because there was discordance between a radiographer and casualty officer, or in the case of radiographers alone that it was reported as being abnormal.
The problem of verification and work-up bias may be avoided if the observers' reports are not known before the reference standard is applied. It is also possible in certain circumstances to correct the results obtained if data are available on a stratified random sample of negative films as well as the clinical details of each patient [3032].
Incorporation bias
This occurs if the observer under evaluation is incorporated into the process of generating the reference standard or is used as the reference standard. For example, a study may assess the film reading performance of a group of radiographers vs radiologists of varying seniority. Incorporation bias exists if a radiologist's report within a cohort was used to generate the reference standard, for example a double blind radiologist's report. This will artificially inflate the performance of the radiologists, as there will be confounding of the radiologist's report within the cohort and the reference standard report.
| Biases due to measurement of results |
|---|
|
|
|---|
Withdrawal bias
Any non-random exclusion of films that have been judged eligible for inclusion in a study will bias the results. Furthermore, if films are excluded prior to receiving the reference standard it will introduce work-up bias or verification bias depending on the reason for exclusion. The following describes two kinds of withdrawal bias.
Indeterminate results
Failure to include indeterminate (equivocal) film interpretations in the analysis may result in a possibly biased assessment of radiographers' performance [26]. Their inclusion is valuable for economic assessment, for example, if radiographers ask for repeat examinations but radiologists interpret films correctly without the need for repeats this will save direct healthcare costs, and for generalization of results. If indeterminate results are included but regarded as negative, specificity is artificially increased and sensitivity is decreased, and the reverse is true if they are classified as positive [23]. For these reasons, it is important that studies record the frequency of equivocal reports and the way these results are used in the calculation of radiographers' performance.
Loss to follow-up
The films reported by radiographers may be lost and the reference standard cannot therefore be applied. If this is systematic, it may distort the performance results of radiographers.
Observer variability
Reliability is a major consideration in studies involving judgement [33], not only in image interpretation but also when comparing the report under evaluation with the standard. The ability of observers to produce reliable reports is also reflected in their ability to accurately interpret films. For example, a low level of reproducibility is incompatible with a high level of accuracy [34]. Observer variation in plain film reporting is pervasive. A recent study examined the interobserver variation between three experienced radiologists with the three major types of plain film examination; abdominal, chest and skeletal. Concordance between all three was found in only 51%, 61% and 74% of radiographs, respectively [35]. Interobserver variation is usually greater than intraobserver variation and is measured using the Kappa statistic [36]. The following variabilities can be estimated.
Arbiter variability
Even when explicit and objective decision-making criteria are available for comparing observers' reports with the reference standard, it is important to assess whether it can be applied consistently by different people (arbiters) on the same occasion or by the same person on different occasions. Variation in the decisions made by an arbiter can affect the indices of performance calculated. The application and interpretation of the criteria used to measure radiographers' performance therefore influence the reliability of study results. The following variabilities can also be estimated using the Kappa statistic.
| Independence of interpretation biases |
|---|
|
|
|---|
Assessments that involve clinical judgement are also susceptible to bias owing to prior expectation [38]. The arbiter should therefore be blind to who is responsible for the reports because preconceived ideas can affect their decision as to whether two reports are concordant. The following terms have been coined specifically for plain film reading performance studies.
Observer review bias
This occurs if the radiographers are aware of the reference standard report when interpreting films and can be avoided by blinding the radiographers to this report. If the reference standard used is clinical follow-up, as long as the study is prospective the results of the definitive diagnosis must be unknown at the time of interpretation by the radiographers. This source of bias can lead to falsely elevated indices of performance.
Reference standard review bias
This is the opposite of observer review bias. It occurs if radiographers' reports are known when the films are interpreted by the reference standard [24]. If the reference standard was a consultant radiologist, they must be blind to the radiographers' reports. This source of bias could falsely elevate or even deflate radiographers' indices of performance, depending on how this knowledge affects the reference standard report.
Observer bias
This bias is present if individual radiographers in a study do not report films independent of each other. They should therefore be blind to other reports. If the study is performed during clinical practice and it is normal for radiographers to communicate with colleagues, then this bias is not applicable.
Observer comparator bias
If an explicit attempt is made to compare the performance of individual radiographers, each should report on the same films or the films should be randomly allocated so the radiographers report on a comparable sample. Differences between radiographers can then be attributed to differences in individual performances rather than the case mix of films.
Co-image bias
This occurs if additional images are available to a cohort of observers other than the images they are being asked to interpret. If radiographers' plain film reading performance is being assessed, they should not have access to images from other modalities that could assist their interpretation of the plain films. However, the availability of previous plain films would be permissible if the aim was to simulate clinical practice, particularly as there is evidence that this improves accuracy [39].
Arbiter review bias
The severity of this bias depends on whether (a) the arbiter is also an observer under evaluation or (b) the arbiter was the reference standard, with the former having a greater potential affect on study results. If radiographers' performance was being evaluated, the same radiographers should not be involved in the process of deciding whether their reports concord with the reference standard. They may be too critical about their own reports or alternatively they may not be critical enough. Neither should the reference standard be the arbiter, as they are responsible for one of the reports. This could bias their decision as to whether pairs of reports agree.
Arbiter bias
This occurs if the arbiter was aware of whether the report was made by a radiographer or the reference standard. The arbiter does not need to know which report is the reference standard or the radiographer's and therefore should be blind to which report was made by whom.
Film access bias
This bias is present if the arbiter has access to the sample of films whilst judging whether the reports being compared are concordant. The arbiter's interpretation of the films can incorrectly influence the decision as to whether the reports agree. Furthermore, the arbiter's judgement could be affected by an incorrect report when viewing the films.
Clinical review bias
Several authors have demonstrated improved performance when clinical data are available [4042]. Others have found clinical details to be unhelpful in lesion detection [43, 44]. In routine clinical practice, knowledge of patients' age, sex and symptoms is required to ensure the most appropriate procedure is carried out and to avoid time and effort searching for findings that would be irrelevant in the clinical context. These requirements heavily outweigh any potential advantage of eliminating bias by withholding relevant clinical data [34]. However, if an unblinded study is undertaken, it is prudent to account for the possible influence of other factors and covariates [14].
Cohort comparator bias
This bias is present when a study assesses the performance of two groups who do not interpret films independently. If radiographers' A&E film reading performance was compared with casualty officers, both groups should be blind to each other's reports. Furthermore, the radiographers and casualty officers should report on the same or a comparable batch of films so that any difference in performance is attributed to differences between cohorts rather than differences in the case mix of films.
Co-image comparator bias
This bias would occur if the plain film reading performance of radiographers is compared with radiologists and the latter have access to images from other modalities such as CT.
Arbiter comparator bias
This bias is present if two or more groups are compared and the arbiter is aware of which group made the reports. If the arbiter has a preconceived conception that radiologists should perform better than radiographers, it may systematically influence their judgement and subsequently distort the indices of performance.
| Further biases in film reading studies |
|---|
|
|
|---|
If an assessment of interpreting, for example, CT head examinations was performed, then additional biases, possibly in the selection of images, may be pertinent. A neurological centre can refer rare or problem cases with a different prevalence and type of disease to that of a district hospital (centripetal bias). Experts may preferentially include and keep track of challenging or interesting cases (popularity bias) and differences in financial and geographical access to CT may affect the study group (diagnostic access bias) [45].
When interpreting barium enema examinations, observers' performance is influenced largely by what is seen fluoroscopically and reports would be biased by the images presented for reporting. In a study where the influence of two factors are being controlled, i.e. who performs the examination and who interprets the images, a factorial design would be appropriate. Random allocation, stratified perhaps by certain clinical details, would ensure that a comparable case mix of patients was included in each arm of the trial.
| Conclusion |
|---|
|
|
|---|
It is also important to note that even if all biases are eliminated, inadequate attention to other methodological factors will limit the value of a study. If too small a sample of films is selected, this will produce imprecise estimates of indices of performance. Therefore, a RCT designed to compare radiographers' and radiologists' plain film reading performance should calculate an appropriate sample size to detect as statistically significant a real difference of a given magnitude between the two groups [27]. Confidence intervals should also be constructed to illustrate the range of values that one can beconfident includes the true film reading performanceof radiographers and radiologists [27]. A detailed analysis of the specific methodological factors that influence the quality of a plain film reading performance study will be the subject of a further paper.
Observer variation is substantial and image interpretation, whether of plain films or not, is considered the weakest area of clinical imaging [34]. It is therefore imperative that these biases are understood and consideration is given to avoiding or minimizing them when designing studies to assess different observers' competence to interpret plain films or any other image. It is also important to be aware of these biases when systematically appraising such studies, as their presence will compromise the quality of research. Improving awareness of these biases should underpin the validity of the evidence base used to guide policies, to influence good practice or to direct research.
| Acknowledgments |
|---|
Received for publication August 2, 2000. Revision received November 29, 2000. Accepted for publication January 24, 2001.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S BREALEY, C HEWITT, A SCALLY, S HAHN, C GODFREY, and N THOMAS Bivariate meta-analysis of sensitivity and specificity of radiographers' plain radiograph reporting in clinical practice Br. J. Radiol., July 1, 2009; 82(979): 600 - 604. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Brealey and M Westwood Are you reading what we are reading? The effect of who interprets medical images on estimates of diagnostic test accuracy in systematic reviews Br. J. Radiol., August 1, 2007; 80(956): 674 - 677. [Abstract] [Full Text] [PDF] |
||||
![]() |
S D Brealey, A J Scally, S Hahn, and C Godfrey Evidence of reference standard related bias in studies of plain radiograph reading performance: a meta-regression Br. J. Radiol., June 1, 2007; 80(954): 406 - 413. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. T. Sica Bias in Research Studies Radiology, March 1, 2006; 238(3): 780 - 789. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y Balabanova, R Coker, I Fedorin, S Zakharova, S Plavinskij, N Krukov, R Atun, and F Drobniewski Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study BMJ, August 13, 2005; 331(7513): 379 - 382. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Brealey, D G King, S Hahn, C Godfrey, M T I Crowe, K Bloor, S Crane, and D Longsworth The costs and effects of introducing selectively trained radiographers to an A&E reporting service: a retrospective controlled before and after study Br. J. Radiol., June 1, 2005; 78(930): 499 - 505. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Bossuyt, J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. P. Glasziou, L. M. Irwig, D. Moher, D. Rennie, H. C.W. de Vet, and J. G. Lijmer The STARD Statement for Reporting Studies of Diagnostic Accuracy: Explanation and Elaboration Ann Intern Med, January 7, 2003; 138(1): W1 - W12. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Bossuyt, J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. P. Glasziou, L. M. Irwig, D. Moher, D. Rennie, H. C.W. de Vet, and J. G. Lijmer The STARD Statement for Reporting Studies of Diagnostic Accuracy: Explanation and Elaboration Clin. Chem., January 1, 2003; 49(1): 7 - 18. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Brealey, D G King, M T I Crowe, I Crawshaw, L Ford, N G Warnock, R A J Mannion, and S Ethell Accident and Emergency and General Practitioner plain radiograph reporting by radiographers and radiologists: a quasi-randomized controlled trial Br. J. Radiol., January 1, 2003; 76(901): 57 - 61. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Brealey, A J Scally, and N B Thomas Methodological standards in radiographer plain film reading performance studies Br. J. Radiol., February 1, 2002; 75(890): 107 - 113. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| BJR | DMFR | IMAGING | ALL BIR JOURNALS |