| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Review article |
1Department of Health Sciences, University of York, York YO1 5DD, 2Division of Radiography, University of Bradford, Bradford BD5 0BB and 3X-ray Department A, North Manchester General Hospital, Manchester M8 5RB, UK
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
When assessing radiographers' film reading performance it is not always possible, or necessary, to conduct a randomized controlled trial. It is therefore important to be aware of the main threats to study validity from the alternative designs encountered. Furthermore, there is a conceptual hurdle to overcome: how does one relate patient outcome to the reports made by different professionals when other factors, such as therapy, are involved? In the context of imaging technologies, this difficulty was resolved by applying a hierarchical framework first suggested by Fineberg et al [4] and subsequently extended by the Institute of Medicine [5]. The categories involved are technical capability, diagnostic accuracy, diagnostic impact, therapeutic impact and impact on health [6].
Film reading performance studies are comparable with the "diagnostic accuracy" category [7]. These studies involve observers e.g. radiographers, interpreting a sample of films. An arbiter, i.e. health care professional, then judges whether the reports made by the radiographers are concordant with a reference standard, e.g. consultant radiologist. The resulting data is then used to calculate performance statistics such as sensitivity and specificity. However, the environment in which a radiographer is assessed, i.e. controlled conditions or clinical practice can affect, for example, film selection, choice of reference standard and method of analysis [7]. Table 1
describes three different types of plain-film reading performance studies. Those that assess observers reporting in controlled conditions, such as radiographers under examination conditions, are called diagnostic accuracy studies. In such studies a mix of normal and abnormal films are carefully selected with abnormalities covering a range of pathology, body areas and degrees of conspicuity. A robust reference standard such as a double/triple blind consultant radiological report is developed against which the observers' reports are compared to ensure they interpret films to a high level of accuracy before reporting in clinical practice. Diagnostic performance studies monitor the progress of one group of observers interpreting a consecutive series of films during clinical practice compared, perhaps, with a single consultant radiologist as the reference standard. Studies that compare radiographers and other professional groups, such as casualty officers, against a reference standard are called diagnostic outcome studies because we assume that the group with the highest reporting accuracy will contribute more to improving patient outcome.
|
| Data sources |
|---|
|
|
|---|
The journals and supplements that were hand-searched included the British Journal of Radiology, Clinical Radiology, Radiography Today/Synergy and Radiography. With the exception of Radiography, for which issues from 1995 onwards were searched, journal searches covered issues between 1990 and the end of June 1999, as this was when the debate accelerated with the introduction of the NHS and Community Care Act [1]. The Royal College of Radiologists, the Society of Radiographers and the College of Radiographers were also contacted to identify studies, as were members of the Special Interest Group in Radiographic Reporting and university centres that provide postgraduate courses to train radiographers in film interpretation. The reference lists of all articles identified were scanned for further studies.
| Study selection |
|---|
|
|
|---|
Searching electronic sources yielded 695 studies, of which 20 assessed radiographers' film reading performance. However, only seven of these studies were eligible. Of the 13 studies excluded, 1 was a visual search strategy study and the other 12 were the same studies identified in different databases. A total of 30 studies were judged eligible from all data sources.
To minimize "reviewer bias", studies from the electronic databases were selected independently, i.e. blindly, by two reviewers (SB and AS). Perfect agreement was found in the application of the selection criteria. It was therefore judged acceptable for only SB to apply these criteria to studies located from other data sources.
| Data extraction |
|---|
|
|
|---|
| Data synthesis |
|---|
|
|
|---|
|
|
|
The standard criteria used were (a) a study that measured the performance of a single group of radiographers had to calculate a sample size according to how precise an estimate of sensitivity and specificity was needed and (b) a study comparing groups had to use a power calculation to determine the sample size required to detect clinically important effects as statistically significant.
Table 3
shows none of the 11 diagnostic accuracy studies calculated the necessary sample size. Six studies used a sample of fewer than 100 films. Only one diagnostic performance study calculated the necessary sample size. However, the number of films included in these studies ranged from several hundred to several thousand. No diagnostic outcome study performed a sample size calculation, although each study used a sample of films ranging around several hundred. Thus, Table 4
shows that only 1 (3%) of the radiographer film reading performance studies satisfied this standard.
Study design
Standard 2: was a normal/abnormal report adequately defined?
The definition of what is "normal" or "abnormal" influences both sensitivity and specificity. These individual measures are therefore subjective in that an investigator can define what they are, although a pair of sensitivity/specificity values are not subjective [8]. If studies use different criteria to define positive and negative reports this will affect the validity of comparisons between them. This is because differences in performance statistics may be entirely due to variation in these definitions. Subsequently, any published set of estimates is of limited value unless what constituted a normal or abnormal report is adequately defined.
Only one standard criterion was used: whether the definition of normal or abnormal is acceptable depended on the context of the study.
Apart from one diagnostic performance study, all studies adequately described what was normal and abnormal. Indeed, Table 4
shows that this standard was met in 28 (97%) studies.
Standard 3: was the performance of the observers placed in the context of the diagnostic sequence?
For studies conducted in clinical practice it is important to describe the process, or diagnostic sequence, through which films pass. The point at which radiographers interpret the films will affect the sample in terms of prevalence and severity of disease. This will affect the predictive values and possibly sensitivity and specificity. Without this information readers will not be able to apply the results of the study to the system employed in their hospital.
Standard criterion: the process through which the films passed before interpretation must be explicitly described. This standard was not applicable to studies performed outside of clinical practice, i.e. diagnostic accuracy studies.
Again, except for one diagnostic performance study, all of the 17 (94%) applicable studies adequately described the context in which observers reported.
Standard 4: was the contribution of individual groups determined if the combined performance of two (or more) different groups of observers was assessed?
If the aim of a study is to assess the combined performance of two groups, e.g. radiographers and emergency nurse practitioners, it is desirable to assess the individual contribution of each group. This will demonstrate the marginal utility of the combined groups compared with each individual group.
Standard criterion: this standard was met if each group within a combination of groups was assessed independently.
Table 3
demonstrates that this standard was not applicable to diagnostic accuracy or performance studies as no studies assessed the combined film reading performance of two or more groups of observers. 1 (50%) of 2 diagnostic outcome studies met this standard.
Standard 5: was an appropriate (valid) reference standard used?
The validity of radiographers' film reading performance is dependent on the veracity of the reference standard [16]. Considerable variation has been found between experienced radiologists when interpreting plain-films [17]. Therefore, convergence of multiple radiologists' opinions should provide a better reference standard than one radiologist [16]. Searching for further examinations of the same body area over a follow-up period of 1 year could be used to validate the accuracy of a single radiologist's report. Findings at follow-up could indicate whether the reference standard report was erroneous [18].
The criteria for this standard involved using the following hierarchy to judge the validity of the reference standard, with A1 being the most valid. (A1) A double/triple blind consultant radiological report; (A2) a single blind consultant radiological report validated using, for example, clinical follow-up; and (A3) a single blind consultant radiological report. The standard was not fulfilled if an inappropriate reference standard was used, e.g. a combination of radiologists at different grades, or if the professionals under evaluation were used as the reference standard or included in the process of generating the reference standard.
All 11 (100%) diagnostic accuracy studies used a valid reference standard; 6 used a double/triple blind consultant radiological report (A1). In contrast, 5 (45%) diagnostic performance studies failed to meet this standard. Four of these failed because they included the reports of radiographers under evaluation in the process of developing the reference standard, and the fifth study failed because it used junior radiologists' reports. Furthermore, the most frequently used valid reference standard was only a single consultant radiological report (A3). For diagnostic outcome studies all reference standards chosen were valid with the exception of one study that used a consultant radiologist or specialist registrar. Table 4
shows that the appropriate use of a valid standard was met in 24 (80%) of all studies.
Standard 6: was an appropriate (valid) arbiter used to compare radiographers' reports with the reference standard?
Critical elements of film interpretation are knowing what to look for in images and why [19]. Similarly, the professional, or arbiter, responsible for comparing reports should possess this knowledge. The primary criterion used to judge the validity of the arbiter is whether they were external to the institution under evaluation, i.e. where the arbiter was based. An external arbiter should be more objective than an internal one, as their institution is not being evaluated. Furthermore, even if an internal arbiter is blind to which report was made by whom, they might recognize who made a report which could consciously or unconsciously affect their judgement. The second criterion is whether the study used a panel of arbiters rather than an individual, i.e. the number of arbiters used. Even when explicit criteria are available for comparing reports there can be variation in how they are applied by different arbiters. This is analogous with the discussion of single vs multiple radiologists producing a reference standard report. Therefore when a panel of arbiters were involved in the process of comparing reports this was judged as being more valid than a single arbiter. The final criterion focused on whether the arbiter was appropriately skilled to perform this task, i.e. who the arbiter was.
Standard criteria: the following hierarchy was used to judge the validity of the arbiters with A1 being the most valid. (A1) External panel; (A2) external single consultant radiologist; (A3) internal panel; (A4) internal single consultant radiologist; (A5) radiographer trained to report supported by an independent consultant radiologist; and (A6) untrained radiographer(s) supported by an independent consultant radiologist. Examples of inappropriate arbiters include untrained radiographer(s) with no referral to a radiologist in equivocal cases, and a professional under evaluation.
9 (82%) of the 11 diagnostic accuracy studies used a valid arbiter, with 3 (27%) studies using an internal panel (A3). However, radiographers not trained to report were used in one study with no option to refer to a radiologist, and in one other study it was not possible to discern who was the arbiter. In contrast, the use of a valid arbiter was satisfied in only 2 (18%) of the 11 diagnostic performance studies. 7 (64%) of these studies used the radiographer under evaluation as the arbiter, or the radiographer was included in the process of arbitration. It was unclear who the arbiter was in two studies. Finally, 6 (75%) of the 8 diagnostic outcome studies used a valid arbiter. Of the two studies that did not, one used a radiographer not trained to report and the other study did not refer to whom the arbiter was. Table 4
shows that the appropriate use of a valid arbiter was met in 17 (57%) of studies.
Standard 7: was an appropriate control used?
Controls help rule out potential threats to study validity and eliminate the possibility of alternative explanations [20]. To assess the effectiveness of a training programme (intervention) on radiographers' film reading performance there should be a group who received intervention (experimental) and a group who did not (control). The two groups should be matched for appropriate characteristics, such as number of years experience in the profession or in a relevant speciality [21]. Improved performance in the experimental group could then be attributed to the intervention rather than differences in the sample of films or radiographers.
Standard criterion: this standard was met if an appropriate control was used within the context of the study.
When applicable, 2 (50%) of 4 diagnostic accuracy studies did not use a control group when assessing the effectiveness of a training course on radiographers' ability to interpret films. None of the three diagnostic performance studies used a control. Two assessed the effectiveness of a training programme and another the effect of introducing a radiographer abnormality detection system (red dot system) on casualty officers' error rates. Similarly, neither of the outcome studies met this standard. Table 4
shows that only 2 (22%) studies used an appropriate control.
Presentation of results
Standard 8: were films appropriately analysed for pertinent subgroups?
Even if the case mix of films has been adequately described, the performance statistics represent average values for the entire sample that may mask low levels of accuracy for a particular subgroup [11].
Standard criterion: the standard was met if radiographers' performance was presented for pertinent medical subgroups, e.g. body areas, patient type.
Table 3
shows that in general, for diagnostic accuracy, performance and outcome studies, around 50% met this standard. Table 4
shows that it was met in 12 (44%) of all studies.
Standard 9: were the data presented in enough detail to permit re-calculation of performance statistics and confidence intervals?
The presentation of raw data is necessary so that readers or reviewers can calculate relevant performance statistics and confidence intervals. Confidence intervals are important to illustrate the range of values we can be confident includes a radiographer's true film reading performance [15]. The width of the interval indicates whether the sample size is too small to draw a valid conclusion from.
Standard criterion: data needed to be presented in enough detail to permit the calculation of performance statistics and confidence intervals.
Table 3
shows that 8 (73%) and 5 (63%), respectively, diagnostic performance and outcome studies achieved this standard. Overall this standard was met in 16 (59%) studies.
Standard 10: were indeterminate, i.e. equivocal or missing data, results appropriately presented?
In clinical practice it is not always possible to provide a clear-cut interpretation of a film owing to factors such as technical defects and artefacts, patient restrictions or administrative limitations [22]. Performance statistics are therefore distorted if there is no option to classify a report as being equivocal [23]. If indeterminate results are included but regarded as positive, sensitivity is artificially increased and specificity decreased. The reverse effects occur if the indeterminate results are counted as negative. Corresponding distortions occur in the calculation of likelihood ratios [11].
Two standard criteria had to be met before this standard was fulfilled. First, the study must present all of the appropriate positive, negative and indeterminate interpretations. Second, the study must describe whether indeterminate interpretations had been included or excluded when performance statistics were calculated.
6 (75%) and 3 (75%), respectively, diagnostic performance and outcome studies met this standard, compared with 2 (50%) diagnostic accuracy studies. A possible explanation for this is that because diagnostic accuracy studies use a carefully selected sample of films to assess radiographers' film reading performance, unequivocal cases were mostly used. Table 4
shows this standard was met in 11 (69%) studies overall.
| Conclusions |
|---|
|
|
|---|
Ideally, all standards should be met. However, possibly owing to resource constraints and study objectives, there are trade-offs between standards. As an example, when a robust reference standard is required, as in diagnostic accuracy studies, fewer films are selected. Conversely, when a large sample of films is required, as in diagnostic performance studies, a less robust reference standard is used. The challenge to those designing and executing such studies is to achieve the right balance between the methodological standards. If these standards are not adhered to it will increase the chance of erroneous conclusions being made about radiographers' film reading performance. This in turn can affect radiographic reporting policy and, ultimately, patient care and service efficiency.
| Acknowledgments |
|---|
Received for publication April 17, 2001. Accepted for publication November 8, 2001.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S BREALEY, C HEWITT, A SCALLY, S HAHN, C GODFREY, and N THOMAS Bivariate meta-analysis of sensitivity and specificity of radiographers' plain radiograph reporting in clinical practice Br. J. Radiol., July 1, 2009; 82(979): 600 - 604. [Abstract] [Full Text] [PDF] |
||||
![]() |
S D Brealey, A J Scally, S Hahn, and C Godfrey Evidence of reference standard related bias in studies of plain radiograph reading performance: a meta-regression Br. J. Radiol., June 1, 2007; 80(954): 406 - 413. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Brealey, D G King, S Hahn, C Godfrey, M T I Crowe, K Bloor, S Crane, and D Longsworth The costs and effects of introducing selectively trained radiographers to an A&E reporting service: a retrospective controlled before and after study Br. J. Radiol., June 1, 2005; 78(930): 499 - 505. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Bossuyt, J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. P. Glasziou, L. M. Irwig, D. Moher, D. Rennie, H. C.W. de Vet, and J. G. Lijmer The STARD Statement for Reporting Studies of Diagnostic Accuracy: Explanation and Elaboration Ann Intern Med, January 7, 2003; 138(1): W1 - W12. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Bossuyt, J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. P. Glasziou, L. M. Irwig, D. Moher, D. Rennie, H. C.W. de Vet, and J. G. Lijmer The STARD Statement for Reporting Studies of Diagnostic Accuracy: Explanation and Elaboration Clin. Chem., January 1, 2003; 49(1): 7 - 18. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Brealey, D G King, M T I Crowe, I Crawshaw, L Ford, N G Warnock, R A J Mannion, and S Ethell Accident and Emergency and General Practitioner plain radiograph reporting by radiographers and radiologists: a quasi-randomized controlled trial Br. J. Radiol., January 1, 2003; 76(901): 57 - 61. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| BJR | DMFR | IMAGING | ALL BIR JOURNALS |