| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Full Paper |
1 Centre for Health Informatics and Multiprofessional Education, University College London, Archway Campus, Highgate Hill, London2 Duchess of Kent Screening Unit, Blackshaw Road, London, 3 School of Public Policy, University College London, 29/30 Tavistock Square, London and 4 Health Economics Research Centre, Institute of Health Sciences, University of Oxford, Oxford, UK
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Computer aids for mammography
The computer analysis of mammograms has been a field of research since the 1960s and the recent advent of direct digital mammographyX-ray machines that generate a digital mammogram without the need for an image to be produced on filmhas generated increased activity [4]. The most important work in the computer analysis of mammograms has been in the detection of two important classes of mammographic abnormality: microcalcifications and stellate lesions. A great deal of effort has been put into the development of these techniques, and they can be made extremely sensitive. However, this sensitivity is achieved only at the cost of specificity, with high numbers of false prompts. The interpretation of mammograms cannot, at the moment, be completely automatedthis remains a distant prospect. Some authors, however, have suggested that the combination of computer plus human reader can produce better results than a human reader alone and have concluded that computer aided diagnosis (CAD) systems can be used to prompt human readers [5, 6].
A CAD system is used to alert a human film reader to regions of a mammogram where computerized analysis suggests that abnormalities might be found. The system used here is the R2 ImageChecker 1000 (R2 Technology Inc., Sunnyvale, CA) [7]. It has two components. The processing unit scans conventional films to create digital images, which are then analysed by the software that places the prompts. The display unit is used to display the films, in the conventional manner, for review. The display unit combines a computer, which is used to present the prompts, with a conventional lightbox for the display of the original films. The radiologist (or other film reader) views the image, checks the CAD display for prompts and, if appropriate, reassesses the image. CAD systems have existed in research settings for over a decade, and in clinical use for a little less. They have been the subject of a number of investigations aimed at evaluating their potential as aids to breast cancer screening.
Many studies agree that CAD systems can achieve high sensitivity and that they are better for detecting calcifications (sensitivities of 98% have been reported) than soft tissue lesions (86%) [8]. High sensitivities, of up to 77%, have also been achieved on prior mammograms judged on review to be false negatives in women subsequently diagnosed with cancer [9]. This high sensitivity is not matched by a high specificity, with approximately two false prompts on average per case [8]. There is conflicting evidence in test conditions on reader behaviour, some showing improvement in sensitivity and specificity with prompting and some not [1012]. A single prospective study demonstrates increased cancer detection in real life screening with a 19.5% increase in the number of cancers detected, at a cost of an 18% increase in the number of women recalled for assessment [13].
Non-radiologist film readers
Two groups of non-radiologists are currently employed as film readers in the breast screening programme: radiographers and breast clinicians. Radiographers are being trained as film readers in a number of centres in the UK. Trained radiographers are currently working as film readers in at least six screening centres. The Association of Breast Clinicians was founded in 1996 and currently has 53 members, all of whom are registered medical practitioners working in breast diagnostics.
The available data seem to suggest that both breast clinicians and radiographers can be trained to perform as well as radiologists [1315]. The samples studied are, however, relatively small and likely to be highly selected (nine trained radiographers and 10 breast clinicians participated). It may be that only a limited number of radiographers feel comfortable with the additional responsibility associated with film reading. Attempts to involve large numbers of radiographers in the interpretation of films might fail to maintain the performance achieved in the studies carried out to date.
It seems likely that an increased reliance on radiographers will be an essential component of any solution to the manpower crisis. There are, however, outstanding questions concerning the use of non-radiologist film readers and it would be wrong to assume that using non-radiologist film readers can solve the crisis by itself. Attempts to encourage large numbers of radiographers to move into film reading could generate a shortage of radiographers to take the films. It is therefore sensible to consider the extent to which the reading of screening films can be automated, but also to study the impact of any proposed form of automation on a screening programme in which different professional groups read the films.
Looking at the available evidence on the use of the computer aids and on film reading radiographers, we decided that there was a need to examine the potential for the use of computer aids as part of a screening programme employing radiographers as film readers. We therefore carried out a study to ascertain if (1) computer aids could be used either in the role of second readers, allowing a single radiologist to achieve the accuracy currently achieved by double reading, and (2) whether a radiographer using a computer aid performed as well as a radiologist. We wanted to include a large sample of different film readers in our study and therefore elected to carry out a test of the impact of prompts on decision-making using archive films with high proportion of cancers. This study was carried out as part of a larger programme of evaluations that included assessments of the repeatability of prompts [17], the sensitivity and specificity of CAD tools on interval cancers from the NHSBSP [18] and assessments of film-readers attitudes to and opinion of prompting systems [19].
| Methods and materials |
|---|
|
|
|---|
The mammograms were divided into three sets of 60 cases, each to be interpreted at a single sitting. The ratio of cancer to non-cancer cases was varied slightly across the three sets. An additional set of 60 cases was used as a training set to accustom film readers to using the CAD system.
Readers
50 film readers participated: 30 consultant radiologists, 5 breast clinicians and 15 trained radiographers. All had undergone a rigorous training programme to meet the requirements of NHSBSP. All were currently working in the screening programme and reading at least 5000 screening cases per annum.
Procedure
All film readers were given training including an explanation of prompting and of the behaviour of the system. They were told that they should look at the films in the normal way, before looking at the prompts. They were advised on the typical frequency of prompts and given examples of cases where CAD often generates inappropriate prompts. All film readers then read a training roller, using the prompts. This roller contained a similar mix of cancers and non-cancers to the subsequent test rollers.
The order in which each reader viewed the test sets was separately randomized, as was the order in which they viewed the conditions (prompted and unprompted). A minimum period of 1 week was left between reading a test set in the two conditions.
Readers viewed the films on a standard roller viewer and were asked to complete a report form for each case. Low resolution images of the films were included on the form. In the prompted condition these were shown with prompts, in the unprompted condition they were shown without prompts. The reader was asked to circle any area of abnormality and state their degree of suspicion for each abnormality in a table below the images. Readers were then asked to give an overall decision on recall for each case. The time taken to read each test set was recorded.
All the cancer cases were reviewed by a consultant radiologist (RGW) who indicated the extent of visible signs of cancer on each image. Two members of the team (PT & JC) compared these annotations with the prompt sheets generated by the ImageChecker to ascertain whether or not the cancer was correctly prompted. A case was considered to have been correctly prompted if a prompt appeared in either view at a location indicated by the radiologist. Three borderline cases were reviewed by RGW.
| Results |
|---|
|
|
|---|
|
|
The transformed data were then entered into a generalized linear model (repeated measures analysis of variance) with two within-subject factors: for the prompt (two levels: with prompt, without prompt) and for the roller (three levels for three rollers). One analysis was done for sensitivity and another for specificity. Testing for a difference in sensitivity, we found a significant effect for the roller (F2,76=26, p<0.001), but not of the prompt (F1,38=0.003, p=1.0) or for the interaction between roller and prompt (F2,76=2.2, p=0.1). The data presented in Table 3
show the effect of taking the model estimated means and 95% confidence intervals on the transformed scale and back-transforming them on to the original sensitivity scale of 01.
|
|
Impact of CAD prompts on radiologists' and radiographers' sensitivities and specificities
Taking the above data for specificity and analysing the results for radiologists and radiographers separately (there were too few breast clinicians in the final sample for comparisons with that group to be significant) we obtained the results presented in Table 5
. There is a significant effect of the roller (F2,70=21, p<0.001), but not for prompt, reader type or any of the interactions (p>0.2). Repeating the exercise for specificity we again found a significant effect due to the roller (F2,70=5.1, p=0.009), but not for prompt, reader type or any of the interactions (p>0.1).
|
For sensitivity there was a significant effect of the roller (F2,74=27, p<0.001) and of reader skill (F1,37= 62, p<0.001), but not for prompt or any of the interactions (p>0.07). For the key interaction between prompt and reader skill: F1,37= 0.6, p=0.4. For specificity there was a significant effect of the roller (F2,74=5.3, p=0.007) and of reader skill (F1,37=14, p=0.001), but not for prompt or any of the interactions (p>0.1). For the key interaction between prompt and reader skill: F1,37=0.8, p=0.4. There was no evidence that use of R2 affects able and less able readers' sensitivities or specificities differently.
| Discussion |
|---|
|
|
|---|
There is, therefore, clear evidence that CAD algorithms are extremely sensitive even for the detection of relatively subtle microcalcifications. The evidence that they are effective in improving radiologists' decision-making is however less strong. Thurfjell et al carried out a study with three film readers: an expert screener, a screening radiologist and a clinical radiologist [10]. The expert's sensitivity of 86% was unchanged, but use of the ImageChecker improved that of the screening radiologist from 80% to 84% and that of the clinical radiologist from 67% to 75%. The specificities of the expert and of the clinical radiologist were unchanged, while that of the screening radiologist fell from 83% to 80%. Funovics et al tested the system using a test set including 40 proven spiculated lesions and three radiologists. They found an average improvement in sensitivity of 9%, with some cost in specificity [11]. The largest reported study is that of Brem and Schoonjans who used a sample of 106 cases including 42 malignant microcalcifications, 40 with benign microcalcifications and 24 normals [12]. Five radiologists participated. 41 out of 42 (98%) malignant microcalcifications and 32 of 40 (80%) benign microcalcifications were prompted at a prompt rate of 1.2 markers per image. The radiologists' sensitivity without and with the system ranged from 81% to 98% and 88% to 98%, respectively. No statistically significant changes in sensitivity were found and no significant compromise in specificity.
Freer et al conducted a 12 month prospective evaluation of the impact of CAD using the R2 ImageChecker [13]. The trial involved 12 860 screening mammograms. Each was initially interpreted without CAD. Areas marked by the CAD system were then re-evaluated. Data were recorded both before and after the CAD prompts were consulted. The authors report an increase in recall rate from 6.5% to 7.7% with the use of CAD and an increase from 3.2 to 3.8/1000 in the cancer detection rate.
The key issue in evaluating the above evidence on decision-making is the contrast between the somewhat equivocal results of studies using test sets drawn from archive data, with a single prospective study, that of Freer et al [13]. The latter study found an increase in the number of cancers detected, although at a non-negligible cost in terms of increased recall rate. It should also be noted that 75% of the additional cancers were ductal carcinoma in situ (DCIS) rather than invasive cancers. The study could also be questioned because it compares a judgement made on the radiologist's first look with a judgement made on the basis of the first and second looks, where CAD is available on second look. One might expect that the performance of the radiologists on the first look would be slightly compromised because they would have been anticipating the second look. And of course, the increment due to the second look cannot be entirely attributed to the availability of CAD.
We were interested in assessing the potential of CAD as an aid for the screening programme. We hoped to use our data to simulate two comparisons: one between double reading by radiologists and double reading by a radiologist and a CAD supported radiographer and another between double reading by radiologists and a radiologist with CAD. If there was a significant impact due to the prompts such comparisons could provide valuable information that could form part of an economic assessment of the likely value of CAD. However, if there was no significant impact due to CAD it would be clear that the comparisons were unwarranted, since if we failed to detect an impact due to CAD, we could not make any further assessment of its impact. The study could therefore be viewed as having two underlying questions: the central question being whether or not the CAD prompts improved the film readers' sensitivity with a secondary question being whether or not it had a differential impact on radiologists and radiographers. It therefore seemed sensible to attempt to answer these questions first, since this would maximize the study power.
We found no evidence that the prompts provided by the R2 Imagechecker affected readers' sensitivities or specificities. This is consistent with other results such as that of Brem and Schoonjans [12] but runs against the general tenor of findings in evaluations of prompting systems. Studies such as those of Funovics et al or Thurjfell et al [10, 11] have found in favour of the ImageChecker. These studies were however much less powerful than ours and should not be regarded as demonstrations of an effect due to prompting. We believe strongly that conclusions should not be drawn from studies involving only two or three radiologists. Our negative results also contrast with some of the earlier studies conducted in university research labs which used receiver operating characteristic (ROC) analysis to investigate the impact of prototype prompting systems, for example Chan et al [5]. This may reflect a difference in the analysis. An increase in the area under the ROC curve may reflect a slight impact of the prompts on the confidence of the decision-maker over a range of operating points, rather than a significant change in their behaviour with respect to a real clinical decision.
We found no significant difference in recall rate between different professional groups of readers (p=0.2). This is consistent with other studies. Haiart and Henderson compared the sensitivities and specificities of a radiologist, a radiographer and a clinician and found similar sensitivities but greatly inferior specificity for the non-radiologists [14]. Pauli et al studied a group of seven trained radiographers working as second readers. They found that the increase in sensitivity due to the second reader was 6.4% [15]. This was reported as being equivalent to the increase found for second reading by radiologists. Cowley and Gale provide data on the comparative performance of the different professional groups on PERFORMS, a voluntary self-assessment exercise [16]. The results show that untrained radiographers have some interpretative skills and that trained radiographers perform as well as radiologists. The performance of breast clinicians was slightly better than that of radiologists in early rounds of assessment and similar in the most recent one.
We divided the readers into two groups based on their average sensitivity scores, to see if CAD had a greater effect on readers with lower sensitivities. We found no evidence that use of the ImageChecker affected more and less able readers differently.
The paradox of CAD is that prompts seem not to affect decision-making despite their high sensitivity. We believe that this is partly due to their low specificity. Figures for the number of cancers missed by the NHSBSP are not published in its annual review but can be estimated by looking at cancers that presented as interval cancers or at subsequent screening rounds and determining if they were visible at prior rounds of screening. We believe that around 25% of radiologically visible cancers are missed at screening [23]. This would seem to indicate a clear role for CAD. If the number of cancers detected per thousand cases is 6.5, the number of visible cancers would be 8.9 with the number missed being 2.4. However, there is evidence that at least 40% of false negative interval cancers correspond to lesions that were seen by radiologists and misclassified as benign [24]. Our data suggest that the sensitivity of CAD on false negative interval cancers is as low as 60% [18]. Taking these two facts into consideration, the number of prompts for cancers that would otherwise not be detected is probably less than 1 prompt per 1000 cases.
A radiologist reading 5000 cases per annum will, in a year, see 10 000 prompts, of which maybe only five will be prompts for cancers that he or she had not detected. Of course if every radiologist detected five additional cancers per annum, this would be a significant intervention, but the likelihood has to be that these prompts will be so swamped by false positives that they will fail to have to the impact one would wish for.
Cowley and Gale reported that both breast clinicians and radiographers are slower than radiologists at reporting films. However, we found no significant difference in time taken to read films in either the prompted or the unprompted conditions (p=0.6) or between the different professional groups. Under test conditions both groups are likely to be slower than in real life and this may mask differences.
There are a number of questions that need to be asked about the validity of studies such as ours. One concern is the extent to which the test set is adequately challenging. In order to ensure that the set was challenging we included a number of false negative interval cancers. Our set of 60 cancers was made up of 20 false negative interval cancers and 40 screen-detected cancers. Cancers of each category were taken at random from those available in a participating screening centre. The resulting set was felt to be an appropriate test, including both obvious and very subtle cancers.
However, if we now analyse this set of 60 cancers and the data from this study, we can divide the 60 cases into three groups. There are 15 too difficult cases, where there is no possibility of detecting an improvement due to prompting because the ImageChecker failed to prompt the cancer. There are 29 too easy cases, where there is no real scope for detecting an improvement because more than 90% of readers detected the cancer in the unprompted condition. There are therefore, only 16 cases on which there is any real scope for detecting improvements. If we restrict attention to just these cases, we find that the power of the study is greatly reduced (our simulations suggest a power of around 50% to detect a 5% improvement in sensitivity).
We are currently carrying out a further study using a data set that contains 40 cancers which were missed by at least one radiologist at screening but which are prompted by the R2 ImageChecker. We hope that this will provide further evidence of the impact of CAD, however some analytical work will then be required to determine how the mix of cases in the old and new data sets map to the casemix found in screening.
There are further highly significant questions that cannot realistically be answered using archive data:
Questions such as these can only, realistically, be answered by an evaluation of the impact of a CAD system in routine use. We hope in the future to carry out a replication of the Freer et al prospective study using a larger group of radiologists and radiographers using double reading in the context of a UK screening programme. We believe that this is important because of the differences between the UK and American screening populations and processes and also because we regard the findings of the Freer et al study as inconclusive given the high proportion of DCIS in the additional cancers. The initial cancer detection rate in the Freer study is very low compared with that in the NHSBSP, perhaps reflecting the lower incidence of cancer in their younger population, and the increase in recall rate found by Freer and colleagues would have serious consequences for the UK screening workload.
It is worth noting that the algorithms used in the ImageChecker are periodically revised and the machines upgraded. Improvements in the sensitivity and specificity of the detection software are likely to have a significant impact on the usefulness of the device. We are particularly keen to see improvements in the overall specificity of the prompts and in the sensitivity of the prompts to asymmetries and distortions. Work now ongoing to raise specificity is looking at the effect of integrating information from both views of the same breast and comparing with previous films, both strategies which radiologists use to raise specificity. Comments made by participants in our study suggest that the current low specificity of the prompts means that they are not valuable as aids to decision making. This is significant because some authorities suggests that as many as half the cancers missed at screening are missed not because the radiologist failed to spot the abnormality but because he or she made the wrong decision in assessing it [24]. Seven out of the 15 cancers missed by the ImageChecker in this study were asymmetries or subtle distortions.
Finally, we should acknowledge that full field digital imaging will change the use of R2 dramatically. There will be no constraints due to digitizing time or sorting films. In addition the resolution of full field digital might require CAD for the detection of fine calcifications.
| Appendix 1 |
|---|
|
|
|---|
A value for bD was taken from a survey by Beam et al which suggested that the sensitivity of film-readers can differ from the mean by up to 20% [25]. There is little data on the varying difficulty of images, so the figure for aD was also set, somewhat arbitrarily, at 20%. The mean sensitivity of film readers in experimental conditions obviously depends on the difficulty of the test set, and on the reading protocol. The model with SpreD was set to 70% and 80%. The difference between intervention and control conditions was set at 10%, which roughly corresponds to the effect of CAD found by Funovics et al [11]. A realistic maximum number of images for a film reader to interpret in a study is 300 and it was felt that at least two normals were required for every cancer in the set, therefore the model was run with values of 60 and 90 for ID. The particular difficulty for this study is obtaining a sufficient number of non-radiologist film-readers and the model was run using values of 7 and 10 for R. Where one of the conditions being considered is double reading, the number of readers required will be double R. 400 simulations were obtained with each setting of parameter values. The value given for power is the proportion of simulations where the result was significant at the 0.05 level.
Given the above calculations we assumed that 10 readers were required for each group to be compared, but that 60 cancers would be adequate. With 10 readers per group (20 for double reading) and 60 cancers this gave a power of 0.76 or 0.83 depending on whether the initial sensitivity is 70% or 80%. In the above analysis, the difference to be detected between the two conditions is 10%. The other parameters however, been set rather cautiously and one might reasonably hope to detect a smaller difference. A recent experimental study of one versus two view mammography found a mean sensitivity of 88%, with a variation of plus or minus 10%. Using these estimates our study would have had a 71% chance of detecting a 5% change in sensitivity.
| Acknowledgments |
|---|
| Footnotes |
|---|
Received for publication March 13, 2003. Revision received July 14, 2003. Accepted for publication August 7, 2003.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
L. A. L. Khoo, P. Taylor, and R. M. Given-Wilson Computer-aided Detection in the United Kingdom National Breast Screening Programme: Prospective Study Radiology, November 1, 2005; 237(2): 444 - 449. [Abstract] [Full Text] [PDF] |
||||
![]() |
S M Astley Evaluation of computer-aided detection (CAD) prompting techniques for mammography Br. J. Radiol., January 1, 2005; 78(suppl_1): S20 - S25. [Abstract] [Full Text] [PDF] |
||||
![]() |
P Taylor and R M Given-Wilson Evaluation of computer-aided detection (CAD) devices Br. J. Radiol., January 1, 2005; 78(suppl_1): S26 - S30. [Abstract] [Full Text] [PDF] |
||||
![]() |
E Alberdi, A A Povyakalo, L Strigini, P Ayton, M Hartswood, R Procter, and R Slack Use of computer-aided detection (CAD) tools in screening mammography: a multidisciplinary investigation Br. J. Radiol., January 1, 2005; 78(suppl_1): S31 - S40. [Abstract] [Full Text] [PDF] |
||||
![]() |
S M Astley Computer-based detection and prompting of mammographic abnormalities Br. J. Radiol., December 1, 2004; 77(suppl_2): S194 - S200. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| BJR | DMFR | IMAGING | ALL BIR JOURNALS |