BJR
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

British Journal of Radiology (2005) 78, S26-S30
© 2005 British Institute of Radiology
doi: 10.1259/bjr/84545410

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Taylor, P
Right arrow Articles by Given-Wilson, R M
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taylor, P
Right arrow Articles by Given-Wilson, R M

Paper

Evaluation of computer-aided detection (CAD) devices

P Taylor, PhD 1 and R M Given-Wilson, FRCR 2

1 Centre for Health Informatics and Multiprofessional Education, University College London, London and 2 Department of Radiology, St George's Hospital NHS Trust, London, UK


    Abstract
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 
We present a review of three major UK studies of computer-aided detection (CAD) for mammography. A short account of the motivation, methods and results is given for each of the three. A number of conclusions are drawn, particularly about the merits and difficulties of research in the field. The first two studies measured the impact of CAD on the sensitivity and specificity of film readers interpreting cases with known outcomes displayed on rollers with an artificially high frequency of cancers. In the first study 50 film readers each read 180 cases, including 60 cancers (40 screen-detected and 20 interval). In the second study 35 film readers viewed 120 cases including 44 cancers, of which 40 were selected to be difficult cases that CAD prompted correctly. The third study was carried out prospectively. 6111 films were independently double read by film readers who recorded a judgement before and after viewing CAD prompts. In addition to this, intraobserver measure of the impact of CAD, we compared the cancer detection rate in these cases with that in 1339 cases read over the same period without the benefit of CAD. None of the three studies showed a statistically significant effect attributable to CAD. There is evidence that a high proportion of missed cancers are prompted and that "emphasised" prompts, which have a greater positive predictive value, have a stronger impact on decision-making that other prompts.


    Introduction
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 
Computer-aided detection (CAD) systems have been around in research settings for over a decade, and in clinical use for a little less. They have been the subject of a number of investigations aimed at evaluating their potential as aids to breast cancer screening. Broadly speaking, these fall into three categories. The simplest form of evaluation is to run the detection software on a test set of mammograms and see how it performs: how many cancers are detected for a given level of false prompts. The manufacturers of commercial systems carry out such studies both with unselected cancer cases and, more interestingly, with sets of cancers that were previously missed at screening. It is worth replicating these studies with missed cancers from different screening programmes, since the subtlety of missed cancers will vary. More, however, can be learnt from evaluations that run the software on a set of cases and see how the generated prompts affect the decisions of a set of radiologists. The third category of evaluation is potentially the most revealing, a prospective trial of the impact of the device when placed in routine use.

Over the last 5 years we have led a series of evaluations of a commercial CAD system, the R2 ImageChecker, with the aim of assessing its potential value as an aid in the UK National Health Service Breast Screening Programme (NHSBSP). We carried out three main evaluations, all of which have been published with additional detail elsewhere [13]. In this paper we present an overview of the three studies, focusing on ethical and methodological issues, and make some observations regarding the value and difficulty of these evaluations.


    Study One
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 
The work was commissioned by the NHS Health Technology Assessment Board. The aim was to assess the potential cost effectiveness of CAD. It was felt that a full randomised controlled trial of CAD would be too expensive and take too long. Specifically, we were concerned that to meet the ethical requirements of such a trial, women attending at the participating screening centres would have to be asked to give their consent; which would have delayed the screening process. To compare the detection rates of CAD and non-CAD groups on samples as small as 60 cancers each might have required that we approached 24 000 women for their consent. True data on specificity and sensitivity could only have been obtained after 3 years of follow-up. We therefore elected to carry out an experimental study on test sets constructed from archive cases with known outcomes.

Method and materials
50 film readers took part: 30 consultant radiologists, 5 breast clinicians and 15 trained radiographers. All met the requirements of the NHSBSP. All were currently reading at least 5000 screening cases per annum. Film readers were given an explanation of prompting and of the behaviour of the system. They were advised on the typical frequency of prompts and given examples of cases where CAD often generates inappropriate prompts. All film readers then read a training roller, using the prompts.

The test set contained 180 cases of women undergoing routine screening at a centre using two views and double reading with arbitration. The sample included 60 confirmed cancers: 40 consecutive cancers detected through routine screening and 20 false negative interval cancers. Controls were selected from the same time period as the screen-detected cancers and had all been confirmed as normal at a subsequent screen. All were processed using Version 2.2 of the R2 ImageChecker®.

The sample cases were assigned to three rollers, having 15, 20 and 25 cancers. The order in which each reader viewed them was randomised, as was the order in which they viewed the conditions (prompted and unprompted). A minimum period of 1 week was left between reading a roller in the two conditions. Readers viewed the films on a standard roller viewer. Low resolution images of the films were provided on a paper form. In the prompted condition these were shown with prompts, in the unprompted condition they were shown without prompts. Readers were asked to give an overall decision on recall for each case.

Results
The ImageChecker® was judged to have correctly prompted 44 of the 60 cancer cases. 36 of 40 screen-detected cases were prompted, a sensitivity of 90%; only 8 of 20 interval cases were prompted, a sensitivity of 56%.

The sensitivity and specificity of each reader on each roller under each condition was calculated. Sensitivities and specificities are on a 0–1 scale and tend not to be normally distributed. A modified logit transform was therefore applied to the data. The transformed data were then analysed using repeated measures analysis of variance with two within-subject factors: the prompt and the roller. One analysis was performed for sensitivity and another for specificity. Testing for a difference in sensitivity, we found a significant effect for the roller (F2, 76=26, p<0.001), but not for the prompt (F1, 38=0.003, p=1.0) or for the interaction between roller and prompt (F2, 76=2.2, p=0.1). A comparable analysis for specificity found a significant effect for the roller (F2, 76=5.0, p=0.009), but not for the prompt (F1, 38=0.13, p=0.7) or for the interaction between roller and prompt (F2, 76=0.9, p=0.4). There was, therefore, no evidence that the prompts affected readers' sensitivities or specificities.


    Study Two
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 
Having found no evidence of an effect due to CAD in Study One, we decided to assemble a test set of cases that would be more likely to show an impact. A second study, very similar in design to the first, was then carried out using this new test set and a subset of the readers from the first study.

Method and materials
The cancer cases were selected using two criteria: the case must have been missed by at least one film reader in the past, and the cancer must be correctly prompted. Three different categories of case meeting the first criterion were used: false negative interval cancers (n=7), cases used in a previous experiment for which data sheets were still available (n=2), and cases missed by the first reader in normal double reading (n=31). 40 cases that met the criteria were included, with 4 further cancer cases not correctly prompted. Control films were unselected normal cases from the same period as the cancer cases, all of which had undergone a subsequent normal screen.

18 consultant radiologists, 15 trained radiographers and 2 breast clinicians took part, all of whom had participated in Study One. The same procedure was used for reading the cases as had been used in Study One.

Results
The sensitivity and specificity of each reader was calculated for each condition and each roller, as for Study One. In addition, data from the performance of individual readers working without CAD were used to simulate the effect of double reading with arbitration. For each pair of readers, the result was taken to be recall if both agreed on recall, no recall if both agreed on no recall, and, if they disagreed, the result was determined using the judgement of a third reader, selected at random.

As in Study One, the computed sensitivities and specificities were transformed using a logit function before taking a mean value across all readers (or, for double reading, all combinations of two readers). After calculating the means, an inverse logit transform was used so that the results could be presented on the familiar 0–1 range.

Confidence intervals on the data were generated using a bootstrapping technique in which means were calculated in each of the three conditions for 999 random simulated samples generated from the sets of scores. The 95% confidence interval (CI) was taken to be the range between the 25th and 975th means. The same bootstrapping technique was used to calculate 95% CIs for two comparisons, one between single reading and single reading with CAD, and one between single reading and double reading.

The data show that double reading increases sensitivity compared with single reading (0.81 compared with 0.77, 95% CI for the difference is 0.014 to 0.077). The sensitivity for single reading with CAD is also greater than that for single reading (0.80 compared with 0.77, 95% CI for the difference is 0.0027 to 0.064), however since the lower end of the 95% CI for the difference is below zero, it is not statistically significant. The mean specificity is also improved both by CAD and by double reading, again the difference due to double reading is statistically significant, but that due to CAD is not.


    Study Three
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 
Neither Study One nor Study Two showed any evidence of an impact due to CAD. Both studies were criticised by colleagues who felt that users of CAD would not be able to take advantage of the information that the prompts provide until they had had more experience of using the system [4]. We were also aware of the risk that participants in our studies were reading under rather unrealistic conditions and were perhaps able to sustain a degree of vigilance that meant they avoided the kinds of observational lapses for which CAD is of most obvious benefit. We designed an evaluation that allowed us to introduce CAD into prolonged routine use in a way that avoided some of the difficulties associated with a full randomised controlled trial. This prospective study avoided many of the problems of the first two studies, although it had difficulties of its own.

Prospective studies of CAD have used different designs. Some researchers have asked radiologists to record assessments before and after looking at prompts and then reported the proportion of cancers detected only after the prompts were consulted. Freer and Ulissey [5], who used this design, reported a 19.5% increase in cancer detection (it should be noted that the number of cancers involved in these studies is necessarily small; a 19.5% increase here was provided by eight additional cancers). Other researchers have compared the overall cancer detection rate in two distinct sets of cases, where CAD was used only in one set. Gur et al [6] found no difference in cancer detection rate for those cases viewed with and those viewed without CAD. Our study can be viewed as a replication of both the Freer and Ulissey and the Gur et al studies. Data were recorded both regarding the impact of prompts on individual readers' judgements and regarding the cancer detection rate in two sets of cases, with CAD only being available in one set.

Method and materials
We used a much improved version of the R2 ImageChecker® in this study, Version 5.0. From 21 March 2003 to 9 January 2004, as many cases as possible were processed using the ImageChecker®. However, not all the available roller viewers could be equipped with screens for the display of CAD prompts, so not all the films handled by the unit over the period were processed using the ImageChecker®. Although films were not randomly assigned to be read either with or without the ImageChecker®, there was no plausible bias in the allocation.

Films processed using the ImageChecker® were then displayed on a roller viewer equipped with the R2 CheckMate Ultra screen for the display of prompts. Films were read independently by film readers who recorded a first assessment before inspecting the prompts and a second assessment after inspecting the prompts. All of the films, whether processed by the ImageChecker® or not, were double read by two independent trained film readers. Cases considered normal by both readers were not recalled. Cases marked by either reader were allocated to "arbitration", a process involving discussion with two further consultant radiologists with a view to recall for further clinical assessment. Readers could also indicate that the case should be recalled for technical reasons or that the case would have to be re-read because prior films had not been retrieved. Such cases are excluded from our analyses.

Results
During the period of the study, 19 502 women attended for screening and 6111 cases were read with CAD. A total of 62 cancers were detected in 61 women. The ImageChecker® correctly prompted the cancer in 84% of cases (51/61). The false prompt rate, measured in a randomly chosen group of 613 non-cancer cases, was 0.40 prompts per film or 1.59 per case.

Table 1Go shows the differences between the CAD and non-CAD groups for the percentage of cases sent to arbitration and recalled assessment clinics as well as the cancer detection rate. It can be seen that there was a significant difference between the two groups on recall to arbitration and assessment. It therefore seems that CAD does have an impact on the readers' decisions to recall patients. The difference in cancer detection rates between the two groups is not statistically significant. This might suggest that CAD increases recall rates without increasing detection rates, suggesting a negative impact on specificity. Of course the sample size for the comparison of cancer detection rates is smaller than that for the comparison of recall rates. The difference in the cancer detection rates appears broadly in line with the differences in recall rates and it might be thought that there is an effect here that might be revealed in a larger study.


View this table:
[in this window]
[in a new window]
 
Table 1. Comparison of proportion of cases recalled to arbitration or assessment, and cancer detection rates in the computer-aided detection (CAD) and non-CAD groups. 95% confidence intervals for the proportions are given in parentheses. The recall to arbitration rate is not recorded in the clinic and, for the non-CAD group, it was estimated from a random sample of 1688 cases

 
Table 2Go shows the proportion of the cases attributed to CAD at each point in the process, that is to say the proportion of the total number of observations where the reader only recalled the case after viewing the CAD prompts. The figures here suggest that the impact of CAD is much weaker than would be suggested by the between-groups comparison. Of the 118 occasions when a cancer was detected by a reader, there were only 2 cases where the prompts affected the reader's decision.


View this table:
[in this window]
[in a new window]
 
Table 2. The number of cases attributed to computed-aided detection (CAD) at each point in the process

 

    Discussion
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 
Each of the above studies has strengths and weaknesses. A robust conclusion about the likely value of CAD should be based not on the findings of any single study, nor indeed on the results of a series of studies from a single team of investigators. There is clear evidence from many studies on the sensitivity and specificity of CAD machines that they are able to detect cancers that are otherwise missed by radiologists [7].

The evidence from studies such as Study One and Study Two above, that CAD has an impact on radiologists' decision-making in experimental conditions, is less strong. Some studies, such as that of Chan et al [8] have shown an impact. Interestingly, Chan et al used receiver operating characteristic (ROC) analysis to explore the difference between CAD and non-CAD conditions. This is a more sensitive technique than that used in our studies, which simply tested for a significant difference between the sensitivities recorded in the two conditions. ROC analysis measures the difference between two conditions across a range of decision thresholds. It is therefore possible that an improvement in sensitivity detected using ROC analysis would not translate into a significant change at the actual threshold used by readers in making clinical decisions.

Many studies have failed to show an impact but these clearly lacked adequate power [9, 10]. Our first study was much larger than most previous studies and was expected to be more powerful. However, the extra statistical power was anticipated to come from the use of a larger number of readers than had been involved in earlier studies. Increasing the number of readers only augments the power of a study if readers behave differently. There is evidence that radiologists do vary substantially in terms of their overall sensitivity and specificity [1113], many readers also report, anecdotally, that they believe they also vary in terms of the performance on particular signs. This variability is one reason why double reading is better than single reading. However, if the data from Study One are used to simulate double reading, it seems no better than single reading. This is because the readers in Study One achieved broadly similar levels of performance. This fact somewhat undermines the study's claim to statistical power and also suggests that behaviour of the readers is different in some important way from real life.

A case-by-case analysis of the cancers used in Study One revealed that there were 15 "too difficult cases", where there was no possibility of detecting an improvement due to prompting because the ImageChecker® failed to prompt the cancer. There were 29 "too easy cases", where there was no real scope for detecting an improvement because more than 90% of readers detected the cancer in the unprompted condition. There were, therefore, only 16 cases on which there was any real scope for detecting improvements. If we restrict attention to just these cases, the power of the study is greatly reduced (simulations suggest a power of approximately 50% to detect a 5% improvement in sensitivity). It was this analysis that moved us to attempt Study Two.

One further observation from Study One, which has been neglected in earlier publications, is that amongst the negative results there was one statistically significant effect. Although performance was similar in the intervention and control conditions, it did vary from roller to roller. One explanation for this would be that readers were expecting to find roughly similar numbers of cancers on all three rollers and so missed cancers at the end of the roller having the most cancers. It should be noted that although the order in which readers attempted the three test rollers was randomised, all readers first completed a training roller that may have served to calibrate their expectations. One of the features of this research is that relatively minor features of the experimental design appear to have a much larger impact than one might expect.

Conclusions drawn from Study Two are necessarily rather nuanced. The simulation of double reading shows a statistically significant improvement over single reading, confirming that the study avoided at least one of the problems of Study One. The effect due to CAD is not statistically significant but is very close to significance. This might be thought encouraging for CAD but it should be remembered that, unlike Study One, this study uses a set of cases specifically selected to maximise the chances of detecting an improvement due to CAD. The cases were selected both for difficulty (all missed by at least one reader previously) and correct prompting. The majority of the 40 cancer cases had been missed by the first reader at screening. They made up 14% of screen-detected cancers and 61% were prompted by CAD. Our test set is thus representative of approximately 8% of screen-detected cancers, weighted to be the most difficult cases that CAD will detect.

One of the interesting facts about Study Two is that a high proportion of correctly prompted cancers were not recalled. Some people in the field seem to have anticipated that all prompted cancers will be recalled [14]. This is clearly not the case. Our evidence suggests that a high proportion of cancers will be missed. We attempted to analyse why prompted cancers were not recalled and the strongest factor seemed to be the film reader's prior confidence in his or her interpretation of the prompted findings as being benign. The version of the ImageChecker® used in this study was able to indicate that certain prompts carried a higher confidence; this information seemed valuable, as fewer of the markers emphasised in this way were overruled by film readers.

A very striking methodological point emerges from Study Three. The designs replicated in this Study, namely the Freer and Ulissey [5] and the Gur et al [6] designs, produce sharply different estimates of the impact of CAD. The point is dramatically illustrated in Table 3Go. The figures in the first column of data are taken from Table 2Go. The figures in the second column are the differences between the first two columns of Table 1Go expressed as a proportion of the figure in the first column of Table 1Go. The comparison shows the extent to which the data are shaped by the methodology used. The data from the Freer-style analysis are probably the more secure, since that is the more controlled form of experiment. It is possible that the Gur-style analysis is contaminated in some way: the presence or absence of CAD may not be the only difference in the way cases were read in the CAD and non-CAD groups. For example, there may have been differences in how many cases certain readers read in the two groups, or readers in the CAD group may have been operating at higher levels of vigilance because they were aware that they were participating in a trial.


View this table:
[in this window]
[in a new window]
 
Table 3. Percentage of cases attributed to computer-aided detection (CAD) by the two study designs

 
This series of studies does not provide any evidence that CAD is effective. It may, however, be that CAD can have an effect in other settings. The cancer detection rate in the CAD group in Study Three was extremely high, which suggests that very few cancers were being missed by these readers pre CAD. It may be that some feature of the screening protocol used, double reading with arbitration, prevented CAD from having the kind of impact that we anticipated. It may be that CAD could have had an effect if it had been introduced differently. Our readers were trained in the use of the device and acquired substantial experience in using it, but there are other forms of training that might have improved their capacity to take advantage of the prompts.

Interestingly, of the 12 cancers that were missed by a reader in Study Three, 9 were prompted by CAD but only 2 were recalled on the strength of the prompts. Although readers recalled significant numbers of additional cases on the strength of false prompts, the low specificity of prompts means that a number of true prompts are, inevitably, dismissed.

These studies should not be taken as the final word on CAD. Any evaluation is simply a snapshot that measures the performance of a rapidly evolving technology at a single moment in time. The results of future trials may make for an interesting comparison with our work.


    Acknowledgments
 
Funding for these studies was provided by the NHS HTA board and by the NHS Breast Screening Programme. The views and opinions expressed herein are, however, those of the authors and do not necessarily reflect those of the Department of Health or the NHS Breast Screening Programme. This paper draws an work carried out in collaboration with Ms Jo Champness, Dr Kathy Johnston and Dr Lisanne Khoo. The participation of staff at the Duchess of Kent Screening Unit, St George's Hospital, London, is gratefully acknowledged, as is the contribution of Dr Henry Potts who advised on statistical analyses.


    References
 Top
 Abstract
 Introduction
 Study One
 Study Two
 Study Three
 Discussion
 References
 

  1. Taylor PM, Champness J, Given-Wilson RM, Potts HWW, Johnston K. An evaluation of the impact of computer-based prompts on screen readers' interpretation of mammograms. Br J Radiol 2004;77:21–7.[Abstract/Free Full Text]
  2. Taylor P, Given-Wilson R, Champness J, Potts H, Johnston K. Assessing the impact of CAD on the sensitivity and specificity of film readers. Clin Radiol 2004;59:1099–105.[CrossRef][Medline]
  3. Taylor P, Khoo L, Given-Wilson R. Prospective study of the new release of the R2 ImageChecker in the UK screening setting. In: Pisano E, editor. Proceedings of the Seventh International Workshop on Digital Mammography; 2004 June 18–21; Chapel Hill, NC. [in press.]
  4. Astley SM, Gilbert FJ. Computer-aided detection in mammography. Clin Radiol 2004;59:390–9.[CrossRef][Medline]
  5. Freer TW, Ulissey MJ. Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. Radiology 2001;220:781–6.[Abstract/Free Full Text]
  6. Gur D, Sumkin JH, Rockette HE, Ganott M, Hakim C, Hardesty L, et al. Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. J Natl Cancer Inst 2004;496:185–90.
  7. Warren Burhenne LJ, Wood SA, D'Orsi CJ, Feig SA, Kopans DB, O'Shaughnessy KF, et al. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 2000;215:554–62.[Abstract/Free Full Text]
  8. Chan H, Doi K, Vyborny C, et al. Improvements in radiologists' detection of clustered microcalcifications in mammograms. The potential of computer-aided diagnosis. Invest Radiol 1990;25:1102–10.[CrossRef][Medline]
  9. Brem RF, Schoonjans JM. Radiologist detection of microcalcifications with and without computer-aided detection: a comparative study. Clin Radiol 2001;56:150–4.[CrossRef][Medline]
  10. Ciatto S, Del Turco MR, Risso G, Catarzi S, Bonardi R, Viterbo V, et al. Comparison of standard reading and computer aided detection (CAD) on a national proficiency test of screening mammography. Eur J Radiol 2003;45:135–8.[CrossRef][Medline]
  11. Blanks RG, Wallis MG, Given-Wilson RM. Observer variability in cancer detection during routine repeat (incident) mammographic screening in a study of two versus one-view mammography. J Med Screen 1999;6:152–8.[Abstract/Free Full Text]
  12. Beam CA, Conant EF, Sickles EA. Factors affecting radiologist inconsistency in screening mammography. Acad Radiol 2002;9:531–40.[CrossRef][Medline]
  13. Gur D, Sumkin JH, Hardesty LA, Clearfield RJ, Cohen CS, Ganott MA, et al. Recall and detection rates in screening mammography. Cancer 2004;100:1590–4.[CrossRef][Medline]
  14. Brem RF, Baum J, Lechner M, et al. Improvement in sensitivity of screening mammography with computer-aided detection: a multi-institutional trial. AJR 2003;181:687–93.[Abstract/Free Full Text]




This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Taylor, P
Right arrow Articles by Given-Wilson, R M
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taylor, P
Right arrow Articles by Given-Wilson, R M


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
BJR DMFR IMAGING  ALL BIR JOURNALS