BJR
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

British Journal of Radiology (2005) 78, S41-S45
© 2005 British Institute of Radiology
doi: 10.1259/bjr/25058162

This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Roehrig, J
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Roehrig, J

Paper

The manufacturer's perspective

J Roehrig, PhD

R2 Technology, Inc, 1195 W. Fremont Ave, Sunnyvale, CA 94087, USA


    Abstract
 Top
 Abstract
 Introduction
 Prospective studies to date
 The key issue for...
 Conclusion
 References
 
A review of the evidence for the efficacy of a commercially available computer-aided detection (CAD) system is presented. Retrospective studies point to a potential capability of detecting approximately 20% of cancers at least 1 year earlier than normally detected by single reading of mammograms. Several prospective studies have now been published showing an actual increase of detection from 1.7% to 18%. Suggestions are given for closing the gap between the potential and actual performance with CAD, including: improving the specificity of the CAD algorithm; improving the training of end users; presenting more or better information to the radiologist; and the possible need for decision support tools.


    Introduction
 Top
 Abstract
 Introduction
 Prospective studies to date
 The key issue for...
 Conclusion
 References
 
The documentation provided by R2 in support of its application for acceptance by the US Food and Drug Administration (FDA) for its ImageChecker® computer-aided detection (CAD) system incorporated results from the first large statistical study that demonstrated the potential for CAD to significantly reduce a radiologists' false negative rate. This study was conducted at 13 institutions throughout the USA over the 2-year period 1996–1998 [1]. Prior to this there had been no published study reporting the false negative rate of radiologists' reading of mammograms based on a sufficient number of patients. All available screening mammograms that led to the detection of biopsy-proven cancer (n=1083), together with the most recent corresponding prior mammograms (n=427) that could be found, were collected and analysed. To establish safety and efficacy, the three issues that were of most concern to the FDA panel were:

  1. What is the false negative rate of mammographers?
  2. Of the false negative cases, did the CAD system have the capability of detecting a significant number?
  3. In doing so, would clinical use of the CAD system result in a significant increase in work-up rate?

To determine the false negative rate, a number of issues need clarification. To determine sensitivity, one needs to calculate number of cancers detected divided by the number that "should have been detected". The latter quantity cannot be determined without having first decided upon a strategy for measuring "should have" that is both reasonable and objective. It is well known that a large fraction of prior cases have some sign of cancer when observed retrospectively by someone with knowledge of the true cancer location [2]. In the data set referred to above, 67% (286/427) of the cases have a visible sign in the prior case. However, it would be unrealistic to conclude that the false negative rate is 67%, because many of these cases would not be followed up in an actual clinical environment. Many of the mammograms have minimal signs that are essentially indistinguishable from normal breast tissue without prior knowledge from a source other than the mammogram. It was therefore necessary to devise a means to determine what could "reasonably" be detected in these 286 visible priors. For this reason, Burhenne et al [1] developed the concept of "consensus of actionability". The 286 visible prior cancer cases were mixed with normal cases and were then read by panels of five radiologists who were blinded to the truth. The actionability for each case was simply the fraction of the five radiologists who would have decided to follow up the case. The set of cases also contained a number of current cancer cases. The sensitivity on current cases and the specificity on normal cases were monitored to ensure that the panel radiologists were operating within normal parameters during the test. Table 1Go shows the distribution of cases from this portion of the study.


View this table:
[in this window]
[in a new window]
 
Table 1. Distribution of visible priors in "actionability"

 
To understand these data, consider the number of cases in each consensus category. The 83 cases in the 0/5 group were those that had signs that were visible retrospectively by a radiologist who knew where the cancer was eventually detected, but which were so subtle that none of the five radiologists without this knowledge would have recalled the patient. For this reason, these 83 cases were completely removed from consideration in the "should have been detected" category. At the opposite extreme, there were 36 cases that all five of the radiologists would have recalled, and can therefore be "reasonably" placed in the "should have been detected" category. These have an actionability of 5/5 or 1.0, and therefore contribute their entire number, 36, to the fourth column. The cases in the intermediate groups arguably have some degree of ambiguity with regard to whether or not they should have been recalled, but clearly the higher the consensus, or number out of five, the more a case should be included in the "recall" category. The authors therefore decided to weight each case by the consensus, and add that to the fourth column, which contains the number that, in some sense, "should have been detected". The 115 cases in the fourth column therefore represent the total number of the 286 prior cases with visible signs of cancer that one can reasonably place in the false negative category. There have been other means of deriving such numbers; for example, if one considers the cases that a majority of panel radiologists thought should be recalled, namely the 3/5, 4/5 and 5/5 categories, the number of false negatives would have been 38+38+36=112, an almost identical number. In most other studies to determine false negative rates, a radiologist familiar with the cases and with knowledge of the truth made a subjective judgment. Remember that these 115 cases is the number, out of 427 total cases, that should have been recalled 1-year prior. Therefore, of the 1083 cases one would expect 1083/427 x 115, or 292 cases, detectable in a prior year. The sensitivity was therefore 1083/(1083+292)=79%, for a false negative rate of 21%.

The second question was whether CAD has the ability to detect these false negatives. The prior cases were analysed by the CAD system and the data are summarized in Table 2Go, which shows the number of cases in each consensus category that were detected by the CAD system.


View this table:
[in this window]
[in a new window]
 
Table 2. Computer-aided detection (CAD) performance on prior cases

 
Although CAD detected 181 of the 286 visible priors, we conservatively assume that, except for the high consensus groups, radiologists would be unlikely to recall a case simply because of a CAD prompt, and therefore we weight the benefit by the actionability. In this way we obtain a total of 96 of the 115 false negative cases prompted by CAD, or 84% of the cases, that should have been detected in a prior year. We conclude that 96 of 115 cancers that were considered to be actionable by the panel radiologists but that were missed by the original screening radiologist would be detected with the CAD system.

Additionally, the 1083 current cancer cases were analysed by the CAD system, which detected 975 (90%) of the 1083 cases.

Figure 1Go summarises the findings thus far in answer to the first two issues. Of the actionable cases, the radiologists alone detected 79% in the current year, of which CAD marks 90%, or 71% of cases. In addition, CAD marks 18% of the 21% that a single radiologist misses. If the radiologist picks up all 18% of the additional actionable cases, the combined sensitivity of the radiologist with CAD would be 97%.



View larger version (32K):
[in this window]
[in a new window]
 
Figure 1. Radiologists alone detected 79% in the current year (left), of which computed-aided detection (CAD) marks 90%, or 71% of cases, and CAD marks 18% of the 21% that a single radiologist misses (middle). If the radiologist picks up all 18% of the additional actionable cases, the combined sensitivity of the radiologist with CAD would be 97% (right).

 
In conclusion, attending radiologists at the 13 institutions under study had a 21% false negative rate, and the CAD system demonstrated the capability to reduce this rate significantly.

The remaining question concerned the effect that the use of CAD would have on the overall recall rate. A prospective study was conducted at five sites and involved 14 radiologists. Historical data before installation of CAD showed that for a total of 23 682 cases prior to installation of CAD, the recall rate was 8.3%. Following installation of CAD into the five sites, 14 817 cases showed a recall rate of 7.6%, a statistically insignificant change. The conclusion was that use of the CAD system did not result in a significant increase in the rate of recall.


    Prospective studies to date
 Top
 Abstract
 Introduction
 Prospective studies to date
 The key issue for...
 Conclusion
 References
 
Although these results were quite gratifying to the manufacturer, R2 was quite cognizant that the results, because of their retrospective nature, showed only the potential for improvement, not the actual improvement. The actual benefit of CAD can only be proven in prospective studies. Freer and Ulissey [3] performed the first prospective study. Over a period of 1 year they recorded the initial interpretation of each mammogram, followed by re-evaluation after review of CAD prompts. The authors observed a 19.5% increase in the number of cancers detected due to the CAD prompt, with a modest increase in recall rate from 6.5% to 7.7%. In the population of 12 860 patients who participated in the study, the radiologist detected 41 malignancies before consulting the CAD prompts, and 49 after, an increase of 8 malignancies. It is important to note that statistical significance is not an issue in this study since an absolute change of diagnosis was recorded.

Another methodology was followed by Gur et al [4], who simply reported the raw rate of occurrence of breast cancer in his clinic over an 18-month period before and after installation of a CAD system. The author concluded that the observed 1.7% increase in cancer detection rate was not statistically significant (95% confidence limit (CL) –11 to 19). Although the 95% confidence interval puts this result well within all other prospective results as well as results with human second reading, there has been a strong tendency to interpret such results as giving a "negative result". For this reason, it is important to draw attention to some serious shortcomings of this type of measurement, as pointed out by a number of leading radiologists in subsequent Letters to the Editor of the Journal of the National Cancer Institute. It is well known that many factors affect the cancer detection rate, such as age group and ratio of prevalence to incidence cases. In Gur et al's work, none of the factors that can change the cancer detection rate over a 3-year period were controlled for, or taken account of. Indeed, the authors noted that the fraction of patients undergoing mammograms for the first time decreased from 40% to 30% from the beginning to the end of the study. Such a change alone could itself account for a decrease in cancer detection of up to 10% [5]. Similarly, even in a fixed population comprised entirely of incidence cases, the cancer detection rate has been observed to decrease over time. This is because each screening round removes the more easily detected cancers, leaving fewer and more difficult cases to be detected by subsequent screening rounds. Finally, we should be careful to understand what the intended purpose of CAD or, for that matter, human double reading is. Any initial increase in detection of cancers must come from the population of cancers that would subsequently have been detected later. Therefore one would not expect the absolute number of cancers to change, when integrated over time. What one might expect is that the distribution of the stages of cancer when they are detected should move progressively toward earlier stages. The authors did not report any data on staging of cancers detected either before or after installation of CAD.

Several other prospective studies have been conducted and reported at conferences showing detection rates that vary from 5% to 18% [69]. Most of these measured the rate at which the radiologist actually changed a diagnosis on malignant cases, rather than the raw cancer detection rate. An exception was the study conducted by Cupples [6] who, similarly to Gur, measured the change in cancer detection rate, however keeping track of the effect of confounding factors. His paper has been submitted for publication and the other studies are expected to follow. The study that is perhaps most interesting and may yet prove most enlightening was that conducted by Young et al [9]. They found that the CAD system marked 21% of the cancers that a single radiologist missed, but only one-third of those, or 7%, actually induced her to change her mind during the study. In this case the actual benefit of CAD was one-third of the potential benefit. This is a puzzling and important observation. It suggests that rather than expending great effort in producing further incremental increases in sensitivity, much more benefit could be obtained by understanding how and why this 66% of true marks could be ignored, and this is where our future efforts will be focused. Also, contrary to conventional wisdom on the benefit of CAD, the CAD prompted cases in this study were all found to be invasive cancers.

It should be clear by now that CAD has more than sufficient sensitivity to detect false negative cases earlier. Burhenne et al [1] showed this conclusively, and more recent versions of the algorithm significantly reinforce that conclusion. We have seen that a second necessary condition is that uncontrolled variables which may change detection rates be controlled, making the Freer-style protocol [3] more reliable than simple measurements of raw detection rates. A further condition is that the number of cancer cases collected in a study be sufficiently large that there will be a non-negligible number of false negatives that CAD has the potential to mark. Finally, we believe that the major open question affecting the usefulness of CAD concerns the radiologist's response upon seeing CAD prompts. We know that they may dismiss true marks, sometimes on lesions that are not subtle. This is perhaps the most important unanswered question in CAD: what affects the radiologist's ability to respond correctly to correct CAD information?

An effort to begin to answer this question has been made in an interesting study performed by Karssemeijer [10] in which the author simulates combined performance using both the radiologist's assessment and the CAD system's suspiciousness index. Karssemeijer performed a localisation receiver operating characteristic (LROC) analysis with the 115 mass lesions in his study and found that the mean sensitivity of the radiologists increased by 7% with CAD, and by 10.5% with double reading. There are a number of unique features to this study: it only involved masses; it involved lesions missed due to incorrect interpretation; and the LROC analysis was done using an independent addition or synthesis of radiologist plus CAD likelihood assessments. One of the most intriguing findings of his study is that it showed that the radiologists were not yet able to make full use of the information provided by CAD. Indeed, an identical analysis was subsequently carried out with a more recent version of the CAD algorithm (version 8.0), showing that the synthesized single reading plus CAD performance matched double reading, yet the single radiologists using CAD did not reach this performance level. This suggests that sufficient information is available to equal human double reading, but it has not yet been utilised fully.


    The key issue for CAD
 Top
 Abstract
 Introduction
 Prospective studies to date
 The key issue for...
 Conclusion
 References
 
We hope that we have provided sufficient evidence that, contrary to what many radiologists purchasing CAD systems focus on, there is by now ample evidence that CAD has adequate sensitivity to reduce false positives and that CAD does indeed mark actionable lesions that are otherwise missed. The remaining issue concerns the radiologist's actions, or behaviour consequent upon using CAD. Specifically, what are the reasons radiologists dismiss true prompts? Many explanations have been proposed, including: false marker rate too large to pay attention to true marks; inadequate training of users; the need to present more information to the radiologist; and the need for decision support tools. We will touch on each of these in the remaining discussion.

The false marker rate
With regard to the false positive rate, it is our experience that a threshold producing approximately two false marks per case on average appears to be the largest false positive rate practical (or tolerable) for a large installed base of approximately 1400 systems. Experience has shown that at a false positive rate of nearly three per case, over one-half of the users will complain about the false positive rate and lose confidence in the system. Furthermore, the tolerance for false positives, or the ability to easily dismiss them, varies from user to user. Our solution to this issue has been to make steady improvements in the CAD algorithm performance, reducing the false positive rate of masses at a constant sensitivity by approximately 30% between software releases. These improvements have in recent years been focused on the performance of algorithms to detect masses, since the CAD calcification algorithms to detect microcalcifications already appear to meet, or even exceed, most expectations. At the same time, in recent releases of the software system, multiple threshold or operating points have been provided to meet the variable demands of users. Table 3Go shows the performance improvement in the last three algorithm releases, at a threshold chosen to keep the sensitivity constant at 90%.


View this table:
[in this window]
[in a new window]
 
Table 3. False positive rate at constant overall sensitivity

 
Table 4Go shows the performance of the CAD algorithm over the last three software releases at actual available operating points. Version 8.0 has been submitted for approval to the FDA. We think it is interesting to point out that this most recent version, at the tightest threshold, has higher sensitivity than the version of software reported in the original FDA study presented by Burhenne et al [1], with a far lower false marker rate.


View this table:
[in this window]
[in a new window]
 
Table 4. Available operating points

 
Training in the use of CAD
Training radiologists in the use of CAD has assumed more importance than originally thought necessary. Keeping in mind that many normal cases will have false marks, we have long been aware of the fact that radiologists, when first exposed to CAD prompts with no explanation, are frequently puzzled by what the system will mark, especially on normal cases. This can lead to loss of confidence in the device, with a subsequent lack of attention to true marks, leading to less than optimal performance. For this reason we have considered it to be an important part of the sales process to complete the installation of every new system with a training session, lasting 1 day, given by applications personnel who have been trained in mammography. The applications training to the radiologist will consist of a review of approximately 20 test cases showing both strengths and weaknesses of the system, as well as typical false negatives and false positives. If possible, a sample of cancer cases from the site, with prior films, will also be scanned and reviewed. The goal of this training process is to provide the user with a workable understanding of the areas in which the system may help, and a realistic approach to dismissing false positives. Some evidence is accumulating that the training process requires an even longer time before some radiologists can make full use of the CAD information. Based on her study of the data from their prospective study in 2002, Kathy Willison, of the Elizabeth Wendy Logan Clinic, has expressed the belief that the primary reason that radiologists dismissed true marks was not enough experience in the use of CAD (personal communication). In the study performed by Roelofs and Karssemeijer [11], five radiologists received training on 48 cases with CAD, and five studied 148 cases. They too believe that a positive effect can be shown due to the difference in training. Finally, Sue Astley and Fiona Gilbert are performing a study for the UK screening programme that involved 7 weeks of training with over 650 cases with CAD, including feedback. Preliminary results look encouraging but have not been finalised yet.

Presentation of additional information from the CAD algorithm
It may occasionally happen that the radiologist needs more information than simply the presence of a CAD prompt. We have observed that the ability to display specifically what the CAD system determined to be suspicious sometimes helps a radiologist, particularly with regard to dismissing false positives. For example, by observing the segmentation of the calcifications found, the radiologist may better understand why the algorithm displayed a prompt, which nevertheless s/he chooses to dismiss. Ultimately we believe even more quantitative information about a lesion should be made available to support the radiologist in making a decision regarding that lesion. In the past we have provided a display that simply showed which marks or suspicious regions exceed a certain threshold, without distinguishing between marks. As it happens, however, we know that not all marks are equally likely. The computer algorithm uses a neural network classifier to rank all initial suspicious regions and assigns a "probability" or "index of suspiciousness" to each region. True lesions tend to have a much higher value of this index on average than false positives. By making the size of the CAD prompts proportional to the computer probability, more relevant information is conveyed directly to the radiologist.

Decision support tools
Although all efforts at commercialisation so far have focused on simple detection of lesions, there is a significant body of information in the research community which indicates that radiologists may benefit from computerised information to aid in the determination of benign/malignancy of a lesion [1214]. Indeed, the portion of false negatives that are due to incorrect interpretation may be at least as large as that due to detection. We believe the technology may also aid in the training of new mammographers, since a computer-aided diagnosis system can contain a database of many more cancer lesions than a mammography resident or even an experienced mammographer would normally encounter. We have implemented a particular form of computer-aided diagnosis originally developed by Maryellen Giger [1517] of the University of Chicago, called the Reference Library. In the Reference Library, the computer contains a database of several thousand biopsy-proven cancers and several hundred benign lesions. When the radiologist indicates a prompted lesion on which more information is desired, s/he may call up the Reference Library, which immediately compares the lesion in question with the internal database and displays the "closest" 12. The "closeness" of the lesion can be in terms of overall computer likelihood or closeness in terms of one of several features such as degree of speculation.


    Conclusion
 Top
 Abstract
 Introduction
 Prospective studies to date
 The key issue for...
 Conclusion
 References
 
We have presented data that show that a substantial false negative rate exists in screening mammography and that CAD has adequate sensitivity to these false negatives to potentially improve the performance of a single reader significantly. Prospective studies to date have mixed results, as do studies of human second reading, but they do provide the existence of proof showing CAD improving performance. The full potential, however, appears to be limited by the fact that some radiologists, at least occasionally, dismiss true CAD prompts. This appears to indicate the continued need for training in the use of CAD and improvement in the specificity of CAD algorithms.


    References
 Top
 Abstract
 Introduction
 Prospective studies to date
 The key issue for...
 Conclusion
 References
 

  1. Warren Burhenne LJ, Wood SA, D'Orsi CJ, Feig SA, Kopans DB, O'Shaughnessy KF, et al. Potential contribution of computer-aided detection to the sensitivity of screening mammography. Radiology 2000;215:554–62.[Abstract/Free Full Text]
  2. Harvey J, Farardo L, Innis C. Previous mammograms in patients with impalpable breast carcinoma: retrospective vs blinded interpretation. AJR 1993;11:1167–72.
  3. Freer T, Ullissey M. Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. Radiology 2001;220:781–6.[Abstract/Free Full Text]
  4. Gur D, Sumkin JH, Rockette HE, Ganott M, Hakim C, Hardesty L, et al. Changes in breast cancer detection and mammography recall rates after the introduction of a computer-aided detection system. J Natl Cancer Inst 2004;96:185–90.[Abstract/Free Full Text]
  5. Feig SA. Age-related accuracy of screening mammography: how should it be measured? Radiology 2000;214:633–40.[Free Full Text]
  6. Cupples TE. Impact of computer-aided (CAD) in a regional screening mammography program. Radiology 2001;221(P):520.
  7. Bandokar P, Birdwell R, Ikeda D. Computer aided detection (CAD) with screening mammography in an academic institution: preliminary findings. Radiology 2002;225(P):458.
  8. Morton MJ, Whaley DH, Brandt KR. The effects of computer-aided detection (CAD) on a local/regional screening mammography program: prospective evaluation of 12,646 patients. Radiology 2002;225(P):459.
  9. Young WW, Destounis SV, Bonaccio E, Zuley ML. Computer-aided detection in screening mammography: can it replace the second reader in an independent double read? Preliminary results of a prospective double blinded study. Radiology 2000;225(P):600.
  10. Karssemeijer N. Computer-aided detection versus independent double reading of masses on mammograms. Radiology 2003;227:192–200.[Abstract/Free Full Text]
  11. Roelofs T, Karssemeijer N. Effects of computer-aided diagnosis on radiologist's detection of breast masses. In: Pisano E, editor. The Seventh International Workshop on Digital Mammography; 2004 June 18–21; Chapel Hill, NC. Amsterdam, The Netherlands: Elsevier Science B. V., 2004.
  12. Jiang Y, Nishikawa R, Schmidt R, Metz C, Giger M, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999;6:22–33.[CrossRef][Medline]
  13. Veldkamp W, Karssemeijer N, Otten J, Hendricks J. Automated classification of clustered microcalcifications into malignant and benign types. Med Phys 2000;27:2600–8.[CrossRef][Medline]
  14. Chan HP, Sahiner B, Helvie M. Improvement of radiologist's characterization of mammographic masses by using computer-aided diagnosis: an ROC study. Radiology 1999;212:817–27.[Abstract/Free Full Text]
  15. Giger ML, Huo Z, Lan L, Vyborny CJ. Intelligent search workstation for computer-aided diagnosis. In: Lemke HU, Inamura K, editors. CARS 2000: Computer Assisted Radiology and Surgery: Proceedings of the 14th International Congress and Exhibition, 2000 June 28–July 1; San Francisco, CA. Elsevier, 2000:822–7.
  16. Giger ML, Huo Z, Vyborny CJ, Lan L, Bonta I, Horsch K, et al. Intelligent CAD workstation for breast imaging using similarity to known lesions and multiple visual prompt aids. Proceedings of SPIE Medical Imaging: Image Processing 2002;4684:768–73.
  17. Giger ML, Hui Z, Vyborny CJ, Lan L, Nishikawa RM, Rosenbourgh I. Results of an observer study with an intelligent mammographic workstation for CAD. In: Peitgen H-O, editor. Proceedings of the Sixth International Workshop on Digital Mammography; 2002 June 22–25; Bremen, Germany. Berlin, Germany: Springer, 2003:297–303.




This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Roehrig, J
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Roehrig, J


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
BJR DMFR IMAGING  ALL BIR JOURNALS