British Journal of Radiology (2006) 79, S123-S126
© 2006 British Institute of Radiology
doi: 10.1259/bjr/37622515
Computer aided detection of masses in mammograms as decision support
N Karssemeijer, PhD
1
J D M Otten, MS
2
H Rijken
3 and
R Holland, MD, PhD
3
Radboud University Nijmegen Medical Centre, 1 Department of Radiology, 2 Department of Epidemiology and Biostatistics and 3 National Expert and Training Centre for Breast Cancer Screening, Nijmegen, The Netherlands
Correspondence: Nico Karssemeijer, Department of Radiology, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands. E-mail: n.karssemeijer{at}rad.umcn.nl
 |
Abstract
|
|---|
Performance of a computer aided detection (CAD) system for masses in mammograms was investigated. Using data collected in an observer study, in which experienced screening radiologists read a series of 500 screening mammograms without CAD, performance of radiologists was compared to the standalone performance of the CAD system. Due to a larger number of FPs (false positives), the performance of CAD was lower than that of the readers. However, when analysis was restricted to mammographic regions identified by the radiologists, it was found that the CAD system was comparable to the readers in discriminating these regions in cancer and non-cancer. In a retrospective analysis, the effect of independent combination of reader scores with CAD was compared to independent combination of scores of two radiologists. No significant difference was found between the results of these two methods. Both methods improved single reading results significantly.
 |
Introduction
|
|---|
Computer aided detection (CAD) systems have been developed for detection of abnormalities in mammograms. In current practice, locations of suspect regions detected by the computer are presented to the radiologists as prompts, to avoid abnormalities being overlooked [1]. Prospective studies indicate that this can improve the performance of readers [2, 3]. CAD programs have been widely accepted as a tool for detection of microcalcification clusters. Some doubt remains, however, about the effectiveness of the technology for detection of cancers appearing as masses, architectural distortions and asymmetries (referred to as "masses" in the rest of the paper). Radiologists appreciate the high sensitivity of these programs, but often find it difficult to deal with FP (false positive) mass prompts. FPs of microcalcification detection programs bother them less as those can be dismissed more easily. In this paper, we present results that indicate it may not be the quality of CAD programs for mass detection, but the way prompts are presented that hinders recognition of their value.
Experimental evidence suggests that misinterpretation of mammographic abnormalities is a more important cause of errors in breast cancer screening than perceptual oversights. First, in retrospective studies it is found that many missed breast cancers visible as masses are not very subtle. It is highly unlikely that many of those were missed because they were overlooked, in particular in screening programs where double reading is practiced. Second, in practice it appears frequently that a cancer prompted by CAD is not detected, which is understandable as there are too many FP prompts to act on all of them. Assuming that readers do use CAD when it is available, detection errors on regions prompted by CAD can only be explained by interpretation failure. This indicates that use of CAD in screening mammography should be aimed at helping radiologists with interpretation of masses rather than only helping them with visual search. Similar observations have been made in other areas. In an observer study employing eye tracking, it was found that the majority of errors in a cancer detection task in chest radiographs were due to misinterpretation and that very few nodules were missed by search errors [4].
In a previous study, we found that radiologists might be able to use current prompting systems to help with interpretation of masses [5], but that for this purpose the use of CAD should be radically different from what is currently recommended. Instead of ignoring prompts on regions already inspected, the reader should reconsider decisions with respect to inspected regions with the use of CAD, in particular when doubt remains. On the other hand, computer prompts on regions not identified as potential abnormalities might be better ignored, unless the prompt indicates a clear abnormality that was overlooked. In this paper, this way of using CAD mass prompts is investigated further. In particular, we compare standalone performance of CAD with that of the radiologists, where we distinguish two types of FPs of CAD: prompts on non-cancer locations also identified by one or more of the readers, and FP prompts that were not marked by any reader. In addition, we study the effect of independently combining observer scores with CAD prompts, using an improved version of the CAD algorithm.
 |
Materials and methods
|
|---|
Observer study data
In this study we make use of data from an observer study, which has been described in detail previously [5, 6]. Ten experienced screening radiologists from The Netherlands and an international expert panel of five radiologists were involved in the study. In this paper, we use only the results of the ten Dutch radiologists. The study material includes mammograms of 500 cases, of which 250 were prior cancer cases. Half of the cancer cases were interval cancers and the other half were screen detected. All mammograms were taken from the Dutch breast cancer screening program and were randomly selected from cases screened between 1997 and 1999. However, selection was limited to cases in which mammograms of two previous screening rounds were available. Cases with insufficient image quality or poor positioning technique were also excluded. The time interval between subsequent screening mammograms of each case was two years on average. For the interval cancers, the time interval between the diagnostic and prior mammograms was in the range of 3 months to 26 months, with an average of 14 months. The 250 normal cases were selected at random from the same five districts and from the same period as the cases with cancer. These were all negative screening cases; no benign referrals were included.
In the observer study, each radiologist independently read the prior mammograms of all 500 cases during ten sessions spread out over two days. The original films were used and cases were presented in a random order on conventional alternators. Radiologists were unaware of the detection results and mammograms of cancer cases at the time of detection were not shown. For each case, the mammogram of the screening round previous to the prior was also mounted on the alternator, to allow radiologists to judge mammographic changes over time. For the study, radiologists were asked to mark and rate all regions that attracted their attention, also those that they would normally not rate suspicious enough for recall. For their ratings they used a scale of suspiciousness ranging from 0% to 100%. The readers also classified findings by type of abnormality. In this study we only investigated use of CAD mass detection results. Therefore, we only included findings of the radiologists that were classified in at least one of the three categories of mass, architectural distortion, or asymmetry, thus excluding all microcalcification findings.
A radiologist who did not participate in the reading reviewed all positive cases and marked the locations of the cancers in the priors when they were visible. It turned out that in 142 cases a visible lesion could be seen in the prior mammogram at the location where a cancer developed, either as significant abnormality or as minimal sign. The other cases showed no sign of abnormality. In 116 of these 142 cases a mass, architectural distortion, or asymmetry was the predominant sign; the others were microcalcification cases. This subset of 116 cases was used in this investigation, along with the 250 normals. All images used in the study were digitized using a Lumisys LS 85 film scanner (Lumiscan LS 85, Lumisys, Sunnyvale, CA). The digitized images were archived and processed later using the most recent release of a CAD algorithm (ImageChecker v8.0) developed by R2 Technology (R2 Technology Inc., Sunnyvale, CA). The software used allowed detected regions to be archived along with a level of suspicion and its coordinates. It is noted that the mass detection algorithm covers a wide range of mammographic abnormalities, ranging from spiculated and ill-defined masses to architectural distortions and focal asymmetry. However, it is not aimed at classifying abnormalities into various benign and malignant types, nor does it make use of temporal comparison between mammograms taken in subsequent screening rounds. Despite these limitations, the levels of suspicion computed by the CAD system can be used as a measure for the probability that a region is a cancer. In the study, we used CAD results obtained on the mammograms of the same screening round judged by the readers, i.e. 116 prior mammograms of cancer cases and 250 normals.
Performance measurement
It is common practice to report CAD performance by free-response receiver operating characteristics (FROC), in which detection sensitivity is plotted as a function of the number of FPs per image. Observer performance, on the other hand, is usually determined by one operating point or by regular receiver operating characteristic (ROC) analysis. The observer data collected in our study allowed computation of both FROC and ROC results for the readers, as they marked locations of abnormalities and assigned a level of suspiciousness to each finding. By computing the detection and FP rate using a range of thresholds on the scale of suspicion, performance measures at different operating points were obtained for each of the readers. In a similar way, performance of the CAD system was determined. Using FROC analysis, a direct comparison between radiologists and CAD could be made. TP (true positive) detections were determined by a distance criterion. If a CAD prompt or an annotation of a radiologist was close enough to a true cancer location, a TP was counted. For the radiologists, we used a distance of 2.5 cm. For CAD, a smaller distance criterion of 1.5 cm was used. These criteria were chosen taking into account firstly, the inaccuracy of the radiologists' annotations (which were drawn on small paper printouts of the mammograms) and secondly, the fact that with many CAD marks there is an increasing risk of erroneously counting a TP CAD mark due to a nearby FP. Case based FROC curves were computed, i.e. a TP was counted if a cancer was found in either the craniocaudal (CC) or the mediolateral oblique (MLO) view. The number of FPs per image was determined based on findings in the 250 normal cases. For the radiologists, only case based findings were available, i.e. if a lesion was visible in two views this was scored as one finding. Therefore, to construct FROC curves of the radiologists two FPs were counted per finding if both the MLO and CC views were present. Actually, for the majority of the cases this did not make a difference as only MLO views were available. In the Dutch screening protocol, CC views in the subsequent screening rounds are only made if the radiographer determines that there is an indication that these might be useful for the interpretation, with dense breast pattern as the most common indicator. It was verified that almost all findings reported by the radiologists were marked in both views if they were available, which justifies our calculation.
At the current state of art, computer aided detection algorithms for masses still display many FP prompts, of which a fraction can be easily recognized as obvious failures of the system to understand normal mammographic structure. Examples include prompts on crossing vessels or skin folds. We can identify such FPs by correlating the prompts with the findings of the readers. If a FP of CAD is on location that was not reported by any of the 15 readers in the study, we call it a "zero false positive". It can be argued that these will not have much influence on the screening outcome if radiologists use CAD for interpretation of lesions they detect themselves. Therefore, we also computed FROC results of the system without these zero FPs.
Independent combination of readers and CAD
By independently combining findings of the radiologists with detection results of the CAD program, we can investigate the potential contribution of CAD in improvement of mammographic interpretation [5]. For this purpose, we only considered locations in mammograms that the observers reported and annotated. As a consequence, we ignored possible TPs of the CAD system that the radiologist overlooked. To compute the combined response of a reader with CAD, for a finding marked by the reader, we check whether the location of the finding is marked by CAD and determine its level of suspicion. If two views are available and the finding is marked by CAD in both views, we take the highest level of suspicion assigned to either of the regions by CAD, if it is not marked at all we assign a zero level. Subsequently, we compute the combined response by taking a weighted average of the reader score with the CAD suspicion level. Similarly, we compute results of independent double reading of two radiologists by taking the average of their scores if they both marked a finding at the same location, and by dividing the score by two if only one of them reported the finding. In a previous study this was found to be a better strategy than other ways of independent combination [7].
By varying a detection threshold applied to the level of suspiciousness, "localized response" (LROC) curves were constructed that show the fraction of correctly localized lesions (sensitivity) as a function of the FP fraction [8]. A FP was counted when a case had at least one FP finding that exceeded the detection threshold. Mean results of independent double reading were computed by combining each reader with all other readers. In this way, mean double reading curves were obtained for each reader, which were subsequently averaged to get a mean curve over the readers.
 |
Results
|
|---|
In Figure 1
, the performance of the radiologists and CAD is shown in the FROC diagram. Each reader is represented by a different marker. Using typical referral rates in screening, it can be calculated that operating points in European screening programs are in the range of 0.005 0.02 FP/image. In that range performance is relatively low, which may be understood from the fact that of the positive cases only prior mammograms were shown. Clearly the reason for low performance is not that readers overlooked many mass lesions, because the readers did report the majority of the cancers at higher FP rates. Two curves show the performance of CAD. It appears that standalone CAD performance is lower than that of any reader, but if the evaluation is restricted to locations identified by at least one of the readers, the classification performance of CAD is similar to that of the readers.

View larger version (35K):
[in this window]
[in a new window]
|
Figure 1. Mass detection performance of CAD and 10 experienced screening radiologists. Radiologists are represented by different marks, and each point represents an operating point of a radiologist. The solid line shows standalone CAD results while the dashed line shows FROC results obtained by leaving out the zero false positives(FPs), i.e. prompts on non-cancer locations that none of the readers reported.
|
|
Results obtained by combining the reader scores with CAD prompts are shown in Figure 2
. It appears that results of independent reading with CAD are similar to independent double reading. Statistical analysis was performed by comparing the mean sensitivity in the range of FP fraction less than 0.1 of the different readers for three reading modes: single reading, independent double reading and independent reading with CAD. The low range was taken because of its relevance in screening. Improvement of single reading by independent reading with CAD or independent double reading was significant. There was no statistically significant difference between the results obtained with the combined interpretation with CAD and independent double reading.

View larger version (22K):
[in this window]
[in a new window]
|
Figure 2. Mean sensitivity for visible masses on prior mammograms, obtained by single reading, independent double reading and independent interpretation with CAD as a function of the false positive(FP) fraction.
|
|
 |
Discussion
|
|---|
It was found that FROC performance of CAD is lower than that of any radiologist in the study. However, the difference between CAD and radiologists is not very large and may largely be attributed to obvious failures of CAD, identified as FPs on locations not reported by any of the readers. It is expected that the gap will be bridged in the near future when new CAD algorithms become available. It is also noted that radiologists in the study were experienced and motivated. The average performance of radiologists may in practice be lower than our findings. FROC results show that the perception many radiologists have of CAD performing poorly on masses is not justified, because the difference between CAD and the human readers is not that large. Negative perception of CAD for detection of masses may be due to the operating point of CAD used in practice, around 0.4 FPs per image, and the way prompts are presented: radiologists only see the prompted locations and have no access to the importance or "strength" of the CAD marks. They cannot easily relate CAD results to their own performance, because they operate at a much lower FP rate. This problem is worse in screening programs with a low recall rate. For instance, when a radiologist operates at a four per cent recall rate, his number of FPs per image is about 20 times lower than the setting of the CAD system. Moreover, in screening radiologists see much more FP prompts than TPs, because the number of cancers in the population is low.
We found that improvement of performance obtained by combining suspiciousness scores of the readers with prompt levels computed by CAD was significant, and similar to that obtained by independent double reading. This is in agreement with the FROC results, which showed that performance of CAD on regions identified by the readers is similar to the performance of the radiologists. Results indicate that CAD can be used in practice as an interpretation aid, i.e. to help deciding which cases should be recalled. In fact, the independent combination of readers with CAD we implemented can be viewed as a simulation of interactive use of CAD prompts, in which existence of a CAD marker is only displayed if the user requests decision support at a certain location. This way of using CAD will be greatly facilitated if levels of the CAD markers, which indicate probability of malignancy, are made available for the reader. Further study is needed to determine how these probabilities should best be displayed. For instance, a colour scale or marker size could be employed.
N Karssemeijer is scientific consultant of and receives grant support from R2 Technology.
 |
References
|
|---|
- SM Astley. Computer-based detection and prompting of mammographic abnormalities. Br J Radiol 2004;77:S194S200.[Abstract/Free Full Text]
- TW Freer and MJ Ulissey. Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. Radiology 2001;220:7816.[Abstract/Free Full Text]
- TE Cupples, JE Cunningham, JC Reynolds. Impact of computer-aided detection in a regional screening mammography program. AJR Am J Roentgenol 2005;185:94450.[Abstract/Free Full Text]
- D Manning, S Ethell, T Donovan. Categories of observer error from eye-tracking and afroc data. SPIE Medical Imaging 2004;5372:9099.
- N Karssemeijer, JD Otten, AL Verbeek, JH Groenewoud, HJ de Koning, JH Hendriks, et al. Computer-aided detection versus independent double reading of masses on mammograms. Radiology 2003;227:192200.[Abstract/Free Full Text]
- JD Otten, N Karssemeijer, JH Hendriks, JH Groenewoud, J Francheboud, AL Verbeek, et al. Effect of recall rate on earlier detection of breast cancers based on the Dutch performance indicators. J Natl Cancer Inst 2005;97:74854.[Abstract/Free Full Text]
- N Karssemeijer, JD Otten, AJ Roelofs, S van Woudenberg, JH Hendriks. Effect of independent multiple reading of mammograms on detection performance. SPIE Medical Imaging 2004;5372:8289.
- RG Swensson. Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys 1996;23:170925.[CrossRef][Medline]