BJR
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

British Journal of Radiology (2006) 79, S117-S122
© 2006 British Institute of Radiology
doi: 10.1259/bjr/96931332

This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Saunders, R S
Right arrow Articles by Samei, E
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Saunders, R S
Right arrow Articles by Samei, E

Full paper

Improving mammographic decision accuracy by incorporating observer ratings with interpretation time

R S Saunders, PhD 1 and E Samei, PhD 1,2

1 Duke Advanced Imaging Laboratories, Department of Radiology, 2424 Erwin Road, Suite 302, Duke University Medical Center, Durham, NC 27705, 2 Departments of Physics, Biomedical Engineering and Medical Physics, Duke University, Durham, NC 27710, USA

Correspondence: Ehsan Samei, Duke Advanced Imaging Laboratories, Department of Radiology, 2424 Erwin Rd, Suite 302, Duke University Medical Center, Durham, NC 27705, USA. E-mail: samei{at}duke.edu


    Abstract
 Top
 Abstract
 Introduction
 Methods and materials
 Results
 Discussion
 References
 
Mammography is currently the most established technique for the early detection of breast cancer. However, mammography would benefit from further improvements as it does produce some errors, such as not finding all early-stage cancers. The objectives of this study were first, to measure the timing of correct and incorrect reading decisions in mammography and second, to exploit those dependencies to improve accuracy in mammographic interpretation. To address these objectives, an experiment was conducted where experienced breast imaging radiologists reviewed 400 mammographic regions equally divided among images that contained simulated benign masses, malignant masses, malignant microcalcifications and no lesions. The experiment recorded the radiologists' decision as well as the length of time the mammogram was interpreted in. The experiment results showed that incorrect detection as well as incorrect classification decisions were associated with longer interpretation times (p<0.0001). The timing results were used to create a model that would flag cases for review that had a higher probability of error. The flagged cases had a median accuracy drop of 13% for detection decisions and 16% for classification decisions compared with unflagged cases. This suggests that interpretation time can be incorporated into mammographic decision-making in order to identify cases with higher probabilities of perceptual error that require further review.


    Introduction
 Top
 Abstract
 Introduction
 Methods and materials
 Results
 Discussion
 References
 
Mammographic interpretation is a difficult perceptual task, with 20–40% of cancers missed in the initial mammographic screening [14]. In addition to missed cancers, another perceptual error is the substantial number of false positives as the specificity of mammography ranges from 88% to 98% [13]. Reducing the number of missed cancers and increasing specificity in mammography should be one of the goals of perception science.

Previous perception studies have decomposed interpretation errors into three categories, based on the length of time the radiologist focuses on a potential lesion: search errors, recognition errors and decision-making errors [5, 6]. These studies indicate that search errors occur when the radiologist does not focus once on the abnormality; recognition errors happen when the radiologist briefly examines a potential abnormality, but dismisses it very quickly; and decision-making errors arise when a radiologist examines a potential abnormality for a extended period of time, but still incorrectly classifies it [5]. Some previous studies have investigated these perceptual errors, but have generally considered them together [5, 79].

This study focused on the third category, decision-making errors. For screening, these errors occur after the radiologist has searched the image and recognized the area as a potential abnormality, but then incorrectly classifies the area as not containing a lesion. These errors can be more difficult to avoid than other perceptual errors because while improving the conspicuity of lesions can be expected to reduce search errors and recognition errors, it would not necessarily improve decision-making performance. In fact, decision-making errors have been suggested to be the primary perceptual errors in chest radiography [10]. To better understand and decrease decision-making errors, the purpose of this study was two-fold: (1) to measure the timing of correct and incorrect reading decisions in mammography and (2) to exploit those dependencies to improve the accuracy of mammographic interpretation.


    Methods and materials
 Top
 Abstract
 Introduction
 Methods and materials
 Results
 Discussion
 References
 
This study isolated decision-making errors by controlling the search process and lesion variability. An image set of 400 mammographic regions was created by inserting simulated breast masses and microcalcifications into digital mammograms. The mammographic regions were then reviewed by experienced breast imaging radiologists, who rated whether the mammograms contained a lesion or not, and classified the lesion. The ratings and interpretation time for each observer were analysed to understand decision-making errors and whether incorporating interpretation time could improve accuracy.

Mammographic images
A database of 984 de-identified four-view mammograms was obtained with approval by the institutional review board (IRB). Each mammogram had been acquired on an indirect flat-panel mammography detector (GE Senographe 2000D; GE Medical Systems, Waukesha, WI) [11, 12]. Out of this database, 200 craniocaudal views were chosen for further analysis.

Lesion simulation
Simulated mammographic lesions, the realism of which was verified in previous studies, were embedded in the digital mammograms [1315]. These simulated lesions included typically benign masses (oval circumscribed and oval obscured), typically malignant masses (irregular ill-defined and irregular spiculated), and typically malignant microcalcifications (fine linear branching and clustered pleomorphic). The contrast for these lesions was determined by a Monte Carlo model (xSpect) of the mammographic image acquisition [16]. The contrast was reduced by the expected scatter, which was calculated from previous models [17].

Image processing
The images were processed by a two-stage process to enhance fine detail and provide sufficient contrast at the skin line [18, 19]. After this processing, the histogram of each image was analysed to find the appropriate window and level. The window and level was approximated by a sigmoid curve, which provided a smooth transition at the extremes of the greyscale range. All image processing was evaluated by an experienced breast imaging radiologist (JAB: 7 years experience, 5000 cases per year). The radiologist did not participate in the observer performance experiment to minimize bias.

Observer performance experiment
The mammograms were reviewed by five experienced radiologists (average 11.2 years of experience as radiologist attending, average 9.8 years as mammography attending, average 160 cases per week). The radiologists reviewed images on a custom graphical user interface (GUI) that displayed a 5.12 cm x 5.12 cm region of the mammogram for interpretation. The radiologists rated each image based on whether it appeared to contain microcalcifications, a benign mass, a malignant mass, or no lesion. Images were viewed three times, once on a medical-grade liquid crystal display (LCD) (Nova V; National Display Systems, Morgan Hill, CA; 165 µm pixels) and twice on a medical-grade CRT (MGD 521; Barco LLC, Duluth, GA; 148 µm pixels). For each reading, the interface recorded the radiologist interpretation time, or the interval between the time the mammogram was displayed and the time the radiologist recorded his or her rating.

The observer experiment controlled for other factors by adopting the following constraints. Each image was displayed at full resolution to maintain image fidelity. The image centre was indicated by four whiskers on each side in order to minimize image search. To reduce rating correlations between sequential images, radiologists could not return to an image once it had been rated. The radiologists viewed each display straight ahead and centred as some displays, such as LCDs, have different properties off-axis [20]. To maintain a similar image appearance, the radiologists could not adjust the image window and level. Finally, the display order, image order and image orientation were randomized to further reduce potential biasing effects.

Statistical analysis
The data were analysed to determine the performance at two different clinical tasks. One task was a screening task where radiologists must detect a mammographic lesion. For this detection task, the radiologists would be correct if they detected the lesion, even if they incorrectly classified it as benign or malignant. The other task was a diagnostic task where the radiologists had to differentiate between benign and malignant breast masses. In this classification task, the lesion had been detected and the radiologists were judged on whether the lesion was classified appropriately. For both tasks, accuracy was computed as the average of sensitivity and specificity.

For each task, the data were analysed to learn whether incorrect and correct decisions correlated with different interpretation times. First, the interpretation time was analysed using survival analysis, where the "survival time" of an image was defined as the length of time it remained unrated. The survival curves were plotted to qualitatively show whether rating errors affected interpretation time. Next, the interpretation time distributions for correct and incorrect ratings were compared using statistical tests. A Wilcoxon test compared the centre of the distributions, while a Brown-Forsythe test compared the width of the distributions [21]. Finally, the interpretation times were modelled as a function of decision type (e.g. true positive, false positive) using a Proportional Hazards model, allowing a further test of whether rating errors affected interpretation time [22, 23].

After testing whether decision types had a statistically significant impact on interpretation time, two models were constructed to exploit that information. The first model used a nominal logistic regression fit to fit the mammogram truth as a function of the observer ratings alone, interpretation time only, or observer ratings combined with interpretation time. This fit was then used to predict mammogram truth for given observer data (either ratings, timing, or ratings plus timing). The second model operated similarly to a computer aided detection (CAD) system as it did not make decisions on the mammogram truth, but rather flagged cases with higher probability of incorrect decisions for further review by radiologists. To decide which cases to flag, a linear discriminant was used to find a threshold time that best separated false positives from true positives and false negatives from true negatives. Cases with interpretation times above these thresholds were flagged as they had greater probability of being incorrect. These flagged cases should then be given further review by radiologists in order to improve their accuracy. Each model was evaluated for sensitivity, specificity and accuracy with the variance of each quantity estimated using a bootstrap with 10 000 samples [24].


    Results
 Top
 Abstract
 Introduction
 Methods and materials
 Results
 Discussion
 References
 
Detection task interpretation time
Figure 1Go demonstrates the timing results for the detection task. The figure shows that incorrect decisions had longer interpretation times than true decisions. The interpretation time differences between the four decision categories (false positives, false negatives, true positives and true negatives) were statistically significant both in terms of the mean time (Wilcoxon's {chi}2 = 676, degrees of freedom (DOF) = 3, p<0.0001) and the timing variance (Brown-Forsythe's F = 78.5, DOF = 3, p<0.0001). As shown in Table 1Go, false positives had statistically significant longer interpretation times than true positives and false negatives had longer interpretation times than true negatives. The interpretation time's correlation with decision category was confirmed with a Proportional Hazards model. This model also found that decision categories had a statistically significant effect on interpretation time ({chi}2 = 462, DOF = 3, p<0.0001).


Figure 1
View larger version (21K):
[in this window]
[in a new window]

 
Figure 1. Interpretation times for correct and incorrect detection for detection task.

 

View this table:
[in this window]
[in a new window]

 
Table 1. Median interpretation time for different contingency table conditions. The error bars represent the 95% confidence interval of the median

 
Table 2Go illustrates the results of the first predictive model incorporating interpretation time. The table shows that a model based on interpretation time and observer ratings performs slightly better than a model based on observer ratings alone, but not by a statistically significant amount. Interestingly, a model based solely on interpretation time generally performs above chance by a statistically significant amount, suggesting that interpretation time does provide useful information for predicting mammographic truth.


View this table:
[in this window]
[in a new window]

 
Table 2. Accuracy of models that incorporate rating data only, timing data only, or combine rating and timing data. The error bars represent the 95% confidence interval of the mean

 
Figure 2Go illustrates the results of the second model which flagged suspicious cases for further review. The figure shows that flagged cases generally had statistically significant drops in sensitivity and specificity. Table 3Go shows the magnitude of the accuracy drop from the unflagged cases to the flagged cases. For each observer, there was a statistically significant drop in accuracy for the flagged cases.


Figure 2
View larger version (25K):
[in this window]
[in a new window]

 
Figure 2. Differences in(a) sensitivity and (b) specificity for detection task with interpretation time flagging.

 

View this table:
[in this window]
[in a new window]

 
Table 3. Improvement in detection accuracy of unflagged cases over flagged cases. An asterisk indicates a statistically significant difference. The error bars represent the 95% confidence interval of the mean

 
Classification task interpretation time
Figure 3Go illustrates the difference in interpretation times for correct and incorrect classifications of masses. For each observer, incorrect decisions had longer interpretation times. As with detection task, the mean of the interpretation times were different for correct and incorrect decisions (Wilcoxon's {chi}2 = 269, DOF = 3, p<0.0001) and the width of the interpretation times distributions differed between incorrect and correct classification decisions (Brown-Forsythe's F = 37.1, DOF = 3, p<0.0001). The relationship of interpretation time to decision category (false positive, true positive, false negative, true negative) was confirmed using a Proportional Hazards model. This model also found decision categories had a statistically significant effect on interpretation time ({chi}2 = 191, DOF = 3, p<0.0001).


Figure 3
View larger version (21K):
[in this window]
[in a new window]

 
Figure 3. Interpretation times for masses correctly and incorrectly classified as benign or malignant.

 
Figure 4Go shows the results of the flagging model for this classification task. For each observer, sensitivity and specificity dropped for the flagged cases. Table 4Go illustrates that there is a statistically significant difference drop in accuracy for flagged cases.


Figure 4
View larger version (25K):
[in this window]
[in a new window]

 
Figure 4. Differences in(a) sensitivity and (b) specificity for classification task with interpretation time flagging.

 

View this table:
[in this window]
[in a new window]

 
Table 4. Improvement in classification accuracy of unflagged cases over flagged cases. An asterisk indicates a statistically significant difference. The error bars represent the 95% confidence interval of the mean

 

    Discussion
 Top
 Abstract
 Introduction
 Methods and materials
 Results
 Discussion
 References
 
There has been previous work in investigating perceptual errors. One common means of investigation has been eye-position analysis [5, 79]. Eye position analysis infers the type of error based on the amount of time the radiologist focused on a potential abnormality. Eye-tracking relies on the central assumption that foveal attention indicates visual processing of particular areas. This introduces some uncertainty into the results, as foveal focus can include at least a 1° range. Notwithstanding these limitations, previous eye-tracking experiments largely agree with our detection timing results. For pulmonary nodule detection, incorrect decisions were associated with longer interpretation times for experienced radiologists [25]. For breast cancer screening, previous studies found that false positive results from normal mammograms had longer interpretation times than true positive results [5, 8, 9, 26] and false negative results had longer times than true negative results [5].

This study showed that interpretation time did correlate with decision category. These results could then be exploited. While a predictive model using interpretation time and observer ratings did not produce statistically significant improvements over a model using observer ratings alone, a flagging model similar to CAD systems did show promise. The flagging model could be used clinically to indicate mammograms requiring further review and potentially improve both the sensitivity and specificity of screening and diagnostic mammography.

In conclusion, this study investigated the potential for using interpretation time as a means of improving accuracy in screening and diagnostic tasks. Detection errors and classification errors had longer interpretation times than correct detection and classification decisions. Using linear discriminant analysis, we established a flagging program to highlight cases that had a greater probability of incorrect detection or classification decisions. The flagging creates an opportunity to improve mammographic accuracy by identifying cases with statistically lower sensitivities and specificities for further review.

This work was supported in part by grants from the NIH, R21-CA95308, and from the Department of Defense (DoD), USAMRMC, W81XWH-04-1-0323.


    Acknowledgments
 
Thanks are due to Gina Tourassi, Joseph Lo and Amar Chawla for serving as preliminary observers. The authors also thank Dave Delong and Jay Baker for their help in study design and analysis and our observers, Etta Pisano, Mary Scott Soo, Cherie Kuzmiak, Dag Pavic and Ruth Walsh. The authors also thank Andrew Karellas and Sankararaman Suryanarayanan of Emory University for permitting the use of their mammographic data set. This work was supported in part by grants from the NIH, R21-CA95308, and Department of Defense (DoD), USAMRMC W81XWH-04-1-0323.


    References
 Top
 Abstract
 Introduction
 Methods and materials
 Results
 Discussion
 References
 

  1. Houssami N, Irwig L, Simpson JM, McKessar M, Blome S, Noakes J. Sydney Breast Imaging Accuracy Study: comparative sensitivity and specificity of mammography and sonography in young women with symptoms. AJR Am J Roentgenol 2003;180:935–40.[Abstract/Free Full Text]
  2. Carney PA, Miglioretti DL, Yankaskas BC, Kerlikowske K, Rosenberg R, Rutter CM, et al. Individual and combined effects of age, breast density, and hormone replacement therapy use on the accuracy of screening mammography. Ann Intern Med 2003;138:168–75.[Abstract/Free Full Text]
  3. Pisano ED, Gatsonis C, Hendrick E, Yaffe M, Baum JK, Acharyya S, et al. Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005;353:1773–83.[Abstract/Free Full Text]
  4. Bird RE, Wallace TW, Yankaskas BC. Analysis of cancers missed at screening mammography. Radiology 1992;184:613–7.[Abstract/Free Full Text]
  5. Nodine CF, Mello-Thoms C, Kundel HL, Weinstein SP. Time course of perception and decision making during mammographic interpretation. AJR Am J Roentgenol 2002;179:917–23.[Abstract/Free Full Text]
  6. Kundel HL, Nodine CF, Carmody D. Visual scanning, pattern recognition and decision-making in pulmonary nodule detection. Invest Radiol 1978;13:175–81.[CrossRef][Medline]
  7. Nodine CF, Mello-Thoms C, Weinstein SP, Kundel HL, Conant EF, Heller-Savoy RE, et al. Blinded review of retrospectively visible unreported breast cancers: an eye-position analysis. Radiology 2001;221:122–9.[Abstract/Free Full Text]
  8. Mello-Thoms C, Hardesty L, Sumkin J, Ganott M, Hakim C, Britton C, et al. Effects of lesion conspicuity on visual search in mammogram reading. Acad Radiol 2005;12:830–40.[CrossRef][Medline]
  9. Mello-Thoms C, Britton C, Abrams G, Hakim C, Shah R, Hardesty L, et al. Head-mounted versus remote eye tracking of radiologists searching for breast cancer: a comparison. Acad Radiol 2006;13:203–9.[CrossRef][Medline]
  10. Manning DJ, Ethell SC, Donovan T. Detection or decision errors? Missed lung cancer from the posteroanterior chest radiograph. Br J Radiol 2004;77:231–5.[Abstract/Free Full Text]
  11. Suryanarayanan S, Karellas A, Vedantham S. Physical characteristics of a full-field digital mammography system. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 2004. 533:560–70.
  12. Vedantham S, Karellas A, Suryanarayanan S, Albagli D, Han S, Tkaczyk EJ, et al. Full breast digital mammography with an amorphous silicon-based flat panel detector: physical characteristics of a clinical prototype. Med Phys 2000;27:558–67.[CrossRef][Medline]
  13. Saunders Jr RS, Samei E, Baker JA. Simulation of breast lesions. In: 7th International Workshop on Digital Mammography; Durham, NC; 2004
  14. Saunders RS, Samei E. Characterization of breast masses for simulation purposes. Proc SPIE 2004;5372:242–50.[CrossRef]
  15. Saunders RS Jr, Samei E, Baker JA. Simulation of mammographic lesions. Acad Radiol 2006. 13:860–70.
  16. Samei E, Flynn MJ. An experimental comparison of detector performance for direct and indirect digital radiography systems. Med Phys 2003;30:608–22.[CrossRef][Medline]
  17. Boone JM, Lindfors KK, Cooper VN 3rd, Seibert JA. Scatter/primary in mammography: comprehensive results. Med Phys 2000;27:2408–16.[CrossRef][Medline]
  18. Stahl M, Aach T, Dippel S. Digital radiography enhancement by nonlinear multiscale processing. Med Phys 2000;27:56–65.[CrossRef][Medline]
  19. Davies AG, Cowen AR, Parkin GJS, Bury RF. Optimizing the processing and presentation of PPCR imaging. Proceedings of SPIE - The International Society for Optical Engineering 1996;2712:189–95.
  20. Krupinski EA, Johnson J, Roehrig H, Nafziger J, Lubin J. On-axis and off-axis viewing of images on CRT displays and LCDs: observer performance and vision model predictions. Acad Radiol 2005;12:957–64.[CrossRef][Medline]
  21. Heiberger RM, Holland B. Statistical analysis and data display: an intermediate course with examples in S-plus, R, and SAS. New York, NY: Springer, 2004
  22. Cox DR. Regression models and life-tables. J Royal Statistical Society. Series B (Methodological) 1972;34:187–220.
  23. Lawless JF. Statistical models and methods for lifetime data. New York, NY: Wiley, 1982
  24. Efron B, Tibshirani R. An introduction to the bootstrap. New York, NY: Chapman & Hall, 1993
  25. Manning D, Barker-Mill SC, Donovan T, Crawford T. Time-dependent observer errors in pulmonary nodule detection. Br J Radiol 2006;79:342–6.[Abstract/Free Full Text]
  26. Nodine CF, Kundel HL, Mello-Thoms C, Weinstein SP, Orel SG, Sullivan DC, et al. How experience and training influence mammography expertise. Acad Radiol 1999;6:575–85.[CrossRef][Medline]




This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Saunders, R S
Right arrow Articles by Samei, E
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Saunders, R S
Right arrow Articles by Samei, E


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
BJR DMFR IMAGING  ALL BIR JOURNALS