| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Paper |
1 Centre for Health Informatics and Multiprofessional Education, University College London, London and 2 Department of Radiology, St George's Hospital NHS Trust, London, UK
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Over the last 5 years we have led a series of evaluations of a commercial CAD system, the R2 ImageChecker, with the aim of assessing its potential value as an aid in the UK National Health Service Breast Screening Programme (NHSBSP). We carried out three main evaluations, all of which have been published with additional detail elsewhere [13]. In this paper we present an overview of the three studies, focusing on ethical and methodological issues, and make some observations regarding the value and difficulty of these evaluations.
| Study One |
|---|
|
|
|---|
Method and materials
50 film readers took part: 30 consultant radiologists, 5 breast clinicians and 15 trained radiographers. All met the requirements of the NHSBSP. All were currently reading at least 5000 screening cases per annum. Film readers were given an explanation of prompting and of the behaviour of the system. They were advised on the typical frequency of prompts and given examples of cases where CAD often generates inappropriate prompts. All film readers then read a training roller, using the prompts.
The test set contained 180 cases of women undergoing routine screening at a centre using two views and double reading with arbitration. The sample included 60 confirmed cancers: 40 consecutive cancers detected through routine screening and 20 false negative interval cancers. Controls were selected from the same time period as the screen-detected cancers and had all been confirmed as normal at a subsequent screen. All were processed using Version 2.2 of the R2 ImageChecker®.
The sample cases were assigned to three rollers, having 15, 20 and 25 cancers. The order in which each reader viewed them was randomised, as was the order in which they viewed the conditions (prompted and unprompted). A minimum period of 1 week was left between reading a roller in the two conditions. Readers viewed the films on a standard roller viewer. Low resolution images of the films were provided on a paper form. In the prompted condition these were shown with prompts, in the unprompted condition they were shown without prompts. Readers were asked to give an overall decision on recall for each case.
Results
The ImageChecker® was judged to have correctly prompted 44 of the 60 cancer cases. 36 of 40 screen-detected cases were prompted, a sensitivity of 90%; only 8 of 20 interval cases were prompted, a sensitivity of 56%.
The sensitivity and specificity of each reader on each roller under each condition was calculated. Sensitivities and specificities are on a 01 scale and tend not to be normally distributed. A modified logit transform was therefore applied to the data. The transformed data were then analysed using repeated measures analysis of variance with two within-subject factors: the prompt and the roller. One analysis was performed for sensitivity and another for specificity. Testing for a difference in sensitivity, we found a significant effect for the roller (F2, 76=26, p<0.001), but not for the prompt (F1, 38=0.003, p=1.0) or for the interaction between roller and prompt (F2, 76=2.2, p=0.1). A comparable analysis for specificity found a significant effect for the roller (F2, 76=5.0, p=0.009), but not for the prompt (F1, 38=0.13, p=0.7) or for the interaction between roller and prompt (F2, 76=0.9, p=0.4). There was, therefore, no evidence that the prompts affected readers' sensitivities or specificities.
| Study Two |
|---|
|
|
|---|
Method and materials
The cancer cases were selected using two criteria: the case must have been missed by at least one film reader in the past, and the cancer must be correctly prompted. Three different categories of case meeting the first criterion were used: false negative interval cancers (n=7), cases used in a previous experiment for which data sheets were still available (n=2), and cases missed by the first reader in normal double reading (n=31). 40 cases that met the criteria were included, with 4 further cancer cases not correctly prompted. Control films were unselected normal cases from the same period as the cancer cases, all of which had undergone a subsequent normal screen.
18 consultant radiologists, 15 trained radiographers and 2 breast clinicians took part, all of whom had participated in Study One. The same procedure was used for reading the cases as had been used in Study One.
Results
The sensitivity and specificity of each reader was calculated for each condition and each roller, as for Study One. In addition, data from the performance of individual readers working without CAD were used to simulate the effect of double reading with arbitration. For each pair of readers, the result was taken to be recall if both agreed on recall, no recall if both agreed on no recall, and, if they disagreed, the result was determined using the judgement of a third reader, selected at random.
As in Study One, the computed sensitivities and specificities were transformed using a logit function before taking a mean value across all readers (or, for double reading, all combinations of two readers). After calculating the means, an inverse logit transform was used so that the results could be presented on the familiar 01 range.
Confidence intervals on the data were generated using a bootstrapping technique in which means were calculated in each of the three conditions for 999 random simulated samples generated from the sets of scores. The 95% confidence interval (CI) was taken to be the range between the 25th and 975th means. The same bootstrapping technique was used to calculate 95% CIs for two comparisons, one between single reading and single reading with CAD, and one between single reading and double reading.
The data show that double reading increases sensitivity compared with single reading (0.81 compared with 0.77, 95% CI for the difference is 0.014 to 0.077). The sensitivity for single reading with CAD is also greater than that for single reading (0.80 compared with 0.77, 95% CI for the difference is 0.0027 to 0.064), however since the lower end of the 95% CI for the difference is below zero, it is not statistically significant. The mean specificity is also improved both by CAD and by double reading, again the difference due to double reading is statistically significant, but that due to CAD is not.
| Study Three |
|---|
|
|
|---|
Prospective studies of CAD have used different designs. Some researchers have asked radiologists to record assessments before and after looking at prompts and then reported the proportion of cancers detected only after the prompts were consulted. Freer and Ulissey [5], who used this design, reported a 19.5% increase in cancer detection (it should be noted that the number of cancers involved in these studies is necessarily small; a 19.5% increase here was provided by eight additional cancers). Other researchers have compared the overall cancer detection rate in two distinct sets of cases, where CAD was used only in one set. Gur et al [6] found no difference in cancer detection rate for those cases viewed with and those viewed without CAD. Our study can be viewed as a replication of both the Freer and Ulissey and the Gur et al studies. Data were recorded both regarding the impact of prompts on individual readers' judgements and regarding the cancer detection rate in two sets of cases, with CAD only being available in one set.
Method and materials
We used a much improved version of the R2 ImageChecker® in this study, Version 5.0. From 21 March 2003 to 9 January 2004, as many cases as possible were processed using the ImageChecker®. However, not all the available roller viewers could be equipped with screens for the display of CAD prompts, so not all the films handled by the unit over the period were processed using the ImageChecker®. Although films were not randomly assigned to be read either with or without the ImageChecker®, there was no plausible bias in the allocation.
Films processed using the ImageChecker® were then displayed on a roller viewer equipped with the R2 CheckMate Ultra screen for the display of prompts. Films were read independently by film readers who recorded a first assessment before inspecting the prompts and a second assessment after inspecting the prompts. All of the films, whether processed by the ImageChecker® or not, were double read by two independent trained film readers. Cases considered normal by both readers were not recalled. Cases marked by either reader were allocated to "arbitration", a process involving discussion with two further consultant radiologists with a view to recall for further clinical assessment. Readers could also indicate that the case should be recalled for technical reasons or that the case would have to be re-read because prior films had not been retrieved. Such cases are excluded from our analyses.
Results
During the period of the study, 19 502 women attended for screening and 6111 cases were read with CAD. A total of 62 cancers were detected in 61 women. The ImageChecker® correctly prompted the cancer in 84% of cases (51/61). The false prompt rate, measured in a randomly chosen group of 613 non-cancer cases, was 0.40 prompts per film or 1.59 per case.
Table 1
shows the differences between the CAD and non-CAD groups for the percentage of cases sent to arbitration and recalled assessment clinics as well as the cancer detection rate. It can be seen that there was a significant difference between the two groups on recall to arbitration and assessment. It therefore seems that CAD does have an impact on the readers' decisions to recall patients. The difference in cancer detection rates between the two groups is not statistically significant. This might suggest that CAD increases recall rates without increasing detection rates, suggesting a negative impact on specificity. Of course the sample size for the comparison of cancer detection rates is smaller than that for the comparison of recall rates. The difference in the cancer detection rates appears broadly in line with the differences in recall rates and it might be thought that there is an effect here that might be revealed in a larger study.
|
|
| Discussion |
|---|
|
|
|---|
The evidence from studies such as Study One and Study Two above, that CAD has an impact on radiologists' decision-making in experimental conditions, is less strong. Some studies, such as that of Chan et al [8] have shown an impact. Interestingly, Chan et al used receiver operating characteristic (ROC) analysis to explore the difference between CAD and non-CAD conditions. This is a more sensitive technique than that used in our studies, which simply tested for a significant difference between the sensitivities recorded in the two conditions. ROC analysis measures the difference between two conditions across a range of decision thresholds. It is therefore possible that an improvement in sensitivity detected using ROC analysis would not translate into a significant change at the actual threshold used by readers in making clinical decisions.
Many studies have failed to show an impact but these clearly lacked adequate power [9, 10]. Our first study was much larger than most previous studies and was expected to be more powerful. However, the extra statistical power was anticipated to come from the use of a larger number of readers than had been involved in earlier studies. Increasing the number of readers only augments the power of a study if readers behave differently. There is evidence that radiologists do vary substantially in terms of their overall sensitivity and specificity [1113], many readers also report, anecdotally, that they believe they also vary in terms of the performance on particular signs. This variability is one reason why double reading is better than single reading. However, if the data from Study One are used to simulate double reading, it seems no better than single reading. This is because the readers in Study One achieved broadly similar levels of performance. This fact somewhat undermines the study's claim to statistical power and also suggests that behaviour of the readers is different in some important way from real life.
A case-by-case analysis of the cancers used in Study One revealed that there were 15 "too difficult cases", where there was no possibility of detecting an improvement due to prompting because the ImageChecker® failed to prompt the cancer. There were 29 "too easy cases", where there was no real scope for detecting an improvement because more than 90% of readers detected the cancer in the unprompted condition. There were, therefore, only 16 cases on which there was any real scope for detecting improvements. If we restrict attention to just these cases, the power of the study is greatly reduced (simulations suggest a power of approximately 50% to detect a 5% improvement in sensitivity). It was this analysis that moved us to attempt Study Two.
One further observation from Study One, which has been neglected in earlier publications, is that amongst the negative results there was one statistically significant effect. Although performance was similar in the intervention and control conditions, it did vary from roller to roller. One explanation for this would be that readers were expecting to find roughly similar numbers of cancers on all three rollers and so missed cancers at the end of the roller having the most cancers. It should be noted that although the order in which readers attempted the three test rollers was randomised, all readers first completed a training roller that may have served to calibrate their expectations. One of the features of this research is that relatively minor features of the experimental design appear to have a much larger impact than one might expect.
Conclusions drawn from Study Two are necessarily rather nuanced. The simulation of double reading shows a statistically significant improvement over single reading, confirming that the study avoided at least one of the problems of Study One. The effect due to CAD is not statistically significant but is very close to significance. This might be thought encouraging for CAD but it should be remembered that, unlike Study One, this study uses a set of cases specifically selected to maximise the chances of detecting an improvement due to CAD. The cases were selected both for difficulty (all missed by at least one reader previously) and correct prompting. The majority of the 40 cancer cases had been missed by the first reader at screening. They made up 14% of screen-detected cancers and 61% were prompted by CAD. Our test set is thus representative of approximately 8% of screen-detected cancers, weighted to be the most difficult cases that CAD will detect.
One of the interesting facts about Study Two is that a high proportion of correctly prompted cancers were not recalled. Some people in the field seem to have anticipated that all prompted cancers will be recalled [14]. This is clearly not the case. Our evidence suggests that a high proportion of cancers will be missed. We attempted to analyse why prompted cancers were not recalled and the strongest factor seemed to be the film reader's prior confidence in his or her interpretation of the prompted findings as being benign. The version of the ImageChecker® used in this study was able to indicate that certain prompts carried a higher confidence; this information seemed valuable, as fewer of the markers emphasised in this way were overruled by film readers.
A very striking methodological point emerges from Study Three. The designs replicated in this Study, namely the Freer and Ulissey [5] and the Gur et al [6] designs, produce sharply different estimates of the impact of CAD. The point is dramatically illustrated in Table 3
. The figures in the first column of data are taken from Table 2
. The figures in the second column are the differences between the first two columns of Table 1
expressed as a proportion of the figure in the first column of Table 1
. The comparison shows the extent to which the data are shaped by the methodology used. The data from the Freer-style analysis are probably the more secure, since that is the more controlled form of experiment. It is possible that the Gur-style analysis is contaminated in some way: the presence or absence of CAD may not be the only difference in the way cases were read in the CAD and non-CAD groups. For example, there may have been differences in how many cases certain readers read in the two groups, or readers in the CAD group may have been operating at higher levels of vigilance because they were aware that they were participating in a trial.
|
Interestingly, of the 12 cancers that were missed by a reader in Study Three, 9 were prompted by CAD but only 2 were recalled on the strength of the prompts. Although readers recalled significant numbers of additional cases on the strength of false prompts, the low specificity of prompts means that a number of true prompts are, inevitably, dismissed.
These studies should not be taken as the final word on CAD. Any evaluation is simply a snapshot that measures the performance of a rapidly evolving technology at a single moment in time. The results of future trials may make for an interesting comparison with our work.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| BJR | DMFR | IMAGING | ALL BIR JOURNALS |