| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Full paper |
1 Department of Medical Physics and Engineering, Leeds Teaching Hospitals NHS Trust, Leeds General Infirmary, Leeds LS1 3EX, 2 Academic Unit of Medical Physics and Centre of Medical Imaging Research and, 3 Institute of Digital Innovation, University of Teesside, Middlesbrough TS1 3BA, UK
Correspondence: Dr David S Brettle, Medical Physics and Engineering, Leeds Teaching Hospitals NHS Trust, 1th Floor Wellcome Wing, Beckett Street, Leeds LS1 3EX, UK. E-mail: dsb{at}medphysics.leeds.ac.uk.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
The issue is further complicated by the work of Rackow et al [5] who conducted a study into the ability to detect low contrast signals in structured noise, using 16 observers with radiology experience and 16 with no experience. They found that training does not give an advantage when detecting lesions in unfamiliar noise. They further suggested that talent may be more important than training in the fundamental detection task.
This limited review indicates that there is conflicting evidence about the relative impact of task and/or experience on detectability within anatomical noise, and this was investigated in the study described here. The factor of relevance to this study is the impact of the varying presentations of local area anatomical noise on detection.
Anatomical noise
In the medical viewing task the observer has to locate pathology embedded within a complex background of anatomical structure or clutter. In the case of projection imaging this clutter is a 2D representation of a 3D volume. In cross-sectional imaging anatomical structure is spatially defined within the slice of the image. Although anatomical presentation can differ greatly with anatomy and modality, local texture, within defined anatomical regions, tends to be consistent. This constant texture can be regarded as the "ground" [6] or background. The background is in addition to noise and often in addition to the signal. However, this does not exclude the signal being a part of the background.
There is evidence that local area anatomical background detrimentally affects lesion detection. Revesz et al [7, 8] investigated lung nodule detection in the presence of local structure. They reported a conspicuity measure, based on background structure, which correlated with probability of detection. Several publications have reported lower detectability of lesions in the presence of anatomical noise in chest radiographs [811]. Båth et al [12] reported a method for the removal of anatomical noise which improved the lower threshold of nodule detection in chest radiographs. There has also been work investigating the nature of anatomical clutter. Burgess et al [13] investigated the nature of anatomical noise in mammograms and reported that there was no difference in human results for mammographic and filtered noise backgrounds, whereas Båth et al [14] reported that certain regions in a chest radiograph acted as pure noise. It is clear that anatomical background has a significant impact on signal detection and requires further investigation.
Psychophysical testing
One approach for investigating signal detection within anatomical noise is to use psychophysical tests which measure the psychological response to a physical stimulus. Many methods have been developed for psychophysical testing [15, 16]. For this work it was identified that measuring the threshold of detection (contrast sensitivity) of signals in anatomical backgrounds would be the most appropriate measure. Contrast sensitivity can be determined using the Method of Limits [17] which helps to reduce several observer errors; however, it cannot remove observer bias. A technique that is less dependent on observer bias is the method of the two alternative forced-choice (2AFC) [15]. In a 2AFC experiment the observer is presented with two images: one contains the signal superimposed on the background, the other the background only. The observer has to report which image contains the signal. The main disadvantage of the 2AFC test is the requirement for a large number of image pairs [18]. This makes traditional 2AFC experiments costly in both experimenter and observer time. However, previous work has shown the 2AFC to be accurate at determining threshold contrast [15, 18, 19]. Methods for making the 2AFC test more efficient have been developed by using a staircase technique for presenting the stimulus [20]. In the basic staircase method the stimulus intensity is decreased for correct responses and increased for incorrect responses [20]. Changing the ratio of correct/incorrect responses required for an intensity change causes the signal intensity to iterate to a predictable threshold with a known probability of detection [21]. Although this method is efficient at finding thresholds the drawback is that the observer can predict the next step from the previous steps. Randomly interleaving two 2AFC experiments together prevents the observer anticipating the next step.
Image synthesis
Psychophysical techniques such as the 2AFC are based on Statistical Decision Theory. This assumes uncertainty in making a decision because the decision is always made in the presence of noise [22]. Due to the statistical nature of these tests, large numbers of images and several observers are usually required. An additional requirement is that a gold standard is established against which observer response is compared. Psychophysical experiments using clinical images may be desirable, but they are difficult to conduct due to the difficulties in obtaining significant numbers of images with suitable anatomy/pathology and the difficulty in establishing a gold standard.
Some investigators have sought to redress some of these problems by using simulated lesions superimposed on clinical images [23, 24]. In medical imaging experiments the target signal is often spherical, simulating either a nodule or a well-circumscribed lesion. This type of signal has obvious advantages in that it is easy to simulate. Samei et al [25] investigated models for test details to simulate the profiles of genuine lung lesion. They conducted a review of studies that used nodule phantoms to simulate subtle lung lesions and concluded that all the nodules had Gaussian-like contrast profiles with diffuse edges and ranged in diameter from 2.5 mm to 20 mm. This is in agreement with the "designer" lesion profiles generated by Burgess et al [13] who used a distance ratio power function to generate their lesions. Other workers [26] have described sectional MR profiles with a top hat profile that closely matches the generalized form of the designer nodule.
A logical extension to the use of a synthetic lesion is to also use synthesized clinical backgrounds. Although there are many techniques for synthesizing texture [2730] few can comprehensively cope with growing large area images from clinical seeds due to the complex, non-repetitive nature of clinical textures. Some users have addressed this problem by using synthetic images that have the same statistical characteristics as clinical textures [13, 29]. However, only a limited number of textures, such as breast structure, can be currently synthesized using this method. An alternative, proposed by Brettle et al [31] is to use the method of Efros and Leung [28] with modified seed samples. The modification is based on sample reordering, interchanging the outside edges with the centre of the sample. Edge non-continuities are removed by padding with solutions synthesized from the body of the original sample. This method allows large area textures to be generated from a small clinical image sample.
Aims
The aim of this work was to investigate the hypothesis that clinical experience is the determining factor in the ability to detect clinical abnormalities in dominant local area anatomical noise. To test this hypothesis the threshold of detection of clinically relevant features in clinically relevant backgrounds was determined for a large population of observers with varying levels of experience.
| Method and materials |
|---|
|
|
|---|
Test design
Pre-study investigations determined that the optimal experimental design for this study was an interleaved staircase 2AFC test. The parameters of the experiment were set at a 3:1 step ratio where the amplitude of the target signal was decreased for 3 correct responses in a row and increased for 1 incorrect. This causes the experiment to iterate to the 75% correct detection rate at the end of the experiment. The signal start amplitude was set at 60 grey levels and the percentage intensity step change was set at 9.1% after Burgess et al for comparison [13]. Each experiment comprised 100 trials, with the signal grey level varying as described above for each trial. In this way the experiment would iterate to threshold in approximately 15 min. Threshold was defined as the mean of the signal grey levels over the last five trials of the experiment, in order to accommodate a complete step change cycle.
Image generation
All images were synthesized using the modified method of Efros and Leung [28] described by Brettle et al [31]. Two textures were generated from clinical texture seeds; a MR T1 weighted brain slice (Figure 1
) and an X-ray trabecular bone texture (Figure 2
). All synthesis was conducted using routines written in Matlab (The Mathworks, Inc., Natick, USA). Lesions were simulated using the designer nodule function described by Burgess et al [13] at 15 mm and 2.5 mm diameter (measured at the monitor face) for the brain and bone textures, respectively (Figure 3
). The signal intensity was varied under computer control and was defined as the peak grey level in the lesion profile. These lesions were added on top of the anatomical backgrounds; no additional noise was added, however, the original texture noise would be preserved. All images were generated to be 286x286 pixels with 8 bits of grey scale and 128 images were generated for each texture. The effective pixel size of the monitor was 0.25 mm giving an image area of 7.15 cm on the monitor face at 1:1 magnification. For consistency all the texture backgrounds were rescaled to a mean grey level of 128. To minimize fluctuations due to differential stimulus uncertainty [32] the background images were tested to identify those that had the highest correlation in a square central region with side length twice the largest signal diameter. This resulted in the 26 most correlated backgrounds being selected for use in the observer trials. The images were presented to the observer in pairs; one image contained the background plus signal; the other, background only. Fiducial markers were added to locate the signal exactly at the centre of the background image; this was to ensure there was no element of search in the detection task. A high contrast reference signal was always displayed to ensure the observer knew the signal they were looking for (Figure 4
).
|
|
|
|
Power calculations
A simple validated correlation model [35] with Poisson resampling [36] and limited data from repeat testing of two observers were available for the power calculation (see Table 1
). The mean threshold values of these data were used to conduct the power calculation.
|
Observer selection
For group testing three groups were identified Diagnostic, Non-diagnostic and Public. The Diagnostic group was formed from Radiologists, Radiographer Practitioners (Radiographer (D)) and Others 1 (who report diagnostic images). The Non-diagnostic group comprised Radiographers, Medical Physicists and Others 2 (who do not report, yet are familiar with viewing medical images). The Public group comprised Academics (University Staff), Students (University Students) and General Public (who are not familiar with viewing medical images).
The study was openly promoted and recruitment was by consecutive attendance. All volunteers read a common information sheet before participating, which specified the exclusion criteria. Only two observers were excluded from the study as they did not have their glasses with them. All observers were asked which group they belonged to before conducting the test and classification was by mutual agreement between the observer and the experiment controller.
Approximately half the observers were recruited within the Leeds Teaching Hospitals NHS Trust, UK; the rest were recruited over 3 days at the United Kingdom Radiological Congress (UKRC). This ensured a good cross-section of observers. In the final analysis there were 31 observers in the Diagnostic group, 38 in the Non-diagnostic group and 32 in the Public group.
Experimental control
The observers were free, and encouraged, to change their viewing distance in order to minimize contrast sensitivity effects. There was no time limit for the experiment although the study was designed so that each experiment lasted approximately 15 min. Training was by verbal feedback for the first few high contrast signals and the observer responses were monitored until the experiment controller was confident the observers were responding to the stimulus correctly. The observer had the option to terminate the test at any time.
Data analysis
Data analysis was conducted using a statistical analysis module, Analyse-IT (Analyse-It Software Ltd, Leeds, UK) for Microsoft Excel. Tests were conducted for normality using the ShapiroWilk parameter, normality plots and measures of skew and kurtosis. Depending on the outcome of the normality test, either between groups parametric or the KruskalWallis non-parametric analysis of variance (ANOVA) tests were conducted.
| Results |
|---|
|
|
|---|
|
|
|
|
Analysis by group
The bone texture results in Figure 6
were analysed parametrically using parametric ANOVA. An F-statistic of 2.79 and a p-value of 0.066 were obtained indicating that there is no evidence of a statistically significant difference between the groups for the bone texture at the 95% confidence level.
The brain texture results were analysed non-parametrically using the KruskalWallis test. This returned a KruskalWallis statistic of 15.25 and a p-value of 0.0005 (
2 approximation). Therefore it can be concluded that there is a significant statistical difference between the groups for the brain texture at the 95% confidence level.
| Discussion |
|---|
|
|
|---|
Allocating observers to the relevant groups was found to be best achieved by discussion and mutual agreement between the observer and the experiment controller. Radiographer practitioners were grouped with the radiologists, which may be considered a controversial step, but was considered to be justified because before practising the radiographer must be validated by a radiologist. It is emphasised that the groupings were primarily of experience of viewing medical images, not diagnostic training.
It was not possible to screen the observers for visual acuity. Each was asked to confirm that their vision was normally adjusted for the viewing task and visual correction was allowed if required. Vision testing would have added an ethical complexity beyond the scope of this study. It is accepted that non-reported poor visual acuity could have affected the results and future experiments should aim to address this.
Better power for the test could be achieved by collecting more samples. However, the experience from this study is that this is extremely difficult and other methods of better powering the test should be considered, e.g. improving precision. Collecting data from just two groups, diagnostic and public, may have made data collection more focused on the groups with the largest separation. To clarify whether experience is a significant factor it may be necessary to limit recruitment into the test more precisely, e.g. for the diagnostic group only recruit radiologists of more than 5 years' experience
If the main hypothesis is true there should be a significant difference between the group results. This study showed no significant difference between the groups for the bone texture at the 95% level (p = 0.066) but a significant difference for the brain texture (p = 0.0005).
The results show a significant correlation between the brain texture and experience, even though clinical experience effects were minimized. A significant correlation was also found for the "between texture" performance of the Public group, but not for the Diagnostic group. This is consistent with the work of Davies et al [2] who suggested that some aspects of radiological skill may be based on changes in the effectiveness of early visual processes. The fact that the correlation was not found for both textures is consistent with the work of Burgess and Colborne [38] who concluded that internal noise is proportional to image noise and that this process is dominant when external noise is readily visible.
Group pre-selection could have confounded this study; this would occur if someone who is good at detection chooses a career that utilizes that ability. Additionally training selection could have filtered selection into the diagnostic group. Both these processes would focus ability in the diagnostic group. This may occur to a lesser degree with the non-diagnostic group and the public group reflects the general range of abilities. This confounder is difficult to adjust for and would ideally require testing a large group of medical students through from the undergraduate stage to radiologist appointment to quantify it.
An interesting aside is that often in psychophysical studies non-clinical observers are used in place of clinicians for convenience [13]. Usually at least one representative clinician is included to confirm the non-diagnostic results are comparable. However, this study indicates that this may only be valid for certain images and using non-clinical staff may underestimate true clinical performance at the detection task.
In summary, we have found a significant correlation between increasing experience but this was dependent on the composition of the local area anatomical noise. These results add to the evidence of certain previous studies [25, 39] and suggest that internal noise is proportional to the local area image content and that when the observer is not limited by internal noise effects, experience can improve detectability.
This work was funded by the NHS (Northern and Yorkshire) Research Training Fellowship.
Current address for Dr Berry, EB Imagistics Ltd, 9 Birk Lane, Leeds LS27 0ST, UK.
| Acknowledgments |
|---|
Received for publication February 28, 2006. Revision received June 14, 2006. Accepted for publication June 30, 2006.
| References |
|---|
|
|
|---|
lance/Psycho/Methods/staircase_2afc.html [Accessed 7 July 2006]This article has been cited by other articles:
![]() |
BJR review of the year -- 2007 Br. J. Radiol., April 1, 2008; 81(964): 265 - 269. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| BJR | DMFR | IMAGING | ALL BIR JOURNALS |