BJR
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

First published online July 19, 2006
British Journal of Radiology (2007) 80, 169-176
© 2007 British Institute of Radiology
doi: 10.1259/bjr/35012658

This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Båth, M
Right arrow Articles by Månsson, L G
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Båth, M
Right arrow Articles by Månsson, L G

Full paper

Visual grading characteristics (VGC) analysis: a non-parametric rank-invariant statistical method for image quality evaluation

M Båth, PhD 1 and L G Månsson, PhD 2

1 Department of Medical Physics and Biomedical Engineering, Sahlgrenska University Hospital, SE-413 01 Göteborg, 2 Department of Radiation Physics, Göteborg University, SE-413 01 Göteborg, Sweden


    Abstract
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
Visual grading of the reproduction of important anatomical structures is often used to determine clinical image quality in radiography. However, many visual grading methods incorrectly use statistical methods that require data belonging to an interval scale. The rating data from the observers in a visual grading study with multiple ratings is ordinal, meaning that non-parametric rank-invariant statistical methods are required. This paper describes such a method for determining the difference in image quality between two modalities called visual grading characteristics (VGC) analysis. In a VGC study, the task of the observer is to rate his confidence about the fulfilment of image quality criteria. The rating data for the two modalities are then analysed in a manner similar to that used in receiver operating characteristics (ROC) analysis. The resulting measure of image quality is the VGC curve, which – for all possible thresholds of the observer for a fulfilled criterion – describes the relationship between the proportions of fulfilled image criteria for the two compared modalities. The area under the VGC curve is proposed as a single measure of the difference in image quality between two compared modalities. It is also described how VGC analysis can be applied to data from an absolute visual grading analysis study.


    Introduction
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
Using visual grading of the reproduction of important anatomical structures – especially those mentioned in the European quality criteria [13] – for evaluating image quality in radiography has become an established method for several reasons. First of all, the validity of such studies can be assumed to be high since the quality criteria are based on clinically relevant structures and the anatomical background is therefore included. Second, visual grading methods have in special cases been shown to agree both with methods based on receiver operating characteristics (ROC) analysis [4, 5] and with calculations of the physical image quality [6, 7]. This is important, and validates in some way the assumption that the possibility to detect pathology correlates to the reproduction of anatomy – the basic idea of visual grading. Discrepancies between the methods have been reported [8], but have been explained with the different tasks for the methods rather than low validity for visual grading. Third, visual grading studies are relatively easy to conduct, especially in comparison with ROC studies, which is important when optimizing equipment at the local level. How to perform visual grading studies has been extensively described [912], and the threshold for performing such studies is low. Fourth, the time consumption is moderate, at least for the observers, which means that it is realistic to believe that these methods can be implemented at almost any hospital. The workload on each participating radiologist is typically in the order of a few hours, which means that a study is easy to justify from an economical perspective for the hospital.

However, arguments against the use of visual grading are often presented. Some of these relate to the subjective nature of the task and state that studies of this type amount to a "beauty-contest" [13], meaning that they are not scientifically valid. However, the authors of the present work are of the opinion that this is a simplification of the problem of evaluating clinical image quality and an underestimation of the ability of the radiologist to recognize the needed reproduction of anatomy for making a diagnosis. According to Kundel [14], the images of the highest diagnostic quality are those that enable the observer "to most accurately report diagnostically relevant structures and features". The use of observer preference ("what images do you prefer?" or "with which images do you feel you can make the most secure diagnosis?") can easily be argued against since the observer is free to choose any criterion or criteria he finds appropriate for the task (in a clinical trial of screen–film combinations in portable chest radiography where the observers were asked to indicate their preference without any criteria, they usually chose the modality at use in their department at the time [15]). However, partly to overcome this problem, an international group of well-established radiologists and physicists developed the European quality criteria [13], which for specific examinations state important anatomical landmarks and their needed level of reproduction.

Another objection – of more relevance for the present paper – regards the analysis of visual grading data. Since the grading results in an ordinal scale, the ratings cannot be transformed into numerical values. Nevertheless, statistical methods that require not only interval scales but also specific distributions of the data are often used when it is evident that neither of these conditions hold for visual grading. The calculation of mean values and/or standard deviations of ratings are examples of such statistically forbidden operations that are common. The aim of this work is to present a method for interpreting and analysing the results in a way such that objections regarding the use of the data cannot be made for statistical reasons, meaning that the scale steps are treated as ordinal and no assumptions about the distribution of the data are being made. The method takes some of the strengths from both relative and absolute visual grading analysis (VGA), fulfilment of image criteria (IC) and ROC analysis to form a hybrid, which can be used to analyse the characteristics of the visual grading. The method is therefore called visual grading characteristics (VGC) analysis.


    Visual grading
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
Several types of visual grading methods are described in the literature. Two methods of great relevance to the present paper are IC and VGA. These methods will therefore be described in the following sections.

Fulfilment of image criteria (IC)
As previously mentioned, the European Commission has established "quality criteria" for different radiological examinations [13]. A subset of these quality criteria contains the ones referring to image quality, and these are suitable for use in visual grading. These criteria are statements of the needed level of reproduction of important anatomical structures. Fulfilment of image criteria (IC) [11] is a simple visual grading method, in which the task of the observer is to state whether a certain criterion is fulfilled or not in the image. An image criteria score (ICS) is then simply calculated as the proportion of fulfilled criteria [16].

IC has the advantage that the use of parametric statistics is unquestionable since the ICS – the proportion of fulfilled criteria – is the mean of a variable that can take the value of either zero or unity. The central limit theorem states that such a mean is normally distributed for large samples [17]. No mathematical or statistical errors are made in the calculation of the ICS. A second advantage is that since the ratings are absolute, an ICS corresponding to a clinically acceptable level of image quality can be stated. The third advantage is that the quality criteria, established and formulated by an international group of well-established radiologists and physicists, are used in their original form. This means that both the structures pointed to as important and the necessary level of reproduction (e.g. "visually sharp" or "visual") can be used.

However, there are disadvantages with IC. Since there is no soft transition from "not fulfilled" to "fulfilled", the observer may have difficulties in deciding whether a specific criterion is fulfilled or not when the reproduction of the anatomical structure is close to the decision threshold of the observer. Also, no consideration is taken to how far away from the decision threshold the reproduction is, which may lead to difficulties in interpreting the results.

Visual grading analysis (VGA)
A second approach to visual grading is to let the observer grade the visibility of important structures, for example the structures from the European quality criteria, using a multistep scale. In this way, the observer is given more freedom to state his opinion about the image quality. VGA is either performed in an absolute manner, where the observer states his opinion about the visibility of a certain structure on an absolute scale (typically consisting of four to five scale steps ranging from "very bad" to "very good"), or in a relative manner, where the observer compares an image with a reference image and gives a statement of the relative visibility of the structure (typically consisting of five scale steps ranging from "much worse" to "much better") [10].

Prior to analysis, the scale steps in a VGA study are most often converted to numerical values. For example, in an absolute VGA study with four scale steps, the lowest scale step may be addressed as the number 1 and the highest scale step as the number 4. In a relative VGA study with 5 scale steps, the lowest scale step may be addressed the number –2 and the highest +2. The data from a VGA study are then used to calculate what is often referred to as a visual grading analysis score (VGAS) [7, 1823], usually defined as simply the mean value of all ratings when the numerical representations of the scale steps are used. A statistical analysis based on the variation of the ratings is also often performed. However, the scale steps used belong to an ordinal scale and the numerical representations are merely arbitrarily chosen names of the scale steps and do not represent numbers on an interval scale (which is necessary for a mean value to have a true meaning). The scale steps belong to an ordered qualitative variable (where the only knowledge about the relationship between "3" and "2" is that "3" is more than "2") and not to a discrete quantitative variable. The VGAS therefore lacks mathematical and statistical validity. The use of the numerical representations of the ordinal scale steps in VGA for calculating a VGAS has also been criticized [24]. The lack of additivity of ordinal data demands non-parametric rank-invariant statistical methods, which means that the methods should be unaffected by any kind of ordered re-labelling of the scale categories [25].


    ROC analysis
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
The method proposed in the present paper for evaluating clinical image quality – VGC analysis – is based on concepts developed for ROC analysis. A short description of ROC analysis will therefore be given here.

The fundamental task for an observer in medical imaging is to state whether an image belongs to a healthy patient or whether the patient has a disease. This has led to the need for characterizing the performance of the observer. An intuitive measure of the quality of the observer might be the number of correct responses. However, such a measure has a serious drawback in that it is strongly dependent on the prevalence of signal (or disease). As an example, imagine an image data set corresponding to a selection of patients of which only, say, 1% suffers from a specific disease. If the observer in this case would state that the patient is healthy in all cases, he would, despite the failure of not detecting a single pathological case, end up with the impressive number of 99% correct responses. It is easily understood that a relevant measure needs to be independent of the prevalence of signal. Sensitivity (the probability that an observer detects an existing signal) and specificity (the probability that a healthy patient is determined as being healthy by the observer) are two common measures that fulfil the requirement of independence of the prevalence of signal. However, for a given observer the sensitivity and the specificity are closely correlated in that an increase in sensitivity stemming from a change in the decision threshold in most cases inevitably results in a decrease in specificity. This dependence on decision threshold leads to difficulties in comparing different observers. The positive predictive value (the probability that a patient determined as being sick by the observer actually has a disease) and the negative predictive value (the probability that a patient determined as being healthy by the observer actually is healthy) are two other commonly used measures, which are dependent on both the prevalence and the decision threshold.

However, the varying choices of the decision threshold are the essence of ROC analysis. The method provides a natural distinction between the inherent detectability of the signal (or disease) and the judgement of the observer, reflected in the positioning of the decision threshold. By deliberately varying the decision thresholds, the trade-offs between the true positive fraction (TPF = sensitivity) and the false positive fraction (FPF = 1–specificity) can be established. In Figure 1aGo, four such choices of criterion levels are shown. (A confidence scale with five steps leads to four decision thresholds.) In principle, an infinite number of decision thresholds can be considered, thus generating a continuous ROC curve with the TPF given as a function of the FPF (Figure 1bGo). Such curves allow one to directly compare the inherent diagnostic capabilities of different diagnostic procedures. A more accurate procedure will generate a curve closer to the top left corner than a less accurate one. Curves situated on or near the diagonal represent totally non-informative procedures, the results of which are no better than pure guesswork. Thus, an ROC curve describes all possible compromises between true positive and false positive decisions inherent in a diagnostic procedure. Like sensitivity and specificity, it is independent of the prevalence of signal (or disease). Furthermore, ROC analysis is independent of any effect that different decision thresholds might have on the diagnostic process.


Figure 1
View larger version (13K):
[in this window]
[in a new window]

 
Figure 1. (a) Probability distributions (A = noise, B = signal) of a detection task showing 4 levels of decision thresholds X1 – X4. Values of X < X1 correspond to the first rating category ("1"), X1 < X < X2 to the second ("2"), etc., and X > X4 to the last ("5"). (b) The resulting ROC curve, giving the true positive fraction (TPF) as a function of the false positive fraction (FPF). The four operating points corresponding to the four decision thresholds in (a) are given by the boxes in (b).

 
Once the ROC points are calculated, some form of curve fitting must be applied to create the whole ROC curve. If a quantitative description of the difference between two modalities is desired, an objective curve fitting is necessary. The underlying assumption for such a fit is the existence of two overlapping distributions for the actual positive and negative cases (Figure 1aGo). The distributions need not be known. Any distribution that produces the calculated ROC points is suitable [26, 27]. For simplicity, normal distributions are often assumed – with the means µn and µs and the standard deviations {sigma}n and {sigma}s for the actually positive and negative cases. In such a case, it can be shown that the ROC points will appear on a straight line when they are plotted on "normal-deviate" axes, i.e. axes where the normal-deviate values z = (x–µ)/{sigma} for the two distributions are scaled linearly [28, 29]. By fitting a straight line to the ROC points in such a binormal graph, a continuous ROC curve can be obtained.

Plotting the ROC points in a binormal graph when the underlying probability distributions are in fact not known may be questioned. However, according to Metz [26], the binormal assumption concerns only the functional form of the ROC curve, which can always be determined empirically. The ratings are ordinal rankings, and it is not necessary for the decision axis to have a particular numerical scale. Any monotonic transformation of the decision axis yields different underlying distributions, but does not alter the functional form of the ROC curve. This is shown by Hanley [27] who showed that simply by changing the decision variable scale, the fitted normal distributions can be deformed into shapes of exponential, chi-square, and rectangular distributions and still represent the original ROC data set. Hanley [30] also reviewed the arguments given in the literature for the binormal model and showed that, in general, even if an alternative distribution were in fact correct, the binormal fit differs so little from the true form that it has no practical consequence. In fact, the variation arising from the sampling of cases usually obscures the possible lack of fit with the binormal model. Thus, the choice of model may be made pragmatically.

The accuracy index most often used is the area under the ROC curve, Az, for which the range of values is [0.5, 1.0], where 0.5 represents detection governed by chance only and 1.0 represents perfect detection. As such, it is a useful quantitative measure of the performance of the observer. In fact, the accuracy index Az has a direct relation to statistical decision theory. It can be shown [31] to be equal to the proportion of correct choices in two-alternative forced-choice (2AFC) experiments, in which the observer is presented two images – one image containing a signal, the other not – at the same time, with the task of deciding which of the two images contains the signal.


    A new method of visual grading – VGC analysis
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
Visual grading revisited
In ROC analysis, the difference between the signal and noise distributions is determined from the response of the observer to the two distributions, where the task of the observer is to state his confidence about the occurrence of a possible signal. In this way, the two distributions are sampled at points corresponding to the decision thresholds of the observer (Figure 1aGo). However, Figure 1aGo can also be interpreted as the thresholds used by an observer in a visual grading study using absolute ratings when comparing two different modalities. Due to variations in exposure, patient anatomy and other fluctuations, the perceived image quality in images produced with a given modality is not constant but gives rise to a distribution. (In Figure 1aGo, probability distributions A and B correspond to the perceived image quality from modalities A and B, respectively.) The shape and position of this probability distribution is unknown, and it is of course also dependent on the meaning of the term "image quality". However, the problems connected to the operationalization (the specification of measurable empirical referents for abstract concepts) of the phenomenon image quality are outside the scope of this paper, and it is assumed that the rating tasks of the observer are relevant.

For IC, it is easily understood that the ICS corresponds to the proportion of the area of the probability distribution to the right of the decision threshold of the observer for a fulfilled criterion. For example, if the observer would need a value of the decision variable larger than x3 for stating a criterion to be fulfilled in Figure 1aGo, the ICS for the two modalities "A" and "B" would be the proportions of the two probability distributions of the two modalities to the right of this threshold.

This similarity between ROC analysis and visual grading leads to the insight that it should be possible to use the well established methods for evaluating ROC data for evaluating data from an absolute visual grading study in which the observer uses multiple scale steps to state his opinion about the image quality. In this way, the difference in the response (visual grading) of the observer to two modalities can be used to characterize the difference between the two modalities in the same way as the difference in the response of the observer to the signal and noise distributions are used to characterize the observer in an ROC study. The term visual grading characteristics (VGC) is therefore suitable for the method of analysing visual grading data in similar fashion as is done with ROC data.

Visual grading characteristics
The basic VGC study is an expanded IC study in which the observer uses a multistep rating scale to state his opinion about the fulfilment of an image quality criterion. (As is described below, the analysis method can also be applied to data from absolute VGA studies.) VGC can in this way be interpreted as a repeated image criteria scoring, where the observer changes his threshold for the requirement for the fulfilment of each criteria in a similar way to when the scale steps in an ROC study are used by the observer to state the confidence of each positive/negative decision. In this way, the probability distribution of the images from each modality is sampled.

Example of criterion [1]:

Visually sharp reproduction of the trachea and proximal bronchi

Example of rating scale:

  1. Confident that the criterion is not fulfilled
  2. Somewhat confident that the criterion is not fulfilled
  3. Indecisive whether the criterion is fulfilled or not
  4. Somewhat confident that the criterion is fulfilled
  5. Confident that the criterion is fulfilled

By letting the observer state his confidence regarding the fulfilment of a criterion, an ordinal scale is obtained. Like in ROC analysis, it is not required that the different ratings correspond to given numerical intervals on the decision axis [27] or that all readers use the ratings with the same meaning [29, 32], since the ordinal scale is merely used to sample the probability distribution for each modality. As will be described below, the interobserver variation will, however, have some implication for the subsequent data analysis.

Data analysis
For each comparison to be made, a 2xn frequency table is put together, where n is the number of rating categories used. The frequency table gives the number of test results in each rating category separately for the two modalities being compared. To get the corresponding VGC points (pairs of the proportion of fulfilled criteria for modalities A and B), the 2xn frequency table is successively rearranged to 2x2 tables. The first point is calculated by regarding only a rating of "n" as a fulfilled criterion; the next by considering a rating of "n–1" or "n" as a fulfilled criterion, and so on. The last point, comprising all categories, always yields ICSA = ICSB = 1.0. In this way, n rating categories give (n–1) VGC points. In addition, the VGC curve always has its origin in ICSA = ICSB = 0.0.

The VGC points can be plotted in a diagram to produce a VGC curve. To obtain a continuous VGC curve, the same argumentation that was used for obtaining the ROC curve from the ROC data using the binormal fit can be used for obtaining the VGC curve from the VGC data. Suitable ROC software such as ROCKIT or LABMRMC (C E Metz, University of Chicago) can be used for this, and also for calculating the area under the curve (AUCVGC). (Due to the established connection between Az and ROC analysis, it is recommended that AUCVGC be used to describe the area under the VGC curve). The above-mentioned software packages also provide the standard deviation of the AUCVGC, enabling the possibility of a statistical analysis of the VGC results. According to Swets and Picket [29], Az can be treated as a normally distributed variable in most cases (exceptions are if Az is close to 1 or if the number of cases is small (<50)). This means that standard statistical tests can be used for calculating confidence intervals and p-values from obtained values of AUCVGC and its standard deviation. For example, if the obtained 95% confidence interval for the estimation of AUCVGC does not cover 0.5 (the value 0.5 corresponding to equal image quality for the two modalities), the difference between the two modalities can be stated to be statistically significant at the 95% level.

Using VGC analysis for analysing VGA data
VGC presents the possibility of performing multiple-choice grading studies using image quality criteria (studies where the observer states his confidence about the fulfilment of a criterion), but it can also be used to analyse standard VGA data that stem from an absolute VGA study. Although conceptually different – VGC corresponds to the observer grading his confidence or agreement in the fulfilment of an image quality criterion whereas VGA corresponds to the observer grading his opinion about the reproduction or visibility of a certain structure – the statistical properties of the resulting data are the same. (In fact, if the structure rated in an absolute VGA study is extracted from an image quality criterion used in a VGC study, the two probability distributions should be identical.) By letting the requirement for a fulfilled criterion be a visibility higher than a specific rating, the absolute VGA data for two modalities are converted to an ICS pair. By repetitively changing the threshold to correspond to the different ratings, the ICS pairs needed for the VGC analysis are obtained.

Application
As an example of a VGC study, Table 1Go presents the results from a fictitious study in which an observer has been presented 100 images each from two modalities with the task of – for each image – stating his confidence regarding the fulfilment of a specific image quality criterion. In Table 2Go, the corresponding pairs of ICS for the two modalities are given. Figure 2Go presents the resulting VGC curve, along with the operating points used by the observer for Table 1Go, obtained from ROCFIT (C E Metz, University of Chicago). The AUCVGC was determined to 0.72 with the standard deviation 0.037. Since the resulting 95% confidence interval for AUCVGC (0.65, 0.79) does not cover 0.5, it can be stated that Modality B – regarding the specific criterion and observer – is statistically significantly better than Modality A.


View this table:
[in this window]
[in a new window]

 
Table 1. Observer ratings from a fictive visual grading characteristics(VGC) study

 

View this table:
[in this window]
[in a new window]

 
Table 2. The visual grading characteristics' (VGC) data points (pairs of image criteria scores (ICS)) obtained from the data presented in Table 1Go

 

Figure 2
View larger version (12K):
[in this window]
[in a new window]

 
Figure 2. The visual grading characteristic(VGC) curve from the data presented in GoTables 1 and 2Go. The boxes represent the operating points corresponding to the observer's interpretation of the scale steps of the rating scale.

 

    Discussion
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
VGC analysis consists of elements from both IC and relative and absolute VGA as well as from ROC analysis. The concept of VGC analysis can be interpreted as "IC meets ROC" with the VGC curve presenting the ICSB (the proportion of images rated as fulfilling a criterion for modality B) as a function of the ICSA (the proportion of images rated as fulfilling a criterion for modality A) for a grading task, just like the ROC curve describes the TPF (the proportion of images rated as containing a signal for the positive images) as a function of the FPF (the proportion of images rated as containing a signal for the negative images) for a detection task. (One important difference between the two curves being that the ROC curve describes an observer's ability to separate the signal and noise distributions belonging to one modality from each other, whereas the VGC curve describes the observer's opinion about the separation of the image quality distributions from two modalities.) For the observer, the resulting study is similar to absolute VGA with the use of a multistep ordinal scale for grading the image quality. The resulting measure of image quality, AUCVGC, is finally, like in relative VGA, a relative measure of image quality, describing the image quality for modality B in comparison with modality A.

Using the statistical methods of ROC analysis, VGC analysis presents a solution to the need of non-parametric rank-invariant statistical methods for analysing the ordinal data from visual grading studies. The use of the ROC technique for comparing ordinal data from studies other than detection tasks has been proposed previously. Sonn and Svensson [25] studied changes in activities of daily living (ADL) measured by a 10-level ordinal scale, the Staircase of ADL, in rehabilitation medicine and used the ROC curve to analyse the systematic change in ADL levels between two age groups. The use of the ROC technique, enabling a statistically valid analysis of ordinal data, can probably be applied to many other rating tasks.

Strengths of VGC
IC is a visual grading method for which valid statistical methods have been used most often previously. The dissatisfaction from the fact that the observer can only use a two-step rating scale in IC (criterion fulfilled/criterion not fulfilled) often leads to the use of VGA, enabling the use of multiple scale steps, although invalid statistical methods are often used. The use of VGC analysis can hopefully satisfy the needs for both a valid statistical method and freedom for the observer.

Furthermore, VGC analysis can be used directly on the image quality criteria defined by the European Commission – giving statements of the needed levels of reproduction for certain anatomical landmarks – without the need for extracting the relevant structures from the criteria and grading the visibility of these structures. This has the potential of leading to an increased validity in the use of the image quality criteria in multiple-choice grading studies. However, VGC analysis is not limited to the use of European quality criteria. Modifications of the original quality criteria have been proposed for chest radiography [16], lumbar spine radiography [33] and mammography [23, 34] and these modified criteria – as well as other relevant criteria – may meritoriously be used. Furthermore, the grading task is not limited to normal anatomy. If applicable, grading of image criteria based on pathology may also be used.

Weaknesses of VGC
The value of the AUCVGC can be criticized for the same reason as the Az can be questioned in ROC analysis. The index Az is useful in most cases because it reflects accuracy in general through a range of possible operating points [35]. However, doubts have been expressed by some investigators concerning the fact that a large part of the area comes from the rightmost part of the curve and thereby include false positive fractions of limited or no clinical relevance. Also, crossing curves can cause confusion; one curve may have higher TPFs than another in the region of relevant FPFs, but if the curves cross for higher FPF values, the superiority for the first curve may be lost or even reversed if the area under each curve is used as an index of accuracy [27, 36]. In the same way a large part of the area of the VGC curve comes from a part of the curve which corresponds to a very low threshold of the observer for judging a criterion of being fulfilled – possibly corresponding to an unacceptable image quality.

The VGC curve
It is important to realise that a VGC curve is completely determined by the two underlying distributions of the modalities being studied (in the same way as the ROC curve is determined by the signal and noise distributions), but that no reversed relationship holds. This means that an infinite number of different pairs of distributions result in the same VGC curve. However, some information about the distributions can still be obtained from the VGC curve. A VGC curve situated on or near the diagonal, for example, corresponds to a comparison between two modalities with practically identical image quality distributions. No matter which threshold an observer chooses to use, the ICS will always be equal for two such modalities. A symmetrical curve corresponds to two shifted distributions with equal variance. As a third example, a VGC curve showing an advantage for modality B for a majority of the points, but crossing the diagonal at the rightmost part of the curve and there showing an advantage for modality A corresponds to a probability distribution for modality B with its centre to the right of that of modality A, but with the probability distribution for modality B being wider so that at the leftmost part of the decision axis, the proportion of the distribution to the left of a decision threshold is larger for modality B than for modality A (Figure 3Go). Such a result would be stemming from modality B producing images with more variation in the image quality than modality A, the majority of the images from modality B being better. However, if the observer's threshold is very low for judging a criterion as being fulfilled, almost all images from modality A will be rated as fulfilling the criterion whereas the images with lowest image quality produced by modality B will fail in meeting the requirement for a fulfilled criterion. An easily understood example would be if the criterion is heavily dependent on the amount of quantum noise in the image and modality B provides images with a lower average amount of quantum noise but larger spread.


Figure 3
View larger version (12K):
[in this window]
[in a new window]

 
Figure 3. (a) Two image quality distributions and (b) the resulting asymmetrical visual grading characteristic (VGC) curve.

 
Finally, it must be emphasised that the present paper is only a first proposal for the use of ROC statistics for analysing visual grading data. Therefore, further investigations – theoretical as well as experimental – are needed to explore the properties of VGC analysis for different study designs.


    Conclusions
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 
The rating data from the observers in a visual grading study with multiple ratings is ordinal, meaning that non-parametric rank-invariant statistical methods are required. The method presented in the present paper for determining the difference in image quality between two modalities, VGC analysis, is such a statistical method. In a VGC study, the task of the observer is to rate his confidence about the fulfilment of image quality criteria. The rating data for the two modalities are then analysed in a manner similar to that used in ROC analysis. The resulting measure of image quality is the VGC curve, which – for all possible thresholds of the observer for a fulfilled criterion – describes the relationship between the proportions of fulfilled image criteria for the two compared modalities. The area under the VGC curve, AUCVGC, can be used as a single measure of the difference in image quality between two compared modalities. VGC analysis can also be applied to data from an absolute VGA study.


    Acknowledgments
 
The authors would like to acknowledge Markus Håkansson and Elisabeth Svensson for stimulating discussions.

Received for publication January 30, 2006. Revision received April 28, 2006. Accepted for publication May 30, 2006.


    References
 Top
 Abstract
 Introduction
 Visual grading
 ROC analysis
 A new method of...
 Discussion
 Conclusions
 References
 

  1. CEC. European guidelines on quality criteria for diagnostic radiographic images. Report EUR 16260 EN. Luxembourg: Office for official publications of the European Communities, 1996
  2. CEC. European guidelines on quality criteria for diagnostic radiographic images in paediatrics. Report EUR 16261 EN. Luxembourg: Office for official publications of the European Communities, 1996
  3. CEC. European guidelines on quality criteria for computed tomography. Report EUR 16262 EN. Luxembourg: Office for official publications of the European Communities, 1996
  4. Sund P, Herrmann C, Tingberg A, Kheddache S, Månsson LG, Almén A, et al. Comparison of two methods for evaluating image quality of chest radiographs. Proc SPIE 2000;3981:251–7.[CrossRef]
  5. Tingberg A, Herrmann C, Lanhede B, Almén A, Besjakov J, Mattsson S, et al. Comparison of two methods for evaluation of the image quality of lumbar spine radiographs. Radiat Prot Dosim 2000;90:165–8.[Abstract]
  6. Sandborg M, McVey G, Dance DR, Alm Carlsson G. Comparison of model predictions of image quality with results of clinical trials in chest and lumbar spine screen-film imaging. Radiat Prot Dosim 2000;90:173–6.[Abstract]
  7. Sandborg M, Tingberg A, Dance DR, Lanhede B, Almén A, McVey G, et al. Demonstration of correlations between clinical and physical image quality measures in chest and lumbar spine screen-film radiography. Br J Radiol 2001;74:520–8.[Abstract/Free Full Text]
  8. Tingberg A, Båth M, Håkansson M, Medin J, Besjakov J, Sandborg M, et al. Evaluation of image quality of lumbar spine images: a comparison between FFE and VGA. Radiat Prot Dosim 2005;114:53–61.[Abstract/Free Full Text]
  9. Månsson LG. Evaluation of radiographic procedures. Investigations related to chest imaging. (Thesis). Göteborg: Göteborg University; 1994
  10. Månsson LG. Methods for the evaluation of image quality: a review. Radiat Prot Dosim 2000;90:89–99.[Abstract]
  11. Tingberg A. Quantifying the quality of medical X-ray images. An evaluation based on normal anatomy for lumbar spine and chest images. (Thesis). Lund: Lund University; 2000
  12. Båth M. Imaging properties of digital radiographic systems. Development, application and assessment of evaluation methods based on linear-systems theory. (Thesis). Göteborg: Göteborg University; 2003
  13. Chakraborty DP. Problems with the differential receiver operating characteristic (DROC) method. Proc SPIE 2004;5372:138–43.[CrossRef]
  14. Kundel HL. Images, image quality and observer performance. Radiology 1979;132:265–71.[Abstract]
  15. Vucich J, Goodenough DJ, Lewicki A, Briefel E, Weaver KE. Use of anatomical criteria in screen/film selection for portable chest x-ray procedures. In: Cameron J, editor. Optimization of chest radiography. HHS Publication 80-8124. Rockville, MD: FDA, 1980:237–48
  16. Lanhede B, Båth M, Kheddache S, Sund P, Björneld L, Widell M, et al. The influence of different technique factors on image quality of chest radiographs as evaluated by modified CEC image quality criteria. Br J Radiol 2002;75:38–49.[Abstract/Free Full Text]
  17. Altman DG. Practical statistics for medical research. London: Chapman & Hall, 1991
  18. Offiah AC, Hall CM. Evaluation of the Commission of the European Communities quality criteria for the paediatric lateral spine. Br J Radiol 2003;76:885–90.[Abstract/Free Full Text]
  19. Tingberg A, Herrmann C, Lanhede B, Almén A, Sandborg M, McVey G, et al. Influence of the characteristic curve on the clinical image quality of lumbar spine and chest radiographs. Br J Radiol 2004;77:204–15.[Abstract/Free Full Text]
  20. Sund P, Båth M, Kheddache S, Månsson LG. Comparison of visual grading analysis and determination of detective quantum efficiency for evaluating system performance in digital chest radiography. Eur Radiol 2004;14:48–58.[CrossRef][Medline]
  21. Geijer H, Persliden J. Varied tube potential with constant effective dose at lumbar spine radiography using a flat-panel digital detector. Radiat Prot Dosim 2005;114:240–5.[Abstract/Free Full Text]
  22. Wiltz HJ, Petersen U, Axelsson B. Reduction of absorbed dose in storage phosphor urography by significant lowering of tube voltage and adjustment of image display parameters. Acta Radiol 2005;46:391–5.[CrossRef][Medline]
  23. Hemdal B, Andersson I, Grahn A, Håkansson M, Ruschin M, Thilander-Klang A, et al. Can the average glandular dose in routine digital mammography screening be reduced? A pilot study using revised image quality criteria. Radiat Prot Dosim 2005;114:383–8.[Abstract/Free Full Text]
  24. Geijer H, Verdonck B, Beckman K-W, Andersson T, Persliden J. Digital radiography of scoliosis with a scanning method: initial evaluation. Radiology 2001;218:402–10.[Abstract/Free Full Text]
  25. Sonn U, Svensson E. Measures of individual and group changes in ordered categorical data: application to the ADL staircase. Scand J Rehabil Med 1997;29:233–42.[Medline]
  26. Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986;21:720–33.[Medline]
  27. Hanley JA. Use of receiver operating characteristics (ROC) analysis in imaging experiments. In: Short Course Notes SC 34. Bellingham, WA: SPIE - The International Society for Optical Engineering, 1991
  28. Green DM, Swets JA. Signal detection theory and psychophysics. New York, NY: Wiley, 1966
  29. Swets JA, Picket RM. Evaluation of diagnostic systems. New York, NY: Academic Press, 1982
  30. Hanley JA. The robustness of the binormal model used to fit ROC curves. Med Decis Making 1988;8:197–203.[Abstract/Free Full Text]
  31. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29–36.[Abstract/Free Full Text]
  32. Hannequin P, Liehn JC, Fortier A, Elaerts J, Valeyre J. Comparison of phase analysis with factor analysis in equilibrium gated radionuclide angiography. Nucl Med Commun 1986;7:857–64.[Medline]
  33. Almén A, Tingberg A, Besjakov J, Mattsson S. The use of reference image criteria in X-ray diagnostics: an application for the optimisation of lumbar spine radiographs. Eur Radiol 2004;14:1561–7.[Medline]
  34. Grahn A, Hemdal B, Andersson I, Ruschin M, Thilander-Klang A, Börjesson S, et al. Clinical evaluation of a new set of image quality criteria for mammography. Radiat Prot Dosim 2005;114:389–94.[Abstract/Free Full Text]
  35. Swets JA, Picket RM. Assessing diagnostic technologies: reply to Habicht. Science 1980;207:1415
  36. Habicht J-P. Assessing diagnostic technologies. Science 1980;207:1414[Free Full Text]




This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Båth, M
Right arrow Articles by Månsson, L G
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Båth, M
Right arrow Articles by Månsson, L G


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
BJR DMFR IMAGING  ALL BIR JOURNALS