Existing Evaluation Methods - 東北大学機関リポジトリTOUR

There are many existing methods employed to evaluate a treatment effect or the accuracy of a test or diagnosis. These methods measure the different aspects or properties of a treatment or diagnosis and may be sensitive to factors like disease prevalence, the spectrum of the disease etc. However, regardless of the methods applied, they are sensitive to the population being studied and the design of the study [Šimundić et al., 2009]. In this section, three evaluation methods that are commonly applied are described in details.

130

6.1.1 Coefficient of Variance (COV)

The coefficient of variance (COV) is a measure of the variability of a set of data, independent of the unit of measurements. Therefore, it can be used to compare the spread of different dataset with different units of measurement. However, it is applicable only if the dataset consists of a real zero.

COV is commonly determined in % using equation (58):

𝐶𝑂𝑉(%) = 𝑠𝑡𝑑𝑒𝑣

𝑚𝑒𝑎𝑛×100% (58)

where stdev is the standard deviation of a set of data and mean is the average or mean value of the set of data.

COV is used to determine the performance of radiotracer such as CNS. A good CNS radiotracer has small COV value which indicates small variation in measured outcome (e.g. BPND) [Guo et al., 2009]. Small COV within a subject condition is desirable. However, COV cannot be used to evaluate the performance of the radiotracer in discriminating subject conditions as the variations in SUVR differed across the subject conditions and it is used as an index of variability within a subject condition and not across subject conditions.

6.1.2 Receiver Operating Characteristics (ROC)

Receiver operating characteristics (ROC) is commonly applied to judge the diagnostic accuracy of a diagnostic test. It requires a threshold, which is set to discriminate subjects into two groups and involves identifying subjects that are correctly or incorrectly classified. Subjects that are correctly identified as having a disease are termed true-positive (TP), and those correctly identified as not having a disease are termed true negatives (TN). Subjects that are incorrectly identified as having and not having a disease are termed false positive (FP) and false negative (FN) respectively. These 4 cases are summarised in table 6.1.

Table 6.1: Classification of subjects based on diagnostic test results and the actual outcome.

Diagnostic

Test Results Actual Outcomes Positive Negative

Positive TP FP

Negative FN TN

The sensitivity of a test is defined as the probability of correctly identifying subjects with the disease and is calculated as TP / (TP+FN). The specificity of a test is defined as the probability of

131 correctly identifying subjects without the disease and is calculated as TN / (FP+TN) [Bewick et al., 2004]. Positive predictive value (PPV) is the probability of having subjects with the disease in the positive diagnostic test results and is calculated as TP / (TP+FP). Negative predictive value (NPV) is the probability of having subjects without disease in the negative diagnostic test results and is calculated as TN / (TN+FN) [Bewick et al., 2004].

The receiver operating characteristics (ROC) plot is obtained by varying the thresholds and determining the respective sensitivities and specificities and plotting the sensitivity against the 1-specificity (Figure 6.1). The area under the ROC curve (AUROC, Az) can be used as a global measure of diagnostic accuracy but it is unable to differentiate test with higher sensitivity from that with higher specificity. The value of Az ranges from 0 to 1, whereby the larger the value of Az, the higher the diagnostic accuracy.

Figure 6.1: Receiver Operating Characteristic plot with three curves of different Az values of 0.5, 0.75 and 0.9. The curve with Az of 0.75 is coloured in gray.

Az can be determined as a sum of trapezoids (empirical) or by fitting the curve (parametric) [Bewick et al., 2004]. It is often accompanied by the determination of the confidence interval (CI) of 95% and a statistics test [Šimundić et al., 2009]. In amyloid imaging, Az, sensitivity and specificity are commonly employed to determine the diagnostic accuracy of a radiotracer in classifying the subjects into HC, MCI and AD conditions, as classified by neuropsychological test (Table 6.2). For clinically-applied amyloid radiotracers, the values of sensitivity and specificity are mostly greater than 85% and varied even for the same radiotracers as the sensitivity and specificity depend on the population evaluated.

132

Table 6.2: Sensitivity and specificity of five clinically-applied amyloid radiotracers.

Radiotracers ROC

References Sensitivity (%) Specificity (%)

[¹¹C]PIB 97.2 85.3 Hatashita et al., 2014

[¹¹C]BF227 97.5^# 81.7^# Shao et al., 2010;

Furumoto et al., 2013 [¹⁸F]flutemetamol 95.2* 89.3* Hatashita et al., 2014;

Vandenberghe et al., 2010 [¹⁸F]florbetapir 92.7^$ 95.3^$ Clark et al., 2011;

Camus et al., 2012

[¹⁸F]FACT 90 100 Furumoto et al., 2013

# Averaged values of Sensitivity (95,100) and of Specificity (92,71.4)

* Averaged values of Sensitivity (97.2,93.1) and of Specificity (85.3,93.3)

$ Averaged values of Sensitivity (93,92.3) and of Specificity (100,90.5)

Sensitivity, specificity, PPV and NPV are dependent on the thresholds applied and the spectrum of the disease. Az is independent on the thresholds set but does not differentiate tests with high specificity or sensitivity. The values of sensitivity, specificity, PPV, NPV and Az range from 0.0 to 1.0. To correctly apply ROC, statistical analysis with CI should be stated to determine the strength of the differences between two treatments or diagnostic test for discriminating two groups of subjects. However, statistical results are dependent on sample size and the disease spectrum in clinical studies.

6.1.3 Power & Sample Size Analysis

Effect size (Es) shows the magnitude of the difference between two datasets. It is scale-free (unitless) and hence it is applicable for comparing the relative magnitude effects of different data.

There are a few types of formula to determine Es and it is dependent on the dataset. The equation used to determine Es is similar to that of Z-score or t-value formulae but instead of dividing by the population standard deviation or standard error (SE = σ /√n) respectively, a specified standard deviation is applied instead. The most common Es is the Cohen’s D, which can be determined using the means of the two datasets and a pooled standard deviation, σpooled.

𝐸𝑠= 𝑀₂− 𝑀₁

𝜎_pooled (59)

𝜎pooled (Cohen^′s D) = √(𝑆𝐷₂²+𝑆𝐷₁²)

2 (60)

where M2 is the mean of the sample or dataset 2 and M1 is the mean of the population or dataset

133 1. SD2 is the standard deviation of the sample or dataset 2 and SD1 is the standard deviation of the population or dataset 1.

However, Es (Cohen’s D) is applicable only if there is homogeneity in the variance of the datasets such that the results differed due to the sampling variation. If the standard deviations of the two datasets differ greatly, then this assumption will be violated and the standard deviations cannot be pooled together. If the sample sizes of the two datasets differ, σpooled (Hedges’G) is recommended to weigh the standard deviation by its sample size.

𝜎_pooled(Hedges^′G) =√(𝑛₁− 1)∙ 𝑆𝐷₁²+ (𝑛₂− 1)∙ 𝑆𝐷₂² 𝑛₁+𝑛₂− 2

(61)

If σpooled (Hedges’G) is applied, Es should be corrected for small positive bias.

Corrected 𝐸𝑠 (Hedges^′G) =𝐸𝑠 ∙(1 − 3

[4(𝑛₁+𝑛₂)− 9]) (62)

Where n1 and n2 are the sample size of dataset 1 and 2 respectively.

If the sample sizes of both datasets are the same but the standard deviations differed, the standard deviation of the control group should be applied instead to determine Es (Glass’s delta). This is based on the assumption that the measurements of the control group are not biased by the treatment or another external factor.

The use of Es is limited to normally-distributed datasets and comparison of one type of measurement at a time. Moreover, Es is based on average values, which can differ widely depending on measurement reliability. For normally-distributed datasets with equal variances, Es (Cohen’s D) can also be converted to a common language effect size (CL), also known as Az in ROC [McGraw and Wong, 1992]:

𝐶𝐿= ∅(𝛿

√2) (63)

where Φ is the cumulative distribution function for a normally-distributed data, δ is the population effect size of datasets with homogeneous variance, similar to Es (Cohen’s D).

ドキュメント内東北大学機関リポジトリTOUR (ページ 148-152)