Substantial interobserver variation of thyroid volume and function by visual evaluation of thyroid 99mTc scintigraphy

Thyroid scintigraphy is an important tool in the management of thyroid disorders. In addition to the thyroid disease per se, the amount of tracer taken up by the gland correlates inversely with the whole-body iodine pool, and it is also influenced by factors such as iodinated contrast exposure. 99mTc pertechnetate (99mTc) is the tracer most commonly used owing to its short half-life and to its cost and availability. Radioiodine (131I) is much used for treatment of non-toxic goitre in some countries [1]. Thyroid imaging is often needed for goitre size estimation since clinical assessment of the goitre volume is notoriously inaccurate [2]. The thyroid 131I uptake (RAIU) can be measured exactly by the administration of a tracer dose of 131I. However, a preliminary decision on whether the patient is eligible for 131I therapy is often based on the appearance of the thyroid 99mTc scintigram, and the scintigram is thus used semi-quantitatively for assessment of thyroid RAIU and goitre size. Jarløv et al have previously assessed the extent to which clinicians differ in their evaluation of goitre [3, 4]. Regarding the diagnosis of solitary scintigraphically cold thyroid lesions, they found a moderate inter- as well as intraobserver variation; these results were in line with those of studies of diseases in other organs [5]. However, the validity of thyroid 99mTc scintigraphy for assessment of quantitative thyroid parameters has received little attention. We therefore investigated whether experienced specialists can make valid assessments of the thyroid RAIU and of goitre size based on a visual evaluation of thyroid 99mTc scintigrams.

MATERIAL AND METHODS

Study population and design

The scintigrams evaluated in this study were obtained from 171 patients who participated in our previous studies [6-9] on recombinant human thyroid-stimulating hormone (TSH) stimulated 131I therapy in patients with a non-toxic nodular goitre. The characteristics of the patients have been described previously [6-9]. All scintigrams were recorded routinely at the initial visits.

Two highly experienced specialists in endocrinology (E1 and E2) and two specialists in nuclear medicine (N1 and N2) participated. None of the four participants were provided with any information about the patients. Based on a visual judgment of a high quality print of the scintigrams, the physicians were asked − blinded with respect to the other observers − to assess a) the thyroid 24-h RAIU, and b) the thyroid volume according to 16 predefined response categories (RAIU: 5% intervals; thyroid volume: 10 ml intervals for volumes ≤ 100 ml, 20-40 ml intervals for volumes in the range 101-200 ml, and 100 ml intervals for volumes > 200 ml). The true values were available from our previous studies [6-9]. Thus, the thyroid 24-h RAIU was determined after oral administration of a tracer activity of 0.5 MBq 131I, and the thyroid volume was measured by either magnetic resonance imaging (MRI), computed tomography or ultrasound, depending on the size of the goitre and the set-up of the respective study.

The survey was performed in two sessions at an interval of approximately four weeks in order to estimate the intraobserver variation. At the second evaluation, the scintigrams were rearranged to minimize the risk of recognition bias. Data from the first evaluation were used to analyse for accuracy of the assessments, while data from the second evaluation were used for determination of the intraobserver variation.

Statistical analyses

The two-sample Wilcoxon rank-sum test and the t-test were used for non-parametric and parametric data, respectively. The χ2-test was used for categorical variables. Odds ratios were calculated in order to assess the probability of a correct assessment. For calculation of the inter- and intraobserver agreements, the kappa (κ) statistics were used. κ adjusts for the agreement that can be expected by chance alone. The κ coefficient can attain values between –1 and +1. According to the κ-value, the degree of agreement was characterised as poor (κ < 0.00), slight (0.00 ≤ κ ≤ 0.20), fair (0.21 ≤ κ ≤ 0.40), moderate (0.41 ≤ κ ≤ 0.60), substantial (0.61 ≤ κ ≤ 0.80), or almost perfect (0.81 ≤ κ ≤ 1). Since there were 16 possible categories for the variables in question, we calculated the weighted kappa (κω) [10]. The level of observer agreement was calculated using the bootstrap technique for inferring confidence values [11]. A bootstrap of R = 1,000 was chosen. κ and κω coefficients were compared as described by Gjørup & Jensen [10]. The statistical software used was STATA version 12.1 (STATA Corp LP, Texas, USA). p-values < 0.05 were considered significant.

Trial registration: not relevant.

RESULTS

The thyroid 24-h 131I-uptake: accuracy of observer assessments

Table 1 shows the total number of correct assessments. The number of correct assessments expanded by one level above or below the true category is also shown. All observers had less than 25% correct assessments, while the highest score was 58% accepting a ± one category range. Observer E1 did significantly worse than the other observers in his assessment of the thyroid RAIU (p < 0.001).

Figure 1 presents the number of assessments according to each of the 5% interval categories and in relation to the true values. Thus, the “0%” column represents a correct assessment, while the “–20%” represents a RAIU scored four categories below. The endocrinologists (E1 and E2) tended to underestimate the thyroid RAIU by choosing a category, which on average was more than three and one categories, respectively, below the true category. This corresponds to a mean (SD) distance to the correct category of –16.3 ± 11.1% for E1 and –5.6 ± 11.9% for E2, in contrast to more precise assessments made by N1 (0.5 ± 12.7%) and N2 (0.0 ± 10.5%). The odds for the observers’ ability to estimate the thyroid RAIU correctly were 0.188 (N1), 0.257 (N2), 0.062 (E1) and 0.286 (E2). In 67/171 (39%), 65/171 (38%), 17/171 (10%) and 64/171 (37%) of the cases, observer N1, N2, E1 and E2, respectively, assessed the thyroid RAIU correctly (± one category) in both evaluations.

In order to investigate whether the observers’ assessments depended on the thyroid RAIU, an arbitrary cut-off level of 30% was chosen. Both endocrinologists were significantly more accurate in their assessment if the true value of the thyroid RAIU was below 30% (E1: p < 0.001; E2: p < 0.002), while N2 was significantly better in his estimations with a thyroid RAIU above 30% (p < 0.001). The accuracy of N1 was unrelated to the thyroid RAIU (p = 0.591). Similar results were found by analysing data from the second evaluation, which supports the absence of a learning effect.

The thyroid 24-h 131I-uptake: observer variation and inter-specialty agreement

The four observers were compared in six different pairs (Table 2). The κω-value (0.43) for the two specialists in nuclear medicine was significantly higher than the κω-value (0.21) for the endocrinologists (p < 0.0001). No complete agreement was reached among the four observers in any patient. The agreement between the two evaluations was determined for each of the four observers (Table 2). For both specialties there was a significant difference between κω-values (0.55 (N1) versus 0.41 (N2), p = 0.026; 0.34 (E1) versus 0.68 (E2), p < 0.0001). In 105/171 (61%), 114/171 (67%), 134/171 (78%) and 98/171 (57%) of the cases, observer N1, N2, E1 and E2, respectively, altered their assessment from the first to the second evaluation.

Data from the endocrinologists were pooled, resulting in 342 assessments, as were data from the nuclear medicine physicians. The two specialties were compared for each scintigram, resulting in a κω-value of 0.16 in the first evaluation and of 0.27 in the second evaluation (Table 2).

The thyroid volume: accuracy of observer assessments

Table 1 shows the total number of correct assessments. All observers had less than 25% correct assessments, while the highest score was 51% accepting ± one category. Figure 2 presents the number of assessments, according to each of the categories, and in relation to the true value, in parallel with the RAIU data. The observers E1 and N2 tended to assess the thyroid volume as being too low, reflected by a deviation from the correct category of mean (SD) –2.84 ± 2.84 for E1 and –1.37 ± 2.99 for N2, in contrast to the more precise assessments made by E2 (–0.91 ± 2.88) and N1 (–0.89 ± 2.90%). The odds for the observers’ ability to estimate the thyroid volume correctly were 0.286 (N1), 0.155 (N2), 0.171 (E1) and 0.204 (E2), respectively. In 58/171 (34%), 50/171 (29%), 62/171 (36%) and 61/171 (36%) of the cases, observer N1, N2, E1 and E2, respectively, assessed the thyroid volume correctly (± one category) in both evaluations. If an arbitrary cut-off level of 80 ml was chosen, all four observers were significantly better in estimating thyroid volumes below 80 ml (p < 0.001). Similar results were obtained in the repeat evaluation.

The thyroid volume: observer variation and interspecialty agreement

The four observers were compared in six different pairs (Table 2). The κω-value (0.35) for the two specialists in nuclear medicine was significantly higher than the κω-value (0.22) for the endocrinologists (p = 0.007). The agreement between the two evaluations was determined for each of the four observers (Table 2). There was a significant difference between κω-values for the two nuclear medicine specialists (0.55 (N1) versus 0.37 (N2), p = 0.003), but not for the two endocrinologists. In 114/171 (67%), 116/171 (68%), 91/171 (53%) and 120/171 (70%) of the cases, observer N1, N2, E1 and E2, respectively, altered their assessment from the first to the second evaluation. Similar to the analysis of the RAIU data, the two specialities were compared, which resulted in a κω-value of 0.26 in the first evaluation and of 0.30 in the second evaluation (Table 2).

DISCUSSION

When deciding on the choice of therapy in patients with non-toxic goitre, the size of the gland as well as the thyroid RAIU are crucial parameters [12]. Not all clinicians managing such patients have access to an accurate measurement of the thyroid size or to measures of the thyroid RAIU. At some centres, thyroid scintigraphy is the only examination that supplements the clinical examination and the biochemical tests. A previous study [3] on observers’ ability to differentiate between diffuse and multinodular goitres found that a higher agreement was obtained when thyroid scintigraphy was added to other routine tests. However, as demonstrated by the present data, scintigraphic imaging may be misleading in some respects. Thus, the thyroid 99mTc scintigram is neither a valid method for determination of goitre size nor for determination of thyroid 24-h RAIU. The number of correct 24-h RAIU assessments (within categories of 5% intervals) was low, ranging from 6% to 22%. Even if an assessment was accepted as “correct” by the inclusion of one category below or above the true category, the maximum number of correct assessments only reached 58% for one observer, while the other three scored poorer. The nuclear medicine specialists performed better than their colleagues in endocrinology (particularly due to low performance by one of the endocrinologists), who tended to underestimate the thyroid RAIU, especially in cases with a high thyroid RAIU. As for the assessment of thyroid volume, the accuracy was at a similarly low level. This is in line with the results from previous studies where comparisons between MRI, ultrasound and scintigraphy revealed pronounced differences in thyroid volume estimates [13-15]. In our study, all four observers estimated the thyroid volume significantly more incorrectly when thyroid volumes were above 80 ml. In theory, this may be explained by poorer scintigraphic visualization of large goitres. However, what seems to be a low scintigraphic thyroid uptake is, in fact, a dilution effect of the isotope being distributed into a larger thyroid volume, as supported by our previous study [16]. Indeed, in the present study, we demonstrated that the accuracy of the thyroid volume assessments did not depend on the thyroid RAIU. Cases with such low RAIU (< 10%) were excluded in our study. It can be criticized that the scintigraphy and the measurement of the RAIU were not performed on the same occasion. The thyroid RAIU may have varied, but we reported previously that the RAIU was very stable among our patients in the study period [17].

The κ statistics used in the present study are widely used to quantify the agreement among observers in the evaluation of a specific variable, but κ-values can be misleading when results are compared between studies [18]. The four observers in the present study achieved low to moderate interobserver agreement of their assessments of both the thyroid 24-h RAIU and the thyroid volume, with most κω-values being below 0.41. In none of the 171 scintigrams did all four observers agree on the thyroid RAIU, and only in one case did they agree on thyroid volume. The interobserver agreement was significantly higher for the two specialists in nuclear medicine than for the endocrinologists. Still, high agreement rates do not necessarily imply a high accuracy when compared with a gold standard. In general, we found higher κω-values for the intra- than for the interobserver variation, as commonly seen in such studies [19]. However, inconsistency persists among experienced specialists in their interpretation of a thyroid scintigram, reflected by κω-values in the range of 0.34-0.68 for the intraobserver variation.

In conclusion, the present study underlines that a visual evaluation of thyroid 99mTc scintigraphy is not useful for a valid assessment of either the thyroid 24-h RAIU or of the thyroid volume. The poor intra- and especially interobserver variation further invalidates the method. Importantly, 99mTc scintigraphy remains highly useful for differentiating between functioning and non-functioning thyroid nodules, and in these cases it is associated with high observer agreement [20].

Correspondence: Steen J. Bonnema, Endokrinologisk Afdeling M, Odense Universitetshospital, Søndre Boulevard 29, 5000 Odense C, Denmark. E-mail: steen.bonnema@dadlnet.dk

Accepted: 6 November 2013

Conflicts of interest:Disclosure forms provided by the authors are available with the full text of this article at www.danmedj.dk