Interrater reliability of subacromial ultrasound measures in the hands of novice sonographers
Shoulder pain is one of the most common causes of musculoskeletal pain with an estimated point prevalence in the 6.9-26% range in the community, and subacromial pain syndrome (SAPS) is reported to be the most frequent manifestation of shoulder pain . The subacromial structures are believed to be the primary cause of pain, but their aetiology varies . As a supplement to the clinical diagnosis , musculoskeletal ultrasound (US) is commonly used by radiologists, nonradiologists such as rheumatologists, orthopaedic surgeons, physiotherapists and other health professionals with limited US training , hereafter referred to as sonographers.
A recent study reported excellent intra- and interrater reliability of subacromial measures using a standardised protocol and performed by experienced sonographers . However, the reliability for novice sonographers has only been sparsely investigated; and since US is a highly operator-dependent modality, US experience may play a role. Furthermore, knowledge concerning interrater reliability may be useful in research design and interpretation and have important implications for clinicans who use US but have limited experience with the modality.
The purpose of this study was to investigate the interrater reliability of standardised static and dynamic subacromial US measures in the hands of novice sonographers.
The investigations were conducted by two novice sonographers and raters: rater A (KM), a graduate medical student, who completed a three-day musculoskeletal course prior to the study and had no previous US experience; and rater B (AW), an orthopaedic resident with 50 unsupervised shoulder US examinations. The study adheres to the Guidelines for Reporting Reliability and Agreement Studies (GRRAS)  and a training phase preceded the blinded study phase  as proposed by the International Federation for Manual/Musculoskeletal Medicine. The open training phase consisted of supervised sessions including introduction to the protocol, practice training, clarification of details and doubts and, finally, approval of the assessment technique. Between the supervised sessions, the novice raters jointly and unblinded trained the US protocol on 16 asymptomatic and 13 symptomatic shoulders in the course of a six-month period. Once consensus and an agreement of 70-80% were achieved, raters proceeded to the blinded study phase where they completed the US protocol twice on the symptomatic and twice on the asymptomatic contralateral shoulder. The order of raters and shoulder laterality changed between patients based on convenience. All measurements were conducted immediately after each image capture. The raters were blinded to health information, each other’s ratings (not in the same room) and own previous measures (using a new paper sheet). Each rater repositioned the patient (and probe) individually, according to the protocol, for each round of measures.
Patients were recruited consecutively from the Outpatient Clinic at the Department of Orthopaedics, Copenhagen University Hospital Hvidovre, Denmark. Eligibility was assessed clinically by experienced shoulder surgeons using standardised shoulder tests and non-standardised US. Patients were included if they were 18 years or older, had sufficient Danish proficiency and were diagnosed with subacromial pain with at least three positive recordings of the following five clinical tests: Hawkins-Kennedy, Neer’s, Jobe’s, Resisted External Rotation and Painful Arc [3, 8]. The exclusion criteria were rheumatic disease, pregnancy, adhesive capsulitis, glenohumeral osteoarthritis, radiculopathy, previous fracture, a complete supraspinatus tendon tear, subacromial steroid injection within the past four weeks and prior surgery or radiation therapy of the shoulder region. Patients with an asymptomatic contralateral shoulder (no pain/dysfunction for the past week; no history of fracture/ surgery/ radiation therapy to the area) were also included in the asymptomatic group.
Patients were instructed not to disclose information about the laterality of the symptomatic shoulder during their US examination. In the blinded study, the patients provided information concerning demographics and symptoms, completed the Shoulder Pain and Disability Index (SPADI)  and underwent US examination.
The study was approved by the Institutional Review Board of the Capital Region of Denmark (VD-2019-164) and complies with the principles of the Helsinki Declaration. All patients were informed orally and in writing, and written consent was obtained.
The subacromial measures examined were supraspinatus tendon thickness and subacromial-subdeltoid bursa thickness in two positions (SUPRA1, SUPRA2, SASD1 and SASD2), acromio-humeral distance (AHD) and dynamic impingement (DI). Positioning, image capture and measures presented in this protocol (Figure 1) have shown good to excellent intra- and interrater reliability . The rater was positioned standing behind the patient. The patient was seated on an adjustable armless chair in a neutral trunk position, feet flat on the floor and face forward.
US was conducted using a Hitachi Arrieta V70 scanner and a Hitachi L64 linear transducer, 18-5 MHz (Hitachi Medical Systems, Steinhausen, CH) with standardised settings preset for musculoskeletal small parts and fixed starting points in depth, gain and focus.
An a priori sample size calculation was conducted considering β = 0.1 and α = 0.05, expected intraclass correlation coefficient (ICC) = 0.75, minimal pre-specified ICC = 0.5, two raters, resulting in 34 required subjects per group . Statistical testing was performed at a two-sided 5% significance level and 95% confidence intervals (CIs) were used. Measures were averaged for analyses. Continuous measures were tested for normality and paired T-test was used to compare measures from symptomatic and asymptomatic shoulders and rater differences. ICC (2,1) (two-way random effect, single rater, absolute agreement) was used to investigate the interrater reliability of continuous measures interpreted as proposed by Portney et al. . Agreement of continuous measures was determined using standard error of measurement, minimal detectable change and Bland Altman plots with 95% limits of agreement . Categorical data (i.e. DI) were analysed using Cohen’s unweighted κ for reliability purposes with interpretation as proposed by Landis & Koch (1977). Statistical analyses were conducted in R Statistical Package version 3.5.1.
Trial registration: not relevant.
Thirty-three patients were included with five patients subsequently being excluded (complete supraspinatus tendon tear (n = 1), subacromial steroid injections within four weeks (n = 4)). The asymptomatic contralateral shoulders were included in 20 of the patients, resulting in a total of 48 shoulders. The 28 patients had a mean age of 51.7 years (standard deviation (SD): ± 13.9 years), 14 (50%) were female, 22 (79%) had their dominant side affected, mean BMI was 27.5 kg/m2 (SD: ± 4.3 kg/m2), mean symptom duration was three years (SD: ± 2.9 years) and mean SPADI score (0-100 (best)) was 46.7 (SD: ± 23.8).
Except for SASD2, no differences were recorded in agreement and reliability between position 1 and position 2 (Figure 1), and between symptomatic and asymptomatic shoulders (Table 1). No significant differences were found between absolute measures of the symptomatic and the asymptomatic side (p ≥ 0.068). In symptomatic shoulders, the mean was 0.0-0.4 mm larger than in asymptomatic shoulders (Table 1). (Download Table 1 as PDF)
Bland-Altman plots of SUPRA and SASD measures revealed no systematic bias with mean differences below 0.3 mm. AHD was negatively skewed with a significant asymptomatic mean difference between the raters of 0.61 mm (p = 0.045), with rater A measuring smaller AHD than rater B (Figure 2).
As described in Table 2, rater B identified DI more than twice as often as rater A did (15 versus 6).
The reliability of SUPRA was moderate to good, SASD was inconsistent, ranging from poor to good and AHD was moderate. The reliability of DI was fair in symptomatic and moderate in asymptomatic shoulders. No substantial differences were found in the agreement and reliability of the static measures between position 1 and position 2, or between symptomatic and asymptomatic shoulders, with the exception of the bursa measures in position 2.
The absence of differences between modified Crass and Crass position implies that using one position is sufficient. Position 2 may be painful since the supraspinatus tendon is stretched, and the less provocative position 1 is therefore preferred.
Supraspinatus tendon thickness
The reliability of SUPRA was good; and no significant differences were identified in thickness, agreement and reliability between positions or sides. Two studies measuring the interrater reliability of the longitudinal SUPRA tendon view had ICC values above 0.87 [5, 12]. The reason for the higher values in these studies than in the present study may be the use of novice versus experienced sonographers since US is an examiner-dependent modality.
No clinically relevant or significant differences in reliability and thickness between the symptomatic and the asymptomatic side was found for the measures of SUPRA (0.4 mm, p > 0.06). This runs counter to previous studies reporting changes because of tendinopathy [13, 14]. This may be due to methodological differences since the previous studies measured the SUPRA in transverse view based on an average of three measures. Taking the contralateral shoulder as a reference, the findings of the present study did not support the theory that SAPS results in a measurable difference in SUPRA.
Subacromial-subdeltoid bursa thickness
The SASD was the least reliable structure investigated in the protocol, resulting in two opposing extremes; poor reliability in symptomatic shoulders (ICC = 0.41) and borderline excellent reliability in asymptomatic shoulders (ICC = 0.88). The poor reliability of SASD2 in symptomatic shoulders may be caused by the stretching of soft tissue (bursa and tendon) provoking pain, whereas SASD1 measured in a less pain-provoking position displayed moderate reliability.
The thin nature of the SASD also makes it prone to measurement error and, consequently, to poor reliability.
Knowledge of the clinical relevance of measuring SASD is sparse, but it has been suggested that no healthy bursa is thicker than 2 mm . Our results revealed no difference in mean thickness and prevalence of bursa thickness exceeding 2 mm between symptomatic (20%) and asymptomatic shoulders (24%). As many as 86% of the present study population had experienced symptoms for > 6 months and were therefore considered to have a manifest chronic condition, which may explain the absence of bursal and/or tendon thickening in symptomatic shoulders.
The AHD is a frequently investigated measure with reported interrater reliability ranging from moderate to excellent [5, 16, 17]. Our results of moderate reliability (ICC: 0.68-0.72) were in line with those of another study that included both novice and experienced sonographers and obtained an ICC of 0.70 .
Comparison of AHD measures between symptomatic and asymptomatic shoulders did not reveal a difference (p = 0.43) and may have minimal clinical relevance. A very small AHD in symptomatic shoulders (< 6-7 mm) may, however, be an indication of a superior migration of the humeral head as a result of full-thickness rotator cuff tear .
The reliability of DI was fair (κ = 0.29) in symptomatic and moderate (κ = 0.46) in asymptomatic shoulders. DI CIs ranged from poor to substantial in symptomatic shoulders and covered a wide spectrum in asymtomatic shoulders. This was likely caused by the low prevalence of DI in asymptomatic shoulders (5-15%). Thus, a larger sample size is needed to assess this reliability. The assessment of DI in symptomatic shoulders resulted in rater B reporting higher prevalence than rater A, with a prevalence of 43% and 18%, respectively. In line with the observations of rater B, other studies have reported a prevalence of 36%, 41% and 57% in symptomatic shoulders , suggesting that rater A may have underrated DI. In asymptomatic shoulders, the raters agreed on a low prevalence, resulting in a high negative specific agreement (0.94) and a high prevalence-adjusted and bias-adjusted κ (PABAK) (0.80).
In the present study, we only obtained a “fair” reliability in symptomatic shoulders, which is likely explained by experienced raters being superior to novices, with κ values of 0.60 versus 0.44 . The diagnosis of SAPS should not be based on the presence of DI on US, especially when US is performed by novices.
Strengths and limitations
A major strength in this study is its transferability to the clinical setting including image capture, positioning, adjustment of depth, gain, focus and immediate measurement. Furthermore, patients were recruited from a public outpatient clinic and an ICC model suited for generalisability was chosen . Another major strength of the present study is the use of clear and standardised inclusion criteria to identify patients with SAPS.
The choice of ICC model is also a strength. It was motivated by increased generalisability to the clinical setting, but does, however, result in a more conservative estimate with a shift in ICC from 0.58-0.93 to 0.41-0.88, i.e. from good to poor and from excellent to moderate.
Agreement and reliability analyses were not conducted before proceeding to the study phase to ensure satisfactory levels. Furthermore, in future studies we suggest to include experienced raters to give an idea of the impact that US experience has on both intra- and interrater reliability measures.
Because of a premature discontinuation due to COVID-19, the study was underpowered, with the inclusion of 28 symptomatic and 20 asymptomatic shoulders rather than the desired 34.
Measures were performed during the US examinations, allowing the raters to recall their own previous measures, with the risk of inducing bias.
Blinding of the symptomatic side proved difficult as patients often had difficulties performing the internal rotation in position 2 without exhibiting pain, which might have biased the raters.
The interrater reliability of novice sonographers was found to be moderate to good when assessing SUPRA and AHD. The assessment of SASD and DI showed reliability ranging from poor to good. No significant differences in SUPRA and SASD thickness were found between symptomatic and asymptomatic shoulders, which questions the clinical value of measuring these structures in similar populations with a long symptom duration.
Correspondence Birgitte Hougs Kjær. E-mail: Birgitte.Hougs.Kjaer@regionh.dk
Accepted 16 August 2023
Conflicts of interest none. Disclosure forms provided by the authors are available with the article at ugeskriftet.dk/dmj
Acknowledgements We thank the Department of Clinical Research, Hvidovre Hospital, for statistical support and the staff at the Department of Orthopedic Surgery, Hvidovre Hospital, for assistance with the inclusion of patients.
Cite this as Dan Med J 2023;70(11):A05230285
- Luime JJ, Koes BW, Hendriksen IJM et al. Prevalence and incidence of shoulder pain in the general population; a systematic review. Scand J Rheumatol. 2004;33(2):73-81.
- Witten A, Mikkelsen K, Mayntzhusen TW et al. Terminology and diagnostic criteria used in studies investigating patients with subacromial pain syndrome from 1972 to 2019: a scoping review. Br J Sports Med. 2023;57(13):864-71.
- Hegedus EJ, Cook C, Lewis J et al. Combining orthopedic special tests to improve diagnosis of shoulder pathology. Phys Ther Sport. 2015;16(2):87-92.
- Roy JS, Braën C, Leblond J et al. aaccuracy of ultrasonography, MRI and MR arthrography in the characterisation of rotator cuff disorders: a systematic review and meta-analysis. Br J Sports Med. 2015;49(20):1316-28.
- Kjær BH, Ellegaard K, Wieland I et al. Intra-rater and inter-rater reliability of the standardized ultrasound protocol for assessing subacromial structures. Physiother Theory Pract. 2017;33(5):398-409.
- Kottner J, Audige L, Brorson S et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Int J Nurs Stud. 2011;48(6):661-71.
- Patijn J, Remvig L. Protocol formats for diagnostic procedures in manual/musculoskeletal medicine. 2007. F. A. O. M. M. MEDICINE.
- Witten A, Barfod KW, Thorborg K et al. Subacromial impingement syndrome. Ugeskr Læger. 2019;181:V03180215.
- Christiansen DH, Andersen JH, Haahr JP. Cross-cultural adaption and measurement properties of the Danish version of the Shoulder Pain and Disability Index. Clin Rehab. 2013;27(4):355-60.
- Bujang MA, Baharum N. A simplified guide to determination of sample size requirements for estimating the value of intraclass correlation coefficient: a review. Arch Orofac Sci. 2017;12(1):1-11.
- Portney LG, Watkins MP. Foundations of clinical research - applications to practice. New Jersey, US: Pearson Education Inc., 2009.
- Ingwersen KG, Hjarbaek J, Eshoej H et al. Ultrasound assessment for grading structural tendon changes in supraspinatus tendinopathy: an inter-rater reliability study. BMJ Open;2016;6(5):e011746.
- Michener LA, Yesilyaprak SSS, Seitz AL et al. Supraspinatus tendon and subacromial space parameters measured on ultrasonographic imaging in subacromial impingement syndrome. Knee Surg Sports Traumatol Arthrosc. 2015;23(2):363-9.
- Cholewinski JJ, Kusz DJ, Wojciechowski P et al. Ultrasound measurement of rotator cuff thickness and acromio-humeral distance in the diagnosis of subacromial impingement syndrome of the shoulder. Knee Surg Sports Traumatol Arthrosc. 2008;16(4):408-14.
- Schmidt WA, Schmidt H, Schicke B, Gromnica-Ihle E. Standard reference values for musculoskeletal ultrasonography. Ann Rheum Dis. 2004;63(8):988-94.
- Pijls BG, Kok FP, Penning LIF et al. Reliability study of the sonographic measurement of the acromiohumeral distance in symptomatic patients. J Clin Ultrasound. 2010;38(3):128-34.
- McCreesh KM, Anjum S, Crotty JM, Lewis JS. Ultrasound measures of supraspinatus tendon thickness and acromiohumeral distance in rotator cuff tendinopathy are reliable. J Clin Ultrasound. 2016;44(3):159-66.
- Xu M, Li Z, Zhou Y et al. Correlation between acromiohumeral distance and the severity of supraspinatus tendon tear by ultrasound imaging in a Chinese population. BMC Musculoskelet Disord. 2020;21(1):106.
- Bureau NJ, Beauchamp M, Cardinal E, Brassard P. Dynamic sonography evaluation of shoulder impingement syndrome. AJR Am J Roentgenol. 2006;187(1):216-20.
- O'Connor PJ, Rankine J, Gibbon WW et al. Interobserver variation in sonography of the painful shoulder. J Clin Ultrasound. 2005;33(2):53-6.