Relation between breast cancer mortality and screening effectiveness: systematic review of the mammography trials

The randomised mammography screening trials have shown varying results. After 13 years of follow-up, the results range from a 42% decrease to a 2% increase in breast cancer mortality [1]. Debates about how these differences are best explained have mainly focused on trial quality, as some trials appear to be more reliable than others [1-3]. The most straightforward explanation – differences in screening effectiveness – has received little attention. Screening effectiveness can be perceived as the ability to advance the time of diagnosis, which leads to identification of more cancers than in an unscreened control group [3]. A screening programme that finds many cancers, e.g. owing to a high sensitivity, should therefore lead to a larger reduction in breast cancer mortality relative to a control group than a programme that identifies fewer cancers.

One would also expect trials that were more effective in identifying cancers before they had metastasised to yield larger effects [3]. An indication that this may be the case was provided in a Letter to the Editor in The Lancet [4]. The authors found an association between the risk ratio for detecting node-positive cancers and the risk ratio for breast cancer mortality [4], but they included only women in the age-group 40-49 years and did not describe their methods.

The objective of this systematic review of the randomised mammography screening trials was to examine whether there is a relation between screening effectiveness and breast cancer mortality.

MATERIAL AND METHODS

The primary analysis was a linear regression (meta-regression) analysis weighted by the inverse variance for breast cancer mortality in the trials. This analysis related the screening effectiveness, defined as the log risk ratio (RR) of being diagnosed with cancer (including carcinoma in situ) within the first seven years to the log RR of breast cancer mortality after seven and 13 years, respectively, as the outcome.

In additional regression analyses, the RRs of stage II+ cancers (those that are either node-positive or at least 2 cm in size) and of node-positive cancers were used as explanatory variables.

Comprehensive Meta Analysis version 2.2.030, July 2006, was used (random effects model, unrestricted maximum likelihood).

Searches

The literature search was extensive. I searched PubMed with (breast neoplasms [MeSH] OR "breast cancer" OR mammography [MeSH] OR mammograph*) AND (mass screening [MeSH] OR screen*) and combined this search with a search on author names [1]. The latest search was performed in November 2008, and 24,479 records were imported into ProCite and searched for author names, cities and trial eponyms. Reference lists were scanned and letters, abstracts, grey literature and unpublished data were included.

A total of nine trials were found. They were performed in New York, Canada (two trials), the UK (two trials) and Sweden (four trials: Two-County (sometimes reported separately for the two counties, Kopparberg and Östergötland), Malmö, Stockholm and Göteborg (divided in two sub-trials by age)). The age range 45-64 years was covered by most trials [1], but the UK Age Trial only included women between 39 and 41 years of age [5].

Data

Trial data on relative risks for breast cancer mortality after seven and 13 years from our 2009 Cochrane review were used [1]. Furthermore, I extracted data from the many papers included in this review on total number of cancers (including carcinoma in situ) and number of advanced cancers (number in stages II-IV and number that were node-positive).

Data on breast cancers from the majority of the trials vary from publication to publication, mostly because of changing cut-points for registration, different age groups and varying numbers of women in the analyses [1]. All the retrieved data were entered into an Excel spreadsheet and extensive validity checks were performed, e.g. calculation of relative risks for finding cancers and cancers in specific stages and comparison of the results. Data used in the statistical analyses were checked again by comparing them with trial report data. In some cases, the data on the total number of cancers and the number of women (the denominators for the calculations) were slightly different from those of the Cochrane review [1], as data divided on stage and node-positivity were used in the present review. However, differences were immaterial, as the RRs for cancer detection were either identical or very similar to those of the review, the largest difference being 0.05 (1.44 rather than 1.49 in the Stockholm trial).

Data were available from all trials on breast cancer mortality and on total number of cancers: Canada [6-10], Malmö [11, 12], Kopparberg [13-15], Östergötland [13, 14], Stockholm [12, 16, 17], Göteborg [18-21], New York [22, 23], Edinburgh [24, 25] and the UK Age trial [5, 26]. Other papers provided additional information on the type of cancers [27-33].

Specific issues in the individual trials

In New York, about the same number of cancers was detected in the screened group and the control group, and it is therefore surprising that a large effect was reported [1]. However, the cause-of-death assessment seems to have been biased, and some cancers in the control group – and their associated deaths – should have been excluded, as these patients were diagnosed with breast cancer prior to randomisation [1]. The Edinburgh trial was cluster-randomised, but this worked so poorly that 26% of the women in the control group and 53% in the study group belonged to the highest socioeconomic level. This resulted in mammographic screening being associated with a 26% reduction in cardiovascular mortality among invited women [1], a result that cannot have been caused by screening. Sensitivity analyses were therefore performed that excluded the data from the two trials.

Apart from the Malmö trial, the Swedish trials screened the whole control group 3-5 years after randomisation [1]. Therefore, the number of cancers found before the control group screen was used to avoid this serious contamination. In additional analyses, however, the contamination was disregarded and the additional cancers found at the control screening were included.

In Göteborg, the number of cancers detected before the control group was screened was only available for the youngest age group, 39-49 years [20], whereas number of deaths after seven years was only available for the slightly narrower age group 40-49 years [18]. Varying denominators have also been reported for the other Swedish trials, and the denominators that corresponded to the number of deaths may therefore be slightly different from those that corresponded to the number of cancers.

In Östergötland, there were no data on the number of cancers that included the control group screen after about seven years, but data existed after a more extended follow-up period. The number of cancers in the study group had increased by only 16% after this additional follow-up [14]. These data were used in the analyses, as only the RRs of cancers were needed, and these ratios differed very little in the trials when the total number of cancers differed as little as was the case in Östergötland.

Tumour data from the control group in Stockholm had been multiplied by a factor that corresponded to the smaller size of this group compared with the screened group [17]. The data were re-corrected for the analyses by dividing with this factor.

Data on breast cancer mortality and on the number of cancers are shown in Table 1 and Table 2, respectively.

RESULTS

Screening effectiveness measured as total number of cancers

Screening effectiveness, defined as the RR for the total number of detected cancers, was not related to the reduction in breast cancer mortality, p = 0.19 after seven years (Figure 1a) and p = 0.73 after 13 years of follow-up (Figure 1b). Figure 1a shows a clustering of widely varying mortality estimates for approximately the same screening effectiveness. Furthermore, the New York trial is an outlier that unduly influences the analysis, shifting the regression line upwards, although a downward trend is expected, as detection of more cancers in the screened group should decrease breast cancer mortality. Regression analysis after exclusion of the trials from New York and Edinburgh (see Material and methods) is more appropriate, but did not change the findings (p = 0.43 after seven years and p = 0.61 after 13 years).

When the cancers detected at the control group screen were included, there was a significant relationship between screening efficiency and the reduction in breast cancer mortality with good fits to the regression lines (Figures 1c and 1d). However, this relationship was the opposite of that which was expected. The more similar the number of cancers in the screening and the control groups, the larger the effect (p = 0.02, both after seven and 13 years). This relationship remained after exclusion of the New York and Edinburgh trials (p = 0.02 and p = 0.005, respectively).

Screening effectiveness measured as advanced stage cancers

For cancers in stage II and above, a significant relationship in the expected direction was found, i.e. fewer advanced cancers in the screened group than in the control group predicted a larger reduction in breast cancer mortality, both after seven years (p = 0.04) and 13 years (p = 0.006) (Figure 2a and 2b). This relationship remained after exclusion of the New York and Edinburgh trials (p = 0.04 and p = 0.006, respectively).

Also for node-positive cancers, the expected trends were significant (p = 0.008 after seven years and p = 0.04 after 13 years) (Figures 2c and 2d). This relationship persisted also after exclusion of the New York and Edinburgh trials (p = 0.03 and p = 0.02, respectively).

Evidence of bias

The four regression lines for advanced cancers predicted a relative risk in breast cancer mortality ranging from 0.84 to 0.91 for zero screening effectiveness (same proportion of advanced cancers in the screened group as in the control group, i.e. RR = 1 for number of cancers, and log RR = 0). In the most powerful analysis, which was after 13 years for node-positive cancers (Figure 2d), a screening effectiveness of zero predicted a relative risk of 0.84 for breast cancer mortality. This 16% reduction in breast cancer mortality was highly significant (p < 0.001; 95% confidence interval, 9% to 23% reduction, see appendix for details). This can only occur if there is bias, as it is not possible to obtain an effect with a screening effectiveness of zero.

DISCUSSION

Screening advances the time of diagnosis, and the total number of cancers detected in a screened group relative to the number detected in a control group is therefore an unbiased measure of screening effectiveness [3]. Some of the screening-detected cancers were not destined to cause symptoms or death in the women’s remaining lifetime [34], but the extent of this overdiagnosis is rather closely related to the ability to advance the time of diagnosis because when the lead-time is longer, more women will die from other causes before their cancers become symptomatic.

The better screening is at advancing the time of diagnosis, the more cancers will be found in a screened group compared with a control group. Furthermore, fewer of these cancers will be advanced, which is the objective of screening. Thus, an effective screening programme would be expected to yield a relatively large RR for the total number of cancers detected and a relatively low RR for the number of advanced cancers. It was therefore surprising that there was no relation between breast cancer mortality and screening effectiveness calculated on the basis of the total number of cancers, given that – in the same trials – breast cancer mortality was clearly more reduced in those trials that had fewer advanced cancers in the screened group.

This discrepancy and the fact that zero screening effectiveness was associated with a 16% reduction in breast cancer mortality suggest that the number of advanced cancers or the number of breast cancer deaths, or both, is biased in favour of screening. For simplicity, the influence of each potential bias will be explored separately below under the assumption that the other bias does not exist and using data on node-positive cancers and mortality after 13 years.

Bias in number of node-positive cancers

Many values were missing. The number of node-positive cases was only twice that of cases with unknown nodal status (Table 2). Node-negative cancers are not relevant, as many of these are overdiagnosed. Metastatic disease is considered the best proxy for breast cancer mortality, but more women with positive nodes failed to be identified in the control group than in the study group, as control group women were more likely to be treated in centres where careful nodal dissection was not the norm. This problem has been acknowledged for the Two-County and Canadian trials [35, 36] and is supported by the finding that in the Canadian trial covering women in the age group 40-49 years, 47% of those who died of breast cancer in the control group had node-negative cancer compared with only 28% in the mammography group [1, 37].

An estimate of the size of this bias can be obtained from the Canadian trial [6, 37]. Based on the RR for breast cancer mortality and assuming that this risk applies to both node-negative and node-positive cancers, which seems reasonable, as so many women with node-negative cancer died, there should be 1.2 times more node-positive cancers in the control group than actually reported (see appendix for details). If we multiply the number of node-positive cancers in the control groups of each trial in Table 2 by 1.2, the regression analysis shows a reduction in breast cancer mortality of 9% for zero screening effectiveness. Thus, if the Canadian findings can be generalized, about half of the 16% observed bias can be explained by underreporting of node-positive cancers in the control group.

Bias in assessment of cause of death

Assessment of the cause of death is inevitably biased in favour of screening, even when data from official cause-of-death registers are used [1]. One reason for this is that women who are screened are more likely to receive radiation treatment than controls, leading to an increased mortality from other causes and also to a reduction in local breast cancer recurrence. This makes it more likely that screened women with breast cancer will be assigned another cause of death [1].

The 16% bias in the regression analysis would disappear if we multiplied the number of breast cancer deaths in Table 1 in the screened group by 1.2, which would lead to zero effect for zero screening effectiveness. The factor 1.2 means that an additional 20% breast cancer deaths were missed in the screened groups. This may seem unrealistic, but the Östergötland trial shows that it can occur. The Östergötland investigators, who were not blinded when they assessed the cause of death, reported a 24% reduction in breast cancer mortality, whereas the official cause-of-death register showed only a 10% reduction [1]. The difference between 24% and 10% corresponds to an additional 19% of breast cancer deaths in the screened group.

Data to facilitate an estimate of this bias are lacking from other trials, apart from the New York Health Insurance Plan (HIP) trial. In the New York HIP trial, differential misclassification may be responsible for about half of the reported breast cancer mortality reduction since a similar number of dubious cases were selected for blinded review from each group, while a much smaller proportion of the screened group was finally classified as having died from breast cancer [38].

Limitations

The assumption of linearity appears reasonable. Although the trials spanned almost 30 years, the data points for advanced cancers (Figure 2) were nicely distributed around the regression lines. Furthermore, the choice of statistical model was immaterial. A fixed effect model is usually not recommended for meta-regression, but it gave the same result for the most powerful analysis as the random effects model. I did not incorporate the variance in the number of cancers in the analyses, but that would not have made any material difference either. The greatest uncertainty stems from the mortality estimates because of the relatively small number of events.

The sensitivity and specificity of mammographic readings in the trials seem not to have changed since the New York trial [1]. It is therefore difficult to understand why the trials from Kopparberg, Östergötland, Stockholm and Göteborg, which screened the whole control group 3-5 years after randomisation and therefore had small intervention contrasts, were those that reported the largest reductions in breast cancer mortality after 13 years [1]. I included all trials, also the two flawed trials from New York and Edinburgh, to avoid accusations of selective reporting, and to facilitate a comparison with the smaller study of node-positive cancers in women aged 40 to 49 years [4], but it made no difference to the results whether or not these trials were included.

It could be argued that it is an oversimplification to suggest that a screening effectiveness of zero should lead to zero effect on breast cancer mortality. Screening brings forward the diagnosis of both localised and advanced cancers, and one might therefore theoretically see more advanced cases in a screened group than in a control group and still reduce breast cancer mortality. However, such a possibility is of minor relevance compared with the biases identified in the present study, and, more importantly, it cannot explain them because most trials have fewer advanced cancers in the screened groups than in the control groups (see Figures 2a-d).

Implications for observational studies

The biases I identified were substantial. Furthermore, surgical and pathological expertise is likely to vary considerably between regions and over time. This suggests that comparative observational studies across regions, countries or time periods may be unreliable if cancer stages are used as measures of screening effectiveness or as surrogate markers for predicting an effect on breast cancer mortality.

What is the effect of screening?

Comprehensive systematic reviews have suggested that mammography screening reduces breast cancer mortality by 15-16% [1, 2]. This estimate is of the same size as the bias in the regression analysis of node-positive cancers.

Considering also the substantial bias related to determination of cause of death, the many flaws in the design and execution of the trials [1, 2] and the lack of an effect on all-cancer mortality, it seems reasonable to question whether screening has any life-extending effect [1, 2]. The present study and recent observational studies [39] support this concern.

CONCLUSION

The differences in the reported reductions in breast cancer mortality in the screening trials cannot be explained by differences in screening effectiveness. It is not clear what the effect of screening is, as the size of the bias was similar to the estimated effect.

Correspondence: Peter C Gøtzsche, The Nordic Cochrane Centre, Rigshospitalet, Department 3343, 2100 Copenhagen Ø, Denmark. E-mail: pcg@cochrane.dk

Accepted: 6 January 2011

Conflicts of interest: none

Funding: not relevant

Acknowledgement: I thank statistician Per-Henrik Zahl for his comments on the manuscript.