When we collide

June 3rd, 2019

A CMA Original, written by Cameron Sharpe, Progressive Content

Analysis based on data collected from content marketing campaigns can be enticing, but be careful with what you conclude.

Runners in their forties produce quicker marathons than their twenty-something counterparts.

So suggests a set of data released by fitness tracking website Strava – a network that can lay claim to a gigantic data sample of engaged users enjoying and tracking various exercise types every single day.

On this basis alone, trends drawn from such a sample of users ought to be treated seriously – yet, the conclusion that both men and women over the age of 40 are recording average marathon times quicker than their decades younger rivals is startling – particularly as we know that it can’t possibly be true.

The fittest twenty-something athletes are not covering the traditional marathon distance slower than their older running comrades – despite legitimate scientific evidence displaying a correlation between race times and running experience.

By way of tangible example, at the 2018 London Marathon, the fastest runner over 40 covered the course nearly 20 minutes slower than 22-year-old Shura Kiata Tola – an absolute chasm in running terms.

So, why does the massive data pull draw such a remarkable conclusion?

Despite being a slightly tongue in cheek analysis, the conclusion is – perhaps by pure accident – a classic example of a specific problem with conclusions based on data polluted by unintended or unexplored variables – otherwise known as collider bias.

In this case, a network and study of this type is inadvertently slanted towards capturing the successful older athlete than it is the younger counterpart.

Simply put, a strong marathon runner in middle age is significantly more likely to have regularly reflected their running output on an app like Strava than the fittest young athletes. Many of the fittest athletes in the younger demographic are far less prone to have a presence on the site, let alone be so minded as to load their efforts onto the network religiously – perhaps instead posting a screenshot of their workout on Instagram or Facebook.

This difference in consumption habits and behaviours across generations is the unintended variable that skews the results and ultimate conclusions drawn (the same data suggested the older runners were also running more, which gives an additional indication of possible bias). In this case, it’s not so difficult to pick out the problems with this analysis, but in other areas it can be a lot trickier to spot your colliders.

For content marketers operating online, the data protection minded blind-spot in reporting tools makes this a real and tangible danger. For example, if your goal is to reach an audience amongst a younger age group – perhaps education materials for university undergraduates – you might be delighted and reassured when your Google Analytics data reports that 90% of your user base is under the age of 25.

However, because Google’s demographic information requires browsers to carry either a YouTube or Gmail account to be properly tracked in Analytics, the oldest users – who are far less likely to have either – are effectively being automatically being filtered out. In this instance, the Google data capturing methodology is your collider.

It can be a confusing and potentially frustrating bias to consider, particularly where your reporting tools are not always set up to highlight these issues.

However, once you’ve added the concept to your feedback loop for campaigns and reporting, the quality and nuance of your insight output gains an extra dimension – albeit, you might find yourself running slower marathon times.

  Share: Posted in CMA Blog