Automating the fight against Referral Spam

Fighting Google Analytics Referrer Spam with ga-spam-control (v0.6.0)
created by on 2016-06-30

I am running three small private websites that I care about that use Google Analytics.
And every other day my reports get messed by some stupid referral spam:

Screenshot of a Google Analytics Dashboard polluted with referral spam

Lots of entries from spam websites such as share-buttons.xyz, free-traffic.xyz, traffic2cash.xyz or с.новым.годом.рф shamelessly begging for traffic.

… I don’t know if this bothers you as much as it does me - but I am sick of it and decided to do something about it.

What can you do against Referrer Spam?

You cannot prevent the spammers from sending false analytics data to your Google Analytics accounts. But there are two things you can do to mitigate the effects of referral spam in your Google Analytics reports:

  1. Use filters to block the spam domains from your reports (proactive)
  2. Use custom segments to exclude spam in your reports (retroactive)

Even though these measures work – they still suck. Because using filters and segments to exclude spam from your analytics data means manual effort for something that shouldn’t be necessary. Especially since Google could do that for us. But for reasons unknown to me they don’t.

So we form a vigilante group to protect ourselves from the spammers 👊.
… and since we are developers we build a machine 👾 which does the fighting for us.

Introducing “Google Analytics Spam Control”

To keep my referrer spam filters up-to-date without having to update them myself I have created a tool which does that for me: ga-spam-control

ga-spam-control logo

ga-spam-control (as in “Google Analytics Spam Control”) uses lists of known referrer spam domain names to create and update regular Google Analytics filters which block these domain from your reports.

… unfortunately filters don’t take effect on existing spam entries. But future reports will no longer include records with domain names which were identified as referrer spam.

Using segments to filter out referrer spam that made it past the filters
Animation: Using segments to filter spam that made it through the filters

Using ga-spam-control

ga-spam-control is a command-line utility for Linux, Mac OS and Windows. Its four most important commands are:

  1. filters status: Display your spam control status
  2. filters update: Update the spam filters of a given analytics account
  3. domains update: List all known referrer spam domains
  4. domain find: Find new referrer spam in your analytics data

ga-spam-control help dialog

For a full documentation of all available option use the help action or have a look at the project documentation at github.com/andreaskoch/ga-spam-control.

Authorization

When you use ga-spam-control for the first time you will be asked to authorize the application to access your Google Analytics data and filter controls:

Authorizing ga-spam-control to access your Google Analytics account
Animation: oAuth authorization during first call to filters status

Your credentials are stored in ~/.ga-spam-control/credentials.json. For future actions, as long as your Google Analytics access token is valid, you will not be asked for authorization again.

Show your spam-control status

To show the current spam-control status of the Google Analytics accounts that you have access to you an use the filters status action:

ga-spam-control filters status

Using the ga-spam-control filters status command to display the current spam-protection status of your Google analytics accounts
Animation: ga-spam-control filters status

Because there are so many known referrer spam domains (currently more than 900) ga-spam-control cannot simply create a single filter. One Google Analytics filter can have a maximum of 255 characters – so ga-spam-control distributes all spam domains across multiple filters:

Screenshot of the spam-control filter segments created by ga-spam-control
Animation: Google Analytics filters created by ga-spam-control

The percentage behind each account indicates how many of the known referrer spam domains are currently blocked by the existing spam filters of your Google Analytics accounts. When the percentage is 100% it means all known referrer spam domains are blocked. 90% means most spam domain names are being blocked. And 0% generally means that you don’t have any spam filters installed.

Update or install spam-control filters

To install (or update) the spam-control filters for a given account you can use the filters update command:

ga-spam-control filters update <accountID>

ga-spam-control filters update creates and or updates spam filters for a given Google Analytics account
Animation: Installing spam-control filters with ga-spam-control filters update

Update referrer spam domain list

To keep up with spammers you need to regularly update your spam domain list using the update-spam-domains filters:

ga-spam-control domains update

Updating the list of referrer spam domain names with ga-spam-control domains update
Animation: Updating the local spam domain lists with ga-spam-control domains update

Find new referrer spam domains

Besides the lists of referrer spam domain names that are maintained by the community you can also maintain your personal list of spam domains. In earlier version of ga-spam-control this was done by a machine-learning algorithm, but becase I could not get that to work reliably I made the spam detection process manual:

ga-spam-control domains find <accountID> <numberOfDaysToReview>

Manually locating referrer spam in your analytics data using ga-spam-control domains find
Animation: Locating new referrer spam in your anayltics data using ga-spam-control domains find

What is special about ga-spam-control?

There are already other tools which also create and maintain Google analytics filters for you. But there are a few points in which ga-spam-control is different:

  1. Unless most other tools, ga-spam-control is a command-line utility that can schedule to run automatically on any number of Google Analytics accounts. No manual interaction required.
  2. ga-spam-control is cross-platform and works on Linux, OS X and Windows.
  3. The source code of ga-spam-control is publicly available at github.com/andreaskoch/ga-spam-control, open for review and change requests.
  4. ga-spam-control uses multiple community referrer spam lists as a source for referrer spam domains.
  5. ga-spam-control makes it easy to find new referrer spam in your analytics data and add these domain names to your list of known referrer spam

How does ga-spam-control work?

ga-spam-control builds a list known referrer spam domains and creates and maintains Google Analytics View Filters for these domains:

C4 model - Context diagram of ga-spam-control

ga-spam-control uses …

What about Segments?

Every time a new referral spam domain appears that has not yet been detected by the community it will make its way into your Analytics reports before it can be identified by the machine-learning model of ga-spam-control.

After ga-spam-control identified the new spammer it can easily block it. But it cannot remove the existing spam entry from you analytics reports. This can only be done with segments:

Screenshot of a Google Analytics Segment which excludes referral spam from your Google Analytics views

Unfortunately Google Analytics Segments can only be created manually. So ga-spam-control currently can’t help you with that.

Using machine learning to detect referrer spam

Earlier versions of ga-spam-control contained a machine-learning component which tried to detect referrer spam by training a neural network.

Animation: A neural network for detecting referrer spam in your analytics data

Unfortunately I could not get it to work reliably enough to work with sites with different usage patterns, so I removed it. But I will build a machine learning based referrer spam detector that uses honeypot sites that will help to keep the referrer spam lists up-to-date.

Roadmap

Ideally Google would just include spam protection into Google Analytics and make this whole thing obsolete.

But until then I will use this tool as a playground and add some features here and then. The complete list of feature ideas is maintained in the README of the project: github.com/andreaskoch/ga-spam-control#roadmap

I will update this post when I release a new version of ga-spam-control (current version: v0.6.0).

Shortlinks:
Tags: