Destigmatising Mistakes: A Game Launch Incident Review

2 Aug '21

Incident Response

We don’t often talk about the mistakes we make, whether that’s because we feel guilty, ashamed, scared, or something else. So I thought it would be nice to share the one we (GES) made and what we learned from it. Hopefully, this will help reduce the pressure/stigma of making a mistake.

So grab a brew and let me tell you a story…

What do GES do?

GES are responsible for launching games, primarily on Vegas and Bingo Arcade. We also look after mini-games such as Prize Machine/Unwrap the Cash as well as ensuring Reality Check works, which is one of our safer gambling tools.

What happened?

TL;DR: games worked fine until a user got a error — then we threw a more critical error making the game unplayable.

In one of our updates, we made an improvement to our “session expired” error messages. If there was an error during our pre-launch checks, we would reset the launch config and ask a user to go back to the home page — because we couldn’t accurately fix the problem, as we don’t fetch the auth tokens ourselves. So by sending someone back to the home page, portals (e.g Vegas/Bingo) could re-auth a user for us.

This passed our reviews and testing — as these were pre-launch errors it was fine to wipe the launch config, since a user can’t launch a game anyway.

The problem was noticed once we deploy to live: we were accidentally wiping the launch config on every error, not just the pre-launch ones. As the service is written in React, whenever the gameName prop is changed (which is part of the launch config), we try to reload the game — as you might be loading a new game.

Removing the gameName counted as changing it, so we tried to reload the game. As we removed the gameName, we now don’t know what to load, so an error was thrown.

Lots of things count as an “error” to the game, including insufficient funds — so as soon as any “error” happened the game had to be hard reloaded in order to work.

How did you first spot the error?

Our monitoring automatically calls us out, but we also watch our graphs when releasing changes, so we saw a few seconds before we got the call.

How did we respond?

At first, we couldn’t quickly work out what was causing the error — so we decided to roll back to the last version. This gave us more time to debug the issue and create a proper fix, rather than pushing up a hacky fix that might not have worked.

The downside to this was that we were going to have to do a bit of cleaning up of git tags and versions.

How long did the incident last?

We rolled back to the last version within about an hour. This would have been quicker, but our rollback job failed. As it’s only ever been run once before, it hadn’t been updated for a while.

Why was it so hard to find the issue?

We don’t do small, per-ticket releases like other squads as we are a service provider; if we did that, then each portal would end up being asked to version bump for each ticket we release. So we bundle up our releases and do them every few weeks to ease the strain on other teams.

This means our releases can get rather beefy and so was hard to see what small bit caused the issue.

If you have a large release, how did you narrow down what the error was?

We looked at the overall release and worked out what changed around what we thought was causing the error, and thankfully we were right.

What we should have done was roll back each commit and test to see if the error was there; this would have been much more accurate and would have shown the commit that caused the error.

What did we learn?

Our monitoring and call-outs works :)
Our rollback job needs fixing.
Sometimes our releases can get a little big, making errors hard to find.
We need a nicer way to tidy up a failed release.
We are going to leave our release on test/staging environment longer before declaring it good and releasing to live.
Having a larger time before merging our release candidate into master, making rolling back easier.

Want to read more about Incident Response? Check out these related articles...

9 Dec '20

It's Just a Monitoring Change

Have you ever had a seemingly innocuous change to one system affect another in a catastrophic way? If yes, you might notice a few familiar themes in this write-up. If no, then read it now, before it’s too late.

Author:

Oliver Leaver-Smith

Time:

11 minute read

2 Feb '17

Lessons Learned for Incident Commanders

Incident command is a reasonably new area of focus for SBG. In a nutshell we have a nominated technical person known as the Incident Commander (IC) who gives direction in order to resolve an incident and restore service as quickly as possible.

This blog post contains some of the insights and ‘lessons learned’ by our teams from their experiences in live incidents and exercises (known internally as fire drills) as they work to improve their skills and reduce our Mean Time To Resolution

Author:

Patrick Holmes

Time:

13 minute read

21 Nov '15

H2OhNoes! Five lessons we can learn from old-world utility firms on how to handle outages

Utility companies have customers. And just like us, those customers expect a ubiquitous, always-on service provision. Are there therefore any lessons we can learn from an old, established industry like a utility company on how to handle outages?

Author:

Dan Adams

Time:

7 minute read

Meet the author

Martin Blackburn

Principal Engineer · Gaming Tribe

Working at Sky Betting & Gaming since 2016

Principal engineer in the Games Experience Squad (GES). Which is responsible for launching games from different providers across each of products, creating mini-games such as the Prize Machine, and the running of reality check, one of our safer gambling tools.

JS
React
CSS
HTML
Node

Tweets by @SBGTechTeam