Here at Sky Betting & Gaming, we’ve had great success running fire drills against our production systems. By running these failure scenarios we’ve been able to increase confidence in our ability to support live service, enhance the knowledge of support staff and highlight weaknesses - be that with the services themselves or our support processes.
Looking to build on this success, I put forward the idea of running similarly structured events pre-go-live in the form of a GameDay. The aim was to gain similar learnings but by doing it before the system went live we sort capture issues in their infancy, before they have a chance to impact customers, and help make a judgement call on the service’s readiness to go live - thus the “FireDrill GameDay” was born. In this post, I’ll summarise some of the key elements that made these events a success and highlight some of the lessons learned from running these over the last two years.
A FireDrill GameDay brings together two activities in Chaos Engineering :
A Firedrill GameDay is essentially a GameDay run using the tried and tested format we’ve developed for fire drills; as such, I’ll refer to them as GameDays for the remainder of this post. What makes the format slightly different from our standard fire drills is that they are less formal, have multiple scenarios, and in our case performed on pre-live systems.
In regards to planning and execution, we found it useful to include the following elements:
By clearly defining the scope of the GameDay we can ensure that we focus on the correct areas of the service with the appropriate participants. The scope would generally cover all components of a given service but you may choose to focus on areas of a service where the risk of failure is high or the state of readiness (to go live) is questionable.
Those involved in the GameDay should be notified at the earliest opportunity to ensure they are available and preparation can start at the earliest oppotunity. Generally, involvement would be limited to those who are to be responsible for supporting the service and those involved in its development. The support team would primarily be involved in the investigation and resolution of incidents and the development team involved in planning and reviewing outcomes.
One or more Excon’s (exercise coordinator) will be required to run the incidents and it’s also useful to assign someone to the roles of SLM (Service Lifecycle Manager) and IC (Incident Commander).
In advance of the GameDay publish a timeline of how the day will be broken down. Ensure that sufficient time is provided to investigate each issues, attempt to find a solution, and confirm that everything is back to normal. It is also advisable to have a break in between each scenario to reset the environment and give those running the GameDay a break as well as time to prep for the next scenario.
Kick off the GameDay by providing a run down of the day’s plans and an overview of the system architecture. This is particularly useful where the target system is yet to go live and the support personnel are unfamiliar with it. Having the architect or lead developer provide this overview may be particularly beneficial as is the provision of links to useful resources such as runbooks and monitoring.
Multiple failure scenarios to be devised and and tested well in advance, while ensuring they are kept hidden from those resposible for resolving the problem on the day. Devising the scenarios should be a team effort involving analysis of the system architecture and failure mode. Issues experienced during development should be used to pinpoint potential problem areas. Ultimately, a list of scenarios should be broken down and detailed as shown here:
The selected scenarios should be thoroughly tested to obtain a clear understanding of how the simulated incident will play out and the best candidates taken forward to the GameDay.
Certain scenarios may benefit from simulated transactions being run against the service. This adds to the realism, with logs being populated with associated errors/warns, while also helping sign-posting the issue and its consequences.
During execution be clear on any rules that might apply e.g :
During the exercise, it is good to keep reminding the participants of the questions we are seeking answers to :
Once all the scenarios have been executed those involved should be given time to take a break and consider any issues encountered especially in response to the questions asked in the previous section.
A retro board similar to the following may be used to decide which issues are to be fixed now and those which can be left until later:
We also made use of a Readiness-O-Meter to get a quick view of where people thought we were in regard to our ability to support the service.
Having now run several of these events - here are some of the key lessons learned:
Early involvement is key - start early in the project and get people involved from the outset. The GameDay should be treated as a project deliverable in its own right.
Choose scenarios wisely for maximum learnings on incidents that are likely to occur. This ensures you provide valuable insights into the service and how it is supported.
Devise multiple scenarios - not only will you need backups but the more scenarios, the more you’ll learn about the service and the greater the potential for uncovering issues.
Resist efforts to include tests that should be done elsewhere; the GameDay is a complement not a substitute for more traditional forms of Operations Acceptance Testing like failure testing, backup recovery, and DR testing.
During the exercise do not lose sight of the end goal - keep reminding the participants of the questions we are seeking answers to e.g. Is system behaviour as expected? Are there any areas we can change to improve the supportability of the service(s)?
Make time for a closure-type event and associated activities ensuring that nothing gets left unresolved and without an owner.
Finally, the day of the GameDay is just a part of it - as important are the conversation, analysis and testing that happen in the lead-up.
Here at Sky Betting & Gaming, we create some spectacular websites and apps (but you already knew that). What’s great about that is we generally use the same tools and techniques for developing these so a breakthrough for one person can prove to be substantial for many. Granted, every squad is different so it’s never going to be identical but there’s only so many ways you can develop, bundle and release packages.
So, when we decided we wanted to look at improving our performance it was a no-brainer we’d immediately look to make a big fuss about it if we were successful - oh boy were we successful!
In the Games Experience Squad (GES), we previously used Babel and Webpack for building Next then using Typescript and Rollup for compiling and bundling our packages. After some searching we found SPEEDY WEB COMPILER a rust-based compiler with all the bells and whistles needed to integrate with webpack and rollup but also with a simple cli. Little did we know, this compiler would be perfect for our use case and be far easier to integrate than we ever expected.
What’s great about SWC, and what drew our attention to it, is that Next.Js 12 is built on top of SWC. But without making explicit changes, it still defaults to Babel which meant we weren’t making full use it. Being a direct replacement to Babel it boasts impressive benchmarking with claims of 20x faster single thread speeds and an eye-watering 70x faster multi-core speed.
But how could we possibly make this work using a multitude of different packages within a monorepo? Well, we got lucky. SWC has integration with Webpack which let us directly swap out Babel. What was left then was the pesky rollup package that used Typescript. Thankfully, we used rollup-plugin-swc to finalise our migration to a SWC based application.
Let’s not beat around the bush, you’re all here for the data to see just how much of a difference it made.
Build Location | Old System | New System | Speedup (%) |
---|---|---|---|
Local | 182.39s | 49.86s | 365.80% |
Jenkins | 94.00s | 31.00s | 303.22% |
When initially comparing the time statistics, it does not appear to be significant but once you realise this is over 300% on multiple environments the potential for impact on Sky Betting & Gaming is extraordinary.
Our Response: After verifying and testing our builds we went straight to inform the Gaming tribe before the entire sbg-tech channel via Slack to voice our achievement with encouragement and support to make these changes where possible.
Performance and technology are always changing and shaping the industry - as soon as you do one thing the next best thing goes along. Fortunately, there are a number of new features under construction from SWC including a bundler and minifier which we are excited to see and can potentially migrate from Webpack and Rollup for a full Rust-based build process.
]]>In the talk I lamented my lack of drive in pursuing purely technical content any more (there’s plenty of that in my old blog) instead reflecting my current career arc and giving a talk on “boring management stuff”, as I prefaced.
Of course, I don’t think it’s boring or particularly constrained to management stuff, but that was the first layer of my subterfuge and was reflected back in the organiser’s comment that DC151 hadn’t seen a talk like this before.
Nice.
So anyway, this is a brief outline of my talk. Well, of my point, really, which is actually pretty simple, all told.
The most succint summary of what I was saying has since come from the co-organiser of DC151, Glenn:
We need to stop being the smartest in the room and start being the most helpful in the room.
— Glenn Pegden - ☎️📟💾 Ⓗằ⒞𝓴𝗘ṝ (@GlennPegden) September 12, 2021
And that really is the main point that I wanted to get across.
As an industry we’re obsessed with being smart. And that’s ok - good even, in the right context. We love our rockstars as much as the next group. But there’s a whole subsection of our industry that has adopted a slightly worse interpretation of that wherein it’s not enough to be smart, we have to be the Smartest Person In The Room.
But if you think back to all of the Smartest People In The Room that you’ve ever had to work with, unless they’ve made this realisation, I expect you’ll also, as I do, remember them as smart and angry.
Because they’ve done the research. They’ve put the work in. They’re smart and they’re right. But they’re not getting their own way.
What’s left to do?
Frustratingly, from that point of view, that’s only the beginning.
Everything that you want to achieve hinges on your ability to convince other human beings of your point of view. To trade positions with them so that you each compromise what’s happening in the right context so that you can get what you need and they can get what they need.
It’s really obvious when written down or said out loud. But remains elusive to some InfoSec professionals to this day, in my experience.
What is all too easy to forget is that we’re the tail, not the dog. Most businesses don’t exist to do Perfect Security, if there were such a thing. Given infinite resources and infinite time I’m sure we could endlessly iterate on what we’re designing and saying such that it improves, but in the meantime back in the real world we’ve got a requirement to make money. And InfoSec doesn’t make money.
And herein lies the rub. How to get your own way in InfoSec relies on a simple economic truth: is it cheaper to do what I say, or is it more expensive?
And if you fall on the wrong side of that then you’re almost never going to get your own way.
So how exactly do you make a non-revenue generating business area like InfoSec not cost money get its own way?
This is where I revealed my final subterfuge for the DC151 crowd - not only was this a boring management talk, this was also a boring Security Architecture talk!
At Sky Betting & Gaming we’ve weaponised compliance and used it in what I like to refer to as “selective relieving of friction” - we introduced a very stripped down version of a very old idea - patterns - and made them work for our audience.
The top part of the document tells an engineer what tech and processes can be used to solve a known problem with a given context. Let’s say “Authentication”, as an example. It explains (pictorally and in text) what the solution needs to basically look like, what technology and processes are acceptable in its implementation, and what trade-offs need to be considered.
All good. Pretty standard.
But the coup de grâce is the second page. That’s where we pre-assess (and therefore effectively pre-approve) our pattern against the relevant compliance standards that are in place across our business.
And since compliance is a local issue, it boils down to this:
“Do it our way, and it’s free. Do it your own way and demonstrating compliance is your problem”
We’ve tipped the balance of the scales in terms of cost. It’s a known cost to implement this pattern since it’s just tech and processes. But if you build it yourself then not only do you have to come up with a solution you also need to go to the trouble of checking it’s compliant.
All of a sudden, solving problems in the way we’ve selected becomes economically viable compared to the other options.
So we achieve the double whammy - we’re smart, and right, but we’re also helpful and cheaper.
Win/win.
]]>So grab a brew and let me tell you a story…
GES are responsible for launching games, primarily on Vegas and Bingo Arcade. We also look after mini-games such as Prize Machine/Unwrap the Cash as well as ensuring Reality Check works, which is one of our safer gambling tools.
TL;DR: games worked fine until a user got a error — then we threw a more critical error making the game unplayable.
In one of our updates, we made an improvement to our “session expired” error messages. If there was an error during our pre-launch checks, we would reset the launch config and ask a user to go back to the home page — because we couldn’t accurately fix the problem, as we don’t fetch the auth tokens ourselves. So by sending someone back to the home page, portals (e.g Vegas/Bingo) could re-auth a user for us.
This passed our reviews and testing — as these were pre-launch errors it was fine to wipe the launch config, since a user can’t launch a game anyway.
The problem was noticed once we deploy to live: we were accidentally wiping the launch config on every error, not just the pre-launch ones. As the service is written in React, whenever the gameName
prop is changed (which is part of the launch config), we try to reload the game — as you might be loading a new game.
Removing the gameName
counted as changing it, so we tried to reload the game. As we removed the gameName
, we now don’t know what to load, so an error was thrown.
Lots of things count as an “error” to the game, including insufficient funds — so as soon as any “error” happened the game had to be hard reloaded in order to work.
Our monitoring automatically calls us out, but we also watch our graphs when releasing changes, so we saw a few seconds before we got the call.
At first, we couldn’t quickly work out what was causing the error — so we decided to roll back to the last version. This gave us more time to debug the issue and create a proper fix, rather than pushing up a hacky fix that might not have worked.
The downside to this was that we were going to have to do a bit of cleaning up of git tags and versions.
We rolled back to the last version within about an hour. This would have been quicker, but our rollback job failed. As it’s only ever been run once before, it hadn’t been updated for a while.
We don’t do small, per-ticket releases like other squads as we are a service provider; if we did that, then each portal would end up being asked to version bump for each ticket we release. So we bundle up our releases and do them every few weeks to ease the strain on other teams.
This means our releases can get rather beefy and so was hard to see what small bit caused the issue.
We looked at the overall release and worked out what changed around what we thought was causing the error, and thankfully we were right.
What we should have done was roll back each commit and test to see if the error was there; this would have been much more accurate and would have shown the commit that caused the error.
Part of our technology stack for backend account services (we’re talking super-backend, as in direct connections to the databases) uses an abstraction layer provided by a third-party. It’s closed source, but we have a route into them for bug fix and feature requests. One such feature request was to expose metrics to show the count of payments waiting to be fulfilled by one such utility server. As the year is 2020, this was delivered as a Prometheus query exporter application that could be scraped by our Prometheus instances. We could then graph these counts, and raise alerts when the fulfillment queue was growing beyond expectations, giving us additional visibility into problems.
This query_exporter
application was delivered back in July, and it has been deployed and running on the relevant servers ever since. The endpoint has been exposed (though network access is restricted) and when we manually curl the metrics endpoint we see the results we expect. The work to start scraping the endpoint was paused behind a larger piece of work to overhaul our Prometheus offering, and so was picked up again recently. Here is where there was a fundamental breakdown of communication in understanding exactly what the query_exporter
application actually does.
We assumed that the metrics endpoint would behave like others we have had experience with in the past, namely that requesting the endpoint would display the latest metrics gathered by the application. What was actually happening is that each time the endpoint was requested, it would make a request to the database which, while as efficient as it could be, still queried many millions of records. Keep this in mind as we go through the timeline of how this became a service affecting incident.
After deploying the new scrape target configuration to our test Prometheus instance, which was querying the endpoint on non-production servers, a change request was raised to begin scraping the metrics endpoint from our production Prometheus instance. The default values were used, changing only the metrics_path
and params
values to take into account the URL of the endpoint. Most notably, the default settings for scrape interval (30s) and timeout (10s) remained. The change was carried out, and only when the config became active and Prometheus started scraping the targets in production did we realise that additional firewall rules were needed. The change was marked “partially successful” as the config was in place but the targets were not yet being scraped. Off the back of this change request, tickets to add the necessary firewall rules were raised and passed to the relevant teams to progress.
A side note about how our firewall requests tickets work. We raise a request with the requirements and other details (source, destination, ports, encryption, etc.) which gets passed to a human, then an automated process, then Security to review (not that Security aren’t human, but you know) before being implemented by an automated process. It makes firewall rules simpler to raise, implement, and manage than if the process was entirely human-driven.
A day or so later, at approximately 17:04, the firewall automation process implemented the rule that we’d asked for, and Prometheus could finally get at the URLs it was so desperate for. The more database-intensive query (getting all payment records for the previous week) took nearly as long as the interval to return results, and very soon was taking longer than the interval. Even though the Prometheus scrape was timing out after 10 seconds, the query was still running to completion on the database. After around 15 minutes we were in a situation where the database was overloaded by this query running multiple times, so much so that other services which use this database (such as login) were unable to interact with it. As for the incident process itself, this was ultimately rather mundane. Our products had customer-facing banners applied, to ensure that there would be no further organic interactions with our services and the database. Once this specific query was identified as being the apparent cause of the increase in database load, the application on the utility servers was stopped and regular service was resumed shortly afterwards.
As with all major incidents and service interruptions, a Post Incident Review was held which was open to the whole company. From this meeting, a number of excellent observations and areas for improvement were made, which I will attempt to summarise below:
query_exporter
application does. This of course has a knock-on effect that we need to account for firewall requirements earlier in the process of a piece of work, and build those requirements into the scope of the work.A number of these improvements are already underway, and changes have been made to the query_exporter
to ensure that it doesn’t ever attempt to create more than two simultaneous connections to the database. As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences. Given our past experiences would suggest that a metrics endpoint is relatively low-resource, we saw no problem with polling it as such a high rate. Engineers that work directly with the database, and have indeed already written their own exporters that make use of database queries, would know for certain that this should be handled with a lot more caution.
Above all else, this has been a very useful reminder that even what appears to be a simple change to monitoring might in fact be the thing that causes your revenue-generating services to grind to a halt.
EDIT (2021-02-22): The contents of this blog post have been discussed further on the Software Misadventures podcast.
]]>We built a container platform using Chef and Docker as a stepping stone to Kubernetes. It utilised our existing apps-on-virtual-machines deployment and operational patterns, and so this allowed Software Engineers to develop their applications in a cloud-native manner in a familiar way, and without needing to use the (at the time) brand new Kubernetes platform.
Now we have a very high level of confidence in said new Kubernetes platform, we are migrating projects over and decommissioning our middle-ground solution.
The migration actually went a lot smoother than we anticipated, especially seeing as this is the first large-scale service we have built on Kubernetes. One thing we hadn’t initially accounted for was the different ways in which health checks would be used by the underlying platform to detect whether or not traffic should be sent to a specific container. For some context, our current Corbenetes architecture looks similar to the below with one traefik container and two app containers per host (alongside a number of supporting containers for logging and metrics that aren’t shown here).
Traefik acts as a proxy, handling both TLS-termination and sending requests to the two backends that are available on the host. The upstream load balancer examines the health endpoints of the applications (via traefik) as well as traefik’s own /ping
endpoint. When we do a restart of the application as part of a release, we stop traefik, which allows existing connections to the application to clear down, and takes it out of service in the load balancer so no further requests are sent to this host. Once the applications have finished dealing with their requests, they too are closed down. Because of this, the shutdown of the application itself doesn’t need to be the cleanest, as by the time it receives the shutdown signal there are no requests being served by it.
The following image shows our current architecture now we’re on Kuberbetes, again very high level. We have traefik and app Pods, each exposed with a Service (essentially a cluster of Pods and a policy allowing access to them). The traefik Service is exposed outside the cluster to allow incoming connections from the load balancer, and we make use of the Ingress resource to direct traffic destined for specific URLs to the app backend Service.
Our health checks are defined as part of our Deployment manifest, for both traefik and the application. Initially we used the same health check endpoints as we had been using previously. Our manifest looked a little like this.
livenessProbe:
failureThreshold: 2
httpGet:
path: /ping
port: 80
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
failureThreshold: 2
httpGet:
path: /ping
port: 80
scheme: HTTP
periodSeconds: 5
livenessProbe:
failureThreshold: 2
httpGet:
path: /ping
port: 8081
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
failureThreshold: 2
httpGet:
path: /ping
port: 8081
scheme: HTTP
periodSeconds: 5
This seemed to work out alright. Our Pods were up and running when we expected them to be, and when we force stopped the application, the health check went bad and the Pod was deleted. It was at this point we started doing extensive performance testing of the application on this new platform. We noticed an issue with cold starts. It appeared that the first time a non-health check request was made to each application Pod, the response time was around six seconds. The was obviously a suboptimal situation, and another reason why we do lots of performance testing using production-like load profiles.
After this first slow response, the application was absolutely fine. Turns out that the application doesn’t compile its JavaScript components until the first page load.
It’s almost as though the team that built Kubernetes knew this would be a thing, as they specifically define both a livenessProbe
and a readinessProbe
that can be applied to the Pods. Their own documentation explains the difference between the two:
/ping
endpoint returns a bad responseIn our case, we needed to change our readinessProbe
to try and load the /login
endpoint, and allow the application to be properly ready for accepting all traffic, not just health check traffic.
readinessProbe:
failureThreshold: 5
httpGet:
path: /login?client_id=auth_provider
port: 8081
initialDelaySeconds: 25
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
This was a great success. Each Pod took about 15 seconds longer to start, but that’s a small price to pay to be confident in the ability of the Pod to handle traffic straight away. We went back to our performance testing and low and behold… we still saw issues with dropped requests during an application release, or scale down of the Pods.
Something about how this application works did not appreciate the way in which Kubernetes was terminating it.
When Kubernetes decides that it needs to terminate a Pod (whether this be to evacuate a host, move things around for resource utilisation purposes, or just because there is change to the Deployment warranting the creation of new Pods) it sends a SIGTERM
(graceful shutdown request) to PID 1 of the container and waits for up to thirty seconds. In our case, PID 1 was the node
parent process.
Remember the way we used to stop the applications on Corbenetes? We’d cut them off from receiving new requests and then eventually stop the application itself. Because at a basic level there is no link in Kubernetes between the state of the application Pods, and the state of the traefik Pods, we can’t hide the application behind traefik any more.
Thankfully, we had a way around this. We could use the Kubernetes container hooks to implement a PreStop hook to do something before Kubernetes attempts to stop the container. We knew that straight killing the node
parent was bad, even if it was graceful, so we experimented with the best way to gracefully terminate the container.
lifecycle:
preStop:
exec:
command: [
"sh", "-c", "sleep 2 && kill -15 $(pidof node | awk '{print $1}') && sleep 2"
]
We ended up with this PreStop hook, which waits a couple of seconds, then sends a SIGTERM
to the node
child process, before waiting for another couple of seconds.
We went back to our performance tests and it was smoother than the proverbial. No matter how aggressively we were terminating Pods and restarting Deployments (the quickest way to test a release without actually releasing), we saw absolutely no degradation in service whatsoever.
This was all fine and dandy, but it taught us an important lesson: even if you are moving from one container-based platform to another, there is no such thing as a lift-and-shift. Each of them have their own nuances that need to be addressed. Now we know that Kubernetes is our standard, and we know how to handle zero-downtime deployments for our applications, we can properly spec out work in future, and not spend days chasing our tail trying to work out why it doesn’t just work.
]]>Based on the analysis of Kubernetes clusters this session will demonstrate ways to “see” the workloads running on clusters. Helping to better understand a system through visualisation, anomaly identification and exploratory analysis. Using tools such as graph databases and visualisation tools, you’ll see how they can help explore and understand cluster workloads. Sharing examples of how these tools have identified issues and how they can help engage with users of the systems to share best practices and ultimately improve cluster performance.
You can download the go source code to export kubernetes objects to Neo4j and GEXF. The slide deck is also available.
]]>The incident in question was caused by a config change to the JMX settings of a number of Java applications we run. We use JMX to gather metrics from these applications, and a previous attempt at this change had already caused an incident that stopped these metrics from being collected. It was not customer impacting, but it did mean this change was already fairly high stress for the Engineer performing it. A fix had been added to the change for the metrics issue, and the change had successfully gone out to our non-production environments more slowly than normal.
Obviously there are exceptions to every rule, but mostly actions are taken via the use of Jenkins. A job is defined for many of the common actions we need to take; deploys are all scripted in this way for example. This allows us to perform actions on a large number of servers consistently. If there isn’t a specific job in Jenkins to perform an action, we have a job that allows us to run a command we define ourselves. When planning a change we write out the steps we will take, and provide evidence that these steps produced the desired result in our test environments. When choosing a Jenkins job to run we choose not only the appropriate job, but also any options we will pass to that job. That plan is then read by another Engineer, generally one of the Senior Engineers, and preferably one uninvolved in the change being planned. If the approving Engineer has any doubts then the change needs to be amended to address those doubts.
So, on to our change. The plan was reasonably straightforward.
It is worth explaining that last step in more detail, as that is where things did not go according to plan. As multiple different applications were being restarted the plan did not use the application restart job that we have scripted in Jenkins. This job is designed to restart a single application. Instead the job mentioned earlier to run commands we define was used. It was planned as:
These exact steps had worked in all our other environments without issue.
As we all know, mistakes happen, people are imperfect beings, and that’s why we script as much of our work as we can. In this case the Engineer forgot to set the Concurrency on the job to restart the applications. In the Application Restart job, this would have caused a problem, but not a major one, as it always starts with a single server, and prompts the Engineer to check that the application has started cleanly before moving on. But that job was not used; the job that was used was an older job, it had fewer safe guards, and unfortunately it had a default value for the concurrency. As this job is used widely within the business, and may be used across hundreds of servers at a time it makes sense to have a high default concurrency, however in this instance that was sufficient to restart the application on all the Java application servers at once. These applications handle a number of functions surrounding customer logins, and consequently on 23 March at 10:27am all customer logins across all Sky Betting & Gaming products failed, for approximately one minute.
We try to work with a no-blame culture. It isn’t the fault of the individual, it is the fault of the platforms and procedures that allowed a mistake to have an undesired impact. So when the Engineer performing this change immediately put their hand up and said “I have caused an outage” it allowed the on-call engineers, and other interested parties to get on with the task of fixing the problem, and investigating the impact, without having to look at why the incident happened.
It’s important not to panic when things go wrong. A natural instinct is to “Do Some Thing!” when things are going wrong, but the wrong thing done quickly can sometimes do more damage than doing nothing. Based on the Engineer’s account of what happened during this incident they saw the restart job connect to all the Java application servers at once, and nearly killed the job then and there. This would have stopped that job from bringing the applications back up, and would have not been good. Fortunately they paused, and thought about what they wanted to do, and didn’t take that action. This is always a useful way to handle the panic you will feel in that situation, pause and reflect on what you want to do, the extra time can bring clarity, it is that clarity that panic robs from us.
They asked for help. This is also important; we are a large organization, individual Engineers are not alone, and are not expected to know everything. The Engineer called for help, and some other Engineers who were more familiar with the application were able to confirm that they came back up cleanly, and also investigate the extent of the impact. Approximately 500 logins failed in that minute.
We have reviewed the incident after the fact, and the Engineer was open and honest about what they did. This has allowed us to improve our procedures, including adding extra scrutiny over the use of the Jenkins job that failed us in this instance. And further education on how to use the Application Restart job, that would actually have been able to do the job we needed, but the Engineer who planned the change was not familiar with it, as it is most often called as part of a deploy, and not directly.
I started by saying that I wanted to tell you why mentoring is hard. So many people have told this Engineer that they did the right thing, that it’s not their fault, and that everyone makes mistakes. I know that this doesn’t make them feel good. I know that the impact this had on the business, on our customers, and on their colleagues fills them with guilt. And I know that whenever something like this happens, it is part of my job to help them pick themselves up and carry on. I can do my best to try and assuage that guilt, but I know it is always going to be there. I can only tell them it gets better.
I know they feel this guilt, because on this occasion it was my change, and it was my mistake. Which just goes to show that mistakes can happen no matter how experienced or senior you are, so as I tell the Engineers I work with, and as I try to tell myself, “Don’t be so hard on yourself, it could happen to anyone”.
]]>Practising your incident response is nothing new or special, and it’s certainly not confined to the technology sector. I’m sure many of us remember fire drills at school. I was never really sure what good it was doing, but made the most of the time out of lessons, taking “walk calmly don’t run” to a new level of slowness. It’s this attitude that has attracted a lot of criticism of fire drills, the thought being that they breed complacency.
We found ourselves in this situation too; people were becoming complacent. Our fire drills that previously saw us revered within the company were delivering fewer and fewer positive outcomes, and they were seen as a chore rather than an exciting and useful pastime.
The scenarios that were run each week were primarily designed and run by myself; when other people ran them, they were facilitated by other platform-focused colleagues. This resulted in a lack of diversity in failure scenarios, specifically neglecting the wonderful ways in which applications can fail. This also had the unintended consequence that those running the fire drills very rarely ever took part as a responder, and so missed out on that vital experience.
The fact that the scenarios were primarily being run by one group of people also meant that product delivery teams were not aware of how their systems and services behaved under certain conditions. We do not do continuous verification of systems as part of their development or deployment; instead we rely on the scripted fire drills to identify and eke out ways in which the systems fail.
Finally, and definitely the biggest contributing factor to the complacency we were seeing, was what we initially thought made our fire drills as immersive as a real incident. We had members of all on-call rotations (Platform, two Software Engineer rotas with different domain expertise made up of a few different squads, IC, and SLM). We saw that as soon as the cause of the incident was found to be something in one team’s domain, that team was called in to investigate (e.g. /var/
has become unmounted somehow; don’t figure out how or why, just escalate to Platform). During an incident this is exactly what you want but while training engineers on their incident response and troubleshooting skills, we don’t necessarily want to rely on SMEs and entire rotas for their domain knowledge. This causes documentation to become stale, and worse be written for those who know how to restore service without the docs, rather than for someone who knows just enough to follow them safely.
Another theme we saw emerging was recurring actions off the back of a fire drill. There would be some particular observations, for example “SystemX is not available for people in GroupY and GroupZ”, that was never really addressed because the likelihood of needing access to SystemX was so low.
We also saw a very long cycle for time for improvements that were raised with predominantly feature delivery squads. Even though they were raised in the context of what was essentially an incident, they were treated almost as tech debt, passed to a BAU stream that wouldn’t necessarily have all the context. Often these were relatively trivial things like changing a WARN event to an ERROR, or amending what events are thrown on certain HTTP response codes.
Our fire drill plans needed changing before the whole purpose of them was forgotten and everyone lost interest.
Things did change. We shifted the rota around so that a different squad was responsible for facilitating a drill each week. This was an immediate success. The level of engagement we saw was unbelievable, far higher than any previous attempts to drum up support from other people to run the drills.
The facilitating squad’s place on the rota was effectively removed, and played as an additional role by the person facilitating the drill. This meant that they could drip feed information to the other participants if required. This has led to some fun and engaging ways in which their inability to help has been explained.
The drill can be held in whatever environment the team is comfortable running it in. We have traditionally used our Staging and Disaster Recovery environments for drills but with the advent of cloud technologies being used in the tribe, there is nothing to stop other environments being used (or indeed entire mock environments being spun up temporarily).
A big part of letting the squads plan and execute their own drills is that they can then own the mop-up from the drill, including picking the low-hanging fruit of service improvements (mainly documentation, changes to monitoring, logging, and observability, and minor code fixes such as those alluded to earlier).
This was initially put forward as a trial for a few months, and given the response we have had to it from leadership and those involved in running the drills, I’m sure it will become the new normal. As before though, we need to be mindful not to let the new normal become boring, and so we will be looking to iterate further on the process over time to ensure that people stay engaged, and our incident response drills stay relevant.
]]>I recently attended the performance.now() conference in Amsterdam courtesy of Sky Betting & Gaming’s Tech Ninja Fund. performance.now() was a single-track conference with fourteen world-class speakers, covering today’s most important web performance insights. As a performance test engineer with keen interesting in front-end performance, I was looking forward to catching up on the latest development, ideas, and approaches in the industry and also hoped to pick up some tips along the way.
In this blog post, I intend to summarise my key takeaways from the conference and give a summary of some of the less technical talks with links to associated content.
Web performance has a massive influence on the quality of a users experience and as such has always been a big topic in web development. Back in 2007 Steve Souders, the godfather of web performance, wrote “High-Performance Web Sites”, a guide for those wanting to improve web site performance. Things have moved on a long way since then both in terms of user behaviour and technology. Despite these changes, one thing still holds true - “80-90% of end-user response time is at the frontend.”
As a performance engineer with a focus on ensuring our websites are stable and can cope with predicted load, it’s easy to lose sight of the fact that front-end performance is of equal importance and shouldn’t be seen as secondary concern or problem for only front-end developers to address. After all, it’s no good having a 100% uptime if customers are turning away because the site is slow and unresponsive.
“The quickest request is the one never made” - Henri Helvetica
The opening talk of the conference was from Henri Helvetica, a freelance developer, entitled “A decade of disciplined delivery” in which he walked through the 14 rules set out in Sounders seminal publication to see how relevant they are today.
Here are a few of his examples which show how things have changed:
From these and other examples provided, it is clear that the 14 rules remain relevant and failure to fully adopt them shows that there is much work still to be done. As Henri pointed out in his talk - the massive shift from desktop to mobile is key to these changes with mobile devices eclipsing desktop in terms of web access back in 2016. Recent figures put the split at 52.48 to 44.59% in favour of mobile. Alongside the growth in mobile usage greater mobile device fragmentation further add to the challenge.
“We’ve built a web that largely dismisses affordable typical smartphones and the people that use them” - Tim Kadlec
Tim Kadlec, a performance consult and trainer, gave a talk entitled “When JavaScript bytes” in which he highlighted the cost of JavaScript and the practical ways it can be reduced. Tim made an interesting comparison between the 1.7MB of code that helped land man on the moon and the 1.8MB shipped for your average mobile site. A big difference in code and purpose but it does show what can be done where necessity dictates. Tim emphasises the fact that JavaScript byte-for-byte is the most expensive resource on the web with x3 performance penalty in terms of network, on-device and execution cost. With many sites neglecting to follow best practices like compression and reduction in bundles sizes, coupled with widespread use of lower spec mobile devices on poor networks, there should be no room for complacency. Tim suggests enforcing strict limits on code size from outset else continuing to chip away at legacy code while there is the opportunity to do so.
“Everyone who works on a web product shares ownership of performance and security - whether they know it or not” - Simon Hearne
Simon Hearn, Web Performance Solutions Engineer as Akamai, gave us a talk entitled “Deep dive into third-party performance”. Through looking at post-mortems of recent incidents Simon sought to equip his audience with the stories, tools, and techniques to manage third-party content and what to look out for when evaluating a new third-party service. As you’d expect, third-party web content has seen a similar trajectory of growth to that of JavaScript in recent years with the median website now made up of 37% third-party requests. In his talk, Simon acknowledges that third-party content, though a source of revenue, can become a source of irritation that creates friction between development and marketing teams. He encourages everyone to become third-party subject matter experts to advance both their careers and the products they work on.
While progress in addressing the issues contributing to poor web performance has been slower than many might have envisaged, the same cannot be said for the tools available to help tackle the problem. The last 10 years have seen the proliferation and advancement in monitoring solutions and diagnostic tools that provide the insights needed to help improve web performance. Monitoring solutions like NewRelic and AppDynamics can provide a 360 degree view of performance from infrastructure to application and Real User Monitoring (RUM). Tools like Lighthouse also provide valuable information aimed at improving site quality through analysis of pages against best practice criteria. All the talks at the conference drew on important insights obtained from these tools and it is true to say that web performance monitoring and diagnostics is no longer the dark art it once was. However, like anything - you need to ensure you choose the right tool for the right job.
“We need good top-level metrics” - Annie Sullivan
Annie Sullivan, Software Engineer at Google, gave a talk entitled “Lessons learned from performance monitoring in Chrome”. Annie talk covered performance metric and benchmark design, dealing with benchmark noise in the lab, and understanding the subtleties of RUM data. With all the data now available to us it’s sometimes difficult to see the wood for the trees. Annie told us about the properties that make up a good metrics and their associated use cases. While acknowledging that obtaining accurate insights from the metrics isn’t always easy, by focusing on the right metrics in the right setting we can gain a deeper understanding of performance problems from the lab through to real-user experience.
Over the years I’ve made use of many tools to gain a better understanding of site performance - here are a few of my favourites that are free, easy to use, and provide a wealth of advice and insights :
Sitespeed.io: Not mentioned at the conference but one of my favourites. Sitespeed.io is a set of Open Source tools that makes it easy to monitor and measure the performance of your web site. Sitespeed.io is the complete toolbox to test and monitor your performance or checkout how your competition is doing.
WebPageTest: Simple to use and lots of options to play with. WebPageTest is used for measuring and analysing the performance of web pages. A test can be kicked off from a variety of locations with a range of different browsers and test configurations.
Pagespeed Insights: PageSpeed Insights (PSI) reports on the performance of a page on both mobile and desktop devices, and provides suggestions on how that page may be improved. PSI provides both lab and field data about a page. Lab data is useful for debugging performance issues, as it is collected in a controlled environment.
While quick one-off tests can be executed against your site with relative ease, with a little more work it’s possible to integrate such tests into your continuous integration pipeline to provide feedback earlier in the development lifecycle.
Once equipped with the knowledge, tools, and data necessary to improve site performance, there is still a very big challenge to managing and maintaining performance especially in a fast-paced environment where multiple teams can be releasing changes into production several times throughout the day.
“Everyone who touches a page should care about the performance of that page” - Tammy Everts
Tammy Everts talk entitled “The 7 habits of highly effective performance teams” shared tips and best practice gleaned from conversations with companies leading the way in web performance. These companies had one thing in common - they had a strong culture of performance.
The 7 habits covered in the talk were as follows:
Like other speakers, Tammy emphasised the need to set performance budgets. She recommends that budgets should be clear on what the budget is, when you go out-of-bounds, how long you were out, and when you’re back in credit. With budgets in place, it should never be the responsibility of one person or a single team to act as a performance cop - with the right culture everyone shares ownership and understands the impact of what they do. While it is benificial to have performance specialists, performance must be collectively owned and considered from the outset for all projects and changes, through all stages of development and through into Production.
As the web continues to evolve, performance engineers and web developers alike need to stay on top of their game, adhering to best practice principles while also seeking new opportunities to improve web performance through process, design, and tooling. Web design and development has come a long way over the last decade with mobile-friendly sites and responsive design now the norm. However, as sites continue to grow in size and complexity, alongside increasing mobile device fragmentation - there is a greater need than ever to ensure that we help build a web that performs for the many, and not just the few.
If you’re interested to learn what the future of web performance might look like, it’s worth checking out the final talk of the conference from Vitaly Friedman, Creative lead at Smashing Magazine, entitled “The Future of Performance”. Amongst other interesting insights, he tells us that 5G will likely widen the gap between the haves and have nots, rather close it.
Details of all the talks can be found here. I highly recommend the performance.now() conference, not just for those involved in web performance but anyone interested in web development in general and keen to advance their understanding of front-end web performance. It also helps that it is friendly, well organised, and Amsterdam is a great city to visit.
The third edition of performance.now() will take place on the 12th and 13th of November 2020. Sign up here to be notified when ticket sales open: https://perfnow.nl
]]>