Modernising our hosting infrastructure: what we’ve learnt
Most firms that build digital products expect to run and maintain them for many years after launch. During that time, the need to modernise and migrate the project’s web hosting infrastructure will almost certainly arise.
Browser, for example, currently supports many live projects, with the oldest dating back to 2010. As anybody interested in technology can tell you, a lot a can change in that timeframe. This post is about how we’ve approached our currently ongoing hosting migration process, and what we’ve learned so far.
But first, a caveat
We’re a heavy user of Amazon Web Services (AWS), so many of the technical benefits we’ve reaped from this process will reference specific AWS hosting services.
This is not to say such gains aren’t possible on Google Cloud, Microsoft Azure, or any of the other big players, but we won’t try to talk about the alternatives here. As always you should research services and form a plan that best suits your migration project.
Applying discovery to hosting infrastructure changes
When approaching this wholesale migration of multiple products, we tried to practice what we preach to our clients and engaged in a thorough discovery process for the project.
Luckily, our sister company, Twine, recently moved through a similar modernisation of its hosting platform, so we set out to learn as much as we could from their experience before we applied it to our own infrastructure.
We also thought it was important to not only look at how new technologies can make things more secure, but how they can also be used to de-risk business processes. As such, it was important to understand the key frustrations of our developers, and doing this helped to identify three key areas to focus on during the migration.
1. Standardise deployment processes
Different projects, from different times, had different deployment processes – from continuous integration tools deploying automatically, to out of date Capistrano scripts. On some old projects, it was unclear what method was in use, and what versions of a given tool it used.
This caused unnecessary delay to deployments as not every developer had the correct version of the deployment tools running on their machine. Standardising and removing the dependency on developer machines to execute deployments would save time and cost.
2. Improve efficiency and raise performance
Due to different projects being reliant on different external services (and versions thereof), our projects were deployed inefficiently. Some projects required their own virtual servers due to a specific software version dependance, spreading our hosting budget out instead of deploying it towards better performance for all our clients.
3. Create opportunities for cross-training and project review
Over the potential near decade-long time span of a project, it’s likely that the original devs will have moved on, taking their in-depth knowledge of the project with them. Whilst documentation means that we’re never entirely stuck, there is a benefit to really knowing a project when called to maintain it.
Incidentally, Google has taken to solving this problem by deliberately rewriting perfectly good projects regularly, simply in order to cross-train staff.
Now, for most of us, such an approach (which Fergus Henderson details in section 2.11 of his report on Software Engineering at Google) would be absurd, but a migration project like this can be used as an opportunity for a developer to learn about a project. Allocating a project for migration to a dev that has never seen it before both spreads knowledge and gives the team a fresh perspective on how the project is managed. It also helps spread understanding of your devops processes amongst your team.
Our technical approach
This isn’t a technical guide, but from working on the recent Twine migration, and our experience with other projects, we settled on some core pillars to our web hosting setup.
This is a no-brainer for us, and the benefits have been explained far better by others than I ever could. The difference in our implementation, however, is that for our legacy projects we’ve built a single container containing all that we need, rather than the Docker best practice of a container per service.
This is because nearly all of our projects push files off to Simple Storage Service (S3), use Relational Database Services (RDS), and otherwise use managed AWS products wherever possible already. This is helpful as it means we can lift the whole project and run it as a single container with cache, web server, and the application itself all wrapped up in one. We primarily do this in order to reduce the time required to migrate projects that, by default, expect services to be in certain places, and we make exceptions to this rule where we need to.
To be clear, there are drawbacks to this approach, specifically around how you handle logs. This is a tradeoff that may not be appropriate on some projects, particularly new greenfield efforts.
ECS and EC2
We settled on Amazon’s Elastic Container Service (ECS) using Elastic Compute Cloud (EC2) instances. Twine has used ECS with Fargate instead, however, our experience has taught us that when you’re migrating many projects quickly there is a big benefit to being able to get quick, direct, access to the host instances to debug problems. We also found that we could be more cost efficient with our deployment costs using EC2 clusters.
Elastic Kubernetes Service (EKS) is a perfectly acceptable solution if you’re looking to avoid vendor lock-in, but one we did not pursue heavily as we already had a high level of experience with ECS, and high level of reliance on AWS products.
Another no-brainer that has been explained better elsewhere, you’re already doing this, right? CI, for us, was a chance to get rid of all our old differing deployment methods with one automated solution. This is also reduced our dependence on specific team members with server access for deployments – anybody can deploy.
We’ve used Jenkins heavily in the past, but decided, in the end, to migrate to CircleCI on the recommendation of the Twine team and to avoid having to manage this service ourselves.
Branch merging rules
With everybody able to deploy at any time, we now have the opposite problem… anybody can deploy at any time! We made use of branch merge rules in our Bitbucket repositories to solve this issue. We typically still allow anybody to deploy, particularly to staging environments, but for branches deployed to production, we have added a restriction requiring a pull request to be created. This pull request must be approved by at least 2 developers before it can be merged, triggering CircleCI.
So how is it going?
Migrating our hosting infrastructure to more modern AWS services has, so far, been an extremely positive project, resulting not only in meeting our key objectives but also throwing up some helpful improvements.
As a matter of course, we update software to patch security issues, but we often cannot go further due to the constraints of other projects running on the same servers. By moving to a docker based setup we’re free to upgrade each project to the “best” they can handle with little effort. In many cases, this gives us an automatic, free, performance improvement.
Improved performance & security
Moving services off old EC2 classic instances to a new ECS cluster means that we’ve been able to make use of much newer Amazon instance types, in this case, the M5 range of instances with much better performance for the cost.
It also means we can move from the old Elastic Load Balancers (ELBs) to the newer, and far more powerful, Application Load Balancers (ALBs). These allow us to use routing rules to perform important 301 redirects, such as forcing SSL, before the request even reaches the web server, improving performance. More importantly, though, it provides an opportunity to add security in several ways.
The AWS Certificate Manager (ACM) integration, for example, now means that provisioning and renewing SSL certificates is trivial and, more importantly, free. We can now offer SSL support for any client, old or new, that is able to add a small TXT entry to their DNS records. Under the previous ELB based service, we had to manually purchase, upload, and renew certificates. In addition, ELB only supports one certificate per load balancer, meaning every client with SSL would incur the cost of its own service.
ALB brings integration with another excellent tool, the Web Application Firewall (WAF). WAF allows us to mitigate known issues such as SQL injection attacks by defining rules that run at the infrastructure level. Furthermore, we can subscribe to third-party rule sets that are updated regularly by outside security companies for a small monthly fee. We’ve taken the opportunity to subscribe to several rule sets including mitigation against know CMS CVEs, for example. This means that even if we’re unable to quickly update a CMS application due to a breaking change, there is a good chance that the issue can be blocked before it ever reaches the application.
We’re not done yet but we’ve learned a lot from the process. The biggest thing we’ve learned is that when properly planned and executed, a hosting infrastructure modernisation project can yield advantages beyond efficiency, security, and performance. The benefit of cross-training staff and taking the opportunity to fully automate deployments releases development staff to do what they do best.