The Commitments of a Great Cloud Platform.

Great Platform Engineering - Part I

Nov 07, 2024

I love the cloud.
We’re very fortunate to be part of such an impactful ecosystem that is evolving and maturing so fast.

Yet, a decade of mismanaged config, poor engineering practices, and misaligned priorities between business and engineering has led lots of organizations to be burdened by their overly expensive and under-performing cloud setup.

In this blog series, starting with this post, I’d love to explore my thoughts on what is a good cloud platform.

Let’s be realistic, most organizations won’t need Netflix-level scalability or a nonsensical 99.9999% availability.
Instead, let’s try to think about real-world systems, accepting the trade-off of the design, and not fixing things that aren’t broken.

The three-body problem

1. The Obvious Public-Facing Responsibility

Ultimately, your cloud environment will host public-facing applications, which will serve actual users. Your reputation is at stake. We build the platform to maximize the quality of service provided to our users.

What is our commitment to these users?

The platform will ensure the app is up as much as possible.
In many cases, it means adapting to surges in traffic, making sure the underlying infrastructure is healthy, and enabling the app to automatically deal with most issues it may encounter.
High security standards are set for the platform and the services it runs.
Network security, data security, app security, and automated processes should be in place.
Vulnerabilities that could have been found before production are unacceptable.
The platform will fail - and that’s okay - but we will recover as rapidly as possible, and our architecture and processes ensure minimal data loss (if any).

Although these commitments sound like common-sense, cloud platforms are often built to fulfill technical requirements first, ignoring the essential qualities of the system and the external end-user experience.

Anecdote #1

I used to work for a well-known organization in the UK. During a long and complex cloud modernization project, outages happened.
Regardless of how small the outage, or at what time it happened, the relentless local tabloid press always made sure everyone knew about it. No pressure :)

2. Platform Engineering is nothing without its people

A platform is built around its engineers. The first and last requirement should be to make the life of the internal users of the platform and the platform team easier.

A (very biased) belief of mine is:

The engineer’s time is your most valuable resource, protect it.

How do we put this concept at the center of our platform practices ?

The platform is documentation-driven.
Documentation, technical or otherwise, is at the center of everything we do.
Clear and usable documentation makes everyone’s life easier.
Shift the engineers’ attention to hard problems; automate the rest.
Obligatory caveat: the cost of automating a task should be reasonable.
Maintainability is a core feature of the platform and its processes.
Operational responsibilities are split.
Processes are built to enable application owners (e.g., devs) to own the deployments and operations of their own services.

In other words, the platform must be built to be rebuilt, in most cases, one Terraform module at a time.

A former tech lead of mine used to compare a cloud platform and the associated platform team with the classic WW2 Jeep: Efficient, versatile and completely rebuildable in no time.

Anecdote #2

This Jeep pep talk came after our biggest incident to date: a devastating terraform destroy in production.
It took six panicked engineers five hours, on a Slack call, to untangle the mess of dependencies between half-destroyed resources. This led to a radical shift in how we thought and designed our systems.

3. It’s not someone else’s money

Whilst we’d all love to spend our days building whatever we want, however we want, we’re always bound to some financial responsibility.

However, the famous “Good, fast, or cheap — choose two.” saying does not necessarily apply here. In fact, you could easily end up with a bad, slow, and expensive platform :-).

Joking aside, how do we account for the money factor while building and running a cloud platform?

Right-sizing isn’t a one-time effort; the platform should evolve as workloads and demands change.
The design carefully weighs the types of resources (hosted vs managed vs off-the-shelf vs homemade) used, and their cost.
Here again, keeping in mind the engineer is the most valuable resource.
Relevant metrics are defined and tracked.
On a multi-tenant platform for example, a cost-per-tenant or cost-per-app is a common KPI.

Anecdote #3

I once witnessed a startup COO shout at the engineering team about an $800k AWS bill. I investigated.
Was it hidden AWS costs? Compromised hosts mining Bitcoin?
No, as it turned out, the issue was individual dev environments, which, for some reason, all included EC2 and RDS instances, constantly on.
Of course, whilst this wasn’t a new issue, a recent hiring spree had exacerbated this mismanagement.

Let’s get building

Platform Engineering is a vast topic, and I could have listed many more essential concepts or best practices (e.g, monitoring). The underlying truth remains the same, a cloud platform is at its most impactful when it is built with the engineers in mind, for quality of service of the end-users, in a financially sustainable way.

As this is an introduction post, I tried to stay away from the nitty-gritty tech stuff.
I am very much looking forward to getting my hands dirty with more technical topics in the future.

What I really want this newsletter to be is a way for us all to share our expertise and experience, leading to a more mature Platform Engineering ecosystem.

Let’s be honest with each other. Share your feedback. Raise any mistakes I made. Point me to great resources. Engage in conversations. Let’s be friends.

Opinionated Engineering

Discussion about this post

Ready for more?