✏️

How to make sense of Cloud Architectures

The advantage of using Cloud solutions is that we can use pre-built services out of the box without investing thousands of man-hours into developing stuff on our own.

You can replace entire teams of developers if you know how to tie together ready-to-use services instead of reinventing the wheel.

Cloud Architect has to be aware of all the options available out there to solve the problem at hand. For those who are just getting into it: the Google Cloud alone has over 200 products in it.

In this article I want to outline a few patterns that help me quickly make sense of:

  • what a service does,
  • how to compare it with alternatives,
  • how to tie it together with other services
  • and how to visualize the whole architecture in an easy-to-understand diagram that fits into your imagination yet doesn't hurt your brain

The good news is that Cloud Architectures designed for ease of understanding, because at this point in our civilization's history, we cannot yet create systems capable of full autonomy on a high-level. And if something is designed by humans and designed for humans to maintain, then you can distinguish patterns and abstraction layers.

Cloud Architecture is all about patterns and abstraction layers. This is the way things are and most likely this is the way things will be until the singularity point, because Machine Leaning based autonomy, or any autonomy per se, grows from low-level abstractions upwards. So in any perceivable future there will be services to tie together and maintain however advanced those services are. And Cloud Architect job is exactly about tying together services.

My name is Anton Kulikalov, I'm a professional Cloud Architect. Today I wanna share with you the most important patterns that help me understand Cloud Architectures no matter what I do. These patterns can be applied to GCP, AWS, Azure or any other platforms and stand-alone SaaS products.

Pattern number one: Everything is an I/O device... essentially.

Whenever I'm trying to learn a new SaaS, I start by figuring out what the Input is and what the Output is.

Take Google Cloud Storage as an example:

  • If you store data objects in it, then the input is a data object and the output is a status message.
  • If you retrieve data objects, then the input is a path to a file, and the output is a data object.

Even the most straight-forward service like Google Cloud Storage have plenty of input/output variations. But when I design a Cloud Architecture, it does just one specific thing. So, from that perspective it has only one kind of input and output.

Pattern number two. Every I/O device should either be a Stateless Compute device or a Storage.

image

Avoid hybrids, aka Stateful Compute devices. They instantly complicate things. For the most part any stateful compute device is a sign of poor architecture. Sometimes we have to use them but generally speaking it's better to avoid them. Stateful Compute Devices have high maintenance complexity and are harder to scale comparing to the Stateless ones.

There are some exceptions though. Caches and Search Indexes are hybrids and they are quite useful. Cache devices help stateless compute devices temporarily remember results of heavy compute operations to decrease their response time. Search Indexes help Storage devices search through their data, because an efficient way of storing data is not an efficient way of filtering it and vise-versa.

Third pattern. Every I/O device shares the same set of characteristics: Reliability, Scalability and Maintainability.

  1. Reliability. It's what could possibly go wrong with a given device:
    1. It could be Hardware faults. They are random and independent from each other. A hard disk crash, a power grid blackout - these are the most frequent causes of hardware faults. For instance, Hard Drives have a mean time to failure of about 10 to 50 years. So, it's sums up to approximately 1 failed disk per day per 50k disks.
    2. Software errors. The most important characteristic of software errors is that they are always systematic, which makes them very dangerous and very interesting to debug.
    3. And my favorite kind of reliability issue is Human error. The manual actions performed by people, not machines. Humans can be considered as I/O devices too. Stateful, unreliable and unable to scale. Humans have decent search indexes and poorly designed mimicking of a data Storage device in their heads. Don't trust them. They are unpredictable. The only benefit human I/O devices have is their initial development cost. Though, maintenance cost can still be quite steep, so, avoid using human I/O devices if possible.
  2. Alright, the next characteristic of any I/O device is Scalability. Can it process x10 the current input? How about x1000? You can answer this question only if you know exactly what "current input" is. Otherwise any assumptions about device scalability would be base-less. So, to develop a high certainty about a given I/O device scalability you have to first observe a real load pattern, then simulate x10 or x1000 the workload. Don't try to skip the "observe a real load pattern" step, you'll end up making assumptions based on wrong premises, and therefore your assumptions gonna be consecutively wrong.
  3. And the last characteristic of any I/O device is Maintainability. It conventionally splits into 3 sub-characteristics:
    1. Operability - how easy it is for the operations team to keep this device running smoothly
    2. Simplicity - how easy it is for new engineers to understand the device
    3. And Evolvability - how easy it is to make changes and expand the device

Let's recap what we have so far:

  • Everything is an I/O device
  • Every I/O device should be either a Stateless Compute device or a Storage
  • Every I/O device has the same set of characteristics: Reliability, Scalability and Maintainability.

With all of this in mind lets design a real-world Cloud Architecture.

image

This is Bob. An irrational random I/O device, aka Human. He uses our client app. We'll leave this app architecture behind the scene for now and focus on our backend architecture.

image

First, we'll need a web server, accepting our client app requests as an input and outputting respective responses. It's a Compute device.

We'll probably need to save some information about Bob, like his name or email address. And since we don't want to suffer, we'll choose to leave our web server stateless and use, let's say, a NoSQL database as a Storage device for Bob's name and email. Maybe Bob wants to upload his picture too, so we need another Storage device for data objects.

Okay, good enough, at this point our architecture does the usual stuff - Creates, Reads, Updates and Deletes our users data. At some point we'll notice that our database is getting slow, so we'll add a search index to it to speedup reads. Then our Compute device will start experiencing overload, so, let's add caching to it.

So far it was pretty simple... and really fragile, since we have no horizontal scaling nor backups in place.

Let's zoom into our Web Server and add scalability to it.

image

We'll add a couple more replicas of our Web Server and we'll need to distribute the incoming requests between these replicas. So another Stateless Compute Device - a Load Balancer is needed. It will be sitting in front of all of our replicas and distributing the load.

I guess we want the scaling to be automatic, based on the percentage of CPU utilized. So, we need a Storage device, aggregating logs that among other things lets us monitor CPU usage. And another Stateless Compute Device will be reading those logs and deciding if we need more or fewer replicas.

Another critical part of any architecture would be a health check that will restart or replace those instances that appear to be unhealthy. This device can either be based on logs or it can just make direct requests to our replicas and restart them if they were unable to respond with status 200.

What we just did is, we created a layer of abstraction, because on our high level architecture we still have the same Stateless Compute I/O Device, even though it has its own architecture inside of it.

This brings us to the last pattern: Abstraction layers are everywhere.

image

If it was obvious to everyone, we wouldn't see architectures nor codebases that look like spaghetti. So, where were we...

From the high level architecture perspective our Web Server is a single Stateless Compute Device, because the same input will produce the same output. But when you scale into it, you see a completely different picture. Because it's a different Abstraction layer.

With this knowledge in mind you can figure out what would other architectural solutions look like. And if you have a high level of understanding of these things, then it is much easier to figure out what services to use no matter which platform you are on. No matter how they call these services.

If you enjoyed this reading, consider sharing it with the communities you are part of on Reddit, slack, discord or somewhere else. This would help me a lot! See you next time!

Further reading