Abstractions and serverless

Published in

SoftwareMill Tech Blog

12 min readMar 19, 2020

IT systems are inherently hard to comprehend. That’s why we divide systems into smaller pieces. If this division is done properly, instead of understanding everything about a system, we only need to understand each piece in isolation, and how the small pieces are composed. Performing the division “properly” is obviously not trivial, and is the crux of the problem, which we’ve been studying for quite some time.

This division is the process of creating abstractions. The role of an abstraction is to hide complexity and unnecessary details. Depending on context, abstractions have different names: APIs, systems, applications, microservices, modules, classes, functions, syscalls. These are all manifestations of the same goal.

However, we must be cautious not to confuse abstraction with generality. As Kevlin Henney defines it:

It is easy to mix up the notion of generality with the notion of abstraction — something I know I’ve done at times! Although they sometimes go hand in hand, they are not the same concept. Generalisation seeks to identify and define features that are common to different specialisations and possibilities; abstraction seeks to remove details that are not necessary.

As the system that we are developing grows larger, we need to introduce new layers of abstraction so that it’s still comprehensible. With each new layer, we gain a new level at which we can look at the system. It’s the connections between the components at a given layer of abstraction that become important, not the implementation details of each component.

Hence, the total size of the system that is comprehensible grows exponentially with the number of abstraction layers: each component from an abstraction layer can hide a number of sub-components, each sub-component a number of sub-sub-components etc. That’s a very important property, allowing us to create really complex systems, and still have at least some degree of control over them!

We might start with a small project that’s just a bunch of classes; as things grow, we divide them into modules, where each module becomes semi-independent. Some time later, we might split our monolith (but multi-module) application into a number of microservices. And at all times, our system doesn’t live in a vacuum: it communicates with other systems, each of which has its own layers of abstractions hidden from the public: microservices, modules, classes etc.

When viewed from the outside, an abstraction forms an indivisible whole. We treat it as atomic: that’s the whole point of an abstraction; we don’t want to be concerned with its internals.

Where to divide?

As always, there’s nothing for free: breaking down a system is not an easy task. It’s a constant struggle to pick the right size and proper division lines. Bad abstractions do more harm than good; good abstractions are rare, and emerge as a result of suffering from a number of bad abstractions.

There are some undeniable abstraction success stories, though. Hiding lower-level concerns, such as machine code, assembly, networking, device access, kernel threading is mostly a solved problem and works great for us every day. We rarely have to zoom past these abstractions, using them as black boxes, doing their job right.

When writing applications, we are not concerned with what’s beneath TCP/IP, how to directly access the HDD, or how threads are scheduled and contexts switched.

However, the higher we go in the hierarchy of abstractions, the closer we get to the business domain, the more blurry and problematic this division becomes. This has many reasons; first of all, the problems that we are dealing with stop being technical, and start being more business & domain-oriented. It’s no more only 0s and 1s, but complex processes in which humans are involved.

Secondly, the problems start being more specialised. Networking, low-level machine code, threading are very general concepts, which are easily re-used. The closer we get to the core problem domain, our abstractions start being more problem-specific, created only for a single project.

Finally, as a consequence of this specialization, we have less experience and less data points on where the “correct” abstraction boundary lies. With networking, we have a lot of experience and a lot of use-cases; with project-specific abstractions, this stops being the case.

Serverless

Good abstractions are good, bad abstractions are bad, but … what does it all have to do with serverless? More than you would initially think!

The so-called “serverless” approach allows us to deploy standalone functions (also known as lambdas), which are run on-demand on a provider’s infrastructure. The most well-known providers include AWS, Google Cloud and Azure. We only have to pay for the resources used by the function executions; no upfront allocation or fees are needed. And most importantly, we don’t have to provision and maintain servers.

When creating a “serverless”-based system (quotes are intentional — the servers are still there, they are just managed by somebody else!) we need to deploy not only the functions, but also describe how these functions should be triggered. This is typically done through the cloud provider’s APIs, or through a GUI which visualises the current setup and allows modifying it. Triggers include API calls, database events, messages sent to queue etc.

Developing a serverless system, we’ll quickly discover that we want to group our functions into applications / services / microservices. That’s exactly the same process of creating abstractions, as described before! Serverless is not a panacea for the complexity of IT systems: quite the contrary; it might be the source of even greater complexity.

As soon as we’ll try to group our functions into services, we’ll want to do the same with the wiring: the descriptions of how our functions should be triggered, what kind of datasources should they read and write, and what other systems they need to access.

Summing up, a service in the serverless world consists of two parts: the functions, that constitute that service and define the logic; and the wiring, which describes how the functions interact with the external world — what are the function triggers.

Abstractions are atomic

As mentioned before, we usually want to look at our abstractions as atomic, indivisible units. We want them to be “black boxes”, which perform some kind of functionality. This applies both to services deployed “traditionally”, as well as to “serverless” services.

In the traditional way, when deploying a service, we usually are able to define a deployment unit, such as a docker image, a .jar file, or an executable. In order to treat “serverless” services as proper abstractions, we need to somehow define the same, an indivisible unit that constitutes our service.

This rules out using the GUI to describe the wiring of our serverless functions and the triggers. We don’t want the knowledge of how our serverless service interacts with the external world to be spread across a number of services that our cloud provider offers.

But this happens now; for example, when using AWS, the functions might be defined in Lambda; the API through API Gateway; user management through Cognito; some metadata might be stored on S3; etc. When using the web interface, it’s almost impossible to gain an understanding of how a service is implemented!

While not mentioned here, the process of testing, continuous integration and delivery, setting up multiple environments and deployments is equally important in both serverless and serverful projects. With serverless, things get a bit more complicated as we have to test everything on a “live” setup, created in the managed environment.

Recovering the abstraction

It’s no surprise that people try to recreate the abstraction, so that a single serverless service can be easily comprehended and reproduced.

The most popular (but not only!) project in this area is serverless.com; using the open-source variant, you need to create a serverless.yml descriptor, which fully describes the whole service. It contains details on the provider (as serverless.com can deploy services on Amazon, Google and Azure), a list of functions to deploy, their triggers, and other resources that need to be created (such as database tables).

Such a service can be deployed to the provider’s infrastructure using a command-line tool, which makes it easy to both re-create a service from scratch, as well as update an existing installation. This way, we can quite easily create the same service in multiple environments, such as production, staging and test.

Great! Problem solved. We have our abstraction layer back. Once again, our service is clearly delimited and fully described through serverless.yml. Each function can now be treated as a lower-level abstraction layer, and we can combine multiple such serverless services to form a larger system.

Back to square one

However … doesn’t this look familiar? Is that any different from defining a “traditional” deployment unit, such as a docker image, or a .jar file? After all, all of our code is in there, and a description of how to wire the functions with external resources.

The main difference is that instead of packaging the deployment unit on a CI server, we run the serverless (or equivalent) tool. Which, if you take a closer look, is really an interpreter from YAML to the “AWS VM”.

What is the AWS VM? (Don’t worry, it’s not another AWS component, just a different perspective on the existing ones.) AWS (or GCP, or Azure) offer a number of services, which range from running code (Lambda), through databases&storage (Dynamo, RDS, S3), to complete implementations (Cognito). These services can be programmed by the GUI, through the CLI or through API calls. What started as an elastic way of provisioning servers, ended up as a fully programmable environment.

That way, when configuring AWS/GCP/Azure, we are in fact programming the induced “virtual machine”, defined by the services they offer.

It shouldn’t be a surprise that we are creating a deployment unit for our “serverless” service. As argued in the first section, it’s natural that in order to comprehend a large system, we want to split it into smaller, comprehensible pieces. Each such piece forms a whole; it’s much easier to understand such a single piece when we have a central place, where it’s entirely described.

Taking such a perspective, the benefits of serverless are not that clear. We do get some savings from the fact that we only pay for the capacity used, and that some concerns are handled for us, such as auto-scaling. However, we get a lot of additional complexity, when it comes to development, testing, and describing the entire service.

It’s worth noting that the cost benefits of serverless also shouldn’t be taken for granted. If your service has consistent load, it probably will be cheaper to go with a traditionally deployed service. Only when you experience traffic spikes, or have a very low-volume, casually used service, it might be cheaper to use serverless. Always do the math before deciding on a solution based on cost!

Deployment descriptors

Let’s take a look at the serverless.yml deployment descriptor. Doesn’t the name ring a bell? We’re repeating mistakes from the past, ones that I have experienced during my professional career— and I’m not even that old!

Re-discovering things already discovered in the 60s/70s is a common phenomenon, such as the recent rise of popularity of functional programming. However here, we don’t need to look that far, instead going back to the relatively recent era of application servers and XML.

Not so long ago, it was considered “best practice” to deploy applications into WebLogic or the JBoss Application Server. This involved writing a lot of XML which described how our application interacts with the outside world: what kind of databases it accesses; what kind of APIs it exposes; how it can be triggered through message queues. Everything was supposed to be declarative, portable, and separated from the business logic.

Slowly, people have realised that this is a dead end. In the Java world, XML was first partially replaced by annotations. But this improved the situation only a bit. It took some more time to almost entirely abandon the idea of application servers, instead creating self-contained applications. Once again, nowadays the code gets usually packaged as an executable, or a Docker container.

It turns out that descriptors are too limiting. They seem attractive at first: they are declarative; they allow fast bootstrapping; they cleanly separate concerns. However, convenience is the root of much evil. Once we start doing anything that’s not directly supported by descriptors, we need to start hacking around. And this might get ugly!

We’ve re-discovered that when applications handle their own wiring and manage the resources they need to use, they code turns out cleaner, lighter and more maintainable.

The same reasoning applies to serverless and serverless services descriptors. Just as with application servers, we’ll once again discover that we need the flexibility of a regular programming language to define the wiring of our service.

YAML

An important point to note is that the description of how a service is wired, what are the interactions between its functions, what are the triggers, etc., can (and possibly should!) be done declaratively. However, it’s crucial how this description is being done.

YAML (or XML, or JSON) is a particularly bad choice, which is, sadly, very popular in a number of modern devops tools. YAML is not a programming language, but a way to represent a tree-like datastructure. And that’s it!

Tree-like data structures are very limited. For some reason, we are trying to coerce the complex descriptions of how our services are deployed, packaged, wired, using this extremely constrained way of expression. We go as far as defining entire cluster configurations, in YAML!

It seems that we are throwing away decades of CS research on programming languages. In YAML, we no longer have a way to define even the simplest abstraction. An absolutely basic operation of extracting common code (common configuration) into a constant needs special application-level support. At the same time, we are doing crazy tricks to embed scripting, templating, conditional logic into our configuration files.

We’ve already found out that XML is not the way to go. It’s the same with YAML, if not worse. These formats are simply not flexible enough to efficiently describe any non-trivial configuration of a service.

We need to do better. Good programming practices, starting with DRY, cannot be forgotten. Note that we are not restricted to general-purpose languages in our search. Maybe a domain-specific, non-Turing-complete language such as Dhall can be a good middle-ground? Or maybe functional programming languages will have an important role to play here? They usually make it very ergonomic to define and manipulate immutable data structures.

Work ahead

As it currently stands, using “serverless” for anything that’s a non-trivial service is a high-risk, complexity-inducing operation. I would think twice before using these offerings in their current form.

However, that’s not to say there’s anything flawed with the idea of serverless. Quite the contrary: the idea is very good! We just need a better implementation.

As we noted before, what we are doing with “serverless” is programming the AWS / GoogleCloud / Azure “virtual machine”. This programming deserves a decent programming language. Only then we’ll be able to properly define abstractions, that our serverless services form.

Hence for now, quite often the best solution will be to stick to the “serverful” model. Either way, there’s a lot of work ahead. We do want the features that serverless offers: pay-as-you-go; auto-scaling; security; out-of-the-box aggregated logging&metrics and quick deployment. There’s also a number of currently non-features of serverless which we definitely want to retain, comparing to the “serverful” setup: local testing; build reproducibility; environment reproducibility; centralised configuration and clear service delineation.

What can the future bring? A both locally-runnable and managed serverless stack, which is programmable through a modern programming language instead of YAML/XML? Or maybe “traditional” setups will gain the benefits of serverless, through an evolved k8s-like orchestrator, with application sidecars, easily deployable cluster-wide logging/metrics systems, intelligent scaling, combined with leaner packages and deployment pipelines? Time will tell!

For the time being, it’s good to keep in mind that it’s not rapid bootstrap, but creating understandable, maintainable and evolvable systems, with clear abstraction boundaries, that is the main challenge that needs to be solved for many IT systems.