Serverless is taking the world by storm. But is it? Performance limitations exclude serverless from many use cases from simple web APIs to cutting edge AI. In this post we start by explaining latency and cold starts and exploring the underlying architecture and what makes serverless slow. We then discuss the implications of these performance limitations on the applicability of serverless to use cases and emerging coding patterns. Finally, we introduce Binaris, a new cloud serverless platform, now in beta, that breaks all performance barriers and lets developer build entire applications out of serverless functions.
Serverless platforms, like AWS Lambda and Azure Functions, let developers build applications out of functions. A developer writes the code for a function, uploads it to her serverless provider and gets a handle in return. Usually this handle is a simple URL. Whenever that URL is hit, the function gets executed. If the URL is hit once an hour, the functions runs once an hour. If it’s hit a million times per second, the functions runs a million times per second. The developer never has to worry about provisioning servers or routing network connections. This responsibility is offloaded to the serverless platform provider. This, of course, has a very positive impact on developer productivity and overall happiness.
Invocation latency is the time elapsed between the caller sending an invocation request and the response arriving back to the caller. In order to better understand latency, we start by breaking the invocation process into steps:
- A request is originated from the invoker device, be it a laptop, phone, smart car or any other device. This generates a message or a series of messages that are sent over the Internet to the cloud. If the invoker is another cloud service, this step is naturally skipped.
- The network in the cloud routes the message to the serverless platform, which is typically one of many services available in the cloud.
- The serverless platform routes the message to the right server which has the code for the function loaded (more on that later).
- The function executes and generates a response.
The response is then routed back along the same path to the invoker. This is why we always refer to round-trip latencies. This process is illustrated below:
As we’ve seen above, the path for invoking a function goes through multiple routers and servers before a function starts executing. The more network hops a request or response has to go through, the more susceptible they are to latency variations. Any number of reasons can slow down the network. Messages might have to wait in queues if the network is congested or might be assigned different routes at different times according to varying network conditions. A message might be lost and need to be retransmitted. Finally, a server might fail altogether and messages might get stuck waiting for a new server to come online.
For these reasons (and others) invocation latencies vary over time, even within the span of a single second. Different invocations or different users will typically experience different latencies even for consecutive invocations. The more complex the system, the larger the variance. This makes it irrelevant to discuss average latencies. Rather, we speak about latencies in terms of percentiles.
A casual user will typically experience the median latency. On the other hand, heavy users (which are the ones we often care about most) will generate many more invocations and thus experience both the good and bad latencies. For those users we want to optimize our systems to respond well even at the 95% or 99% latencies. In large distributed system this is very difficult.
Show me the numbers
Let us follow the invocation path and put actual latency numbers to each step.
Step 1: Accessing the cloud
This step is largely dependent on geography. While worldwide latencies could easily surpass 100ms, the situation in the continental US is much better. Internet to cloud latencies range from 30-35ms in rural areas to 3-4ms in metropolitan centers. Cellular networks today could add a significant overhead of about 50ms, but upcoming 5G networks will reduce this overhead significantly to 1-2ms. US centric apps, can reasonably expect most users to be around 10ms away from the cloud. That’s pretty fast.
Step 2: Cloud networking
In order to operate at scale, the cloud itself has a pretty complex network topology. At the top level, it is broken into regions. Amazon has 20 regions worldwide and already announced plans to deploy 4 more. Internally, each region consists of 2-6 Availability Zones, each consisting of multiple data centers. Those data centers are pretty large, hosting about 100,000 servers each, so they have multiple layers of network internally.
Cloud network latencies vary between cloud providers and breaking those down is beyond the scope of this post. In general, latencies within a single region range between 5 and 10ms.
Step 3: Serverless platform latency
In order to invoke your function, the serverless platform needs to locate a server with enough resources to run your code. Typically, it will look for a server that already has your code loaded and initialized, usually inside a container. This process can take a while. The graph below shows AWS Lambda latencies as benchmarked from another AWS server in the same region. We tested Lambda directly and also with API Gateway, an AWS service required to provide a client facing HTTPS endpoint for Lambda functions. We show results for median (50%) latencies, 90%, 95%, 99% and maximum latencies.
Step 4: Function run time
Naturally, this part varies greatly between different functions and different use cases.
Putting these numbers together, we are looking at over 100ms to invoke an AWS Lambda function (latencies are similar for other platforms). For heavily computational functions that run for seconds or minutes, an invocation overhead of 100ms is negligible. However, responsive functions implementing web or mobile backends or any other real-time service, typically run for 1-10ms. In these cases, a 100ms overhead is simply unacceptable. This overhead precludes serverless functions from many use cases, and that’s before we even mention cold starts.
The latencies mentioned above can be referred to as warm latencies. When measuring, we made sure that the serverless platform has been “warmed up” and had enough time to allocated and initialize the resources required to run your function. This is not always the case. Since serverless functions are invoked on demand and since users are only charged when their functions actually execute, a serverless platform cannot afford to allocate resources for all functions. Actually, such platforms are designed to support a very long tail of functions, most of which lay dormant most of the time.
When a serverless platform needs to invoke a function that’s been dormant for a while, we hit a cold start. The platform needs to find a server with enough capacity to run the function, load the function’s code onto that server, initialize that code and set up routing so that future invocations can reuse this server to invoke the same function. This is a complex process and in most serverless platforms takes about 1 second to complete.
Cold starts happen when functions are scaled from zero to one, i.e. when a function is invoked that hasn’t been used at all for a while. They also happen when functions scale from n to n+m, i.e. when more and more users begin to invoke the same function at the same time. In some cases, we’ve seen functions burst from zero to thousands of concurrent invocations inside a mere second.
Whenever such scaling occurs, a function hits a cold start and can suffer an additional second or more of invocation overhead. In many cases this is strictly unacceptable. In any case this introduces a level of unpredictability that complicates the design of distributed multi-service applications.
It is worth to note that workarounds exist for cold start latencies. One could set up a service that will invoke their functions periodically to make sure the serverless platform always keeps warm resources for these functions. Beyond the added cost, this approach beats the purpose of serverless as it requires users to set up and maintain those services. It also beats the purpose of auto-scaling as it requires such services to be able to predict load variations and warm up enough functions ahead of time.
Performance limits use cases
The only reason anyone should ever care about performance is use cases. Performance in and of its own is useless. Why should I care if my car can go 300 mph if the freeway speed limit is 65? I really do care, though, that my bike can only do 20 mph, and would never use them to go on a freeway. Functions are much like cars (or bikes) in that sense. If they are too slow to invoke, we can only use them for a limited set of use cases.
Naturally, each applications has its own unique characteristics and performance requirements and your mileage might vary. We can, however, outline the general latency requirements for different use cases:
Unfortunately, performance of serverless platforms today, both in their warm state but definitely considering cold starts, excludes them from being used in most application use cases today. The benefits of serverless are reserved mostly to DevOps, automation and ETL use cases.
Performance limits design patterns
Serverless not only manages infrastructure, it also lets developers break their applications into functions. We can think about functions as single purpose services. Functions are coded, tested and deployed independently, which makes them faster to build and easier to debug. Functions allow multiple developers to work in parallel on different parts of an application and simplify the process of sharing and reusing existing, tested code.
Building an application out of many functions opens the door for new design patterns that can better leverage the scale of the cloud:
Chaining is when functions are invoked sequentially, one after another. The result of one function is often used as input for the next.
Fan-out is when a function triggers parallel invocation of other functions. This is useful for data analysis as well as for running massive computations in parallel.
Fan-in is when outputs from multiple functions are combined by a single function to produce an overall result.
Recursion is when a function repeatedly invokes itself (or different instances of itself) until a terminating condition is met.
These patterns can be combined to form more complex designs. For example, the common data processing paradigm called MapReduce is simply a composition of the fan-out and fan-in patterns. The following illustration shows a potential architecture for a REST API, combining multiple functions and a parallelized list implementation using the fan-out pattern:
The power of functions and function composition is lost if latency is too high or unpredictable. In order for the applications to execute in reasonable time, function invocation latencies have to be kept low, as they accumulate with every function added to the invocation chain. Moreover, it is critical for the latency to be predictable, i.e. kept low at high percentiles, as the odds of hitting a slow invocation increase exponentially with the depth of the invocation chain.
With existing invocation latencies in the tens or hundreds of milliseconds, function composition is simply impractical. Indeed leading serverless providers recommend against invoking one serverless function from another. Latency is the show stopper for them.
Binaris sets out to tackle latency
We founded Binaris in 2016 because we deeply believe in the power of serverless and the power of functions. We envision a future where applications are built entirely out of serverless functions.
To enable this, we started out by focusing on invocation latency. We built a cloud serverless platform that fires up functions in single digit milliseconds. Binaris is almost 40x faster than AWS Lambda (and similar serverless platforms) at the 99th percentile, and also faster than most self-managed container or instance based services:
We have designed Binaris from the ground up with functions in mind. We have implemented a function optimized container provisioning system, built our own instance provisioning layer and manage networking and load balancing using latency optimized algorithms. As a result, we have no cold starts and can provide predictable low latency, even for extremely bursty workloads.
Low latency is hard to achieve in cloud scale distributed systems. It is also critical that latencies are low enough to support your use case. If a system is too slow for your use case, you simply can’t use it.
Serverless, while providing many advantage in operations and developer productivity, is too slow today to support anything but DevOps, automation and ETL use cases. It is also too slow to allow multiple serverless functions to be composed together, limiting the architectural freedom of developers and precluding sharing and reuse of code.
Binaris is a serverless platform designed to support responsive and interactive use cases and provide developers with the freedom to build entire applications out of serverless functions. You can read more and sign up for free at www.binaris.com.