Many companies, including one of my previous employers Yubl, were able to realise huge savings by moving to serverless.
AWS Lambda itself is really cheap, but there are other less obvious costs that can quietly creep up on you. And if you’re operating a service that has to run at scale constantly then a Lambda-based architecture would likely cost many times more than a service running in containers or VMs.
In this post let’s take a look at common pitfalls when it comes to understanding the cost of AWS Lambda.
100ms charging blocks
With AWS Lambda, you are charged for the invocation requests as well as the duration of the invocations. If your function is not used then you don’t pay for it, hence the motto of “don’t pay for idle”.
When your function runs, an invocation that runs for 2 seconds would cost twice as much as an invocation that runs for 1 second. But there’s a small caveat - durations are charged in 100ms blocks.
An invocation that runs for 50ms would be rounded up to 100ms and would therefore cost as much as an invocation that runs for only 25ms.
This also means that there is no cost benefit in optimizing functions with an average execution time below 100ms already.
More memory doesn’t always mean more expensive
With Lambda, the cost per invocation is also influenced proportionally by the memory size of the function. A function with 256MB of memory would cost twice as much per second as a function with 128MB of memory.
Moreover, CPU resources are also allocated proportionally - more memory equals more CPU. Combined with the 100ms charging blocks, this opens up an interesting optimization opportunity.
Consider a 128MB function with an average execution time of 110ms. Because the duration is rounded up to the nearest 100ms, it means on average you will pay for 200ms of execution time, or $0.000000416. You can give the function more memory to speed up the portion of the code that is not waiting for an IO operation to complete, and bring the average execution time down. With 192MB of memory, you might reduce the average execution time to under 100ms, and reduce the average invocation cost to $0.000000313. That is a 25% saving per invocation!
Figuring out what memory size to use is a dark art and requires tons of trial-and-error. Alex Casalboni has an elegant solution to automate this process using Step Functions, you can learn more about it here.
Paying for wait time
As Ben Kehoe explained in this excellent post, even with Lambda you are still paying for idle whenever your function is waiting for IO operations to complete. For example, whenever your function talks to another AWS service or another API endpoint.
In a microservices architecture, where one Lambda function calls another (via API Gateway), you can also end up paying for these wait time multiple times.
Unfortunately there is not a lot we can do about it today. You could, however, adopt a fire-and-forget approach when you don’t need to perform further processing on the response. For example, when publishing a message to SNS, you don’t have to wait for the response as you’re probably not doing anything with the MessageId in the response anyway.
That said, I would strongly recommend that you consider the tradeoffs carefully before adopting this approach for two reasons:
Because durations are charged in 100ms blocks, these optimizations often don’t yield any cost savings. For example, a request to publish a SNS message would likely complete in under 40ms. Shaving an invocation time from 160ms to 120ms would result in no cost saving, as both would be rounded up to 200ms.
You won’t find out if the request actually succeeded. Therefore you won’t be able to handle the errors, or retry the operation. You won’t even have error logs to tell you that the operation failed. This puts you in a terrible position, to not have any observability for state changes that you’re depending on.
To truly be able to not pay for idle, we will need platform changes to support millisecond-based billing and only charge for CPU time. However, my personal feeling is that, when you have cost concerns at this level then maybe your scale is already outgrowing the cost-per-invocation model and you should consider moving the workload elsewhere.
Besides the Lambda functions, it’s also very important to take into account the cost of the event sources. API Gateway for instance, is charged at $3.50 per million API calls received plus data transfer charges. In practice, API Gateway is likely to cost you more than Lambda in production, sometimes several times more.
An API Gateway that receives a constant rate of 1000 requests/s would cost around $9,000 per month, plus the cost of Lambda invocations, CloudWatch Logs and data transfers. Given the cost involved at scale, you should also consider rewriting these high throughput APIs to run in containers or VMs.
AWS Step Functions is another service that, whilst delivers a lot of value, can be really pricey when used at scale. At $25 per million state transitions, it’s one of the most expensive services in AWS that I have worked with. Have a look at this post to learn more about Step Functions, what it’s good for and when to use it.
Cost can be scale sensitive
Aside from APIs, Lambda is often used in conjunction with SNS, SQS or Kinesis to perform background processing. SNS and SQS are both charged by requests only, whereas Kinesis charges for shard hours on top of PUT requests. This makes Kinesis a comparatively expensive event source when the throughput is low.
At one message per second, this is how much each event source would likely cost per month:
However, by the time you reach a throughput of one thousand messages per second, the monthly cost for these event sources paint a very different picture.
The difference owes to the fact that Kinesis has a much lower cost per million requests at $0.14 vs. $0.5 and $0.4 for SNS and SQS respectively. This allows Kinesis’s cost to grow at a much slower rate as throughput goes up. This makes Kinesis a very attractive option for systems that have to operate at scale. Indeed, Netflix uses Kinesis to analyse VPC flow log at massive scale!
Another often overlooked cost of using Lambda lies with the services you use to monitor your functions. CloudWatch Logs for instance, is a compulsory service. Anything you write to stdout would be captured and shipped to CloudWatch Logs. Even if you don’t write anything yourself, the Lambda service would always write three system messages for START, END and REPORT.
CloudWatch Logs charges you $0.5 per GB ingested as well as $0.03 per GB per month for storage. On its own, this is a very competitive price compared to many of its competitors! However, it’s still relatively expensive compared to the cost of the Lambda invocations. In fact, it’s very common for people to spend more (sometimes an order of magnitude more!) on CloudWatch Logs than Lambda in their production AWS account. Which is why I highly recommend you use structured logging and only sample debug logs in production.
Also, CloudWatch Logs has very limited search capabilities, so most people would ship their logs to another fully fledged log aggregation service anyway. Which means, you still incur the full cost of a log aggregation service, plus whatever you have to pay for CloudWatch Logs.
Finally, you should also consider the cost of data transfer for Lambda. Data transfers are charged at the standard EC2 rate, which is itself a very complex topic! The best explanation I have found is this diagram from the AWS open guide, courtesy of Corey Quinn.
The pricing for Azure Functions and Google Cloud Functions are very similar to Lambda with Azure coming slightly cheaper than the rest. That said, given the price differences are minimal and Lambda is a more mature platform with more supported event sources, I wouldn’t advocate anyone moving to Azure based on cost alone.
Both Azure Functions and Google Cloud Functions offer HTTP binding out-of-the-box and do not charge extra for them. These HTTP bindings don’t offer the full range of features that API Gateway offers, such as caching, model validation, and authentication. For simple use cases where you don’t need all these extra features, these built-in HTTP bindings are both easier to work with, and a lot cheaper to operate!
Binaris also offers a simple, free-to-use HTTP binding, and what’s interesting about their FAAS platform is that it’s specialised for performance and cost. Currently, Binaris’s price is exactly 10% of Lambda and it’s blazingly fast by comparison. From what I have seen so far, it doesn’t suffer from the common problem of cold starts, or perhaps its cold starts are so fast that they’re indistinguishable from a warm invocation!
Binaris do not offer an extensive set of event sources yet, but if you’re building APIs and your primary concerns are predictable performance and cost then you should definitely check these guys!
To summarise, it takes a lot of nuances to fully understand the cost of your serverless architecture built around AWS Lambda functions.
There are caveats to how function invocations are charged by duration, and that has implications on how, and when you should optimize a function to reduce its cost.
A Lambda function is always used with some event source, so you also need to factor in the cost of those event sources. The scale or throughput you operate at can also have a telling impact too. Kinesis can seem expensive when used at small scale, but it is vastly more cost effective compared to SNS and SQS as throughput goes up.
As your API reaches “web scale”, the cost-per-invocation model of Lambda and API Gateway will become very expensive. At that point, you should consider porting or rewriting the API so it can run in a container, or on VMs. Alternatively, consider specialised platforms such as Binaris, which offers a much more competitive pricing structure compared to Lambda.
And finally, you also need to consider other peripheral costs such as CloudWatch Logs and data transfers. If you log all debug messages in production, then you will likely spend many times your Lambda invocation costs on CloudWatch Logs!
There is one other cost that I didn’t cover in this post, which is the cost of developer time and opportunity costs. These are important, and often underappreciated costs that contribute significantly to the total cost of ownership (TCO) of your architecture. And I feel they are also the most important cost savings one can make by moving to serverless!