My wish list for AWS Lambda in 2018

As a heavy Lambda user, I'm desperately waiting to see new features from AWS. These items would address many recurring challenges Lambda users face

Posted by Yan Cui on July 18, 2018

Amazon Web Services (AWS) recently announced that Simple Queue Service (SQS) is finally a supported event source for Lambda. This is extremely exciting news, as I have been waiting for this for two long years! It got me thinking about what other features I am desperately waiting to see from AWS Lambda. After some quick brainstorming, here is my wish list for Lambda for 2018. These items would address many recurring challenges Lambda users face in production, including:

  • better monitoring at scale
  • cold start performance
  • scalability in spiky load scenarios

So, I hope someone from the Lambda team is reading this. Here we go!

1. Ability to forward logs to a Kinesis stream without going through CloudWatch Logs

CloudWatch Logs is a convenient tool for capturing logs from Lambda functions. It collects logs without adding latency to function invocations.

However, the search capabilities offered by CloudWatch Logs are simply not good enough. For instance, you can’t search across more than one log group at once, which is a huge limitation. Because of this, many users are forced to ship their logs to a log aggregation service, such as Logz.io or Splunk.

image-001

You can use a Lambda function to ship the logs. Although this approach works well initially, it can introduce big problems when you begin to operate at scale. It is tricky to control the concurrency of the log shipping function. On one hand, if you don’t limit the function’s concurrency you can hit the regional concurrency limit. On the other hand, if you do limit concurrency you risk losing logs when the function is throttled.

At scale, a superior approach would be to pipe the logs to a Kinesis stream first. With Kinesis, you can control the concurrency of a subscriber function with the number of shards. Hence, preventing the log shipping function from consuming too much of the available concurrency.

image-002

This approach also removes the limitation of having a single subscriber function per log group. You can have many subscribers to a Kinesis stream. You can even use Kinesis Firehose or Kinesis Analytics to slice and dice the log data in different ways. Unfortunately, the tradeoff is another hop in the log shipping pipeline. This additional hop adds further delay to seeing your logs. It would be far better if we had the option to send logs to a Kinesis stream directly. That way, we would reap all the above benefits, and might even see our logs faster than we do with CloudWatch Logs!

2. Option to pay for a pool of pre-warmed containers, and the ability to update desired capacity on demand

Adoption of Lambda skyrocketed when API Gateway became an event source. It opened the door for us to create APIs using fully managed, serverless technologies. With APIs, response time is of the utmost importance. Since cold starts are one of the most often cited problems for folks using API Gateway and Lambda, APIs often suffer.

To minimise the number of cold starts, many have asked for the option to pay for a pool of pre-warmed containers. That way, we can always have enough warm containers to meet baseline traffic and will rarely experience cold starts.

image-003

But what about applications that experience predictable spikes in traffic? For example, food delivery services experience large spikes in traffic around both lunch and dinner. My employer, DAZN, is in the sports streaming business. We too experience huge spikes during matches, as half a million users flood in to watch a match.

image-004

In these cases, it’s simply not cost efficient to maintain a large pool of containers at all times. Instead, we need the ability to update the size of the pool on demand. That way, right before an event we can increase the size of the pool to match the predicted level of traffic. When the event starts, there will be enough warm containers to handle the spike, and we can reduce the size of the pool after the event finishes.

image-005

3. Option to pay for a pool of ENIs to reduce cold start time for VPC functions

When scaling up the concurrency of Lambda functions, it’s also necessary to scale their dependent resources. If your function runs inside a VPC, it means every instance of that function needs an accompanying ENI. Due to the slow creation time of an ENI, cold start duration is impacted significantly.

image-006

In my experience, the overhead of creating ENIs can be as high as 10 seconds! Generally, AWS recommends not using VPCs and instead putting your own authentication in place. Unless, of course, your function requires a VPC-protected resource such as Elasticache or RDS.

image-007

This solution is inadequate for many enterprise users that need to deploy their functions behind private VPCs in order to follow company policy. Even if it’s not mandated by company policy, VPCs are still a great place for functions to be housed due to the many security benefits they bring to the table.

Without VPCs, we can still control ingress traffic with IAM roles, but we lose the ability to effectively monitor and control egress traffic from our functions. A single compromised dependency can allow attackers to leak sensitive data whether that be API keys, AWS credentials, or database connection strings. Deploying functions into VPCs would help us plug that gap in our security monitoring. To minimise the impact this has on cold start time, many of us would be happy to pay a premium to maintain a pool of ENIs. Harkening back to the earlier point regarding pre-warmed container pools, we would also need the ability to adjust the size of the pool on demand. Perhaps, the two features could even be tied together. For example, when I raise the pool size for a function behind a VPC, the system would increase the size of the ENI pool as well. The cost of maintaining the ENI pool could then be factored into the premium we pay for the warm containers. This would simplify the operational aspect (for us customers at least!) of working with these reserved pools of resources.

4. Option to raise the 500 per minute scaling limit via support ticket

There is currently a limit on the rate Lambda is able to scale up the number of concurrent executions. That limit is 500 per minute, which typically is enough in most instances. As previously mentioned, during live streaming events, users tend to flood in moments before the event kicks off. As a result, DAZN regularly experiences far bigger spikes than the 500 per minute limit can accommodate.

image-008

As with most AWS limits, AWS is happy to negotiate on a per customer basis, provided you have the “right” use case. Furthermore, you have to know the right person to even discover these options. In fact, many people aren’t even aware of this limit until they run into it in production. It is not listed in the AWS Lambda Limits documentation page. Instead, it is briefly mentioned in the Lambda Scaling Behaviour page.

image-009

This is also one of the most confusing parts of the Lambda documentation. To this day, I still don’t understand the behaviour described here. Alas, that’s another wish for another post. I would also like to see the 500 per minute scaling limit added to the AWS Lambda Limits documentation, as well as the option added to increase it via a support ticket.

5. Predictive scaling for Lambda

Another feature often requested by advanced users is predictive scaling. Most web traffic is predictable and conforms to a set of daily and weekly patterns, so much so that gov.uk is able to use machine learning to accurately predict their page views per hour.

image-010

Should it not then be possible for the Lambda team to do the same for our Lambda executions? They already have all the necessary information to predict how much concurrency will be needed based on historical patterns. Granted, their prediction engine is unlikely to achieve 100% accuracy. Luckily, it doesn’t have to be in order to be useful. Even the ability to scale up our function’s concurrency before it is necessary (for example, 90% of the time) will still eliminate the majority of cold starts we experience. This would represent a big win!

I would be happy even if this feature was restricted to API Gateway functions. This is due to the fact that functions which perform background processing rarely have an issue with cold start, as the extra latency introduced by cold starts are not user facing and do not impact the UX.

6. CloudWatch to include no. of cold starts and concurrent executions as metric for all functions

My last wish list item related to cold starts is for better visibility. Other tools such as IOpipe offers this feature out of the box (see screenshots below) and, it has proved useful for many of their customers.

image-011

image-012

I imagine it would be easy for Lambda to expose it as a metric in CloudWatch, too. Additionally, I would like to see the number of concurrent executions of a Lambda function exposed as a metric. As of today, you can see the total number of concurrent executions in CloudWatch. What you don’t see, is the number of concurrent executions of an individual function. That function level metric is only enabled when specifying a reserved concurrency. This sets a maximum limit on the concurrency of the function (I know, the name is misleading…).

image-013

Most people I have spoken with would love to see this metric, but not at the risk of unintentionally throttling their functions.

7. Finalizer handler so we can run clean up code when function is garbage collected

Lastly, it would be awesome to have a finalizer handler besides the invocation handler. This would provide users with a consistent and reliable method of cleaning up resources when a container is garbage collected.

The official AWS Lambda Best Practices guide recommends reusing HTTP or database connections. What we lack is a way to dispose of these connections when the container is garbage collected. Instead, we have to rely on subpar mechanisms such as the database timeout for idle connections, which can takes hours. As a result, we often end up with many idle connections to the database, thereby increasing the risk of exhausting the number of available connections for the database.

For instance, MySQL has a default max_connections of just 100! When there are a lot of idle connections left behind by garbage collected containers, new containers might not be able to connect. Additionally, the finalizer handler should be subject to strict timeouts to avoid abuse (similar to how the CLR restricts finalizers to 2 seconds). This avoids malicious code from blocking its finalizer queue.

Summary

As you can see from this wish list, cold starts and scalability are highest up on the issues I personally face. In fact, I get asked about cold starts all the time on my posts and at conferences. It is by far the most pressing issue Lambda users face today. Some might argue that paying for a pool of warm containers would violate the spirit of serverless. After all, many have defined serverless, at least in part, as not paying for unused resources.

They might also argue that, doing so is going to destroy the cost saving benefits of serverless. And they might be right :-) I would argue that Lambda delivers cost savings far beyond what we can see on our AWS bill. In my experience, the biggest cost saving comes from all the code you no longer have to write. After all, every line of code is a cost, and every line of code is a liability that you are responsible for.

image-014

The cost saving is also derived from all the infrastructure you are no longer responsible for. This also means that you no longer need to have the skillset in house to look after them. Considering that engineers are the most expensive resource, this delivers a massive saving! So, let me know what you think in the comments below. And feel free to share your wishlist for AWS Lambda with me!

To see other AWS wish lists, you can follow the #awswishlist hashtag on Twitter, or check out the AWS Wishlist website.