This post is the third in a multi-part series on serverless adoption. In the first part we presented a guide to go all in with serverless adoption. In the second part we discussed how to migrate monoliths to serverless.
When you have a set of microservices running on VMs or inside Docker containers, consider moving some of them to serverless. Here are some commonly cited reasons:
- Cost: Services with low traffic, or cron jobs, are cheaper with serverless. This is because you don’t pay for them when they’re idle.
- Time to market: You get more done, faster. Free yourself of the undifferentiated heavy lifting of managing infrastructure. Focus on your users’ needs instead.
- Scalability: You can scale faster and more aggressively because you’re not constrained by the amount of spare capacity you reserve (and have to pay for) for scaling.
- Easier ops: You get logging and monitoring out of the box. You can also integrate with other services for tracing, alerting, and visualization.
- Security: You have more fine-grained control of access. Give each function only the permissions it needs to minimize attack surface.
- Resilience: You get multi-AZ out of the box with AWS Lambda.
Suffice to say that there are many benefits to moving some of your microservices to serverless. But, there are also challenges you need to overcome as you make the transition. You will find that many of the practices and tools you rely on are no longer suitable. And you need to rethink how you approach building production-ready microservices with serverless.
1. Organize your microservice into functions
One of the most frequently asked questions when porting an existing API to serverless is: “Should I have one function per endpoint and action or one function for the whole API?”
I am firmly in the first camp, where every function handles only one endpoint and action:
This is the right approach for the following reasons:
- Each function is simple as it does only one thing.
- You can tailor the IAM permissions for each function to give the least amount of privilege possible. This reduces the attack surface if one of the endpoints is compromised.
- When a function does only one thing, it will likely require fewer dependencies and less initialization code. This reduces the cold start time you will experience.
- It creates a clear, 1-to-1 mapping between business capability and function. You can see at a glance what business capabilities you have in an API without having to look inside the code.
2. Build observability into the microservice
With serverless, you can no longer rely on agents/daemons to collect and send logs and metrics for you. There is simply no place to install them. Instead, we need to rethink how we build observability into our serverless applications.
2.1 Log aggregation
With AWS Lambda, everything you write to stdout is captured and shipped to CloudWatch Logs. This process is asynchronous and doesn’t add any overhead to your function’s invocation time.
But, CloudWatch Logs do not let you search across multiple functions. It is also missing many of the features of the ELK stack. You will outgrow CloudWatch Logs once you have a dozen or so functions. At that point, you have some options to stream the logs out of CloudWatch Logs:
- Stream to Amazon Elasticsearch if you are already using it. If not, read this post before making your decision.
- Stream to a Lambda function. The function can ship the logs to a log aggregation service—for example Logz.io, Splunk, or a self-hosted ELK stack.
- Stream to a Kinesis stream. You can then subscribe a Lambda function to the stream to ship the logs to a log aggregation service.
Whichever option you choose, you should automate the subscription process.
You should log structured data in JSON and complement each message with useful context, such as request ID or user ID. This helps you debug problems in production and makes it easy to find related log messages.
CloudWatch Logs charges $0.50 per GB ingested. If you leave debug logging on in production, then you will likely spend many times your Lambda invocation cost on CloudWatch Logs. Instead, you should sample debug logs for a small percentage of invocations.
Finally, you should log the invocation event whenever a function errors. This lets you capture and replay failed invocations when you need to debug these failures.
You get some metrics out of the box with CloudWatch. These include invocation count, duration, and error count. But custom application metrics have to be sent as part of the function’s invocation. This introduces latency overhead, which we want to avoid for user-facing APIs.
Many vendors, such as Datadog, support importing metrics from CloudWatch. Datadog also has special integration for Lambda. If you write custom metrics to stdout, in DogStatsD format:
Datadog will parse and ingest them as metrics. You should consider this approach for user-facing APIs even if you don’t use Datadog. If you stream your logs to a Lambda function or to a Kinesis stream first, you can handle these special messages in the function. Instead of sending them as logs, you can parse them and send them as custom metrics.
Don’t forget to set up alarms and hook them up with ops management tools such as OpsGenie or PagerDuty. This is no different than what you do for existing microservices.
With Lambda, you should also enable the “Active Tracing” option via configuration. This allows X-Ray to trace the function invocations. If you instrument your code, then you can get a fine-grained breakdown of what is happening.
2.5 Correlation IDs
Complex workflows often need many functions to work together. You need to correlate the logs from all these functions to debug problems in the workflow. To do that, you need to capture and forward correlation IDs across different event sources.
With AWS Lambda, concurrency is managed at the platform level. You don’t have to worry about handling concurrent requests in your code anymore. This means you can safely store incoming correlation IDs in a global variable. That way, the logging module has a known location to find them and can include them in every log message.
To capture incoming correlation IDs, consider using a middleware engine, such as Middy. Create a reusable middleware to inspect the invocation event and extract the correlation IDs.
To forward correlation IDs in outbound traffic, you can create wrappers for HTTP and AWSSDK clients. In the wrapper, you will inject these captured correlation IDs depending on what you’re talking to.
With synchronous event sources, such as an API, include the correlation IDs as HTTP headers.
Things get trickier with asynchronous event sources, such as SNS, Kinesis, or S3. You can pass correlation IDs as message attributes with SNS or as object tags with S3. With Kinesis, they have to be part of the event payload, as there is no way to add additional context to the event.
The motivation for this approach is to make correlation IDs unintrusive to the developers. The following diagram illustrates how such a system would work.
The most important thing here is to apply the principle of least privilege. Each function should have the minimal amount of IAM permissions possible. As discussed earlier, having one function per API endpoint and action helps. It reduces what each function needs to access and the permissions it needs.
Most teams work very hard to tighten the security around the VPC boundary of their service. But once inside, it’s a full trust environment where everyone has access to all your internal APIs. This fragile approach leaves you wide open if any service behind the VPC is compromised.
This approach doesn’t work with serverless. Take Amazon API Gateway as an example; there is no way to put an API behind a VPC. All APIs are publicly accessible. Instead, you need to authenticate requests to internal APIs. With Amazon API Gateway, you can control access down to each endpoint and each action. When combined with the least privilege principle, it leaves you in a good place.
For internal APIs, you should use AWS_IAM authentication. This allows your functions to sign HTTP requests to internal APIs using their IAM roles. Each function should be permitted to access only the endpoints they need.
To protect yourself against DOS attacks, you need to configure sensible throttling limits. The default method throttling (10K request per second!) for API Gateway is way too high.
These default limits allow a single endpoint to exhaust the account level limit and render all APIs unavailable. It will also incur significant AWS cost at the same time.
3.3 Application layer
The OWASP TOP 10 threats continue to affect us in the serverless world.
You need to sanitize user inputs to protect against injection (A1) and XSS (A7) attacks.
You also need to sanitize your functions’ outputs to avoid leaking unintended data (A3). This includes not returning an unfiltered exception message and stack trace in a HTTP response. These error details can sometimes include sensitive information, such as credentials or API keys.
Use services, such as Snyk, to continuously scan your dependencies for known vulnerabilities (A9). If you know a function is no longer needed, then you should delete it right away. An unused function will continue to exist as an attack surface in your system.
If you know a function is no longer needed, then you should delete it right away. An unused function will continue to exist as an attack surface in your system.
3.4 Data layer
For S3, also enable Amazon Macie to detect inadvertent data leaks (A3) or security misconfigurations (A6).
In summary, the serverless paradigm brings many changes to how we build microservices. While many things stay the same, for example:
- how we identify service boundaries
- how we organize our codebase
- why we should follow the single responsibility principle
- why we should focus on testing the integration points of the system
We should remember the principles of microservices and continue to apply them where applicable.
But as we move into the world of serverless, we also need to acknowledge that some things are different. We need to adjust our server-centric views on security and resilience. And we need to avoid the temptation to pigeon hole existing tools and practices onto this new paradigm.
I hope you have enjoyed this series on migrating to serverless. Do let us know if you have any feedback or if there are any particular topics that we have missed.