While surfing the Internet the other day, I came across an interesting article: AWS Tips I Wish I’d Known Before I Started. Though not the most recent of articles, it’s a good read and it quickly touches on something that is important for critical production systems running under the “Classic” version of AWS Elastic Load Balancer (ELB) – health checks. The default health check is a good starting point, but not the most comprehensive way to check the health of your application, so I want to take some time to go into more detail around ELB health checking your applications.
A quick disclaimer first: if you’re looking for something to tell you the exact settings to use for your health check, this isn’t your guide. If a guide says a single health check will do it all without tuning, it’s likely inaccurate. My goal is to try to give you enough context to evaluate and calculate the best ELB health check for an application without falling into common pitfalls.
First things first, establishing a health check strategy can prevent a lot of frustration and reduce system alarms, saving your on-call team members some sleepless nights. It’s also important to note that a perfect health check isn’t achieved the first time, so don’t be afraid to tweak your strategy based on observable data.
When deciding upon your ELB health check strategy, you’ll need to understand the configuration of its actual “pings” – interval, timeout, and the thresholds that need to be customized for your application. Regardless of what the documentation says, don’t leave defaults unless they truly fit your application. In order to get a tailored health check, you’re going to need some background information. Here are five major things to consider when determining what your application can handle for your health check:
- An ELB consists of one or more nodes. Each node health checks your instances. AWS will not tell you how many nodes belong to your ELB, but you can currently use dig ALL [ELB DNS name here] +short to get the count.
- If you’ve enabled cross-zone load balancing (and you should, trust me), all instances will be health checked by all nodes. Otherwise, the nodes in each Availability Zone (AZ) will only health check your instances in the same AZ. Make sure you have your OS network parameters tuned to ensure this doesn’t overwhelm your instances, as well as your application. Keep in mind that these kinds of changes will have an impact on resources, so test, test, and test some more.
- Your HTTP health check will need to return a 200 OK to pass and it must do so within the timeout period defined for the health check. There cannot be any redirects.
- A health check will be sent each interval – successful or not.
- For TCP/SSL health checks, the ELB ensures that a TCP connection may be established through a simple connection open and close. For HTTP(S), the ELB makes a simple HTTP request against the health check URI and ensures that a 200 OK response code is returned. There is no need to include a leading / in your health check path. The service only needs the path from your webroot. Make sure you are setting your default virtual host to respond to the health check and are using a deep ping page to check other virtual hosts. This is particularly important if you have multiple virtual hosts being served from your instances.
All of the above are important and should be taken into consideration as you develop your ELB health check strategy. Especially if you are running a high traffic application where it is easy to end up with a large ELB that adds quite a bit of load to your instances through health checks alone. You need to plan for that and make sure you’re load testing prior to any “go live” dates. Failure to do so can – and will – result in on-calls wanting to crawl into a corner, sob, and probably go slightly insane.
In upcoming posts, we’ll be looking more closely at how health checks work and the importance of autoscaling – check back soon!