What we learned about cloud security running a SaaS in AWS for 5 years - Part 7 - Availability

This is Part 7 of a multi-part series of posts on how we securely ran ThreatSim in AWS for 5 years and did right by our customers. Some of the controls described are common sense and others are a bit more unique.


Although Amazon operates highly-available data centers, failures in the same region can affect hardware and software resources such as storage, mitigation of DDoS and application-level attacks to name a few. Having a DR plan that considers AWS in the event of a cataclysmic outage that impacts the AWS region is a must. In this post, we focused on how we ensured the availability of our AWS services

7.1 S3 Direct References

Description: If S3 is used within the application, do not use direct S3 URIs.

Why it's important: While S3's service reliability is very high, S3 outages have proven to be highly problematic for organizations who rely on it. When an S3 outage occurs that is limited to a specific AWS region (e.g. US-EAST), the first thought is to transfer all S3-based assets to another region. This can be challenging if your application directly references S3 bucket URIs. As such, the organization should create DNS CNAMEs that point to an S3 bucket. In the event that the organization needs to change regions or buckets, it is trivial to change the CNAME's value versus finding and replacing all instances of the S3 bucket URI.

7.2 AWS WAF & Shield

Description:Based on the organization's risk profile, engineer the application so that AWS WAF or Shield is enabled.

Why it's important: AWS WAF (and Shield) is a CloudFront-backed service that allows organizations DDoS and application attack mitigation. Given that implementation of CloudFront into an application is non-trivial, the organization should (based on the application's risk profile) place public web sites behind CloudFront. By doing so the organization will be prepared in the event of an application-level attack or DDoS. Implementing CloudFront in the middle of a security incident can be cumbersome and problematic. Additionally, security groups do not support filtering based on source IP and require the use of AWS NACLs. Instead of SGs, WAF and Shield rules could be used.

7.3 Disaster Recovery Plan

Description: Ensure the organization has a disaster recovery (DR) plan that considers AWS.

Why it's important: Should the AWS or the organization's environment suffer a cataclysmic outage that impacts the AWS region (e.g. US-EAST), the organization should have the ability to failover or transfer operations to a different AWS region. The DR and associated recovery plans should include (at a minimum) the ability to quickly create and configure the organization's VPC, subnets, etc. in a different region.

7.4 Minimal Health Instance Count

Description: Ensure that all critical services are backed by at least two instances.

Why it's important: All critical services should be backed by at least two EC2 instances. For example, critical web servers should live behind an ALB/ELB that includes two EC2 instances located in two different regions.

7.5: NAT Gateways

Description: Use AWS NAT gateways opposed to NAT instances.

Why it's important: NAT gateways are inherently more robust and secure compared to NAT instances.

7.6 RDS Multi-AZ

Description: Ensure all critical RDS instances are configured as multi-region clusters.

Why it's important: Multi-region instances have higher availability than single-region instances.

7.6 Elasticache Multi-AZ

Description: Ensure all critical Elasticache instances are configured as multi-region clusters.

Why it's important: Multi-region instances have higher availability than single-region instances.

7.7 S3 Cross-Region Replication

Description: Enable cross-region replication for critical S3 buckets.

Why it's important: During S3 outages, it is helpful to have critical assets replicated to other regions so that it is easier to failover to another region.

7.8 AutoScaling Groups

Description: Enable auto-scaling groups for critical EC2 instances.

Why it's important: Ensure that critical EC2 instances are part of an autoscaling group that ensures that a minimum number of instances are available at any given time.

7.9 Elastic IP EC2 Use

Description: Where possible, avoid the use of elastic IPs within EC2 instances.

Why it's important: Elastic IPs directly connect an EC2 instance to the internet. It's preferable to allow internet access via a NAT gateway. Inbound traffic should traverse an ALB/ELB.

7.10 Enable RDS Snapshots

Description: Enable RDS snapshots to facilitate backup and recovery.

Why it's important: Enable an aggressive RDS snapshot schedule. AWS handles it for you and it's relatively inexpensive when compared to not having any snapshots in the event of a loss of data.

Previous Post: Part 6 - Access Security