What we learned about cloud security running a SaaS in AWS for 5 years - Part 4 - Network Security
This is Part 4 of a multi-part series of posts on how we securely ran ThreatSim in AWS for 5 years and never lost a customer (that we know of) due to any cloud security concerns.
In a traditional data center, network security plays a huge role in how you apply security controls. Firewalls, network segmentation, switches, IDS/IPSs, etc. all make up your network security controls. In AWS however, there are no firewalls per se, nor do customers have access to a span port to monitor all network traffic. There are AMI-based solutions but are generally gateway/firewall solutions that pass traffic out of the VPC. In a large scale SaaS, I'm not a fan of using AMI-based traditional/legacy security gateways in AWS, especially for high-traffic applications. It just feels really awkward given that the "AWS way" of moving traffic in and out of your VPC is to use ELB/ALBs and NAT gateways.
One vendor solution to fill the IDS/IPS voice that is frankly really cool (and an engineering feat IMHO) is ProtectWise. I saw their talk at re:Invent 2016 and was pretty impressive. Their first big customer was Netflix. So yeah.
Back to AWS -- the primary network security control within AWS are security groups. One concept that I really like (and like explaining to customers) is that security groups essentially "firewall off" hosts from one another. This is far more granularity than in a traditional VLAN and subnet. I'd argue it's better than a traditional firewall since pretty much every host starts out with a default-deny policy from everything.
One thing we learned about explaining AWS network security controls to customers and prospects is to describe the setup using traditional terms (e.g. DMZ, subnets, firewall rules, etc.) Having a nice network diagram with user data flows is also hugely helpful. Not all customer security people are hip to AWS security concepts so explaining it in traditional terms is key.
Here are our network security controls:
VPC Use
Description: All EC2 instances should reside within a VPC.
Why it's important: VPCs provide a logical separation between resources and are more conducive to a secure environment.
Security Group Default-Deny
Description: Configure security groups so that only those services that support a valid business requirement are allowed to the resource.
Why it's important: Firewalls configured with a default-deny policy is a battle tested network security concept. Configure AWS security groups so that only those services required to satisfy a valid business requirement are permitted from the Internet into the environment's edge. Additionally, only allow those services required within the VPC. For example, an internet-exposed web server does not typically require SSH access to other devices within the VPC.
Security Groups By Role
Description: Create security groups that are specific to a given device role.
Why it's important: Security groups should be configured to be role-specific. For example, an internet-facing web server may have ports (TCP) 80 and 443 open to the ALB/ELB IP ranges and allow SSH access from a jumpbox. This allows a single security group to be applied to a scalable group of devices that will maintain a consistent network security posture.
Subnet By Role
Description: Place EC2 instances on role-specific subnets.
Why it's important: Grouping instances with the same function into subnets is a logical separation that simplifies management. For example, within a VPC may be the 10.0.10.0/24 subnet that is used for public-facing web servers. 10.0.11.0/24 may be internal application servers. In this pattern, it is easy to determine what role an instance plays simply by looking at the IP address. This pattern is also useful when working with different security controls (e.g. VPC flow data, log alert messages, etc.) since events can be prioritized within the context of the IP address. For example, the organization may create a VPC flow log alert that fires if non HTTP traffic is seen to a subnet that should only contain HTTP services.
Subnet By Risk
Description: Place EC2 instances on subnets by risk level.
Why it's important: In addition to placing EC2 instances on subnets by role, it is also helpful to place instances on subnets by risk or trust level. For example, devices that are exposed (either directly or indirectly via an ALB/ELB) should be placed on a specific network segment. This pattern allows security personnel to understand the context of an event by the IP address. Critical data stores should be located on network subnets that do not have any direct internet connectivity (e.g. elastic IP addresses).
Non-AWS Service Exposure
Description: When possible, do not expose native services to the internet. Rather, all exposed surfaces should be AWS services (e.g. ALB/ELB, S3, etc.)
Why it's important: AWS has a proven track record of robust and well-secured services. For example, there is no known exploits against AWS ALB/ELB services. While this does not eliminate the need to patch critical services, it reduces the likelihood that an attacker will successfully exploit an exposed vulnerability. Another advantage of only exposing AWS services is that it reduces the organization's attack surface and AWS is often alerted of wide-spread security events (e.g. AWS patched the Heartbleed bug before it was publicly known).
Egress Security Groups
Description: Filter network egress traffic from the VPC to the internet and other subnets.
Why it's important: Egress rules make an attacker's job more difficult by preventing data exfiltration or establishment of command and control (C2) from the compromised device to a host on the Internet. Given that nature of many applications and associated 3rd party software, it may not be possible to completely allow all traffic outbound, but filtering the traffic to a smaller list of services and protocols simplifies things. For example, an EC2 instance running Ubuntu may require HTTP/S access for updates. Since Ubuntu's update servers run on a variety of (likely) load-balanced hosts, specifying a specific IP address may not be practical. Additionally, 3rd party monitoring solutions (e.g. Data Dog, New Relic, etc.) may require outbound HTTPS, but do not specify a specific IP address. In these cases it is permissible to allow only HTTP/S out.
Enable VPC Flow Data
Description: Enable VPC flow data so that the VPC's traffic is logged.
Why it's important: VPC flow data provides valuable insights when troubleshooting and investigating security incidents. Ideally, the organization should consume the flow data into a solution that is conducive to archival and searching.
No Direct Administrative Access
Description: Do not allow direct administrative access to devices in the VPC. Rather, use a jumpbox.
Why it's important: Restricting all administrative access to a jumpbox allows security personnel to implement strict security controls on a single device. A jumpbox also provides a "chokepoint" where the majority of access controls can be implemented. For example, configure a jumpbox that is only accessible from approved IP addresses or address ranges (e.g. VPN or corporate IP address).
Network Perimeter Awareness
Description: Obtain a comprehensive and up to date list of all externally exposed assets.
Why it's important: Given that elastic IPs (EIPs) and other AWS external IP addresses and resources are not dedicated to a customer and may change over time, it is critical that the organization have a process to determine the organization's external presence. In some cases it will be IP addresses (e.g. EIPs) in other cases it will be hostnames (e.g. S3, ALB/ELB, RDS, etc.) This information is useful as part of the organization's external security assessment program.