10

Efficiency at Scale: A Tale of AWS Cost Optimization

 2 years ago
source link: https://www.toptal.com/aws/aws-cost-optimization-at-scale
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Cover image
11 minute read

Efficiency at Scale: A Tale of AWS Cost Optimization

Understanding total spend is a common challenge for cloud users, especially on projects with complex pricing models. This article explores the top AWS cost optimizations that will help you scale your platform effectively.

Author
Rudolf is a data scientist who has architected big data processing infrastructures on AWS and implemented data engineering solutions for Fortune 500 companies, including Disneyland Hong Kong and Philip Morris. He was invited to participate in NASA’s 2021 International Space Apps Challenge as a speaker and judge.

I recently launched a cryptocurrency analysis platform, expecting a small number of daily users. However, when some popular YouTubers found the site helpful and published a review, traffic grew so quickly that it overloaded the server, and the platform (Scalper.AI) became inaccessible. My original AWS EC2 environment needed extra support. After considering multiple solutions, I decided to use AWS Elastic Beanstalk to scale my application. Things were looking good and running smoothly, but I was taken aback by the costs in the billing dashboard.

This is not an uncommon issue. A survey from 2021 found that 82% of IT and cloud decision-makers have encountered unnecessary cloud costs, and 86% don’t feel they can get a comprehensive view of all their cloud spending. Though Amazon offers a detailed overview of additional expenses in its documentation, the pricing model is complex for a growing project. To make things easier to understand, I’ll break down a few relevant optimizations to reduce your cloud costs.

Why I Chose AWS

The goal of Scalper.AI is to collect information about cryptocurrency pairs (the assets swapped when trading on an exchange), run statistical analyses, and provide crypto traders with insights about the state of the market. The technical structure of the platform consists of three parts:

  • Data ingestion scripts
  • A web server
  • A database

The ingestion scripts gather data from different sources and load it to the database. I had experience working with AWS services, so I decided to deploy these scripts by setting up EC2 instances. EC2 offers many instance types and lets you choose an instance’s processor, storage, network, and operating system.

I chose Elastic Beanstalk for the remaining functionality because it promised smooth application management. The load balancer properly distributed the burden among my server’s instances, while the autoscaling feature handled adding new instances for an increased load. Deploying updates became very easy, taking just a few minutes.

Scalper.AI worked stably, and my users no longer faced downtime. Of course, I expected an increase in spending since I added extra services, but the numbers were much larger than I had predicted.

How I Could Have Reduced Cloud Costs

Looking back, there were many areas of complexity in my project’s use of AWS services. We’ll examine the budget optimizations I discovered while working with common AWS EC2 features: burstable performance instances, outbound data transfers, elastic IP addresses, and terminate and stop states.

Burstable Performance Instances

My first challenge was supporting CPU power consumption for my growing project. Scalper.AI’s data ingestion scripts provide users with real-time information analysis; the scripts run every few seconds and feed the platform with the most recent updates from crypto exchanges. Each iteration of this process generates hundreds of asynchronous jobs, so the site’s increased traffic necessitated more CPU power to decrease processing time.

The cheapest instance offered by AWS with four vCPUs, a1.xlarge, would have cost me ~$75 per month at the time. Instead, I decided to spread the load between two t3.micro instances with two vCPUs and 1GB of RAM each. The t3.micro instances offered enough speed and memory for the job I needed at one-fifth of the a1.xlarge’s cost. Nevertheless, my bill was still larger than I expected at the end of the month.

In an effort to understand why, I searched Amazon’s documentation and found the answer: When an instance’s CPU usage falls below a defined baseline, it collects credits, but when the instance bursts above baseline usage, it consumes the previously earned credits. If there are no credits available, the instance spends Amazon-provided “surplus credits.” This ability to earn and spend credits causes Amazon EC2 to average an instance’s CPU usage over 24 hours. If the average usage goes above the baseline, the instance is billed extra at a flat rate per vCPU-hour.

I monitored the data ingestion instances for multiple days and found that my CPU setup, which was intended to cut costs, did the opposite. Most of the time, my average CPU usage was higher than the baseline.

A chart has three drop-down selections chosen at the top of the screen. The first two, at the left, are

The above chart displays cost surges (top graph) and increasing CPU credit usage (bottom graph) during a period when CPU usage was above the baseline. The dollar cost is proportional to the surplus credits spent, since the instance is billed per vCPU-hour.

I had initially analyzed CPU usage for a few crypto pairs; the load was small, so I thought I had plenty of space for growth. (I used just one micro-instance for data ingestion since fewer crypto pairs did not require as much CPU power.) ​However, I realized the limitations of my original analysis once I decided to make my insights more comprehensive and support the ingestion of data for hundreds of crypto pairs—cloud service analysis means nothing unless performed at the correct scale.

Outbound Data Transfers

Another result of my site’s expansion was increased data transfers from my app due to a small bug. With traffic growing steadily and no more downtime, I needed to add features to capture and hold users’ attention as soon as possible. My newest update was an audio alert triggered when a crypto pair’s market conditions matched the user’s predefined parameters. Unfortunately, I made a mistake in the code, and audio files loaded into the user’s browser hundreds of times every few seconds.

The impact was huge. My bug generated audio downloads from my web servers, causing additional outbound data transfers. A tiny error in my code resulted in a bill almost five times larger than the previous ones. (This wasn’t the only consequence: The bug could cause a memory leak in the user’s computer, so many users stopped coming back.)

A chart similar to the previous one but with the first drop-down reading

The above chart displays cost surges (top graph) and increasing outbound data transfers (bottom graph). Because outbound data transfers are billed per GB, the dollar cost is proportional to the outbound data usage.

Data transfer costs can account for upward of 30% of AWS price surges. EC2 inbound transfer is free, but outbound transfer charges are billed per GB ($0.09 per GB when I built Scalper.AI). As I learned the hard way, it is important to be cautious with code affecting outbound data; reducing downloads or file loading where possible (or carefully monitoring these areas) will protect you from higher fees. These pennies add up quickly since charges for transferring data from EC2 to the internet depend on the workload and AWS Region-specific rates. A final caveat unknown to many new AWS customers: Data transfer becomes more expensive between different locations. However, using private IP addresses can prevent extra data transfer costs between different availability zones of the same region.

Elastic IP Addresses

Even when using public addresses such as Elastic IP addresses (EIPs), it is possible to lower your EC2 costs. EIPs are static IPv4 addresses used for dynamic cloud computing. The “elastic” part means that you can assign an EIP to any EC2 instance and use it until you choose to stop. These addresses let you seamlessly swap unhealthy instances with healthy ones by remapping the address to a different instance in your account. You can also use EIPs to specify a DNS record for a domain so that it points to an EC2 instance.

AWS provides only five EIPs per account per region, making them a limited resource and costly with inefficient use. AWS charges a low hourly rate for each additional EIP and bills extra if you remap an EIP more than 100 times in a month; staying under these limits will lower costs.

Terminate and Stop States

AWS provides two options for managing the state of running EC2 instances: terminate or stop. Terminating will shut down the instance, and the virtual machine provisioned for it will no longer be available. Any attached Elastic Block Store (EBS) volumes will be detached and deleted, and all data stored locally in the instance will be lost. You will no longer be charged for the instance.

Stopping an instance is similar, with one small difference. The attached EBS volumes are not deleted, so their data is preserved, and you can restart the instance at any time. In both cases, Amazon no longer charges for using the instance, but if you opt for stopping instead of terminating, the EBS volumes will generate a cost as long as they exist. AWS recommends stopping an instance only if you expect to reactivate it soon.

But there’s a feature that can enlarge your AWS bill at the end of the month even if you terminated an instance instead of stopping it: EBS snapshots. These are incremental backups of your EBS volumes stored in Amazon’s Simple Storage Service (S3). Each snapshot holds the information you need to create a new EBS volume with your previous data. If you terminate an instance, its associated EBS volumes will be deleted automatically, but its snapshots will remain. As S3 charges by the volume of data stored, I recommend that you delete these snapshots if you won’t use them shortly. AWS features the ability to monitor per-volume storage activity using the CloudWatch service:

  1. While logged into the AWS Console, from the top-left Services menu, find and open the CloudWatch service.
  2. On the left side of the page, under the Metrics collapsible menu, click on All Metrics.
  3. The page shows a list of services with metrics available, including EBS, EC2, S3, and more. Click on EBS and then on Per-volume Metrics. (Note: The EBS option will be visible only if you have EBS volumes configured on your account.)
  4. Click on the Query tab. In the Editor view, copy and paste the command SELECT AVG(VolumeReadBytes) FROM "AWS/EBS" GROUP BY VolumeId and then click Run. (Note: CloudWatch uses a dialect of SQL with a unique syntax.)

An overview of the CloudWatch monitoring setup described above (shown with empty data and no metrics selected). If you have existing EBS, EC2, or S3 instances on your account, these will show up as metric options and will populate your CloudWatch graph.

CloudWatch offers a variety of visualization formats for analyzing storage activity, such as pie charts, lines, bars, stacked area charts, and numbers. Using CloudWatch to identify inactive EBS volumes and snapshots is an easy step toward optimizing cloud costs.

Extra Money in Your Pocket

Though AWS tools such as CloudWatch offer decent solutions for cloud cost monitoring, various external platforms integrate with AWS for more comprehensive analysis. For example, cloud management platforms like VMWare’s CloudHealth show a detailed breakdown of top spending areas that can be used for trend analysis, anomaly detection, and cost and performance monitoring. I also recommend that you set up a CloudWatch billing alarm to detect any surges in charges before they become excessive.

Amazon provides many great cloud services that can help you delegate the maintenance work of servers, databases, and hardware to the AWS team. Though cloud platform costs can easily grow due to bugs or user errors, AWS monitoring tools equip developers with the knowledge to defend themselves from additional expenses.

With these cost optimizations in mind, you’re ready to get your project off the ground—and save hundreds of dollars in the process.

The AWS logo with the wordAs an Advanced Consulting Partner in the Amazon Partner Network (APN), Toptal offers companies access to AWS-certified experts, on demand, anywhere in the world.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK