This work is licensed under CC BY 4.0 - Read how use or adaptation requires attribution
Super Early Bird for FinOps X 2024 Now Available

How Inadequate Volume Testing Spikes AWS S3 Charges

This story was contributed by an anonymous source due to privacy concerns.

Sometimes, as cloud footprint increases and infrastructure grows, gaps in volume testing can create irregularities that can cause unforeseen cost spikes.

Our organization has a centralized FinOps team that supports central IT functions, finance, and cloud consumers in different and distinct pillars across the organization. As well as the native tools provided by AWS and Azure we have a major Cloud Management Platform (CMP) tool that’s used as a single pane of glass across both providers. In addition to the cost reporting and recommendation features, this also provides alerting on cost anomalies through automated emails.

At the time this anomaly occurred, a single email typically contained anomalies across multiple billing accounts and subscriptions, which meant it was additional work to analyze, carve them up and get the right messages to the correct audience.

What we saw and what it meant

The issue started with alerts around the cost of AWS S3 API operations. These seemed to be going up disproportionately relative to the increase in S3 storage across the accounts. As we already had some tagging in place we were able to identify that one of the applications was being migrated from staging into production, and with that came multiples of the data volumes that had been for testing.

We were also getting alerts around a very large increase in CloudTrail costs. The S3 and CloudTrail costs were increasing at a far steeper rate for this one application than for any of the others. Combined, this was costing several hundred dollars per day.

When we worked with the application team to dig into what was going on, we found that it was not simply the volume of data that was being stored, but the size and number of files in which it was stored.

Data partitioning decisions can be the simple choice of having a small number of large files or a large number of small files. In this case the original decision was to have a massive number of tiny files and from a purely engineering perspective it had made sense. This decision was taken without consideration for the way in which the volumes would scale and whether that would impact the way certain charges were incurred.

The anomaly occurred when the workload was moved from a staging environment to a production environment and the increasing request volumes rocketed. This was compounded further by the tight relationship between the S3 API calls and CloudTrail, resulting in a double whammy on the cost front.

One of the contributing factors that led to this anomaly was that the application teams had been used to developing on-prem solutions to on-prem constraints. Compared with on-prem, when designing for the cloud some constraints may be merely different while others are completely new; to many developers, product owners and senior leaders, the variable cost model of the cloud offers challenges and opportunities that are both different and new.

When this anomaly hit:

  • application teams were not focused on run costs when designing implementations, so did not use them as a metric when testing or moving to production;
  • application teams were not aware of some of the additional API and logging costs that get incurred;
  • central teams with the responsibility for setting policy but without any budget accountability were not obliged to consider the cost implications of their decisions.

What we did

Changes were made in the following areas:

  • the application was reengineered to make use of larger files to reduce the number of S3 calls
  • after lengthy negotiations with one of the central teams, it was agreed to scale back on the data events to record only management events in CloudTrail,
  • work started to refine the anomaly alerting mechanisms so that targeted emails go directly to the most appropriate recipients.

Advice / what we learned

  • When migrating an application from on-prem to the cloud, it’s not just the application and data that have to be migrated, the engineering mindset has to be migrated too.
  • Help the central engineering and the application teams understand the additional or ‘hidden’ costs of operating in the cloud.
  • Support application teams in constructing more detailed consumption forecasts that account for these additional costs.
  • Get all application and product owners to review their actual cloud costs against forecast on a weekly basis and flag when this changes. This is especially important during the transitions through the various stages of software release deployments, data migrations and new client onboarding.
  • Ensure that central IT or engineering teams, and the business application owners that they support, work together to stay informed on changes to costs and constraints in the public cloud.