This story was contributed by an anonymous source due to privacy concerns.
Sometimes, as cloud footprint increases and infrastructure grows, gaps in volume testing can create irregularities that can cause unforeseen cost spikes.
Our organization has a centralized FinOps team that supports central IT functions, finance, and cloud consumers in different and distinct pillars across the organization. As well as the native tools provided by AWS and Azure we have a major Cloud Management Platform (CMP) tool that’s used as a single pane of glass across both providers. In addition to the cost reporting and recommendation features, this also provides alerting on cost anomalies through automated emails.
At the time this anomaly occurred, a single email typically contained anomalies across multiple billing accounts and subscriptions, which meant it was additional work to analyze, carve them up and get the right messages to the correct audience.
The issue started with alerts around the cost of AWS S3 API operations. These seemed to be going up disproportionately relative to the increase in S3 storage across the accounts. As we already had some tagging in place we were able to identify that one of the applications was being migrated from staging into production, and with that came multiples of the data volumes that had been for testing.
We were also getting alerts around a very large increase in CloudTrail costs. The S3 and CloudTrail costs were increasing at a far steeper rate for this one application than for any of the others. Combined, this was costing several hundred dollars per day.
When we worked with the application team to dig into what was going on, we found that it was not simply the volume of data that was being stored, but the size and number of files in which it was stored.
Data partitioning decisions can be the simple choice of having a small number of large files or a large number of small files. In this case the original decision was to have a massive number of tiny files and from a purely engineering perspective it had made sense. This decision was taken without consideration for the way in which the volumes would scale and whether that would impact the way certain charges were incurred.
The anomaly occurred when the workload was moved from a staging environment to a production environment and the increasing request volumes rocketed. This was compounded further by the tight relationship between the S3 API calls and CloudTrail, resulting in a double whammy on the cost front.
One of the contributing factors that led to this anomaly was that the application teams had been used to developing on-prem solutions to on-prem constraints. Compared with on-prem, when designing for the cloud some constraints may be merely different while others are completely new; to many developers, product owners and senior leaders, the variable cost model of the cloud offers challenges and opportunities that are both different and new.
When this anomaly hit:
Changes were made in the following areas: