Amy Ashby
FinOps Lead at Under Armour
My anomaly example relates to an instance when my dev account cost went up a couple days due to an unseen increase in AWS Lambda costs. Even at a small scale, any runaway AWS Lambda costs can be an important signal to be aware of to prevent compounding cost spikes over time (if left unchecked). Here is how we dealt with it.
This story is told originally as part of Amy Ashby’s FinOps X ‘23 talk.
I found this anomalous cost increase by looking at it by service, grouped among the other cloud services. From the billing CSV file, I could see that there were three services that trended at the same time. Using this data, I took my investigation back to the console.
Filtered by what I found, and then in that organization, we had a tag called the name tag. I grouped data by that tag, and was able to find the exact time frame for the spike. With the exact name of the resources, I was equipped to have a conversation with the team that used the service.
Equipped with the cost and usage data, I was able to have a constructive conversation, and they were able to understand what the process was that was driving the cost. In this particular example, AWS Lambda step functions and some CloudWatch logs were generating excess costs.
There was also a function that was processing data out of an SQS queue. And, if there was something wrong with the message, it would reprocess it and keep going in a repetitive motion. Therefore, in this particular case, this was in a dev account and these messages were badly formed and no one was able to see these cost and usage spikes at first glance.
So this process just kept reprocessing over and over again, adding to this anomalous cost, to the tune of thousands of dollars in a day. By taking all of this information, going to the engineering team, and having an informed dialog, they were able to go inspect the message, clear out the queue of the bad data.
This also set up the basis to shift our conversation to “how we prevent this from happening in the future?”