Cloud and Finance – Lessons learned
This is a joint post with Abiade Adedoyin.
Here at Expedia, we’re undergoing a strategic migration to the cloud. Like most others, we too first struggled to understand and manage our cloud spend. Over the past eight months, we learned some lessons and adopted certain practices to manage our cloud spend more predictably and efficiently than before.
Here are the results of our practices:
- We increased the proportion of reserved instances (RIs) from under 40% to about 65% to 70%. See the “Compute Usage Hours” chart below. Our overall savings due to RIs grew fivefold during the same period.
- We still maintain RI utilization rate of over 90%. As we increased the proportion of RIs, the utilization rate dropped at first (the lowest in Oct 2016) before bouncing back to mid 90%s. See the “RI Utilization Rate” chart.
- We are able to keep our cloud spend close to the forecast.
- We also did not standardize and thus constrain all workloads to run on resource sharing platforms like Mesos, ECS, Kubernetes, and Swarm for their potential to provide higher efficiency. Our container and cluster manager adoption is driven by agility and not resource efficiency needs.
- Most important, our teams still retain the flexibility to use any instance type in any region we operate from.
The chart below shows the percentage of unused RIs for the same duration.
As you read in this post, we make frequent tweaks to shift more of our workloads to use RIs while simultaneously increasing RI utilization rate and reducing waste.
Shift in Thinking
All this started early September last year. One of our vigilant colleagues made a late-night observation that, at the rate we were spending, we would exceed the annual forecast by a staggering sum. A few of us started huddling in a war room for the next several weeks to understand what was going on, and to put in place some remedial measures as quickly as possible.
Our constraints were clear. First, we had to act quick. Second, any change we make would take several billing cycles to be effective. Third, we needed to be careful about introducing measures that constrain choice and innovation. Data center-like practices likes capacity approvals, and rules on which team can use what instance type would set our DevOps culture back.
In the end, we are pleased with what we learned, and the effects our measures produced. Read on to find more.
Explain Centrally and Optimize Locally
Our prior attempt at taming costs involved optimization. As we have a number of teams in multiple locations using the cloud, we forked a small temporary team to look at all the apps to minimize waste, and to optimize as best as possible. This approach didn’t work well. The result of this approach lasted just about a month.
We then pivoted from cost optimization to cost transparency to decentralize the problem and to let each team optimize their apps. We took a couple of steps to enable this. We first fixed our automation to tag all taggable resources. We then built automation to automatically remove untagged resources within minutes.
These steps allowed each team to find out how much their apps cost on the cloud, and take steps to optimize. To this day, we maintain 99% tagging of all taggable resources on AWS.
Cost Awareness as part of DevOps Culture
A consequence of cost transparency at Expedia is that cost awareness has become a part of our DevOps culture. Most of our teams now routinely monitor their costs, and take steps to optimize. We also have an active internal tech blog where teams occasionally share and debate different optimization patterns and tradeoffs.
We fueled this process further by decentralizing our forecasting process. Some of our larger teams have begun to forecast their spend every quarter. At present, teams fill-in templates to project their cloud usage based on their cloud migration plans. The template includes projected spend for some key cloud services as well as data transfer. We then consolidate these inputs to come up with the overall forecast. We start with an annual forecast, and refine it every quarter.
Though it is a time-consuming activity for teams to partake in this exercise, we realized that the success of our cloud journey depends on being cost efficient, and that there is no better way than to make cost awareness part of every team’s activities. Our teams thus retain the freedom to consume any type of compute by taking responsibility of forecasting and cost optimization as part of their activities.
Measure, Measure and Measure
In order to spot trends early, we look at certain key cost related metrics weekly and monthly. Here are some of the metrics we monitor weekly:
- Percentage of utilized RI hours over total reserved instance hours
- Ratio between total amount spent on reserved and on-demand compute
- Cost of unallocated resources due to missing tags
- Monthly forecast vs actual spend to date
- Average daily spend
This process allows us to take corrective measures sooner than later, though in recent months, we did not have to take any actions other than asking clarifying questions.
We also review a few additional metrics at the end of every month:
- Cost of different types of resources such as compute, storage, and shared services
- Volume of data transferred with each region, and across regions, and their costs
- Unused RI types and their cost
- Instance types for which we have run out of reservations
This review prompts the next step, which is to top off our RIs.
Smaller but More Frequent RI Purchases
We operate in multiple regions and expect teams to pick the right instance types for their applications themselves. We further don’t restrict teams to continue to use the same instance type they picked for any application. We believe that this freedom is essential to drive agility and innovation.
We consequently end up consuming a diverse set of ever changing portfolio of instance types in each region. This makes upfront long-term planning to procure RIs impossible. Prior to September last year, we used to purchase RIs every 5-6 months. We now make small RI purchases almost every month. This allows us to adjust our RI spend and reduce waste on an ongoing basis.
However, analyzing all those instance types in each region and then procuring reservations is a laborious six-step dance.
This process involves analyses, obtaining recommendations from multiple sources and tools, validating those recommendations, and then procuring RIs. This is the best we could do to balance between the upfront planning that RIs take and our preference for to retain choice for our teams. We can’t wait enough for simplified pricing models designed for elasticity and developer choice.
Tech and Finance Partnership
Lastly, managing cloud spend efficiently depends on a close partnership between tech and finance teams. We’ve a small but excellent team that collaborated over months to learn what to measure and fine tune the process. We now have a better understanding of the interplay between cloud and finance and are able to ask better questions to stay ahead.
Prior to this partnership, we took cloud as an automation and DevOps activity, and finance as just a necessary detail. We rarely sat together to compare notes. This is no longer the case. We’re not done yet, and are bound to learn more lessons as we continue our cloud adoption.