Turbocharging Machine Learning
The Challenge – How can we scale bidding?
Expedia attracts customers through meta search sites like Trivago, Google Hotel Ads, TripAdvisor, and Kayak. As you might guess, Meta sites are very competitive marketplaces. This comes as no surprise – the Winner spot gets 80% of the clicks and traffic drops off exponentially from there.
To compete effectively good prices and an optimal bid are both important – the meta site uses a proprietary algorithm that involves the hotel’s price and associated bid to choose a hotel offer for the winner spot. We have to be smart about our bids for each of our hotels. If we bid too much for a poorly converting hotel, then we buy traffic that does not result in revenue – we could lose a lot of money. Conversely if we do not bid high enough to get traffic for a good-performing hotel then we will get little or no traffic – we miss out on bookings.
To choose smart bids across our broad hotel portfolio we use linear programming (a mathematical technique for finding the values that maximize the output of a linear function of several variables). To do this we need a way to estimate how changes to our bids might change our objectives – for that we use predictive models to estimate the impact of our decisions. Using these predictive models we can rapidly test the impact of many different decisions on getting us to our objective.
The Algorithm – Gradient Boosted Trees
The optimization described above however is limited by the quality of our predictive models. To build good predictive models we use machine learning algorithms. One of the machine learning algorithms most frequently used in the industry is Gradient Boosted Trees (GBT). This algorithm examines data to find important decision points that become branches in decision trees. It then combines the output of many smaller decision trees into a single predictive model. A simplified version looks a bit like this:
Many companies, including Expedia, use the XGBoost open-source implementation of the GBT algorithm. To make the training process run fast, the authors of XGBoost wrote the core code in C++ and designed it to run on effectively across the all the CPU cores on single machine. If you would like to learn more about XGBoost, you can learn more via their GitHub or via this more gentle introduction.
So what challenges did we have? There were several.
Problem 1 – Limited Scale Up
In our single-machine training process we used an x1.32xlarge – an EC2 instance that has 128 vCPUs and roughly 2 Terabytes of memory. This is the biggest AWS instance currently available. However even this monster cannot tackle some of our training data sets. Further, as our business grows our training data sets are only going to get bigger. We were already at a point that we had to make sacrifices:
- Train models for less than one-third of our points of sale
- Limit the amount of data we used for training
You can imagine how this will continue to get worse as we add more hotels and grow our traffic. Scaling up is no longer an option.
Problem 2 – Training Duration
Another challenge was the serial nature of the process; loading the training data set to a single machine takes several hours, even in the cloud. Even once the data is loaded, running the training process pegs CPU utilization on the single EC2 node at 100% for most of a day. There were several models that need to be trained together so you can imagine the entire process was rather lengthy. As a result we performed predictions less frequently than we wanted and model updates were even more rare.
Problem 3 – Model Iteration Speed
Developers know build times are critical. Improving iterations on testing your work has positive impacts on productivity. The same is true for data scientists – faster training means faster iteration on models. While data scientists can work on sample-sized data sets there are many benefits to faster training and testing regardless of data set size.
The Solution – Scale Out
As we started investigating how we might distribute training across multiple machines we found that XGBoost had a Spark-compatible version. While it sounds fairly straightforward to adopt a Spark-enabled version of XGBoost and quickly have great results, there was a lot to do to get this working and, once it was working in order to find optimal settings. The Spark-XGBoost documentation was incomplete so we spent time reading through the Java / Scala code in XGBoost to better understand, for example, the impact of input parameters. Further, performance tuning involved not only adjusting parameters forXGBoost, but also for Spark configuration and EC2 node sizing.
The Results – 15x Faster Model Training
As the chart below shows, the difference in training time between the single-node (“SingleHost”) and our Spark cluster was dramatic. A model that once took 10 hours to train now requires just 40 mins. You read that correctly: for some models, we saw a 15x improvement in training time.
For those who are curious about the details of our Spark cluster:
- 10 nodes of R3.8xlarge
- Spark 2.1
- XGBoost 0.7
We also tested with a 20 node cluster but found diminishing increases in speed for the data sets and models we work with. We will continue experimenting in the future to see what additional performance gains we can make.
Training Time Comparison
As you can see from the comparison chart below, when we averaged out the amount of “training time per tree generated for the model” we found that Spark could train trees at a rate significantly faster than the single host.
While we did not have cost reduction as a goal we also found significant decreases in cost, as seen below:
More details about the costs comparison:
- Single node = 10 hours x $13.30/hr = $133 per training
- Cluster = 10 nodes x 1 hr x $2.66/hr = $26.66 per training
Now that this has been productionalized and put into use in our bidding efforts, we have a number of future directions for this work including investigation into other ways to distribute machine learning and using GPUs as we train models.