Published
July 6, 2017
Author
Ram Hariharan
Category
Comments
Time To Read
Estimated reading time: 5 minutes

Turbocharging Machine Learning

By Ram Hariharan in Big Data, Cloud, Data Science on July 6, 2017 |

The Challenge – How can we scale bidding?

Expedia attracts customers through meta search sites like Trivago, Google Hotel Ads, TripAdvisor, and Kayak. As you might guess, Meta sites are very competitive marketplaces. This comes as no surprise – the Winner spot gets 80% of the clicks and traffic drops off exponentially from there.

Screenshot of meta search website highlight winning bidding position as much more prominent

To compete effectively good prices and an optimal bid are both important – the meta site uses a proprietary algorithm that involves the hotel’s price and associated bid to choose a hotel offer for the winner spot. We have to be smart about our bids for each of our hotels. If we bid too much for a poorly converting hotel, then we buy traffic that does not result in revenue – we could lose a lot of money. Conversely if we do not bid high enough to get traffic for a good-performing hotel then we will get little or no traffic – we miss out on bookings.

To choose smart bids across our broad hotel portfolio we use linear programming (a mathematical technique for finding the values that maximize the output of a linear function of several variables). To do this we need a way to estimate how changes to our bids might change our objectives – for that we use predictive models to estimate the impact of our decisions. Using these predictive models we can rapidly test the impact of many different decisions on getting us to our objective.

Chart showing predictive models flowing into an optimization layer

The Algorithm – Gradient Boosted Trees

The optimization described above however is limited by the quality of our predictive models. To build good predictive models we use machine learning algorithms. One of the machine learning algorithms most frequently used in the industry is Gradient Boosted Trees (GBT). This algorithm examines data to find important decision points that become branches in decision trees. It then combines the output of many smaller decision trees into a single predictive model. A simplified version looks a bit like this:

Graph chart of Boosted Tree Training

Many companies, including Expedia, use the XGBoost open-source implementation of the GBT algorithm. To make the training process run fast, the authors of XGBoost wrote the core code in C++ and designed it to run on effectively across the all the CPU cores on single machine. If you would like to learn more about XGBoost, you can learn more via their GitHub or via this more gentle introduction.

So what challenges did we have? There were several.

Problem 1 – Limited Scale Up

In our single-machine training process we used an x1.32xlarge – an EC2 instance that has 128 vCPUs and roughly 2 Terabytes of memory. This is the biggest AWS instance currently available. However even this monster cannot tackle some of our training data sets. Further, as our business grows our training data sets are only going to get bigger. We were already at a point that we had to make sacrifices:

  • Train models for less than one-third of our points of sale
  • Limit the amount of data we used for training

You can imagine how this will continue to get worse as we add more hotels and grow our traffic. Scaling up is no longer an option.

Problem 2 – Training Duration

Another challenge was the serial nature of the process; loading the training data set to a single machine takes several hours, even in the cloud. Even once the data is loaded, running the training process pegs CPU utilization on the single EC2 node at 100% for most of a day. There were several models that need to be trained together so you can imagine the entire process was rather lengthy. As a result we performed predictions less frequently than we wanted and model updates were even more rare.

Problem 3 – Model Iteration Speed

Developers know build times are critical. Improving iterations on testing your work has positive impacts on productivity. The same is true for data scientists – faster training means faster iteration on models. While data scientists can work on sample-sized data sets there are many benefits to faster training and testing regardless of data set size.

The Solution – Scale Out

As we started investigating how we might distribute training across multiple machines we found that XGBoost had a Spark-compatible version. While it sounds fairly straightforward to adopt a Spark-enabled version of XGBoost and quickly have great results, there was a lot to do to get this working and, once it was working in order to find optimal settings. The Spark-XGBoost documentation was incomplete so we spent time reading through the Java / Scala code in XGBoost to better understand, for example, the impact of input parameters. Further, performance tuning involved not only adjusting parameters forXGBoost, but also for Spark configuration and EC2 node sizing.

The Results – 15x Faster Model Training

As the chart below shows, the difference in training time between the single-node (“SingleHost”) and our Spark cluster was dramatic. A model that once took 10 hours to train now requires just 40 mins. You read that correctly: for some models, we saw a 15x improvement in training time.

For those who are curious about the details of our Spark cluster:

  • 10 nodes of R3.8xlarge
  • Spark 2.1
  • XGBoost 0.7

We also tested with a 20 node cluster but found diminishing increases in speed for the data sets and models we work with. We will continue experimenting in the future to see what additional performance gains we can make.

Training Time Comparison

As you can see from the comparison chart below, when we averaged out the amount of “training time per tree generated for the model” we found that Spark could train trees at a rate significantly faster than the single host.

Bar chart containing tree training rate for single host vs. spark (single host about 1/5 the per second values of spark cluster, with model 1 and 2 being about the same and model 3/4 being faster for spark and slower for single host)

Costs

While we did not have cost reduction as a goal we also found significant decreases in cost, as seen below:

Training cost bar chart, singlehost about $130 per training, spark about $25.

More details about the costs comparison:

  • Single node = 10 hours x $13.30/hr = $133 per training
  • Cluster = 10 nodes x 1 hr x $2.66/hr = $26.66 per training

Next steps

Now that this has been productionalized and put into use in our bidding efforts, we have a number of future directions for this work including investigation into other ways to distribute machine learning and using GPUs as we train models.