Advancing Machine Learning with H2O

Author: Amitpal Tagore

Typical and automated machine learning

At IAS, we analyze billions of online advertisements, also known as ad impressions, every day for suspicious activity. A big concern is that computers can be used to automate online activity and cause a large number of ad impressions to be served, and because they will never be seen by a human, this is a source of lost revenue for advertisers. Networks of computers that are used for this purpose are called botnets or just bots. However, a large number of ad impressions coming from a single IP address (IP) is not necessarily a bad sign. IPs can be associated with proxy servers, which can be used legitimately for business or privacy concerns. The proxy server funnels traffic from multiple sources on the internet through a single IP, allowing for anonymity and security. An emerging and increasingly popular example of this is Apple’s iCloud Privacy Relay; for a small fee, individuals can achieve a level of anonymity on the internet by having their activity funneled through a set of IPs owned by Apple and its partners. This will not stop Internet Service Providers from monitoring individual activity, but it will reduce the ability of third parties to track individuals online. Going back to the issue of ad fraud, internet traffic becomes suspicious when we receive unexpected signals from browsers or when we detect a large fraction of behavior from an IP that deviates from the norm.

Machine Learning

Given the large amount of data that we receive, separating proxy (human) and cloud (bot) IP addresses can seem like a daunting task. We could aggregate the data down to the IP level, inspect it by eye, and then code simple functions to separate suspicious and non-suspicious IPs based on our findings and intuitions. Nowadays, however, computers can take in much larger amounts of data and learn what the rules should be in a fraction of the time. When datasets start to become very large (e.g., hundreds of thousands of observations/rows with hundreds of features/columns), we begin to appreciate just how powerful machine learning can be.

Machine learning requires time and experience. There are many algorithms to choose from, and knowledge of the algorithms, their assumptions on the data, and how they function is essential. For example, some algorithms are better suited for imbalanced classes and some for interpreting and understanding the predictions. Feature engineering is done based on intuition and informed domain expertise. This can be a trial and error process with many iterations. Optimizing the parameters of a particular model can take some time, especially if there are a lot of parameters or a large sample of data is needed to thoroughly test for differences in performance.

However, we can now turn to tools that help us automate the tuning and selection of each of these knobs. These tools may not always decrease the overall amount of CPU time needed to evaluate the different variables involved, but they certainly decrease the amount of coding and time needed on the part of the modeler. If considerable time has been spent developing the tool, then automated ML software should decrease overall run time as well. Even if one prefers to manually develop their models, we can always use the output of these tools to help inform our starting place or to narrow down our pool of models and features to a much more manageable set.

Here’s a screenshot of one such tool, H2O Driverless AI, being used on a simple dataset via its GUI interface:

H2O

H2O is a company that develops software for automating machine learning tasks. Some of their products automate the model selection and optimization aspect, and some automate the entire model building process.

AutoML

Here, we turn to H2O AutoML, an open-source tool for automating the training of machine learning models. For example, if you have a data table with features and labels, then AutoML can try out a suite of algorithms, including tree-based methods, gradient-boosted algorithms, and deep learning. It can also create stacked models that use combinations of these to increase performance.

Before we can use AutoML, we need to do some feature engineering. We aggregate proprietary data from our logs and create a set of features for each IP address. We take care to make sure that data is reasonably balanced and that features derived from one IP address don’t enter our data set for multiple days. This last part is important to ensure that there is no cross-contamination in our training, validation, and test data sets.

With our feature engineering done, we create a model using a very minimal amount of R code:

# set your model-specific variables

# training_filename <-

# features <-

# response_vars <-

# grab a data.frame with features and labels

training_data <- get_data(train_filename)

training_data <- training_data[, c(response_vars, features)]

# convert data.frame to H2O frame and split data

data_hex <- as.h2o(training_data)

split_h2o <- h2o.splitFrame(data_hex, c(0.6, 0.2), seed = 1234)

train_conv_h2o <- h2o.assign(split_h2o[[1]], “train” ) ## 60% of our data

valid_conv_h2o <- h2o.assign(split_h2o[[2]], “valid” ) ## 20% of our data

test_conv_h2o <- h2o.assign(split_h2o[[3]], “test” ) ## 20% of our data

# train the model

automl_h2o_models <- h2o.automl(

x = features,

y = response_vars,

training_frame = train_conv_h2o,

leaderboard_frame = valid_conv_h2o,

)

# get the best model and its confusion matrix

automl_leader <- automl_h2o_models@leader

perf <- h2o.performance(automl_leader, test_conv_h2o)

h2o.confusionMatrix(perf)

We test a variety of machine learning algorithms, automatically selecting the best performing one. Usually a stacked model, which uses a combination of algorithms, works best. To train a data set with a few hundred thousand rows on an m5.16xlarge AWS EC2 for instance, model training takes a little less than an hour.

Driverless

Another problem is identifying bots based on activity, instead of focusing on detecting cloud-computing centers. This problem is similar to the previous one, except with a greater focus on the behavior of traffic coming from an IP during feature engineering. Another distinction is that traffic from an IP address cannot be classified simply as bot or not. In some cases, devices at a residential IP are used by legitimate humans, while other devices have been compromised by malware and are being controlled by a remote fraudster some fraction of the time. Because of this, our model operates at the individual impression level, and this means that our data set is much larger and much more varied.

Even though our predictions are made at the individual impression level, our features can be aggregated at any level, including IP, website, country, etc. We do our raw feature engineering using Spark on AWS EMR for speed and scalability. We call these our raw features because they will be further transformed in a later step. Moreover, we pre-aggregate our data every hour. This allows our pipeline to run even faster, even though more than an hour’s worth of data is used for scoring impressions as having come from either a bot or a human.

For scoring, we turn to H2O Driverless AI (DAI), which goes a step further than AutoML. In addition to applying a number of machine learning algorithms, DAI automates feature generation as well. In our case, the raw features we have created are fed into DAI, which turns these into a final, and typically larger, set of features. DAI can automatically detect a variety of variable types, including timestamps, numeric, categorical, and text strings. It then performs transformations on these input data columns, which can include applying a number of functions to numeric data, one-hot encoding categorical data, frequency analysis, binning, text analysis, and more. It goes a step further and creates new features by combining raw or transformed features in various ways. As you can imagine, this could result in a small handful of features exploding into hundreds or even thousands of features. Indeed, this is exactly what the authors of DAI intended. The software is meant to be run on powerful GPU machines, so that such a massive number of features can be evaluated for importance. An important feature with such power is the ability to tune just how much feature explosion DAI will do; you can severely limit the number of features DAI will generate, or you can let it try it all.

A common approach using DAI might include programmatically training tens or hundreds of models on a subsample of the data, using different input parameters to DAI. The easiest parameters to tweak are machine learning algorithms to include/exclude, time for training, accuracy of the model, and interpretability of the model. Then, we can examine the results of these various models on an external test data set. At this point, we decide what is most important to us: final performance on our test data, time to train, time to score, or interpretability. Often, it will be a combination of these. Finally, we can save the scoring pipeline, typically a MOJO (Model ObJect, Optimized), and use it in a variety of pipelines (Python, Spark, Hive, custom UDF, etc).

We would like to show what programmatically training and evaluating hundreds of models would look like. However, because we cannot share our data, we turn to a publicly available dataset instead. We trained many models on a modified version of a breast cancer dataset available from scikit-learn here. The data set has about 30 features and the task is to predict whether an observation is benign or malignant. As can be seen from the GUI screenshots below, we can tune the knobs so that only the raw features are used, or we can let DAI perform transformations on those features. One of the most important features in this case seems to be a truncated SVD, which is a sort of linear combination of a subset of features. We trained many models, setting the knobs to different values and showing what the false positive/negative rates look like, highlighting the differences based on the scorer (AUC, F1, etc.) used (see figure below). Depending on the goal, which in this case would likely be to catch all true positive cases, we can select which scorer and DAI settings we like regardless of false positive count.

Similar to the analysis just shown using the breast cancer data set, we have trained hundreds of models on our data and selected the model parameters that lead to the best scores. Going back to our initial problem of identifying bots based on activity, we can now deploy our trained model and score billions of ad impressions every day. Because Driverless AI was used, we can be confident that we have extracted the most information possible from our features, or we have decided that speed or interpretability are more important than squeezing out the last drop of accuracy and precision from our model.

Final thoughts

Machine learning has come a long way since the term was first coined. Now, tools like DAI can help us in many ways. We were not able to discuss all of them within the scope of this blog, but machine learning automation can help us by:

  • Reducing time to tune machine learning model parameters
  • Reducing time train multiple models and investigate the performance of each
  • Automating feature engineering
  • Testing a range of model algorithms
  • Reducing time to deploy to production
  • Identifying differences in the training data and scoring data (data drift)
  • Visualizing data and machine learning steps
  • Interpreting results of feature engineering

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store