Using Apache Spark for Massively Parallel NLP

Jeff Palmucci posted July 17, 2015

Here at TripAdvisor we have a lot of reviews, several hundred million according to the last announcement. I work with machine learning, and one thing we love in machine learning is putting lots of data to use.

I’ve been working on an interesting problem lately and I’d like to tell you about it. In this post, I’ll set up the problem and the underlying technology that makes it possible. I’ll get into the algorithm itself in follow-up posts.

If you’ve been to Tripadvisor recently, you may notice that we have various pieces of meta data that we attach to hotels, restaurants, and attractions that appear on the site. Some of these (we call them tags) are simple yes or no questions that we get from various data sources. Does a hotel have a pool? Is this an Italian restaurant? Etc.

This information is great when you can get a reliable source, but what happens when you can’t? Maybe there’s a particular part of the world where you don’t have a good data source (we deal with all parts of the world, after all). Maybe there’s a very subjective question, like “is this a romantic hotel?” Any hotel manager would probably say yes to that question. A visitor may disagree.

So how do you get around these problems? It turns out that when you have as awesome a user base as us, you can just ask them. For a while now, we’ve been asking our users these simple yes or no questions at the end of our “write a review” form and other places. On average, they answer about 3 questions per review. That is incredibly useful to us, and by extension to other visitors on the site.

For the last little bit, I’ve been working on making the best possible use of these answers on a project we call “Auto Tagging”. You can simply ask a reviewer one of these tagging questions whenever you are not sure of an answer, but that’d waste their time and limit the coverage of the results. Instead, we build regression models based on natural language to predict the probability a user will answer “yes” or “no” to each question. That way we only have to ask if we are not sure of the answer given all the data at our disposal.

Specifically, we try to predict the probability that a user, when filling out a review form that we haven’t seen yet, would answer “yes” to the given tagging question. Note that is different from trying to predict a “yes” on a review form given the text on the form. That’s because we are not trying to predict whether the current reviewer had a romantic stay or if the hotel was very family friendly to them. We are trying to predict whether the next visitor to the hotel will have that experience. I’ve found that when people have a very romantic stay, they may vote the hotel romantic based just on their experience, and not necessarily any properties on the hotel. By predicting the expected proportion of future yes votes at a particular location, we can average out this noise and get an honest estimate of just how romantic, family-friendly, or whatever a particular hotel, restaurant, or attraction is.

We solve this problem using a semi-supervised form of logistic regression. A large portion of the model consists of “bag of words” type features from user submitted reviews on the properties. Since it is a semi-supervised technique, not only do we use the reviews on locations that we have tag votes on during training, we also use a large chunk of unlabeled data. Also, when applying the model to get the end results, we need to read and process all our reviews. On top of that, we have hundreds of different tags.

Think about that for a second. Millions and millions of reviews times hundreds of tags. The algorithm itself is pretty cool and I’ll talk about it in another blog post. Today, I’d like to talk about the technology that makes a solution possible.

We use Spark to power this algorithm. Spark is an excellent data parallel engine that allows you to spread your data among all the nodes in your cluster. It’s different than Map / Reduce in two important ways:

  1. It’s a lot easier to read and understand a Spark program because everything is laid out step by step without a lot of boilerplate. For example, check out the difference in implementing a word count (the “hello world of big data”) in Spark and Map / Reduce.
  2. It allows you to operate in memory, spilling to disk only when needed.

Given Spark as a base, it’s a pretty straightforward process working with all this data. We just read the reviews into memory spread across a bunch of the nodes in the cluster, and iteratively work one tag at a time. The large chunk of the time it takes to read and process the reviews into a useable format is just done once at the beginning of the process.

The entire process is broken down into three stages.

  • Dataset generation: We read all the reviews for all locations (restaurants / hotels / attractions) that have votes on them and generate a data set for each tag. This consists of feature selection, creating the feature vectors, and choosing the cross-validation sets. (The last 2 steps are actually nontrivial for this problem.)
  • Training: For each tag, profile the regularization parameter and train the model. This is an “embarrassingly parallel” stage that’s done pretty straightforwardly with Sparks “parallelize” operation. I just broadcast the data set to each node and parallelize the parameters I want to profile. Each task simply returns the out of sample error estimate that I use to select the model.
  • Application: Read all the reviews into memory. Yes, all of them. Iteratively score each tag on each location, dumping the results to Hadoop’s file system. Import these into Hive with “load data”, which is basically a quick HDFS rename operation.

This is all done remarkably efficiently. Spark gives me control over what to keep in memory and what to flush when it is no longer needed. I can also choose the way data is partitioned among the nodes. I make sure to partition by location ID so that reductions and grouping require very little inter-node communication.

On top of that, I’m able to use real, readable, Java code. I don’t have to restrict myself to working in some dialect of SQL, or deal with large amount of boilerplate and confusing code required with Map / Reduce.

I’m really happy with the control Spark gives me over our cluster and highly recommend you check it out.