At TripAdvisor, we use machine learning to assess whether a user’s review is substantive and helpful to other users. This article describes our motivations, technology, and results.
TripAdvisor members submit nearly one million reviews every week. We want to publish only the reviews that are helpful to other travelers, but our moderation team can’t possibly read every submitted review. If we can programmatically score a review’s helpfulness, we can automatically publish the obviously helpful reviews and only send to our moderators the few that are likely to be unhelpful.
We set out to build a text classifier that could “read” a review, score its helpfulness, and decide whether the review should be automatically rejected, automatically accepted and published, or queued for human moderation. Our tolerance for errors is asymmetric: it’s much better for a mediocre review to be published than for a useful review to be rejected—potentially disenfranchising a contributing member of our traveler community. Therefore, we bias the classifier to queue for moderation as many potentially unhelpful reviews as we have moderation resources to support, without wasting their time with too many false positives (helpful reviews we should have auto-published).
What do we mean by helpful and unhelpful?
Does the review tell travelers anything about the location that would help them decide whether to go? For example, this real review looks helpful:
Great location and very helpful staff
The location of this hotel is absolutely excellent (close to MoMA, Hells Kitchen district etc). Our room was nice and clean. The breakfast buffet was great. The concierges gave us very useful advice and the other staff members were also very helpful. We had a very enjoyable stay at this hotel. The wifi worked flawlessly for several devices once we got the access code from the reception (it was free).
Clearly a review doesn’t have to be long or polished to be useful. They gave some nice details that could easily apply to your vacation.
Compare that to:
Both my boys Loved the programs! One of my boys went to 5 classes and the other went to 2. They are still talking about it! They are hoping to go again next year! I would highly recommend this!
Yes, they’re enthusiastic but gave you unhelpful details and nothing about the classes themselves. Do you know even know what they’re reviewing?
Our human moderators are trained to quickly recognize whether a review is helpful or not, and they then either publish it or reject it and send the author a helpful email explaining why. They’ve been doing this for 15 years, which, incidentally, gives us a wealth of data for training a machine learning system.
Technology and Methodology
We experimented with quite a number of approaches and settled on one that’s simple but effective.
- Get some training data – Grab a set of reviews that human moderators decided were either publishable (helpful) or unhelpful, excluding reviews that were removed for other reasons.
- Simple query to our mammoth Postgres database
- Extract bag-of-words features – Normalize the review text (which, in English, just means converting to lowercase), tokenize it (strip out punctuation and segment into words), and collect the set of unique words and bigrams (two-word phrases).
- We use the tokenizer from ElasticSearch, as it is fast enough and very good at handling the unique requirements of each language.
- Train – Train a classifier on the words and bigrams, selecting the best configuration using cross validation.
- We opted for Stanford’s maximum entropy classifier, as it has good performance (both in terms of computation and in terms of accuracy), fit well into our Java 8 ecosystem, and has been heavily used in production settings for years. The classifier learns which words and phrases occur more often in reviews we reject and ones that occur more often in published reviews. It can then use the words of a new review to predict whether it is worth publishing. It’s also good at explaining itself; it can tell us what words contribute most to the classification.
- We could have leveraged TripAdvisor’s extensive Hadoop/YARN/Spark computing cluster. However, the Stanford classifier doesn’t lend itself to parallel training, and the classifiers that come with Apache Spark are not as tested and robust as the Stanford classifier. We opted, then, for sequential training (which doesn’t need to be fast in this case) and parallel deployment, as described below.
- Evaluate – Estimate precision and recall using a validation set.
- Our training data from step 1 is biased by the system that determines whether a human moderator should look at a review in the first place. We don’t know for sure that auto-published (unmoderated) reviews are useful; we only know that they didn’t trigger our pre-existing moderation filters. Thus, to get a better estimate of recall, we gave our moderators a random set of reviews to label as helpful or unhelpful.
- Precision is the fraction of reviews we queue for moderation that were actually worth rejecting: “true positives” / (“true positives” + “false positives”).
- Recall is the fraction of unhelpful reviews that we queued for moderation: “true positives” / (“true positives” + “false negatives”).
- Deploy – Install the trained classifier into our review processor.
- The Stanford classifier is very fast to classify and is thread-safe, making it a natural fit into our multi-threaded system for evaluating and routing submitted reviews.
Top Useful Words
The classifier found the following words to be the most useful for determining whether a review is helpful or not.
Helpful Reviews Frequently Contain
Unhelpful Reviews Frequently Contain
Do you notice any trends? The helpful words tend to be specific and descriptive. The unhelpful words look like questions, complaints about our 200-character limit, or something unrelated to their experience.
Let’s see how these words helped us classify reviews.
The plot below illustrates the recall (what percentage of the total unhelpful reviews we’re able to remove) at different levels of precision (what percentage of the moderated reviews are worth removing). The green line represents expected performance from the classifier described above. The cloud around it represents a 95% confidence interval based on bootstrapping statistics. For comparison, a classifier that guesses randomly (blue line) would perform at a precision based on the fraction of our total review volume that is unhelpful, estimated at about 7%.
To illustrate, here’s a review that the classifier thought was helpful but our moderators disagreed:
Excellent service and lovely staff .Thank you for making our Christmas complete! We were welcomed with a smile and the service was spot on.
The words aren’t particularly indicative of being unhelpful, but together, they don’t say much.
On the other end, this review was classified as unhelpful, but moderators thought it was ok:
Spoilt day out with family ! “dog owners and small children beware”
I would normally never leave a review of this type but after my visit today I can do no other. The inn has been a regular stop off for a great lunch on our days out due to the fact that well behaved dogs are welcomed. My dog is a well trained springer spaniel and behaves better than most children but today my dog was attacked and bitten by, we are told, a regulars dog that has attacked before! Now my dog will heal but imagine this was a child, or elderly person. The couple of an older age were sitting to the right of the bar with a husky type dog. Please please please be aware if these people are in. If this dog is known to have aggressive tendencies, as told by the bar/waiting staff, then why the hell is it still allowed in? Is there not a duty of care towards customers from the management? I do look forward to the management response to this review. The food and inn are superb, just a pity about a known aggressive dog being allowed in a family environment.
It’s not the most sober or polished review, but it could be very helpful to other reviewers.
So what were our most helpful and least helpful reviews according to the classifier?
The most helpful review, according to the classifier, was:
Fantastic staff, beach and grounds – bring a flashlight!
Overall I was most impressed by the staff, the beach and the resort grounds. We tipped generously and felt that the staff truly deserved extra tips. Most staff greeted us with a smile and Hola – and they always asked if we needed anything. Many of the staff had enough English to help us with whatever we needed. They are obviously hard workers, and take pride in their work. I certainly hope future guests take this into consideration, and I sincerely hope that Iberostar recognizes its staff as its best resource at this hotel. I’ve written a very extensive review, so if you want the finer details of our experience, read on:…
The review goes on for 7,000 more words!
The least helpful review (again, according to the classifier) was:
Nervous in NY prior to vacation
My boyfriend and I are scheduled to arrive January 10th. I did endless research and was assured by my travel agent that La Toc was everything it appeared to be and more. It would be $$$$$$$$ well spent. I have been reading reviews lately because Im super excited and Im feeling concerned. Understand, I am fully aware that a vacation alot of the time is what you make it and we have wonderful outgoing personalities but the price of this one allows me to put some responsibility on this “all inclusive resort”. I need some reassurance here please. Im wondering if we are going to have to struggle for the things and conditions we are expecting. I truly hope upon our return to revisit this site and write a review I can smile about. Does anyone know if there is someone at the resort I can speak with prior to quell my growing concerns? Thanks in Advance.
We built a classifier for scoring whether a review’s text is likely to be helpful to other travelers or not. The performance is good enough to use for filtering reviews to route to our moderators but even at its best precision, it has too many false positives to allow us to auto-reject any reviews. Recall that our tolerance for false positives is very low.
There are many avenues we might consider for further performance improvements. We have looked at including non-word features, e.g., statistics on text length, words per sentence, and character entropy, but none provided significant improvement over the words-only approach to warrant the additional complexity. In the future, we might consider taking into account the user’s review writing history; if they’ve written helpful reviews in the past, perhaps we’ll give them more benefit of the doubt on a marginal review.
We might also consider applying the classifier output in ways other than simply rejecting or publishing. For example, when we display reviews to other travelers, we could sort them by their expected helpfulness, as estimated by the classifier.