On TripAdvisor we have lists of restaurants and hotels which can be filtered e.g., to show only restaurants which serve a particular cuisine type. Each location has a set of “tags” which determine which filtered lists it will appear in (for example a restaurant may have the tag of “italian cuisine” or a hotel may have the tag of “free wifi” in which case these locations will be displayed when the corresponding filters are applied). These tags are determined by a combination of users votes and machine learning models described in this article. When a user selects a filter, we want to highlight snippets of review text that are the most relevant given the filters they selected. For example, when a user selects “italian cuisine” we wanted to highlight review text talking about italian food items:
We did this in three stages:
- Determine which terms in review text are most strongly associated with each tag (for example for italian food we wanted to find the names of food items).
- Scan the review text for instances of these terms and build up a set of snippets for each location.
- Choose the best snippets for each location.
Below we describe how we solved each of these problems.
Tag-Keyword Association Mining
The problem of determining which phrases in review text are most associated with each tag can be thought of as a case of association rule mining. In order to determine which words are most strongly associated with each tag, we built up a contingency table for each tag and phrase, where the cells were the counts of the number of locations having that tag and phrase. In order to make this tractable, we limited consideration to “keywords” rather than all possible phrases in the review text. Keywords are phrases in the review text which are mined using “segphrase” which is described in this paper and the results can be seen on hotel review pages. They consist of phrases of text such as dish names, the names of attractions near hotels and so on. The resulting contingency tables look like e.g.:
|Has keyword "pizza"||Does not have keyword "pizza"|
|Tagged as "italian"||5000||15000|
|Not tagged as "italian"||50000||5000000|
From these tables we can calculate various measures of association between the tag and the keyword. An obvious one is to compare the conditional probability of the keyword given presence of the tag, to the probability of the keyword given the absence of the tag.
In the above case we get
- P(pizza | italian) = 0.33
- P(pizza | not italian) = 0.01
So “pizza” is 33 times more likely to appear in a restaurant tagged as italian than in one without the tag, we can use this ratio as a kind of “association score.” After computing this for each keyword/tag pair we can find the keywords with the highest scores for each tag. This method needs help to address sparse keywords however. If a keyword only appears in one location, then it will have an infinite association score for each of the tags of that location. To deal with this we simply remove keywords with insufficient frequency, and also smooth the counts by adding a constant value to the count in each cell in the table. This can be seen as a Bayesian method where we have a beta prior distribution for each of the conditional probabilities we compute.
An alternative method to the conditional probability ratio is to calculate the extent to which the presence of the keyword and the tag deviate from being independent. This can be done by an analog of the method detailed in a classic paper where the goal was to determine when two words appearing in a sentence were part of a phrase or not. The idea is to score each tag-keyword pair with the likelihood ratio statistic for testing the null hypothesis of:
- H0: P(tag, keyword) = P(tag) * P(keyword)
The form of this statistic can be found in the above paper. In essence it tells us how much better a multinomial distribution fits the observed data, compared to the product of two binomial distributions. The main difference between this method the former is that the scale of the statistic depends on the amount of data present in the table. This can be advantageous since it deals with sparsity issues nicely: a keyword which appears in few locations will not get a high association score with any tags. On the other hand it means that if we want to choose a score threshold above which to declare a keyword “relevant” to the tag, then this has to be done on a per-tag basis, since different tags will have different amounts of data.
Results from these methods applied to the tag of “italian cuisine” look like:
|Probability Ratio||Log Likelihood Ratio|
The main difference in the results is that the probability ratio assigns a high score to keywords which appear almost entirely in restaurants with the tag, even when the total frequency of that keyword is small. On the other hand the likelihood ratio approach assigns higher scores to more common keywords since the score scales with both the frequency of the term, and the degree to which the term is associated with the tag.
Once we have a set of “relevant” keywords for each tag, we scan the review text for each location in order to pre-compute “snippets” which we can display. These are short substrings of review text which contain the keyword. When we compute the snippets we want to show some context in which the keyword appears, so we attempt to center it in the snippet. We also try to find snippets which don’t cross sentence boundaries, since one sentence could be irrelevant. Therefore we use the openNLP sentence tokenizer to split each review into sentences. Then we look in each sentence for presence of one of the keywords, and build a snippet by taking an appropriately long substring of the sentence. There are a few applications of these snippets, and each has its own constraints on the snippet length, so we generate a few different lengths of each.
Scoring Snippets for Sentiment
Depending on the application we may want to get a sense of the sentiment of the snippet (i.e., whether the snippet praises the location or is critical of it). In order to do this we could look at the “bubble rating” (the TripAdvisor equivalent of star ratings) of the review from which the snippet was extracted. However there are cases when a generally positive review has a negative sentence, and if the snippet comes from that sentence then the bubble rating of the review is not a good indicator of the sentiment.
Therefore we built a model which assigned a score to each snippet according to how positive it was. Since we have a giant corpus of review text it was easy to build our own, and also would ensure that the model understands the type of vocabulary used in our reviews, whereas a pre-trained model may work well on other types of data (e.g., product reviews) but not well on hotel reviews. To make the model we took all of our extracted snippets, along with the bubble ratings of the review from which the snippet was extracted. We took only the 5 bubble reviews, and those with 2 bubbles or less, and built a model to predict which of these two categories a snippet fell into.
The software we used to estimate the model is vowpal wabbit which can be found here. The features we used are just the tokens of text in the snippet. If we examine the weights of the resulting model we find the the features with the highest weights are all positive sounding words, and likewise for the negative weights:
|Highest Weight||Lowest Weight|
For the keyword “hotel,” some snippets which get a high score under the resulting model are:
- Great Hotel and Outstanding Stay
- Amazing Hotel, very clean, beautiful Lobby and in
On the other end of the score spectrum we have e.g.:
- This was a disgusting hotel and even worse service
- … overshadowed by an abysmal stay at this hotel
This article gave an overview of how we mine relevant segments of review text from hotel and restaurant reviews at TripAdvisor. These segments of review text can be used for various products whenever we want to give the user a short summary of a location which may spur their interest. To make the snippet generation tractable we restricted to those containing a subset of phrases which were pre-computed. In order to produce relevant snippets we chose the appropriate phrases to highlight based on the filters that the user selected, using a statistical technique. Finally we scored each snippet for sentiment, using a simple linear classifier based on our corpus of review text.