Query Understanding At TripAdvisor

Robert Hall posted July 11, 2019

Search is an important tool for navigating TripAdvisor.  The current implementation requires the user to specify a search query and optionally a “geo-scope” which prioritizes results which are within the specified country or city.  On the search result page we display a relevance ranked list of locations and other content, like forum posts, users, travel guides etc.

This approach has a few problems, the first of which is that users are typically accustomed to a “one box” search experience and so don’t break their query up into the two fields we require.  For example when looking for attractions in Paris the user may type the query “things to do in paris” with the worldwide geo-scope, rather than selecting Paris in the second box. When we interpret the query we just rank everything according to how well it matches the text, rather than restricting the results to just those in paris.  This leads to the top result of “Paris Las Vegas” which is a casino in Las Vegas and almost certainly not what the user wanted.

Another problem is that the ranked result list that we display is not the ideal experience for certain queries.  For a query like “hotels in Boston with free wifi” we would prefer to direct the user to a specific hotel list page, where the user can book a hotel.

In this case the hotels list page is better than the search results page since it is optimized for finding a hotel, including the ability to change dates or apply complex filtering, and enables the user to book a room.

To address these problems we began a project to do a better job of understanding search queries so that we can give the user the best experience possible.  This consists of:

  • Identifying when the users query contains a geo-scope, so that we can appropriately scope the results.
  • Identifying whether the query is asking for a list of locations and so could be better handled by the list page rather than the search results page.
  • Identifying when the user is asking about a specific location, so we could redirect them to the locations page.

The approach we have taken is to build a natural language processing pipeline, which aims to map the users query into a semantic frame which captures the intent of the query.  The pipeline consists of first segmenting the query into its components.  The types of components we care about are location names, “geo” names (e.g., cities, countries), and “target” phrases (e.g., “hotels with wifi” — the thing the user is searching for when it’s not a location name).  The segmentation is carried out by a combination of a CRF and LSTM model (see e.g., Bidirectional LSTM-CRF Models for Sequence Tagging by Huang et al.) which we trained on manually labeled data and implemented using TensorFlow.  The second step is to match the segments corresponding to geos and locations against our database.  This is what we refer to as the task of “entity recognition” and is the focus of this post. The final step is mapping the resulting segmented query along with the matched entities into a semantic frame.  This is achieved by a rule based system.

For example for the query “hotels in boston near fenway park” we first segment the query, which identifies “fenway park” as a location and “boston” as a geo.  We match the segments to our database and identify two candidates for “boston:” Boston, MA and Boston, England. Then finally by inspecting the segmentation and these candidates with our rule based system we map the query to a “nearby query” semantic frame, which means that the user was looking for a particular thing (in this case “hotels”) near to Fenway Park and in Boston.  To handle ambiguity we may end up mapping a query into multiple frames, for example if Boston in England also contained a place called “Fenway Park” then we may return a second frame corresponding to those locations – depending on the relative popularity of the two.

Entity Recognition

The problem of matching these segments of text against our database of locations is complicated since users typically don’t type out the exact name of the location.  Shown below are some examples of segments identified as locations along with the names of the locations that they probably refer to. These are found by hand and not by any model.

Segment TextMatching Location Name
westin carlsbadthe westin Carlsbad resort and spa
dreams puntacanadreams puntacana resort and spa
best western fallonBest western fallon inn and suites
The maxwell hotelThe maxwell hotel – a staypineapple hotel

These examples show some ways that users will systematically drop tokens out of the locations name when they submit their query.  Phrases like “resort and spa,” “inn and suites” are likely to be dropped, as well as the name of the hotel chain in the last example.  What’s more users will sometimes add tokens to the location name (“the”, “hotel” etc). This means that we can’t appeal to a pure text match to solve this problem.

The next most simple approach we could consider would be to score locations by some combination of their similarity to the text segment and some measure of their popularity (e.g., their pageviews over the last few months).  This approach is not satisfactory since we have cases where popularity can overwhelm the text match and other cases where the text similarity can overwhelm the popularity. For example the query “new york fries” refers to this location however using a combination of tf-idf similarity and popularity leads to matching it to New York City, since the popularity overwhelms the small advantage in tf-idf similarity that the real location has.  We could deal with this by making a more complicated heuristic matching function, but we don’t want to take this approach since we want something which can generalize to other languages.

Therefore the approach we took was to build a noisy channel model.  We assume that the user has some location in mind when they typed the query, and the text that we received is essentially a corrupted version of the location name.  Here the “noise” in the channel is just due to users misspelling the location name, or dropping or adding tokens to it as we saw above. At a high level, we can assume that we have some model which gives us the probability of each string given a particular location: P(M|L) where M is the “mention” — the string we observe in the query, and L is the location that the user was trying to express.  If we also know a prior distribution over locations P(L) then we can invoke Bayes’ rule to get at the probability of each location given a mention, and we can use this to assign locations to the mention.  In this case a reasonable prior distribution would be one based on the popularity of each location.

This model falls into the class of generative models, since it tries to explain how these “mentions” are generated.  The generative procedure associated with the model is shown in the below diagram. First the user chooses some location at random from P(L), then given the location they decide how to express the location name by sampling from P(M|L).

What remains is to decide on a structure for P(M|L) since this is the core of this approach.  What we used is something similar to a string edit model (see: Learning String Edit Distance by Ristad and Yianilos).  We essentially treat each location name and each mention as a sequence of tokens.  The way that the mention is constructed from the location name is by a sequence of edit operations on the tokens.  The operations we support are: copy (where we take a token from the location name and output it), insert (where we output a new token in the mention text) and delete (where we take a token out of the location name and don’t output it in the mention).  An example of a location name, a mention, and a sequence of string edit operations is shown below. The sequence of edit operations describes how we transform the location name into the mention. Depending on the edit operation at each step, one of the two strings might be empty.

Locationdreamspuntacanaresort
Mention
dreamsdominicuspuntacana
Edit Operationcopyinsertcopycopydelete

Note that there are many possible edit sequences which would lead to the same observed mention string (for example the extreme case where we delete every word of the location name then insert every word of the mention, and all the ways we could interleave the insert and delete operations).

Another way to view this edit model is as outputting a sequence of pairs of tokens where either element is allowed to be the empty token.  In the above table the tokens in each column would be the pairs. By associating a probability to each possible pair of tokens we end up with a model which gives us a joint probability P(M,L).  To compute that probability we need to sum over all the sequences of pairs which would have lead to the observed strings. This is achieved by a dynamic program which is very similar in structure to the Levenstein edit distance computation, except we sum the probabilities rather than minimizing the distance.  Using a similar dynamic program we compute the conditional P(M|L). Since we only supported very limited edit operations we end up with a number of parameters equal to three times the size of the vocabulary (one for each edit operation for each word).

This model is attractive since we can estimate the parameters in the absence of labeled training data.  Namely we can regard L as a latent variable and then appeal to the EM algorithm to estimate the parameters.  The procedure starts with a “guess” at the parameters (e.g., all equal) and iterates:

  1. For each mention, iterate over each location, compute P(M|L), and normalize to get P(L|M).
  2. Estimate the string edit model parameters using all the (M,L) pairs where each one is weighted by P(M|L), using the algorithm given in the above paper.

Note that there are really two latent variables here, the location assigned to each mention, and the sequence of string edit operations leading to the mention.  We effectively sum over both, the location explicitly in step 1 above, and the sequence during the dynamic program in step 2.

The iteration proposed in step 1 is extremely expensive since we are dealing with millions of locations.  So to make this procedure efficient we make a restriction which is that we only consider locations which are sufficiently similar to the mention text in terms of tf-idf similarity.  Namely we limit to the 20 most similar locations by this metric. This is just an ad-hoc optimization. With this approximation, we can fit the model to ~30K queries in a few seconds on a typical development machine.  

We can examine the learned parameters to see which tokens are most likely to be dropped out of the location name by the user, namely the words have the highest probabilities for the delete operation.  They are: hotel, and, the, resort, restaurant, spa, beach, inn, park. These all seem reasonable based on the example queries given above. Likewise we can see some cases where the noisy channel model improves over the simple heuristic of matching based on relevance and popularity.

Query TextRelevance + PopularityNoisy Channel
Dreams la RomanaLa RomanaDreams Dominicus La Romana
Doubletree phoenix northPhoenixDoubleTree by Hilton
Phoenix North
Big stuff bbqHot StuffBig Stuff Barbecue

In the first two cases the heuristic prefers matching to the larger geographical area rather than the specific hotel since the former has a much larger popularity which outweighs the improved relevance that the hotel has.  In the last case the heuristic makes a match on the basis of the word “stuff” which has a high tf-idf score. Presumably “Hot Stuff” is also more popular than “Big Stuff barbecue”. What’s interesting is that the noisy channel model finds the correct match even though we did not endow it with the knowledge that “bbq” and “barbecue” are essentially the same thing.  In this case it has made an insert and a delete operation in order to arrive at the observed query text.

Summary

We described at a high level our approach to query understanding and entity recognition.  The model we proposed to solve the latter is essentially an unsupervised noisy channel model which estimates probabilities for various string edit operations that we envision the user making when they type out a location name.  We gave the model a very simple parameterization where the only operations are insert delete and copy, and we observed an improvement over more simple heuristics for doing the matching. Going forwards it would be interesting to see if we can build a stronger model by allowing for more complex operations, such as to substitute one word for another, in the case where they mean the same thing (e.g., “bbq” and “barbecue” above) or when one word is a misspelling of another.  This would make the model more robust, since currently it depends on a critical mass of tokens from the location name actually appearing in the query text.

The entity recognition is just part of the larger project of query understanding, and in the future we will write more articles about the other steps of the process, namely the segmentation and framing, and share some experimental results.

Author’s Biography

Rob Hall is a Principal Machine Learning Engineer on the Machine Learning team at TripAdvisor, where he works on things like recommender systems, query understanding and various ranking models.  He graduated in 2012 from Carnegie Mellon University with a PhD jointly in Machine Learning and Statistics. After working as a data scientist for Etsy he joined Tripadvisor in 2016.