There are a lot of hotels on TripAdvisor. At the moment, there are 1790 hotels listed for Paris, 1054 hotels for London, and 466 hotels in New York City. We have been working on better ways to explore these hotels, and find an interesting place to stay. In this blog post, I’ll describe an upcoming feature on TripAdvisor, that uses Natural Language Processing (NLP) to find groups of hotels that have an interesting theme. It is a semi-automated process that can be scaled across many cities. There is some interesting technology behind it.
For New York City, we might show a list of collections like this…
We have collections for “Times Square Views”, “Trendy Soho”, “Catch a Show”, and “Art Deco Classic”, among others. Each of those is a group of about 10 hotels that fit the theme. How can we automate finding these interesting groups, even for cities we’re not familiar with?
Topic Modeling is Boring
The most textbook approach might be to build a topic model based on the review text for the hotels. I first tried a Latent Dirichlet Allocation (LDA) model. It assumes that each review is a mixture of a number of latent topics. You can represent each review as a vector with a “bag of words” representation of the review text.
I tried building an LDA, model, but got very boring results. Here are some of the first topics for New York, which a mix of a bunch of general hotel related topics.
0.018 breakfast + 0.015 nice + 0.015 coffee + 0.011 small + 0.009 staff + 0.008 location + 0.008 great + 0.008 clean + 0.006 bed + 0.006 stayed 0.032 staff + 0.020 great + 0.016 location + 0.014 stayed + 0.013 friendly + 0.012 helpful + 0.011 new + 0.010 york + 0.009 perfect 0.015 location + 0.015 great + 0.014 square + 0.013 times + 0.010 walk + 0.010 subway + 0.009 clean + 0.008 around + 0.007 staff 0.010 new + 0.007 distrikt + 0.007 great + 0.006 york + 0.006 breakfast + 0.006 city + 0.006 nyc + 0.005 service + 0.005 staff 0.012 staff + 0.009 yotel + 0.009 desk + 0.008 time + 0.007 front + 0.006 arrived + 0.006 made + 0.006 service + 0.005 great 0.023 new + 0.019 breakfast + 0.018 york + 0.013 square + 0.011 time + 0.011 empire + 0.010 state + 0.010 building 0.019 great + 0.017 staff + 0.016 bar + 0.012 breakfast + 0.012 location + 0.011 wine + 0.011 new + 0.009 york + 0.009 friendly 0.019 great + 0.019 park + 0.018 square + 0.017 location + 0.016 times + 0.016 central + 0.014 staff + 0.012 walking 0.025 air + 0.010 conditioning + 0.009 staff + 0.008 night + 0.007 service + 0.007 stayed + 0.006 smoking + 0.006 new
Single words just don’t have the richness to represent interesting topics. What about ‘breakfast’? Was it good, bad, free, or wonderful? Most hotel review mix up talking about the room, the location, the staff and more all at once,and these topics show that mixing. But we want to just focus on a single notable aspects of a hotel, not summarize it completely.
Phrases as raw material
Since single words aren’t very expressive, I wanted to find some good phrases in our review text. What is a phrase, exactly? In this case, we mean an n-gram of words that has a specfic semantic meaning. We tried the approach from the paper “Mining Quality Phrases from Massive Text Corpora”, by Liu et. al. at the Univeristy of Illinois and Microsoft (some interesting slides). Because they released an open source implementation of their SegPhrase algorithm, we were able to try it out.
It is a multiple step process, but the output is that you get a large set of phrases, and a score on how likely it is to be a phrase. We dubbed this score the “phrasiness” metric. So what are the strongest phrases in New York?
club_quarters 0.999620288 hampton_inn 0.999542304 rush_hour 0.999526829 frosted_glass 0.999506234 ritz_carlton 0.999476254 usa_today 0.999473892 jersey_boys 0.999458328 holiday_inn_express 0.999450456 art_deco 0.9994495 gordon_ramsay 0.999448261 battery_park 0.999418922 grand_central_station 0.999410719 naked_cowboy 0.999401511 yankee_stadium 0.999390047 penn_station 0.999386755 columbus_circle 0.999381838 charlie_chaplin 0.999381761 scrambled_eggs 0.999379073 jet_lag 0.999370422 affinia_dumont 0.999364144 harry_potter 0.999357816 les_halles 0.999352377 air_conditioning 0.999346666 mamma_mia 0.999345891 hudson_river 0.999345247 pinot_noir 0.999344796 woody_allen 0.999337025 fairy_tale 0.999306646 grand_central 0.999304571 radio_city_music_hall 0.999301883
The number shown is the phrasiness score. All of these are clear semantic concepts, not just a sequence of words that frequently co-occur.
This is a list of “phrases” with much lower “phrasiness” scores:
the_staff_were_great 0.116338786 located_on_the_ground_floor 0.116338612 the_rooms_are_very_small 0.116337216 requests_were_honored 0.116334616 an_older_crowd 0.11633381 booked_our_return 0.116329291 quite_rudely 0.116321934 and_pedro 0.116312719 helped_ourselves 0.11630484 after_spending_over 0.116302541 my_husband's_birthday 0.116302346 being_slightly 0.116297622 no_inconvenience 0.116296059 far_greater 0.116295743 longer_period_of_time 0.116291693
You can imagine that these phrases frequently occur in reviews, but don’t really represent a single concept.
We used a threshold of 0.5 for phrasiness, which reduced the approximately 250 thousand potential phrases for New York to around 35 thousand. While people often talk about air_conditioning in a review, and it is clearly a phrase, that’s not the kind of aspirational topic that people will want to browse through. Importantly, there are plenty of phrases that have more interesting emotional content, like art_deco, fairy_tale, andradio_city_music_hall. So we have the raw material we need to find collections. In fact, we may have too much raw material, as there are tens of thousands of a good phrases just for New York hotels.
word2vec Party Tricks
We have a bunch of phrases, and many of them are very similar. These phrases …
amazing_view_of_central_park amazing_views_of_central_park beautiful_view_of_central_park beautiful_views_of_central_park central_park_views fantastic_view_of_central_park fantastic_views_of_central_park great_view_of_central_park great_view_over_central_park great_views_of_central_park nice_view_of_central_park spectacular_view_of_central_park wonderful_views_of_central_park
seem like a great basis for making a hotel collection. But they are mixed in with tens of thousands of other phrases. It would be very helpful to cluster the related phrases into groups. But what features should we use to do the clustering? We need a similarity metric that puts these phrases close to each other.
The word2vec algorithm from Mikolov and other Google researchers is perfect for this case. Given a corpus of text, it maps each word into a numerical vector of say 100 dimensions. The mapping is done in a way that put words that are used in a similar way in the corpus are near each other in the vector space.
The papers on word2vec give lots of party tricks you can do by adding and subtracting the vectors. For example vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’), and vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to the vector(‘queen’).
I took all 9 billion words of review text at TripAdvisor, and build a word2vec model with around 1 million unique words. I used the C implementation of word2vec which can be easily run from the command line.
Here are some example words, and their nearest neighbors:
great terrific 0.932 fantastic 0.908 fabulous 0.876 wonderful 0.861 brilliant 0.852 fab 0.849 superb 0.818 good 0.787 lovely 0.782 phenomenal 0.773 noisy noisey 0.934 loud 0.880 nosiy 0.748 noicy 0.729 distracting 0.726 busy 0.711 crowded 0.705 annoying 0.696 disruptive 0.693 rowdy 0.685 rude unfriendly 0.908 unhelpful 0.898 arrogant 0.882 impolite 0.876 unprofessional 0.872 surly 0.864 dismissive 0.849 abrupt 0.843 ignorant 0.840 hostile 0.826 bed beds 0.814 mattress 0.780 sofabed 0.717 couch 0.709 matress 0.708 duvet 0.690 bedroom 0.683 futon 0.653 beds 0.652 room 0.650
The numbers in the table are the cosine similarity of the top word with the near neighbors. We can see that word2vec is doing a nice job of mapping related words to be near each other.
For our short phrases, we just added the individual word vectors to create a phrase vector, ignoring stop words. So now we have a numerical vector of features for each phrase, and we can use any clustering technique.
With word2vec, people usually use the cosine distance, where a larger value is better. However, in clustering we usually work with a distance, where smaller is better. If we normalize our word2vec vectors so they have a unit norm, then we can just minimize the Euclidean distance, as this is equivalent to maximizing the cosine similarity (see the Properties section here for the simple derivation). So going forward, we will use the Euclidean distance of the two normalized word2vec phrase vectors as our distance metric.
Clustering phrases into tight topics
Usually, in a clustering problem we are interested in all of the clusters. Here, we have a slightly different situation. We want to group similar keywords into very focused groups, like our “view of central park” phrases, above. A small fraction of the clusters, say 100 or so, will be the basis for our collections. The rest of the clusters will be focused on more boring topics like air conditioning, and we’ll just ignore those.
What clustering technique will do a nice job of giving us those tight clusters? The default k-means approach for a fairly small k isn’t probably going to work too well. It will produce broad groups, and concentrate on getting each phrase into a good cluster, even though we don’t really care about most of the phrases.
Agglomerative clustering starts with each phrase in its own cluster, so we have N clusters of size 1. It then goes through and tries to combine each pair of current clusters into a new combined one. With complete linkage, the score of the combined cluster is the maximum of the distances between every pair of phrases in the combined cluster. The two clusters with the smallest maximum score after combination are actually combined, leaving us with N-1 clusters. This process is repeated until we have 2 clusters left.
The overall combination process forms a tree of nodes. At internal nodes, a cluster is composed of the phrases that are leaves above the node in the tree. If we pick a threshold score, we can prune off subtrees where every cluster has a score better than the threshold. We only need to make the agglomerative tree once, and then we can quickly experiment with different thresholds.
Let’s look at an example for our New York city hotel data. I chose 18 phrases, such that they formed some natural clusters. There are 4 keywords about the view, 6 about being comfortable, 2 about the wi-fi, 1 about breakfast, 3 about friendly staff, and 2 about feeling safe. With luck, the phrases should be clustered into these natural clusters at some point. Here is the clustering, with the natural clusters shown as different colors.
The number shown in internal nodes provide the order of the merge, where smaller numbers were merged first, and hence more similar. The leaf nodes were given numbers 0 to 17, and so internal node 18 was the first combined cluster with comfortable_beds_and_pillows and comfy_bed_and_pillows. The next node 19 combined friendly_accommodating_staff and friendly_assistance, etc.
In general, the clustering did an excellent job on the test case. A phrase like sooooo_comfy is the least like the other phrases talking about comfort, so it goes in last among the red group.
With the right threshold, we can get natural clusters by themselves. In practice, we can get many useful clusters to use for collections.
We can also go back and find the hotels that frequently mention the phrases in a cluster. These hotels would be the member of the collection formed by a given cluster.
The human curation post-process
Up to this point we have had a totally automated process, that use phrase detection, word2vec features, and agglomerative clustering to find clusters of phrases. The remaining step is to find the interesting ones. At this point we rely on human curation to pick out the interesting ones, and give them a snappy name.
The manual process is fairly easy. We provide a list of phrases for a cluster, the hotels that are the most related to the cluster, even some examples sentences from the review text for those hotels. It is fairly easy to look at that information, and determine if it would a good way to explore the hotels of a city. It is hard to mathematically define what is interesting, but easy for a human to know when they see it. The human also comes up with a clever name, which is also simple given the list of phrases.
Some interesting collections
The “Catch a Show” collection has phrases like this:
at_radio_city_music_hall b'way_shows beacon_theater beacon_theatre broadway_dance_center broadway_play broadway_plays broadway_shows broadway_shows_and_great_restaurants broadway_shows_and_restaurants comedy_shows david_letterman_show easy_walk_to_broadway_shows evening_entertainment great_shows radio_city_hall radio_city_music radio_city_music_hall radio_city_music_hall_and theater_shows theatre_shows walking_distance_to_broadway_shows walking_distance_to_broadway_theaters walking_distance_to_shows walking_distance_to_theatre
My personal favorite when I’m in New York, the “Near The High Line” collection has:
chelsea_market_and_high_line chelsea_market_and_the_highline high_line high_line_park highline_park highline_walk highline_walkway the_high_line_park
The whole process provides insight into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby attractions. All the things someone staying in the hotel might be interested in. Different cities will result in different collections.
Once the user has some ideas, they can focus in on a few hotels, and see if they would be a great place to stay. It wouldn’t be possible without the insights that Natural Language Processing provides.