Using NLP to Find “Interesting” Collections of Hotels

Craig Schmidt posted October 5, 2015

There are a lot of hotels on TripAdvisor. At the moment, there are 1790 hotels listed for Paris, 1054 hotels for London, and 466 hotels in New York City. We have been working on better ways to explore these hotels, and find an interesting place to stay. In this blog post, I’ll describe an upcoming feature on TripAdvisor, that uses Natural Language Processing (NLP) to find groups of hotels that have an interesting theme. It is a semi-automated process that can be scaled across many cities. There is some interesting technology behind it.

For New York City, we might show a list of collections like this…

cs.nyccomp

We have collections for “Times Square Views”, “Trendy Soho”, “Catch a Show”, and “Art Deco Classic”, among others. Each of those is a group of about 10 hotels that fit the theme. How can we automate finding these interesting groups, even for cities we’re not familiar with?

Topic Modeling is Boring

The most textbook approach might be to build a topic model based on the review text for the hotels. I first tried a Latent Dirichlet Allocation (LDA) model. It assumes that each review is a mixture of a number of latent topics. You can represent each review as a vector with a “bag of words” representation of the review text.

I tried building an LDA, model, but got very boring results. Here are some of the first topics for New York, which a mix of a bunch of general hotel related topics.

    0.018 breakfast + 0.015 nice + 0.015 coffee + 0.011 small + 0.009 staff + 0.008 location + 0.008 great + 0.008 clean + 0.006 bed + 0.006 stayed
    0.032 staff + 0.020 great + 0.016 location + 0.014 stayed + 0.013 friendly + 0.012 helpful + 0.011 new + 0.010 york + 0.009 perfect 
    0.015 location + 0.015 great + 0.014 square + 0.013 times + 0.010 walk + 0.010 subway + 0.009 clean + 0.008 around + 0.007 staff 
    0.010 new + 0.007 distrikt + 0.007 great + 0.006 york + 0.006 breakfast + 0.006 city + 0.006 nyc + 0.005 service + 0.005 staff 
    0.012 staff + 0.009 yotel + 0.009 desk + 0.008 time + 0.007 front + 0.006 arrived + 0.006 made + 0.006 service + 0.005 great 
    0.023 new + 0.019 breakfast + 0.018 york + 0.013 square + 0.011 time + 0.011 empire + 0.010 state + 0.010 building 
    0.019 great + 0.017 staff + 0.016 bar + 0.012 breakfast + 0.012 location + 0.011 wine + 0.011 new + 0.009 york + 0.009 friendly 
    0.019 great + 0.019 park + 0.018 square + 0.017 location + 0.016 times + 0.016 central + 0.014 staff + 0.012 walking 
    0.025 air + 0.010 conditioning + 0.009 staff + 0.008 night + 0.007 service + 0.007 stayed + 0.006 smoking + 0.006 new

Single words just don’t have the richness to represent interesting topics. What about ‘breakfast’? Was it good, bad, free, or wonderful? Most hotel review mix up talking about the room, the location, the staff and more all at once,and these topics show that mixing. But we want to just focus on a single notable aspects of a hotel, not summarize it completely.

Phrases as raw material

Since single words aren’t very expressive, I wanted to find some good phrases in our review text. What is a phrase, exactly? In this case, we mean an n-gram of words that has a specfic semantic meaning. We tried the approach from the paper “Mining Quality Phrases from Massive Text Corpora”, by Liu et. al. at the Univeristy of Illinois and Microsoft (some interesting slides). Because they released an open source implementation of their SegPhrase algorithm, we were able to try it out.

It is a multiple step process, but the output is that you get a large set of phrases, and a score on how likely it is to be a phrase. We dubbed this score the “phrasiness” metric. So what are the strongest phrases in New York?

    club_quarters           0.999620288
    hampton_inn             0.999542304 
    rush_hour               0.999526829
    frosted_glass           0.999506234 
    ritz_carlton            0.999476254 
    usa_today               0.999473892
    jersey_boys             0.999458328 
    holiday_inn_express     0.999450456 
    art_deco                0.9994495
    gordon_ramsay           0.999448261
    battery_park            0.999418922 
    grand_central_station   0.999410719 
    naked_cowboy            0.999401511 
    yankee_stadium          0.999390047 
    penn_station            0.999386755 
    columbus_circle         0.999381838 
    charlie_chaplin         0.999381761 
    scrambled_eggs          0.999379073 
    jet_lag                 0.999370422 
    affinia_dumont          0.999364144 
    harry_potter            0.999357816 
    les_halles              0.999352377 
    air_conditioning        0.999346666 
    mamma_mia               0.999345891 
    hudson_river            0.999345247 
    pinot_noir              0.999344796 
    woody_allen             0.999337025 
    fairy_tale              0.999306646 
    grand_central           0.999304571 
    radio_city_music_hall   0.999301883

The number shown is the phrasiness score. All of these are clear semantic concepts, not just a sequence of words that frequently co-occur.

This is a list of “phrases” with much lower “phrasiness” scores:

    the_staff_were_great        0.116338786
    located_on_the_ground_floor 0.116338612
    the_rooms_are_very_small    0.116337216
    requests_were_honored       0.116334616
    an_older_crowd              0.11633381
    booked_our_return           0.116329291
    quite_rudely                0.116321934
    and_pedro                   0.116312719
    helped_ourselves            0.11630484
    after_spending_over         0.116302541
    my_husband's_birthday       0.116302346
    being_slightly              0.116297622
    no_inconvenience            0.116296059
    far_greater                 0.116295743
    longer_period_of_time       0.116291693

You can imagine that these phrases frequently occur in reviews, but don’t really represent a single concept.

We used a threshold of 0.5 for phrasiness, which reduced the approximately 250 thousand potential phrases for New York to around 35 thousand. While people often talk about air_conditioning in a review, and it is clearly a phrase, that’s not the kind of aspirational topic that people will want to browse through. Importantly, there are plenty of phrases that have more interesting emotional content, like art_deco, fairy_tale, andradio_city_music_hall. So we have the raw material we need to find collections. In fact, we may have too much raw material, as there are tens of thousands of a good phrases just for New York hotels.

word2vec Party Tricks

We have a bunch of phrases, and many of them are very similar. These phrases …

    amazing_view_of_central_park
    amazing_views_of_central_park
    beautiful_view_of_central_park
    beautiful_views_of_central_park
    central_park_views
    fantastic_view_of_central_park
    fantastic_views_of_central_park
    great_view_of_central_park
    great_view_over_central_park
    great_views_of_central_park
    nice_view_of_central_park
    spectacular_view_of_central_park
    wonderful_views_of_central_park

seem like a great basis for making a hotel collection. But they are mixed in with tens of thousands of other phrases. It would be very helpful to cluster the related phrases into groups. But what features should we use to do the clustering? We need a similarity metric that puts these phrases close to each other.

The word2vec algorithm from Mikolov and other Google researchers is perfect for this case. Given a corpus of text, it maps each word into a numerical vector of say 100 dimensions. The mapping is done in a way that put words that are used in a similar way in the corpus are near each other in the vector space.

The papers on word2vec give lots of party tricks you can do by adding and subtracting the vectors. For example vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’), and vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to the vector(‘queen’).

I took all 9 billion words of review text at TripAdvisor, and build a word2vec model with around 1 million unique words. I used the C implementation of word2vec which can be easily run from the command line.

Here are some example words, and their nearest neighbors:

great
    terrific        0.932
    fantastic       0.908
    fabulous        0.876
    wonderful       0.861
    brilliant       0.852
    fab             0.849
    superb          0.818
    good            0.787
    lovely          0.782
    phenomenal      0.773
noisy
    noisey          0.934
    loud            0.880
    nosiy           0.748
    noicy           0.729
    distracting     0.726
    busy            0.711
    crowded         0.705
    annoying        0.696
    disruptive      0.693
    rowdy           0.685
rude
    unfriendly      0.908
    unhelpful       0.898
    arrogant        0.882
    impolite        0.876
    unprofessional  0.872
    surly           0.864
    dismissive      0.849
    abrupt          0.843
    ignorant        0.840
    hostile         0.826
bed
    beds            0.814
    mattress        0.780
    sofabed         0.717
    couch           0.709
    matress         0.708
    duvet           0.690
    bedroom         0.683
    futon           0.653
    beds            0.652
    room            0.650

The numbers in the table are the cosine similarity of the top word with the near neighbors. We can see that word2vec is doing a nice job of mapping related words to be near each other.

For our short phrases, we just added the individual word vectors to create a phrase vector, ignoring stop words. So now we have a numerical vector of features for each phrase, and we can use any clustering technique.

With word2vec, people usually use the cosine distance, where a larger value is better. However, in clustering we usually work with a distance, where smaller is better. If we normalize our word2vec vectors so they have a unit norm, then we can just minimize the Euclidean distance, as this is equivalent to maximizing the cosine similarity (see the Properties section here for the simple derivation). So going forward, we will use the Euclidean distance of the two normalized word2vec phrase vectors as our distance metric.

Clustering phrases into tight topics

Usually, in a clustering problem we are interested in all of the clusters. Here, we have a slightly different situation. We want to group similar keywords into very focused groups, like our “view of central park” phrases, above. A small fraction of the clusters, say 100 or so, will be the basis for our collections. The rest of the clusters will be focused on more boring topics like air conditioning, and we’ll just ignore those.

What clustering technique will do a nice job of giving us those tight clusters? The default k-means approach for a fairly small k isn’t probably going to work too well. It will produce broad groups, and concentrate on getting each phrase into a good cluster, even though we don’t really care about most of the phrases.

I chose the agglomerative clustering routine in the python sklearn library to do the clustering. The “complete linkage” option will try the hardest to keep each cluster pure.

Agglomerative clustering starts with each phrase in its own cluster, so we have N clusters of size 1. It then goes through and tries to combine each pair of current clusters into a new combined one. With complete linkage, the score of the combined cluster is the maximum of the distances between every pair of phrases in the combined cluster. The two clusters with the smallest maximum score after combination are actually combined, leaving us with N-1 clusters. This process is repeated until we have 2 clusters left.

The overall combination process forms a tree of nodes. At internal nodes, a cluster is composed of the phrases that are leaves above the node in the tree. If we pick a threshold score, we can prune off subtrees where every cluster has a score better than the threshold. We only need to make the agglomerative tree once, and then we can quickly experiment with different thresholds.

Let’s look at an example for our New York city hotel data. I chose 18 phrases, such that they formed some natural clusters. There are 4 keywords about the view, 6 about being comfortable, 2 about the wi-fi, 1 about breakfast, 3 about friendly staff, and 2 about feeling safe. With luck, the phrases should be clustered into these natural clusters at some point. Here is the clustering, with the natural clusters shown as different colors.

cs.clustering

The number shown in internal nodes provide the order of the merge, where smaller numbers were merged first, and hence more similar. The leaf nodes were given numbers 0 to 17, and so internal node 18 was the first combined cluster with comfortable_beds_and_pillows and comfy_bed_and_pillows. The next node 19 combined friendly_accommodating_staff and friendly_assistance, etc.

In general, the clustering did an excellent job on the test case. A phrase like sooooo_comfy is the least like the other phrases talking about comfort, so it goes in last among the red group.

With the right threshold, we can get natural clusters by themselves. In practice, we can get many useful clusters to use for collections.

We can also go back and find the hotels that frequently mention the phrases in a cluster. These hotels would be the member of the collection formed by a given cluster.

The human curation post-process

Up to this point we have had a totally automated process, that use phrase detection, word2vec features, and agglomerative clustering to find clusters of phrases. The remaining step is to find the interesting ones. At this point we rely on human curation to pick out the interesting ones, and give them a snappy name.

The manual process is fairly easy. We provide a list of phrases for a cluster, the hotels that are the most related to the cluster, even some examples sentences from the review text for those hotels. It is fairly easy to look at that information, and determine if it would a good way to explore the hotels of a city. It is hard to mathematically define what is interesting, but easy for a human to know when they see it. The human also comes up with a clever name, which is also simple given the list of phrases.

Some interesting collections

The “Catch a Show” collection has phrases like this:

    at_radio_city_music_hall
    b'way_shows
    beacon_theater
    beacon_theatre
    broadway_dance_center
    broadway_play
    broadway_plays
    broadway_shows
    broadway_shows_and_great_restaurants
    broadway_shows_and_restaurants
    comedy_shows
    david_letterman_show
    easy_walk_to_broadway_shows
    evening_entertainment
    great_shows
    radio_city_hall
    radio_city_music
    radio_city_music_hall
    radio_city_music_hall_and
    theater_shows
    theatre_shows
    walking_distance_to_broadway_shows
    walking_distance_to_broadway_theaters
    walking_distance_to_shows
    walking_distance_to_theatre

My personal favorite when I’m in New York, the “Near The High Line” collection has:

    chelsea_market_and_high_line
    chelsea_market_and_the_highline
    high_line
    high_line_park
    highline_park
    highline_walk
    highline_walkway
    the_high_line_park

The whole process provides insight into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby attractions. All the things someone staying in the hotel might be interested in. Different cities will result in different collections.

Once the user has some ideas, they can focus in on a few hotels, and see if they would be a great place to stay. It wouldn’t be possible without the insights that Natural Language Processing provides.