The newly redesigned TripAdvisor.com emphasizes traveler photos throughout the site, but not all of these photos are useful in every situation. Deep Learning networks provide an excellent opportunity for us to improve our users’ experience by highlighting the most attractive and useful photos for varying presentation contexts. This post will discuss our approach for gathering training data, developing a model, and scaling it up to over 110 million photos and 7 million places of interest.
Opportunity: Photo-centric redesign, lots of photos
This year, we redesigned TripAdvisor.com, and I think it looks great. It’s clean, modern, and focused on helping travelers find and book a great vacation. Part of what makes the redesigned site look nice is its emphasis on photos, both professional and amateur. Every detail page, every search result, every exploration shelf is filled with photos.
This redesign highlights an opportunity we’ve known for quite a while but hadn’t fully addressed: We have a lot of photos (over 110 million), but we could improve upon which photos we show and in what order for different contexts.
For example, sometimes we show great photos for restaurants:
Other times, our photo selection doesn’t make the most useful first impression:
Also, wouldn’t it be nice if, when we recommended hotels with a pool, we actually showed you photos of the pools, rather than photos of bedrooms and lobbies?
“Hotels with pools” (Where are the pools?)
We could ask property owners to rate photos, select the main photo for their listing, and tag photos by their scene type. We could hire an army of photo moderators to tag, rank, and select photos, but that would be slow and expensive.
Instead, we’ve found that Deep Learning networks, trained on fast GPU hardware, were surprisingly good at improving our photo selections.
Goal: Show attractive, useful photos
- Good hero photos – For each property, select as its main (or “hero”) photo the best photo we have.
- Relevant amenity photos – When we have a shelf of recommendations regarding a particular amenity (pool, beach, etc.), show photos of that amenity, not some generic thumbnail.
- Good default sort order – For each property, present their photos in an order that’s attractive and useful.
Approach: 15 curators, a mini-fridge of GPUs, and some great open-source software
Our approach follows a relatively normal machine learning process: get some training data, choose, train, and evaluate a model, and deploy.
Gather training data
We had 15 human curators blast through hundreds of thousands of photos. They focused first on performing pairwise ranking of photos: given two photos from a property, choose the one that would most motivate you to learn more about the property. They also labeled photos as having a human in them or not, as photos focused on people tend to be less useful than photos focused on the property.
Our infrastructure for collecting data was dirt simple:
- We used the python library Pandas to assemble the photos into HTML pages.
- We used a CherryPy python http server to collect submitted CSVs.
To label hotel scene type (pool, beach, room, etc.), we scraped tagged photos from ImageNet.
Choose a model
Many Deep Learning tutorials walk you through building a convolutional neural network (CNN) from the ground up, starting from a randomly-initialized model and training it on your images. Depending on your situation, that might not be the best starting point. For us (and for many others), it was better to start with a model that was trained on many more images than you have. All of our models, for example, are built on top of the the ResNet50 model. ResNet50 (He et al, 2015) is a 50-layer CNN trained for image classification on millions of labeled ImageNet photos. The figure to the right illustrates the similar ResNet-34 network. If you lop off the top layers that focus on ImageNet classification, you’re left with lower layers that provide an abstract representation that is useful for a variety of problems. We use that lower network as a feature extractor: input a rescaled image, output a 2,048-value vector, called “bottleneck features” that becomes the input for our model. This lower network benefits from extensive training on millions of ImageNet photos. It’s learned the basic features of images so our model doesn’t have to, drastically reducing the amount of training data we need.
We experimented with a number of approaches for selecting a good hero photo. We ultimately chose learning pairwise rankings using a Siamese Network, an architecture that learns on pairs of photos, one ranked above the other, and can then produce a single score for each photo that best preserves these rankings.
We used standard feedforward networks for classifying photos as having humans in them (or not), as well as classifying by scene type.
In all cases, we used two hidden dense layers, interleaved with dropout layers. In Keras syntax:
def create_mlp(input_size=2048,output_size=1, hidden_layer_sizes=(2048, 2048), dropout_rates=(0.5, 0.5)) -> (Model, Model): model = Sequential() model.add(Dense(hidden_layer_sizes, activation='relu', input_shape=(input_size,))) model.add(Dropout(dropout_rates)) for (h, d) in zip(hidden_layer_sizes[1:], dropout_rates[1:]): model.add(Dense(h, activation='relu')) model.add(Dropout(d)) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer="adadelta", loss="binary_crossentropy", metrics=["binary_crossentropy", "accuracy"]) return model
Layer sizes and dropout rates were determined via random search and cross validation.
Training it up
We have a mini-fridge-sized development machine with a two consumer-grade GPUs. Keras, TensorFlow, and Pandas make it really fast and easy to train and cross-validate. Even with this modest hardware, our models are fast enough to finish training while you wait and to do a 100-trial random hyperparameter search overnight.
Evaluate and test
We’re still in the process of A/B testing our results, but some of the early numbers show that users who viewed restaurant hero photos selected by machine vision were more likely to click than those who saw the original hero photos.
Our production infrastructure uses Spark on YARN as the data layer and a Kubernetes pod filled with NVIDIA GeForce GTX 1080 Ti GPUs for computing the model output. These high-end cards can handle our volume with capacity to spare.
Results: Pretty food, fewer bathrooms
We’re very excited about the resulting photo selections. Here are some examples.
Better hero photos
We saw a lot of improvement on restaurant hero photos when we selected them using the pairwise-ranking model. Here are some examples.
Similarly for hotels, which photo do you think makes a better first impression?
Relevant amenity photos
We want to populate most of our amenity-specific placements with photos of that amenity, rather than a general-purpose hero photo. For example, on our two most popular shelves, “hotels with a pool” and “hotels on the beach”, we’d like to show photos of the pool and the beach, rather than photos of the lobby and the rooms.
Here, we select the photos with the highest pool probability and highest attractiveness score, deferring to the hero photo if it is already featuring a pool.
And likewise for beach:
Better Default Sort Order
Users love to browse through photos. On mobile, users will look at dozens of photos in one session. We’d like to show them the most useful and attractive photos first.
Here are some examples where we improved the sort order by moving photos with helpful votes and high pairwise-attractiveness ranking toward the front and photos with a low ranking and photos with humans toward the back.
Travel planning is such a visual activity. The photos you see on TripAdvisor can make the difference between finding the perfect hotel and skipping over that hotel just because its first photos didn’t make a good first impression. This project proved to us that state-of-the-art machine vision based on convolutional deep neural networks is invaluable in automatically selecting the best photo for different display contexts. Since at least 2012, these algorithms have dominated object and scene classification problems, but we find that, even for seemingly subjective problems like “attractiveness,” these approaches perform very well.
So as you use TripAdvisor.com, keep a look out for gradually improving photos!
It is an understatement to say this work stood on the shoulders of giants. Outside of TripAdvisor, this work benefited tremendously from tools produced and open-sourced by Google (Keras, Tensorflow) and NVIDIA (CUDA), as well as models published by Microsoft and original research done at Google, Microsoft, University of Toronto, NYU, and Facebook. Inside Trip, Jeff Palmucci and Aaron Gonzales did much of the early foundational work that got our machine vision projects off the ground. Lastly, Anyi Wang has been a partner in the trenches, exploring new ways to improve hotel photos and incorporate click-through data.
Greg Amis is a Principal Software Engineer in the Machine Learning team at TripAdvisor in Needham, MA. He’s been at Trip for just over 3 years, working on machine vision, text processing (e.g., catching inappropriate content), and metadata processing (e.g., catching fraudulent reviews). Prior to TripAdvisor, he worked on military and other government contracts, doing everything from adaptive radar jamming to forecasting Navy personnel needs. Greg has a PhD from Boston University in Cognitive and Neural Systems, studying a type of neural network called Adaptive Resonance Theory and its application to semi-supervised learning and remote sensing.