Image Quality Assessment and why TripAdvisor cares about it

Imri Sofer posted April 30, 2019

TripAdvisor has embarked on a mission: To make every image on our website amazing. This is not an easy task. We have 1 million points-of-interests (POIs) on our website, from the Grand Canyon to Hoover Dam, and more than 200,000 bookable experiences, from speed-boat rentals in Miami to goat yoga in Snohomish, WA (yep, we have it!). As you can imagine, sifting through the tens of millions of images our users have uploaded to find the best picture for each POI and experience is not an easy task.

In order to improve the site images, we have been working for the last couple of months on several computer vision models that can evaluate the technical quality and aesthetics of our images. While for many people the terms quality and aesthetics are synonymous, in computer vision it is common to differentiate between the two. Image quality assessment (IQA) is the problem of assessing the technical quality of the image – It’s blurriness, noise, compression artifacts, etc. Aesthetic assessment, on the other end, is the problem of measuring the attractiveness or beauty of the image. Often the two are concepts go hand in hand, since in most cases we won’t label an image as aesthetically pleasing if it has JPEG artifacts but many times they differ. For instance, few people will find a high quality image of a bloody wound aesthetically pleasing.

Why do we need an IQA model?

In many cases an aesthetics model can provide most of what we need (I just want to see the most beautiful image!!), but we found that there are situations that require an IQA model from us. One such use-case is to assess the images that suppliers upload to our website: When a supplier uploads a new image of a bookable experience, we would like to analyze the image in real-time to see if the image quality meets our standards, and classify it as acceptable or unacceptable. If it is unacceptable then we can immediately notify the supplier that they should change the image.

There are several reasons that an IQA model would be more appropriate in this case:

  1. We see Aesthetics as a ranking problem, and IQA as a classification problem. We have already built several aesthetics ranking models at TripAdvisor (for example). We prefer ranking models for this task, since for the most part aesthetics is relative and task-dependent. Furthermore, it’s much easier for people to compare the attractiveness of two images, side-by-side, than to rate the level of aesthetics of a single image. But while ranking models are good for comparing two images, they are not that useful when one wants to classify images. In our case we want to classify our images as acceptable vs. unacceptable, which makes ranking inappropriate for the task.
  2. Not all supplier images should be aesthetically pleasing but all should have high quality. For instance, if a supplier is selling rides from LaGuardia airport to a nearby hotel, a simple clean image of car will be very practical and aligned with users’ expectations. It would be nice if the car in the photo would be on a beach with a sunset in the background, but we should not flag the image has unacceptable if it’s not.
  3. The size of on image needs to be considered: Aesthetics is usually size invariant – If it is beautiful as a thumbnail, it will be beautiful as a full blown picture. Therefore we can train and serve our aesthetics models on thumbnail versions of our images, which helps us scale our service. But quality is affected by the size of the image – a full blown image may have artifacts that won’t be shown in its thumbnail version. So we need to have a model that can run on images in their original size.

We therefore decided that we need to have our own Image Quality Assessment model.

IQA – short history

Before the deep learning revolution, IQA models (like all computer vision models) used hand-crafted features that were meant to capture different aspects of the image. But as the last couple of years have taught us, it’s difficult to hand-craft vision features but it’s very easy to let a computer learn them. In one of the first deep learning papers on IQA, Bianco et al. [1] built a CNN model named DeepBIQ. The model is based on a fine tuned VGG16 model. It takes 30 random crops from the image, passes them through the network and then averages the results. The performance of the model improved by 28% over the state-of-the-art models at that time.

DeepBiQ was originally trained on a small datasets, since there were not any large datasets for IQA when it first came out, but last year a new dataset was published –  KonIQ dataset. This  dataset is currently the best source of image quality ratings (see examples below). It has 10,000 high-resolution images with 1.2MM ratings on a scale of 1 to 5, and it is almost 10 times bigger than previous datasets of non-synthetic images for quality assessment.

[1] On the Use of Deep Learning for Blind Image Quality Assessment – Bianco et al. 2016

The ratings are based on the technical quality of the images, not their aesthetics, and the dataset creators validated the correctness of the ratings by comparing the crowdsourced ratings to expert opinions. Interestingly, while overall there is a strong agreement between the crowd and the experts, there is one type of image where they differ in their opinion: images with shallow depth of field (like the one below). These images were considered to have low quality by the crowd, since they are blurred, but the experts saw the blurriness has an artistic effect, which does not reduce the quality of the image.

Our model

When we started our project we first tried to rebuild DeepBIQ. It was pretty difficult to reproduce their results, so we decided that instead of recreating the same exact architecture, we’d build our own. We switched to fastai, and built a model based on Resnet34, with two fully connected layers of 1024 (including batch normalization and dropout). The network produced a single number which was compared to the ground truth through rating using an L1 loss. The fastai ResNet also comes with spatial adaptive pooling mechanism, which allows the model to run on an image of any size without resizing it. This added flexibility means that we don’t need to crop patches from the images to feed them into the model like in DeepBIQ – an idea that was used in a recent IQA model by Varga et al. [2] To speed up model training we trained the model on a downsampled version of the images; after the model converged we slightly increased the size of the images, and fine-tuned the model. We repeat this process of size-increase and fine-tuning until we reached the original size.

[2] DeepRN: A content preserving deep architecture for blind image quality assessment – Varga et al. 2018

Revising the metrics

When we started to evaluate the results we quickly realized that we have a metric issue. One major question we ask in every one of our data science projects is: How is this model going to be evaluated? In the IQA literature it is common to evaluate models’ performance using correlation coefficient between the ground truth and the model predictions. When we started the project we initially thought of using the same metric, but we then realized that it is not what we were looking for. The goal of IQA projects in academia is to match human quality perception. This means knowing if an image should be rated as 4.7 or 4.8 stars. However, for our use case we don’t care much about these fine differences – we only care to know if the image is acceptable or not, since if it is unacceptable we need to take action (such as contact the supplier, or remove the image from the site). We therefore decided to evaluate the model by its ability to distinguish between acceptable and unacceptable images. We’ve empirically found that a threshold of 3.2 stars is a fairly decent way to divide the images to acceptable and not acceptable, which allowed us to switch to using classification metrics for model evaluation.

Once we settled on evaluation metrics, we decided to re-examine the loss function we use in the model. Previous models were concerned with matching human performance, therefore they viewed giving 4 stars to a 3.5 star image as bad as giving it 3 stars – in both cases the difference is 0.5 stars. But for our use-case there is a big difference between the two since in the first case the image, which has acceptable quality, will be classified as acceptable, but in the other case it would not. We therefore decided to make sure the model would emulate more closely what our metrics were trying to achieve.

We then created two additional models:

  • In the first one we transformed the ground truth scores to binary labels (acceptable and unacceptable images), and used a cross entropy metric.
  • In the second model, we processed the ratings using a sigmoid function, centered on the 3.2 rating. The goal was to still use all the information that the ratings carry, but to transform it to a perceptual space that is more aligned with our goal – it reduces the difference between 4 and 5 stars compared to the difference between 3.2 and 4.2 stars.

As you can see from our results, if we simply binarize the ratings it degrades the model performance, probably because the fine information in the ratings gets lost. On the other hand the third model, which is aligned with our perception, significantly improved our results.

Results

After training the model on the KonIQ dataset, we evaluated it on our own images. The model did extremely well and was able to flag our low-quality images, as you can see from the example below:

Low quality examples:

And as expected we had some false alarms of high quality images with shallow depth of field:

Our model is currently used as part of our new pipeline that looks for the best cover images for POIs. It filters out images that may look great to the aesthetics model, but carries visible noise and artifacts that the aesthetics model misses. Later this year we are planning to integrate it with our supply platform to notify our suppliers when they inadvertently upload a low quality image.

Author’s Biography

Imri Sofer is a Data Science manager on TripAdivisor’s Rentals and Experience team. His team solves problems which requires machine learning at scale, including recommendation, ranking, computer vision, and NLP. He graduated at 2014 from Brown University with a PhD in cognitive science (aka other quantitative field), where he used machine learning and Beysian statistics to understand how the human brain categorized objects. After building personalized ranking models and recommender systems for Zullily, and Zillow, he joined TripAdvisor at the end of 2017. In his spare time he likes to play board games, and update his short bio.