This is Part 1 of a series on how TripAdvisor uses event-driven architecture and read-optimized data stores to orchestrate complex work across a complex fleet of microservices.
TripAdvisor runs a constellation of microservices, several of which operate as the source of authoritative truth for one small part of the business’ overall data model. Services marshall, through their public API, access and update requests for the data they manage.
The Join Problem
With the siloing of CRUD (create, read, update, delete) responsibilities behind different services, it can be difficult to efficiently satisfy a query that relies on data partitioned between different services.
For a hypothetical example, imagine you want to build a list of nearby restaurants, open on Mondays, which serve pasta. Geolocation-based filtering (1), restaurant dish tag data (2), and opening hours (3) might all exist behind three different services, with no efficient way of joining data across each of them (this is sometimes called the Join Problem).
A simple request-chaining-based approach might involve executing the most restrictive predicate first (get all nearby restaurants), then taking that list and passing it into the second-most restrictive predicate evaluation (serves pasta), then taking the output of that service and running the third predicate (opening hours) with yet another service call.
Now, this is absolutely a reasonable first MVP. It actually works in most contexts, and it’s “fast enough” at low data volumes, but it does involve a LOT of data passed around the network on each query and eventually hits a point where the naive approach can’t scale. Suppose instead of restaurants near me, I needed to do this for all restaurants in Canada – I’d be passing around info on tens of thousands of entities.
When request-chaining starts failing, one solution is to build eventually-consistent, read-optimized stores. Instead of joining across services, we’ll build a query-only service that has direct access to all the data it needs to satisfy a query efficiently without any more service hops. (Either by joining across tables in a traditional RDBMS, or using a denormalized data store, like Elasticsearch.)
We give up some of the benefits of strict per-service siloing of data, namely:
- single source of truth for our data
- operational simplicity
- minimizing duplication of data representations in API contracts and interfaces
- (some of) the ease of evolving data representations
And in exchange, we gain a service that can answer some types of queries really, really fast 🏎! It does this by maintaining a local copy of just the subsets of information owned by the earlier services (1), (2), and (3) that it cares about.
We have a number of such services in use at TripAdvisor, which we’ll dive into in the following posts in this series.
A great way to keep these read-optimized data stores up to date is by driving updates asynchronously through an event bus.
Eventing lets us build decoupled architectures where the production of a fact/event (e.g. “user @steve has posted a photo”) is completely independent from any knowledge regarding who might care about, and consume, such an event. This style of information broadcast is much easier to scale and maintain compared to direct remote service invocations. Now the (asynchronous) handling of an event can happen whenever the event consumer is ready to perform it, with the event bus acting as a near-infinite-capacity buffer, instead of an entire multi-step operation being tied to the uptime of its least-reliable service. New uses for existing event data can crop up over time without adding software complexity to the event sources.
The decoupling achieved here is useful in scaling dev teams themselves: as an engineering organization grows, the coordination and communication overhead grows with it. (See: Brook’s Law, The Mythical Man-Month, and the last… 40 years of our profession.)
If we can isolate the team responsible for a particular type of event from having to understand in-depth the downstream implementations that react to that event, we can work to reduce the cross-team overhead. A team of engineers, owning a partial silo of the business’ data, and a service controlling access to read and update that data, declares the format of the events representing data updates as a part of their external contract. Their responsibility can often end there: other teams can build software reacting to these events in a completely independent manner, as long as the public contract isn’t broken.
Keeping read-optimized data representations up to date involves following a few simple rules:
- There’s still one single authoritative source of truth about an entity/fact, and that is the service managing that entity’s CRUD operations (though other services can and do maintain shadow copies of parts of the data for queryability)
- When a fact about an entity managed by a service changes, that service must post an event to a shared message bus informing consumers that a change occurred
- This event broadcast must happen after the owning service successfully commits the change to its underlying durable store. (The implication is that if an event consumer receives an event telling it that a photo with ID 56123 was just published, and it comes to the authoritative service that published that event asking about photo ID 56123, the transaction that created it must now be committed and fully visible to all instances of that service.)
- The query-only service maintains its shadow copy of the data it cares about by consuming the subset of events it cares about, and updating its own data store in response, either directly from the event contents, or, commonly, by refreshing its view of something by querying the authoritative service for an updated representation of it. (Which we call lightweight events – a topic we’ll expand in Part 3.)
(A simplified view of how a search indexer might react to new content becoming available, and refresh its own denormalized joining store.)
A few years ago, we started mandating that key data management services (reviews, users, forum posts, etc.) publish change events on an event bus whenever they created/updated/deleted an entity. Since then, new services have sprung up to harness that event stream and build their own optimized representations of those for feed-building, searching, etc.
In Part 2, we’ll dive into specific real applications and look at how the TripAdvisor site builds feeds of the users you follow.
Jean-Philippe joined TripAdvisor in 2013. Currently, he is the Technical Manager for Search and Discovery, the team responsible for building search engine and event messaging platforms. Prior to TripAdvisor he was an architect at Solace, designing event-driven distributed systems for the banking and high-frequency trading sectors. In his free time, he enjoys traveling with his wife and two daughters.