The Fundamental Problem of Search
08 Sep 2018It has gnawed on my subconscious for the past 5 years. Even as I wrote Relevant Search it was there at the back of my mind weighing me down - the fundamental problem of search. But only now has the problem taken shape so that I can even begin to describe it. Succinctly, here it is:
Users don’t know what the heck they want and couldn’t tell you even if they did.
Yeah… so that’s about as far as I’ve gotten with it. I feel like I need something pithier. But let me break it down a bit and then you too will understand what’s been in my mind all these years. To put the above phrase less succinctly, but more accurately:
- Modern search interfaces are minimalistic and users don’t have much opportunity to tell you what they want - usually just a text box.
- Users have lots of criteria in mind when making a decision. If they want to find an event then they are considering the type of event, the location, the time, the date, the overall quality, and probably many other things as well.
- Different users have different criteria in mind when making decisions and different weighting of said criteria.
- Users often don’t know they have all of these criteria in mind. They believe you can “simply” find a set of matching documents and return them in any specified order they specify.
- Users believe that deciding whether or not documents match is a binary decision and users believe that the order of the results can be exact. In truth both the match and the ordering are naturally “fuzzy”.
- Despite uncertain user intent and despite the fuzzy and uncertain nature of matching and ordering, the relevance engineer has to make both matching and ordering concrete and absolute.
Hopefully my original thesis makes a little more sense now. Users are unaware of their own internal processing as they work to satisfy and information need. Many of their assumptions about how search works are wrong - which makes the search experience disappointing. And the interface we provide users is so minimalistic that we have little insight into the user’s needs. In the sections below I’ll delve into a several examples that will make my points above concrete and then follow up with some ideas about how we can overcome the fundamental problem of search. Here’s a hint though … it won’t be easy.
Matching and Sorting by Relevance
In the simplest possible scenario the user enters text into a search box and we match documents and sort the results based solely on relevance. That is, we find all the documents that contain at least one of the user’s keywords and we score the documents based upon TF*IDF. This seems pretty straightforward right? But fuzziness is already starting to creep in.
First, let’s consider how we determine what set of documents match. If your user searches for “green mile” then we as humans recognize that the user is probably talking about the movie called The Green Mile. The search engine is just going to return all documents that match the term green or the term mile. But documents that match only one of these terms probably isn’t going to be very relevant. One option is to require both terms to match. But this strategy is ill advised because there are plenty of times where an either/or match might be relevant. If the user searches for “comedy romance” then they might would prefer a comedy romance, but toward the end of the list a comedy or romance film might be just fine.
In principle another option would be to return every document with a score above some cutoff value X. In practice this isn’t possible because TF*IDF scoring is not absolute; you don’t score 0 for perfectly irrelevant documents and 1 for perfectly relevant documents. Consider two queries, one for “it” (as in the movie It) and another query for “shawshank” (as in The Shawshank Redemption). The term “it” is very common and so the best matching document - the movie It - will likely get a relatively low TF*IDF score. However in the case of “shawshank” let’s say that we don’t actually have a document for The Shawshank Redemption and the only document that matches is because of a footnote in the description stating “from the director of The Shawhank Redemption”. Though this is poor match, the score will be quite high because the word “shawshank” is so rare. In this example, we have a low scoring document that is a great match and a high scoring document that is a terrible match. It is just not possible to delineate between matching an non-matching documents based upon score alone.
So we see that even in the most basic text search scenario we will already begin running into where we can’t know the user’s intent and where the hope of perfect matching and perfect sorting break down. It gets worse though!
Matching and Sorting by Relevance and Quality
“I want to find a documentary about Abraham Lincoln.” Seems simple enough. So we retrieve all documents that match either abraham or lincoln and we sort by the default scoring algorithm so that documents that match both terms appear at the top. But there’s a problem: The user told you they want Abraham Lincoln documents but implicit in the request is that they really want just the high quality results. So if you’re used to databases and if you’ve just started working with search then the answer seems obvious - just sort by quality (or popularity or whatever field that serves as a proxy forq quality). If you do this you’ll immediately find yourself with a new issue: when sorted by quality, the top results will contain documents that are high quality, but only match one of the terms and aren’t really very relevant at all. If you had a documentary on the life and times of the Biblical Abraham for example and if it was a really high quality documentary, then it would jump up above the documents that are actually about Lincoln.
So again, for someone new to search, the next “answer” is clear: just turn minimum_should_match
parameter to 100% to ensure that documents are only returned if they have all the terms that the user queries. But this doesn’t really fix the problem. Consider a high quality documentary about Ulysses S. Grant which merely mentions Abraham Lincoln - a high quality result, but nonetheless irrelevant to the user. What’s more, minimum_should_match=100%
can get you in a lot of trouble when the user searches by dumping lots of words in the search box and hoping that some of them match. For example “civil war abraham lincoln” - a documentary entitled “President Lincoln and the Civil War” would be entirely relevant yet would not be a match!
The best thing to do here is to boost by quality rather than use absolute sorting. By default the score of the document is based solely upon how well the text matches according to TF*IDF. But you can incorporate quality into the score by adding in some multiplier times the quality: total_score = text_score + k * quality
. With this approach in principle you can adjust k
so that the total score is the appropriate balance between text score sorting (k = 0
) and absolute quality sorting (k = inf
).
Though this approach of linearly adding in quality a very commonly used approach and is often effective, it can come with some nasty problems of its own. Ideally you would be able to find some k
that works best in all cases. In practice you can not. Refer back to the example of the “it” search and the “shawshank” search. In the “it” search, the best matching document will have a much lower text score than a typical query. And in the “shawshank” query, even average matching documents will have potentially high scores. In both of these cases if we calculate total_score
as text_score + k * quality
, then in the “it” search quality component have a much greater effect on sorting than it will for the “shawshank” query. It would be nice if somehow we could automatically scale k
so that it was proportional to the general tests scores for a given search. (More on this in a future post!)
Side Bar: Multiple Objective Optimization
A big part of the underlying theme here is that search is a multiple objective optimization problem. That is, we are trying to optimize the scoring function so that multiple objectives are optimized simultaneously. However we do not know - and we can not know - how important the objectives are relative to one another. The issue is perhaps most obvious in applications like Yelp where the different objectives are called out in the application: You’re looking for a restaurant - how would you like to organize the results? Distance? Price? Number of stars? If you’ve selected a food category or typed in a search, then how important should that be? From Yelps perspective, the answer can not quite be known. The best we can do is to find some sort of balance between the various dimensions that empirically tends to maximize conversion rates. In modern implementations of search, Learning-to-Rank is a machine learning approach that does exactly this.
Matching and Sorting by Relevance, Quality, and Date
Things get even worse when we involve precise quantities like date or price. Often users will want to sort their results by date, and this sounds like a perfectly reasonable thing to do. But you will encounter some very bad side effects when exact sorting by things like date. Here’s why: the world is driven by the Pareto rule and your inventory is no different. If you are in e-commerce, then 20% of your inventory is where you get 80% of your sales and 80% of your inventory is where you get 20% of your sales - it’s garbage. So let’s say our users are searching for “beer events” but they want to sort the results by date. Rather than showing the most relevant events such as beer festivals, brewing classes, and beer tastings, we’re going to show them irrelevant, date-ordered events such as business dinners or speed dating events that merely mention beer in their descriptions. Effectively we are scooping way down into the 80% of less desirable events simply because they happen sooner than the more relevant events that we should be returning.
Solving this is quite a challenge. Consider some alternatives:
- Boost by date: As presented in the last section you can boost by date and make sure that the best documents are right at the top of the search results kinda sorted by date. But when users choose to sort by a precise quantity like date they will see any deviation from date order as evidence that search is broken and not to be trusted.
- Re-sort the most relevant documents by date: You can use the Elasticsearch
rescore
feature to find a set of the N most relevant documents and then re-sort them by date. But how do you find a good value for N? If N is too low then users may page past all N results and you’ll have to either tell them there are no more results OR you’ll have to “show omitted results” and start over by date. On the other hand, if N is too high then the returned set will dip past the most relevant document and pull up some of the 80% of less desirable results. Sorting by this group means that some of these irrelevant or low quality documents will end up at the top of the search results. -
Sort by date then sort by relevance: If you think this is a good idea, then you haven’t put your thinking cap on yet today. Nevertheless I hear this tossed around as an option quite a bit. The problem is that if date includes a timestamp then it is a continuous value. If your documents have timestamps with granularity down to the second then sorting by date followed by quality is no different than just sorting by date.
- Bucket by date and sort by relevance within each bucket: As a variant on the previous idea, you do have the option of discretizing the date and chunking documents into buckets of day or week and within each bucket sort by static quality. This might just be a great solution. If the user doesn’t expect exact data/time then they will be more forgiving when the documents don’t appear in exact date order down to the second within the buckets. But there are still problems - within each bucket there are fewer documents to draw from. Nevertheless, the search engine will faithfully provide the best documents it has for each bucket. This means that as your bucket size gets smaller the chances become higher that the bucket will be filled with irrelevant documents. If would be better to not return the bucket at all, but per our Matching and Sorting by Relevance section, scoring is not absolute, so it might still be difficult to decide which buckets should be omitted from the search results.
No Hard and Fast Solutions
As you can see in the sections above, I’m doing a good job of outlining a very big problem, but I’m not providing any easy solutions. That’s because there aren’t any! By its very nature search and recommendation is and forever will be filled with nasty corner cases. Human language is dirty and imprecise; and your users’ information needs will be uncertain and highly varied. But don’t lose hope. Despite the many corner cases, search technology is still an excellent tool for helping users to satisfy their information needs in the great majority of use cases.
What’s more, search is getting better. Learning to Rank is a machine learning technique for scoring documents that can automatically find the best balance between features like text relevance, static quality, and enumerable other things. Similarly, there has been lots of conversation in the search community about embedding vectors into search so that traditional inverted-index search can be used in conjunction with recent developments with machine learning (check it out!) Finally, I would expect the user experience to continue to develop and improve. The dominant search experience for the past 15 years has been a text box at the top of the screen. But we’re seeing this give way to more conversational search experiences like when you ask Siri to look up a phone number of ask Alexa to play a certain song. The visual experiences are changing too. Take a look at Google’s image search. There the left-nav faceted search has been replaced with a very intuitive tag-based slice-and-dice experience that allows you to quickly narrow down the the small set of results that fit your information need. I expect we will continue to get better and better experiences.