Skip to content

Search & Retrieval

Similarity Search for Grouped Content (Teaser)

Vector search has taken the world by storm. The idea is this - cast documents into a vector embedding space via some transformation (e.g. BERT) and then index them into a vector database that is capable of fast approximate nearest neighbors search. Then when the user makes a text query, cast their query into this same embedding space and find the nearest vectors to the query vector and return the corresponding documents.

It occurred to me today while staring at the sky blankly, that this type of search won't work when searching for grouped content. For instance, what if I go onto Reddit, and I'm not really looking for a single reddit page, but I'm looking for an entire subreddit that matches my interests. Here I'm not looking a document, but a group of documents that match my intent.

It has gnawed on my subconscious for the past 5 years. Even as I wrote Relevant Search it was there at the back of my mind weighing me down - the fundamental problem of search. But only now has the problem taken shape so that I can even begin to describe it. Succinctly, here it is:

Users don't know what the heck they want and couldn't tell you even if they did.

Haystack Highlights

On April 10th and 11th OpenSource Connections held their first (annual I hope) Haystack search relevance conference. It was intended to be a small-and-casual, 50-person conference but ended up pulling in roughly 120 people requiring OSC to scramble to find more space. The end result was one of the best conferences I've ever attended. In general, conference speakers have to aim their content at the lowest common denominator so they that they don't lose their audience. At this conference, the lowest common denominator was really high! So there was no need to over-explain the boring introductory topics. Instead the speakers were able to jump into interesting and deep content immediately.

Needless to say, I came away with a ton of good information that I'm going to put to work at Eventbrite as soon as possible.

Better Click Tracking for Identifying Statistically High Performers - Part I

Click tracking is a way of boosting documents based upon the historical clickthrough rate that they received when surfaced in search results. Here's how it works: Let's say that we're building click tracking for an online store and we want to boost the documents that are getting the most attention. First you set up logging so that you can count how times a particular item is clicked. Next you have a process that aggregates the clicks across, say, a week, and you store the value in a click_count field along side the documents that you are serving from search. Finally, when someone performs a search you boost the results according to the click_count so that items with high clickthrough rates start surfacing higher in search results. But if you think hard, there's a pretty nasty problem with this approach.

(Can you figure it out?)

The problem is feedback. In the context of search results, the first page, and really, the first few results get all the love. Very few users are desperate enough to click through to the second page of results. So click tracking causes a nasty positive feedback dynamic to arise: The user are shown a page of results, user's only click into those results, thus those first-page items now get an additional boost. This makes it even more likely for these items to show up on the first page of results for other related searches, which exacerbates the problem, etc. One way of addressing this problem is by tracking the typical clickthrough rate and then boosting a document according to only how much it exceeds the typical clickthrough rate.

This is the first in a series of blog posts where we will examine how a more sophisticated version of click tracking can be implemented and we will examine some of the neat off-shoots of this work that allow you to things like turning click logs into judgement lists. But first we start with a very simple example... a very simple example:

Embedding spaces are quite trendy right now machine learning. With word2vec for example, you can create an embedding for words that is capable of capturing analogy. Given the input "man is to king as woman is to what?", a word2vec embedding can be used to correctly answer "queen". (Remarkable isn't it?) Embeddings like this can be used for a wide variety of different domains. For example, facial photos can be projected into an embedding space and for tasks of facial recognition. However I wonder if embeddings fall short in a domain that I am very near to - search. Consider the facial recognition task: Each face photo is converted into an N-dimensional vector where N is often rather high (hundreds of values). Given a sample photograph of a face, if you want to find all of the photos of that person then you have to search for all the photo vectors near to the sample photo's vector. But, due to the curse of dimensionality, very high dimensional embedding spaces are not amenable to data structure commonly used for spatial search, such as k-d trees.