Tokenizing Embedding Spaces for Faster, More Relevant Search

Embedding spaces are quite trendy right now machine learning. With word2vec for example, you can create an embedding for words that is capable of capturing analogy. Given the input “man is to king as woman is to what?”, a word2vec embedding can be used to correctly answer “queen”. (Remarkable isn’t it?) Embeddings like this can be used for a wide variety of different domains. For example, facial photos can be projected into an embedding space and for tasks of facial recognition. However I wonder if embeddings fall short in a domain that I am very near to - search. Consider the facial recognition task: Each face photo is converted into an N-dimensional vector where N is often rather high (hundreds of values). Given a sample photograph of a face, if you want to find all of the photos of that person then you have to search for all the photo vectors near to the sample photo’s vector. But, due to the curse of dimensionality, very high dimensional embedding spaces are not amenable to data structure commonly used for spatial search, such as k-d trees.

Maybe we can take a lesson from search, because search engines have no problem at all with high-dimensional feature spaces. Consider a feature vector for an English document, say, a description of a product in a catalog. In principle, the cardinality of this feature vector is giant - one floating point value for each word in the English vocabulary. This is a bit misleading of course; the representation used by search engines is very sparse - most documents do not contain most words in the English vocabulary - so the feature vector is mostly zeros. However the point I’m making still stands: search engines are good at finding matches in high-dimensional feature sets. Given a query - a set of words - a.k.a. a sparsely-encoded, high-dimensional feature vector - then a search engine can very quickly identify all the documents that contain these words - a.k.a. all similar feature vectors.

Using Search Technology for Large Scale Face Recognition Tasks

So let’s go back the the face recognition task. Let’s say we have a data set of every photograph of every face on Facebook and we want to make a face search application. If somehow we could tokenize a face - that is, convert it into a bunch of symbols similar to words in a text document - then we could reduce facial recognition to a simple application of search. What would this look like? Well in principle you could assemble an army of well trained curators to write up a tagged description of each face. The tags could read something like this

{caucasian, fair-complexion, straight-hair, black-hair, receeding-hairline, bald, no-beard, no-facial-hair, caterpillar-eyebrowse, arched-eyebrows, black-eyecolor, small-iris, bulging-eyes, large-nose, blubous-nose, broad-smile, missing-teeth, kinda-sinister-looking}

(In this example, our curators are obviously describing Gargamel.)


Given an index full of similarly tagged facial photos, someone searching for a new face would proceed as follows. First they would have a curator tag their face photo. (Don’t worry, we’ll relax this silly requirement in a moment.) Then they would make a textual search of these tags against the face search index. The search engine would quickly narrow the result down to only those that shared much of the terms in the query. Notice that a nice benefit of sparsely encoding faces as tags is that you become robust to missing or even incorrect information. For instance, if Gargamel wore a disguise as shown in the photo below, he would still have {black-hair, large-nose, bulbous-nose, bulging-eyes} and his photos would still be present in the filtered set of images!


Once the filtering step is done, the surviving images are then all scored and ordered from most likely to least likely matches. Out-of-the-box search engines use a TF*IDF-based scoring algorithm which serve as a pretty reasonable starting point even for a face search application like this. With TF*IDF scoring, less-common tokens, (such as Gargamel’s bulging-eyes, small-iris, and missing-teeth) will carry larger weight. Intuitively this makes sense, bulging-eyes is more noteworthy than, say, caucasian or black-hair.

Besides the natural ability to filter and sort result sets, using a search engine comes with other benefits. If you use Elasticsearch then you can immediately take advantage of the ability to partition your index across several machines and scale to very large data sets. Consequentially, you can also distribute search across these machines for fast searchability. Additionally, since we’re using a search engine, you can combine facial feature search with other types of search. For instance in addition to the facial_feature field above, you could have a geo_location field that carries latitude and longitude. Now you have location aware face search: “Find me all individuals matching Gargamel’s description in Tennessee”.

Tokenizing Embedding Spaces

The above solution is of course silly, because we don’t have an army of trained curators to tag face photos. However, if we could find some way to take a dense embedding vector and tokenize it into a sparse representation like that above, then we could have all of the benefits described above without the overhead of supporting that army of curators.

What would such a tokenization look like? Well for one thing, it wouldn’t be human readable; we wouldn’t take a vector of floating numbers and convert it into words like {fair-complexion, straight-hair, black-hair}. Rather we would produce pure tokens - symbols that carried some sort of descriptive information, (though we wouldn’t necessarily know what that information means). So we would send Gargamel’s image through some swanky machine learning algorithm to convert it into an embedded vector and then we would send that vector through our algorithm (still T.B.D.) and it would emit tokens: {a52b3gd2, b1a3ff63, 69e4a3b3, ...}. If we do this, then we get all the benefits outlined in the above paragraphs, and we can fire off our curation staff. The only thing we’ve lost is the human readability of the tokens.

But what algorithm is capable of mapping from embedding vectors to token space? Well, I don’t rightly know. But I do know some of the qualities that it must exhibit to be successful:

Yeah, Yeah… But how do we do it?

How do we find an appropriate mapping between the embedding space and the token space? This is a problem I leave to the reader - and a problem that I have not been able to solve myself! Here are some rough ideas that I have been working on, but I feel they all fall short:

Standard Disclaimer

Per my usual, I did not do any exhaustive research prior to writing up this post. Perhaps this is an idea that has been well studied and perhaps my foundational presumptions are just wrong! If so, then would someone kindly point me in the right direction? On the other hand, maybe this is indeed something new. In that case I would love to get your ideas on how to push this idea further. In any case I’d love to get your thoughts. Let’s start a conversation on Twitter - @JnBrymn.

comments powered by Disqus