Welcome to my technical blog! This first post should be relatively quick, discussing an information retrieval paper that I think makes a useful observation. This is also a first test of the platform I’m using for equations.
Softmax loss and ranking metrics
This post discusses An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance by Sebastian Bruch and collaborators at Google Research.
This paper comes from the information retrieval literature, and concerns ListNet loss specifically. We’ll consider this loss, and the paper’s relevance to computer vision in more detail later.
They make a simple observation that I hadn’t considered before reading their paper: ListNet loss (and softmax cross-entropy losses generally) bound common ranking metrics.
Before diving in, let’s briefly introduce these topics.
Ranking quality metrics
Ranking quality metrics model real world tasks, where the order of results is very important. For example, if a search engine presents a user with 10 results, of which 3 are relevant, then the relative position of those results is very important to quality. Since retireval quality is order-sensitive, ranking metrics are also order-sensitive.
Some examples of ranking metrics include:
- Recall at k: what fraction of relevant results are in the
- Precision at k: what fraction of the first
kresults are relevant?
- Average precision (AP): a mean of “precision at k” values
at each position
kthat contains a relevant result (or 0 for relevant results not in the result set).
- Discounted cumulative gain (DCG) and normalized discounted cumulative gain (NDCG): metrics that allow expressing more fine-grained notions of relevance.
- (Mean) reciprocal rank (MRR): the inverse of the rank of the first relevant document. if the ground truth result is first, if it is second, etc.
The important point is that for many retrieval contexts, the main determinant of retrieval quality is the order of results.
ListNet is a loss from the information retrieval literature. It has a motivation based on permutation probability, but it ultimately arrives at a variation on softmax cross-entropy loss.
ListNet uses a cross-entropy loss between a label distribution (using a continuous notion of relevance over a set of labeled results), and a predicted distribution over candidate documents.
Proposition: Softmax CE bounds ranking metrics
This paper addresses a challenge in information retrieval models that are learned using gradient-based optimization:
Ranking metrics rely on the order of results. The metrics are stepwise, with abrupt changes in value only when the order of results change. This makes these metrics unsuitable for direct gradient-based optimization.1
Because ranking metrics are not differentiable, models are trained using separate loss functions that are not directly related to ranking metrics. These are typically seen as surrogate losses – loss functions that enable optimization, but which are only loosely related to the target task.
The paper’s main claim is that ListNet loss, when used with binary notions of relevance, bounds two retrieval metrics: MRR and NDCG. I’ll focus on MRR here, since it’s more straightforward.
Proof sketch: Softmax cross-entropy bounds MRR
Part 1: Softmax bounds reciprocal rank
Given pre-softmax logits , the predicted probability for class is computed as follows:
For convenience, let’s define to be the partition (the summation in the denominator).
Let be the rank of class , if we sort the classes in descending order (or, equivalently, if we sort values or values in descending order).
The key observation is that the partition has terms that are . Let’s introduce arbitrary non-negative constants to express this:
Since is the definition of reciprocal rank, we have demonstrated that softmax probability of the ground-truth class is a lower-bound on reciprocal rank.
Part 2: Softmax cross-entropy bounds log MRR
I’m skipping over the multiple-label case, although it is a fairly straightforward extension of the above.
The formal claim made in the paper is that softmax cross-entropy bounds log MRR. There are a few related ways we can get there. The paper uses Jensen’s inequality. I’ll argue this from the inequality of geomentric and arithmetic means, which is straightforward and avoids a tangent on Jensen’s inequality:
We can apply this inequality to softmax cross-entropy loss, substituting for above:
Therefore softmax cross-entropy loss bounds log MRR.
This observation connects a loss function used in several retrieval domains to ranking metrics.
The relationship is a bound, and this does not argue that minimizing this loss is equivalent to minimizing these metrics. It’s also important to keep in mind that the ranking metrics here are over the training dataset, so this does not characterize generalization (i.e. MRR over the test set).
While the paper focuses on the ListNet formulation specifically, the argument applies to many uses of softmax cross-entropy losses for retrieval objectives. This is relevant to two threads of computer vision research:
First, there has been recent interest in learning-to-rank methods in computer vision, including minimizing differentiable formulations of ranking metrics.1 This observation connects these methods to a conventional loss function.
Second, there are several recent methods using softmax cross-entropy losses for fine-grained retrieval at training time. For instance, the NormFace family of loss functions (including CosFace and ArcFace) use softmax cross-entropy over all identities in the training dataset to supervise an embedding space for fine-grained recognition. Connecting these methods to retrieval metrics is directly relevant, because softmax cross-entropy here is used in a retrieval context.
Similarly, InfoNCE loss is a noise-contrastive estimation motivated use of softmax cross-entropy loss. It is used in contrastive predictive coding and momentum contrast for unsupervised representation learning.2
Computer vision may be on the verge of unsupervised representation learning overtaking supervised pre-training, and many of these methods use softmax cross-entropy objectives. These methods are more similar to retrieval use cases than classification (eg. hundreds of thousands of instance-level classes), making ranking metrics a meaningful measure of quality.
While it accurate to say that ranking metrics are not directly differentiable, there is an extensive literature on using tricks to make these metrics (or approximations thereof) differentiable.
See, for example, SoftRank, which proposes a differentiable formulation of rank based metrics like NDCG. This method involves casting document scores as the center of a gaussian distribution, then computing NDCG over the score distribution. This makes ranking metrics smooth, differentiable functions.
Another technique involves estimating a gradient for non-differentiable functions. Yang Song and collaborators’ Direct Loss Minimization paper estimates a gradient to directly minimize the average precision (MAP) metric in a computer vision context, for example.
There has recently been a revival of learning-to-rank methods in computer vision, including directly minimizing differentiable formulations of ranking metrics. Here are a few recent papers applying these methods in computer vision:
- Hashing as Tie-Aware Learning to Rank, which uses the natural binning of hash representations to minimize a differentiable MAP formulation.
- Deep Metric Learning to Rank, a differentiable MAP formulation based on histogram binning of cosine distances.
- Learning with Averapge Precision, another differentiable MAP formulation, based on histogram binning.
- SoDeep: a Sorting Deep net to learn ranking loss surrogates, which proposes an independently-trained RNN to predict the rank of each element in a list of scores. This differentiable sorting approximation allows differentiably learning to mimic a ground-truth ordering of results (using a Spearman correlation based loss).
The PIRL method of unsupervised representation learning is also related, since its noise-contrastive estimation objective is related to softmax cross-entropy loss.
Under certain conditions, the two loss functions can be demonstrated to be asymptotically equivalent. The details are beyond the scope of this post, but it’s not clear to me whether PIRL’s NCE loss is equivalent to a softmax CE loss over all images, given the additional constraints that PIRL imposes (eg. logits based on scaled cosine similarity). ↩