Softmax bounds ranking metrics
Welcome to my technical blog! This first post should be relatively quick, discussing an information retrieval paper that I think makes a useful observation. This is also a first test of the platform I’m using for equations.
Softmax loss and ranking metrics
This post discusses An Analysis of the Softmax Cross Entropy Loss for LearningtoRank with Binary Relevance by Sebastian Bruch and collaborators at Google Research.
This paper comes from the information retrieval literature, and concerns ListNet loss specifically. We’ll consider this loss, and the paper’s relevance to computer vision in more detail later.
They make a simple observation that I hadn’t considered before reading their paper: ListNet loss (and softmax crossentropy losses generally) bound common ranking metrics.
Before diving in, let’s briefly introduce these topics.
Ranking quality metrics
Ranking quality metrics model real world tasks, where the order of results is very important. For example, if a search engine presents a user with 10 results, of which 3 are relevant, then the relative position of those results is very important to quality. Since retireval quality is ordersensitive, ranking metrics are also ordersensitive.
Some examples of ranking metrics include:
 Recall at k: what fraction of relevant results are in the
first
k
results?  Precision at k: what fraction of the first
k
results are relevant?  Average precision (AP): a mean of “precision at k” values
at each position
k
that contains a relevant result (or 0 for relevant results not in the result set).  Discounted cumulative gain (DCG) and normalized discounted cumulative gain (NDCG): metrics that allow expressing more finegrained notions of relevance.
 (Mean) reciprocal rank (MRR): the inverse of the rank of the first relevant document. if the ground truth result is first, if it is second, etc.
The important point is that for many retrieval contexts, the main determinant of retrieval quality is the order of results.
ListNet loss
ListNet is a loss from the information retrieval literature. It has a motivation based on permutation probability, but it ultimately arrives at a variation on softmax crossentropy loss.
ListNet uses a crossentropy loss between a label distribution (using a continuous notion of relevance over a set of labeled results), and a predicted distribution over candidate documents.
Proposition: Softmax CE bounds ranking metrics
This paper addresses a challenge in information retrieval models that are learned using gradientbased optimization:
Ranking metrics rely on the order of results. The metrics are stepwise, with abrupt changes in value only when the order of results change. This makes these metrics unsuitable for direct gradientbased optimization.^{1}
Because ranking metrics are not differentiable, models are trained using separate loss functions that are not directly related to ranking metrics. These are typically seen as surrogate losses – loss functions that enable optimization, but which are only loosely related to the target task.
The paper’s main claim is that ListNet loss, when used with binary notions of relevance, bounds two retrieval metrics: MRR and NDCG. I’ll focus on MRR here, since it’s more straightforward.
Proof sketch: Softmax crossentropy bounds MRR
Part 1: Softmax bounds reciprocal rank
Given presoftmax logits , the predicted probability for class is computed as follows:
For convenience, let’s define to be the partition (the summation in the denominator).
Let be the rank of class , if we sort the classes in descending order (or, equivalently, if we sort values or values in descending order).
The key observation is that the partition has terms that are . Let’s introduce arbitrary nonnegative constants to express this:
Since is the definition of reciprocal rank, we have demonstrated that softmax probability of the groundtruth class is a lowerbound on reciprocal rank.
Part 2: Softmax crossentropy bounds log MRR
I’m skipping over the multiplelabel case, although it is a fairly straightforward extension of the above.
The formal claim made in the paper is that softmax crossentropy bounds log MRR. There are a few related ways we can get there. The paper uses Jensen’s inequality. I’ll argue this from the inequality of geomentric and arithmetic means, which is straightforward and avoids a tangent on Jensen’s inequality:
We can apply this inequality to softmax crossentropy loss, substituting for above:
Therefore softmax crossentropy loss bounds log MRR.
Implications
This observation connects a loss function used in several retrieval domains to ranking metrics.
The relationship is a bound, and this does not argue that minimizing this loss is equivalent to minimizing these metrics. It’s also important to keep in mind that the ranking metrics here are over the training dataset, so this does not characterize generalization (i.e. MRR over the test set).
While the paper focuses on the ListNet formulation specifically, the argument applies to many uses of softmax crossentropy losses for retrieval objectives. This is relevant to two threads of computer vision research:
First, there has been recent interest in learningtorank methods in computer vision, including minimizing differentiable formulations of ranking metrics.^{1} This observation connects these methods to a conventional loss function.
Second, there are several recent methods using softmax crossentropy losses for finegrained retrieval at training time. For instance, the NormFace family of loss functions (including CosFace and ArcFace) use softmax crossentropy over all identities in the training dataset to supervise an embedding space for finegrained recognition. Connecting these methods to retrieval metrics is directly relevant, because softmax crossentropy here is used in a retrieval context.
Similarly, InfoNCE loss is a noisecontrastive estimation motivated use of softmax crossentropy loss. It is used in contrastive predictive coding and momentum contrast for unsupervised representation learning.^{2}
Computer vision may be on the verge of unsupervised representation learning overtaking supervised pretraining, and many of these methods use softmax crossentropy objectives. These methods are more similar to retrieval use cases than classification (eg. hundreds of thousands of instancelevel classes), making ranking metrics a meaningful measure of quality.
Footnotes

While it accurate to say that ranking metrics are not directly differentiable, there is an extensive literature on using tricks to make these metrics (or approximations thereof) differentiable.
See, for example, SoftRank, which proposes a differentiable formulation of rank based metrics like NDCG. This method involves casting document scores as the center of a gaussian distribution, then computing NDCG over the score distribution. This makes ranking metrics smooth, differentiable functions.
Another technique involves estimating a gradient for nondifferentiable functions. Yang Song and collaborators’ Direct Loss Minimization paper estimates a gradient to directly minimize the average precision (MAP) metric in a computer vision context, for example.
There has recently been a revival of learningtorank methods in computer vision, including directly minimizing differentiable formulations of ranking metrics. Here are a few recent papers applying these methods in computer vision:
 Hashing as TieAware Learning to Rank, which uses the natural binning of hash representations to minimize a differentiable MAP formulation.
 Deep Metric Learning to Rank, a differentiable MAP formulation based on histogram binning of cosine distances.
 Learning with Averapge Precision, another differentiable MAP formulation, based on histogram binning.
 SoDeep: a Sorting Deep net to learn ranking loss surrogates, which proposes an independentlytrained RNN to predict the rank of each element in a list of scores. This differentiable sorting approximation allows differentiably learning to mimic a groundtruth ordering of results (using a Spearman correlation based loss).

The PIRL method of unsupervised representation learning is also related, since its noisecontrastive estimation objective is related to softmax crossentropy loss.
Under certain conditions, the two loss functions can be demonstrated to be asymptotically equivalent. The details are beyond the scope of this post, but it’s not clear to me whether PIRL’s NCE loss is equivalent to a softmax CE loss over all images, given the additional constraints that PIRL imposes (eg. logits based on scaled cosine similarity). ↩