If At First You Don’t Succeed
Test-time Re-ranking for Zero-shot, Cross-domain Retrieval

1University of York

Abstract


We introduce a novel method for zero-shot, cross-domain image retrieval. Our key contribution is a test-time Iterative Cluster-free Re-ranking (ICFRR) process that leverages gallery-gallery feature information to establish semantic links between query and gallery images. This en ables the retrieval of relevant images even when they do not exhibit similar visual features but share underlying semantic concepts. This can be combined with any pre-existing cross-domain feature extraction backbone to improve retrieval performance. However, when combined with a carefully chosen Vision Transformer backbone and combination of zero-shot retrieval losses, our approach yields state-of- the-art results on the Sketchy, TU-Berlin and QuickDraw sketch-based retrieval benchmarks. We show that our re-ranking also improves performance with other backbones and outperforms other re-ranking methods applied with our backbone. Importantly, unlike many previous methods, none of the components in our approach are engineered specifically towards the sketch-based image retrieval task - it can be generally applied to any cross-domain, zero-shot retrieval task. We therefore also present new results on zero- shot cartoon-to-photo and art-to-product retrieval using the Office-Home dataset.

Example 2 After
With IFCRR
Example 2 Before
No ICFRR
Example 3 After
With IFCRR
Example 3 Before
No ICFRR

ICFRR Explainer



In the example shown below, the initial query-gallery rank ing (a) places an incorrect match high in the list (marked in red). For every gallery image, we now consider their gallery-gallery ranks (b). In (c) we highlight the positions of two highly ranked examples in the other gallery ranked lists. Item 3 is an incorrect match. This is indicated by the fact that it appears high in the ranked list of gallery images that are themselves ranked low against the query (items 150 and 151). This will incur a high penalty for item 3 in its updated distance to query. Item 4 however is a correct match and this is indicated by appearing high in the ranked list of gallery images that are themselves ranked high against the query (items 1, 2 and 5). This will incur a small penalty for item 4. Once the distances have been updated, we re-rank (d) and the incorrect match moves further down the query- gallery ranked list.


Full Methodology


Overview of architecture and train versus test operation. At training time, pairs of images are used to compute cross-attention to supervise the cross-attention distillation loss. At test time, individual images are embedded and ranked followed by our ICFRR re-ranking process.


Rationale


The initial ranking based on similarities between the embeddings of the visual features of the query and gallery images will contain a mixture of correct and incorrect matches. However, the incorrect matches will not be ranked highly if ranked against correct, highly- ranked gallery images. Since gallery-gallery ranks depend on comparing visual features within the same domain, the problem is simplified and we are able to extract more semantically-meaningful matches. This allows initially highly-ranked gallery images to be penalised and moved down the query-gallery ranked list. Iterating this process enables discovery of gallery images that are more distantly connected to the query via multiple gallery-gallery matches.
In some cases, although the query and gallery images are of the same class they may continue very dissimilar visual features. We provide a concrete example for a query sketch against gallery photos (though note that our approach is not limited to sketch-photo retrieval). The query sketch of some wooden doors (A) initially assigns a low rank to the gallery glass doors (B), even though they both belong to the same class. However, the gallery image of the church doors (C) contains similar visual features to the sketch (e.g. curved door shape, wooden material). The church doors contain some similarity to the glass doors (both are within door frames) however the shape and materials are different and so (B) is ranked relatively low in the gallery- gallery ranks for (C). However, the gallery image of the street house doors (D) are similar enough to (C) to be highly ranked (both contain a wooden door within a door frame) and, in turn, the house doors (D) are similar enough to the glass doors (B) to be highly ranked in the gallery-gallery ranks for (D) (rectangular door, within door frame). Our iterative re-ranking process is able to discover such connections and raise the rank position of difficult matches such as (B) against a very different query (A).
This approach emulates the cognitive mechanism employed by the human brain, leveraging thematic semantic systems to establish meaningful connections between concepts.


Qualitative Results


Below we showcase some results of our IFCRR method across different datasets. A reranking step of 0 is the similarities outputted from the model. Then it goes through iterations of our ICFRR method.


TU-BERLIN SKETCHY OFFICE-HOME

Quantitative Results