Text based Heuristical Annotation Guidelines

For a given query caption, this heuristic labels another caption as partially relevant if either the two captions have the same set of nouns but differ only in their verbs (e.g., 'person eating cake'; 'person sitting next to a cake') or they differ in their nouns but share the same verb(s) ( e.g., 'person eating cake'; 'person eating cereals'). A caption is labeled positive with respect to the query if both share the same nouns and verbs, whereas they are labeled negative if they share neither the nouns nor verbs. 

In most datasets, the textual captions consist of a subject, one or more objects (noun), and the associated verbs, all of which can be retrieved using an off-the-shelf part-of-speech POS tagger. Given the recovered nouns and verbs, we compute the percentage overlap between the query caption's noun-verbs with all the other captions in the batch. Trivially, a 100% overlap represents a positive sample, and a 0% overlap represents a negative sample. However, for the partial sample selection, we select thresholds αn and αv for the percentage overlap of nouns and verbs, respectively. Finally, a sample is labeled as partial if it is not a positive sample and satisfies either the noun threshold or the verb threshold. Intuitively, a common verb is a strong signal of similarity between two actions, hinting towards the possibility of similar visual features.

We note here that this is only one of the many possibilities of heuristic functions.

Part of Speech Taggers used:

  1. Indian Languages: Link
  2. English: Link