Approximate matching

Measures ordinarily score 1 when gold and system annotations exist that have an exact match for all elements of the key.

For some kinds of measure it is possible to award partial matches for:

  • mention pairs with overlapping, but not identical, spans
  • mention pairs with related, but not identical, entity types
  • mention pairs with related, but not identical, KB entries (disambiguands)

Overlapping spans

To give partial award to overlapping gold and system mentions, we use the scheme developed by Ryan Gabbard of BBN for LoReHLT:

We award systems for partial matches according to the degree of character overlap between system and key names. The partial match scoring algorithm has two parameters: the recall overlap strategy and the precision overlap strategy.

  • The per-name recall score of a name in the answer key is the fraction of its characters which overlap with the system name set according to the recall overlap strategy parameter. For the “MAX” strategy, this will be the characters overlapping with the single system name with maximum overlap. For the “SUM” strategy, this will be the number of its characters which overlap with any system mention.
  • The recall score for a system is the mean of the per-name recall scores for all names in the answer key.
  • The per-name precision score of a name in the answer key is the fraction of its characters overlapped by the reference set, where ”overlapping” is determined by the precision overlap strategy in the same manner as above for recall.
  • The precision score for a system is the mean of the per-name precision scores for all names in the answer key.

This applies to measures with aggregator:

  • overlap-maxmax for recall and precision overlap strategies both MAX
  • overlap-maxsum for recall overlap strategy MAX and precision overlap strategy SUM
  • overlap-summax for recall overlap strategy SUM and precision overlap strategy MAx
  • overlap-sumsum for recall and precision overlap strategies both SUM

In the following example, the gold standard includes a mention from character 1 to 10 and another from 12 to 12. The system includes a mention from 1 to 5 and another from 6 to 12.

$ bash -c "\
neleval evaluate \
-m overlap-maxmax::span \
-m overlap-maxsum::span \
-m overlap-summax::span \
-m overlap-sumsum::span \
-m sets::span \
-g <(echo -e 'd\t1\t10\nd\t12\t12') \
   <(echo -e 'd\t1\t5\nd\t6\t12')"
ptp	fp	rtp	fn	precis	recall	fscore	measure
1.714	0.286	1.500	0.500	0.857	0.750	0.800	overlap-maxmax::span
1.857	0.143	1.500	0.500	0.929	0.750	0.830	overlap-maxsum::span
1.714	0.286	2.000	0.000	0.857	1.000	0.923	overlap-summax::span
1.857	0.143	2.000	0.000	0.929	1.000	0.963	overlap-sumsum::span
0	2	0	2	0.000	0.000	0.000	sets::span

TODO: flesh out calculation

Caveats:

  • All mentions within the gold annotation must be non-overlapping.
  • All mentions within the system annotation must be non-overlapping.
  • There is (currently) no equivalent implementation for clustering metrics.

Approximate type matching

Rather than exactly matching entity types, they can be matched using arbitrary weights. These can be specified to neleval evaluate with --type-weights. This option accepts a tab-delimited file with three columns:

  • gold type
  • system type
  • weight

For types not in this weight file, exact matches between gold type and system type score 1, and otherwise score is 0. If multiple gold/system entries exist, the maximum weight is used.

The following example scores 0.123 where the gold type is type1 and the system type is type2.

$ bash -c " \
neleval evaluate --by-doc \
-m strong_typed_mention_match \
--type-weights <(echo -e 'type1\ttype2\t0.123') \
--gold <( \
echo -e 'doc1\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc2\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc3\t10\t20\tkbid\t1.0\ttype2'; \
echo -e 'doc4\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc4\t30\t40\tkbid\t1.0\ttype1'; \
) <( \
echo -e 'doc1\t10\t20\tkbid\t1.0\ttype2'; \
echo -e 'doc2\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc3\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc4\t10\t20\tkbid\t1.0\ttype2'; \
echo -e 'doc4\t30\t40\tkbid\t1.0\ttype2'; \
) \
"
ptp	fp	rtp	fn	precis	recall	fscore	measure
0.123	0.877	0.123	0.877	0.123	0.123	0.123	strong_typed_mention_match;docid="doc1"
1.000	0.000	1.000	0.000	1.000	1.000	1.000	strong_typed_mention_match;docid="doc2"
0.000	1.000	0.000	1.000	0.000	0.000	0.000	strong_typed_mention_match;docid="doc3"
0.246	1.754	0.246	1.754	0.123	0.123	0.123	strong_typed_mention_match;docid="doc4"
0.342	0.908	0.342	0.908	0.311	0.311	0.311	strong_typed_mention_match;docid=<macro>
1.369	3.631	1.369	3.631	0.274	0.274	0.274	strong_typed_mention_match;docid=<micro>

This currently only applies to measures with the sets aggregator.

Type match weighting with a hierarchy

neleval weights-for-hierarchy converts a hierarchy of types into the above --type-weights format. It uses a scheme with a decay parameter \(0 < d < 1\), such that a system mention is awarded:

  • 0 if its type is not identical to or an ancestor of the gold type
  • \(d ^ {\mathrm{depth}(\mathrm{goldtype})-\mathrm{depth}(\mathrm{systype})}\) if its type is an ancestor of the gold type

Thus:

  • \(d\) if its type is a parent of the gold type
  • \(d ^ 2\) if its type is a grandparent of the gold type

etc.