Approximate matching¶
Measures ordinarily score 1 when gold and system annotations exist that have an exact match for all elements of the key.
For some kinds of measure it is possible to award partial matches for:
- mention pairs with overlapping, but not identical, spans
- mention pairs with related, but not identical, entity types
- mention pairs with related, but not identical, KB entries (disambiguands)
Overlapping spans¶
To give partial award to overlapping gold and system mentions, we use the scheme developed by Ryan Gabbard of BBN for LoReHLT:
We award systems for partial matches according to the degree of character overlap between system and key names. The partial match scoring algorithm has two parameters: the recall overlap strategy and the precision overlap strategy.
- The per-name recall score of a name in the answer key is the fraction of its characters which overlap with the system name set according to the recall overlap strategy parameter. For the “MAX” strategy, this will be the characters overlapping with the single system name with maximum overlap. For the “SUM” strategy, this will be the number of its characters which overlap with any system mention.
- The recall score for a system is the mean of the per-name recall scores for all names in the answer key.
- The per-name precision score of a name in the answer key is the fraction of its characters overlapped by the reference set, where ”overlapping” is determined by the precision overlap strategy in the same manner as above for recall.
- The precision score for a system is the mean of the per-name precision scores for all names in the answer key.
This applies to measures with aggregator:
overlap-maxmax
for recall and precision overlap strategies both MAXoverlap-maxsum
for recall overlap strategy MAX and precision overlap strategy SUMoverlap-summax
for recall overlap strategy SUM and precision overlap strategy MAxoverlap-sumsum
for recall and precision overlap strategies both SUM
In the following example, the gold standard includes a mention from character 1 to 10 and another from 12 to 12. The system includes a mention from 1 to 5 and another from 6 to 12.
$ bash -c "\
neleval evaluate \
-m overlap-maxmax::span \
-m overlap-maxsum::span \
-m overlap-summax::span \
-m overlap-sumsum::span \
-m sets::span \
-g <(echo -e 'd\t1\t10\nd\t12\t12') \
<(echo -e 'd\t1\t5\nd\t6\t12')"
ptp fp rtp fn precis recall fscore measure
1.714 0.286 1.500 0.500 0.857 0.750 0.800 overlap-maxmax::span
1.857 0.143 1.500 0.500 0.929 0.750 0.830 overlap-maxsum::span
1.714 0.286 2.000 0.000 0.857 1.000 0.923 overlap-summax::span
1.857 0.143 2.000 0.000 0.929 1.000 0.963 overlap-sumsum::span
0 2 0 2 0.000 0.000 0.000 sets::span
TODO: flesh out calculation
Caveats:
- All mentions within the gold annotation must be non-overlapping.
- All mentions within the system annotation must be non-overlapping.
- There is (currently) no equivalent implementation for clustering metrics.
Approximate type matching¶
Rather than exactly matching entity types, they can be matched using arbitrary
weights. These can be specified to neleval evaluate with
--type-weights
. This option accepts a tab-delimited file with three
columns:
- gold type
- system type
- weight
For types not in this weight file, exact matches between gold type and system type score 1, and otherwise score is 0. If multiple gold/system entries exist, the maximum weight is used.
The following example scores 0.123 where the gold type is type1
and the
system type is type2
.
$ bash -c " \
neleval evaluate --by-doc \
-m strong_typed_mention_match \
--type-weights <(echo -e 'type1\ttype2\t0.123') \
--gold <( \
echo -e 'doc1\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc2\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc3\t10\t20\tkbid\t1.0\ttype2'; \
echo -e 'doc4\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc4\t30\t40\tkbid\t1.0\ttype1'; \
) <( \
echo -e 'doc1\t10\t20\tkbid\t1.0\ttype2'; \
echo -e 'doc2\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc3\t10\t20\tkbid\t1.0\ttype1'; \
echo -e 'doc4\t10\t20\tkbid\t1.0\ttype2'; \
echo -e 'doc4\t30\t40\tkbid\t1.0\ttype2'; \
) \
"
ptp fp rtp fn precis recall fscore measure
0.123 0.877 0.123 0.877 0.123 0.123 0.123 strong_typed_mention_match;docid="doc1"
1.000 0.000 1.000 0.000 1.000 1.000 1.000 strong_typed_mention_match;docid="doc2"
0.000 1.000 0.000 1.000 0.000 0.000 0.000 strong_typed_mention_match;docid="doc3"
0.246 1.754 0.246 1.754 0.123 0.123 0.123 strong_typed_mention_match;docid="doc4"
0.342 0.908 0.342 0.908 0.311 0.311 0.311 strong_typed_mention_match;docid=<macro>
1.369 3.631 1.369 3.631 0.274 0.274 0.274 strong_typed_mention_match;docid=<micro>
This currently only applies to measures with the sets
aggregator.
Type match weighting with a hierarchy¶
neleval weights-for-hierarchy converts a hierarchy of types into the
above --type-weights
format. It uses a scheme with a decay parameter
\(0 < d < 1\), such that a system mention is awarded:
- 0 if its type is not identical to or an ancestor of the gold type
- \(d ^ {\mathrm{depth}(\mathrm{goldtype})-\mathrm{depth}(\mathrm{systype})}\) if its type is an ancestor of the gold type
Thus:
- \(d\) if its type is a parent of the gold type
- \(d ^ 2\) if its type is a grandparent of the gold type
etc.