Measures in detail

This describes measures as listed by neleval list-measures.

Measure Key Filter Aggregator
Mention evaluation measures      
strong_mention_match span NA sets
strong_typed_mention_match span,type NA sets
strong_linked_mention_match span is_linked sets
Linking evaluation measures      
strong_link_match span,kbid is_linked sets
strong_nil_match span is_nil sets
strong_all_match span,kbid NA sets
strong_typed_link_match span,type,kbid is_linked sets
strong_typed_nil_match span,type is_nil sets
strong_typed_all_match span,type,kbid NA sets
Document-level tagging evaluation      
entity_match docid,kbid is_linked sets
Clustering evaluation measures      
muc span NA muc
b_cubed span NA b_cubed
b_cubed_plus span,kbid NA b_cubed
entity_ceaf span NA entity_ceaf
mention_ceaf span NA mention_ceaf
pairwise span NA pairwise

Custom measures

A custom measure can be specified on the command-line as:

<aggregator>:<filter>:<key>

such as

sets:None:span+kbid for strong_all_match

Grouped measures

By default measures are aggregated over the corpus as a whole. Using the --by-doc and/or --by-type flags to neleval evaluate will instead aggregate measures per document or entity type, and then report per-doc/type and overall (micro- and macro-averaged) performance. Note that micro-average does not equate to whole-corpus aggregation for coreference aggregates, but represents clustering performance disregarding cross-document coreference.

Key

The key defines how system output is matched against the gold standard.

Key Description
docid Document identifier must be the same
start Start offset must be the same
end End offset must be the same
span Shorthand for (docid, start, end)
type Entity type must be the same
kbid KB identifier must be the same, or must both be NIL

Filter

The filter defines what mentions are removed before precision, recall and f-score calculations.

Filter Description
is_linked Only keep mentions that are resolved to known KB identifiers
is_nil Only keep mentions that are not resolved to known KB identifiers
is_first Only keep the first mention in a document of a given KB/NIL identifier

Note that the is_first filter is intended to provide clustering evaluation similar to the entity_match evaluation of linking performance.

Aggregator

The aggregator defines how corpus-level scores are computed from individual instances.

Aggregator Description
Mention, linking, tagging evaluations  
sets Take the unique set of tuples as defined by key across the gold and system data, then micro-average document-level tp, fp and fn counts.
overlap-{max,sum}{max,sum} For tasks in which the gold and system must produce non-overlapping annotations, these scores account for partial overlap between gold and system mentions, as defined for the LoReHLT evaluation.
Clustering evaluation  
muc Count the total number of edits required to translate from the gold to the system clustering
b_cubed Assess the proportion of each mention’s cluster that is shared between gold and system clusterings
entity_ceaf Calculate optimal one-to-one alignment between system and gold clusters based on Dice coefficient, and get the total aligned score relative to aligning each cluster with itself
mention_ceaf Calculate optimal one-to-one alignment between system and gold clusters based on number of overlapping mentions, and get the total aligned score relative to aligning each cluster with itself
pairwise The proportion of true co-clustered mention pairs that are predicted, etc., as used in computing BLANC
pairwise_negative The proportion of true not co-clustered mention pairs that are predicted, etc., as used in computing BLANC