Measures in detail¶

This describes measures as listed by neleval list-measures.

Measure	Key	Filter	Aggregator
Mention evaluation measures
strong_mention_match	span	NA	sets
strong_typed_mention_match	span,type	NA	sets
strong_linked_mention_match	span	is_linked	sets
Linking evaluation measures
strong_link_match	span,kbid	is_linked	sets
strong_nil_match	span	is_nil	sets
strong_all_match	span,kbid	NA	sets
strong_typed_link_match	span,type,kbid	is_linked	sets
strong_typed_nil_match	span,type	is_nil	sets
strong_typed_all_match	span,type,kbid	NA	sets
Document-level tagging evaluation
entity_match	docid,kbid	is_linked	sets
Clustering evaluation measures
muc	span	NA	muc
b_cubed	span	NA	b_cubed
b_cubed_plus	span,kbid	NA	b_cubed
entity_ceaf	span	NA	entity_ceaf
mention_ceaf	span	NA	mention_ceaf
pairwise	span	NA	pairwise

Custom measures¶

A custom measure can be specified on the command-line as:

<aggregator>:<filter>:<key>

such as

sets:None:span+kbid for strong_all_match

Grouped measures¶

By default measures are aggregated over the corpus as a whole. Using the --by-doc and/or --by-type flags to neleval evaluate will instead aggregate measures per document or entity type, and then report per-doc/type and overall (micro- and macro-averaged) performance. Note that micro-average does not equate to whole-corpus aggregation for coreference aggregates, but represents clustering performance disregarding cross-document coreference.

Key¶

The key defines how system output is matched against the gold standard.

Key	Description
docid	Document identifier must be the same
start	Start offset must be the same
end	End offset must be the same
span	Shorthand for (docid, start, end)
type	Entity type must be the same
kbid	KB identifier must be the same, or must both be NIL

Filter¶

The filter defines what mentions are removed before precision, recall and f-score calculations.

Filter	Description
is_linked	Only keep mentions that are resolved to known KB identifiers
is_nil	Only keep mentions that are not resolved to known KB identifiers
is_first	Only keep the first mention in a document of a given KB/NIL identifier

Note that the is_first filter is intended to provide clustering evaluation similar to the entity_match evaluation of linking performance.

Aggregator¶

The aggregator defines how corpus-level scores are computed from individual instances.

Aggregator	Description
Mention, linking, tagging evaluations
sets	Take the unique set of tuples as defined by key across the gold and system data, then micro-average document-level tp, fp and fn counts.
overlap-{max,sum}{max,sum}	For tasks in which the gold and system must produce non-overlapping annotations, these scores account for partial overlap between gold and system mentions, as defined for the LoReHLT evaluation.
Clustering evaluation
muc	Count the total number of edits required to translate from the gold to the system clustering
b_cubed	Assess the proportion of each mention’s cluster that is shared between gold and system clusterings
entity_ceaf	Calculate optimal one-to-one alignment between system and gold clusters based on Dice coefficient, and get the total aligned score relative to aligning each cluster with itself
mention_ceaf	Calculate optimal one-to-one alignment between system and gold clusters based on number of overlapping mentions, and get the total aligned score relative to aligning each cluster with itself
pairwise	The proportion of true co-clustered mention pairs that are predicted, etc., as used in computing BLANC
pairwise_negative	The proportion of true not co-clustered mention pairs that are predicted, etc., as used in computing BLANC