Measures in detail¶
This describes measures as listed by neleval list-measures.
| Measure | Key | Filter | Aggregator |
|---|---|---|---|
| Mention evaluation measures | |||
| strong_mention_match | span | NA | sets |
| strong_typed_mention_match | span,type | NA | sets |
| strong_linked_mention_match | span | is_linked | sets |
| Linking evaluation measures | |||
| strong_link_match | span,kbid | is_linked | sets |
| strong_nil_match | span | is_nil | sets |
| strong_all_match | span,kbid | NA | sets |
| strong_typed_link_match | span,type,kbid | is_linked | sets |
| strong_typed_nil_match | span,type | is_nil | sets |
| strong_typed_all_match | span,type,kbid | NA | sets |
| Document-level tagging evaluation | |||
| entity_match | docid,kbid | is_linked | sets |
| Clustering evaluation measures | |||
| muc | span | NA | muc |
| b_cubed | span | NA | b_cubed |
| b_cubed_plus | span,kbid | NA | b_cubed |
| entity_ceaf | span | NA | entity_ceaf |
| mention_ceaf | span | NA | mention_ceaf |
| pairwise | span | NA | pairwise |
Custom measures¶
A custom measure can be specified on the command-line as:
<aggregator>:<filter>:<key>
such as
sets:None:span+kbid for strong_all_match
Grouped measures¶
By default measures are aggregated over the corpus as a whole. Using the
--by-doc and/or --by-type flags to neleval evaluate will instead
aggregate measures per document or entity type, and then report
per-doc/type and overall (micro- and macro-averaged) performance. Note
that micro-average does not equate to whole-corpus aggregation for
coreference aggregates, but represents clustering performance
disregarding cross-document coreference.
Key¶
The key defines how system output is matched against the gold standard.
| Key | Description |
|---|---|
| docid | Document identifier must be the same |
| start | Start offset must be the same |
| end | End offset must be the same |
| span | Shorthand for (docid, start, end) |
| type | Entity type must be the same |
| kbid | KB identifier must be the same, or must both be NIL |
Filter¶
The filter defines what mentions are removed before precision, recall and f-score calculations.
| Filter | Description |
|---|---|
| is_linked | Only keep mentions that are resolved to known KB identifiers |
| is_nil | Only keep mentions that are not resolved to known KB identifiers |
| is_first | Only keep the first mention in a document of a given KB/NIL identifier |
Note that the is_first filter is intended to provide clustering evaluation similar to the entity_match evaluation of linking performance.
Aggregator¶
The aggregator defines how corpus-level scores are computed from individual instances.
| Aggregator | Description |
|---|---|
| Mention, linking, tagging evaluations | |
| sets | Take the unique set of tuples as defined by key across the gold and system data, then micro-average document-level tp, fp and fn counts. |
| overlap-{max,sum}{max,sum} | For tasks in which the gold and system must produce non-overlapping annotations, these scores account for partial overlap between gold and system mentions, as defined for the LoReHLT evaluation. |
| Clustering evaluation | |
| muc | Count the total number of edits required to translate from the gold to the system clustering |
| b_cubed | Assess the proportion of each mention’s cluster that is shared between gold and system clusterings |
| entity_ceaf | Calculate optimal one-to-one alignment between system and gold clusters based on Dice coefficient, and get the total aligned score relative to aligning each cluster with itself |
| mention_ceaf | Calculate optimal one-to-one alignment between system and gold clusters based on number of overlapping mentions, and get the total aligned score relative to aligning each cluster with itself |
| pairwise | The proportion of true co-clustered mention pairs that are predicted, etc., as used in computing BLANC |
| pairwise_negative | The proportion of true not co-clustered mention pairs that are predicted, etc., as used in computing BLANC |