The NEL evaluation tools are invoked using neleval, or ./nel inside the repository. Usage:

neleval <command> [<args>]

To list available commands:


To get help for a specific command:

neleval <command> -h

See Command-line reference.

The commands that are relevant to TAC KBP entity linking evaluation and analysis are described below.

Basic usage

The following describes a typical workflow. See also Convenience scripts for TAC KBP evaluation.

Convert gold standard to evaluation format

For data in TAC14 format:

neleval prepare-tac \
    -q /path/to/gold.xml \    # gold queries/mentions file
    /path/to/ \       # gold KB/NIL annotations file
    > gold.combined.tsv

For data in TAC12 and TAC13 format, remove extra columns first, e.g.:

cat /path/to/ \
    | cut -f1,2,3 \
neleval prepare-tac \
    -q /path/to/gold.xml \ \
    > gold.combined.tsv

Convert system output to evaluation format

For data in TAC14 format:

neleval prepare-tac \
    -q /path/to/system.xml \  # system mentions file
    /path/to/ \     # system KB/NIL annotations
    > system.combined.tsv

For data in TAC12 and TAC13 format, add dummy NE type column first, e.g.:

cat /path/to/ \
    | awk 'BEGIN{OFS="\t"} {print $1,$2,"NA",$3}' \
neleval prepare-tac \
    -q /path/to/gold.xml \    # gold queries/mentions file \              # system KB/NIL annotations
    > system.combined.tsv

Evaluate system output

To calculate micro-averaged scores for all evaluation measures:

neleval evaluate \
    -m all \                  # report all evaluation measures
    -f tab \                  # print results in tab-separated format
    -g gold.combined.tsv \    # prepared gold standard annotation
    system.combined.tsv \     # prepared system output
    > system.evaluation

To list available evaluation measures:

neleval list-measures

Advanced usage

The following describes additional commands for analysis. See also (TODO) and

Calculate confidence intervals

To calculate confidence intervals using bootstrap resampling:

neleval confidence \
    -m strong_typed_link_match \ # report CI for TAC14 wikification measure
    -f tab \                  # print results in tab-separated format
    -g gold.combined.tsv \    # prepared gold standard annotation
    system.combined.tsv \     # prepared system output
    > system.confidence

We recommend that you pip install joblib and use -j NUM_JOBS to run this in parallel. This is also faster if an individual evaluation measure is specified (e.g., strong_typed_link_match) rather than groups of measures (e.g., tac).

The script is available to create reports comparing multiple systems.

Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.

Calculate significant differences

It is also possible to calculate pairwise differences:

neleval significance \
    --permute \               # use permutation method
    -f tab \                  # print results in tab-separated format
    -g gold.combined.tsv \    # prepared gold standard annotation
    system1.combined.tsv \    # prepared system1 output
    system2.combined.tsv \    # prepared system2 output
    > system1-system2.significance

We recommend calculating significance for selected system pairs as it can take a while over all N choose 2 combinations of systems. You can also use -j NUM_JOBS to run this in parallel.

Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.

Analyze error types

To create a table of classification errors:

neleval analyze \
    -s \                      # print summary table
    -g gold.combined.tsv \    # prepared gold standard annnotation
    system.combined.tsv \     # prepared system output
    > system.analysis

Without the -s flag, the analyze command will list and categorize differences between the gold standard and system output.

Filter data for evaluation on subsets

The following describes a workflow for evaluation over subsets of mentions. See also (TODO) and

Filter prepared data

Prepared data is in a simple tab-separated format with one mention per line and six columns: document_id, start_offset, end_offset, kb_or_nil_id, score, entity_type. It is possible to use command line tools (e.g., grep, awk) to select mentions for evaluation, e.g.:

cat gold.combined.tsv \       # prepared gold standard annotation
    | egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
    > gold.WB.tsv             # filtered gold standard annotation
cat system.combined.tsv \     # prepared system output
    | egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
    > system.WB.tsv           # filtered system output

Evaluate on filtered data

After filtering, evaluation is run as before:

neleval evaluate \
    -m all \                  # report all evaluation measures
    -f tab \                  # print results in tab-separated format
    -g gold.WB.tsv \          # filtered gold standard annotation
    system.WB.tsv \           # filtered system output
    > system.WB.evaluation

Evaluate each document or entity type

To get a score for each document, or each entity type, as well as the macro-averaged score across documents, use --group-by in neleval evaluate. See Grouped measures.