CoNLL-2025: Robust WSI

CoNLL-2025 Shared Task

Robust Word Sense Induction


We invite participants to benchmark systems for word sense induction (WSI) across multiple languages. Unlike traditional approaches, this task evaluates WSI without relying on predefined sense inventories, offering a more theoretically plausible framework for understanding word meanings.

Important Dates


Task Description


Participants will be provided with:

The goal is to cluster the sentences according to the sense in which the target word is used.

A different set of headwords and contexts will be provided for each of the following languages:

For each language, there will be approximately 25 headwords with 1500+ contexts each.

Motivation


Previous WSI evaluation approaches have faced significant limitations:

This shared task uses a novel evaluation framework that addresses these limitations through multi-annotator sense clustering and a robust evaluation metric that accounts for annotator agreement levels. By supporting multiple languages and avoiding predetermined sense inventories, it offers a more realistic and comprehensive evaluation of WSI systems.

The evaluation methodology

Traditional evaluation strategies rely on some fixed and arbitrary word sense inventory.

Our evaluation strategy is based on the following observation: in many cases it is clear that two particular occurrences of a word carry the same sense in the respective contexts, and conversely, sometimes it is clear that the sense is different and no speaker of the language would want to lump them. For the rest of pairs of occurrences, it is not obvious whether we should classify them as same or different. A fair and descriptive evaluation strategy should focus on the well defined cases, while discounting the influence of the contended pairs of instances.

The key idea is annotating instances by many annotators, each of them creating their own sense inventory. From this data, we can tell whether the annotators think that a pair of instances carry the same sense, carry a different sense, or whether it is not clear what the result should be. The WSI system is then evaluated on all pairs of instances, discounting the contended pairs.

See our article and the repository for more details.

Submission Information


If you are interested in this shared task, please register so that we can keep you updated.

After the trial phase begins, we will release the headwords along with the contexts and the full evaluation data for 3 words for each language along with the scorer, so that you can prepare your system for the competition.

For the test phase, we will release the rest of the headwords along with the contexts, but not the evaluation data. We will release the evaluation data after the test phase ends.

There will be two tracks, open and closed. For the open track, you are free to use arbitrary external resources, while for the closed track, you should not use any external data, except for the source files we provide and the OSCAR 23.01 Corpora for the respective languages.

You will be able to make up to 10 submissions during the test phase in each of the categories and languages.

Evaluation


The evaluation criterion is the average Weighted Shadow Rand Index (wsRI) over all headwords. See the article for the full definition of wsRI. Intuitively, it is equivalent to the Adjusted Rand Index, where the pairs of observations are weighted by the agreement between annotators.

Data Formats


Source files

The data is provided in a tab-separated UTF-8 formatted text file. The headword is specified as a lemma and a part-of-speech tag. The headword is marked up within the sentence text by the "<" and ">" characters.

Sample for the English headword band-n:

    headword text
    band-n   The Beatles are arguably the most famous <band> in rock and roll history.
    band-n   This <band's> music was pop influenced.
    band-n   Three <bands> at 5 GHz have been allocated for WiFi and similar services.
    band-n   Put two large rubber <bands> around the base of the cup.
    band-n   If he put the gold <band> on your finger, he likes you pretty well.
    

Submission files

The response to be submitted shall consist of the same number of lines as the source data. A header and then a single arbitrary cluster label per line.

The submission corresponding to the previous example might look like this:

    cluster
    a
    a
    b
    c
    d
    

The submission will be done through this website.

Contact


This shared task is organized by Ondřej Herman, Miloš Jakubíček, Pavel Rychlý and Vojtěch Kovář at Lexical Computing and Masaryk University.

For questions and support, contact the task organizers at conll2025@sketchengine.eu.