CoNLL-2025 Shared Task

Robust Word Sense Induction

We invite participants to benchmark systems for word sense induction (WSI) across multiple languages. Unlike traditional approaches, this task evaluates WSI without relying on predefined sense inventories, offering a more theoretically plausible framework for understanding word meanings.

Important Dates

20. 12. 2024 - Registration opens
~~30. 1.~~ 3. 2. 2025 - Trial data available
~~27. 2.~~ 27. 2. 2025 - Test data available, evaluation starts
25. 4. 2025 - Test phase ends
2. 5. 2025 - Submission of system description papers
31. 7. 2025 or 1. 8. 2025 - The CoNLL workshop at ACL 2025 in Vienna

Task Description

Participants will be provided with:

A set of polysemous target headwords.
Sentences containing these words in diverse contexts.

The goal is to cluster the sentences according to the sense in which the target word is used.

A different set of headwords and contexts will be provided for each of the following languages:

English
Czech
German
Spanish
Estonian
Chinese

For each language, there will be approximately 25 headwords with 1500+ contexts each.

Motivation

Previous WSI evaluation approaches have faced significant limitations:

Reliance on predetermined sense inventories, which introduces evaluation bias
Limited consideration of annotator disagreements, which are natural in sense annotation
Focus on single-language (typically English) evaluation

This shared task uses a novel evaluation framework that addresses these limitations through multi-annotator sense clustering and a robust evaluation metric that accounts for annotator agreement levels. By supporting multiple languages and avoiding predetermined sense inventories, it offers a more realistic and comprehensive evaluation of WSI systems.

The evaluation methodology

Traditional evaluation strategies rely on some fixed and arbitrary word sense inventory.

Our evaluation strategy is based on the following observation: in many cases it is clear that two particular occurrences of a word carry the same sense in the respective contexts, and conversely, sometimes it is clear that the sense is different and no speaker of the language would want to lump them. For the rest of pairs of occurrences, it is not obvious whether we should classify them as same or different. A fair and descriptive evaluation strategy should focus on the well defined cases, while discounting the influence of the contended pairs of instances.

The key idea is annotating instances by many annotators, each of them creating their own sense inventory. From this data, we can tell whether the annotators think that a pair of instances carry the same sense, carry a different sense, or whether it is not clear what the result should be. The WSI system is then evaluated on all pairs of instances, discounting the contended pairs.

See our article and the repository for more details.

Trial Data

The GitHub repository contains the data for three words for each language with the sense annotation, along with a scorer program.

Submission Information

The test phase has has begun. Use the submission form to upload your entry. The GitHub repository contains the test data, which consists of headwords and the contexts to be used for word sense induction. The full annotations will be released after the test phase ends.

There are be two tracks to compete in, open and closed. For the open track, you are free to use arbitrary external resources, while for the closed track, you should not use any external data, except for the source files we provide and the OSCAR 23.01 Corpora for the respective languages.

You can make up to 10 submissions during the test phase in each of the categories and languages for each word sense induction system. For the submissions to be considered for the final ranking, you will need to provide a system description paper or an abstract.

If you are interested in this shared task, please register so that we can keep you updated.

Evaluation

The evaluation criterion is the average Weighted Shadow Rand Index (wsRI) over all headwords. See the article for the full definition of wsRI. Intuitively, it is equivalent to the Adjusted Rand Index, where the pairs of observations are weighted by the agreement between annotators.

Data Formats

Source files

The data is provided in a tab-separated UTF-8 formatted text file. The headword is specified as a lemma and a part-of-speech tag. The headword is marked up within the sentence text by the "<" and ">" characters.

Sample for the English headword band-n:

    headword text
    band-n   The Beatles are arguably the most famous <band> in rock and roll history.
    band-n   This <band's> music was pop influenced.
    band-n   Three <bands> at 5 GHz have been allocated for WiFi and similar services.
    band-n   Put two large rubber <bands> around the base of the cup.
    band-n   If he put the gold <band> on your finger, he likes you pretty well.

Submission files

The response to be submitted shall consist of the same number of lines as the source data. A header and then a single arbitrary cluster label per line.

The submission corresponding to the previous example might look like this:

    cluster
    a
    a
    b
    c
    d

The test phase has has begun. Use the submission form to upload your entry.

Contact

This shared task is organized by Ondřej Herman, Miloš Jakubíček, Pavel Rychlý and Vojtěch Kovář at Lexical Computing and Masaryk University.

For questions and support, contact the task organizers at conll2025@sketchengine.eu.