Linguistic Patterns in European Public Organization Names

Tracking #: 3858-5072

Authors: 
Alvaro del Ser
Carlos Badenes-Olmedo

Responsible editor: 
Blerina Spahiu

Submission type: 
Full Paper
Abstract: 
This work addresses the challenge of classifying public sector organizations across multiple European languages using only their official names, a critical step for entity disambiguation in knowledge graph population. We employ ontology-based knowledge extraction to evaluate three Natural Language Processing approaches: rule-based keyword extraction, zero-shot Natural Language Inference, and embedding-based semantic similarity —under low-context, low-resource assumptions. Large Language Models are integrated accross all three techniques. Our methodology systematically evaluates multilingual preprocessing, various state-of-the-art models, different supervision regimes, classification structures, and parameter optimization. We conduct a detailed evaluation across three specific domains (healthcare, administration, education) spanning multiple European countries, analyzing performance in relation to lexical structure and class balance. Results demonstrate that lightweight rule-based methods, particularly TF-IDF keyword selection, are effective in multilingual scenarios with minimal supervision. Natural Language Inference models offer competitive zero-shot performance but show deficiencies with unbalanced class distribution. Embedding-based methods provide the most consistent generalization across languages, with evidence of class coherence in vector space. We apply these techniques to a real-world use case — classifying contracting authorities in the EU Contract Hub platform - and outline additional applications and extensions for governance objectives and ontology refinement. This work highlights the feasibility of ontology-guided multilingual classification from short texts and its contribution to entity disambiguation challenges in formal knowledge representation systems, particularly when integrating diverse European organizational entities into structured knowledge bases.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By David Lindemann submitted on 26/Sep/2025
Suggestion:
Minor Revision
Review Comment:

The paper is interesting, well-structured, and well written. An introduction section 2 explains the aim of the paper, and the (possible) application of the presented methodology for classification of EU-based organizations by their name, e.g. for populating a knowledge graph. Main challenges are multilinguality (the dataset used consists of European organization names in 29 languages), and the fact that apart from their name, the dataset contains no metadata describing the organizations. The remainder of section 1 outlines problems to be addressed, condensed in a two-fold research question: (1) How effectively the employed NLP techniques can assist with organization classification from their names, and (2) if these techniques can help describing organization naming convention variation across the EU.

Interestingly, and as explained in section 2 "background", the authors use Wikidata for retrieving organization metadata. It is stated that "resources such as Wikidata" contain valuable classifications of organizations to be used for training and evaluating the methods presented in this paper. It seems necessary to provide more insight here: What other "such" resources are available, and why choose Wikidata? Is this due to the span and quality of Wikidata in that regard (is it possible to make statement about the completeness of organizations represented in Wikidata, e.g. through comparisons with ROR and/or others?), or due to its open license? Is it possible to trace where Wikidata has got organization metadata from, i.e., is those data somehow referenced?

Section 3 presents the three methods to be compared in terms of recall and precision (F1-score): (1) rule-based, (2) zero-shot LLM-based, and (3) embedding-based. This is well-explained, though a word about a general interest of comparing methods from these three methodological families would strengthen the paper even more. In section 3.1 (where data retrieval from Wikidata is explained) talks about "enpoint limitations" at Wikidata, which led to have to limit query result sizes. It is not clear which endpoint has been used, and whether endpoints alternative to the main Wikidata endpoint such as QLever (Freiburg University), which are offered for bridging the mentioned shortcomings (the authors state they had to deal with "incomplete data responses" (line 6/22), provide better solutions here. It is also stated that "Wikidata does not offer a preferred language for labels" (6/24), which is not any more true, and specially in the context of this study, the recently introduced "mul" label ("multilingual", the default language-independent label for an entity) is maybe worth to take into consideration. Again, in lines 6/36f, "enpoint constraints" are mentioned, and that future improvements could explore extraction from full Wikidata dumps - it seems that the authors do not have explored QLever. Or do the same constraints apply there? - In lines 6/44ff it is stated that Wikidata containes "sparse data" for "some countries", and that countries not meeting a "minimum threshold" (not further specified!!) in that covering have been excluded. Is this due to country size, or have rates of organization representation been compared to country size? [This is a bit better explained later in line 19/13f (although very fuzzily); should this be moved to section 3?]

Section 3.2 explains the classification experiments design. As for LLM-based methods, only locally run LLMs have been chosen, because "for few token tasks ... smaller models can offer sufficient performance" (7/25f). On the other hand, that choice means "under-performance, and ... computational constraints". It would be good to include a reason for not having chosen large models run on remote servers and available for free, such as DeepSeek. - In contrast to that, in the subsequent paragraph, the use of DeepSeek-R1 in the presented study is presented: Is that a local copy of DeepSeek?

This reviewer is not familiar with the employed methods; it must be said that the methods from all three families (rule-based, LLM-based, embedding-based) are well explained here, in way quite understandable to non-experts, which the reviewer points out as a mayor plus of this paper. The same is true for the discussion of the results in section 4.

Regarding the domain of expertise of this reviewer, the final conclusion (19/49f), which is that the presented methods could contribute to validation and refinement/enrichment of Wikidata content is particularly interesting. Are the authors planning to take action in this regard, and/or present their work at some Wikidata event?

Minor issues in the text:

The name "Wikidata" is inconsistently spelled (sometimes in camelcase, WikiData). The reviewer suggests not to use camelcase for that name, since the Wikimedia Foundation also uses the spelling "Wikidata".

Line 2/31, word 1 "named" > to be deleted

Regarding the datasets published together with the paper:

(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,

YES, the data is accessible and well organized, though the README file could contain more detailed information (e.g.) by referring to sections of the paper.

(B) whether the provided resources appear to be complete for replication of experiments

The reviewer cannot judge on that. It would be good if the employed Wikidata SPARQL queries were given in the repository, so that the experiments could be replicated with fresh and up-to-date data.

(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete.

The dataset is available at Zenodo.

Review #2
By John McCrae submitted on 03/Dec/2025
Suggestion:
Major Revision
Review Comment:

This paper focuses on classifying organization names and compare rule-based, natural language inference (NLI), and embedding-based methods for classifying public sector organization names. The authors apply this in a low-resource conditions, using Wikidata for ground truth. It finds, as I would expect, that embedding-based classifiers produce the best overall response.

This work provides an interesting evaluation of some NLP methods and proposes an interesting and practical multilingual task, but its originality is more in the application and analysis within this specific domain rather than novel algorithms. It covers several contemporary state-of-the-art approaches but there are some obvious gaps. For example, a few-shot prompting approach would, I guess, produce highly competitive results, and other methods such as contrastive learning (e.g., SimCSE), few-shot learning (e.g., SetFit) would make for a more complete analysis. Further, representation of the classes with knowledge graph embeddings would be an interesting direction to take this work as well.

The authors also make a low-resource assumption, which is surprising to me given the very large amount of training data that could be extracted from Wikidata. If this constraint was relaxed many methods could be investigated that would likely lead to much better performance. The value of using only a few examples over just collecting more data is not clear to me.

Given that this task is primarily a classification task, I also think that the authors should look at methods used in other tasks. For example, entity alignment where entities are matched to elements in a knowledge graph would solve many of the similar tasks. I also note some similarity to large-class named entity benchmarks such as FewNERD and the literature on this dataset here could help the authors to find methodologies that are performant on this task.

The presentation of the results is quite unclear with several tables having the same title and the results not aligned across different tasks. It is not clear for example if Table 8 and Table 7 are comparable and why they use different models. The results are also presented as "maximum F1", but I am not sure what this is a maximum over. Maximum values can also vary substantially so the results may not be reliable. There is also no significance analysis of the results and varying levels of precision (decimal places) that make it all hard to follow. Competitive baselines are not presented as mentioned above. Results should be organised into tables comparing relevant results with significance analysis and relevant baselines.

Overall, this is an interesting task, however I feel that it needs to be better situated with respect to the methodology and other similar tasks. The results need to be much clearer to support the conclusions of this work.

Review #3
By Mariana Damova submitted on 18/Jan/2026
Suggestion:
Major Revision
Review Comment:

This paper presents evaluation of several approaches to detect European Public Organization names. The paper is very well written with respect to language and description of work carried out. However, there are several substantial shortcomings that have to be addressed by the authors. First, the paper lacks motivation about the importance of the topic. Second, it lacks related work. Third, while discussing three domains, medical, public administration and education, the only examples in the paper concern the medical domain, and more precisely three terms, e.g. hospital, university hospital and clinics. Forth, while pretending to cover 27 languages of all European countries, the paper fails to present and discuss evidence of experiments and results on them. There are mentions of languages rich in resources and languages poor in resources, but it is not made clear which ones are which. Further, the tables in the paper comment on 8 languages (c.f. Table 6) without explaining the reason for quoting exactly this set of languages. That said, the paper outlines a substantial work done in running experiments with different approaches and models covering three methodologies – rules-based, natural language inference and embedding-based, and substantial analysis of the peculiarities of each of these approaches. It will be of advantage for understanding of the arguments of the authors to incorporate the information from the annexes inside the body of the paper and to discuss it. Regarding the figures in the annexes, some of them are not clear, because of small font size and practically unreadable. They should be revised. Finally, the conclusions are actually inconclusive. There is no clear statement as of which approach is recommendable, or which approach is to be preferred in specific contexts. It would be of benefit for the conclusions of this extensive work if something along these lines will be asserted. The overall work conducted and presented in this paper displays systematic and well-structured work logic and a solid background of the authors in language technologies. This paper would be of interest for practitioners dealing with language models and classifiers from a technical stand point, while not particularly original. The results of the comparison between the approaches can also be of use.

Review #4
By Michael Rosner submitted on 13/Feb/2026
Suggestion:
Major Revision
Review Comment:

ORIGINALITY
The paper concerns the general problem of developing robust methods for name-based entity classification, an NLP problem which the authors label as foundational, insofar as it can touch many particular downstream tasks. One of these is the particular subject of the paper: automated classification of public sector organisations (PSOs) across EU States using only their official names. This has emerged following the authors' own involvement in a study on procurement at EU level within the health sector, for which the classification of public service organisations is crucial.

Although there are good reasons behind the restricted nature of this particular case, it gives rise to a shortcoming that the authors do not dwell upon: the extent to which conclusions reached are transferable to other particular use cases whose parameters differ. Examples would be classification of named entities where the classification structure is richer and/or more universally accepted than the case studied, in the medical field , for example, i.e. names of diseases, anatomical nomenclature. The classification structure of the case chosen is an artefact of socio-political and linguistic choices, in contrast to, e.g. the names of drugs or chemicals which bear a more objective relationship to the underlying reality.

Given the importance of the classification problem tackled, a clear diagram of the target class structures on p5 showing both nested and linear classes would be better than the nested bullet points that appear now. The authors also need to explicitly explicitly state that the target class structure (as opposed to the entity instances) were obtained from Wikidata and how this was done.

Nevertheless, even though the authors recognise the formidable challenges of the chosen use-case, as listed in section 1.1., neither the problem itself, nor the approaches adopted are particularly original. The main novelty of the paper lies in the pipeline developed for systematically investigating solutions which harness "the leverage of structured knowledge bases".

QUALITY OF WRITING

The English is generally of high quality with only a few typos as listed below. However, there are some issues with the structure and organisation of the paper.

Thus the introduction clearly indicates (p3) five challenges pertaining to the case at hand which "the leverage of structured knowledge bases" is proposed to address, leading to the formulation of the two research questions investigated in this study: RQ1 - How effectively can NLP techniques paired with KG data distinguish between medical, government, and educational organizations and RQ2 - How do naming conventions of PSOs exhibit semantic variation across different EU members, and how effectively can these variations be captured by KG resources for entity disambiguation. These are reasonable RQs but:

(i) What to the authors mean by "NLP techniques paired with KG data" and "variation captured by KG resources". Two distinct interpretations are exploited in the paper: the use of Wikidata (a) to generate an annotated dataset suitable for entity classification (p6 line 35), and (b) to supply "class prototypes" used as a basis for comparison in embedding-based methods as described in section 3.2.3 (p11 line 39). These two uses (and there may be others) need to be more clearly distinguished. These are each related to different RQs, but that relation is quite subtle and not clearly stated. The authors should therefore more clearly distinguish these two usages of KG resources and expand on the two research questions are derived from them.

(ii) The two RQs are clearly stated at the outset of the paper. However, they are not mentioned subsequently. The discussion and conclusion are indeed loosely connected to issues relevant to the RQs, but the impact on answers the RQs themselves is lost in the discussion. Therefore, I would recommend that material in the discussion and conclusion sections be restructured so that the RQs become the main organising principle for the results of investigations.

Section 3 provides a comprehensive description of methods used. These are summarised in Fig. 1, whose structure (the column labelled "Parameter Optimisation" as well as the two rightmost columns) is confusing and needs further explanation. However, a reference to Fig 1 is missing in the text.

Authors should note that several other tables and figures appearing in the paper are not referred to in the text. The authors are urged to check for references to all tables and figures. Also, references to tables appearing in annexe should be explicitly labelled as such.

There is some confusion in the overall organisation of sections 3, 4 and 5 with insufficient separation between methodology, review of relevant literature, results of experiments, discussion of those results, and future work. Thus, the description of experiments is very long and includes some results, whilst the section results and conclusions occupy less than a page. Some of the material under conclusions would be better placed under future work. Here then, some restructuring would be desirable to increase the clarity and impact of the paper.

Many references in the bibliography are to arxiv preprints when full references exist e.g.

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder and F. Wei, Improving Text Embeddings with Large Language Models, arXiv preprint arXiv:2401.00368 (2024)
https://arxiv.org/abs/2401.00368
=>
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11897–11916
August 11-16, 2024 ©2024 Association for Computational Linguistics

Authors need to check the entire bibliography for updated peer-reviewed references.

SIGNIFICANCE OF RESULTS

The results are interesting and reflect considerable amount of work carried out. However, the main conclusions concerning the three classes of solution methods investigated are somewhat limited. Although the authors have designed a useful experimental pipeline, the paper does not provide much insight into how to solve the harder use-cases (e.g. overall coverage of rule systems, preprocessing morphologically rich language data, threshold tuning for NLI approaches). More details on the problematic areas would boost the significance of what has been achieved.

TYPOS

p2 lin 31 named developed => developed
p2 lin 42 co-oficial => co-official
p4 NLI acronym should be defined earlier
P4 Lin 41 languages, we focus => languages focuses
p9 metods => methods
p11 lin 40 its => it is
p11 lin 42 chosen, organisation => chosen organisation
p12 conceptual dispersion the capacity => conceptual dispersion and the capacity
p12 we also supports => we also supported
p12 due to their absence => due to its absence
p13 lin 51 the Github => Github
p14 lin 22 variability. In comparison => variability, in comparison
p15 wikidata => Wikidata (check all occurrences)