Bilingual dictionary generation and enrichment via graph exploration

Tracking #: 2899-4113

Shashwat Goel
Jorge Gracia
Mikel Lorenzo Forcada

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Full Paper
In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. Such linguistic linked data has stimulated the development and use of openly available linguistic knowledge graphs, as it is the case or Apertium RDF, a collection of interconnected bilingual dictionaries represented and accessible through Semantic Web standards. In this work we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speedup, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. On average over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as a free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 25/Oct/2021
Review Comment:

I think this paper is in a very good state and the authors have taken into account the comments well.

A few minor issues I found in this reading:
p1. l23 "as it is the case of" => "as is the case for"
p1. l29. "on average" reads odd. I would remove.
p3. l6. "time response"... I guess you meant "response time" but you should probably just say "computation time" or "execution time"
p22 l19. I think it should be "(m)" after masculine

Review #2
Anonymous submitted on 22/Nov/2021
Review Comment:

In the reviewed version of the paper, authors have properly addressed reviewers' comments clarifying the content when needed.
Minor remark: p2, c2, l47: pachina -> panchina

Review #3
By Basil Ell submitted on 28/Nov/2021
Review Comment:

Dear authors,

thank you for you detailed and helpfully clarifying responses to my review (review #2). The new submission of your paper is a great improvement. There are only minor points that I'd like to mention:

p1, abstract. "the case or Apertium RDF" -> "the case of Apertium RDF"

p2, column 1, line 41. "evaluation methods that are more reflective". Although I have a rough idea what you mean with the term reflective, there might eb a better term or a more detailed explanation, or simply remove that term from the abstract.

p2, column 2, Fig. 1. "Apertium RDF graph". What is shown here is nit an RDF graph, but instead a graphical visualization of language pairs within Apertium RDF graph and their interconnectedness. The same holds for the graphs shown in Fig. 3 and Fig. 4 - these are not RDF graphs.

p5, column 2, footnote 12. This is not a full sentence.

p12, column 2, lines 27 & 28: "While the in-production Apertium language pairs that are used in RDF". Maybe drop "that are used in RDF"?

p18, Fig. 6. One could remove the legend, as the labels occur below the plots. Also, one could remove the colors, as they do not provide additional information.