Assessing deep learning for query expansion in domain-specific arabic information retrieval

Tracking #: 2026-3239

Authors: 
Wiem Lahbib
Ibrahim Bounhas
Yahya Slimani

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Abstract: 
Abstract. In information retrieval (IR), user queries are generally imprecise and incomplete, which is challenging, especially for complex languages like Arabic. IR systems are limited because of the term mismatch phenomenon, since they employ models based on exact matching between documents and queries in order to compute the required relevance scores. In this article, we propose to integrate domain terminologies into Query Expansion (QE) process in order to ameliorate Arabic IR results. Thus, we investigate two categories of corpus representation models, namely (i) word embedding; and (ii) graph-based representation. In the first category, we compare Latent Semantic Analysis (LSA) with neural deep learning-based model (i.e. Skip-gram, CBOW and GloVe). In the second one, we build cooccurrences-based probabilistic graph and compute similarities with BM25. To evaluate our approaches, we conduct multiple experimental scenarios. All experiments are performed on a test collection called Kunuz, which documents are organized through several domains. This allows us to assess the impact of domain knowledge on QE. According to multiple state-of-the art evaluation metrics, results show that incorporating domain terminologies in QE process outperforms the same process without using terminologies. Results also show that deep learning-based QE enhances recall.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Oct/2018
Suggestion:
Minor Revision
Review Comment:

The paper describes a Query Expansion strategy for Arabic language. The task is particularly challenging as Arabic language shows a high level of ambiguity. The authors tackle it using multiple methods and resources.

The paper has a very rich and strong experimental section, with numerous linguistic and methodological insights.

Compared to the previous version, the paper has been radically improved. Redundancy has been removed, the sections make clear sense and it went through a careful linguistic revision. There are some layout flaws, such as the distance between text and tables (see for example below Table 1), the equation (3), different fonts and positions for the table and figure descriptions. These need to be addressed before publication.

I am still a bit skeptical on whether the use of word embeddings is enough to fit the Deep learning issue, but, other than this, I will support the paper publication with minor revisions.

Comments:
- Is P@0 ok? Shouldn’t it be P@1?

Review #2
Anonymous submitted on 30/Oct/2018
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper focuses on the usage of deep learning in the task of query expansion for Arabic domain texts.
The authors aim to prove that the incorporation of domain specific terminology improves the results and the deep learning techniques improve the relevance of the retrieved data.

The authors provide a thorough investigation on the influence of the domain terminology on the relevance of the retrieved data. They present experiments with various scenarios, such as with/without expansion; with/without terminology. For the similarity estimation the following methods have been tested: LSA, probabilistic graph-based model and deep neural learning.

I assume some novelty in the following respects: the specific language and the experiments where the target domain is detected automatically. I see as an advantage also the comparison of the graph-based methods with the neural ones.

The paper now is organized in an understandable way. Related works section is informative with regard to the task. The related works, the inclusion of terminologies into the query expansion process, the preprocessing, the results are well exemplified and discussed in detail. Also, the survey is presented in a comparative way, which is an advantage. The authors share the successful experiment settings but also they describe the problematic issues, such as the sources of noise, the compatibility of the various learning methods with the data characteristics, etc.
I think that the results and conclusions have an added value for suggesting appropriate strategies of introducing terminologies into query expansion for better IR for other languages and specific-domain based tasks.

The results are systematically presented with indication of being significant or not - as well as with explanations about the reasons for getting these results.

However, before being published, the text still needs a good proof-reading - especially in some parts: the abstract; 4.1 and 4.2-in-the-first-paragraph have the same text (4.1 should be deleted as it is now, and 4.2 should become 4.1.); two phrases for the authors to be consistent with throughout the text are: 'constists OF' and 'ON the one/other hand'; the section on Impact of corpora size.

Some other comments/suggestions:

- why Arabic is called a complex language in the abstract? In what sense is it complex? I would use another term like 'morphologically rich', or the phrasing 'especially for' might be replaced by 'including languages like Arabic'.
- section 2: 'Sentences do not follow an exact structure as in French or English'. I would suggest 'a fixed word order' instead of 'an exact structure'
- what do the terms 'general' and 'semi-general' mean in Table 1?
- for the last sentence before Section 7: why is filtering like the usage of a general corpus not used for disregarding the highly frequent words as being non-domain-terms?

Review #3
By Dagmar Gromann submitted on 28/Nov/2018
Suggestion:
Minor Revision
Review Comment:

(1) Summary
This article investigates the impact of incorporating domain terminologies in form of vector representations into query expansion. To this end, several word embeddings as well as co-occurrence-based graph embeddings are evaluated on a collection of documents spanning several domains. An improvement can be observed when using domain-specific terminology and even more so using elaborate word representation methods. The two aspects that are most interesting are the application of the method in Arabic and the task-specific comparison of word embeddings and other word representation methods.

(2) Overall evaluation:
The comparison offered by the article and the specific application to the Arabic language are interesting and good contributions. However, the current style of writing and missing information make it difficult to follow the paper. Several details specified below in the detailed evaluation section, such as the corpus used for training the embeddings and how exactly they are trained and the mismatch between the presentation of the results in abstract and introduction and actual results section, need to be clarified. Nevertheless, most of this issues can be resolved with careful editing and proof-reading, which is why it should be a matter of minor revisions and the overall results are provided.

(3) Quality of writing
While the approach is very interesting and the paper is quite informative, this paper desperately requires thorough proof-reading by someone (near-)native in English. There are numerous typos and basic errors that lower the quality of the writing and a whole section is provided twice as 4.1. and 4.2. Several further examples can be found below in the minor comments section. Compliance with the template needs to be ensured and academic citation rules should be followed. If there is more than one author, it is "Abbache et al. [1]" and not "Abbache [1]" - please correct this for the whole paper.

(4) Evaluation of the State-of-the-Art
Related work is detailed and structured in a convincing manner, even though I have to admit I am no expert on Query Expansion works specific to Arabic and its language resources.

(5) Detailed evaluation
- Introduction:
The improvement of the IR results does not necessarily entail the quality of the extracted terms and especially not the quality of the terminology. Terminologies as linguistic resources are usually not just a random collection of terms but an organized, frequently onomasiological language resource. Also the quality of the terms would need to be specified: what exactly do you mean by "accurate"? In terminology science there is something called "linguistic accuracy" that refers to the compliance of a given term to the morphological, syntactic, and phonological conventions of a language. However, the actual quality of the term in terms of consistency, transparency, and conciseness cannot be ensured this way. That is, whether the extracted words indeed represent adequate terms of the domain or only partial terms is difficult to determine this way.

- Arabic language specificities:
Interesting section on term variation in the Arabic language.

- Related work:
This section presents a large collection of QE approaches specific to the Arabic language and its language resources as well as an interesting comparison. However, "Discussion" is conventionally a section that comes at the end before the conclusion and provides a detailed discussion of the results obtained in the paper itself and a comparison with results obtained by others. It is generally not a subsection of Related Work where the proposed approach is demarcated form other approaches, which is usually just part of the Related work section.

- Terminology-based QE:
Which terminologies have you already built? How have you built them? Which domains?
In terms of terminology extraction, you represent the whole terminology with one vector? This is what it says at the moment. If so, please make the generation of this one vector for a whole terminology more explicit - how exaclty is it learned.

Also please specify the "text miniming, graph-mining and deep learning techniques" that you use for creating candidate terminologies. Please do not separate pages by a table (i.e., Table 2) because this distorts the whole typesetting. In fact this strange separation of pages makes it almost impossible to follow section 4.5. - which formula belogns to which part? "The similarity between a query q and a domain d is computed as follows:" is followed by equation (3) but should probably be followed by (2) which then suddenly appears at the beginning of the next column - this needs to be urgently fixed.

The organization of this section needs attention. It seems that you specify the terminology extraction methods in Section 4.5., which then should be a subsection of 4.3.

Please specify the proper similarity metric formula. Which type of correlation do you use? Cannot be determined from this description. The same goes for cosine - if you want to provide an equation, do it properly.

Provide the ITF formula as equation just like the others. Reference the equations in text to make this reference clear. Saying that k1 and b are the "usual parameters" is not enough - please explain what they stand for. The most important question of this graph mining section is how is the graph actually build? This needs to be described appropriately.

Stating that you train word2vec on a "large number of areas" is not enough - which corpora did you use? Did you train separate models for each area? How was the terminology vector finally produced? How did you obtain the GloVe embeddings.

- Experimental Protocol:
Is the POS filtering reasonable for Arabic? For English, for instance, terms frequently also contain prepositions, such as "people with history" in a medical domain or "by law" in several domains including the legal.

- Results:
The highlighting in Table 8 is unclear. While BM25 obtains the highest result of 0.91 in the second section of title + description queries, 0.88 values are highlighted. Why is that?

This marginal improvement that can be achieved by using additional terminology cannot be used to talk about "outperforming". In fact, it seems that the word representation method has a much larger effect on the whole process than the use of domain-specific terminologies in the expansion process, which seems to not lead to a large improvement. If I have misunderstood, I suggest making the results clearer. If not, it is necessary to change the argumentation in the abstract and introduction.

To understand the comparison with the reference terminology, the result of this experiment need to be quantified. Also the "built terminologies" need to be clarified as mentioned above.

(6) Minor comments in order of occurrence
"into Query Expansion (QE) process" => into Query Expansion (QE) process
"corpus representation models" => word representation models
Skip-gram, CBOW => usually the frameworks for training are compared, that is word2vec, GloVe and not the specific submethod, since Skip-gram and CBOW are also used in FastText etc.
"we build cooccurrences-based probabilistic graph" => a cooccurrence-based (with article and without "s")
"which documents are organized through several domains" => which provides documents in several domains
"state-of-the art" => state-of-the-art
"in QE process" => "into the QE process"
"a critic situation" => a critical situation
gendering => user is not automatically "he"; use s/he or they
"Bank" => bank (lowercase, otherwise example does not work, that is, "rely on") - change everywhere
"information more than just relying on the original query" => information with more than just
"to more relevant documents" => full stop missing
Relevance feedback (RF) => Feedback
sub set => subset
Wikipedia is an encyclopedia and not a lexicon (this sentence needs commas in the right places)
terms similarity => term similarity
Deep Learning, is able => no comma
"In the other hand, we compare" => the linker would be "On the other hand" but is the wrong one for this sentence; maybe "Furthermore,"?
"their ability to incorporate terminologies in Arabic QE" => those models do not incorporate anything but only represent words - you want to evaluate their ability to improve results when being incorporated into QE right?
"implication of domain terminologies in a QE process" => incorporation not implication
"extracting directly expansion terms" => extracting expansion terms directly
"accords more" => ?
"in the web" => on the Web
"ambiguity manifests" => manifests itself
"when associating to a single lexical unit several solutions" => when associationg several solutions that may .... to a single lexical unit
"returns to the richness" => ?
quotes are marked by quotation marks and not italics unless italics occur in the original text; also please encode quotation marks correctly, which is `` and '' in Latex (in the entire paper)
"reduced silence" => what does "silence" mean here? not self-evident
deduced => deducted (but I presume this is not the verb you actually want to use here)
"make possible" => make it possible
"of Islamic ontology" => of the Islamic ontology
Otair et al., [34] => no comma
To test their approach, Authors used => the authors
Authors reported => The authors
equals to 66% => equal to
"to add to a query words having the same lemmas or root" => "to add words to a query that have the same lemmas or roots"
"an approach that exploit" => exploits
"light stemmer." => no full stop in Table 1
From table 1 => Table 1
"existent works" => existing
majority of the work => works
section 4.1 => Section 4.1
"then number" => the number
footnote 5 has encoding problems (different font!)
"given that as" => no as
specifies => particularities
cf. figure 1 => see Fig. 1
please use separation markers for numbers, i.e., 10,761 words not 10761 words
too many to list them all