Semantic Enrichment of Hadith Corpus - Knowledge Graph Generation from Islamic Text

Tracking #: 3651-4865

Authors: 
Nigar Azhar Butt
Amna Basharat
Amna Binte Kamran

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Knowledge graphs from text have garnered substantial interest across various domains due to their potential to facilitate efficient information retrieval and knowledge exploration. However, knowledge graph generation from textual sources presents unique challenges, particularly in the Islamic domain, where primary sources of knowledge are texts in Arabic, which exhibit complex linguistic and cultural nuances. This paper presents a comprehensive methodology for generating a knowledge graph from the hadith corpus. Hadith, a fundamental resource in the Islamic domain, stands as one of the primary sources of Islamic legislation, encompassing the sayings, actions, and silent approvals of the Prophet Muhammad ﷺ. Leveraging Natural Language Processing techniques, we systematically extract, annotate, and interlink semantic entities and relationships from the hadith corpus, extend the SemanticHadith ontology for entity organisation, and compute textual similarities to establish semantic connections. We generate a comprehensive knowledge graph by applying these methods to six hadith collections, facilitating efficient information retrieval and knowledge exploration in the Islamic domain. This is an essential step towards annotating and linking the hadith corpus to allow semantic search to support scholars or students in creating, evolving, and consulting a digital representation of Islamic knowledge. The SemanticHadith knowledge graph is freely accessible at http://www.semantichadith.com.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Jun/2024
Suggestion:
Accept
Review Comment:

Summary
The paper presents a detailed methodology for generating a knowledge graph from the hadith corpus, aiming to enhance the accessibility and interoperability of Islamic knowledge resources. It builds upon the SemanticHadith Ontology, introducing an updated version to better model entities and topics within hadith texts. The methodology incorporates meticulous data selection, natural language processing (NLP)-based entity extraction, and semantic modelling to facilitate the seamless exploration and retrieval of Islamic knowledge in the digital age.

Introduction Section

Strengths

1.Relevance and Importance: The section emphasizes the significance of structured knowledge representation using Knowledge Graphs (KGs) and highlights the underutilization of this technology in the Islamic knowledge domain. This is a relevant and important issue, addressing a clear gap in the field.
2.Context and Background: The introduction effectively sets the stage by providing context about the role of KGs in various domains and the specific challenges associated with the Islamic knowledge domain. This helps readers understand the motivation behind the research.
3.Clear Problem Statement: The section clearly outlines the problem: the lack of comprehensive semantic modelling for hadith texts and the need for better integration of Islamic knowledge into the semantic web ecosystem.
4.Methodology Overview: It briefly outlines the methodology, including data selection, NLP-based entity extraction, semantic modelling, and knowledge graph construction. This gives a high-level view of the approach without overwhelming details.

Potential Errors or Shortcomings

1.Generalization Issues: The section mentions that existing studies in hadith primarily focus on specific domains, like prophetic medicine and the chain of narrators, too broadly and do not acknowledge other studies in the domain.
2.Lack of Detail on NLP Techniques: The mention of NLP-based entity extraction could be more explicit. Specific techniques, tools, or methodologies employed for NLP tasks are not described.
3.Assumptions and Limitations: There is no discussion of the methodology's assumptions or potential limitations. For instance, challenges related to linguistic diversity, dialects, and the varying authenticity of hadith texts could be briefly mentioned.
4.Comparative Analysis: The section lacks a comparative analysis with other methodologies or approaches. It needs to discuss why the proposed approach is expected to be more effective than existing ones.
5.Impact on Users: While the section mentions the potential benefits of research and knowledge discovery, it could better articulate the practical implications for end-users, such as scholars, students, or the general public.
6.Citation Accuracy: Ensure that all claims, especially those about the current state of research and previous work, are properly cited. For example, the section references multiple studies [1-16] without clear attribution to specific claims.

Background Context and Motivation Section

Strength: Overall, the "Background Context and Motivation" section effectively highlights the significance of hadith and the need for formalized semantic modelling.

Additional Recommendations
1.Incorporate More Examples: Including more examples throughout the section would enhance clarity. For instance, provide examples of hadith that elaborate on specific Quranic verses or explain Islamic concepts.
2.Elaborate on Computational Techniques: While the section mentions various computational techniques (e.g., NER, NLP, similarity computations), a brief explanation of how these techniques are applied in the context of hadith literature would be beneficial.
3.Address Future Directions: Briefly touching on potential future directions or research opportunities in the semantic modelling of Islamic knowledge would provide a forward-looking perspective.

Methodology Section

1.Strength: Clarity and Coherence: The section provides a clear overview of the approach. However, a summary or a flowchart would benefit from visually representing the stages mentioned (especially Figure 3 directly here for immediate context).
Data Selection and Acquisition
1.Unicode Conversion: While the conversion to Unicode is mentioned, the exact methodology or tools used for this conversion should be briefly detailed to provide transparency and reproducibility.
2.Data Description: The description of the corpus being 34,458 hadith is clear, but mentioning how the data was curated or any criteria for inclusion/exclusion would add depth.
NLP-based Custom Knowledge Extraction
1.Model Training: The use of spaCy and transfer learning is appropriate. However, details on the training dataset size, epochs, and performance metrics (like precision, recall, and F1-score) would enhance understanding.
2.Expert Validation: Expert validation is crucial. It would be beneficial to provide examples of how discrepancies were resolved or the extent of expert involvement (e.g., the number of experts and their qualifications).
Similarity Computation and Interlinking of Hadith
1.Similarity Metrics: Using cosine similarity and pre-trained sentence transformers is appropriate. However, mentioning alternative methods considered or benchmark comparisons would add depth.
2.Expert Validation in Similarity: Elaborate on the process of expert validation in similarity computation. How many pairs were validated, and what criteria were used?
Additional Recommendations
1.References: Verify all references are up-to-date and correctly cited. For instance, [12], [28], [34], [35], [36], [37], [38], and [39] should be cross-checked for accuracy and relevance.
2.Consistency: Maintain consistency in terminologies, such as "Hadith" vs "hadith" and "Sahih Bukhari" vs "Sahih al-Bukhari."

NLP Methodology for Entity Extraction Section

Strength: Overall, the section provides a comprehensive overview of the NLP methodology for entity extraction from the hadith corpus.

Additional Recommendations

1.Collaboration with Domain Experts: While the involvement of domain experts is noted, it would be helpful to provide more details on the expertise of these individuals and the extent of their participation (e.g., how many experts their specific contributions).
2.Customization Details: The modifications to the dataset are described in general terms. Including specific examples of additional concepts (e.g., examples of holy books and angels) and removing labels would make this section more concrete and informative.
3.Transfer Learning: The explanation of transfer learning techniques is repetitive. Consolidating this information into a single, clear statement would improve readability.
4.Consistency: Ensure consistency in terminology and formatting throughout the section. For example, the "NER model" vs "Named Entity Recognition model" should be used consistently.
5.References: Verify all references are up-to-date and correctly cited. Cross-check references [28], [29], and [40] for accuracy and relevance.

Design and Development of the Extended SemanticHadith Ontology Section

Strength: The section provides a comprehensive overview of the design and development process for the SemanticHadith ontology. It covers conceptual knowledge modelling, scope definition, reuse of existing ontologies, ontology design, modelling decisions, and integration and implementation.

Additional Recommendations
1.Interoperability: The emphasis on interoperability is good. However, the section could elaborate on any challenges faced while integrating these ontologies and how they were addressed.
2.Examples and Illustrations: Including more examples and illustrations throughout the section would enhance clarity and engagement. For instance, showing a sample ontology class with its properties and relations would be helpful.
3.Consistency: Ensure consistency in terminology and formatting throughout the section. For example, "RootNarrator" and "HadithNarrator" should be used consistently.
4.References: Verify all references are up-to-date and correctly cited. Cross-check references [10], [12], [13-15], [35], [36], [39], [41-43], [45-47], [48-52] for accuracy and relevance.

Results and Discussion Section

Strength: The "Results and Discussion" section is comprehensive and covers various aspects of the evaluation and application of the SemanticHadith ontology.

Additional Recommendations
1.Examples and Illustrations: Including more examples and visual aids throughout the section would enhance clarity. For instance, screenshots of the ontology in Protégé or WebVowl, examples of SPARQL queries, and visual representations of knowledge graph integrations would be useful.
2.Consistency and Formatting: Ensure consistency in terminology and formatting throughout the section. For instance, terms like "SemanticHadith ontology," "knowledge graph," and "annotation" should be used consistently.
3.References: Ensure that all references are up-to-date and correctly cited. Cross-check references [51], [55], [56], and [57] for accuracy and relevance.

Review #2
By Md Kamruzzaman Sarker submitted on 12/Sep/2024
Suggestion:
Minor Revision
Review Comment:

This paper discusses methodology to create knowledge graph from religious text, especially for islamic text hadith. Authors have created conceptual model for the ontology, provided instances to create semantichadith ontology version 2 and from there used NLP techniques to identify the instances and relationships to create knowledge graph.

From the paper, it is not clear how many experts were used to verify the results from NLP techniques. The paper mentioned experts were used, but no mention of the number of experts, their expertise background and how the tie was broken in case of diverging decision.

The website's (http://www.semantichadith.com/) sparql endpoint is working, however the faceted browser is not working. It will be good to provide support for that. It's showing Error 40004
SR174: Log out of disk : Problem writing to the transaction log

Another problem I found is the paper is overly relying on semantichadith ontology, though it make sense that not many hadith collections are available as ontology format.

Paper is well written. The menthodology used by the authors seems standard Knowledge engineering technings such as NLP and semantic web.

(1) originality - This paper's work in the Islamic hadith collection context is original, but not for general knowledge graph development.

(2) significance of the results - A knowledge graph in hadith collection may provide automated reasoning capabilities to perform on Islamic rules and regulations, which can be further be explored.

(3) quality of writing is good.

Review #3
Anonymous submitted on 03/Oct/2024
Suggestion:
Minor Revision
Review Comment:

Summary:
The SemanticHadith knowledge graph is a comprehensive methodology to generate a knowledge graph from the hadith corpus, a significant source of Islamic knowledge. The authors present a comprehensive methodology that addresses the unique challenges posed by the linguistic and cultural nuances of Arabic texts. The hadith corpus is essential for Islamic legislation and understanding but has not been fully integrated into the semantic web ecosystem, especially in fields like biomedicine and education. The paper emphasizes the importance of structured knowledge representation for efficient information retrieval and exploration, emphasizing the role of knowledge graphs (KGs) in organizing and interlinking information. The methodology involves data selection, Natural Language Processing (NLP)-based entity extraction, conceptual knowledge modeling, similarity computation, and interlinking with external data sources. The SemanticHadith ontology organizes entities and topics within the hadith texts, facilitating semantic search and exploration. The authors also address challenges in semantic modeling, such as variability in Arabic naming conventions and the need for expert validation in entity mapping. The knowledge graph is publicly accessible, promoting further research and collaboration in the field.

Strengths:


1. The authors present a comprehensive methodology for knowledge graph generation, involving stages such as data selection, NLP-based knowledge extraction, conceptual modeling, and similarity computation.

2. They use advanced Natural Language Processing (NLP) techniques, such as Named Entity Recognition (NER), to enhance entity extraction and relationship identification in hadith texts.

3. Expert validation is also incorporated to ensure reliability in textual authenticity and contextual relevance.

4. The paper emphasizes interoperability and standardization by leveraging existing vocabularies like Schema.org and DBpedia, ensuring the knowledge graph can be effectively utilized across different platforms and applications.

However, the paper could improve by discussing the following questions:
1. How does the complexity of computing similarity between Hadith pairs impact the accuracy of similarity scores? What approaches can be used to avoid inflated results?
2. What methods can be used to determine and quantify contextual relevance in Hadith similarity computation?
3. Since this is a resource paper, what scalability challenges would arise from the goal of encompassing a wide range of Islamic resources? How can the authors' methodology be adapted to handle the vastness of the Hadith corpus and related literature?
4. How does the crowdsourcing framework for expert validation contribute to the entity mapping process in Hadith analysis? Further details on its implementation and potential impact should be discussed.