Review Comment:
In their article, the authors present ChronoGrapher – an approach for the automated creation of event-centric knowledge graphs. Since the semantic modelling of events is a challenging task required for creation of event knowledge graphs, this is highly relevant. The authors show how to extract and model relevant information from generic knowledge graphs, evaluate ChronoGrapher, amongst others, through a RAG-based user study and made the code to re-run experiments available. However, I see several issues with the clarity of the goal and problem statement, the evaluation and methodology as listed in the following.
1. Clarity of goal: The general goal of the KG extraction process, specifically the graph traversal, is not described clearly. For example, within Section 1, step-1, or RQ1, or “first component” (Section 3.2) is described as “searching cues in a massive memory”, “event subgraph extraction”, “How to extract relevant events from a generic KG?” and “extracting relevant content from the generic KG”, in Section 3.2 as “relevant content extraction” and “Informed graph traversal”.
2. Related Work: In the first paragraph, the connection of “(i)”, “(ii)”, and “(iii)” is rather unclear; specifically, for “(ii)”. In “(i)”, a stronger motivation should be given to distinguish between “manual” KG creation and automatic KG creation (also see my comment 7). The connection of the different methods in (ii) to the task of event-centric KG creation should be made more explicit. In (iii), the impression is that RAG would help to understand the implicit knowledge in LLMs, while it rather does the opposite (explicitly input knowledge).
3. Problem statement: Section 3 and maybe also 3.2 could benefit from a more formal problem statement. For example, I struggle with “searches all sub-events”. If it searches for all, why is any filtering needed? And doesn’t the graph also contain non-events such as persons? Even in the case of events, are they all sub-events (e.g., what about preceding events)?
4. Definitions: Several definitions miss motivation (specifically, Definition 8 and 11), maybe some intuitive examples (Definition 8) and some definitions are unclear: If, according to Definition 5, expanding a node is about finding all its ingoing and outgoing nodes (-> Definition 3 and 4), why does Definition 6 suddenly turns this into a ranking problem (what do scores even mean in this context)?
5. Filters: The role of the filters should be motivated more clearly. At first, it seems counterintuitive to skip locations and persons since these entities typically are the core elements of an event (I assume it means these nodes are not further expanded but still relevant parts of the event graph).
6. Extraction from text: RQ2 combines two tasks (semantic modelling and information extraction from text) which seem rather unrelated. In general, the extraction from text plays a very minor role before Fig. 5 and should be better motivated already in the beginning. Two questions: (i) In an event-centric KG, what is the fraction of triples generating during the extraction from text compared to the other triples? (ii) Is this step solely following what was done in [7] or does it include original research?
7. Evaluation: Many parts of the evaluation are performed as a comparison to EventKG. Here, it would be good to better describe how exactly the precision, recall and F1 scores are computed. Also, while it is stated that EventKG was created “manually” (which is confusing since also EventKG is extracted automatically from different generic KGs) it can be assumed not be perfect/complete, so I would like to see a more in-depth comparison of an event represented in EventKG and with CronoGrapher. This comparison could, on the one hand, show that ChronoGrapher misses out relevant information, but, on the other hand, also show that it succeeds in skipping non-relevant information. This should motivate why ChronoGrapher is favoured over (or at least complementary to) EventKG in some settings.
8. Evaluation (RQ3): While I like the setting of Section 4.3 since it actually tests the applicability of the KG in realistic settings, I miss lots of details (specifically, in the first paragraph of Section 4.3): (i) How are the metrics defined? (ii) How are the prompts created; what prompting strategy do they follow? (iii) Are there examples of the 6 types of questions?
9. Evaluation (Data): All examples and all events in Table 6 are about conflicts, which leads to the question how the approach performs on, for example, political and sports events.
10. Data: It would be great to see an actual example KG file in the repository.
11. Methodology: In general, the whole methodology (except for extraction from text, which is, however just application of a pre-trained model) is rather basic and heavily relies on heuristics. While I do not require that AI methods need to be part of each paper, there is a need for a strong motivation why the selected approach is superior over learning-based methods.
Minor:
- Algorithm 1 needs some re-working. The part with input parameters is strangely formatted and some more mathematical/pseudo-code notations would help (e.g., instead of “add node in N”). The comments in curly brackets are confusing, should be formatted differently. “to_extract” is not used.
- Fig. 1: It is unclear if “French Directory” and “French Consulate” are really events. This should, at least, be discussed.
- Notations: Try to use a more standardised notation: For example, “e={e_1, …}” is confusing; a set should be upper-case (“E”). Also, the double-use of “n” for “node” and “number of iterations” is confusing.
- Fig. 3: A proper description of what is going on in Fig. 3 would be helpful. Since Section 3.2 gives for each of the four stage a description (Page 7, lines 25-30), an algorithm description (line 30-36) and an example sub-image in Fig. 3, maybe restructure it and describe it stage-by-stage with these three elements?
- Definition 11 comes out of nowhere and should be better introduced/motivated.
- The notation in the evaluation section is sometimes unclear. For example, the parameter “domain_range”, the notion of “who = 1”.
Very minor:
- Abstract: The three groundedness scores in the abstract are a bit too detailed / non-intuitive if metric not known to the reader.
- Page 4, line 42f: I don’t understand the “output …is regions” sentence.
- Fig.3: “DB” is mentioned in the caption but nowhere else.
- Page 7, line 17: Sentence “For each pattern…” is unclear.
- Page 7, Lines 25-35: Consistently use “Stage” and don’t switch to “step”.
- Page 7, Line 33: “ ; “
- Page 19, Line 2: “and and”
- Page 19, Line 47: “[are] available on GitHub”
- Table 8: Runtime formatted differently in bold.
|