Review Comment:
The paper extends the conference version of PATHFINDER by incorporating additional path semantics aligned with GQL, in particular the notion of GROUPS, and adding to the prior framework the corresponding algorithms. The work is evaluated over a combination of real-world and synthetic datasets, with the aim of demonstrating scalability and applicability.
The topic is relevant and timely, and the work builds upon a solid prior contribution. However, in its current form, I'm not sure that the paper meets the expectations for a journal extension, particularly with respect to the depth of the contribution, the completeness of the presentation, and the experimental validation.
(1) Originality:
The paper presents a technically sound extension of a previously published conference work. The integration of GROUPS into the framework constitutes a meaningful addition, and the overall approach remains original in the broader context of graph query processing.
That said, the degree of novelty with respect to the conference version appears limited. The main ideas and techniques are largely inherited from the earlier work, and the additional contribution, while relevant, does not significantly alter the conceptual core of the approach. As such, the level of originality is acceptable, but somewhat modest for a journal extension.
(2) Significance of the results:
The results presented in the paper are reasonable and consistent with the goals of the work. However, there are two important limitations that affect their overall significance.
First, from a theoretical and algorithmic perspective, the paper does not provide a fully self-contained description of the proposed methods. Several important procedural details within the algorithms are omitted or only partially described. Instead, the authors refer to an external arXiv version for the complete procedures (an arXiv version that is different from the conference one, which itself does not contain all of these details).
This is problematic for a journal extension, which should ideally be self-contained and should not rely on material outside the peer-reviewed manuscript for essential algorithmic details. This makes it difficult to fully assess the correctness and applicability of the approach. Moreover, it's unclear why such details are not included in the paper itself, given the absence of strict space constraints.
Second, the experimental evaluation does not constitute a substantial extension over the conference version. A significant portion of the setup is reused, including two of the three datasets. The only new dataset (Pokec) provides a limited additional perspective, as it explores aspects that seem to already be implicitly covered—often under more demanding conditions—by the synthetic Diamond dataset.
While the paper includes an evaluation of the GROUPS mode, this analysis remains relatively shallow. The experiments primarily show that increasing the number of groups leads to increased runtime, which is expected and does not provide deeper insight into the behaviour or usefulness of this semantics. Furthermore, only very small values of k are considered, and results for Wikidata are omitted (again, not quite sure why but the authors argue its "for brevity").
The lack of comparison with existing systems is understandable given that the proposed GROUPS semantics is not directly supported by current graph query engines. However, this makes it even more important to provide a thorough internal evaluation of this feature. In particular, the paper does not clearly demonstrate in which scenarios GROUPS provides a practical advantage (or if it does at all). For instance, it would be valuable to understand whether GROUPS helps mitigate path explosion by reducing output size, how it compares to returning individual paths (e.g., SHORTEST or TOP-k paths), and what trade-offs it introduces in terms of runtime and result granularity.
More generally, the current evaluation does not explore settings in which GROUPS is expected to be most relevant, such as graphs where multiple paths of different lengths exist between the same pair of nodes. As a result, the experimental section does not sufficiently support the practical relevance of this extension.
Taken together, these issues limit the overall impact and significance of the experiments presented.
(3) Quality of writing:
The quality of writing is uneven across the paper. The parts inherited from the conference version are generally clear and well presented. In contrast, the newly introduced material is noticeably less polished. In particular, the new sections contain typos and some grammatical issues. More importantly, key technical details are sometimes omitted all throughout the manuscript, making certain parts difficult to follow. This places an unnecessary burden on the reader, which is not justified in a journal extension without strict page limits. The paper contains one very useful running example, but also lacks sufficient illustrative examples in places where they would be helpful for understanding non-trivial constructions (e.g., a small example of the product graph).
A recurring concern is that important information is deferred to an external arXiv document rather than being included in the paper itself. This choice is difficult to justify in the context of a journal submission and contributes to the impression that the presentation of the new material is incomplete.
Overall, the manuscript would benefit from a thorough revision to improve clarity, completeness, and consistency.
(4) Data and reproducibility:
The authors provide a link to code, data, and queries, which is a positive aspect. However, the reproducibility of the work is only partially supported due to the concerns raised above.
While the availability of experimental resources is commendable, the lack of a fully self-contained description of the algorithms in the paper itself makes it difficult to independently reproduce or validate the theoretical aspects of the approach. In particular, relying on an external, non-peer-reviewed arXiv version for procedural details raises concerns.
It would be important to ensure that all essential components required to understand and reproduce the work—both theoretical and experimental—are clearly described within the paper and properly documented in the accompanying resources.
Minor comments:
Generally, spelling is uneven. Choose one (neighbor or neighbour, etc), UK or US.
P3
L20 well, technically you do not introduce Pathfinder here, you extend it from your previous work to cover GROUPS.
L42 we describe and extra examples -> and add?
Also, about this Remark: The claim that the manuscript provides additional algorithmic details and examples compared to the conference version is not entirely clear. In my reading, the level of detail remains largely comparable to the conference version, with the main differences being limited to the extensions themselves. Similarly, the examples appear to either be the same ones or follow closely those already presented, with only minor adaptations to account for the new material.
I would have expected a more substantial expansion in terms of both algorithmic detail and illustrative examples. In addition, the discussion on the complexity of the algorithms remains rather limited and could be further developed.
P4
If you’re defining all the basics…. then:
L37 \epsilon not defined
L42 k-fold concatenation not defined.
P5
L1 “greater than”
P7
L33 ‘in practice…’ why? what is the complexity difference? why do you construct the product graph lazily and in the complexity discussions this is still the bound you give, could you explain a bit more what are the practical differences in the lazy approach?
P10
l41 it runtime -> its runtime
l51 The intuition simple -> The intuition IS simple
P11
I’d appreciate a bit more detail on certain discussions, e.g.
L31 why do you require unambiguous here but not in the previous cases?
L32/33 this is easy to enforce for real-world RPQs?
What is the procedure Neighbors doing, here and in Algorithm 1?
P13
L15 why do you “crucially” need this? Or even better, why you don’t need it in the first algorithms but you do need it now? Can you expand on the differences?
L16 Again, a bit more detail on the runtime of the construction of Gx would be helpful.
L18 How much bigger? Please, don’t send the reader to another paper, explain…
L20 you have explained/defined what this is already in P7, and mentioned it already in P10…
P14
L37 are not shortest -> the shortest / by path -> by a path
L41 travaerse -> traverse
P15
L20 pipelinining -> pipelining
L22 k path if more -> paths
L31 Algorithm 3 work -> works
L44 haven’t -> have not
L47 END -> ENS / we can the output -> we can output
P16
L28 q state -> q the state
P17
L44 terminates -> terminating
L 47 to some nodes n -> to some node n
P18
why the need to compute the product graph lazily if all the bounds you are giving are still ~O(|A||G|) ?
P19
Paragraph starting in l41:
Where are you checking that the next node also respects regex? in IsValid? Is it in Neighbors? Please clarify this.
Because for example, in the example you give, when trying to expand Jane there are no outgoing follows or lives edges, so the search from there terminates. Where in your pseudocode’s algorithm is this?
Same question I had about the procedure Neigbors in the other algorithms. You need to clarify in what part of your code you’re “checking” that the regex for the path you are assembling is ok.
P20
L29: the same as -> the same way as
P22
L6 sued -> used
L9 Reachable a set -> Reachable is a set
L10 Groups a dictionary -> Groups is a dictionary (or use colon)
P23
L34-5 we do not yes -> we do not yet
|