A Survey on Semantic Scientific Workflow
Review by Bertram Ludaescher
The paper aims to provide a survey on "Semantic Scientific Workflow".
However, the paper falls short on a number of points:
* There is much talk of need for "flexible" and "adaptive" workflows, but it is not quite clear what exactly the problems are with scientific workflows.
* There should be examples (not a single example is given)
* It is not clear what semantic technologies are offering to solve the (underspecified, under-explained) scientific workflow problems. Again, examples would have helped.
* According to the paper, many if not all of these technologies are not (widely) used with scientific workflows.
So instead of a writing a broad survey that can only scratch the surface, why not narrow the scope, pick a particular "semantics problem" that scientific workflows might have (which one?), then compare a few of the key approaches?
As a minor point: there are quite a few grammar and language problems.
Further comments:
* Section 1
- ... earth science and etc. --> delete "and"
- ... because it utilizes --> because it may utilize
- ... provenance information in detail, which is critical for further verification, analysis and new discovery.
--> not clear why/how provenance helps with discovery. Example?
- ... Besides, the specification should
--> Here and elsewhere in the paper, "Besides" is incorrect.
--> Replace for example with "In addition", "Additionally", "Moreover", or similar
- ... methodologies are required --> methods are required
- ... gives many advantages
--> may yield many advantages
--> also give concrete examples of these advantages or explain further
- ... Although, there are numerous independent efforts to make scientific workflows more flexible and adapt- able, there is a new tendency towards the inclusion of semantic information, ontologies, and execution rules inside the workflow execution [2].
--> Replace "tendency" with "trend"? Also what drives this trend? Are there examples why this new technology works well? Or is the new trend just a new fashion? Examples, references, or plausible arguments please.
* Section 2:
- The term "flexible" occurs more than 20 times in the paper (a bit too often I think).
Also "adaptive" is often used together with "flexible".
Can you give a bit more detail of what you mean by these terms?
I understand that these are difficult to make more precise, but examples or some more detail is desirable.
The closest to an explanation I found was:
"Flexibility denotes the ability of a workflow system to react to changes in its environment."
This is very broad. What kind of changes? Resources coming online and going offline?
Scientists who need to get their science done try to work in environments that do not change all that frequently.
I'm not conviced that flexibility and adaptiveness as a reaction to changes in the environment are the biggest challenge in scientific workflows. You should come up with some more examples.
- ... "Here, we enumerate some of the relevant requirements."
--> These "requirements" are not very clear.
--> You might also want to cite additional papers regarding requirements of scientific workflows.
For example: Sarah Cohen Boulakia, Ulf Leser: Search, adapt, and reuse: the future of scientific workflows. SIGMOD Record 40(2): 6-16 (2011).
- ... "senior service" --> what is that?
- ... Scientists usually are non-computer experts
--> This really depends. Some of the most advanced tools around are built by computational scientists (which are not computer scientists, but definitely computer experts).
- ... and intermediate products and are very helpful
--> delete the second "and"
- ... The complexity of a scientific workflow lies in its experimental goal and unreliable execution environment
--> replace "goal" with "goals"? Also do you mean "exploratory nature" rather than experimental goals?
Otherwise it is not clear to me why having goals adds to the complexity.
Please explain where the complexity comes from!
I also think that the "unreliable environment" is not the main problem (see e.g. the paper by Cohen-Boulakia and Leser above or some of the papers you on scientific workflows).
- ... Scientific workflows are usually executed in a highly heterogeneous and distributed environment, which is unreliable and evolving all over the time.
--> this should be relativized. The scientists I know try to work in a highly reliable environment that does change at a moderate pace only. Yes, they use different tools and platforms etc. but they work hard to minimize heterogeneity and maximize reliability (for example some scientific workflow applications control expensive instruments, so the workflows have to be very reliable. Some flexibility and adaptiveness might be needed, but all this is within very specific boundaries)
- ... For the purpose of improving reliability and handling various exceptions, an adaptive execution which dynamically selects available services or modifies the structure of a workflow at runtime runs first in the development of a scientific workflow.
--> It is not clear what this means.
- ... a workflow specification enhanced with semantic rules facilitates scientists to describe more complex decision and behavioral logics and also accommodates adaptable execution with more intelligent strategies, such as: following alternative execution paths in case of errors or unexpected exceptions.
--> I don't understand how a program or workflow can react to unexpected exceptions.
What would one do there? Can you give examples? I would have assumed that a designer foresees certain kinds of exceptions and then devises some exception handling mechanisms. This seems to be part of the normal design process.
- ... with the exponentially increasing volumes of data from scientific experiments, it is necessary to employ semantics into provenance in order to unambiguously interpret data in the correct context.
--> This and similar claims in the paper need some examples or other arguments to be backed up.
* Section 3
- ... Flexible scientific workflows require a flexible design of complex experimental logics, but also an intuitive way to describe user requirements to find a concrete service at runtime.
--> The techniques and so-called "standards" described here seem somewhat esoteric.
You also describe "dead" technology: ... "However, the application of both OWL-S and OWL-WS hasn't been prevalent for lack of tools to support the development [40]."
Isn't "not prevalent" an understatement?
Can you point to concrete examples where some of these technologies are used in scientific workflows and where they make a difference?
- ... and shot that rule-based ... --> and show ...
- ... a reasoning Web Service
--> what kind of reasoning?
- ... initial social order of a multi-agent system
--> what social order?
- ... However, since it is based on the BPEL, which not only cannot capture dynamic changes, but also offers a number of complex advanced features in business workflow, it is still not flexible enough for scientists to capture scientific processes.
--> looks like a non-sequitur to me. And why is it not flexible enough? What features specifically are missing from BPEL?
- .., it would be promising to combine the rule-based language and the multi-agent system to capture the scientific processes.
--> I'm just not convinced.
- ... The solution provides a flexible and scalable framework to complete complex tasks. However, there is still much to be done when it comes to scientific workflow.
--> What is the reader supposed to get from such statements. The first sentence is a claim without supporting evidence. The second sentence doesn't say what needs to be done.
* Section 4:
- OWL-S is discussed again, and it is again concluded that OWL-S is dead: "However, it hasn't been widespread applied" (this sentence also has grammar errors)
- WSDL-S, SAWSDL, WSMO, SWSF ... --> it is difficult not to lose track with so many acronyms
Your own assessment: "However, all of them only provides a prototype and haven't been widespread applied."
makes me wonder whether those "standards" have come too early.
Shouldn't the process be: develop something that works and makes a difference, then standardize?
(not your fault; but I wonder why one should even look at all these "standards" if nobody seems to be using them..)
* Section 5:
... since the dynamic runtime changes at runtime, event-based exception handling seems to be a promising one.
--> This is one of many unverifiable, unsubstantiated claims in the paper.
... Exceptions arising at run-time can be detected as events and handlers reacts them based on its rules and knowledge base.
--> not sure what to make of this
* Conclusions:
... We surveyed several solutions which employ the technologies from the Semantic Web community.
--> Neither the scientific workflow problems nor the Semantic Web solutions have become clear in the paper.
... In summary, it has become a tendency to employ declarative rules,
--> "tendency" is not the right word it seems. You need to give examples of how declarative rules can make a difference. I think they can, but the paper doesn't show how.
... semantic-based description, knowledge-based agent systems, ontologies and other technologies from Semantic Web Community into scientific workflow. However, there is still which needs to be done.
--> The last sentence doesn't parse.
... For instance, rule-based languages are usually very complex for non-computer scientists and it is necessary to hide this complexity and improve the usability of rule-based scientific workflows.
--> So what do you propose? What could be done about this?
... Besides, standards and frameworks for workflow specification and execution are still missing.
--> As mentioned above, I don't think lack of "standards" is the problem.
Review by Mark Gahegan
I found this paper to be sometimes too terse and lacking in synthesis, and sometimes very insightful and clear, I've indicated exactly where in the detailed comments below. Overall I think it is a useful contribution, but one that has the potential to be excellent.
If the introduction and section 4 could be improved, I think this paper would make a timely and strong contribution to a quite complex and fast-evolving topic. Since it is a summary article, it may attract a high number of citations—I hope so.
The paper is mostly well written, and clearly organized (below I note a few exceptions).
Detailed comments follow…
Introduction
The authors seem to struggle to differentiate scientific and business workflows. It is true that science workflows are often non-linear, in that they may contain cycles as the researcher moves back and forth between activities. Both Science workflows and Business workflows ARE dataflow-oriented, at least in my experience. I'd say that the Science workflows suit this description less though, since they can involve cycles of refinement and experimentation.
I also do not think it is true to say that Science workflows are USUALLY executed in an evolving environment, where resources may disappear. Most Science workflows, I would suggest, run on desktops and grids, in a fairly closed world. Resilience is a noble and worthy goal, but it is not THE goal (though it may be yours). Other noble goals are: repeatability, better code reuse, capturing provenance, in-silico science, etc.
So can I suggest that you perhaps take more care in highlighting the goals of science workflows, then by all means focus the discussion on the aspect you are most concerned about? Is it not possible to have semantics in a workflow even if it is not using the Web at all? I believe so, and so I find the phrase Semantic Scientific Workflow a little confusing. Could I suggest you insert the word Web somewhere in your title, to clarify?
Section 2.1
An excellent and concise summary of the requirements of Scientific workflows.
Section 2.2
Again, my experience of Science workflows is that they are NOT executed in a dynamic and evolving computational setting all that often, in fact there is a great deal of order and stability. I'm thinking particularly of all the Workflow tools used in bioinformatics and bioengineering, such as Taverna, MyExperiment, Galaxy, CellML, etc… Again, not to say that adaptive execution is not a worthy goal, it is, but be careful of overstating.
I believe that there are big challenges in richly describing the function of workflow components used in science—all but simple I/O components tend to be highly specialized and it might prove very difficult to reliably discover and integrate at runtime. Definitely worth pursuing this question, but also important to recognize the complexity of the task.
One of the most pressing workflow questions in a large, distributed science community is how to work out, for each step of the execution, whether to migrate the data or the computation (or both) to a more performant computer or site for execution. Perhaps mention this challenge too?
Section 3
Bottom of p4 section 3.2 should be 'show', not 'shot'.
Define what ECA is. Section 3.2 is rather densely written, and difficult to follow. I'd suggest using richer descriptions of the work you highlight, to allow the reader to understand how you are differentiating it.
Section 3.3
This description is far richer—thus easy to follow the points being made.
Why do Agent-based workflows achieve high scalability (where others do not?)
Section 3.4
This section is also rather difficult to follow for readers unfamiliar with the paradigm. Mostly because it introduces several new terms without really explaining why they are useful.
Section 4.1
Change ServiceGroundling to SurfaceGrounding on line 7.
Section 4.3
This could use some kind of example, perhaps showing what a simplified SAWSDL description might look like…? Then perhaps explaining the affordances that each kind of information provides. I suggest this because the descriptions are quite abstract, so the less familiar reader may have difficulty seeing the value in them.
Section 4.4
What's the 'top-bottom approach'?
This section, like all of them in this section seems too brief, and to gloss over the real meat of this paper. Can you provide examples again of affordances and problems: for all subsections in Section 4. As described, the approach in Section 4.4 seems the most complete and flexible to me…
Ditto for Sections 4.5 and 4.6—more detail, more contrast, examples.
Section 4.7
Can you provide a synthesis, eg. A summary table that neatly contrasts the most important aspects of the approaches you have described?
Section 5
Level of detail is good here, and I really liked these summaries of approaches to handling exceptions—very useful. But first sentence of the very last para in S5.3 is confusing to me?
Section 6
reads well.
Section 7
Ties together all the threads in a useful conclusion
References
OK
Review by Tobias Weigel
The authors set out to provide a survey of the current state of the art in the literature about the application of semantic technologies for scientific workflows. Being two fields with much active research, providing a profound survey on this interdisciplinary topic is challenging, yet very much relevant to both fields. Overall, this survey provides a broad overview of relevant approaches and implementations. Major challenges are addressed throughout the submission.
The introductory explanation of the authors' setting (p. 2, l. 2 and following) seems sound, but also hints at a hidden use case which should be explained. There should already be solid use cases available in the published literature, which should be referred to here. The authors describe the goal of their survey as to "review existing efforts which incorporate the achievements of Semantic Web community into scientific workflows for the purpose of making them more flexible and adaptable" (p. 2). Acting as the primary survey motivation, this should be mentioned much earlier in the introduction.
The most fundamental issue about the submission is that although the authors provide reasonable requirements for scientific workflows, they fail to consequently use them as the fundamental guiding structure of their survey. Each of the reviewed approaches should be thoroughly analyzed against each of the requirements to provide a means for objective reasoning and to help a reader new to the topic develop an understanding for the specific advantages and disadvantages. Lacking this fundamental step, the conclusions the authors draw cannot be objectively verified. For example, in section 3.2, "exception handling, reproducibility, etc." are considered to be "limited" in the reviewed implementations, but nothing is said as to why and how. In this section's summary, these criteria are not mentioned at all, yet rule-based languages are advised as to be part of an approach considered "promising" by the authors.
Given that the manuscript is meant to be a survey paper, the literature review is not as precise as it should be, and it is sometimes not clear if the authors provide their own thoughts or summarize secondary literature. Often, the given statements about the reviewed approaches are much too short to provide a reader unfamiliar with the specific implementations with a well-fitting clear picture of what the relevant features of the surveyed work are, particularly in view of the requirements.
The following is a non-comprehensive list of examples where the authors must provide further references and/or more detailed explanations for claims they make:
- p. 1: "it is inadvisable to directly reuse the technologies from the business community" - why?
- p. 1: "scientific workflows are usually executed in an evolving environment" - if this is the usual case, there should be references from multiple application domains
- p. 2: "a scientific workflow system which is resilient to the volatile execution environment and supports dynamic and adaptive workflow execution is the ultimate goal of the scientific workflow community" - this is a broad claim which must be backed with representative references
- p. 3: "For the purpose of improving reliability and handling various exceptions, an adaptive execution which dynamically selects available services or modifies the structure of a workflow at runtime runs first in the development of a scientific workflow" - how can an adaptive execution that is subject to run-time changes improve reliability? This sounds contradictory and needs more explanation.
- p. 3: "exponentially increasing volumes of data from scientific experiments" - this is known as a part of the challenge of data-intensive science and there are good references for it
- p. 4: "However, OWL-S focuses on modeling a workflow that is internal to a single service, i.e. an OWLcomposite process (workflow) specifies the steps interacting with a single service implementation, which does not accord with the reality." - this is incomprehensible without further explanation and references
- p. 4: ECA is not defined
- p. 6: "The work OWL-S was the first specification submitted to W3C in 2004 and has deeply affected the development of semantic web service." - this should be backed by references
- p. 6: "However, it hasn't been widespread applied because of its complexity and the top-down approach to modeling of services, which does not fit well with industrial developments of Service-Oriented Architecture (SOA)." - why? this needs more justification.
- p. 9: data provenance is not defined or explained, at least a reference should be provided
The submission also suffers repeatedly from various grammatical errors - including the submission title - and vocabulary errors. The overall readability is fair enough given these errors are corrected.
Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...