Review Comment:
The paper presents an interesting effort for utilizing ontology-based video scene descriptions and rule-based formalisms for reasoning over video scenes to detect suspicious activities, with a use case describing a car park.
The namespace of the proposed ontology is missing and the actual description logic used for the formal grounding of the ontology is not stated. This would be particularly important to understand the mathematical constructors utilized in the formalism, which, along with the computational properties of the proposed SWRL rules, would indicate overall reasoning complexity.
Without the ontology file, the proposed ontology cannot be evaluated at all, and cannot be implemented to verify the claims of the paper. It is not clear why sameObject is defined instead of using the standard owl:sameAs as part of the descriptions, which should be justified.
While spatiotemporal object relations are declared (such as for overlapping objects), alignment with the Allen relations and de facto standard vocabularies that define these and related relations, such as OWL-Time and the SWRL Temporal Ontology, are not explained (they are not even mentioned). Related KOSes, such as the Video Structure Ontology and the Video Ontology, are not considered.
For comparing two identical (moving or stationary) objects, the same shape and size are set as prerequisites in the paper, even though there are feature descriptors that are scale- and/or rotation-invariant, such as the Histogram of Oriented Gradients (HOG), Rotation-Invariant Histogram of Oriented Gradients (RIHOG), and Ordinal Pyramid Coding (OPC), which would allow these to be different.
Isn’t “developing an ontology that represents an object along with its position in every frame” overly resource-intensive? How does the proposed approach perform in (near) real-time applications? Is the frame-by-frame semantic description actually the most optimal for car park videos?
In line 6 of figure 4, only a symbolic URL is used. Why isn’t the actual URL provided?
“Text from the video is extracted and then converted to Resource Description Framework (RDF) using semantic web technologies and NLP.” - What about semantic enrichment? How are namespaces defined for the terms used in the description?
There are several wording issues in the paper. For example, “to reach the human-level perception for various scenarios” is an exaggeration. “LSCOM, SROIQ which are not completely based on description logic.” is a stranded sentence that does not make sense. Some further issues include the following:
“and linked with domain knowledge to acquire human-level perception” → “and linked with domain knowledge to acquire interpretation capabilities with software agents”
“Complex events, which are rare in nature, are hard to train” → “Those complex events that are rare are hard to train”
“and require massive computational capabilities” → “and massive computational capability requirements”
“can be used to reason spatial and temporal reasoning” → “can be used to perform spatial and temporal reasoning”
There are grammatical errors and typos throughout the manuscript, which should be corrected, for example, the following:
In Table 2, “May be” should be “Potentially”
“in a video scenes” → “in video scenes”
“in generation of” → “in the generation of”
“a description logic based knowledge representations, can be used for” → “a description logic-based knowledge representation, which can be used for”
|