Review Comment:
This paper describes the iSamples metadata schema, developed as part of the NSF-funded Internet of Samples project to support discovery, citation, and integration of material samples across domains such as geology, biology, archaeology, and genomics. The authors present a conceptual model, implemented in LinkML and supported by SKOS-based vocabularies, that enables integration of heterogeneous sampling records into a unified metadata registry. The schema is applied in production contexts and has been used to harmonize over six million sample records from participating repositories.
As an application report, the paper meets the core expectations of demonstrating practical deployment of Semantic Web technologies and addressing a real-world interoperability problem. The iSamples model addresses a longstanding and widely recognized gap in scientific data infrastructure: the lack of harmonized metadata for physical sample provenance. The application’s importance is well-motivated, the technologies used are standards-aligned, and its deployment is supported by extensive documentation and openly accessible infrastructure.
Nevertheless, the model exhibits several limitations—particularly around temporal representation, identifier handling, and vocabulary governance—that should be addressed to strengthen the long-term sustainability, semantic coherence, and interoperability of the system. These issues do not undermine the fundamental contribution of the paper, but they are significant enough to merit revision.
⸻
Evaluation Against Application Report Criteria
(1) Quality, Importance, and Impact of the Described Application
The iSamples schema provides a practical and lightweight approach to a domain-crossing challenge of high relevance: the structured description and integration of material samples across the sciences. The project targets a key barrier in the broader realization of FAIR data infrastructure for the physical sciences and natural history collections. Few prior efforts have attempted to unify sample metadata across domains while remaining deployable in modern data ecosystems. The authors’ use of JSON Schema and LinkML, combined with SKOS vocabularies and published crosswalks to standards like Schema.org, DataCite, and TDWG MIDS, demonstrates a commitment to technical interoperability.
The impact of the work is further underscored by its adoption across real data sources, notably SESAR, Open Context, GEOME, and the Smithsonian. The successful integration of millions of existing records using this schema is a compelling demonstration of its applicability and operational maturity. The model’s thoughtful decomposition of sample type into material type, sample object type, and sampled feature type enables scalable classification while avoiding rigid ontological over-specification.
Still, the current version of the schema includes some critical modeling simplifications that may limit its effectiveness in more complex or semantically aware systems. The most important of these relate to the representation of time and process (SamplingEvent), the structure and provenance of cross-referenced identifiers, the traceability of vocabulary mappings, and the modeling of procedural metadata. Each of these issues is discussed in detail below, with constructive suggestions for revision.
⸻
Detailed Review and Suggestions for Revision
Temporal Modeling Inconsistencies
The current schema conflates durative and instantaneous temporal representations by modeling SamplingEvent as a unitary structure that serves both as the record of sample collection and the temporal marker of the event, with a single result_time field. While the paper aligns SamplingEvent conceptually with prov:Activity and sosa:Sampling, both of these ontologies model sampling as a durative process, requiring at minimum a start and end time and, semantically, an alignment with time:Interval.
The use of a single timestamp, possibly at coarse granularity (e.g., year or month), implicitly treats the sampling as an instantaneous event. This creates a mismatch with PROV-O and OWL-Time expectations and undermines interoperability with systems that rely on temporal reasoning, duration tracking, or sampling provenance chains.
Suggested revision: Introduce a clear distinction between a SamplingActivity (aligned with prov:Activity and sosa:Sampling) and a SamplingEvent modeled as a time:TemporalEntity, which may be either a time:Instant or time:Interval. Link the activity to its temporal anchors using a dedicated property (e.g., hasTemporalAnchor). This would preserve backward compatibility while enabling more precise representation of sampling workflows and improved alignment with temporal standards.
⸻
Identifier Modeling and Crosswalk Support
The model’s handling of identifiers is pragmatic but lacks the structure necessary to support automated reconciliation across data systems. The alternate_identifiers field, modeled as a multivalued string list, provides no way to express the scope, provenance, or issuing authority of each identifier. In a schema intended to unify data across museums, laboratories, field surveys, and registration systems, this omission is limiting.
The authors cannot—and should not—be expected to impose version control on external identifier registries. However, they can improve the schema’s utility by adopting a structure akin to Wikidata’s external identifier pattern, in which identifiers are contextualized by scheme, authority, and optionally dereferenceable URI patterns. Such an approach allows for lightweight crosswalks without requiring strict semantic equivalence or complete harmonization.
Suggested revision: Introduce a structured ExternalIdentifier class with fields for the identifier string, scheme name, optional URI template, and additional metadata (e.g., alignment date or alignment source). This would support practical reconciliation workflows, align with FAIR best practices, and improve the model’s ability to serve as a semantic integration pivot.
⸻
Sampling Procedures and Protocol Metadata
The schema currently accommodates protocol descriptions only as unstructured text embedded in the description field of SamplingEvent, with informal encouragement to include persistent identifiers when available. This provides minimal affordances for provenance-aware systems or AI-based agents attempting to interpret or traverse procedural metadata.
Formal modeling of procedures using prov:Plan or workflow ontologies may be unnecessarily complex for the application’s immediate needs. However, a middle-ground approach—drawing from sosa:Procedure, prov:Plan, and Schema.org’s HowTo or CreativeWork classes—would allow the schema to support both human readability and machine interpretability. Such a model could support LLMs and semantic agents by enabling dereferenceable links to protocols, publications, or standards, while also accommodating rich natural language descriptions.
Suggested revision: Introduce a lightweight SamplingProcedure class with fields for label, description, optional identifier (e.g., DOI or URI), and links to related documentation. Align this construct with schema:HowTo, schema:CreativeWork, and sosa:Procedure. This will allow for machine-discoverable procedural metadata without requiring deep formalization, and support future integration with protocol registries or workflow systems.
⸻
Vocabulary Governance and Alignment Provenance
The iSamples schema makes strong use of SKOS-based vocabularies for classifying material samples. However, while the vocabularies are technically well-formed and publicly hosted, the governance model for their evolution is underspecified. The authors acknowledge that versioning is not yet implemented, and no formal process is described for integrating extensions, handling deprecated terms, or tracking updates to alignments with external vocabularies.
Given the complexity and dynamism of scientific classification systems—especially in archaeology and biology—some form of alignment provenance is necessary. Even without enforcing strict versioning on external vocabularies, authors could document alignment metadata such as the date of mapping, source of alignment, or context of use.
Suggested revision: Introduce lightweight metadata to track the provenance of vocabulary alignments (e.g., alignment date, source authority, mapping method). Consider formalizing versioning for internal iSamples vocabularies and documenting a governance process for extensions and updates.
⸻
Geospatial Modeling Ambiguities
The schema includes both sample_location (a direct geospatial coordinate) and a SamplingSite object that contains its own site_location. It is unclear how these two are intended to interact, or whether they may conflict. The duplication introduces the potential for redundant or contradictory location metadata.
Furthermore, while latitude and longitude are modeled as decimal degrees, elevation is captured as a free-text string, which complicates spatial reasoning, filtering, or transformation.
Suggested revision: Clarify the relationship between sample_location and sampling_site. Consider introducing a unified Location or Place class used consistently across contexts. Elevation should be structured as a numeric value with a unit and reference datum, potentially using QUDT or GeoSPARQL-aligned representations.
⸻
(2) Clarity and Readability of the Paper
The paper is clearly written and well organized, and it effectively communicates the motivation and architecture of the schema. The discussion of implementation choices is grounded in the constraints of real-world metadata sources, and the mappings to external standards are well-documented and accessible.
That said, some sections—particularly those dealing with temporal representation, identifier scoping, and vocabulary management—would benefit from clearer distinctions between similar constructs and more explicit modeling diagrams. A clearer summary of the schema’s class structure, either as a visual diagram or a concise prose walkthrough, would help readers unfamiliar with LinkML better understand the overall architecture.
⸻
Conclusion
This paper presents a significant and timely application of Semantic Web technologies to a problem of broad relevance in data infrastructure: the standardized representation of material sampling metadata across disciplines. The application has already demonstrated measurable impact through its use in large-scale data integration efforts. Its emphasis on JSON-native deployment, lightweight classification, and schema.org alignment makes it suitable for widespread adoption.
Nevertheless, several elements of the model—especially around time, identifiers, procedures, and vocabulary provenance—need refinement to ensure semantic consistency, extensibility, and long-term maintainability. Addressing these issues will not fundamentally alter the schema’s pragmatic philosophy but will make it more robust and aligned with best practices for semantic web infrastructure.
|