A metadata schema for documenting material samples from multiple domains

Tracking #: 3785-4999

Authors: 
Steve Richard
Dave Vieglais
Andrea Thomer
Sarah Hyunju Song
Neil Davies
John Deck
Quan Gan
Eric C. Kansa
Sarah Kansa
John Kunze
Kerstin Lehnert
Danny Mandel
Chris Meyer
Rebecca Snyder
Ramona Walls

Responsible editor: 
Cogan Shimizu

Submission type: 
Application Report
Abstract: 
This paper documents a metadata schema, implementation, and associated vocabularies developed for the Internet of Samples (iSamples) project to integrate geoscience, archaeology/anthropology, biology and genomics sample descriptions in a single cross-domain catalog. To develop the sample description scheme for sample discovery across these disparate domains, we reviewed the metadata schema and example metadata from each project partner, as well as other existing schemes. Top level classes in the schema include MaterialSampleRecord, Curation, SamplingEvent, SamplingSite and Agent. By factoring sample type classification into material type, material sample object type, and sampled feature type, it has been possible to classify the approximately 6,000,000 samples in the combined corpus. Category vocabularies for these classifications were developed based unique value summaries from related fields in the source sample metadata, tested using a card sorting exercise and by development of code for automated mapping from source metadata. Each vocabulary has on the order of 20 categories with some hierarchy; the category concepts are intended to be covering, but might overlap. These vocabularies are implemented in SKOS, and published with the ARDC Research Vocabularies Australia (RVA) vocabulary service. The metadata schema is defined using a LinkML YAML file, and implemented as a JSON schema used to validate instance documents. To support interoperability mapping from the iSamples metadata schema to several other schemes is provided in the project Github.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Aug/2025
Suggestion:
Major Revision
Review Comment:

This paper describes the iSamples metadata schema, developed as part of the NSF-funded Internet of Samples project to support discovery, citation, and integration of material samples across domains such as geology, biology, archaeology, and genomics. The authors present a conceptual model, implemented in LinkML and supported by SKOS-based vocabularies, that enables integration of heterogeneous sampling records into a unified metadata registry. The schema is applied in production contexts and has been used to harmonize over six million sample records from participating repositories.

As an application report, the paper meets the core expectations of demonstrating practical deployment of Semantic Web technologies and addressing a real-world interoperability problem. The iSamples model addresses a longstanding and widely recognized gap in scientific data infrastructure: the lack of harmonized metadata for physical sample provenance. The application’s importance is well-motivated, the technologies used are standards-aligned, and its deployment is supported by extensive documentation and openly accessible infrastructure.

Nevertheless, the model exhibits several limitations—particularly around temporal representation, identifier handling, and vocabulary governance—that should be addressed to strengthen the long-term sustainability, semantic coherence, and interoperability of the system. These issues do not undermine the fundamental contribution of the paper, but they are significant enough to merit revision.

Evaluation Against Application Report Criteria

(1) Quality, Importance, and Impact of the Described Application

The iSamples schema provides a practical and lightweight approach to a domain-crossing challenge of high relevance: the structured description and integration of material samples across the sciences. The project targets a key barrier in the broader realization of FAIR data infrastructure for the physical sciences and natural history collections. Few prior efforts have attempted to unify sample metadata across domains while remaining deployable in modern data ecosystems. The authors’ use of JSON Schema and LinkML, combined with SKOS vocabularies and published crosswalks to standards like Schema.org, DataCite, and TDWG MIDS, demonstrates a commitment to technical interoperability.

The impact of the work is further underscored by its adoption across real data sources, notably SESAR, Open Context, GEOME, and the Smithsonian. The successful integration of millions of existing records using this schema is a compelling demonstration of its applicability and operational maturity. The model’s thoughtful decomposition of sample type into material type, sample object type, and sampled feature type enables scalable classification while avoiding rigid ontological over-specification.

Still, the current version of the schema includes some critical modeling simplifications that may limit its effectiveness in more complex or semantically aware systems. The most important of these relate to the representation of time and process (SamplingEvent), the structure and provenance of cross-referenced identifiers, the traceability of vocabulary mappings, and the modeling of procedural metadata. Each of these issues is discussed in detail below, with constructive suggestions for revision.

Detailed Review and Suggestions for Revision

Temporal Modeling Inconsistencies

The current schema conflates durative and instantaneous temporal representations by modeling SamplingEvent as a unitary structure that serves both as the record of sample collection and the temporal marker of the event, with a single result_time field. While the paper aligns SamplingEvent conceptually with prov:Activity and sosa:Sampling, both of these ontologies model sampling as a durative process, requiring at minimum a start and end time and, semantically, an alignment with time:Interval.

The use of a single timestamp, possibly at coarse granularity (e.g., year or month), implicitly treats the sampling as an instantaneous event. This creates a mismatch with PROV-O and OWL-Time expectations and undermines interoperability with systems that rely on temporal reasoning, duration tracking, or sampling provenance chains.

Suggested revision: Introduce a clear distinction between a SamplingActivity (aligned with prov:Activity and sosa:Sampling) and a SamplingEvent modeled as a time:TemporalEntity, which may be either a time:Instant or time:Interval. Link the activity to its temporal anchors using a dedicated property (e.g., hasTemporalAnchor). This would preserve backward compatibility while enabling more precise representation of sampling workflows and improved alignment with temporal standards.

Identifier Modeling and Crosswalk Support

The model’s handling of identifiers is pragmatic but lacks the structure necessary to support automated reconciliation across data systems. The alternate_identifiers field, modeled as a multivalued string list, provides no way to express the scope, provenance, or issuing authority of each identifier. In a schema intended to unify data across museums, laboratories, field surveys, and registration systems, this omission is limiting.

The authors cannot—and should not—be expected to impose version control on external identifier registries. However, they can improve the schema’s utility by adopting a structure akin to Wikidata’s external identifier pattern, in which identifiers are contextualized by scheme, authority, and optionally dereferenceable URI patterns. Such an approach allows for lightweight crosswalks without requiring strict semantic equivalence or complete harmonization.

Suggested revision: Introduce a structured ExternalIdentifier class with fields for the identifier string, scheme name, optional URI template, and additional metadata (e.g., alignment date or alignment source). This would support practical reconciliation workflows, align with FAIR best practices, and improve the model’s ability to serve as a semantic integration pivot.

Sampling Procedures and Protocol Metadata

The schema currently accommodates protocol descriptions only as unstructured text embedded in the description field of SamplingEvent, with informal encouragement to include persistent identifiers when available. This provides minimal affordances for provenance-aware systems or AI-based agents attempting to interpret or traverse procedural metadata.

Formal modeling of procedures using prov:Plan or workflow ontologies may be unnecessarily complex for the application’s immediate needs. However, a middle-ground approach—drawing from sosa:Procedure, prov:Plan, and Schema.org’s HowTo or CreativeWork classes—would allow the schema to support both human readability and machine interpretability. Such a model could support LLMs and semantic agents by enabling dereferenceable links to protocols, publications, or standards, while also accommodating rich natural language descriptions.

Suggested revision: Introduce a lightweight SamplingProcedure class with fields for label, description, optional identifier (e.g., DOI or URI), and links to related documentation. Align this construct with schema:HowTo, schema:CreativeWork, and sosa:Procedure. This will allow for machine-discoverable procedural metadata without requiring deep formalization, and support future integration with protocol registries or workflow systems.

Vocabulary Governance and Alignment Provenance

The iSamples schema makes strong use of SKOS-based vocabularies for classifying material samples. However, while the vocabularies are technically well-formed and publicly hosted, the governance model for their evolution is underspecified. The authors acknowledge that versioning is not yet implemented, and no formal process is described for integrating extensions, handling deprecated terms, or tracking updates to alignments with external vocabularies.

Given the complexity and dynamism of scientific classification systems—especially in archaeology and biology—some form of alignment provenance is necessary. Even without enforcing strict versioning on external vocabularies, authors could document alignment metadata such as the date of mapping, source of alignment, or context of use.

Suggested revision: Introduce lightweight metadata to track the provenance of vocabulary alignments (e.g., alignment date, source authority, mapping method). Consider formalizing versioning for internal iSamples vocabularies and documenting a governance process for extensions and updates.

Geospatial Modeling Ambiguities

The schema includes both sample_location (a direct geospatial coordinate) and a SamplingSite object that contains its own site_location. It is unclear how these two are intended to interact, or whether they may conflict. The duplication introduces the potential for redundant or contradictory location metadata.

Furthermore, while latitude and longitude are modeled as decimal degrees, elevation is captured as a free-text string, which complicates spatial reasoning, filtering, or transformation.

Suggested revision: Clarify the relationship between sample_location and sampling_site. Consider introducing a unified Location or Place class used consistently across contexts. Elevation should be structured as a numeric value with a unit and reference datum, potentially using QUDT or GeoSPARQL-aligned representations.

(2) Clarity and Readability of the Paper

The paper is clearly written and well organized, and it effectively communicates the motivation and architecture of the schema. The discussion of implementation choices is grounded in the constraints of real-world metadata sources, and the mappings to external standards are well-documented and accessible.

That said, some sections—particularly those dealing with temporal representation, identifier scoping, and vocabulary management—would benefit from clearer distinctions between similar constructs and more explicit modeling diagrams. A clearer summary of the schema’s class structure, either as a visual diagram or a concise prose walkthrough, would help readers unfamiliar with LinkML better understand the overall architecture.

Conclusion

This paper presents a significant and timely application of Semantic Web technologies to a problem of broad relevance in data infrastructure: the standardized representation of material sampling metadata across disciplines. The application has already demonstrated measurable impact through its use in large-scale data integration efforts. Its emphasis on JSON-native deployment, lightweight classification, and schema.org alignment makes it suitable for widespread adoption.

Nevertheless, several elements of the model—especially around time, identifiers, procedures, and vocabulary provenance—need refinement to ensure semantic consistency, extensibility, and long-term maintainability. Addressing these issues will not fundamentally alter the schema’s pragmatic philosophy but will make it more robust and aligned with best practices for semantic web infrastructure.

Review #2
Anonymous submitted on 17/Dec/2025
Suggestion:
Reject
Review Comment:

This paper, submitted as an application report, introduces a metadata schema for documenting material samples, along with a SKOS-based vocabulary for characterizing material samples according to different types. The work is developed within the iSamples project, which aims to integrate sample descriptions across domains. Overall, the paper presents a reasonable and potentially valuable application. However, while it demonstrates several strengths, the current version also shows notable weaknesses that should be addressed to meet the standards of the Semantic Web Journal.

Strengths
•The paper addresses a challenging and relevant problem about integrating the description of material samples across multiple scientific domains.
•It provides mappings between the iSamples schema and other relevant schemas, which is valuable for interoperability and reuse.

Weaknesses
•The paper ends with a Testing section and lacks a summary or conclusion. This omission makes it difficult to assess the overall impact of the work and reduces the conciseness and coherence of the paper.
•Table 1 presents related work, while Section 6 introduces additional details about these works when discussing mappings. The presentation could be improved to provide readers with a clearer overview and a more integrated understanding of the related efforts.
•The paper would benefit from stronger cross-referencing between sections to improve navigability and coherence.
•Including a running example and a visual representation of the metadata model (as already available on the project webpage) would significantly improve comprehension.

Detailed Comments:
•Page 2: The phrase “FAIR samples” appears without a reference. If FAIR refers to the FAIR Principles, please include a citation to the original paper (https://doi.org/10.1038/sdata.2016.18).
•Page 2, Table 1:
- The term IGSN is explained in a footnote later in the paper but not at its first occurrence in the table.
- The phrase “originally focused … kinds of samples” is not a complete sentence.
- DataCite is listed without any reference, footnote, or explanation, making it unclear.
- The full name of OGC should be provided at first mention.
- Footnote numbering is inconsistent throughout the paper (sometimes appearing before punctuation and sometimes after).
- Some notes provide very limited information (e.g., the note for ESS-DIVE).
•Page 3, Section 2.1: The description of the iSamples architecture is insufficient for readers to understand the overall system.
•Page 5, Section 2.2.4: The sentence “A free text description of the sample.” is incomplete.
•Page 5, Section 2.3.1: The sentence “Be sure to include description of the coordinate reference system used.” is more like user manual instructions.
•Page 5: Several sections introduce properties and examples (e.g., authorized_by) without sufficient context. An overview diagram of the data model and a running example would improve clarity.
•Page 5: Typo: “should used” → “should be used”.
•Page 6: Darwin Core is mentioned without explanation or introduction.
•Page 6: Footnotes for ORCID and ROR are duplicated and not provided at their first occurrence.
•Page 8: The rationale for setting the goal that each vocabulary should contain 20 categories is unclear and should be justified.
•Page 9: Digital Extended Specimen information graph is introduced without explanation.

Overall, while the paper addresses an important problem and presents a promising application, substantial revisions are needed to improve quality, clarity and structure of the paper, before it can be considered accepted in SWJ.