Tab2KG: Semantic Table Interpretation with Lightweight Semantic Profiles

Tracking #: 2731-3945

Authors: 
Simon Gottschalk
Elena Demidova

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
Abstract: 
Tabular data plays an essential role in many data analytics and machine learning tasks. Typically, tabular data does not possess any machine-readable semantics. In this context, semantic table interpretation is crucial for making data analytics workflows more robust and explainable. This article proposes Tab2KG - a novel method to automatically infer tabular data semantics and transform such data into a semantic data graph. We introduce original lightweight semantic profiles that enrich a domain ontology's concepts and relations and represent domain and table characteristics. We propose a one-shot learning approach that relies on these profiles to map a tabular dataset containing previously unseen instances to a domain ontology. In contrast to the existing semantic table interpretation approaches, Tab2KG relies on the semantic profiles only and does not require any instance lookup. This property makes Tab2KG particularly suitable in the data analytics context, in which data tables typically contain new instances. Our experimental evaluation on several real-world datasets from different application domains demonstrates that Tab2KG outperforms state-of-the-art semantic table interpretation baselines.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/May/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

-- SUMMARY --
The paper describes Tab2KG, an approach for semantic table interpretation. Tab2KG doesn't rely on large-scale cross-domain knowledge graphs to recover the semantic of a given table but instead uses a domain data graph ("domain ontology" + some instances of entities and facts) that describe the specific domain/context of the table content. No instance lookup is used to match the columns of the table to possible properties, instead, the authors use what they call "semantic profiles", a feature vector that describes statistical aspects (avg length, unique values, etc.) and types (string, boolean, time, etc.) of a list of values. Those are used to describe both the columns of the table and the properties of the domain ontology (through the few instances/values). The core point is that two semantic profiles can be very similar to each other even without any overlaps of the values used to build them (respectively table cells and domain ontology entities/values). The approach leverages a siamese network to assign similarities between semantic profiles and transform the table into a data graph that uses the same schema of the domain ontology.

-- OVERALL --
In general, the paper proposes an interesting approach for table understanding and I like the idea of using semantic profiles. It is worth pursuing techniques that try to alleviate the instance lookup step, which is often the pain point of every state-of-the-art paper in table interpretation. In this regard, the paper has some novelties, as the approach seems somewhat different from earlier works and there are some ideas to contribute to the more general problem of web table understanding. Indeed, I do think that the approach can't be compared to other table interpretation approaches as the goal is different, and this should probably be remarked a bit more in the introduction. My main concern regards the robustness of this approach which should be evaluated and explained a bit better. I think that the paper is well written and well understandable, I liked the problem statement section which is very clear, instead some details regarding the implementation details (e.g. formats of representation) might be moved as footnotes or appendix in my opinion.

-- MAIN COMMENT --
My main concern with the paper regards the robustness of the proposed approach, particularly about the nature of these domain data graphs. I think this aspect is not well evaluated. The running example shows that the semantic of a table (weather example) can be recovered using an available ontology that describes that domain, but this seems not consistent which what is evaluated. In the experiments, the data graphs are obtained from small data tables just a bit bigger than the data table of interest (in the n. of columns), which sounds like a tailored solution to each specific table. I would appreciate a better explanation of why this solution is necessary. Why not using an existing soccer ontology for example? Why each table needs its specific ontology? How small the data graphs have to be (in n. of instances and properties) in order to allow the system to have good accuracy? The range of possibilities here go from small “tailored” ontologies (such as the ones used in current evaluation) up to DBpedia: in what circumstances this approach will continue to work? Clarifying this limitation allows the reader to understand when this approach might be useful and under which assumptions.

-- MINOR COMMENTS --
There are some places in which I would appreciate more details when describing the experiments.
- 7.1.1. Synthetic Github Dataset - the authors uses an interesting process to mine small KGs from Github. After describing the process the authors refer to this dataset as a collection of tables. That's okay, I can see a KG transformed to a table, but I think that something is not completely clear, maybe this part should be complemented with some examples.
- Differently from the other datasets, the datasets Soccer (So) and Weapon Ads (WA) provide a number of "data sources". What is a data source in this context, is a synonymous with "data table"?
- SemTab Easy (SE): this dataset has been created to allow a comparison with T2K (instance lookup solution), which is able to provide only data types relation ("columns are mapped to one class only"). If that is correct, then Table 3 might have a typo in the header.
- Weapon datasets seem more difficult than the others. Is it related to the bigger size of the tables? What are the errors there?
- Also Table 2 it seems T2KMatch misses the result
- page 11, column B, rows 45-48. Avoid the presence of "cyclic structures". This is also another aspect to clarify for the creation of the Github dataset and the filtering of these other datasets So, WA, ST, SE. From what I understood, a cyclic structure is the presence of two separate entities of the same class in the same row (like all the myriads of tables describing president and their vice presidents in Wikipedia). How many tables have been filtered out from these datasets because of this aspect? I would appreciate more details when describing this limitation since is something that might happen often in tables.
- Table 1, is not totally clear if the number of tables refers to n. of tables before or after the transformation into triples (G, T, G_d^T).
- pagw 14, column A, rows 22-27: "one-shot-larning" or "zero-shot-learning"?

Also there no mention of why the table schema (ie, headers) is not considered. For sure sometimes the header is missing but when is present it can help the matching. Why not consider it? There are also several other works that rely on the schema of the table to alleviate the instance lookup problem.

Review #2
By Benno Kruit submitted on 19/May/2021
Suggestion:
Minor Revision
Review Comment:

This paper presents a novel approach to create mappings of data tables to domain ontologies by using lightweight semantic profiles of the ontology domains, which consist of summary statistics of the values in these domains.
The paper is well-written, and presents its contributions in a coherent way. The paper is well-structured and it is easy to follow the main arguments.
The artifacts on which this research is based (code, data) are published on Github in a appropriate way, and seem well-described and complete.
The paper has the potential to make a valid and useful contribution to the field, but the training and evaluation sections contain several issues which should be addressed before acceptance for publication.

Compared to existing approaches to semantic table interpretation, this work does not perform any instance matching, but relies solely on a neural network model trained on summary statistics of ontology domains and data table columns to match columns to data type relations. Then it combines these matches with the domain ontology to create data source mappings as RML, which can be used to convert the table to RDF. This method is evaluated on 5 test datasets from various domains, and in many cases outperforms existing methods with regard to accurately identifying ontology relations for columns. The approach is original, and also integrates the novel idea of re-usable data domain profiles.

The results of the evaluation are appealing. Although the semantic profiles consist of very little information (fewer than 50 features) and the learned model is relatively simple, the approach is still able to outperform existing methods on several datasets. However, the evaluation does not shed light on the _practical impact_ of the _theoretical limitations_ of the method, which the authors mention themselves in the paper. I will briefly discuss these here to check whether I understand them. 1. The model is unable to deal with multiple columns that contain values from the same domain, or distinguish relations with identical non-literal ranges and domains. Do the authors have an estimate of how frequent these are in the evaluation data? 2. The authors argue that the model distinguishes between between different relations by capturing implicit value correlations. For their running example, they show an incorrect mapping where "hasBegin" and "hasEnd" timestamps are swapped. The authors claim that their model would not make this mistake because the mean of the "hasEnd" timestamps is later than the mean of the "hasBegin" timestamps, which would be captured by the semantic profiles. However, this assumes that the values of the data type relation and column are sampled from the same distribution, which does not hold when the data source is from some specific partition (e.g. weather observations made at night, or athletes from one country). The authors do not discuss formal properties of their summary statistics (e.g. distinguishing different distributions in terms of skewness or kurtosis), or the possibility that different value sets result in near-identical data profiles. Would it be possible to make such assumptions explicit? 3. The data profile representation of _entity_ domains purely in terms of the _string length_ of entity labels (and some character-class counts). This representation puts the entire burden of distinguishing two entity domains on their string-length similarity, which again depends on partitioning. For example, in their Soccer dataset there are tables in different languages, which have different string length distributions for the same entity domain. How would this affect the model?

In light of these theoretical limitations, the evaluation setup is unfortunately not sufficiently clear. First, I am unsure from the text on which data the Siamese network is trained: does the 1st paragraph of section 7.1 imply that it was trained only on Github data? From section 5.5, I understand that each KG (so, I assume, each of those extracted from the Github RDF files) is split into a domain KG and an artificial table. In Figure 10, the table and graph are equal, but is it correct that they never overlap in the training data? This is insufficiently clear from the text. Second, as far as I understand, the table mapping is only evaluated within the domain KG. In section 7.1, it is mentioned that "pairs of data tables" are identified of which the columns of the first table are a subset of the columns of the second. How is this subset defined? (in terms of column headers, or all values?) How many pairs make up the evaluation setup per dataset? Or is this simply the pairs of tables and graphs that are the result of the transformation?
Third, the resulting domain KGs have different sizes, which means that the task is much harder for some domains than for others. In Table 1, it is unclear how the average number of data type and class relations are calculated: if the domain KGs and tables are generated from the same data, shouldn't the number of columns and relations be equal? The authors do not report the sizes of the candidate sets per domain, so it is not possible from the paper to discern how hard the evaluation task is. If the Siamese network only has to distinguish between 3 to 11 relations within one domain KG, the task is significantly easier than the general-domain case, and the authors should report the performance of an datatype-only approach, for example. How many candidates does the column mapping step generate per column on average? Finally, the error analysis in section 8.3 does not distinguish between different datasets, while the performance varies wildly. Why is the performance in the Github dataset so much higher than the Weapon Ads dataset, and does that have implications for the generalizability of the approach?

In short, the work presented in this paper is appealing, but the paper insufficiently addresses how the theoretical limitations of using summary statistics for column-to-relation matching impact the performance in practice, and whether the evaluation setup is able to reflect this.
I am glad to re-evaluate my decision based on changes made to the paper.

Review #3
Anonymous submitted on 25/Jun/2021
Suggestion:
Major Revision
Review Comment:

This paper proposes a method for converting tabular data into a KG. It does so by matching a profile of data columns, consisting of extracted features, with a profile of possible data type relations.

Overall, I am not very impressed with the depth of this work. Reduced to its essence, it is about matching distributions of values. This is then decorated with graphs and tables (largely following semantic web technologies and ontologies), but it feels to me that the opportunity to make actual use of semantic information is neglected. Examples, include the use if sub-properties, explicit use properties like transitivity, etc; the graph structures appear as just data containers in this work.

Not requiring instance lookup seems like a good property. However, the argument regarding slowness of this lookup is not supported by evidence. I am aware that some slow implementations exist, but that does not mean that these techniques are inherently slow.
A further argument is made that not requiring lookup means that Tab2KG can be used for cases where no instances are available. However, I do not see this strongly supported by the experiments. Besides, the essential part of matching distributions is rather limited in scope and requires a lot of handcrafting as well. I would even argue this this goes up to the point that one would need instances to actually create the profiles in the first place.

In the introduction, it is argued that "DAWs treat tabular data as character sequences and numbers without any further semantics." While I do not refute that, this does not necessarily mean that this is a bad thing. In principle one could use these networks and classify all part of the inputs given (using e.g., a transformer architecture). One could even imagine using a graph generative model after the encoder. I am not saying this is anyhow easy, but your statement seems to discount this approach without proper argumentation. Some links to related papers: https://www.aclweb.org/anthology/2020.acl-main.398.pdf https://www.aclweb.org/anthology/2020.findings-emnlp.27.pdf https://arxiv.org/pdf/2103.12011.pdf https://arxiv.org/pdf/2105.07624.pdf and https://arxiv.org/pdf/2005.08314.pdf

I am a bit confused about why you choose simple statistical features, and not go for a more principled approach where you would directly compare the distributions of values occurring, with for example KL-divergence.

In your problem statement, you introduce many definitions. Why not make them compatible with the RDF standard?

For your string features, it seems obvious nowadays that one would use a pre-trained language model to find similarities. However, you choose for things like statistics on string length, etc, which adds assumptions about the strings used. the example you give about Germany and GER is illustrative of this. A language model would likely not have any issue with that.

In the model, you normalize the data type profile together with the column profile. This stroke me as rather odd. Why would that be needed, and in particular why not do that separately? Doing this together might mean that one outlier in one of the two to affect the other profile greatly.

You mention that 'such a dataset is difficult to obtain'. It appears to me that you could still easily create a good semi-supervised pre-training dataset here.

In the experiments, you use a 80/20 split. Why? A 90/10 split is more common. Besides, given your limited dataset size, a k-fold cross validation would be appropriate.

I do not get constraint 3 at the start of page 12. Shouldn't year method be able to detect these cases and deal with them?

In the experimental evaluation, much more insight would be gained if there was an extensive ablation study. One example of a experiment to include is what happens with more and less missing values. It is also unclear whether hyper-parameters were sufficiently tuned.

Perhaps I misunderstand something, but a limitation of your work which appears to be glanced over is that you do not seem to pay any attention to literal values referring to existing entities, at all. For example, if a string value 'Paris' occurs, you won't map that to , instead you will map that to some literal value in the graph.

Some minor comments:

* You mention that table data does not have any machine readable semantics. This is rather debatable. The table structure can be considered semantic information; the rows and columns indicate that these values are related to each other.

* Check https://developer.github.com/changes/2020-02-10-deprecating-auth-through... to update your github code. I think this is a pretty cool dataset. I do wonder whether after manually checking respective licences you could still distribute it.

* something I could see you doing in your code, but which was not described in the github data creation is the removal of duplicate files. This is essential since otherwise you would have obvious dataset leakage. Note, however, that this duplication check does not guarantee that there is no leakage since this only looks at the sha1 hash, which is sensitive to, for example, whitespace. So, leakage can still happen. Perhaps this latter point is still a major concern I have. How certain are you regarding not having dataset leakage problems for the different datasets?

* Arguably, contribution 4, should not be listed separately as a contribution. In any ML paper, it is to be expected that the authors provide their code for scrutiny.

* For numbers you mention that you include the 'number of digits'. It appears that directly including the logarithm of the number would be more principled.

* The current work does not consider cases where multiple columns compound to one value for the graph. Examples would be measurements and their unit or scale, or time intervals.

* It is unclear what would happen is a column can be interpreted in multipel ways. For example, a column containing 0 and 1 values could be numeric or boolean. A columns with values in {10,12,14,16} could be numeric, or this could indicate a time. Please clarify this better. Also, it might be clever to make the decision of the datatype dependent on other columns, instead of in isulation.

Review #4
Anonymous submitted on 29/Jun/2021
Suggestion:
Major Revision
Review Comment:

#overview
This work proposes a solution for creating Knowledge Graphs from tables based on the data profiling techniques. In particular, the data profiles regard the domain profiles and the table profiles which are provided as vectors of features and represented into semantic data. Domain profiles are patterns of ontology relations (only datatype relations in this work) and their statistical characteristics such as value distributions of the data in a sample of the domain KG. Tables profiles comprise the columns of a table and the statistical characteristics associated with each column. The table interpretation approach named Tab2KG, considers the mapping of table columns to the ontology relations and transforms the table into the data graph. 

#Originality and contribution
I think that the work is original and is very interesting. I have a big concern about the lightweight domain KG. In the evaluation Section, the authors mention "DBpedia as a crossdomain knowledge graph" and this seems to be a contradiction of what you stated in the introduction with respect to the state-of-the art approaches "In the context of DAW, the input data typically represents new instances (e.g., sensor observations, current road traffic events, . . . ), and substantial overlap between the tabular data values and entities within existing knowledge graphs cannot be expected". What if a domain KG is not available? What does mean a sample? How do we measure that the data in the domain KG are representative?

#presentation of the work
I think that reading the introduction I got a slightly different understanding of what is then explained in the other sections. Remove redundant information and keep it short. We already have an explanation in section 1 about what this work is doing- then we have a second explanation in section 2 on the running example- then we have a detailed and formal description in the problem statement- then we have section 4 with the details on profiles. I would suggest keeping a concise description in the introduction and maybe a section of the problem statement and the running example together -> In this way we reduce the number of sections as well.

#other commments
*Definition 6: data type profiles -> which are the statistical characteristics i.e., the features associated with the literal relations? From the definition and the examples in the paper, this is not clear. I was expecting to see some numbers

*Section 5.4. I can understand that the mapping is normalized in the range [0,1] but I don't understand how this function measures the similarity "Given a column profile and a data type relation profile, the mapping function returns a similarity score in the range". Can you provide a formula on how this is effectively measured?

*The knowledge graphs set was split into a training set (90%) and a test set (10%). Is this used for all the other datasets? What happens if we keep 80% for training and 20% for testing?

#Minors
*missing verb: In the case of a data table profile, these attributes the columns.
*check the correctness of the verb "assign" + "to" or "with"- seas:rank 2 (check spaces)
* check spaces in triples e.g., rdf:type