Review Comment:
This paper proposes a method for converting tabular data into a KG. It does so by matching a profile of data columns, consisting of extracted features, with a profile of possible data type relations.
Overall, I am not very impressed with the depth of this work. Reduced to its essence, it is about matching distributions of values. This is then decorated with graphs and tables (largely following semantic web technologies and ontologies), but it feels to me that the opportunity to make actual use of semantic information is neglected. Examples, include the use if sub-properties, explicit use properties like transitivity, etc; the graph structures appear as just data containers in this work.
Not requiring instance lookup seems like a good property. However, the argument regarding slowness of this lookup is not supported by evidence. I am aware that some slow implementations exist, but that does not mean that these techniques are inherently slow.
A further argument is made that not requiring lookup means that Tab2KG can be used for cases where no instances are available. However, I do not see this strongly supported by the experiments. Besides, the essential part of matching distributions is rather limited in scope and requires a lot of handcrafting as well. I would even argue this this goes up to the point that one would need instances to actually create the profiles in the first place.
In the introduction, it is argued that "DAWs treat tabular data as character sequences and numbers without any further semantics." While I do not refute that, this does not necessarily mean that this is a bad thing. In principle one could use these networks and classify all part of the inputs given (using e.g., a transformer architecture). One could even imagine using a graph generative model after the encoder. I am not saying this is anyhow easy, but your statement seems to discount this approach without proper argumentation. Some links to related papers: https://www.aclweb.org/anthology/2020.acl-main.398.pdf https://www.aclweb.org/anthology/2020.findings-emnlp.27.pdf https://arxiv.org/pdf/2103.12011.pdf https://arxiv.org/pdf/2105.07624.pdf and https://arxiv.org/pdf/2005.08314.pdf
I am a bit confused about why you choose simple statistical features, and not go for a more principled approach where you would directly compare the distributions of values occurring, with for example KL-divergence.
In your problem statement, you introduce many definitions. Why not make them compatible with the RDF standard?
For your string features, it seems obvious nowadays that one would use a pre-trained language model to find similarities. However, you choose for things like statistics on string length, etc, which adds assumptions about the strings used. the example you give about Germany and GER is illustrative of this. A language model would likely not have any issue with that.
In the model, you normalize the data type profile together with the column profile. This stroke me as rather odd. Why would that be needed, and in particular why not do that separately? Doing this together might mean that one outlier in one of the two to affect the other profile greatly.
You mention that 'such a dataset is difficult to obtain'. It appears to me that you could still easily create a good semi-supervised pre-training dataset here.
In the experiments, you use a 80/20 split. Why? A 90/10 split is more common. Besides, given your limited dataset size, a k-fold cross validation would be appropriate.
I do not get constraint 3 at the start of page 12. Shouldn't year method be able to detect these cases and deal with them?
In the experimental evaluation, much more insight would be gained if there was an extensive ablation study. One example of a experiment to include is what happens with more and less missing values. It is also unclear whether hyper-parameters were sufficiently tuned.
Perhaps I misunderstand something, but a limitation of your work which appears to be glanced over is that you do not seem to pay any attention to literal values referring to existing entities, at all. For example, if a string value 'Paris' occurs, you won't map that to , instead you will map that to some literal value in the graph.
Some minor comments:
* You mention that table data does not have any machine readable semantics. This is rather debatable. The table structure can be considered semantic information; the rows and columns indicate that these values are related to each other.
* Check https://developer.github.com/changes/2020-02-10-deprecating-auth-through... to update your github code. I think this is a pretty cool dataset. I do wonder whether after manually checking respective licences you could still distribute it.
* something I could see you doing in your code, but which was not described in the github data creation is the removal of duplicate files. This is essential since otherwise you would have obvious dataset leakage. Note, however, that this duplication check does not guarantee that there is no leakage since this only looks at the sha1 hash, which is sensitive to, for example, whitespace. So, leakage can still happen. Perhaps this latter point is still a major concern I have. How certain are you regarding not having dataset leakage problems for the different datasets?
* Arguably, contribution 4, should not be listed separately as a contribution. In any ML paper, it is to be expected that the authors provide their code for scrutiny.
* For numbers you mention that you include the 'number of digits'. It appears that directly including the logarithm of the number would be more principled.
* The current work does not consider cases where multiple columns compound to one value for the graph. Examples would be measurements and their unit or scale, or time intervals.
* It is unclear what would happen is a column can be interpreted in multipel ways. For example, a column containing 0 and 1 values could be numeric or boolean. A columns with values in {10,12,14,16} could be numeric, or this could indicate a time. Please clarify this better. Also, it might be clever to make the decision of the datatype dependent on other columns, instead of in isulation.
|