Typology-based Semantic Labeling of Numeric Tabular Data

Tracking #: 2278-3491

Authors: 
Ahmad Alobaid
Emilia Kacprzak
Oscar Corcho

Responsible editor: 
Guest Editors EKAW 2018

Submission type: 
Full Paper
Abstract: 
A lot of tabular data are being published on the Web. Semantic labeling of such data may help in their understanding and exploitation. However, many challenges need to be addressed to do this automatically. With numbers, it can be even harder due to the possible difference in measurement accuracy, rounding errors, and even the frequency of their appearance. Multiple approaches have been proposed in the literature to tackle the problem of semantic labeling of numeric values in existing tabular datasets. However, they also suffer from several shortcomings: closely coupled with entity-linking, rely on table context, need to profile the knowledge graph and the prerequisite of manual training of the model. Above all, they all treat different kinds of numeric values evenly. In this paper, we tackle these problems and validate our hypothesis: whether taking into account the typology of numeric data in semantic labeling yields a better solution.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Joana Malaverri submitted on 09/Oct/2019
Suggestion:
Accept
Review Comment:

(1) Originality:
This work is an extension of the paper "Fuzzy Semantic Labeling of Semi-Structured Numerical Data Sources" which describes an approach to label the numerical columns of tabular data based on the application of the fuzzy c-means technique. In this extension, the authors introduce a typology of numeric values, based on other work, as a way to improve the labeling of the numerical columns. The work is original in relation to previous work and the state of the art: in general the status-quo focused mainly on textual columns. In particular, the authors present a model for constructing features based on different numerical types. These features will serve as input for the classification and labeling of each numeric column. The methodology is organized with helps to understand the application of the approach.

(2) Significance of the results:
The authors were able to validate the hypotheses of their work using a labeling approach for numerical columns of tabular data based on the application of the fuzzy c-means technique. The authors showed the results obtained using their approach band described its significance by comparing the results from previous work. Moreover, they also found a types, such as categorical, which are under-represented in the existing related work.

(3) Quality of writing.
It is observed that the authors improved the writing and the organization of the text considering the previous version. However, there are still some points to be improved and clarified in order to improve the content of this work. The following list presents some points that need to be reviewed by the authors:

Section 2. Background:
The phrase "On the other hand, Stevens and Birkhoff [10] typology of numbers does not quite agree with this6." seems to be incomplete. Why the authors of [10] do not agree with the authors of [11]? It is necessary to clarify what is the reason Stevens and Birkhoff proposed a new set of typology numbers. This becomes even more necessary because you are using this typology in your work.

In the phrase: In this work, we first detect the type of numerical values as a way to improve the performance of semantic labelling without matching the exact numbers to a property of a matched entity, relying on an ontology, profiling knowledge graphs, manual elimination of properties, or tweaking of parameters (that is knowledge graph dependent). ... " What is your approach to detect the type of numerical values? You are only listing other approaches and do not say anything about your approach. So the phrase lacks clarity.

3. Results and discussion:
- In the phrase: "...which does not make the column look like an actual sequence(it looks like a subset..." A space is missing after the word sequence.

4. I would recommend the authors to double check the text to correct formatting errors and misspelling.

Review #2
By Ilaria Tiddi submitted on 03/Nov/2019
Suggestion:
Minor Revision
Review Comment:

I see the paper has done huge improvements. It seems that most of the issues (mine, and the other reviewers) were properly addressed.

Background : much better described. perhaops a table summarising all type could improve readability?

Regarding my comment on the significance of the results : okay, I am convinced by your argument. I would add your answer in your experimental section as an additional justification of your work.

Results are much improved in clarity and soundness, thanks.

Minor:
- add ~ before \cite (ref 11) on page 2/ line 8 / right
- page 2 /r/ 38. We (humans) > Humans use
- page 4/l/26 : why do you bold "problem statement"? if you really want emphasis on it, rather use italic on the actual statement (line 28-32)
- in general, perhaps section 3 and 4 could become the same section. Something like related work and research question?
- page 8 /r/ 37 : dbp (or dbp:areaOfCatchment
- tables : add a full stop after the caption, and the two footnotes
- remove space (or ~) before some of the \footnotes 14, 19, 20 (did not check them all)
- do not abuse of parentheses in yuor text, they brake a lot the narrative

Review #3
Anonymous submitted on 07/Nov/2019
Suggestion:
Minor Revision
Review Comment:

As already said in my first review, I like the idea, and I think it is a novel and interesting approach to label numerical columns of tables on the web.
I still have two major concerns (detailed below); however, I suggest a “minor revision”, as I see many improvements (and as I know that the two-strike rule would rule the paper out).

1) Type detection heuristics & evaluation:
Still, I am not convinced by the detection of some of the types, and its evaluation. Your algorithm for the nominal hierarchical numbers is: “the numbers have the same number of digits, and they fail the sequential test; hence, they will be considered hierarchical.” Based on this detection, the “hierarchical” class could be any list of numbers with same length (e.g. years which you exclude manually in your experiments).
However, the example of a hierarchical type that you give in the paper is quite complex, and I wonder if this complex hierarchical type even exists in datasets; even more, since you did not report any hierarchical or categorical type in the T2Dv2 dataset.
Given the missing types in the dataset, the evaluation in Table 8 and Table 9 is not really broad and balanced: the sequential type is based on 1 column, the ordinal on 5 columns. So you basically only consider and test the “count” and “other” types?

Also, you do not discuss the precision and recall results of the “other” sub-type. I wonder what kind of types are in this “other” category? Do they belong to one of the other types? Or are these results indicating that there should be other (sub-)types for numeric columns?

2) Quality of writing:
The writing clearly improved in this version, however, there are still some misformulations, and also the organisation could still be improved. For instance:
- 6.4: The description of the detection algorithm is not very clear and should be reformulated and better organised.
For instance:
“because it *is* the most restrictive”,
“For the second one, it should be one of the sub-types that checks for equal digits” -> which second one?
“For the fourth, we check if it is hierarchical.”
- in the conclusions: “In this paper, we introduce a typology of numeric data taking into account the task of semantic labeling. We show that taking into account the typology of numeric data and using such information to perform semantic labeling results in better performance.”
- evaluations: While you split the paper in various sections and subsections in other parts of the paper (e.g. the very short section 4), you could restructure the evaluation of the type detection and labelling. At the moment both evaluations are in the same subsection and the result discussions can get a bit confusing.