Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge Series

Tracking #: 1481-2693

Giuseppe Rizzo
Bianca Pereira
Andrea Varga
Marieke van Erp
Amparo Elizabeth Cano Basave

Responsible editor: 
Guest Editors Social Semantics 2016

Submission type: 
Full Paper
The large number of tweets generated daily is providing policy makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domain-specific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributed to the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition and linking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This article reports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessons learnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Tim Baldwin submitted on 19/Dec/2016
Minor Revision
Review Comment:

The paper is much improved from the original version, and that authors have
taken onboard many of the suggestions in my original review, which was great
to see. In particular, the broader discussion of the approaches taken by
participants and comparison across different years really enhanced the paper,
and represents a significant amount of extra work on the part of the authors.

I have some issues with minor claims in the paper/areas of clarification, but
don't believe the paper needs to go through a third round of reviews:

+ you state that a web user wants to find "exactly what she is looking for"
(p4), which sounds very much like navigational search, which is one of a
number of search modalities; you then go on to claim that the search query
will be composed "by" (should be "of") a mention "to" (should be "of") the
entity of interest; again, this is very much suggestive of navigational
search, which is less than half of searches in practical settings. See Broder
2002 ("A taxonomy of web search") for details. The answer here is to tone down
the controversial and unsubstantiated rhetoric, or remove this claim
altogether ... or provide references to back up this claim, but my experience
with search logs and the IR literature is very much at odds with this claim.

+ surely coverage is also a point of differentiation between KBs (you can have
two KBs which are in the same domain, use the same features to describe
entities, and are updated with similar frequency, but one can have much better
coverage than the other.

+ the URL for ACL (p8) is a bit strange; a more general URL would be https://www.aclweb.org/

+ your claim about long documents and KB update rates (p9) is a bit strange --
surely it's the aggregate count across all documents that affects the update
rate, not the length of individual documents?

+ what do you mean by "unsupervised annotation" (p12)? clarify

+ the "unsupervised naive algorithm" seems to simply merge together
string-identical NE mentions of the same type, in which case "clustering" is
too grand a term (p13); this seems to be to generate the seed clusters, in
which case perhaps "seed cluster generation through merging of
string- and type-identical named entity mentions" would be more appropriate?

+ is it possible to include the Challenge Annotation Guidelines in an
appendix, or at the very least, to include a URL in the paper? You mention
them a number of times as being critical to the success of the tasks, meaning
people should be provided access to them

+ you state that you use GATE to measure IAA, but how is this measured: is it
at the span+label level (over full spans, a la CoNLL), or at the word+label
level (over partial spans)? clarify

+ I couldn't understand what you meant by "The latter showed best performance,
holding more complexity in the definition of the feature sets" (p16) -- please

+ "as subsequent .. thus using the output as input" (p19) -- again, not sure
what you are trying to say; please clarify

+ "NIL clustering is addressed as a supervised learning task" -- say more, as
it's far from intuitive that you should be able to solve a clustering task
with a supervised approach

+ in what way was the weighting in the 2013 "harmonic"? It appears to be a
simple macro-average (Equations 6 and 7), which has nothing to do with
harmonic averages

+ the argument you make about the number of mentions in a tweet and its impact
on the overall evaluation seems wrong (p23) -- all of your evaluation is over
all tweets rather than for individual messages, making this whole argument
misleading at best, as it is not the number of mentions in a given tweet but
the distribution of classes across tweets that is going to have an impact on
overall results.

Language/presentation issues:

+ "enriched tweet" => "enriched tweets" (p1)

- "to grant high quality" => "resulting in high quality" (p2)

+ "also experienced a strong involvement of the" => "also attracted from
strong interest from" (p2)

+ "such as the NEEL-IT" => "such as NEEL-IT" (p2)

+ a minor but important distinction: I would say that named entities have
become a key aspect of "natural language processing" rather than
"computational linguistics" (p3)

+ "Machine Learning, Semantic Web." => "Machine Learning, and the Semantic
Web." (p3)

+ "is being referred by" => "is referred to by" (p3)

+ "in text, may not" => "in text may not" (p4)

+ "have been used interchangeably" => "have become interchangeable" (p3)

+ "explosion on ... generated to solve" => "explosion in ... proposed for"

+ "solve the task" => "approach to the task" (p3)

+ "a poor performance" => "poor performance" (p4)

+ "with short documents" => "over short documents" (p4)

+ "choice for" => "choice of" (p4)

+ "divided in" => "divided into" (p4)

+ "series of features" => "series of document-level features" (p4)

+ "low presence" => "relative absence" (p4)

+ "As more context is available the task becomes easier, little or no ..." =>
"More context makes the task easier, and little or no ... (p4)

+ "perform a misspelling" => "misspells a term" (p4)

+ "Following Candidate Detection" => "Candidate Detection next" (p5)

+ "the partial end-to-end" => "partial end-to-end" (p6)

+ "further years" => "later years" (p7)

+ "that entity" => "that an entity" (p8)

+ "of Named Entity Recognition" => "of the Named Entity Recognition" (p8)

+ "benchmark of solutions for" => "benchmark for approaches to" (p8)

+ "focused on Web" => "focused on the Web" (p8)

+ "further sections" => "later sections" (p9)

+ "would advantage algorithms" => "would be biased towards algorithms" (p9)

+ "of use" => "for using" (p9)

+ "that few" => "that a few" (p9)

+ "extra effort of the organisation" => "extra effort on the part of the
organisers" (p9)

+ "to enabling" => "to enable" (p10)

+ "(FSD) algorithm [19]" => "(FSD) algorithm of [19]" (there are many FSD
algorithms nowadays)

+ "mined from Twitter" => "downloaded from Twitter" (p11)

+ "semantics diversity" => "semantic diversity" (p11)

+ "Table 4, 5, and 6" => "Tables 4, 5, and 6" (p11)

+ "it consisted" => "consisted" (p12)

+ "Consensus, for" => "Consensus: for (p12)

+ "Adjudication, a" => "Adjudication: a" (p12)

+ "case of 2015 challenge" => "case of the 2015 challenge" (p12)

+ "Consistency checking, the" => "Consistency checking: the" (p13)

+ "cross-consistency check" is a strange term (occurs a number of times on
p13; perhaps "cross-checking of consistency of ..."?

+ "Adjudication Phase, where the" => "Adjudication Phase: the" (p13)

+ "Consistency checking, the" => "Consistency checking: the" (p13)

+ "iterated further Phase 1" => "iterated between Phases 1 and 2" (p13)

+ "be a measure" => "to be a measure" (p14)

+ "the difficulty" => "the reading difficulty" (p15)

+ "defines that" => "suggests that" (p15)

+ "and to translating" => "and translating" (p15)

+ "no-ASCII" => "non-ASCII" (p16)

+ "a classification tasks" => "a classification task" (p16)

+ "resulted" => "proved" (p17)

+ "a Support Vector Machines" => "a Support Vector Machine" (p17)

+ "addressed the Mention Detection with a large set of linguistic features and
lexicon related" => "addressed the Mention Detection task with a large set of linguistic and
lexicon-related features" (p17)

+ "Shows per year submissions and ..." => "Submissions and ..." (p18)

+ "page rank" => "PageRank" (and add reference)

+ "the so-called end-to-end" => "a so-called end-to-end" (p19)

+ "it is proposed a tokenisation ... based on [68]" => "a tokenisation
... based on [68] was used" (p19)

+ you use both "ngrams" and "n-grams" -- be consistent throughout the paper

+ "using Random Forest" => "using a random forest" (p19)

+ "as means" => "as a means" (p20)

+ "an empirically threshold" => "an empirically-determined threshold"

+ in Equations 2, 3, 10 and 11, use "\neg\in" rather than "\nexists" (which
looks like a funny "A" at first glance)

+ "Presents per year submissions" => "Submissions" (p21)

+ "Since the 2014 .. weighing" => "From the 2014 ... weighting"

+ "Equation 13, Equation 8" => "Equation 13, and Equation 8" (p22)

+ "none knowledge base entry" => "no knowledge base entry" (p22)

+ on p22, use the same font for all occurrences of
"strong_typed_mention_match", "strong_link_match" and "mention_ceaf"; at
present, you appear to use \textit sometimes and raw math mode at other times
(making for awkward character spacing)

+ there is spurious indentation after Equation 14

+ "and false negative (Equation 3)" => "and false negative (Equation 3)
counts" (p23)
+ "and false negative (Equation 11)" => "and false negative (Equation 11)
counts" (p23)

+ "Details of the algorithm is listed" => "The algorithm is detailed" (p23)

+ move the dangling "Where $E$ ..." paragraph to the end of the preceding
paragraph (" ... from Equation 14, where ...")

+ "$T$ is the set of tweets" => "$Tweet$ is the set of tweets"

+ "allows to measure" => "supports the measurement of" (p24)

+ "the NEEL challenges results" => "the NEEL challenge results" (p24)

+ "Table 15, shows" => "Table 15 shows" (p24)

+ "a 76.4% performance" => "a 76.4% precision" (p24)

+ "Table 16, presents" => "Table 16 presents" (p24)

+ "with over" => "by over" (p24)

+ "winner team followed" => "winning team proposed" (p25)

+ "a delta difference" => "an absolute improvement" (p25)

+ "leverage from the" => "leverage the" (p25)

+ "and to measure, while down-breaking" => "and break down" (p25)

+ "the NEEL-IT" => "NEEL-IT" (p25)

Review #2
By Hao Li submitted on 03/Jan/2017
Minor Revision
Review Comment:

This article described the findings and lessons learnt through the Named Entity rEcognition and Linking (NEEL) challenge series. It covered the following aspects for the "entity recognition and linking in tweets" task: task description, created corpora, comparisons across approaches, learnt lessons and future works. Generally speaking, the task is important and useful, the article is solid, well-written and easy to follow.

I have the following comments:
1. It will be very helpful to analysis the common errors made across systems, it will also shed some light on the problem and future research directions.
2. It will be interesting to build a validation system, which takes all the submissions as input, and output the final "best" submission.
3. Tweets are too short, it will be interesting to create tweet-news corpora, instead of tweets only. It is related to another research task: how to link tweets to their most relevant news articles.