Semantic Search on the Web

Paper Title: 
Semantic Search on the Web
Authors: 
Bettina Fazzinga and Thomas Lukasiewicz
Abstract: 
Web search is a key technology of the Web, since it is the primary way to access content on the Web. Current standard Web search is essentially based on a combination of textual keyword search with an importance ranking of the documents depending on the link structure of the Web. For this reason, it has many limitations, and there are a plethora of research activities towards more intelligent forms of search on the Web, called semantic search on the Web, or also Semantic Web search. In this paper, we give a brief overview of existing such approaches, including own ones, and sketch some possible future directions of research.
Full PDF Version: 
Submission type: 
Other
Responsible editor: 
Krzysztof Janowicz
Decision/Status: 
Accept
Reviews: 

Review 1 by Krzyztof Janowicz:

The paper introduces and motivates the need for and importance of semantic search for the web, discusses previous work, an own approach, and points out directions of future work. While I agree that semantic search is one of the core topics to make the semantic web a reality and the paper is well written, I would propose to slightly refocus the content. IMO, a vision statement should be more like a roadmap than a review of existing work. Of course, the knowledge of previous work is crucial, however, the most interesting parts of the paper are the sections 3 and 4 in which the own approach is summarized and directions of future work are pointed out.

As far as I understand, the FGGL approach argues for annotating existing data and documents on the web using lightweight ontologies. By doing so, most contents on the web become available for semantic search and the power of SW reasoning can be combined with classical IR and ranking approaches from keyword-based search engines. This leads to the question of how to create ontologies for such 'an ontological index over the Web'. The success of the web is mostly due to its heterogeneity and the freedom to contribute whatever kind of data from potentially contradicting viewpoints. These viewpoints are not only reflected in the data but also in the conceptualizations, i.e., the meaning underlying the used vocabulary. Hence, I would argue that the creation and maintenance of these ontologies should to be done in a Wikipedia like manner, and different communities may require their own ontologies. Will these ontologies be visible to the user to make the semantic search transparent? Will the web interfaces expose the new reasoning capabilities to the user or hide them (e.g., for the sake of simplicity)? I would also be interesting to think about the role of ontology evolution in this context.

Moreover, it would be great to read some more details from an information retrieval perspective. For instance, the authors mention context. In classical IR work, inferring hidden contextual information to adapt the results to the user's needs is a crucial component; how can we use semantic web reasoning for this inference step? Would it make sense to have an ontology of user needs and motivations used for matching and query expansion?

Finally, regarding the challenge of (automatically) adding annotations to Web contents - will these annotations be stored separately and used for reasoning on-the-fly (see the discussion GRDDL vs SAWSDL,...)?

As you explicitly mention semantic similarity and its role for semantic search: note that the SimCat SIM-DL/SIM-DL_A similarity server is a DIG-compliant similarity reasoner for description logics-based ontologies and is used in application areas from information retrieval and decision support systems.

Review 2 by Axel Polleres:
The article provides a brief overview of Semantic Search techniques on the Web, ranging from now more "historical" first attempts such as SHOE to recent approaches, including the authors' own work.

While the article is nice to read and summarises the presented approaches well, I would suggest the authors to revise their work before publication, since I believe that a lot of relevant recent work is missing. I will try to detail/explain these in the following:

OntoSelect: http://olp.dfki.de/ontoselect/ This work tries to provide a search engine for Ontology terms, which is another important issue, neglected by many other Search engines. Though, with the rise of Linked data, we will need approaches that will help developers and publishers to find vocabularies and terms to reuse. Ontology term search is still largely open, but will become important!

SWSE:

The SWSE project [1] is around for 5 years now, and various publications have resulted from it. SWSE is a datawarehouse approach for Semantic Web search, crawling and indexing RDF data from the Web, and providing keyword search, entity consolitdation [2], ranking [3,4], SPARQL querying [5] and a form of Web-tolerant, scalable OWL Reasoning over the Web extracted OWL ontologies [SAOR]. VisiNav [visinav], a graphical interface recently developed on top of SWSE, provides faceted entity-based querying/search.

1. Andreas Harth, Aidan Hogan, Jürgen Umbrich, Stefan Decker. "SWSE: Objects Before Documents!". In Proceedings of Billion Triple Semantic Web Challenge 2008, at the 7th International Semantic Web Conference (ISWC2008), Karlsruhe, Germany, 2008.

2. Aidan Hogan, Andreas Harth, Stefan Decker. "Performing Object Consolidation on the Semantic Web Data Graph". Proceedings of I3: Identity, Identifiers, Identification. Workshop at 16th International World Wide Web Conference (WWW2007), Banff, Alberta, Canada, 2007.

3. Aidan Hogan, Andreas Harth, Stefan Decker. "ReConRank: A Scalable Ranking Method for Semantic Web Data with Context". In Proceedings of the Second International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2006), Athens, GA, USA. November 5, 2006.

4. Andreas Harth, Sheila Kinsella, Stefan Decker: Using Naming Authority to Rank Data and Ontologies for Web Search. International Semantic Web Conference 2009: 277-292

5. Andreas Harth, Jürgen Umbrich, Aidan Hogan, Stefan Decker. "YARS2: A Federated Repository for Searching and Querying Graph Structured Data". In Proceedings of 6th International Semantic Web Conference (ISWC2007), Busan, Korea, 2007.

Sindice:
Sindice [6,7]initially started as a look-up index for RDF documents, inspired by classical search engines, but has advanced to a more powerful engine over the past also including ranking [8] and query facilities allowing join-free queries over web data. Also Sindice provides "quarantined" OWL reasoning capabilities [9] to carefully extend search results with inferences over Web
data. Besides, the system provides an interactive search interface, extending faceted browsing ideas with the Sig.Ma engine [10], based on Sindice, that allows to dynamically select and reject sources to refine search results.

6. Eyal Oren, Renaud Delbru, Michele Catasta, Richard Cyganiak, Holger Stenzhorn, Giovanni Tummarello: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1): 37-52 (2008)

7. Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello: A Node Indexing Scheme for Web Entity Retrieval. ESWC (2) 2010: 240-256

8. Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, Stefan Decker: Hierarchical Link Analysis for Ranking Web Data. ESWC (2) 2010: 225-239

9. R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Kalrsruhe, Germany, 2008.

10. Giovanni Tummarello, Richard Cyganiak, Michele Catasta, Szymon Danielczyk, Renaud Delbru, Stefan Decker: Sig.ma: live views on the web of data. WWW 2010: 1301-1304

The Coraal engine [11] was, rather than for whole Web search, developed for specific text search based on documents. It learns a fuzzy ontology from text documents based on linguistig patterns and the document structure, and provides a powerful query language and result ranking.

11. Vít Novácek, Tudor Groza, Siegfried Handschuh: CORAAL - Towards Deep Exploitation of Textual Resources in Life Sciences. AIME 2009: 206-215

Also probably worth a look this work from Aberdeen: http://www.ontosearch.org/

As for building semantic queries from natural language queries, a recent approach by Zenz et al. [12] should be mentioned.

12. Gideon Zenz, Xuan Zhou, Enrico Minack, Wolf Siberski, Wolfgang Nejdl: From keywords to semantic queries - Incremental query construction on the semantic web. J. Web Sem. 7(3): 166-176 (2009)

Also, even more recent, in this year's ESWC2010 (obviously this was after the
present article was submitted), an interesting approach for extracting SPARQL queries from natural language queries was presented [13].

13. Natural Language Interface to Ontologies: Combining Syntactic Analysis and Ontology-based Lookup through the User Interaction Danica Damljanovic and Milan Agatonovic. ESWC2010 You further mention approximate query answering approaches, here the approach presented by Oren et al. in ISWC2008 may be worthwhile mentioning as well [14].

14. Eyal Oren, Christophe Guéret, Stefan Schlobach: Anytime Query Answering in RDF through Evolutionary Algorithms. International Semantic Web Conference 2008: 98-113

Lastly, it should be mentioned that industry-driven approaches start to encompass Semantic Search (apart from Google!). Especially, Yahoo!'s SearchMonkey [http://developer.yahoo.com/searchmonkey/] or - on top of that - BOSS [http://developer.yahoo.com/search/boss/] should be mentioned. As for the existing mention of Google, I suggest to refer to a more concrete reference where you find details about what google currently does: A recent WIRED article summarised the developments nicely: http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1

The whole first part of the paper suggests a classification of Semantic search approaches, I am not sure whether all the approaches I mentioned additionally fit strictly in one of the three categories you defined.

Lastly, it would be nice (if possible) to get some more details on the authors' approaches and see a closer comparison with the existing approaches. What' I'd be interested to knoe is in how far the ontologies used are provided, or extracted from the Web? How does it compare to approaches like Coraal (lightweight ontologies extracted/learned purely from text), or SWSE/Sindice (reasoning from the ontologies eplicitly provided on the Web)?

In total, I think it would be ok to extend the contribution by at least a page, to address these additional works which IMO shouldn't be ommitted and compare to them.

Tags: