Three questions for Dr. Tom O’ Hara, search expert

Tom O’Hara received his Ph.D. at New Mexico State University in Computer Science. His specialty is natural language processing, in particular lexical semantics (i.e., “how words mean”).

Tom O’Hara

He is currently working on multilingual search at Fast, Inc., an enterprise search firm recently acquired by Microsoft.

1)-How do you expect searching on the web via search engines to be different or similar in the year 2020, compared to now?

The biggest change I expect is for it to be much more voice-oriented. Smart phones are driving innovation these days, and voice-enhanced interfaces would be a good way to get around smaller screen limitations. This would include both voice recognition for navigation and voice synthesis for highlighting results likely to be useful.

I just see incremental improvement in other natural language processing capabilities as these are still long-term research issues. Entity recognition (e.g., company tagging) is already mainstream in enterprise search, so this should be applied more regularly in web searching. Word-sense disambiguation research is now quite mature, so that should start making its way into search engines. In addition, one relational tagging might be feasible (see SemEval workshop task) to make search geared around semantic roles, such as actor and recipient.

Lastly, note that the web search interface today is really not much different than 10 years ago, excepting accelerators and other minor tweak; so, we shouldn’t set our hopes too high.

2)-Is the “Semantic Web” already here, at least in some basic sense?

No, I don’t see it as existing in any real sense. There are aspects of it there, but only in limited applications. For me, it will take a while given that there should be broad utilization, say with 5% of all “useful” web pages having some sort of semantic annotation. This will require web authoring tools so that naive users can reliably annotate their pages with precise logical assertions. Having worked on such a tool for DARPA’S Rapid Knowledge Formulation (RKF) for just a limited domain (albeit non-trivial), I am sure that it will much time just for this type of tool to be available for general knowledge engineering.

As an analogy, the current state of the Semantic Web is like the early days of Natural Language Processing (NLP), circa the Schankian scripts era (mid-70’s) when computers seemed on the edge of understanding language. It turned out to work well for limited domains, but broad-coverage understanding still is a long way off. Of course, NLP is a much more ambitious than the Semantic Web; however, there is significant overlap in the underlying technologies, in particular with respect to ontologies.

A specific example might better illustrate this viewpoint. The Friend of a Friend (FOAF) is perhaps the most well-known application for the Semantic Web, yet it is still mainly limited to academic environments. For the Semantic Web to be really here, something like FOAF should be integrated into Facebook or a similar social networking system. The LiveJournal blogging/journaling system does have some support for FOAF, such as in the export of user interests. However, the number of LiveJournal users is quite small compared to Facebook (e.g., roughly 8 million unique monthly visitors versus 130 million).

3)-Microsoft recently acquired the company you work for, Fast Inc. Why?

Microsoft acquired Fast, Inc. in order to beef up its enterprise search offering. It currently offers SharePoint, which is really just an entry-level system. By acquiring Fast, Microsoft gets intellectual property (IP) rights to ESP (Enterprise Search Platform), the industry leading enterprise search system utilized by large corporations such as Verizon and McGraw-Hill.

Besides being entry level, SharePoint is heavily Microsoft Office oriented, mainly supporting documents types like Word, Excel, PowerPoint, and Outlook. ESP has much broader support for other popular document type such as Lotus Notes. This highlights as aspect of enterprise search that is much different from web search. With web search, most of the documents are either in HTML, PDF, or one of the Microsoft Office types. Enterprise search must deal with niche file types like Documentum that particular companies or government agencies might use extensively and thus is mission critical for search accessibility.

Another important feature of ESP over SharePoint is its support for XML documents. ESP allows for structured searches that accounts for hierarchical relationships present in XML documents, rather than simply searching over the text content of XML documents. This is can be a computationally intensive process, which is one reason general search engines don’t support it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s