The two workshop areas are intimately connected. Search is one of the most common and powerful applications of text analytics and text analytics is the best way to make search smarter.
Even though social media and sentiment analysis have been getting more press for the last couple of years, using text analytics to improve search actually delivers more business value to organizations by saving time for users across the entire organization, cleaning up chaotic collections of unorganized documents including multiple duplicate and near duplicate documents, and finally by enhancing business decisions by delivering the right information at the right time.
The payoff for improving search is so huge that the biggest problem is believing the numbers – but multiple studies keep demonstrating that those numbers are indeed true. Saving $6,750,000 dollars a year per 1,000 employees is huge and it is probably somewhat understated as the amount of unstructured text continues to grow (see note below).
But this raises the question – how do search and text analytics work together? We spent considerable time on this in the workshop I conducted at the IKO 2016 conference, but the basic idea for those of you who could not attend the workshop can be summarized as follows:
- Faceted search or navigation is the best way to improve search (way better than just trying to improve those relevancy ranked lists).
- Facets require a large amount of metadata – content types, people, organizations, locations, products, processes, and more.
- Adding metadata has a number of well-known and largely unsolvable issues – getting people to tag documents at all much less getting them to do a consistent, high-quality job of it.
- Text analytics can be used to generate more and more metadata that is consistent and high-quality (if done correctly).
What is the “correct” way to use text analytics to improve search? The answer is, of course, that there is no “one size fits all” solution. The most successful basic model is what I called the hybrid model. The hybrid model does not try to use text analytics to automatically tag documents (except in some cases) nor does it rely on an army of human taggers. The way it works is to semi-automate the job of adding metadata tags as seen in this summary:
- An author creates a document and publishes the document in a content management or SharePoint system.
- Text analytics software (that has been integrated with SharePoint, etc.) analyzes the document and discovers and suggests multiple metadata values – the primary subjects of the document and significant mentions of facet values – the people, organizations, locations, content type, and more.
- These suggestions are presented to the author who reviews the suggestions and either accepts them or offers alternative values.
- The “auto-tagged” and human-curated documents are then published into the appropriate repository – ready to be found quickly and intelligently in the search application.
- Finally, the metadata values can be incorporated into relevancy scores that are much more accurate and useful than simply counting the number of times a search term appears in a document.
This model combines the best of human and machine providing the consistency and scalability of the machine and the depth and intelligence of the human. It also overcomes the issues of author-generated tags by presenting the author with a value that they can react to rather than ask them to generate all that metadata. Reacting to suggested values is a much more cognitively easy task than asking someone to think up the best keywords. Also, it turns out that authors are much more likely to actually provide this review. And if a number of authors simply say “yes” to whatever the software suggests, then you at least have the benefits of an “automatic” tagging – not as good as a true hybrid solution, but better than no metadata at all.
There are, of course, variations on this model and there are situations where this model does not apply as well, for example, in large collections of legacy documents or external documents. In that case, the solution would normally be more heavily weighted on the “automatic” side, but even there, a partial hybrid solution is still best. The human input in these cases comes about in at least two ways. First, as the “automatic” solutions runs, tagging hundreds of thousands or millions of documents, subject-matter-experts (SMEs) and/or a team of librarians or information analysts can periodically review the text analytics-suggested tags for quality. How many documents to review and how often will vary by organization and document collection and anticipated applications.
The second avenue for human input is provided by the feedback that SMEs, authors, and librarians/info analysts generate as they publish or review the text analytics results. This feedback can then be incorporated into the text analytics auto-categorization and entity extraction rules and models to refine and improve those rules and models. Having a sample document where a categorization or extraction rule was wrong enables the text analyst to not only get clues as to what went wrong, but also can be used to test a new, refined rule.
These refined, improved rules can then be used to not only enhance the hybrid CM-text analytics-search model of tagging with facet metadata values, but can also enhance the quality of tagging in those large volume cases that are more automatic.
There are a number of text analytics and search software vendors that like to claim that their solution is fully automatic. Just plug it in, and out comes quality metadata. My experience has been that these claims are almost always grossly overstated – both in terms of the effort needed to get them to work and the accuracy of the “automatic” solutions.
It does take a significant amount of work to develop highly accurate categorization, sentiment, and extraction capabilities, but that work is becoming less as we learn to build on early efforts with templates, better knowledge organization schemas, and shared best practices. In addition, developing these capabilities also creates a platform that can be used for other applications besides search – business and customer intelligence, voice of the customer and voice of the employee, fraud detection, knowledge management applications like expertise location and community collaboration, and dozens more applications that utilize that most under-utilized resource, unstructured text.
But let’s postpone that discussion for another post.
NOTE on dollar savings per year per thousand employees:
This calculation in USD is based on a 30% improvement of search through the application of text analytics as reported in the workshop I conducted for the IKO conference, and the figures on the cost of bad search as reported in an IDC study by Sue Feldman. A good summary of search studies, including the IDC study can be found on Search Technologies website: http://www.searchtechnologies.com/enterprise-search-surveys