In an earlier post, I discussed the broad term of “text analytics”, and looked at a few of the automated functions grouped under that umbrella. It’s important to be able to differentiate these functions; for example, entity extraction can help you fill gaps in your database, but won’t necessarily give you comprehensive information on what your text is about. Today though, I’d like to look at the flip-side of this issue: are there any functions associated with text analytics that are labeled separately but actually provide similar information? The labels in question are “Concept Recognition” and “Classification” (though different providers may brand these services using varying terminology). What do these functions provide? Are they distinct enough to justify storing data from each? Let’s take a look at what these functions actually do:
Concept Recognition: Picks out the key ideas to give you an at-a-glance idea of what a document is about. Listed concepts may include corresponding “type” and “subtype” information and/or an importance score relative to the other concepts in the document.
Classification: Assigns one or more categories to the document as a whole. Typically, these categories are organized into a relatively shallow taxonomy*, so an article about a biotech CEO using herself as a guinea pig for experimental anti-aging therapy might be classified to “Science” or “Health::Therapy”.
Essentially, concept recognition and document classification both give you an idea of what subjects your text discusses; the former just works on a much more granular level, while the latter gives you a broad categorization. This begs the question: how are these two functions linked? Do the individual concepts identified by a text analytics service serve to inform the overall document classification? In a perfect world (of text analytics), the specific concepts would not be separated from the broad taxonomic classifications. Niche topics like “scientific journals” would simply be nodes in a much larger, more comprehensive taxonomy: Books & Literature::Books & Literature Products::Periodicals::Scholarly & Professional Journals::Scientific Journals Unified topic classification is preferable to a separated approach because, ideally, all of that data fits together. Each specific concept links back to a broader subject matter, and high-level classification is informed by the prominence of individual topics. eContext’s hierarchy includes 450,000 individual topics, and each one has its own address in a massive tree. Our users can classify documents to specific concepts, but view those concepts through as broad a narrow a lens as necessary. “Concept Recognition” and “Document Classification” limit users to the very beginning and very end of a long subject matter chain. By unifying these services with comprehensive taxonomy, text analytics companies can provide topic classification that’s more informative, more organized, and supports wider variety of use cases.
*Occasionally, a text analytics company will feature their own proprietary taxonomy, but most of the time, they classify text to align with IAB or IPTC standards. While it’s helpful for providers to group their content according to industry conventions, as of 2016, these taxonomies could certainly use some expansion. FWIW, eContext’s taxonomy can be overlaid on these existing structures, to keep things consistent while still allowing for more granular build-outs.