I once heard a colleague say:
“When you think about it, every business problem is really a matter of adequate classification.”
Granted, he was doing PR for a classification software company, but broadly speaking, his point is tough to disprove. For starters, classification is undeniably crucial in areas like marketing and sales. That’s why we group people into demographics; it’s why we describe a lead according to its position in the funnel. But really, every member of your organization relies on classification to be successful, whether it’s an intern sorting through junk mail or the C-Suite framing objectives for their shareholders.
Imagine, in your email today, you were solicited by two different companies that you’d recently patronized, each asking you to fill out a customer satisfaction survey. Company A gives you a bunch of very broad, open-ended questions such as ”How would you describe your experience?” and there’s a spacious free-form text field for each. Company B gives you a longer list of extremely limited questions — ”Approximately how many minutes did it take for a representative to respond to your request?” — and a discrete list of options from which to choose.
Based on these two surveys, which company would you assume has a larger customer base?
Lacking any other information, Company B is the safe assumption. As an organization grows, the opinions of one individual become tougher and tougher to consider. What we need then is an effective way to group concepts together, to align and aggregate and make decisions based on a wider view of the landscape. This is the argument for effective classification; the ability to derive intelligence from larger and larger pools of information is both the burden and hallmark of successful businesses.
The Threat of Bad Classification
I say “effective classification” because, of course, we all know of examples where inadequate labeling actually inhibits understanding. It doesn’t matter if you’re talking about products or people: when the labels are too broad, or too irrelevant, or too vague, they marginalize nuance and cement a foundation of inaccuracy. Unchecked, that broken foundation will compromise decision-making. In short: bad labels are damaging.
So what we need then are criteria that help us to evaluate any classification method:
PRECISION – Are you classifying accurately? The goal here is to eliminate both false-positives, when an item is assigned an incorrect label; and false-negatives, when an item should receive a given label but doesn’t.
DEPTH – Going hand in hand with precision, the health of your classification depends on how much information your labels can communicate. “Turtle” is informative; “Serrated Hinged Terrapin” more so. Deeper granularity of labels fosters higher-fidelity understanding.
STRUCTURE – The danger of an excessively granular system is that the labels become so specific that they’re meaningless. That’s why we need structure: so that each classification is not just defined in a vacuum, but achieves meaning through its relationships to other classifications. To illustrate using the above example, “Serrated Hinged Terrapin” isn’t all that helpful if you don’t know what a turtle is to begin with.
RECALL – The amount of available information you can successfully classify. Do your labels completely cover the range of different possibilities, or are there items that don’t correspond with your criteria, and thus can’t be classified at all? It’s important to note that, in many classification systems, improving recall can adversely affect precision, and vise-versa. Increasing recall often means broadening the rules, which can result in false positives. On the other hand, trying to improve precision through stricter rules means fewer items meet those tougher standards. It’s a balancing act.
FLEXIBILITY – There’s no such thing as perfect classification. Any system you can come up with will have some kind of flaw in its precision, depth, structure, or recall. Moreover, change is a constant in any organization, and any classification can become outdated eventually. (Fun fact: there once was a time when Harry Potter could be adequately characterized as a book series, as opposed to an international multimedia entertainment franchise.) If your classification system is going to maintain effectiveness over time, labels and rules must be constantly tested and easily revised.
RELEVANCE – Is the information conveyed via classification in line with what you need to know? This can be a tricky one. As we increase our capacity to record and store data on pretty much everything, many organizations are developing a mentality of “grab everything and make sense of it later.” This pack-rat idea is simple — sometimes even seemingly insignificant data can yield important insights. But if you don’t have the capacity to analyze all that data in an organized way, then you’re really just wasting resources on chasing irrelevance. It’s a good idea to approach classification with a clear goal of what you’re trying to achieve.
Understanding at Scale
Classification at its best is all about an economy of information: How can I consider the largest volume of data while losing as little meaning as possible in the process?
Think back to that survey question for a moment — Company A with its free-form approach, and Company B with its tightly controlled survey questions. Maybe when you got there, you thought: “Well clearly Company A is the more successful, because they’re taking the time to elicit organic responses. They seem like they’re really giving each customer individual attention.”
I would agree with that assessment to some extent — free-form answers can provide fuller overall meaning — but this kind of intel becomes harder to align and consider when you’re a large corporation operating in several markets. The ultimate goal, for a company classifying their data, is to capture all the nuance and variety that you’d get from sitting there and manually considering each data point, but in a quantifiable way that makes this meaning accessible on a huge scale and fast.
eContext specializes in a specific kind of classification — the labeling of multimedia content according to the topics discussed — to gain fast, reliable intelligence on behalf of our clients.
eContext’s Semantic Classification
eContext is a rule-based classification engine that annotates content according to topics mentioned. Our clients use semantic classification to organize and interpret all kinds of data, including:
- Web content
- Social posts
- Customer feedback
- Videos
- Search queries
- Messaging
Now, there’s a whole ecosystem of tools available to mine insights from digital content — and many of those solutions don’t have anything to do with that content’s subject matter. Maybe you simply need to know when your customers are most active on social media, or to discover what percentage of searches on your website result in a sale.
But semantic data provides deeper understanding and utility, because it tells you what that content’s actually about. Semantic classification supports a wide variety of applications, including:
- Market research
- Personalized content delivery
- Query response (traditional search box, chatbot, or virtual agent)
- Media planning
- Customer service
- Brand safety
How to tell if you need semantic classification, in three simple questions:
Does your role involve any analysis of digital content?
Does it help to know what that content’s actually about?
Do you have the time or resources to analyze that content manually?
How eContext Recognizes Topics
As mentioned in the last post, any decent classification must be sufficiently structured, accurate, and flexible. eContext meets and maintains these standards through two unique elements:
- Topics are organized into a hierarchy comprising 25 verticals, 20 tiers of depth, and 450,000 individual nodes
- For each topic, a list of vocabulary rules is created to identify when that topic is being mentioned.
eContext organizational structure is the world’s largest taxonomy of commercial and social topics, comprising over 450,000 categories across 25 verticals. These categories are arranged in a hierarchical structure; the top tier includes very broad topics like “Arts & Entertainment”, “Health”, and “Travel”, while in lower tiers, the topics become more and more granular.
To classify text into its 450,000 topic categories, eContext utilizes a database of 55 million positive and negative vocabulary rules. Positive vocabularies indicate if a text string is eligible for a certain classification. For example, in the category “David Bowie”, positive vocabularies include “ziggy stardust” and “thin white duke”. Positive vocabularies let us know when people are using different words to talk about the same thing. Negative vocabularies indicate if a text string is ineligible for a certain classification. For example, in the category “Bow Ties”, negative vocabularies include “pasta” and “noodles”. These rules greatly improve the accuracy of the classification process.
eContext’s 55 million vocabulary rules are trained and maintained by subject experts. This instills common sense in a scalable process that classifies thousands of text inputs per second.
Accessing eContext Classification
eContext offers clients three different ways to take advantage of semantic classification: by web app, on-premise appliance, or API. Which of these options is best for a client depends on the volume of data to be enriched as well the client’s available resources.
Classify.econtext.com — The most lightweight option, but also the least scaleable. eContext’s browser-based tools allow users to review sample-size portions of web, social, or keyword classification. We recommend using the Classify site either to demo the accuracy and depth of eContext classifications, or as an easily-accessible adjunct to one of our other solutions.
On-Premise — For clients that need to classify extremely high volumes of data and have the resources to install eContext’s architecture onsite. This definitely represents the high-end of the spectrum, and is only necessary if you have the data-ingestion rates of a social media company or big data aggregator.
eContext API — The vast majority of our clients access eContext’s classification engine via API. Users have access to the eContext Taxonomy, can extract topics from data in real-time, and can retrieve keywords from the eContext dataset.
Because the API is the most commonly-used option here, subsequent posts will detail its use, including an overview of available functions and a guide to classifying content for a select variety of typical use cases. [/vc_column_text][/vc_column][/vc_row]