Case Study Auto-Classification

Lexington eBusiness Consulting
Consumer Search Behavior
Case Study Search Behavior
SEO Strategy
Content Strategy
Case Study Auto-Classification
Social Media
Startup Consulting
Industry Expertise
Lexington eBusiness Partners
The Blog
Contact Us
Northeastern University

Text Analytics Tools Solve Business Problem

The Problem

A prominent domain name broker owns over a million domain names which they monetize through advertising. Their sister division buys and sells domain names to small to medium size companies. The company had two problems that needed to be addressed:

  1.  First, they wanted to generate keywords for each domain name to help increase advertising revenue by providing contextual ads.
  2.  Second, they needed an organizational strategy that would allow easy access by employees and customers to the domain name portfolio by category.

The Solution

We developed a multi-stage information enhancement process that incorporated a number of new techniques to create a domain name that was rich in Meta data that could be auto-classified to a business taxonomy. The high-level view of the process is as follows:

  1. We processed the domains and split the name into individual tokens (words).
  2. We applied synonym expansion to each of the unique words to produce an associated keyword list.
  3. We developed a consumer / business taxonomy to organize the domain names.
  4. We resolved content rich web sites for unique words and word phrases by topic for each of the taxonomy nodes.
  5. We used Machine Learning to train the taxonomy and auto-classifier.
  6. We execute the auto-classification process to classify domain names to the Taxonomy.


The Solution in Detail

Tokenizing Domain Names

To start the process, a domain name such as BrandNewBaby.com must be split into its unique words and word phrases such as Brand-new and Baby. I utilized tools from Lexalytics Inc. to accomplish this task. We were able to accurately split 95 percent of the domain names into proper machine readable words. The remaining 5 percent were problematic in that they were proper acronyms or a nonsense character strings such as yeipret2.com or Bfd44.com

Synonym Expansion

We then used the WordNet open source thesaurus from Princeton to expand each word extracted from the domain names to generate a complementary set of synonyms that would be used as keywords as a basis for categorizing the domain name to a business taxonomy, and as facets for guided navigation search tools. This expansion process for a single domain generated 47 new terms for the original two words in the domain name. For example the term Baby and Brand-new generated the following list:

Baby synonyms: angel face, babe, bairn, bambino, bundle, changeling, cherub, chick, child, crawler, foundling, infant, kid, little angel, little darling, little doll, little one, newborn, nipper, nursling, papoose, preemie, suckling, tad, toddler, tot, youngster.

Brand New synonyms: afresh, anew, current, fresh, green, immature, inexperienced, innovatory, nascent, neoteric, newfangled, newly, newness, nouveau, pristine, recent, unused, untried, vernal, virgin, youthful.

Create a Taxonomy

We developed a 400-node business taxonomy with 16 high-level entry points. Each top-level category had 15 to 25 sub-categories. The top level categories are:

BusinessFamily Life               Recreation         Society
Computers              Health RegionTravel
EducationHomeShoppingSpecial Events         

The top-level Family Life category contained 25 sub-categories which included Babies where the BrandNewBaby.com domain name would be eventually classified:

Adoption Fashion & Apparel Make Over's Self Improvement 
Babies Genealogy Parenting Teens 
Baby Photography    General Family Life    Pets Weddings 
Cooking Holidays Photo Sharing    Wedding Photography 
Divorce Kids Pre-Schools Life Style 
Fitness Life Events Religion  

Train the Taxonomy

The Taxonomy was semantically neutral. The category Family Life > Babies required a machine learning training procedure to develop the criteria for a successful auto-classification process. To that end we identified 5 to 10 web sites where the subject matter was quintessentially about baby care. For the Babies category we selected the following web sites:

  1. Parents.com
  2. WebMD.com
  3. Kidica.com
  4. PregnancyAndBaby.com
  5. NIH.gov
  6. BabyCenter.com
  7. BabyZone.com
  8. BouncingBaby.com

We then proceeded to crawl each web site to resolve the content for unique words and word phrases. This process produced a list of about 100 words and phrases that were unique to the topic of baby care.

Auto-Classifying Domain Names

At this stage we had generated two complete lists of terms.

  1. The first derived from the original domain names using synonym expansion.
  2. The second list was from selected web sites that we had crawled to resolve for unique words and phrases.

We then created machine-learning pattern-matching process to auto-classify the domain names to the new business taxonomy. This was an iterative training process in which we:

  1. Reviewed the classification results for false positives in two areas:
    • Domain names that were not classified to the proper category.
    • Domain names that could not be classified to any category.
  2. Domain names that were misclassified were simply placed in the proper category, and the auto-classification engine was rerun to train on those changes.
  3. In the case where classification failed, we would identify new web sites to crawl to obtain additional words and phrases.
    • We also had the option to manually delete or add new words to the trainer during the QA phase. This allowed for quick classification quality enhancements.

Value to the Company

The application of existing text analytics tools, and applying new techniques to solving business problems provided significant benefit over the existing platform:

  1. The expanded keyword list for each domain name provided:
    • Increased ad revenue by enabling contextual ads.
    • New Meta data to enable guided navigation.
    • Ability to support advanced search options like "more like this."
  2. Provided an ease-of-use organizational strategy for customers to view and navigate domain names that were for sale by facet or topic.
  3. Provided a set of tools for employees to view and value domain names when evaluating a portfolio for acquisition.

Lexington eBusiness Consulting

Lexington eBusiness developed the technical strategy and solution using existing text analytics tools, and open source software. Lexalytics, Inc. provided the text analytics tools, incremental engineering expertise and programming support to execute the project.





Helping Executives Improve Website Performance