Text Analytics Tools Solve Business Problem
A prominent domain name broker owns over a million domain names which they monetize through advertising.
Their sister division buys and sells domain names to small to medium size companies. The company had two problems that needed
to be addressed:
- First, they wanted to generate
keywords for each domain name to help increase advertising revenue by providing contextual ads.
- Second, they needed an organizational
strategy that would allow easy access by employees and customers to the domain name portfolio by category.
We developed a multi-stage information enhancement process that incorporated a number of
new techniques to create a domain name that was rich in Meta data that could be auto-classified to a business taxonomy. The
high-level view of the process is as follows:
- We processed the domains
and split the name into individual tokens (words).
- We applied
synonym expansion to each of the unique words to produce an associated keyword list.
- We developed a consumer / business taxonomy to organize the domain names.
- We resolved content rich web sites for unique words and word phrases by topic for each of the taxonomy
- We used Machine Learning to train the taxonomy and
- We execute the auto-classification process to
classify domain names to the Taxonomy.
The Solution in Detail
Tokenizing Domain Names
To start the process, a domain name such as BrandNewBaby.com must be split into its
unique words and word phrases such as Brand-new and Baby. I utilized tools from Lexalytics Inc. to accomplish
this task. We were able to accurately split 95 percent of the domain names into proper machine readable words. The remaining
5 percent were problematic in that they were proper acronyms or a nonsense character strings such as yeipret2.com
We then used the WordNet open source thesaurus from
Princeton to expand each word extracted from the domain names to generate a complementary set of synonyms that would be used
as keywords as a basis for categorizing the domain name to a business taxonomy, and as facets for guided navigation search
tools. This expansion process for a single domain generated 47 new terms for the original two words in the domain name. For
example the term Baby and Brand-new generated the following list:
Baby synonyms: angel face, babe, bairn, bambino, bundle, changeling, cherub, chick,
child, crawler, foundling, infant, kid, little angel, little darling, little doll, little one, newborn, nipper, nursling,
papoose, preemie, suckling, tad, toddler, tot, youngster.
Brand New synonyms: afresh, anew, current, fresh, green, immature, inexperienced, innovatory,
nascent, neoteric, newfangled, newly, newness, nouveau, pristine, recent, unused, untried, vernal, virgin, youthful.
Create a Taxonomy
We developed a 400-node business taxonomy with 16 high-level
entry points. Each top-level category had 15 to 25 sub-categories. The top level categories are:
|Business||Family Life ||Recreation ||Society|
|Computers ||Health ||Region||Travel|
|Education||Home||Shopping||Special Events |
The top-level Family Life category contained 25 sub-categories which included
Babies where the BrandNewBaby.com domain name would be eventually classified:
|Adoption ||Fashion & Apparel ||Make Over's ||Self Improvement |
|Babies ||Genealogy ||Parenting ||Teens |
||General Family Life ||Pets ||Weddings |
|Cooking ||Holidays ||Photo Sharing ||Wedding Photography |
|Divorce ||Kids ||Pre-Schools ||Life Style |
|Fitness ||Life Events ||Religion || |
Train the Taxonomy
The Taxonomy was semantically neutral. The category Family
Life > Babies required a machine learning training procedure to develop the criteria for a successful auto-classification
process. To that end we identified 5 to 10 web sites where the subject matter was quintessentially about baby care. For the
Babies category we selected the following web sites:
We then proceeded to crawl each web site to resolve the content for unique words
and word phrases. This process produced a list of about 100 words and phrases that were unique to the topic of baby care.
At this stage we had generated two
complete lists of terms.
- The first derived
from the original domain names using synonym expansion.
- The second list was from selected web sites that we had crawled to resolve for unique words and phrases.
We then created machine-learning pattern-matching process
to auto-classify the domain names to the new business taxonomy. This was an iterative training process in which we:
- Reviewed the classification results for false positives in two areas:
- Domain names that were not classified to the proper category.
- Domain names that could not be classified to any category.
- Domain names that were misclassified were simply placed in the proper
category, and the auto-classification engine was rerun to train on those changes.
- In the case where classification failed, we would identify new web sites to crawl to obtain
additional words and phrases.
- We also had the option
to manually delete or add new words to the trainer during the QA phase. This allowed for quick classification quality enhancements.
Value to the
The application of existing text
analytics tools, and applying new techniques to solving business problems provided significant benefit over the existing platform:
- The expanded keyword list for each domain name provided:
- Increased ad revenue by enabling contextual ads.
- New Meta data to enable guided navigation.
- Ability to support advanced search options like "more like this."
- Provided an ease-of-use organizational strategy for customers to view
and navigate domain names that were for sale by facet or topic.
- Provided a set of tools for employees to view and value domain names when evaluating a portfolio for acquisition.
Lexington eBusiness Consulting
Lexington eBusiness developed the technical strategy and solution using existing
text analytics tools, and open source software. Lexalytics, Inc. provided the text analytics tools, incremental engineering
expertise and programming support to execute the project.
Helping Executives Improve Website Performance