Quantcast
| Print |

Comparison and Compatibility Analysis

Structured Data Search Engines and
Unstructured Data (Full-Text) Search Engines

This paper is also available in PDF format: Structured Data Search Improves Full-Text Search Engines

Download AJAX Incremental Search Revisited

Before choosing a search engine technology it is important to consider how search systems are typically classified in two fundamentally different categories.  One category of search engines software was developed for searching full-text documents (web pages and documents on networks and the Internet).  The other category of search engines specializes in indexing and searching structured data (tables, relational databases, directories, catalogs). 

This white paper provides an overview of the principal differences between these two approaches and how to combine the two for optimal performance.  There are many variables in the mind of a search technology buyer these days:  usability, speed, hardware costs, configurability, leveraging of the latest search enhancements (machine learning, facets, etc.), handling of taxonomies, handling of spelling and synonyms, etc.  Application requirements for search engines rarely take this structured/unstructured data dimension in account fully and cannot be satisfied because the underlying search technology sought by users isn’t adequate to the data they want to access and the way they want it accessed.

In this white paper, Exorbyte presents how it has discovered radically more efficient ways of matching structured data which could never be attempted with full-text unstructured documents.  These approaches offer ways to locate data in any language, in very large data sets, at very high speed, and with a level of control that goes beyond most search software's capabilities:  down to the actual strings of characters contained in each key value pairs of each data record in enormous databases, not only locating exact matches but measuring how close all other near-matches are.  The Exorbyte architecture allows the use of Levenshtein edit distance and other advanced string matching algorithms, phonetics, and Semantics combined with advanced scripted rules.

Finally, this paper explores how such software (Exorbyte MatchMaker) can be combined with existing search infrastructures (including traditional full-text search engines) to transcend the widely accepted limits of search engines today.

Table of Contents

Common Search Engine Features

Preparing the Search EngineAll search technologies index blocks of searchable content so that the content is accessible faster and in a more structured way than in its original form.  The first step of indexing is to convert the documents, web pages, data feeds, databases into a usable form. Both structured data and full text search engines require this step and convert their input to a database-like format called an index.  The index can be a database, a compilation binary file or set of binary files, or any other form that the user-facing application can access easily and fast. 

The building of the index usually involves some standard pre-processing steps such as : 

  • removal of skip words (articles of speech such as “for”, “of”), 
  • standardization of verb conjugation endings and plurals, 
  • standardization of certain characters (“ü” in German goes to “ue”), 
  • removal and standardization of non-searchable characters like punctuation marks and hyphens, 
  • addition of mappings to aid in searching such as converting all letters to lower case, 
  • addition of phonetic mappings, etc.

This process reduces the size of the index and, more importantly, increases relevance of index entries to future user queries.
Search engines provide fast access to the indexed data by caching or loading their index binary file(s) into memory (RAM).  The search engine processes queries based on query parameters and returns a list of ranked results (by user or default parameters like relevance, date, categories, etc). The relevance of each result in the list is also calculated according to the query string.  Since the original data or documents remain in their original system (file system, database, network locations), matches produced using the index can be "de-referenced" using an identifier or address linking the user who chooses a given search result to the original document or data record.

back to top ^

Search Engines for Unstructured Data (Full-text)

Building Search ResultsThe index of a full-text search engine contains a list of all words occurring in each document collection along with a set of references to the documents containing them.  Such indexes are called “inverted indices.” Given a query (e.g. “Informatica Wolfram”) the engine determines which documents contain these words from the index. These first sets of results are called “intersection sets” and can become very large.  The relevance of individual documents is refined using additional criteria: the ranking algorithm. The relative frequency of words in each document (keyword density), the significance of individual words, and the distance between the query words are all important criteria for relevance ranking. Such techniques have been included into the open source search engine Lucene for instance. Moreover, additional retrieval techniques of commercial search engines can be implemented separately.  However, the quality of such refinements is also a function engineering skills and experience of the configuration engineers in charge of a search solution. Documents sets grow, change in content, change in format, and the user needs evolve in such ways that search technologies have to be constantly improved and customized to meet the needs of their users. The Commercial search engines like Google, Fast, or Endeca offer even more advanced criteria to determine a document's relevance.  However, their common feature is that their rankings are performed on the intersection sets described above.

Structured data and tables can be coerced into fitting this full-text indexing process.  Individual entries (table rows) are treated as separate documents. A structured search like “street: Broadway” or “city: New York” can then be performed without any problems as long as the syntax is perfect and matching method used is exact (no allowance for spelling or syntax variations at all). Otherwise the quantity of the comparable data will rise extremely which causes an increase of the processing time, too.

back to top ^

Error-Tolerance in Search (Fuzzy Search)

Error-tolerant search capabilities are increasingly becoming standard features of search engines because the error-tolerance makes searching more intuitive, comfortable, and fast for users.  E-commerce sites and other transaction-tied applications increase revenues by neutralizing “zero results” queries and providing meaningful results to more users.
Incorporating error-tolerance in full-text search can be done in two ways:

  1. building equivalence classes during pre-processing,
  2. or by applying similarity algorithms on queries to match index words (query rewriting).

Equivalence classes are built by using fixed rules to generate equivalent spellings for each word, the extra words are then stored in the index.  Soundex is the most well known phonetic similarity algorithm.  For instance, the word “Meier” in German would have the following synonyms as its equivalence class: “Maier, Mayer and Meyer.”  A search for “Maier” leads to an exact match with “Maier” and a set of absolute synonyms: “Meier, Mayer and Meyer”.  This search process is essentially a yes/no decision:  Is the word a match or not?  The engine is making an exact match with added chance of working with common synonyms and misspellings through manufactured synonyms in its index equivalence classes.  In other words the engine isn’t truly comparing the “similarity” of two strings but the “equality” of two strings.  Words not in the equivalence class will not be found.  A search for “testXersion” would never result in “testVersion” being found even though only 1 of 11 letters is different and there are no difference in structure  (one vs. two words for instance). 

In this example above, the system’s ability to locate words that present phonetic similarities is very limited.  Statistical analysis of search logs would confirm that most errors are either:

  • unexpected and unforeseen and 
  • infrequent (statistically insignificant by themselves the “long-tail” of search queries distribution). 

Another disadvantage of the process is the massive bloating of the original query with additional attributes.

 

… A query such as “Maier Softwere DestVersion” would be nearly impossible to resolve accurately for a full-text search engine, because it would require at least 125,000 arithmetic operations.

Query rewriting transforms the search query into the closest set of index words.  With this approach statistical similarity, semantic contexts (phrase removal, aliases) and syntactic criteria (e.g. “Is the query string an address or a phone number?”) can be incorporated into query processing to generate improved queries.  Modified queries are then used to perform exact matching against the index.  In certain circumstances, groups of similar (approximate) matches may be returned.

A common feature of both approaches is that they establish similarity between terms in a first step and then proceed with an exact match in a second step. This restriction to exact match is related to the way in which intersection sets are constructed. A query such as “Maier Softwere DestVersion” would be nearly impossible for a full-text engine, because it would require:

  • the intersection of at least the 50 most similar words to “Maier,”

  • with the 50 likewise to “Softwere”,

  • with the 50 most similar to “Destversion.” 

This would require 50x50x50 = 125,000 arithmetic operations. 

Aside from the computational problems, this procedure is deficient for another reason, namely that there are often more than a hundred ways of misspelling a given word.  Analysis of web search query logs shows 380 different misspellings of “Hewlett Packard” and 120 variations on “Lufthansa.”

Both equivalence classes and query rewriting methods ultimately perform exact searching on the index and for the above mentioned reasons it is not possible to check whether the query results leads to an empty set or not. In address data search for instance, such an approach could lead to locate the most similar street (to the query) in a different city than that specified in the query. The phenomenon is well-known from Google's “Did you mean” feature too. In this case sometimes the suggestion provided leads to “no results”. This behavior is not a result of a lack of quality, but has structural indexing causes.

back to top ^

Search Engines for Structured Data

Any off-the-shelf structured data search technology allows a wide variety of search techniques and ranking criteria per data field. Furthermore, it is also possible to use full-text engines like Lucene, which focus on string retrieval technologies at the document level for structured data but this technology is on an experimental stage (these mechanisms are not part of the official package and must be implemented separately at the user’s own risk).

Structured data search engines can generate search results dynamically faster than full-text engines can with more complexity and accuracy because the data they point to is already structured: making its semantic relationships easily interpretable and its lookup very fast.

As a result of fundamentally different architectures, search engines for structured data can:

  • compile their indexes much faster, 
  • and have a much more compact index, thereby requiring less hardware resources and making the overall engine much faster.

If the addition of data fields requires extra functionality (ex: a yellow pages address database needs to be extended to include operating times - numeric ranges - or geocodes - geometric proximity to another address), things may become relatively complex.  Too complex for most full-text search engines which require a separate index to be built for every extra functionality.

A search engine for structured data behaves like a database.  To add a new field, administrators only have to create the field and to specify its recall method.  Moreover flexible scripting constructs can be employed to tweak matching behavior and suit any application (Server-Side Scripting).  This prevents the bloating of the computing request, reduces the number of candidates for the mapping by excluding unsuitable candidates and minimizes costly programming efforts.

There is hardly a difference in exact match mode query processing times among each search engine type.  Approximate queries, on the other hand, show clear performance differences due in part to the structural differences described above.  In general, there are certain queries and results sets which full-text search architectures engines cannot even process because their indices don’t contain enough structure to define a complete method of expressing relevance.

back to top ^

Error-tolerant searching (fuzzy)

Since structured data search engines index entire sets of database tables, they are able to compute multiple intersection sets with varying ranking values and intersect these sets in a meaningful way to produce a final set of results.  For example, even when all query fields are misspelled or are moderately “similar” to the query value strings the engine can still compute the “closest” intersection set that is possible. This set also usually happens to be the least computationally expensive to compute.  For a full-text engine, this would be a combinatorial nightmare. 

Efficient structured data search engines go much further than the above “basic” scenario because they can create versatile indexes whose ranking algorithms can be dynamically modified by the client application.  This means the client system has the flexibility to react to a wide range of queries, and ever-changing search requirements.  Such adaptability mitigates obsolescence and provides future safety.

back to top ^

Hardware Requirements Comparison

Different types of search engines have significantly different hardware demands when dealing with structured data. Their differences in regard to hardware are directly attributable to their respective architectures.

In general, unstructured data tends to consist of fewer documents with many words while structured data (e.g. address data) have many “documents” (records) with fewer words per “document”.  If structured data is indexed by a full-text search engine, the resulting index is often relatively large and may even be larger than the original data set. Search engines for structured data, conversely, create indices which smaller than the original datasets (example : 20 to 40% smaller in the case of postal address data for instance). 

Let's compare the hardware requirements of Exorbyte MatchMaker and the Lucene open source search engine using a concrete example: 

A data set consisting of 50 Million personal address records (first name, last name, street number, street 1, street 2, zip, city, country, etc.).  This could be the yellow / white pages database for a large European country like Germany for instance.

For this database Exorbyte MatchMaker would produce an index capable of fitting into 2 GB of RAM maximum. The MatchMaker index server could thus run on a single 32-bit machine, and the same physical server could be used to host the database as well.

A Lucene index built on the same data would require a minimum of 14 GB of RAM in the best case and up to 28 GB in the worst case. If 32-bit systems were used, the Lucene index would thus need four to nine servers, as well as additional hardware to distribute the queries to the various servers should one decide to optimize retrieval processing via the Lucene MapReduce module.

Results of above memory utilization test case comparison between 3 Lucene and Exorbyte configurations
Results of above memory utilization test case comparison between 3 Lucene and Exorbyte configurations
To See Large Image Click Here

back to top ^

Conclusions

A full-text search engine is appropriate for indexing and searching unstructured data (documents, and web pages).  Structured data, on the other hand, can be handled by full-text search engines when:

  • a) engineering professionals are able to build up the engine according the business model using all necessary extensions relevant to the business case and setting up optimal hardware infrastructure and networking systems,
  • b) and the data structure and query logic are fixed (changes require significant reengineering of search engine configuration and optimization),
  • c) and the index can be kept relatively small (minimizing hardware requirements),
  • d) and indexing is not time-critical. Indexing of structured data into a full-text search engine will usually take hours or more,
  • e) and the data set is concise, needing few sorting criteria, so that hardware efficiency is not a critical consideration,
  • f) and voluminous description fields should be searched by the retrieval technologies of a classic full-text relevance ranking system.

A structured data search engine is thus essential for indexing databases, catalogs, directories, CRM data, XML feeds, and more when:

  • a) the table contains more than 1,000,000 records,
  • b) or fast, fuzzy searching is required (error tolerance),
  • c) or fuzzy multi-field or multi-word matching is desired,
  • d) or the index needs to be updated within seconds with changes in the original structured data being searched,
  • e) or the query logic needs to be flexible (modified easily),
  • f) or the client application needs dynamic control of the query logic,
  • g) or when approximate searches on multiple characteristics is needed (ex: geographic positioning, dates, taxonomies, or price information).

Gero Lueben, Daniel Nicollet, Thomas Steinmann - Exorbyte GmbH - October 2008

back to top ^

 

 

      Customers