Sphinx – the Beginner’s Guide

2
31516
Reading Time: 7 minutes

These days, hardly anyone is searching an online store by rambling among the categories or scrolling down the long lists of products.

There is a bunch of available onsite search tools that can make an internal site search fast, intuitive and adjusted to any customer needs.

In this series of articles we are going to review the functionality of the most popular eCommerce onsite search solutions. And the first search toolkit on the list is Sphinx.

Table of Contents

What is Sphinx?

Sphinx is an open source search engine with fast full-text search capabilities.

High speed of indexation, flexible search capabilities, integration with the most popular data base management systems (e.g. MySQL, PostgreSQL) and the support of various programming language APIs (e.g. for PHP, Python, Java, Perl, Ruby, .NET и C++ etc) —  all that make the search engine popular with thousands of eCommerce developers and merchants.

This is what makes Sphinx stand out:

  • high indexing performance (up to 10-15 Mb/s on one core)
  • rapid search performance (up to 150-250 Mb/s on a core with 1,000,000 documents)
  • high scalability (the biggest known cluster is capable of indexing up to 3,000,000,000 documents and can handle more than 50 millions of queries per day)
  • support of the distributed real-time search
  • simultaneous support of several fields (up to 32 by default) for full-text document search
  • the ability to support a number of extra attributes for every document (e.g. groups, time tags, etc.)
  • support of stop words
  • the ability to handle both single-byte encodings and UTF-8
  • support of morphologic search
  • and dozens more

All in all, Sphinx has more than 50 different features (and this number is constantly growing). Follow this link to overview the search engine functionality.

How Sphinx Works

The whole complexity of the search engine working pattern can be summed up in 2 key points:

  • using the source table, Sphinx creates its own index database
  • next, when you send an API query, Sphinx returns an array of IDs that correspond to those in the source table.

Installing Sphinx on a Server

The installation procedure is pretty easy. Follow the links below for a step-by-step installation instructions on:

This is a particular example of installing the search engine on CentOS:

wget http://sphinxsearch.com/files/sphinx-2.1.6-1.rhel6.x86_64.rpm
yum localinstall sphinx-2.1.6-1.rhel6.x86_64.rpm

When the installation is complete, Sphinx will create the path to the Config file. In the standard scenario it is:

/etc/sphinx/sphinx.conf

If you are going to simultaneously use Sphinx for several projects, it’s generally advised to create a separate folder for the Config file,  Index and Log.

E.g.

Config path – /etc/sphinx/searchsuite.yasha.web.ra/
Index path  – /var/lib/sphinx/searchsuite.yasha.web.ra/
Logs path  – /var/log/sphinx/searchsuite.yasha.web.ra/

Configuring Sphinx.conf File

Sphinx configurator consists of 4 constituents:

  • Data Source
  • Index
  • Indexer
  • Search Daemon

Here is how you can configure each of them:

1. Data Source

source catalogsearch_fulltext # catalogsearch_fulltext - the name of the source
{
    type               = mysql    # the type of the database Sphinx connects to 
    sql_host           =               # the host where the remote database is placed 
    sql_user           =               # a remote database user 
    sql_pass           =              # a remote database password 
    sql_db             = yasha_searchsuite    # the name of the remote database 
    sql_port           = 3306  # optional, default is 3306 ; the port, used to connect to the remote database 
    sql_sock           = /var/lib/mysql/mysql.sock    # the socket, used to connect to the remote database (if necessary) 

    sql_query          = SELECT fulltext_id, data_index1, data_index2, data_index3, data_index4, data_index5 FROM catalogsearch_fulltext

    sql_attr_uint      = fulltext_id    # sql_attr_* — the attributes that are returned during the search process 
    sql_attr_uint      = product_id
    sql_attr_uint      = store_id
    sql_field_string   = data_index1    # sql_field_* — these are the fields that should be indexed 
    ...
    sql_field_string   = data_index5

    sql_query_info     = SELECT * FROM catalogsearch_fulltext WHERE fulltext_id=$id  # additional query 
}   

2. Index

index catalogsearch_fulltext
{
    source            = catalogsearch_fulltext    # the data source
    path              = /var/lib/sphinx/searchsuite.yasha.web.ra/catalogsearch_fulltext    # the path to the location where the index is stored
    docinfo           = extern
    charset_type      = utf-8

    min_word_len      = 3    # the minimum number of characters necessary to initiate the search
    min_prefix_len    = 0    # if 0 - the setting is off, > 0 - the minimum number of characters at the beginning of a search query that is necessary to start searching
    min_infix_len     = 3    # if 0 - the setting is off, > 0 - the minimum number of characters in the whole word, necessary to initiate the search
}

And here is what some of the settings from the list above settings mean:

Prefixes — indexing prefixes allows you to run wildcard searching by ‘wordstart* wildcards. Say, if the minimum prefix length is set to > 0, the Indexer will include all the possible keyword prefixes (or, as we call them, word beginnings) in addition to the main keyword.

Thus, in addition to the keyword itself, e.g. ‘example’, Sphinx will add extra ‘exa’, ‘exam’, ‘examp’, ‘exampl’  prefixes to its index.

Note, too short prefixes (below the minimum allowed length) will not be indexed.


Infixes — Sphinx is capable of including any infixes (aka word parts) into its index. E.g. In our example, indexing the keyword “test” will add its parts “te”, “es”, “st”, “tes”, “est” in addition to the main word.

IMPORTANT! It’s not possible to enable these 2 settings at the same time. If done, you’ll get a fatal error during indexation.

Also, enabling either of these 2 settings can significantly slow down the indexation and search performance. Especially, when working with big data volumes.

3. Indexer

To configure the Indexer, you just need to set the appropriate memory limit that can be used by the Daemon Indexer.

indexer
{
    mem_limit   = 128M   #
}

4. Search Daemon

Here are the general Sphinx Search Daemon settings (supplied with the explanatory comments).

searchd
{
    listen           = 9312    # the port, used to connect to Sphinx 
    listen           = 9306:mysql41
    log              = /var/log/sphinx/searchsuite.yasha.web.ra/searchd.log    # Daemon log file
    query_log        = /var/log/sphinx/searchsuite.yasha.web.ra/query.log    # search log 
    read_timeout     = 5    # time (in seconds) the Daemon waits in case of a lost connection (when communicating data to a searcher) 

    max_children     = 30    # The maximum number of simultaneously processed queries When set to 0, no limitation is applied. 
    pid_file         = /var/run/sphinx/searchd.pid    # The file, the launch PIDs are stored in 
    max_matches      = 1000
    seamless_rotate  = 1
    preopen_indexes  = 1
    unlink_old       = 1
    workers          = threads # for RT to work
    binlog_path      = /var/lib/sphinx/    # Binlog for crash recovery
}

Morphology

After splitting the text into separate words, the morphology preprocessors are slapped into action.

These mechanisms can replace different forms of the same word with the basic, aka ‘normal’ one. This approach lets the search engine ‘synchronize’ the main search query with its forms, so that it would be possible to find all forms of the same word in the index.

When Sphinx morphology algos are enabled, the search engine returns the same search results for different forms of a word. E.g. the results may be totally identical for both ‘laptop’ and ‘laptops’.

Sphinx supports 3 types of morphology preprocessors:

  • Stemmer
  • Lemmatizer
  • phonetical algorithms

1. Stemmer

It’s the easiest and fastest morphology preprocessor. It lets the search engine find the word’s stem (a part of a word that remains unchanged for all its forms) without using any extra morphological dictionaries.

Basically, the Stemmer removes or replaces certain word suffixes and/or endings.

This morphology preprocessor works fine for most of search queries. However, there are some exceptions. For instance, with this method, ’set’ and  ‘setting’ will be considered as 2 separate queries.

Also, the preprocessor can treat words that have different meaning but the same stem as identical.

To enable the Stemmer, add the following line to the Index:

morphology = stem_enru

2. Lemmatizer

Unlike the Stemmer, this morphology preprocessor uses morphological dictionaries, which lets the search engine strip the keyword down to lemma. The lemma is a proper, natural language root word.

E.g. the search query ‘settings’ will be reduced to its infinitive form ‘set’.

To use the Lemmatizer, you need to download the morphological dictionaries. You can do that on the official website at sphinxsearch.com

In Config file – Indexer block you can find the lemmatizer_base option. This option will let you  specify the path to the folder where you store all the dictionaries.

indexer
{
    ...
    lemmatizer_base = /var/lib/sphinx/data/dict/
}

When done, you need to select either lemmatize_en or lemmatize_en_all  built-in value. In the latter case, Sphinx will apply the Lemmitizer and the Index all the root word forms.

3. Phonetics algos

At the moment, Sphinx supports 2 phonetical algorithms, these are: Soundex and Metaphone.
Currently, they both work for the English language only.

Basically, these algos substitute the words of the search query with specially crafted phonetic codes. It lets the search engine treat the words that are different in meaning but phonetically close as the same.

This way of search can be of great help when searching by a customer’s name/ surname.

To enable the phonetic algos, you need to specify the values of soundex or metaphone for the morphology option.

morphology = metaphone

Stop Words

The stopwords features in Sphinx lets the search engine ignore certain keywords when creating an index and implementing searches.

All you need is to make a file with all your stop words, upload it to the server and set a path for Sphinx to find it.

When creating a list of stop words, it’s generally recommended to include the keywords that are so frequently mentioned in the text that have no influence on search results. As a rule, these are: articles, prepositions, conjunctions, etc.

With the help of the Indexer it’s possible to create a dictionary of index frequency, where all the indexes are sorted by keyword frequency. You can do that using the commands:

--buildstops and --buildfreqs.

stopwords = /var/lib/sphinx/data/stopwords.txt
stopwords = stopwords-ru.txt stopwords-en.txt

Word Forms

The wordforms feature in Sphinx enables the search engine to deliver the same search results no mater which word form of the search query is used. E.g. customers who are looking for ‘iphone 6’ or ‘i phone 6’ will get the same results.

This functionality comes really useful if you need to define the normal word form in cases when the Stemmer can’t do it. Also, having the file with all word forms, you will be able to easily set up the dictionary of search synonyms.

These dictionaries are used to normalize the search queries during indexation and when implementing search. Hence, to apply changes in the wordforms file, you need to run re-indexation.

The example of the file:

walks > walk
walked > walk
walking > walk

Note, that starting with 2.1.1 version, it’s possible to use к “=>” instead of  “>”. Starting with 2.2.4 version you can also use
multiple destination tokens:

s02e02 => season 2 episode 2
s3 e3 => season 3 episode 3

wordforms = /var/lib/sphinx/data/wordforms.txt

Main Sphinx Commands

And finally, below you can find the list of the commands used for different operations with the search engine:

1. Editing Sphinx config file:
vi /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf

2. Indexing data from the targeted config sources:
sudo -usphinx indexer –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf –all –rotate

3. Launching the Search Daemon:
sudo -usphinx searchd –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf

4. Disabling the Search Daemon:
sudo -usphinx searchd –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf –stop

5. Checking whether the search engine is functioning correctly (making a request to already created indexes):
sudo -usphinx search –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf aviator (instead of ‘aviator’ you can use any other word).

Working with API

include_once Mage::getBaseDir('lib') . DS . 'Sphinx' . DS . 'sphinxapi.php';
$instance = new SphinxClient();
$instance->SetServer(‘localhost’, 9312);
$instance->SetConnectTimeout(30);
$instance->SetArrayResult(true);
$instance->setFieldWeights(array('data_index1' => 5, 'data_index2' => 4, 'data_index3' => 3, 'data_index4' => 2, 'data_index5' => 1));
$instance->SetLimits(0, 1000, 1000);
$instance->SetFilter('store_id', array(1, 0));
$result = $instance->Query('*'.$queryText.'*', ‘catalogsearch_fulltext');

Bottom Line

In this tutorial, I’ve tried to outline the main aspects of setting up and configuring Sphinx.

As you can see, by using this search engine, you can easily add a custom search to your Magento website.

Questions?

Feel free to leave a comment and I’ll get back to you. 🙂

Ellie is the Marketing Executive at Mageworx. Digital marketing expert by day, and a philomath by night, she can't help but share her knowledge and experience with the reader. eCommerce Allstars Podcast participant with over 70 authored articles online.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here