{"id":5401,"date":"2016-05-30T11:17:05","date_gmt":"2016-05-30T11:17:05","guid":{"rendered":"https:\/\/blog.mageworx.com\/?p=5401"},"modified":"2022-05-16T12:10:40","modified_gmt":"2022-05-16T12:10:40","slug":"sphinx-the-beginners-guide","status":"publish","type":"post","link":"https:\/\/www.mageworx.com\/blog\/sphinx-the-beginners-guide","title":{"rendered":"Sphinx &#8211; the Beginner&#8217;s Guide"},"content":{"rendered":"\n<!-- SEO Ultimate (http:\/\/www.seodesignsolutions.com\/wordpress-seo\/) - Code Inserter module -->\n<!-- Google Tag Manager (noscript) -->\r\n<noscript><iframe src=\"https:\/\/www.googletagmanager.com\/ns.html?id=GTM-5DTCW7B8\"\r\nheight=\"0\" width=\"0\" style=\"display:none;visibility:hidden\"><\/iframe><\/noscript>\r\n<!-- End Google Tag Manager (noscript) -->\n<!-- \/SEO Ultimate -->\n\n<span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 7<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span><p>These days, hardly anyone is searching an online store by rambling among the categories or scrolling down the long lists of products.<\/p>\n<p>There is a bunch of available <em>onsite search tools<\/em> that can make an internal site search fast, intuitive and adjusted to any customer needs.<\/p>\n<p>In this series of articles we are going to review the functionality of the most popular eCommerce onsite search solutions. And the first search toolkit on the list is <a href=\"http:\/\/sphinxsearch.com\"><strong>Sphinx<\/strong><\/a>.<\/p>\n<h2>What is Sphinx?<\/h2>\n<p>Sphinx is an open source search engine with <em>fast full-text search capabilities<\/em>.<\/p>\n<p>High speed of indexation, flexible search capabilities, integration with the most popular data base management systems (e.g. MySQL, PostgreSQL) and the support of various programming language APIs (e.g. for PHP, Python, Java, Perl, Ruby, .NET \u0438 C++ etc) \u2014\u00a0 all that make the search engine popular with thousands of eCommerce developers and merchants.<\/p>\n<p>This is what makes Sphinx stand out:<!--more--><\/p>\n<ul>\n<li>high indexing performance (up to 10-15 Mb\/s on one core)<\/li>\n<li>rapid search performance (up to 150-250 Mb\/s on a core with 1,000,000 documents)<\/li>\n<li>high scalability (the biggest known cluster is capable of indexing up to 3,000,000,000 documents and can handle more than 50 millions of queries per day)<\/li>\n<li>support of the distributed real-time search<\/li>\n<li>simultaneous support of several fields (up to 32 by default) for full-text document search<\/li>\n<li>the ability to support a number of extra attributes for every document (e.g. groups, time tags, etc.)<\/li>\n<li>support of stop words<\/li>\n<li>the ability to handle both single-byte encodings and UTF-8<\/li>\n<li>support of morphologic search<\/li>\n<li>and dozens more<\/li>\n<\/ul>\n<p>All in all, Sphinx has more than 50 different features (and this number is constantly growing). <a href=\"http:\/\/sphinxsearch.com\/docs\/\">Follow this link<\/a> to overview the search engine functionality.<\/p>\n<h2>How Sphinx Works<\/h2>\n<p>The whole complexity of the search engine working pattern can be summed up in 2 key points:<\/p>\n<ul>\n<li>using the <em>source table<\/em>, Sphinx creates its own index database<\/li>\n<li>next, when you send an API query, Sphinx returns an array of IDs that correspond to those in the source table.<\/li>\n<\/ul>\n<h2>Installing Sphinx on a Server<\/h2>\n<p>The installation procedure is pretty easy. Follow the links below for a step-by-step installation instructions on:<\/p>\n<ul>\n<li><a href=\"http:\/\/sphinxsearch.com\/docs\/latest\/installing-debian.html\">Debian and Ubuntu<\/a><\/li>\n<li><a href=\"http:\/\/sphinxsearch.com\/docs\/latest\/installing-redhat.html\">RedHat and CentOS<\/a><\/li>\n<li>and <a href=\"http:\/\/sphinxsearch.com\/docs\/latest\/installing-windows.html\">Windows<\/a><\/li>\n<\/ul>\n<p>This is a particular example of installing the search engine on CentOS:<\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">wget http:\/\/sphinxsearch.com\/files\/sphinx-2.1.6-1.rhel6.x86_64.rpm\nyum localinstall sphinx-2.1.6-1.rhel6.x86_64.rpm<\/pre>\n<p>When the installation is complete, Sphinx will create the path to the Config file. In the standard scenario it is:<\/p>\n<p><em>\/etc\/sphinx\/sphinx.conf<\/em><\/p>\n<p>If you are going to simultaneously use Sphinx for several projects, it\u2019s generally advised to create a separate folder for the Config file,\u00a0 Index and Log.<\/p>\n<p>E.g.<\/p>\n<p>Config path &#8211; <em>\/etc\/sphinx\/searchsuite.yasha.web.ra\/<\/em><br \/>\nIndex path\u00a0 &#8211; <em>\/var\/lib\/sphinx\/searchsuite.yasha.web.ra\/<\/em><br \/>\nLogs path\u00a0 &#8211; <em>\/var\/log\/sphinx\/searchsuite.yasha.web.ra\/ <\/em><\/p>\n<h2>Configuring Sphinx.conf File<\/h2>\n<p>Sphinx configurator consists of 4 constituents:<\/p>\n<ul>\n<li>Data Source<\/li>\n<li>Index<\/li>\n<li>Indexer<\/li>\n<li>Search Daemon<\/li>\n<\/ul>\n<p>Here is how you can configure each of them:<\/p>\n<h3>1. Data Source<\/h3>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">source catalogsearch_fulltext # catalogsearch_fulltext - the name of the source\n{\n    type               = mysql    # the type of the database Sphinx connects to \n    sql_host           =               # the host where the remote database is placed \n    sql_user           =               # a remote database user \n    sql_pass           =              # a remote database password \n    sql_db             = yasha_searchsuite    # the name of the remote database \n    sql_port           = 3306  # optional, default is 3306 ; the port, used to connect to the remote database \n    sql_sock           = \/var\/lib\/mysql\/mysql.sock    # the socket, used to connect to the remote database (if necessary) \n\n    sql_query          = SELECT fulltext_id, data_index1, data_index2, data_index3, data_index4, data_index5 FROM catalogsearch_fulltext\n\n    sql_attr_uint      = fulltext_id    # sql_attr_* \u2014 the attributes that are returned during the search process \n    sql_attr_uint      = product_id\n    sql_attr_uint      = store_id\n    sql_field_string   = data_index1    # sql_field_* \u2014 these are the fields that should be indexed \n    ...\n    sql_field_string   = data_index5\n\n    sql_query_info     = SELECT * FROM catalogsearch_fulltext WHERE fulltext_id=$id  # additional query \n}   \n<\/pre>\n<h3><strong>2. Index<\/strong><\/h3>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">index catalogsearch_fulltext\n{\n    source            = catalogsearch_fulltext    # the data source\n    path              = \/var\/lib\/sphinx\/searchsuite.yasha.web.ra\/catalogsearch_fulltext    # the path to the location where the index is stored\n    docinfo           = extern\n    charset_type      = utf-8\n\n    min_word_len      = 3    # the minimum number of characters necessary to initiate the search\n    min_prefix_len    = 0    # if 0 - the setting is off, &gt; 0 - the minimum number of characters at the beginning of a search query that is necessary to start searching\n    min_infix_len     = 3    # if 0 - the setting is off, &gt; 0 - the minimum number of characters in the whole word, necessary to initiate the search\n}<\/pre>\n<p>And here is what some of the settings from the list above settings mean:<\/p>\n<p><strong>Prefixes<\/strong> \u2014 indexing prefixes allows you to run wildcard searching by \u2018<em>wordstart<\/em>* wildcards. Say, if the minimum prefix length is set to &gt; 0, the Indexer will include all the possible keyword prefixes (or, as we call them, word beginnings) in addition to the main keyword.<\/p>\n<p>Thus, in addition to the keyword itself, e.g. \u2018example\u2019, Sphinx will add extra &#8216;<em>exa&#8217;, &#8216;exam&#8217;, &#8216;examp&#8217;, &#8216;exampl&#8217;<\/em>\u00a0 prefixes to its index.<\/p>\n<p>Note, too short prefixes (below the minimum allowed length) will not be indexed.<\/p>\n<p><strong>\u2028Infixes<\/strong> \u2014 Sphinx is capable of including any infixes (aka word parts) into its index. E.g. In our example, indexing the keyword \u201ctest\u201d will add its parts \u201cte\u201d, \u201ces\u201d, \u201cst\u201d, \u201ctes\u201d, \u201cest\u201d in addition to the main word.<\/p>\n<p><strong>IMPORTANT!<\/strong> It\u2019s not possible to enable these 2 settings at the same time. If done, you\u2019ll get a fatal error during indexation.<\/p>\n<p>Also, enabling either of these 2 settings can significantly slow down the indexation and search performance. Especially, when working with big data volumes.<\/p>\n<h4>3. Indexer<\/h4>\n<p>To configure the Indexer, you just need to set the appropriate memory limit that can be used by the Daemon Indexer.<\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">indexer\n{\n    mem_limit   = 128M   #\n}<\/pre>\n<h3>4. Search Daemon<\/h3>\n<p>Here are the general Sphinx Search Daemon settings (supplied with the explanatory comments).<\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">searchd\n{\n    listen           = 9312    # the port, used to connect to Sphinx \n    listen           = 9306:mysql41\n    log              = \/var\/log\/sphinx\/searchsuite.yasha.web.ra\/searchd.log    # Daemon log file\n    query_log        = \/var\/log\/sphinx\/searchsuite.yasha.web.ra\/query.log    # search log \n    read_timeout     = 5    # time (in seconds) the Daemon waits in case of a lost connection (when communicating data to a searcher) \n\n    max_children     = 30    # The maximum number of simultaneously processed queries When set to 0, no limitation is applied. \n    pid_file         = \/var\/run\/sphinx\/searchd.pid    # The file, the launch PIDs are stored in \n    max_matches      = 1000\n    seamless_rotate  = 1\n    preopen_indexes  = 1\n    unlink_old       = 1\n    workers          = threads # for RT to work\n    binlog_path      = \/var\/lib\/sphinx\/    # Binlog for crash recovery\n}\n<\/pre>\n<h2>Morphology<\/h2>\n<p>After splitting the text into separate words, the morphology preprocessors are slapped into action.<\/p>\n<p>These mechanisms can replace different forms of the same word with the basic, aka \u2018<em>normal<\/em>\u2019 one. This approach lets the search engine \u2018synchronize\u2019 the main search query with its forms, so that it would be possible to find all forms of the same word in the index.<\/p>\n<p>When Sphinx morphology algos are enabled, the search engine returns the same search results for different forms of a word. E.g. the results may be totally identical for both \u2018<em>laptop<\/em>\u2019 and \u2018<em>laptops<\/em>\u2019.<\/p>\n<p>Sphinx supports <em>3 types of morphology preprocessors<\/em>:<\/p>\n<ul>\n<li>Stemmer<\/li>\n<li>Lemmatizer<\/li>\n<li>phonetical algorithms<\/li>\n<\/ul>\n<h3>1. Stemmer<\/h3>\n<p>It\u2019s the easiest and fastest morphology preprocessor. It lets the search engine find the word\u2019s stem (a part of a word that remains unchanged for all its forms) without using any extra morphological dictionaries.<\/p>\n<p>Basically, the Stemmer removes or replaces certain word suffixes and\/or endings.<\/p>\n<p>This morphology preprocessor works fine for most of search queries. However, there are some exceptions. For instance, with this method, \u2019set\u2019 and\u00a0 \u2018setting\u2019 will be considered as 2 separate queries.<\/p>\n<p>Also, the preprocessor can treat words that have different meaning but the same stem as identical.<\/p>\n<p>To enable the Stemmer, add the following line to the Index:<\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">morphology = stem_enru<\/pre>\n<h3>2. Lemmatizer<\/h3>\n<p>Unlike the Stemmer, this morphology preprocessor uses morphological dictionaries, which lets the search engine strip the keyword down to lemma. The lemma is a proper, natural language root word.<\/p>\n<p>E.g. the search query \u2018settings\u2019 will be reduced to its infinitive form \u2018set\u2019.<\/p>\n<p>To use the Lemmatizer, you need to download the morphological dictionaries. You can do that on the official website at <a href=\"http:\/\/sphinxsearch.com\">sphinxsearch.com <\/a><\/p>\n<p>In <em>Config file &#8211; Indexer<\/em> block you can find the<em> lemmatizer_base<\/em> option. This option will let you\u00a0 specify the path to the folder where you store all the dictionaries.<\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">indexer\n{\n    ...\n    lemmatizer_base = \/var\/lib\/sphinx\/data\/dict\/\n}<\/pre>\n<p>When done, you need to select either <em>lemmatize_en<\/em> or <em>lemmatize_en_all<\/em>\u00a0 built-in value. In the latter case, Sphinx will apply the Lemmitizer and the Index all the root word forms.<\/p>\n<h3>3. Phonetics algos<\/h3>\n<p>At the moment, Sphinx supports 2 phonetical algorithms, these are: <strong>Soundex<\/strong> and <strong>Metaphone<\/strong>.<br \/>\nCurrently, they both work for the English language only.<\/p>\n<p>Basically, these algos substitute the words of the search query with specially crafted phonetic codes. It lets the search engine treat the words that are different in meaning but phonetically close as the same.<\/p>\n<p>This way of search can be of great help when searching by a customer\u2019s name\/ surname.<\/p>\n<p>To enable the phonetic algos, you need to specify the values of soundex or metaphone for the morphology option.<\/p>\n<p><em>morphology = metaphone<\/em><\/p>\n<h2>Stop Words<\/h2>\n<p>The <em>stopwords<\/em> features in Sphinx lets the search engine ignore certain keywords when creating an index and implementing searches.<\/p>\n<p>All you need is to make a file with all your stop words, upload it to the server and set a path for Sphinx to find it.<\/p>\n<p>When creating a list of stop words, it\u2019s generally recommended to include the keywords that are so frequently mentioned in the text that have no influence on search results. As a rule, these are: <em>articles, prepositions, conjunctions,<\/em> etc.<\/p>\n<p>With the help of the Indexer it\u2019s possible to create a <em>dictionary of index frequency<\/em>, where all the indexes are sorted by keyword frequency. You can do that using the commands:<\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">--buildstops and --buildfreqs.\n\nstopwords = \/var\/lib\/sphinx\/data\/stopwords.txt\nstopwords = stopwords-ru.txt stopwords-en.txt<\/pre>\n<h2>Word Forms<\/h2>\n<p>The wordforms feature in Sphinx enables the search engine to deliver the same search results no mater which word form of the search query is used. E.g. customers who are looking for \u2018iphone 6\u2019 or \u2018i phone 6\u2019 will get the same results.<\/p>\n<p>This functionality comes really useful if you need to define the normal word form in cases when the Stemmer can&#8217;t do it. Also, having the file with all word forms, you will be able to easily set up the dictionary of search synonyms.<\/p>\n<p>These dictionaries are used to normalize the search queries during indexation and when implementing search. Hence, to apply changes in the wordforms file, you need to run re-indexation.<\/p>\n<p>The example of the file:<\/p>\n<p><em>walks &gt; walk<\/em><br \/>\n<em>walked &gt; walk<\/em><br \/>\n<em>walking &gt; walk<\/em><\/p>\n<p>Note, that starting with 2.1.1 version, it\u2019s possible to use \u043a \u201c=&gt;\u201d instead of\u00a0 \u201c&gt;\u201d. Starting with 2.2.4 version you can also use<br \/>\nmultiple destination tokens:<\/p>\n<p><em>s02e02 =&gt; season 2 episode 2<\/em><br \/>\n<em>s3 e3 =&gt; season 3 episode 3<\/em><\/p>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true\">wordforms = \/var\/lib\/sphinx\/data\/wordforms.txt<\/pre>\n<h2>Main Sphinx Commands<\/h2>\n<p>And finally, below you can find the list of the commands used for different operations with the search engine:<\/p>\n<p>1. <a href=\"https:\/\/www.mageworx.com\/magento2-order-editor-extension.html\">Editing<\/a> Sphinx config file:<br \/>\n<em>vi \/etc\/sphinx\/searchsuite.yasha.web.ra\/sphinx.conf<\/em><\/p>\n<p>2. Indexing data from the targeted config sources:<br \/>\n<em>sudo -usphinx indexer &#8211;config \/etc\/sphinx\/searchsuite.yasha.web.ra\/sphinx.conf &#8211;all &#8211;rotate<\/em><\/p>\n<p>3. Launching the Search Daemon:<br \/>\n<em>sudo -usphinx searchd &#8211;config \/etc\/sphinx\/searchsuite.yasha.web.ra\/sphinx.conf<\/em><\/p>\n<p>4. Disabling the Search Daemon:<br \/>\n<em>sudo -usphinx searchd &#8211;config \/etc\/sphinx\/searchsuite.yasha.web.ra\/sphinx.conf &#8211;stop<\/em><\/p>\n<p>5. Checking whether the search engine is functioning correctly (making a request to already created indexes):<br \/>\n<em>sudo -usphinx search &#8211;config \/etc\/sphinx\/searchsuite.yasha.web.ra\/sphinx.conf aviator<\/em> (instead of &#8216;aviator&#8217; you can use any other word).<\/p>\n<h2>Working with API<\/h2>\n<pre class=\"theme:github font:courier-new font-size:16 line-height:18 lang:default decode:true \">include_once Mage::getBaseDir('lib') . DS . 'Sphinx' . DS . 'sphinxapi.php';\n$instance = new SphinxClient();\n$instance-&gt;SetServer(\u2018localhost\u2019, 9312);\n$instance-&gt;SetConnectTimeout(30);\n$instance-&gt;SetArrayResult(true);\n$instance-&gt;setFieldWeights(array('data_index1' =&gt; 5, 'data_index2' =&gt; 4, 'data_index3' =&gt; 3, 'data_index4' =&gt; 2, 'data_index5' =&gt; 1));\n$instance-&gt;SetLimits(0, 1000, 1000);\n$instance-&gt;SetFilter('store_id', array(1, 0));\n$result = $instance-&gt;Query('*'.$queryText.'*', \u2018catalogsearch_fulltext');\n<\/pre>\n<h2>Bottom Line<\/h2>\n<p>In this tutorial, I\u2019ve tried to outline the main aspects of setting up and configuring Sphinx.<\/p>\n<p>As you can see, by using this search engine, you can easily add a custom search to your Magento website.<\/p>\n<p><em>Questions?<\/em><\/p>\n<p>Feel free to leave a comment and I\u2019ll get back to you. \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 7<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span>These days, hardly anyone is searching an online store by rambling among the categories or scrolling down the long lists of products. There is a bunch of available onsite search tools that can make an internal site search fast, intuitive and adjusted to any customer needs. In this series of articles we are going to [&hellip;]<\/p>\n","protected":false},"author":27,"featured_media":6048,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[255,425],"tags":[292],"class_list":{"0":"post-5401","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-magento-2","8":"category-magento-how-tos","9":"tag-sphinx"},"_links":{"self":[{"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/posts\/5401","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/users\/27"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/comments?post=5401"}],"version-history":[{"count":12,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/posts\/5401\/revisions"}],"predecessor-version":[{"id":16009,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/posts\/5401\/revisions\/16009"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/media\/6048"}],"wp:attachment":[{"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/media?parent=5401"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/categories?post=5401"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mageworx.com\/blog\/wp-json\/wp\/v2\/tags?post=5401"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}