Rectangle 27 58

I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).

Solr also provides the following additional features:

  • Multiple cores (think of these as separate databases with their own configuration and own indexes)
  • Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
  • Update index via XML or delimited file
  • Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
  • Index directly from the database with custom queries
  • Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
  • Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
  • Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)

BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.

Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)

Sphinx too about Supports replication Multiple cores Boolean searches Highlighting of keywords Update index via XML -or delimited file- PDF, Word document indexing (via xml) Facets Stop words, synonyms, etc. Index directly from the database with custom queries Auto-suggest Fast indexing Boosting Fielded searches About Dynamic fields Aggregate fields Cache Autowarming I just don't know

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 157

Good to see someone's chimed in about Lucene - because I've no idea about that.

Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.

  • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  • Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
  • I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
  • The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
  • Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
  • There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
  • Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.

I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.

I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.

Sphinx does allow you to update individual attributes of items in current indexes, but not remove/update full records.

sphinx RT allows you to do partial updates/removals. it is in early stage but already [almost] works. sphinxsearch.com/wiki/doku.php?id=rt_tutorial

Nothing can match Sphinx for speed, so if speed is your number one concern then Sphinx is the option to go for. Nice post

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 79

I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.

Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.

As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.

It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.

Compare to sphinx , lucence is tooo slow and bulky. I had used both in my project and i finally sticked to sphinx. Lucence is in java , and it takes a lot more CPU and RAM than Sphinx.

I have to disagree here. Lucene is lightning fast IF you build a correct index. You can basically do an advanced query over millions of records in just a couple of milliseconds. You just need to know what you are doing. And Lucene is in java... your point being? There's also .NET port, Lucene.NET btw.

but you clearly stated that you don't use sphinx, and v3sson has used both.

how can you state that lucene's performance is unmatched in the same sentence that you state you haven't used sphinx?

Valid questions. I never said that Lucene is faster than Sphinx, I mentioned that Lucene vs a database full-text search is unmatched. And it is. No question about that. Lucene is based upon an inverted index. Now I don't know Sphinx, as mentioned before, but if it also uses an inverted index or a similar indexing method then it is possible that they are equally performing. Stating that Lucene, compared to Sphinx, would be 'tooo slow and bulky' is not based upon facts. Especially not when it is only said that Lucene is in 'Java', which is just a ridiculous non-issue in terms of performance.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 48

Start the Jetty Application Server

Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation.

The boost helps you rank your results show up on top. Say, you're trying to search for a name john in the fields firstname and lastname, and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown.

http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john

As you can see, firstname field is boosted up with a score of 2.

The speed is unbelievably fast and no compromise on that. The reason I moved to Solr.

Regarding the indexing speed, Solr can also handle JOINS from your database tables. A higher and complex JOIN do affect the indexing speed. However, an enormous RAM config can easily tackle this situation.

The higher the RAM, The faster the indexing speed of Solr is.

  • ease of use and ease of integration with Django

Never attempted to integrate Solr and Django, however you can achieve to do that with Haystack. I found some interesting article on the same and here's the github for it.

  • resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU

Solr breeds on RAM, so if the RAM is high, you don't to have to worry about Solr.

Solr's RAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation. As explained, Solr is only a near real-time solution.

Solr is highly scalable. Have a look on SolrCloud. Some key features of it.

  • Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large)
  • Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism)
  • extra features such as "did you mean?", related searches, etc

For the above scenario, you could use the SpellCheckComponent that is packed up with Solr. There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed, books instead of book, you will be presented with results related to book.

Assuming that you are under LINUX environment, you could proceed to this article further. (mine was an Ubuntu 14.04 version)

Download Apache Solr from here. That would be version is 4.8.1. You could download new versions, I found this stable.

After downloading the archive , extract it to a folder of your choice. Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/

shankar@shankar-lenovo: cd Downloads/solr-4.8.1

Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server.

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar

Now , do not close the terminal , minimize it and let it stay aside.

( TIP : Use & after start.jar to make the Jetty Server run in the background )

To check if Apache Solr runs successfully, visit this URL on the browser. http://localhost:8983/solr

It runs on the port 8983 as default. You could change the port either here or directly inside the jetty.xml file.

java -Djetty.port=9091 -jar start.jar

This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here

After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jar and paste it to the lib directory.

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib

To put Solr to use, You need to have some tables and data to search for. For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries.

CREATE TABLE test_solr_mysql
 (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  name VARCHAR(45) NULL,
  created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
 );
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');
shankar@shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf
<lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
  <lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />

Now add the DIH (Data Import Handler)

<requestHandler name="/dataimport" 
  class="org.apache.solr.handler.dataimport.DataImportHandler" >
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
</requestHandler>

If the file exists then ignore, add these lines to that file. As you can see the first line, you need to provide the credentials of your MySQL database. The Database name, username and password.

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
    <document>
   <entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
      OR `created` > '${dataimporter.last_index_time}'" >
    <field name="id" column="rid" />
    <field name="solr_name" column="name" />
    </entity>
   </document>
</dataConfig>

( TIP : You can have any number of entities but watch out for id field, if they are same then indexing will skipped. )

Add this to your schema.xml as shown..

<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />

This is where the real deal is. You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries.

Hit the URL http://localhost:8983/solr on your browser. The screen opens like this.

As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors.

Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). Make sure you don't have error messages marked in red. Earlier, on our configuration we had added a select query on our db-data-config.xml, say if there were any errors on that query, it would have shown up here.

Fine, no errors. We are good to go. Let's choose collection1 from the list as depicted and select Dataimport

Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr.

To do that, Choose full-import , and check the options Clean and Commit. Now click Execute as shown.

Alternatively, you could use a direct full-import query like this too..

http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true

After you clicked Execute, Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong.

Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.

Seems like everything went well, now you could use Solr Queries to query the data that was indexed. Click the Query on the left and then press Execute button on the bottom.

You will see the indexed records as shown.

The corresponding Solr query for listing all the records is

http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true

Well, there goes all 10 indexed records. Say, we need only names starting with Ja , in this case, you need to target the column name solr_name, Hence your query goes like this.

http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true

That's how you write Solr Queries. To read more about it, Check this beautiful article.

this is one of the most comprehensive and well organized posts I've seen on SO. Awesome job.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 25

I'm looking at PostgreSQL full-text search right now, and it has all the right features of a modern search engine, really good extended character and multilingual support, nice tight integration with text fields in the database.

But it doesn't have user-friendly search operators like + or AND (uses & | !) and I'm not thrilled with how it works on their documentation site. While it has bolding of match terms in the results snippets, the default algorithm for which match terms is not great. Also, if you want to index rtf, PDF, MS Office, you have to find and integrate a file format converter.

OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer. It's the default for the MediaWiki search, and I really think it's no good for end-users: http://www.searchtools.com/analysis/mediawiki-search/

In all cases I've seen, Lucene/Solr and Sphinx are really great. They're solid code and have evolved with significant improvements in usability, so the tools are all there to make search that satisfies almost everyone.

for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine.

I believe that by PostgreSQL full-text search you're referring to Tsearch. But Tsearch does not support phrase search. It's still on their TODO list sai.msu.su/~megera/wiki/FTS_Todo.

Just done a bunch of testing on Postgres 9.0 full text search; was disappointed to find that French text isn't matched if the user forgets to get all the accents right. Matching of word forms is patchy - for example, in English "say" doesn't match text containing "said". Overall fairly impressive though for an integrated feature across the languages tested (en, fr, ru).

"OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer." That's not a built-in restriction of MySQL -- it's whatever you set in the config file. If you want to index one-letter words, just change one value in the config.

For phrase searches see the "synonym" dictionary feature. That's an odd name for a feature which allows phrase searches to work well, but I had not trouble setting up a dictionary of legal terms that way.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 17

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

We are using this search engine at our project and very happy with it.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 10

SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."

That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 2

I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.

The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.

In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.

As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.

And, of course mnoGoSearch is open source, GNU GPL.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).

Solr also provides the following additional features:

  • Multiple cores (think of these as separate databases with their own configuration and own indexes)
  • Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
  • Update index via XML or delimited file
  • Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
  • Index directly from the database with custom queries
  • Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
  • Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
  • Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)

BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.

Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)

Sphinx too about Supports replication Multiple cores Boolean searches Highlighting of keywords Update index via XML -or delimited file- PDF, Word document indexing (via xml) Facets Stop words, synonyms, etc. Index directly from the database with custom queries Auto-suggest Fast indexing Boosting Fielded searches About Dynamic fields Aggregate fields Cache Autowarming I just don't know

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

We are using this search engine at our project and very happy with it.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.

The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.

In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.

As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.

And, of course mnoGoSearch is open source, GNU GPL.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

Good to see someone's chimed in about Lucene - because I've no idea about that.

Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.

  • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  • Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
  • I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
  • The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
  • Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
  • There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
  • Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.

I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.

I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.

Sphinx does allow you to update individual attributes of items in current indexes, but not remove/update full records.

sphinx RT allows you to do partial updates/removals. it is in early stage but already [almost] works. sphinxsearch.com/wiki/doku.php?id=rt_tutorial

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).

Solr also provides the following additional features:

  • Multiple cores (think of these as separate databases with their own configuration and own indexes)
  • Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
  • Update index via XML or delimited file
  • Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
  • Index directly from the database with custom queries
  • Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
  • Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
  • Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)

BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.

Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)

Sphinx too about Supports replication Multiple cores Boolean searches Highlighting of keywords Update index via XML -or delimited file- PDF, Word document indexing (via xml) Facets Stop words, synonyms, etc. Index directly from the database with custom queries Auto-suggest Fast indexing Boosting Fielded searches About Dynamic fields Aggregate fields Cache Autowarming I just don't know

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

Start the Jetty Application Server

Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation.

The boost helps you rank your results show up on top. Say, you're trying to search for a name john in the fields firstname and lastname, and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown.

http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john

As you can see, firstname field is boosted up with a score of 2.

The speed is unbelievably fast and no compromise on that. The reason I moved to Solr.

Regarding the indexing speed, Solr can also handle JOINS from your database tables. A higher and complex JOIN do affect the indexing speed. However, an enormous RAM config can easily tackle this situation.

The higher the RAM, The faster the indexing speed of Solr is.

  • ease of use and ease of integration with Django

Never attempted to integrate Solr and Django, however you can achieve to do that with Haystack. I found some interesting article on the same and here's the github for it.

  • resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU

Solr breeds on RAM, so if the RAM is high, you don't to have to worry about Solr.

Solr's RAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation. As explained, Solr is only a near real-time solution.

Solr is highly scalable. Have a look on SolrCloud. Some key features of it.

  • Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large)
  • Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism)
  • extra features such as "did you mean?", related searches, etc

For the above scenario, you could use the SpellCheckComponent that is packed up with Solr. There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed, books instead of book, you will be presented with results related to book.

Assuming that you are under LINUX environment, you could proceed to this article further. (mine was an Ubuntu 14.04 version)

Download Apache Solr from here. That would be version is 4.8.1. You could download new versions, I found this stable.

After downloading the archive , extract it to a folder of your choice. Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/

shankar@shankar-lenovo: cd Downloads/solr-4.8.1

Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server.

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar

Now , do not close the terminal , minimize it and let it stay aside.

( TIP : Use & after start.jar to make the Jetty Server run in the background )

To check if Apache Solr runs successfully, visit this URL on the browser. http://localhost:8983/solr

It runs on the port 8983 as default. You could change the port either here or directly inside the jetty.xml file.

java -Djetty.port=9091 -jar start.jar

This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here

After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jar and paste it to the lib directory.

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib

To put Solr to use, You need to have some tables and data to search for. For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries.

CREATE TABLE test_solr_mysql
 (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  name VARCHAR(45) NULL,
  created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
 );
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');
shankar@shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf
<lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
  <lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />

Now add the DIH (Data Import Handler)

<requestHandler name="/dataimport" 
  class="org.apache.solr.handler.dataimport.DataImportHandler" >
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
</requestHandler>

If the file exists then ignore, add these lines to that file. As you can see the first line, you need to provide the credentials of your MySQL database. The Database name, username and password.

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
    <document>
   <entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
      OR `created` > '${dataimporter.last_index_time}'" >
    <field name="id" column="rid" />
    <field name="solr_name" column="name" />
    </entity>
   </document>
</dataConfig>

( TIP : You can have any number of entities but watch out for id field, if they are same then indexing will skipped. )

Add this to your schema.xml as shown..

<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />

This is where the real deal is. You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries.

Hit the URL http://localhost:8983/solr on your browser. The screen opens like this.

As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors.

Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). Make sure you don't have error messages marked in red. Earlier, on our configuration we had added a select query on our db-data-config.xml, say if there were any errors on that query, it would have shown up here.

Fine, no errors. We are good to go. Let's choose collection1 from the list as depicted and select Dataimport

Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr.

To do that, Choose full-import , and check the options Clean and Commit. Now click Execute as shown.

Alternatively, you could use a direct full-import query like this too..

http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true

After you clicked Execute, Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong.

Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.

Seems like everything went well, now you could use Solr Queries to query the data that was indexed. Click the Query on the left and then press Execute button on the bottom.

You will see the indexed records as shown.

The corresponding Solr query for listing all the records is

http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true

Well, there goes all 10 indexed records. Say, we need only names starting with Ja , in this case, you need to target the column name solr_name, Hence your query goes like this.

http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true

That's how you write Solr Queries. To read more about it, Check this beautiful article.

Your latest answer, wow. Long time no talk Shankar. I hope all is well with you my friend.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."

That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

Firstly, your requirement is a much better match for full text searching than "like" comparisons. As soon as you get this working, you'll want to boost "Taunton" over "Aunton" and then deal with different spellings, wildcards, and then you'll notice that the wildcards in your like statement slow the whole thing down. Full text searching does all that for you, and is pretty quick.

The specific answer to your question is a nasty one; I'm reluctant to give you code PHP code because I can't easily test it. I've created a little SQLFiddle to show how you can do it in SQL, though.

In essence, you need to turn your "hits" on individual searches into a derived table, and count each row's number of entries; the more times it turns up in your derived table, the more likely it is to be a "good" hit.

Querying muliple columns in MySQL table (PHP) - Stack Overflow

php mysql sql
Rectangle 27 0

I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.

Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.

As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.

It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.

Compare to sphinx , lucence is tooo slow and bulky. I had used both in my project and i finally sticked to sphinx. Lucence is in java , and it takes a lot more CPU and RAM than Sphinx.

I have to disagree here. Lucene is lightning fast IF you build a correct index. You can basically do an advanced query over millions of records in just a couple of milliseconds. You just need to know what you are doing. And Lucene is in java... your point being? There's also .NET port, Lucene.NET btw.

but you clearly stated that you don't use sphinx, and v3sson has used both.

how can you state that lucene's performance is unmatched in the same sentence that you state you haven't used sphinx?

Valid questions. I never said that Lucene is faster than Sphinx, I mentioned that Lucene vs a database full-text search is unmatched. And it is. No question about that. Lucene is based upon an inverted index. Now I don't know Sphinx, as mentioned before, but if it also uses an inverted index or a similar indexing method then it is possible that they are equally performing. Stating that Lucene, compared to Sphinx, would be 'tooo slow and bulky' is not based upon facts. Especially not when it is only said that Lucene is in 'Java', which is just a ridiculous non-issue in terms of performance.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

We are using this search engine at our project and very happy with it.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.

The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.

In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.

As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.

And, of course mnoGoSearch is open source, GNU GPL.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx
Rectangle 27 0

Good to see someone's chimed in about Lucene - because I've no idea about that.

Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.

  • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  • Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
  • I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
  • The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
  • Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
  • There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
  • Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.

I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.

I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.

Sphinx does allow you to update individual attributes of items in current indexes, but not remove/update full records.

sphinx RT allows you to do partial updates/removals. it is in early stage but already [almost] works. sphinxsearch.com/wiki/doku.php?id=rt_tutorial

Comparison of full text search engine - Lucene, Sphinx, Postgresql, My...

mysql postgresql full-text-search lucene sphinx