Integrate Sphinx into Drizzle
Integrate Sphinx more closely with Drizzle
Whiteboard
Discussion with Andrew July 23rd, and discussed four possible approaches:
1. have a wrapper on db side that manages .conf file, launches indexer, talks to searchd, does
nothing itself
2. have more tightly integrated wrapper on db side that links to libsphinx directly, manages indexes itself, does not care about indexer/
3. wait for an unspecified amount of time until at least alpha dynamic updates are implemented and have a wrapper that uses them, by talking to searchd
4. same as 3, but uses by linking against libsphinx
From further discussion, it seems option 2 is the most feasible and best choice. This option would have these requirements/
* libsphinx used within the drizzle server
* only those needed functionalities that searchd peforms, from libsphinx, included* configuration information stored somewhere - my.cnf?
* data source for sphinx at handler or storage engine level (?)
* searchd currently listens, logs, and proxies. The key is to identify what is needed of those
. Logging can be peformed by drizzle.
* Details of indexes "hidden" from the user
* How do we get a query within drizzle to result in the index being searched? What syntax is used - FT functions, or new Sphinx functions?
1st stage/version:
1) hooks create index (or something), and builds an index (slowly) on that
2) hooks SELECT and does searching
3) does not handle INSERT UPDATE DELETE at stage0
The idea being, get something that at least searches the index. perhaps have the index built u
pon 'create index' being issued. The index is built once, and can be searched. Nothing else is
yet implemented at this stage. Possible to implement 'drop index' deleting the index.
2nd stage/version:
1) Build() (builds index) - where it gets its data source - handler? No need of database client library, lower level.
(shodan) this belongs to 1st stage in fact - Build() should be implemented for CREATE INDEX anyway
Other useful info:
Searchd
* 6500 lines of code..
* first 1400 are logging, globals, helpers, networks buffers etc..
* 800 lines of distributed querying code...
* 50 lines of schema minimization..
* 50 lines of really old network proto fixup..
* 600 lines of parsing the network search request and emitting network search result..
* 800 lines that do all the searching including distributed stuff, multiquery optimizations, and merging search results across several indexes.
* 100 lines of search command network chatter again... ;)
* 200 lines of index rotation, as well as search unrelated (excerpts updates buildkeywords etc)
* The rest 2000 lines are about some async ops, signal handling, configparsing and startup in general, and all that "main loop" stuff.
Of these:
800 + 50 + 800 could be reused if you want distributed support..
and less than 1000 would be needed if you don't.