85

GitHub - nextapps-de/flexsearch: Next-Generation full text search library for Br...

 5 years ago
source link: https://github.com/nextapps-de/flexsearch
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

README.md


Search Library

68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f666c65787365617263682e737667 68747470733a2f2f7472617669732d63692e6f72672f6e657874617070732d64652f666c65787365617263682e7376673f6272616e63683d6d6173746572 68747470733a2f2f636f766572616c6c732e696f2f7265706f732f6769746875622f6e657874617070732d64652f666c65787365617263682f62616467652e7376673f6272616e63683d6d6173746572 68747470733a2f2f6170692e636f646163792e636f6d2f70726f6a6563742f62616467652f47726164652f6138393665303130663662343432396161376263396138393535303332306137 68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6973737565732f6e657874617070732d64652f666c65787365617263682e737667 68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f6c2f666c65787365617263682e737667

Web's fastest and most memory-flexible full-text search library with zero dependencies.

When it comes to raw search speed FlexSearch outperforms every single searching library out there and also provides flexible search capabilities like multi-word matching, phonetic transformations or partial matching. Depending on the used options it also providing the most memory-efficient index. Keep in mind that updating and/or removing existing items from the index has a significant cost. When your index needs to be updated very often then BulkSearch may be a better choice. FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads.

Installation Guide  •  API Reference  •  Example Options  •  Custom Builds  •  Flexsearch Server

Supported Platforms:

  • Browser
  • Node.js

FlexSearch Server is also available here: https://github.com/nextapps-de/flexsearch-server

Library Comparison:

Get Latest (Stable Release):

Build File CDN flexsearch.min.js Download https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@master/flexsearch.min.js flexsearch.light.js Download https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@master/flexsearch.light.js flexsearch.compact.js Download https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@master/flexsearch.compact.js flexsearch.custom.js Custom Build

All Features:

Feature flexsearch.min.js flexsearch.compact.js flexsearch.light.js Presets x x - Async Processing x x - Web-Worker Sharding (not available in Node.js) x - - Contextual Indexes x x x Partial Matching x x x Multi-Phrase Search x x x Relevance-based Scoring x x x Auto-Balanced Cache by Popularity x - - Suggestions (Results) x - - Phonetic Matching x x - Customizable: Matcher, Encoder, Tokenizer, Stemmer, Filter x x x File Size (gzip) 5.0 kb 3.9 kb 2.7 kb

It is also pretty simple to make Custom Builds

Benchmark Ranking

Comparison: Benchmark "Gulliver's Travels"

Query Test: "Gulliver's Travels"

Rank Library Name Library Version Single Phrase (op/s) Multi Phrase (op/s) Not Found (op/s) 1 FlexSearch *** 0.3.3 363757 182603 1627219 2 Wade 0.3.3 899 6098 214286 3 JS Search 1.4.2 735 8889 800000 4 JSii 1.0 551 9970 75000 5 Lunr.js 2.3.5 355 1051 25000 6 Elasticlunr.js 0.9.6 327 781 6667 7 BulkSearch 0.1.3 265 535 2778 8 bm25 0.2 71 116 2065 9 Fuse 3.3.0 0.5 0.4 0.7

Memory Test: "Gulliver's Travels"

Rank Library Name Library Version Index Size * Memory Allocation ** 1 FlexSearch **** 0.3.1 1.33 Mb 20.31 kb 2 Wade 0.3.3 3.18 Mb 68.53 kb 3 Fuse 3.3.0 0.22 Mb 156.46 kb 4 JSii 1.0 8.9 Mb 81.03 kb 5 bm25 0.2 6.95 Mb 137.88 kb 6 BulkSearch 0.1.3 1.53 Mb 984.30 kb 7 Elasticlunr.js 0.9.6 11.83 Mb 68.69 kb 8 Lunr.js 2.3.5 16.24 Mb 84.73 kb 9 JS Search 1.4.2 36.9 Mb 53.0 kb

* Index Size: The size of memory the index requires
** Memory Allocation: The amount of memory which was additionally allocated during a row of 10 queries
*** The preset "fastest" was used for this test
**** The preset "memory" was used for this test

Library Comparison: Benchmark "Gulliver's Travels"

Contextual Search

"TF-IDF and all kinds of variations (like BM25) is a big mistake in searching algorithms today. They don't provide neither: a meaningful relevance of a term nor the importance of it! Like many pseudo-intelligent algorithms this is also just an example of mathematical stupidity." — Thomas Wilkerling, Contextual-based Scoring, 2018

FlexSearch introduce a new scoring mechanism called Contextual Search which was invented by Thomas Wilkerling, the author of this library. A Contextual Search incredibly boost up queries to a complete new level but also requires a lot of additionally memory. The basic idea of this concept is to limit relevance by its context instead of calculating relevance through the whole (unlimited) distance. Imagine you add a text block of some sentences to an index ID. Assuming the query includes a combination of first and last word from this text block, are they really relevant to each other? In this way contextual search also improves the results of relevance-based queries on large amount of text data.

68747470733a2f2f7261776769746875622e636f6d2f6e657874617070732d64652f666c65787365617263682f6d61737465722f646f632f636f6e7465787475616c2d696e6465782e737667

Note: This feature is actually not enabled by default. Read here how to enable.

Compare BulkSearch vs. FlexSearch

BulkSearch FlexSearch Access Read-Write optimized index Read-Memory optimized index Memory Large: ~ 1 Mb per 100,000 words Tiny: ~ 100 Kb per 100,000 words Index Type Bulk of encoded string data divided into chunks

  1. Lexical pre-scored dictionary
  2. Contextual-based map
Strength
  • fast adds
  • fast updates
  • fast removals
  • fast queries
  • memory-efficient index
Weaks
  • less powerful contextual search
  • less memory efficient (has to be defragmented from time to time)
  • updating / deleting extisting items from index is slow
  • adding items to the index optimized for super partial matching (tokenize: "full") is slow
Pagination Yes No

Installation

HTML / Javascript

Use flexsearch.min.js for production and flexsearch.js for development.

<html>
<head>
    <script src="js/flexsearch.min.js"></script>
</head>
...

Use latest from CDN:

<script src="https://cdn.rawgit.com/nextapps-de/flexsearch/master/flexsearch.min.js"></script>

Or a specific version:

<script src="https://cdn.rawgit.com/nextapps-de/flexsearch/0.3.2/flexsearch.min.js"></script>

AMD

var FlexSearch = require("./flexsearch.js");

Node.js

npm install flexsearch

In your code include as follows:

var FlexSearch = require("flexsearch");

Or pass in options when requiring:

var index = require("flexsearch").create({/* options */});

API Overview

Global methods:

Index methods:

Usage

Create a new index

FlexSearch.create(<options>)

var index = new FlexSearch();

alternatively you can also use:

var index = FlexSearch.create();

Create a new index and choosing one of the built-in profiles:

var index = new FlexSearch("speed");

Create a new index with custom options:

var index = new FlexSearch({

    // default values:

    encode: "balance",
    tokenize: "forward",
    async: false,
    worker: false,
    cache: false
});

Read more about custom options

Add items to an index

Index.add(id, string)

index.add(10025, "John Doe");

Search items

Index.search(string | options, <limit>, <callback>)

index.search("John");

Limit the result:

index.search("John", 10);

Async Search

Perform queries asynchronously:

index.search("John", function(result){
    
    // array of results
});

Passing a callback always will perform as asynchronous even if the "async" option was not set.

Perform queries asynchronously (Promise-based):

Make sure the option "async" is enabled on this instance

index.search("John").then(function(result){
    
    // array of results
});

Alternatively ES6:

async function search(query){
    
    const result = await index.search(query);
}

Custom Search

Pass custom options for each query:

index.search({
    
    query: "John",
    limit: 1000,
    threshold: 5, // >= initial threshold
    depth: 3,     // <= initial depth
    callback: function(results){/* ... */}
});

The same from above could also be written as:

index.search("John", {

    limit: 1000,
    threshold: 5,
    depth: 3
    
}, function(results){
    
    // ....
});

Suggestions

Get also suggestions for a query:

index.search({
    
    query: "John Doe",
    suggest: true
});

When suggestion is enabled all results will be filled up (until limit, default 1000) with similar matches ordered by relevance.

Update item of an index

Index.update(id, string)

index.update(10025, "Road Runner");

Remove item to the index

Index.remove(id)

index.remove(10025);

Reset index

index.clear();

Destroy the index

index.destroy();

Re-Initialize the index

Index.init(<options>)

Initialize (with same options):

index.init();

Initialize with new options:

index.init({

    /* options */
});

Re-initialization will also destroy the old index.

Add custom matcher

FlexSearch.registerMatcher({REGEX: REPLACE})

Add global matchers for all instances:

FlexSearch.registerMatcher({

    'ä': 'a', // replaces all 'ä' to 'a'
    'ó': 'o',
    '[ûúù]': 'u' // replaces multiple
});

Add private matchers for a specific instance:

index.addMatcher({

    'ä': 'a', // replaces all 'ä' to 'a'
    'ó': 'o',
    '[ûúù]': 'u' // replaces multiple
});

Add custom encoder

Assign a custom encoder by passing a function during index creation/initialization:

var index = new FlexSearch({

    encode: function(str){
    
        // do something with str ...
        
        return str;
    }
});

Call a custom encoder directly:

var encoded = index.encode("sample text");

Register a global encoder

FlexSearch.registerEncoder(name, encoder)

Global encoders can be shared/used by all instances.

FlexSearch.registerEncoder("whitespace", function(str){

    return str.replace(/\s/g, "");
});

Initialize index and assign a global encoder:

var index = new FlexSearch({ encode: "whitespace" });

Call a global encoder directly:

var encoded = FlexSearch.encode("whitespace", "sample text");

Mix/Extend multiple encoders

FlexSearch.registerEncoder('mixed', function(str){
  
    str = this.encode("icase", str); // built-in
    str = this.encode("whitespace", str); // custom
    
     // do something additional with str ...
    
    return str;
});

Add custom tokenizer

A tokenizer split words into components or chunks.

Define a private custom tokenizer during creation/initialization:

var index = new FlexSearch({

    tokenize: function(str){

        return str.split(/\s-\//g);
    }
});

Add language-specific stemmer and/or filter

Stemmer: several linguistic mutations of the same word (e.g. "run" and "running")

Filter: a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be")

Assign a private custom stemmer or filter during creation/initialization:

var index = new FlexSearch({

    stemmer: {
        
        // object {key: replacement}
        "ational": "ate",
        "tional": "tion",
        "enci": "ence",
        "ing": ""
    },
    filter: [ 
        
        // array blacklist
        "in",
        "into",
        "is",
        "isn't",
        "it",
        "it's"
    ]
});

Or assign stemmer/filters globally to a language:

Stemmer are passed as a object (key-value-pair), filter as an array.

FlexSearch.registerLanguage("us", {

    stemmer: { /* ... */ },
    filter:  [ /* ... */ ]
});

Or use some pre-defined stemmer or filter of your preferred languages:

<html>
<head>
    <script src="js/flexsearch.min.js"></script>
    <script src="js/lang/en.min.js"></script>
    <script src="js/lang/de.min.js"></script>
</head>
...

Now you can assign built-in stemmer during creation/initialization:

var index_en = new FlexSearch({
    stemmer: "en", 
    filter: "en" 
});

var index_de = new FlexSearch({
    stemmer: "de",
    filter: [ /* custom */ ]
});

In Node.js you just have to require the language pack files to make them available:

require("flexsearch.js");
require("lang/en.js");
require("lang/de.js");

It is also possible to compile language packs into the build as follows:

node compile SUPPORT_LANG_EN=true SUPPORT_LANG_DE=true

Get info about an index

This feature is available in DEBUG mode.

index.info();

Returns information e.g.:

{
    "id": 0,
    "memory": 10000,
    "items": 500,
    "sequences": 3000,
    "matchers": 0,
    "chars": 3500,
    "cache": false,
    "matcher": 0,
    "worker": false,
    "threshold": 7,
    "depth": 3,
    "contextual": true                                 
}

Chaining

Simply chain methods like:

var index = FlexSearch.create()
                      .addMatcher({'â': 'a'})
                      .add(0, 'foo')
                      .add(1, 'bar');
index.remove(0).update(1, 'foo').add(2, 'foobar');

Enable Contextual Scoring

Create an index and just set the limit of relevance as "depth":

var index = new FlexSearch({

    encode: "icase",
    tokenize: "strict",
    threshold: 7,
    depth: 3
});

Only the tokenizer "strict" is actually supported by the contextual index.

The contextual index requires additional amount of memory depending on depth.

Try to use the lowest depth and highest threshold which fits your needs.

Enable Auto-Balanced Cache

Create index and just set a limit of cache entries:

var index = new FlexSearch({

    profile: "score",
    cache: 10000
});

When passing a number as a limit the cache automatically balance stored entries related to their popularity.

When just using "true" the cache is unbounded and perform actually 2-3 times faster (because the balancer do not have to run).

WebWorker Sharding (Browser only)

Worker get its own dedicated memory and also run in their own dedicated thread without blocking the UI while processing. Especially for larger indexes, web worker improves speed and available memory a lot. FlexSearch index was tested with a 250 Mb text file including 10 Million words.

When the index isn't big enough it is faster to use no web worker.

Create index and just set the count of parallel threads:

var index = new FlexSearch({

    encode: "icase",
    tokenize: "full",
    async: true,
    worker: 4
});

Adding items to worker index as usual (async enabled):

index.add(10025, "John Doe");

Perform search and simply pass in callback like:

index.search("John Doe", function(results){

    // do something with array of results
});

Or use promises accordingly:

index.search("John Doe").then(function(results){

    // do something with array of results
});

Options

FlexSearch ist highly customizable. Make use of the the right options can really improve your results as well as memory economy or query time.

Option Values Description profile





"memory"
"speed"
"match"
"score"
"balance"
"fastest" The configuration profile. Choose your preferation.
tokenize




"strict"
"foward"
"reverse"
"full"
function() The indexing mode (tokenizer).

Choose one of the built-ins or pass a custom tokenizer function.
encode






false
"icase"
"simple"
"advanced"
"extra"
"balance"
function() The encoding type.

Choose one of the built-ins or pass a custom encoding function. cache


false
true
{number} Enable/Disable and/or set capacity of cached entries.

When passing a number as a limit the cache automatically balance stored entries related to their popularity.

Note: When just using "true" the cache has no limits and is actually 2-3 times faster (because the balancer do not have to run). async

true
false Enable/Disable asynchronous processing.

Each job will be queued for non-blocking processing. Recommended when using WebWorkers. worker

false
{number} Enable/Disable and set count of running worker threads. depth

false
{number:0-9} Enable/Disable contextual indexing and also sets contextual distance of relevance. threshold

false
{number:0-9} Enable/Disable the threshold of minimum relevance all results should have.

Note: It is also possible to set a lower threshold for indexing and pass a higher value when calling index.search(options). stemmer


false
{string}
{function} Disable or pass in language shorthand flag (ISO-3166) or a custom object. filter


false
{string}
{function} Disable or pass in language shorthand flag (ISO-3166) or a custom array.

Tokenizer

Tokenizer effects the required memory also as query time and flexibility of partial matches. Try to choose the most upper of these tokenizer which fits your needs:

Option Description Example Memory Factor (n = length of word) "strict" index whole words foobar * 1 "foward" incrementally index words in forward direction foobar
foobar
* n "reverse" incrementally index words in both directions foobar
foobar * 2n - 1 "full" index every possible combination foobar
foobar * n * (n - 1)

Phonetic Encoding

Encoding effects the required memory also as query time and phonetic matches. Try to choose the most upper of these encoders which fits your needs, or pass in a custom encoder:

Option Description False-Positives Compression false Turn off encoding no no "icase" (default) Case in-sensitive encoding no no "simple" Phonetic normalizations no ~ 7% "advanced" Phonetic normalizations + Literal transformations no ~ 35% "extra" Phonetic normalizations + Soundex transformations yes ~ 60% function() Pass custom encoding: function(string):string

Comparison (Matching)

Reference String: "Björn-Phillipp Mayer"

Query icase simple advanced extra björn yes yes yes yes björ yes yes yes yes bjorn no yes yes yes bjoern no no yes yes philipp no no yes yes filip no no yes yes björnphillip no yes yes yes meier no no yes yes björn meier no no yes yes meier fhilip no no yes yes byorn mair no no no yes (false positives) no no no yes

Memory Usage

The required memory for the index depends on several options:

Encoding Memory usage of every ~ 100,000 indexed word false 260 kb "icase" (default) 210 kb "simple" 190 kb "advanced" 150 kb "extra" 90 kb Mode Multiplied with: (n = average length of indexed words) "strict" * 1 "forward" * n "reverse" * 2n - 1 "full" * n * (n - 1) Contextual Index Multiply the sum above with: * (depth * 2 + 1)

Compare Memory Consumption

The book "Gulliver's Travels" (Swift Jonathan 1726) was used for this test.


68747470733a2f2f7261776769746875622e636f6d2f6e657874617070732d64652f666c65787365617263682f6d61737465722f646f632f6d656d6f72792d636f6d70617269736f6e2e737667

Presets

You can pass a preset during creation/initialization. They represents these following settings:

"default": Standard profile

{
    encode: "icase",
    tokenize: "forward"
}

"memory": Memory-optimized profile

{
    encode: "extra",
    tokenize: "strict",
    threshold: 7
}

"speed": Speed-optimized profile

{
    encode: "icase",
    tokenize: "strict",
    threshold: 7,
    depth: 2
}

"match": Matching-tolerant profile

{
    encode: "extra",
    tokenize: "full"
}

"score": Relevance-optimized profile

{
    encode: "extra",
    tokenize: "strict",
    threshold: 5,
    depth: 5
}

"balance": Most-balanced profile

{
    encode: "balance",
    tokenize: "strict",
    threshold: 6,
    depth: 3
}

"fastest": Absolute fastest profile

{
    encode: "icase",
    threshold: 9,
    depth: 1
}

Compare these presets:

Best Practices

Split Complexity

Whenenver you can, try to divide content by categories and add them to its own index, e.g.:

var feeds_2017 = new FlexSearch();
var feeds_2018 = new FlexSearch();
var feeds_2019 = new FlexSearch();

Use numeric IDs

It is recommended to use numeric id values as reference when adding content to the index. The byte length of passed ids influences the memory consumption significantly. If this is not possible you should consider to use a index table and map the ids with indexes, this becomes important especially when using contextual indexes on a large amount of content.

e.g. instead of this:

index.add("fdf12cad-8779-47ab-b614-4dbbd649178b", "content");

you should probably use this:

var index_table = {
    "fdf12cad-8779-47ab-b614-4dbbd649178b": 0,
    "48b3041c-a243-4a52-b1ed-225041847366": 1,
    "7236c8b5-86e1-451a-842f-d9aba9642e4d": 2,
    // ....
};

index.add(index_table["fdf12cad-8779-47ab-b614-4dbbd649178b"], "content");

It is planned to provide a built-in feature which should replace this workaround.

Export/Import Index

index.export() returns a serialized dump as a string.

index.import(string) takes a serialized dump as a string and load it to the index.

Assuming you have one or several indexes:

var feeds_2017 = new FlexSearch();
var feeds_2018 = new FlexSearch();
var feeds_2019 = new FlexSearch();

Export indexes, e.g. to the local storage:

localStorage.setItem("feeds_2017", feeds_2017.export());
localStorage.setItem("feeds_2018", feeds_2018.export());
localStorage.setItem("feeds_2019", feeds_2019.export());

Import indexes, e.g. from the local storage:

feeds_2017.import(localStorage.getItem("feeds_2017"));
feeds_2018.import(localStorage.getItem("feeds_2018"));
feeds_2019.import(localStorage.getItem("feeds_2019"));

Debug

Do not use DEBUG in production builds.

If you get issues, you can temporary set the DEBUG flag to true on top of flexsearch.js:

DEBUG = true;

This enables console logging of several processes. Just open the browsers console to make this information visible.

Profiler Stats

Do not use PROFILER in production builds.

To collect some performance statistics of your indexes you need to temporary set the PROFILER flag to true on top of flexsearch.js:

PROFILER = true;

This enables profiling of several processes.

An array of all profiles is available on:

window.stats;

You can also just open the browsers console and enter this line to get stats.

The index of the array corresponds to the index.id.

Get stats from a specific index:

index.stats;

The returning stats payload is divided into several categories. Each of these category provides its own statistic values.

Profiler Stats Properties

Property Description time The sum of time (ms) the process takes (lower is better) count How often the process was called ops Average operations per seconds (higher is better) nano Average cost (ns) per operation/call (lower is better)

Custom Builds

Full Build:

npm run build

Compact Build:

npm run build-compact

Light Build:

npm run build-light

Build Language Packs:

npm run build-lang

Custom Build:

npm run build-custom SUPPORT_WORKER=true SUPPORT_ASYNC=true

Alternatively you can also use:

node compile SUPPORT_WORKER=true

The custom build will be saved to flexsearch.custom.xxxxx.js (the "xxxxx" is a hash based on the used build flags).

Supported Build Flags

Flag Values DEBUG true, false PROFILER true, false SUPPORT_ENCODER (built-in encoders) true, false SUPPORT_WORKER true, false SUPPORT_CACHE true, false SUPPORT_ASYNC true, false SUPPORT_PRESETS true, false
Language Flags (includes stemmer and filter) SUPPORT_LANG_EN true, false SUPPORT_LANG_DE true, false
Compiler Flags LANGUAGE_OUT







ECMASCRIPT3
ECMASCRIPT5
ECMASCRIPT5_STRICT
ECMASCRIPT6
ECMASCRIPT6_STRICT
ECMASCRIPT_2015
ECMASCRIPT_2017
STABLE

Copyright 2019 Nextapps GmbH
Released under the Apache 2.0 License


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK