The described procedure seems to be somewhat inefficient. A cool thing compared with the Percolator is that lmgrep provides exact offsets of the matched terms while Elasticsearch does not expose offsets when highlighting. lmgrep just limits the number of stored search queries to one and treats every text line as a document. The overall searching approach is similar to the one of Percolator in Elasticsearch. That is how lmgrep does the full-text search. lmgrep takes the hits, formats them, and sends results to STDOUT. Then the Monitor runs the search query on that in-memory index in the good ol' Lucene way 3. The Monitor then creates an in-memory Lucene index with a single document created out of the line of text. Each line of text is passed to the Monitor for searching. Lmgrep creates a Lucene Monitor (Monitor) object from the provided search query. Instead, I’ll focus on explaining how the search works within a file. I assume that the dear reader doesn’t want to be tortured by reading the explanation on how the file names are being matched with GLOB, so I’ll skip it. Lmgrep by default expects two parameters: a search query and a GLOB pattern (similar to regexp) to find files to execute lmgrep on. In this way, the startup time is around 0.01s for Linux, macOS, and Windows. To reduce the startup time I’ve compiled lmgrep with the native-image tool provided by GraalVM. The main problem is the startup time of JVM.
However powerful Lucene is, it is not well-suited for CLI application.
Also, many developers are already familiar with the Lucene query syntax and know how to leverage it to solve complicated information retrieval problems. Lucene has been more than 20 years in development and it is the library that powers many search applications. Lucene is a Java library that provides indexing and search features.
regular expressions can be combined with other Lucene query components.Flexible text analysis pipeline that includes, lowercasing, ASCII-folding, stemming, etc.Text analysis can be customized to the language of the documents.Boolean operators allow to construct complex, well-designed queries.
Lucene query syntax is better suited for full-text search.Several notable advantages of lmgrep over grep: Of course, there are many more options to grep but it is the essence of the tool. lmgrep tries to mimick exactly that functionality. However, I think that lmgrep is best compared with the very tool that inspired it, namely grep.Īnyway, what does grep do? grep reads a line from stdin, examines the line to see if it should be forwarded to stdout, and repeats until stdin is exhausted 1. I’m perfectly aware that comparing Lucene and grep is like comparing apples to oranges. Give it a try and let me know how it goes. Have you ever wished that grep supported tokenization, stemming, etc, so that you don’t have to write wildcard regular expressions all the time? I’ve also shared that question and on a one nice day, I’ve tried to scratch that itch by exposing the Lucene query syntax as a CLI utility. It is installed as just one executable file without any dependencies, provides a command-line interface, starts-up instantly, and works on macOS, Linux, and, yes, even Windows.
What if grep supported the functionality of a proper search engine like Elasticsearch without a need to install any servers or index the files before searching?