2

Ask HN: Books about full text search?

 1 year ago
source link: https://news.ycombinator.com/item?id=33734259
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Ask HN: Books about full text search?

Ask HN: Books about full text search?
54 points by sopromo 3 hours ago | hide | past | favorite | 14 comments
I would love to learn more about FTS at a very low level and I'm looking for books to read more on that topic. Any good suggestions ?
At a general audience level, "Index" is on my list to read. It covers the invention of the index up to digital search engines. https://www.nytimes.com/2022/02/09/books/review-index-histor...

"Introduction to Information Retrieval" is a textbook which is available online https://nlp.stanford.edu/IR-book/ Here's a review: http://glinden.blogspot.com/2009/02/book-review-introduction...

Another textbook which IMHO is a bit lower level is "Information Retrieval: Implementing and Evaluating Search Engines". The book website is down for me right now, but you can find it on Amazon here: https://www.amazon.com/Information-Retrieval-Implementing-Ev...

Another commenter linked to "Relevant Search", which is great if you want to learn how to effectively use a search engine to improve relevance (as opposed to how to implement a search engine). It's old, but another book in that vein that was really helpful for me earlier in my career is Lucene in Action: https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp...

Three reference textbooks are available openly:

* Introduction to Information Retrieval, http://informationretrieval.org/

* Information Retrieval in Practice, http://www.search-engines-book.com/

* Entity-Oriented Search, https://eos-book.org/

Modern Information Retrieval is also a classic reference. Not openly available but some contents are (were?) available online. Their site seems to be down but the Internet Archive has a copy.

Additional resources here:

* https://nlp.stanford.edu/IR-book/information-retrieval.html http://web.archive.org/web/20220708135205/http://grupoweb.up...

Not a book but Hellerstein’s CS186 from 2015 starting with Lecture 17 gave me a basic understanding (I think).

Playlist https://youtube.com/playlist?list=PLhMnuBfGeCDPtyC9kUf_hG_Qw...

Also from that lecture series, the low level is always IO. One disk read tends to dwarf n^2 in-memory algorithms.

And IO is all about tuning caches and hardware for the specific structural relationships in the data, the way in which it is accessed, and the hardware everything runs on.

Good luck.

Take a look at my post “Lucene: The Good Parts”—

https://blog.parse.ly/lucene/

The book mentioned there is Lucene in Action.

And then this YouTube presentation by a Lucene/Elasticsearch committer will give you a nice overview of some related algorithms—

https://youtu.be/eQ-rXP-D80U

s.gif
Came here to recommend Managing Gigabytes as well. People these days are managing far more than gigabytes but the fundamental ideas remain useful.
Lucene in Action, good introduction to Lucene, which can be helpful to learn ElasticSearch (most used FTS these days)
I had the exact same impulse a couple of weeks ago and I currently have a paper copy of "Introduction to Information Retrieval" open on my desk.

From the perspective of someone that's only just decided to learn about full text search a bit more formally, it's very understandable, well paced, and packed with exercises to enhance understanding. I'm really enjoying it.

I'll definitely be bookmarking this thread!

s.gif
Manning also have a book on Lucene, the library that powers Solr and ElasticSearch. IIRC the book covered how Lucene actually works under-the-good and would therefore act as a good reference on the subject in general.
s.gif
Taming Text is about building a question-answering system; it came out about the time Watson came online; it's not a plan, rather a cookbook of experiments using Apache products like Solr and OpenNLP, but is a great tutorial on how question answering works.
s.gif
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK