Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

		Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
	151 points by divan 3 hours ago \| hide \| past \| favorite \| 41 comments
	There is a 5 month old thread [1] on this, but it might be already outdated. What is the best approach for feeding custom set of documents to LLM and get non-halucinating and decent result in Dec 2023? UPD: The question is generally about how to "teach" LLM answer questions using your set of documents (not necessarily train your own, so approaches like RAG counts) [1] https://news.ycombinator.com/item?id=36832572

You don't train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that's what people are searching for.

You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.

Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.

What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.

What it will do is learn the patterns that are in those documents.

We just held a workshop about this a few weeks ago: https://red.ht/llmappdev We created a simple chatbot using local models with Ollama (llamacpp), LlamaIndex and streamlit. Have a look at the streamlit folder, it's super easy.

I used this simple example to teach about RAG, the importance of the system prompt and prompt injection. The notebook folder has a few more examples, local models can even do natural language SQL querying now.

This one seems like a good summary

Retrieval-Augmented Generation for Large Language Models: A Survey

https://arxiv.org/abs/2312.10997

The photos of this post are also good for a high level look

https://twitter.com/dotey/status/1738400607336120573/photo/2

From the various posts I have seen people claim that phi-2 is a good model to start off from.

If you just want to do embeddings, there are various tutorials to use pgvector for that.

Retrieval-augmented generation, RAG + LLM will turn up more results.

To sing the praises of Bedrock again, it does have continuous pre-training as well as RAG “knowledge bases”. The former is based on JSON fragments and the RAG stuff is PDFs and other document formats.

With regards to its efficacy, I haven’t gone to production with it yet but I was reasonably impressed.

I uploaded 100 legal case documents to Bedrock via Claude and could push it pretty hard asking about the various cases and for situations across the knowledge base.

It did feel like it broke down and got confused at a certain point of complexity of questioning, but I still think it’s already useful as a “copilot” or search engine and surely it will only improve over time.

I forgot about the continuous pre-training thing. How long and how much did they cost on Bedrock?

I had tried to suggest continuous pre-training to my client but it seemed expensive and when I mentioned that he lost interest and just kept wanting me to do fine tuning.

Also to clarify, did you do the continuous pre-training or RAG? And did you compare the efficacy of one or the other or both?

I used the RAG knowledge bases for most of my testing described above.

I got a toy demo up and running with continuous pre-training but haven’t evaluated it unfortunately.

Are there public examples of working products using RAG, compared with fine-tuning or training from scratch?

The OpenAI assistants API is an implementation of a RAG pipeline. It performs both RAG on any documents you upload, and on any conversation you have with it that exceeds the context.

Another question, which one is preferred, LlamaIndex or Langchain, for RAG? Thanks in advance for your insights.

You basically don't use langchain for anything besides 30 minute demos that you copied from someone else's github. It has a completely spaghettified API, is not performant, and forces you into excessive mental contortions to reason about otherwise simple tasks.

LlamaIndex is pretty good.

LlamaIndex is mainly focused on RAG. LangChain does a ton of other stuff too. I'd focus on LlamaIndex first.

Well said. The problem is, there are way too many alternatives. Any idea how llamaindex's ingestion engine compares to unstructured.io? ( Which is used in langchain)

PrivateGPT is one of the better-known examples, but most people are not aware that GPT4 Assistants handle RAG natively now: https://platform.openai.com/docs/assistants/overview

AWS Bedrock is fairly easy. You can do it in 5 or 6 clicks.

You have to upload your documents to S3, create a “Knowledge Base” then sync your documents into a vector database like OpenSearch or PineCone. You are then good to go via their playground or the AWS API.

I made a video here describing the process, check around 14 minutes in:

https://ensembleanalytics.io/blog/introducing-bedrock-knowle...

Bedrock is a decent product I think. All of the models in one place (apart from the big dogs from OpenAI) and a common API across them.

Is there a limit? Could I create a knowledge base with 10,000 documents? 100k? 1M?

The documents are encoded as vectors and stored in a database, so I suspect it would be effectively unlimited. You would just pay for storage and compute.

AWS OpenSearch has fairly good integration so you could look up costs for that. It’s not the cheapest AWS service to run and not exactly serverless as you pay by the hour.

I’m sorry, I don’t understand those limits. It uses a lot of unfamiliar terms like “batch inference” and “modality”. I just want a nice UI that I can give my hard-drive to and then ask it questions.

What is your usecase? If you want to search for relevant info in your documents and get relevant info, and you want to avoid hallucination, you might avoid the text generation altogether.

Instead you can extract text embeddings from your documents, put them in a vector DB, and then you have a super search. You can convert your search query to an embedding, search the DB and keep the e.g. 10 closest matches.

I haven't personally tried this for anything serious yet, but to get the thread started:

Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.

The main perk over the cloud options is that you can point it at any language model, including fully local—my local install pointed at my local Ollama running Mistral.

[0] https://github.com/cheshire-cat-ai/core

But that's not training. That's RAG. They seem to be using qdrant which I believe is a vector store.

They've updated the question to clarify that RAG counts, and as many have noted, properly "training" on a set of documents isn't really a thing.

So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba

I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)

Slightly off topic but is there recommended advice on how to tune / train not for document retrieval but for consistent JSON output with specific enums?

i.e given a text, always return back a certain set of fields. For some keys here is the possible set of enums etc. One shot prompting does work but curious how others approach this if you have training data on hand.

There are many interesting tools that achieve this, like Outlines[0] and jsonformer[1]. I haven't tried them myself but they look very promising.

[0]: https://github.com/outlines-dev/outlines [1]: https://github.com/1rgs/jsonformer

You want grammars to restrict the output, search for "gbnf grammar". That and combined with a good prompt with an example, also check out outlines.dev

For OpenAI, use their functions schema mechanism.

Aside from that, take a look at llama.cpp grammars.

A go-to method is to ingest different chunksizes based on the document hierarchy & then use langchain with a bunch of retrievers depending on the doc type.

Then create an index about the metadata of each doc. So that you can ask the RAGbot what all it can answer about.

Another way to ensure it stays on-domain is to generate synthetic questions & check for similarity against user queries. There's a whole rabbit hole of query decomposition to avoid straying off topic as well.

If you’re looking for something that is hosted for you, at Notion we launched a feature for this a few weeks ago and it works quite well in my experience. RAG is one of the techniques used. https://www.notion.so/blog/introducing-q-and-a

GPT-4 Turbo has a 128K (~300 pages) context window, which probably handles a lot of use cases which might have previously needed extra training/refinement.

Easiest is OpenAI assistants api. Use the playground and it’s a no code experience.

Unstract - https://unstract.com/ They are a month away from launch(both open source and cloud) The team might be able to give you a quick demo on your specific requirements.

What are you trying to do more specifically? You can use https://docalysis.com/ for most document RAG tasks.

Train on your own documents or analyze your own documents for answers? Very different things.

For the first (fine tuning) follow “AI Jason” on YouTube. He has some great tutorials.

For the second (RAG or similar), fire up a cloud VM with GPUs or use Ollama locally and read through the LlamaIndex docs on how to build a RAG pipeline.

Would you kindly elaborate a little bit the difference between training on own documents vs analyzing documents for answers?

The word "training" implies creating a new model by fine-tuning an existing model on top of new documents.

As several other comments in this thread have already indicated: this is almost always the wrong direction. Which is confusing because it's the direction everyone always assumes they should go in at first.

The approaches that does work is surprisingly simple: take the user's question, search for snippets of your documents that appear to be about that question, then paste all of those snippets into the prompt along with the user's question and see what answer you get.

This is known as RAG: Retrieval Augmented Generation. It's a very powerful approach.

Recommend

That time physicist John Wheeler left classified H-bomb documents on a train

Adding Custom Fonts to HTML Documents

How to Train a Custom Object Detection Model with YOLOv7?

How to Train a Custom YOLOv5 Model to Detect Objects

How to Train a Custom Dataset with YOLOv5?

OpenAI Says It Won't Steal Data From Those Who Use Its New API to Train ChatGPT

Discord hops the generative AI train with ChatGPT-style tools

Nvidia DGX Cloud: Train Your Own ChatGPT in a Web Browser For $37K a Month

You can now train ChatGPT on your own documents via API

How to Train YOLOv3 to Detect Custom Objects?

About Joyk