4

Collaboration & evaluation for LLM apps

 7 months ago
source link: https://changelog.com/practicalai/253
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Transcript

šŸ“ Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. šŸŽ§

Welcome to another episode of Practical AI. This is Daniel Whitenack. I am CEO and founder at Prediction Guard, and Iā€™m really excited today to be joined by Dr. Raza Habib, who is CEO and co-founder at Humanloop. How are you doing, Raza?

Hi, Daniel. Itā€™s a pleasure to be here. Iā€™m doing very well. Yeah, thanks for having me on.

Yeah, Iā€™m super-excited to talk with you. Iā€™m mainly excited to talk with you selfishly, because I see the amazing things that Humanloop is doing, and the really critical problems that youā€™re thinking aboutā€¦ And every day of my life itā€™s like ā€œHow am I managing prompts? And how does this next model that Iā€™m upgrading to, how do my prompts do in that model? And how am I constructing workflows around using LLMs?ā€, which definitely seems to be the main thrust of some of the things that youā€™re thinking about at Humanloop. Before we get into the specifics of those things at Humanloop, would you mind setting the context for us in terms of workflows around these LLMs, collaboration on team? How did you start thinking about this problem, and what does that mean in reality, for those working in industry right now, maybe more generally that at Humanloop?

Yeah, absolutely. So I guess on the question of how I came to be working on this problem, it was really something that my co-founders, Peter and Jordan, had been working on for a very long time, actually. Previously, Peter and I did PhDā€™s together around this area, and then when we started the company, it was a little while after transfer learning had started to work in NLP for the first time, and we were mostly helping companies fine-tune smaller models. But then sometime midway through 2022 we became absolutely convinced that the rate of progress for these larger models was so high, it was going to start to eclipse essentially everything else in terms of performanceā€¦ But more importantly, in terms of usability. It was the first time that instead of having to hand-annotate a new dataset for every new problem, there was this new way of customizing AI models, which was that you could write instructions in natural language, and have a reasonable expectation that the model would then do that thing. And that was unthinkable at the start of 2022, I would say, or maybe a little bit earlier.

So that was really what made us want to go work on this, because we realized that the potential impact of NLP was already there, but the accessibility had been expanded so far, and the capabilities of the models had increased so much that there was a particular moment to go do this. But at the same time, it introduces a whole bunch of new challenges, right? So I guess historically, the people who were building AI systems were machine learning experts; the way that you would do it is you would collect, annotate the data, youā€™d fine-tune a custom modelā€¦ It was typically being used for like one specific task at a time. There was a correct answer, so it was easy to evaluateā€¦ And with LLMs, the power also brings new challenges. So the way that you customize these models is by writing these natural language instructions, which are prompts, and typically that means that the people involved donā€™t need to be as technical. And usually, we see actually that the best people to do prompt engineering tend to have domain expertise. So often, itā€™s a product manager or someone else within the company who is leading the prompt engineering effortsā€¦ But you also have this new artifact lying around, which is the prompt, and it has a similar impact to code on your end application. So it needs to be versioned, and managed, and treated with the same level of respect and rigor that you would treat normal code, but somehow you also need to have the right workflows and collaboration that lets the non-technical people work with the engineers on the product, or the less technical people.

And then the extra challenge that comes with it as well is that itā€™s very subjective to measure performance here. So in traditional code weā€™re used to writing unit tests, integration tests, regression testsā€¦ We know what good looks like and how to measure it. And even in traditional machine learning, thereā€™s a ground truth dataset, people calculate metricsā€¦ But once you go into generative AI, it tends to be harder to say what is the correct answer. And so when that becomes difficult, then measuring performance becomes hard; if measuring performance is hard, how do you know when you make changes if youā€™re going to cause regressions? Or all the different design choices you have in developing an app, how do you make those design choices if you donā€™t have good metrics of performance?

And so those are the problems that motivated what weā€™ve built. And really, Humanloop exists to solve both of these problems. So to help companies with the task of finding the best prompts, managing, versioning them, dealing with collaboration, but then also helping you do the evaluation thatā€™s needed to have confidence that the models are going to behave as you expect in production.

And as related to these things, maybe you can start with one that you would like to start with and go to the others, butā€¦ In terms of managing, versioning prompts, evaluating the performance of these models, dealing with regressions, as youā€™ve kind of seen people try to do this across probably a lot of different clients, a lot of different industries, how are people trying to manage this, in maybe some good ways and some bad ways?

[05:52] Yeah, I think we see a lot of companies go on a bit of a journey. So early on, people were excited about generative AI and LLMs; thereā€™s a lot of hype around it now, so some people in the company just go try things out. And often, theyā€™ll start off using one of the large, publicly-available models, Open AI, or Anthropic, Cohere, one of these; theyā€™ll prototype in their own kind of playground environment that those providers have. Theyā€™ll eyeball a few examples, maybe theyā€™ll grab a couple of libraries that support orchestration, and theyā€™ll put together a prototype. And the first version is fairly easy to build; itā€™s very quick to get to the first wow moment. And then, as people start moving towards production and they start iterating from that maybe 80% good enough version to something that they really trust, they start to run into these problems of like ā€œOh, Iā€™ve got like 20 different versions of this prompt, and Iā€™m storing it as a string in codeā€¦ And actually, I want to be able to collaborate with a colleague on this, and so now weā€™re sharing things either via screen sharing, or ā€“ā€ You know, weā€™ve had some serious companies you would have heard of, who were sending their model configs to each other via Microsoft Teams. And obviously, you wouldnā€™t send someone an important piece of code through Slack or Teams or something like this. But because the collaboration software isnā€™t there to bridge this technical/non-technical divide, those are the kinds of problems we see.

And so at this point, typically a year ago people would start building their own solution. So more often than not, this was when people would start building in-house tools. Increasingly, because there are companies like Humanloop around, thatā€™s usually when someone books a demo with us, and they say ā€œHey, weā€™ve reached this point where actually managing these artifacts has become cumbersome. Weā€™re worried about the quality of what weā€™re producing. Do you have a solution to help?ā€ And the way that Humanloop helps, at least on the prompt management side, is we have this interactive environment; itā€™s a little bit like those Open AI playgrounds, or the Anthropic playground, but a lot more fully featured and designed for actual development. So itā€™s collaborative, it has history built in, you can connect variables, and datasetsā€¦ And so it becomes like a development environment for your sort of LLM application. You can prototype the application, interact with it, try out a few thingsā€¦ And then people progress from that development environment into production through evaluation and monitoring.

You mentioned this kind of in passing, and Iā€™d love to dig into it a little bit more. You mentioned kind of the types of people that are coming at the table in designing these systems, and oftentimes domain experts ā€“ you know, previously, in working as a data scientist, it was always kind of assumed ā€œOh, you need to talk to the domain experts.ā€ But itā€™s sort of like ā€“ at least for many years, it was like data scientists talk to the domain experts, and then go off and build their thing. The domain experts were not involved in the sort of building of the system. And even then, the data scientists were maybe building things that were kind of foreign to software engineers. And what Iā€™m hearing you say is you kind of got like these multiple layers; you have like domain experts, who might not be that technical, youā€™ve got maybe AI and data people, who are using this kind of unique set of tools, maybe even theyā€™re hosting their own modelsā€¦ And then youā€™ve got like product software engineering people; it seems like a much more complicated landscape of interactions. How have you seen this kind of play out in reality in terms of non-technical people and technical people, both working together on something that is ultimately something implemented in code and run as an application?

I actually think one of the most exciting things about LLMs and the progress in AI in general is that product managers and subject matter experts can for the first time be very directly involved in implementing these applications. So I think itā€™s always been the case that the PM or someone like that is the person who distills the problem, speaks to the customers, produces the specā€¦ But thereā€™s this translation step where they sort of produce that prd document, and then someone else goes off and implements it. And because weā€™re now able to program at least some of the application in natural language, actually itā€™s accessible to those people very directly. And itā€™s worth maybe having a concrete example.

[10:02] So I use an AI notetaker for a lot of my sales calls. And it records the call, and then I get a summary afterwards. And the app actually allows you to choose a lot of different types of summary. So you can say, ā€œHey, Iā€™m a salesperson. I want a summary that will extract budget, and authority, and a timeline.ā€ Versus you can say ā€œOh, actually, I had a product interview, and I want a different type of summary.ā€ And if you think about developing that application, the person who has the knowledge thatā€™s needed to say what a good summary is, and to write the prompt for the model, itā€™s the person who has that domain expertise. Itā€™s not the software engineer.

But obviously, the prompt is only one piece of the application. If youā€™ve got a question answering system, thereā€™s usually retrieval as part of this; there may be other componentsā€¦ Usually, the LLM is a block in a wider application. So you obviously still need the software engineers around, because theyā€™re implementing the bulk of the application, but the product managers can be much more directly involved. And then, actually, we see increasingly less involvement from machine learning or AI experts, and less people are fine-tuning their own models. So for the majority of product teams weā€™re seeing, there is a an AI platform team that maybe facilitates setting things up, but the bulk of the work is led by the product managers, and then the engineers.

And one interesting example of this on the extreme end is one of our customers thatā€™s a very large ad tech company, they actually do not let their engineers edit the prompts. So they have a team of linguists who do prompt development. The linguists finalize the prompts, theyā€™re saved in a serialized format and they go to production, but itā€™s a one-way transfer. So the engineers canā€™t edit them, because theyā€™re not considered able to assess the actual outputs, even though they are responsible for the rest of the application.

Just thinking about how teams interact and whoā€™s doing what, it seems like the problems that youā€™ve laid out are, I think, very clear and worth solving, but itā€™s probably hard to think about ā€œWell, am I building a developer tool? Or am I building something that these non-technical people interact with? Or is it both?ā€ How did you think about that as you kind of entered into the stages of bringing Humanloop into existence?

I think it has to be bothā€¦ And the honest answer is it evolved kind of organically by going to customers, speaking to them about their problems, and trying to figure out what the best version of a solution looked like. So we didnā€™t set out to build a tool that needed to do both of these things, but I think the reality is, given the problems that people face, you do need both.

An analogy to think about might be something like Figma. Figma is somewhere where multiple different stakeholders come together to iterate on things, and to develop them, and provide feedbackā€¦ And I think you need something analogous to that for gen AIā€¦ Although itā€™s not an exact analogy, because we also need to attach the evaluation to this. So itā€™s almost by necessity that weā€™ve had to do thatā€¦ But I also think that itā€™s very exciting. And the reason I think itā€™s exciting is because it is expanding who can be involved in developing these applications.

Break: [13:05]

You mentioned how this environment of domain experts coming together, and technical teams coming together in a collaborative environment opens up new possibilities for both collaboration and innovation. Iā€™m wondering if at this point you could kind of just lay outā€¦ Weā€™ve talked about the problems, weā€™ve talked about those involved and those kind of that would use such a system or a platform to enable these kinds of workflowsā€¦ Could you describe a little bit more what Humanloop is specifically, in terms of both what it can do, and kind of how these different personas engage with the system?

Yeah. So I guess in terms of what it can do, concretely, itā€™s firstly helping you with prompt iteration versioning and management, and then with evaluation and monitoring. And the way it does that - if thereā€™s a web app and thereā€™s a web UI where people are coming in and in that UI is an interactive playground-like environment, where people basically try out different prompts, they can compare them side by side with different models, they can try them with different inputs, when they find versions that they think are good, they save them. And then those can be deployed from that environment to production, or even to a development or staging environment. So thatā€™s the kind of development stage.

And then once you have something thatā€™s developed, whatā€™s very typical is people then want to put in evaluation steps into place. So you can define gold standard test sets, and then you can define evaluators within Humanloop. And evaluators are ways of scoring the outputs of a model or a sequence of models, because oftentimes the LLM is part of a wider application.

And so the way that scoring works is thereā€™s very traditional metrics that you would have in code for any machine learning system. So precision, recall, rouge, blue, these kinds of scores that anyone from a machine learning background would already be familiar with. But whatā€™s new in the kind of LLM space is also things that help when things are more subjective. So we have the ability to do model as judge, where you might actually prompt another LLM to score the output in some wayā€¦ And this can be particularly useful when youā€™re trying to measure things like hallucination. So a very common thing to do is to ask the model ā€œIs the final answer contained within the retrieved context?ā€ Or ā€œIs it possible to infer the answer from the retrieved context?ā€ And you can calculate those scores.

And then the final way is we also support human evaluation. So in some cases, you really do want either feedback from an end user, or from an internal annotator involved as well. And so we allow you to gather that feedback, either from your live production application, and have it logged against your data, or you can cue internal annotation tests from a team. And I can maybe tell you a little bit more about sort of in production feedback, because thatā€™s something that ā€“ thatā€™s actually where we started.

Yeah, yeah. Go ahead, I would love to hear more.

Yeah, so I think that because itā€™s so subjective for a lot of the applications that people are building, whether it be email generation, question answering, a language learning app - there isnā€™t a ā€œcorrect answer.ā€ And so people want to measure how things are actually performing with their end users. And so Humanloop makes it very easy to capture different sources of end user feedback. And that might be explicit feedback, things like thumbs up/thumbs down votes that you see in ChatGPT, but it can also be more implicit signals. So how did the user behave after they were shown some generated content? Did they progress to the next stage of the application? Did they send the generated email? Did they edit the text? And all of that feedback data becomes useful, both for debugging, and also for fine-tuning the model later on. So that evaluation data becomes this rich resource that allows you to continuously improve your application over time.

[18:23] Yeah, thatā€™s awesome. And I know that that fits inā€¦ So maybe you could talk a little bit about how youā€™re ā€“ one of the things that you mentioned earlier is youā€™re seeing fewer people do fine-tuningā€¦ Which - I see this very commonly as aā€¦ Itā€™s not an irrelevant point, but itā€™s maybe a misconception, where a lot of teams come into this space and they just assume theyā€™re gonna be fine-tuning their modelsā€¦ And often, what they end up doing is fine-tuning their workflows or their language model chains, or the data that theyā€™re retrieving, or their prompt formats, or templates, or that sort of thing. Theyā€™re not really fine-tuning. I think thereā€™s this really blurred line right now for many teams that are adopting AI into their organization, where theyā€™ll frequently just use the term ā€œOh, Iā€™m training the AI to do this, and now itā€™s betterā€, but all theyā€™ve really done is just inject some data into their prompts, or something like that.

So could you maybe help clarify that distinction? And also, in reality, what youā€™re seeing people do with this capability of evaluation, both online and offline, and how thatā€™s filtering back into upgrades to the system, or actual fine-tunes of models?

Yeah. So I guess youā€™re right, thereā€™s a lot of jargon involvedā€¦ And especially for people who are new to the field, the word ā€œfine-tuningā€ has a colloquial meaning, and then it has a technical meaning in machine learning, and the two end up being blurred. So fine-tuning in a machine learning curve context usually means doing some extra training on the base model, where youā€™re actually changing the weights of the model, given some sets of example pairs of inputs/outputs that you want. And then obviously, thereā€™s prompt engineering and maybe context engineering, where youā€™re changing the instructions to the language model, or youā€™re changing the data thatā€™s fed into the context, or how an agent system might be set upā€¦ And both are really important. Typically, the advice we give the majority of our customers and what we see play out in practice is that people should first push the limits of prompt engineering. Because itā€™s very fast, itā€™s easy to do, and it can have very high impact, especially around changing the sort of outputs, and also in helping the model have the right data thatā€™s needed to answer the question.

So prompt engineering is kind of usually where most people start, and sometimes where people finish as well. And fine-tuning tends to be useful either if people are trying to improve latency or cost, or if they have like a particular tone of voice or output constraint that they want to enforce. So if people want the model to output valid JSON, then fine-tuning might be a great way to achieve that. Or if they want to use a local private model, because it needs to run on an edge device, or something like this, then fine-tuning I think is a great candidate.

And it can also let you reduce costs, because oftentimes you can find you in a smaller model to get similar performance. The analogy I like to use is fine-tuning is a bit like compilation. If youā€™ve already sort of built your first version of the language, when you want to optimize it, you might use a compiled language, and youā€™ve got a kind of compiled binary. I think there was a second part to your question, but just remind me, because Iā€™ve lost the second part.

Yeahā€¦ Basically, you mentioned that maybe fewer people are doing fine-tunesā€¦ Maybe you could comment on ā€“ I donā€™t know if you have a sense of why that is, or how you would see that sort of progressing into this year, as more and more people adopt this technology, and maybe get better tooling around the - letā€™s not call it fine-tuning, so we donā€™t mix all the jargon, but the iterative development of these systems. Do you see that trend continuing, or how do you see that kind of going into maybe larger or wider adoption in 2024?

[22:21] Yeah, so I think that weā€™ve definitely seen less fine-tuning than we thought we would see when we launched this version of Humanloop back in 2022. And I think thatā€™s been true of others as well. Iā€™ve spoken to friends at Open AIā€¦ And Open AI is expecting there will be more fine-tuning in the future, but theyā€™ve been surprised that there wasnā€™t more initially. I think some of that is because prompt engineering has turned out to be remarkably powerful, and also because some of the changes that people want to do to these models are more about getting factual context into the model. So one of the downsides of LLMs today is theyā€™re obviously trained on the public Internet, so they donā€™t necessarily know private information about your company; they tend not to know information past the training date of the model. And one way you might have thought you could overcome that is ā€œIā€™m going to fine-tune the model on my companyā€™s data.ā€ But I think in practice, what people are finding is a better solution to that is to use a hybrid system of search or information retrieval, plus generation. So whatā€™s come to be known as like RAG, or retrieval-augmented generation has turned out to be a really good solution to this problem.

And so the main reasons to fine-tune now are more about optimizing costs and latency, and maybe a little bit tone of voice, but theyā€™re not needed so much to adapt the model to a specific use case. And fine-tuning is a heavier duty operation, because it takes longerā€¦ You know, you can edit a prompt very quickly and then see what the impact is. Fine-tuning - you need to have their dataset that you want to fine-tune on, and then you need to run a training job and then evaluate that job afterwards.

So there are certainly circumstances where itā€™s going to make sense. I think especially anyone who wants to use a private open source model will likely find themselves wanting to do more fine-tuningā€¦ But the quality of prompt engineering and the distance you can go with it I think took a lot of people by surprise.

And on that note, you mentioned the closed proprietary model ecosystem versus open models that people might host in their own environment, and/or fine-tune on their own dataā€¦ I know that Humanloop - you explicitly say that you kind of have all of the models, youā€™re integrating these sort of closed models, and integrate with open modelsā€¦ Why and how is that kind of decided to kind of include all of those? And in terms of the mix of what youā€™re seeing with peopleā€™s implementations, how do you see this sort of proliferation of open models impacting the workflows that youā€™re supporting in the future?

So the reason for supporting them, again, is largely customer poll, right? What weā€™re finding is that many of our customers were using a mixture of models for different use cases, either because the large proprietary ones had slightly different performance trade-offs, or because they were use cases where they cared about privacy, or they care about latency, and so they couldnā€™t use a public model for those instances. And so we had to support all of them. It really was something that it wouldnā€™t be a useful product to our customers if they could only use it for one particular model.

And the way weā€™ve got around this is that we tried to integrate all of the publicly-available ones, but we also make it easy for people to connect their own models. So they donā€™t necessarily need us. As long as they expose the appropriate APIs, you can plug in any model to Humanloop.

That would be a matter of hosting the model and making sure that the API contract that youā€™re expecting in terms of responses from a model server, that maybe someoneā€™s running in their own AWS or wherever, would fulfill that contract.

Thatā€™s exactly right. Yeah. And in terms of the proliferation of open source and how thatā€™s going, I think thereā€™s still a performance gap at the moment between the very best closed models, so between the GPT-4, or some of the better models from Anthropic, and the best open sourceā€¦ But it is closing. So the latest models from, say, Mistral have proved to be very good, LLaMA 2 was very goodā€¦ Increasingly, youā€™re not paying as big a performance gap, although there is still one, but you need to have high volumes for it to be economically competitive to host your own model. So the main reasons we see people doing it are related to data privacy. Companies that, for whatever reason, cannot or donā€™t want to send data to a third party end up using open sourceā€¦ And then also, anyone whoā€™s doing things on edge, and who wants sort of real time or very low latency ends up using open source.

Well, Raza, Iā€™d love for you to maybe describe, if you canā€¦ Weā€™ve kind of talked about the problems that youā€™re addressing, weā€™ve talked about the sort of workflows that youā€™re enabling, the evaluation, and some trends that youā€™re seeingā€¦ But Iā€™d love for you to describe if you can maybe for like a non-technical persona, like a domain expert whoā€™s engaging with the Humanloop system, and maybe for a more technical person whoā€™s integrating data sources or other things, what does it look like to use the Humanloop system? Maybe describe the roles in which these people are ā€“ like what theyā€™re trying to do from each perspective. Because I think that might be instructive for people that are trying to engage domain experts and technical people in a collaboration around these problems.

Absolutely. So maybe it might be helpful to have a kind of imagined concrete example. So a very common example we see is people building some kind of question answering system. Maybe itā€™s for their internal customer service stuff, or maybe they want to replace an FAQā€¦ Maybe theyā€™re trying to build some kind of internal question answering system to replace something, or an FAQ, or that kind of thing. So thereā€™s a set of documents, a question is going to come in, thereā€™ll be a retrieval step, and then they want to generate an answer. So typically, the PMs or the domain experts will be figuring out what are the requirements of the system, ā€œWhat does good look like? What do we want it to build?ā€ And the engineers will be building the retrieval part, orchestrating all the model calls in code, integrating the Humanloop APIs into their systemā€¦ And also, usually they lead on setting up evaluation. So maybe once itā€™s set up, the domain experts might continue to do the evaluation themselves, but the engineers tend to set it up the first time.

So if youā€™re the domain expert, typically you would start off in our playground environment, where you can just try things out. So the engineers might connect a database to Humanloop for you. So maybe theyā€™ll store the data in a vector database, and connect that to Humanloop. And then once youā€™re in that environment, you could try different prompts to the models; you could try them to GPT-4, to Cohere, to an open source model, see what impact that has, see if youā€™re getting answers that you likeā€¦ Oftentimes early on itā€™s not in the right tone of voice, or the retrieval system is not quite right, and so the model is not giving factually correct answerā€¦ So it takes a certain amount of iteration to get to the point where even when you eyeball it, itā€™s looking appropriate. And usually, at that point people then move to doing a little bit more of a rigorous evaluation.

So they might generate either automatically or internally a set of test cases, and theyā€™ll also come up with a set of evaluation criteria that matter to them in their context. Theyā€™ll set up that evaluation, run it, and then usually at that point they might deploy to production.

So thatā€™s the point at which things would end up with real users, they start gathering user feedbackā€¦ And usually, the situation is not finished at that point, because people then look at the production logs, or they look at the real usage data, and they will filter based on the evaluation criteria. And they might say ā€œHey, show me the ones that didnā€™t result in a good outcomeā€, and then theyā€™ll try and debug them in some way, maybe make a change to a prompt, rerun the evaluation and submit it.

And so the engineers are doing the orchestration of the code. Theyā€™re typically making the model calls, theyā€™ll add logging calls to Humanloopā€¦ So the way that works - thereā€™s a couple of ways of doing the integration, but you can imagine every time you call the model, youā€™re effectively also logging back to Humanloop what the inputs and outputs were, as well as any user feedback data. And then the domain experts are typically looking at the data, analyzing it, debugging, making decisions about how to improve things, and theyā€™re able to actually take some of those actions themselves in the UI.

[32:03] Yeah. So if I just kind of abstract that a bit to maybe give people a frame of thinking, it sounds like thereā€™s kind of this framework setup where thereā€™s data sources, thereā€™s maybe logging calls within a version of an applicationā€¦ If youā€™re using a hosted model or if youā€™re using a proprietary API, you decide thatā€¦ And so itā€™s kind of set up, and then thereā€™s maybe an evaluation or a prototyping phase, letā€™s call it, where the domain experts try their promptingā€¦ Eventually, they find prompts that they think will work well for these various steps in a workflow, or something like thatā€¦ Those are pushed, as you said, I think, one way into the actual code or application, such that the domain experts are in charge of the prompting, to some degree. And as youā€™re logging feedback into the system, the domain experts are able to iterate on their prompts, which hopefully then improve the system, and those are then pushed back into the production system, maybe after an evaluation or something. Is that a fair representation?

Yeah, I think thatā€™s a great representation. Thanks for articulating it so clearly. And the kinds of things that the evaluation becomes useful for is avoiding regression, say. So people might notice one type of problem. They go in and they change a prompt, or they change the retrieval system, and they want to make sure they donā€™t break what was already working. And so having good evaluation in place helps with that.

And then maybe itā€™s also worth ā€“ because I think we didnā€™t sort of do this at the beginningā€¦ Just thinking about what are the components of these LLM applications. So I think youā€™re exactly right, we sort of think of the blocks of LLM apps being composed of a base model. So that might be a private fine-tuned model, or one of these large public onesā€¦ A prompt template, which is usually an instruction to the model that might have gaps in it for retrieved data or context, a data collection strategy, and then that whole thing of like data collection, prompt template and model might be chained together in a loop, or might be repeated one after anotherā€¦ And thereā€™s an extra complexity, which is the models might also be allowed to call tools or APIs. But I think those pieces that get taken together more or less comprehensively cover things. So tools, data retrieval, prompt template and base model are the main components. But then within each of those you have a lot of design choices and freedom. So you have a combinatorially large number of decisions to get right when building one of these applications.

One of the things that you mentioned is this evaluation phase of what goes on as helping prevent regressions, because in sort of testing behaviorly the output of the models you might make one change on a small set of examples, that looks like itā€™s improving things, but has sort of different behavior across a wide range of examplesā€¦ Iā€™m wondering also, I could imagine two scenariosā€¦ You know, models are being released all the time, whether itā€™s upgrading from this version of a GPT model to the next version, or this Mistral fine-tune to this one over hereā€¦ Iā€™m thinking even in the past few days weā€™ve been using the Neural Chat model from Intel a good bit, and thereā€™s a version of that that Neural Magic released, thatā€™s a sparsified version of that, where they pruned out some of the weights and the layers to make it more efficient, and to run on better ā€“ or not better hardware, but more commodity hardware, thatā€™s more widely availableā€¦ And so one of the questions that we were discussing is ā€œWell, we could flip the version of this model to the sparse one, but we have to decide on how to evaluate that over the use cases that we care about.ā€ Because you could look at the output for like a few tests prompts, and it might look similar, or good, or even better, but on a wider scale might be quite different in ways that you donā€™t expect. So I could see the evaluation also being used for that, but I could also see where if youā€™re upgrading to a new model, it could just throw everything up in the air in terms of like ā€œOh, this is an entirely different prompt formatā€, right? Or ā€œThis is a whole new behavior from this new model, that is distinct from an old model.ā€ So how are you seeing people navigate that landscape of model upgrades?

[36:33] I think you should just view it as a change as you would to any other part of the system. And hopefully, the desired behavior of the model is not changing. So even if the model is changed, you still want to run your regression test and say ā€œOkay, are we meeting a minimum threshold that we had on these gold standard test set before?ā€

In general, I think evaluation - we see it happening in sort of three different stages during development. Thereā€™s during this interactive stage very early on, when youā€™re prototyping, you want fast feedback, youā€™re just looking to get a sense of ā€œIs this even working appropriately?ā€ At that stage, eyeballing examples, and looking at things side by side, in a very interactive way can be helpful.

And interactive testing can also be helpful for adversarial testing. So a fixed test set doesnā€™t tell you what will happen when a user who actually wants to break the system comes in. So a concrete example of this - you know, one of our customers has children as their end users, and they want to make sure that things are age-appropriate, so they have guardrails in place. But when they come to test the system, they donā€™t want to just test it against an input thatā€™s benign. They want to see like, if we try, if we really red-team this, can we break it? And their interactive testing can be very helpful.

And then the next place where you kind of want testing in place is this regression testing, where you have a fixed set of evaluators on a test set, and you want to know ā€œWhen I make a change, does it get worse?ā€ And the final place we see people using it is actually from monitoring. So okay, Iā€™m in production now; thereā€™s new data flowing through. I may not have the ground truth answer, but I can still set up different forms of evaluator, and I want to be alerted if the performance drops below some threshold.

So one of the things that Iā€™ve been thinking about throughout our conversation here, and thatā€™s I think highlighted by what you just mentioned in sort of the upgrades to oneā€™s workflow, and the various levels at which such a platform can benefit teamsā€¦ And it made me think of [unintelligible 00:38:31.06] I have a background in physics, and there were plenty of physics teams or collaborators that we worked with - you know, we were writing code - and not doing great sort of version control practicesā€¦ And not everyone was using GitHub, and there was sort of collaboration challenges associated with that, which are obviously solved by great code collaboration systems of various forms, that have been developed over timeā€¦ And I think thereā€™s probably a parallel here with some of the collaboration systems that are being built around both playgrounds, and prompts, and evaluation. Iā€™m wondering if thereā€™s any examples from clients that youā€™ve worked with, or maybe itā€™s just interesting use cases of surprising things theyā€™ve been able to do when going from sort of doing things ad hoc, and maybe versioning prompts in spreadsheets, or whatever it might be, to actually being able to work in a more seamless way between domain experts and technical staff. Are there any clients, or use cases, or surprising stories that come to mind?

[39:46] Yeah, itā€™s a good question. Iā€™m kind of thinking through them to see what the more interesting examples might be. I think that, fundamentally, itā€™s not necessarily enabling completely new behavior, but itā€™s making the old behavior significantly faster, and less error-prone. Certainly, fewer mistakes and less time spent ā€“ okay, so a surprising exampleā€¦ Publicly-listed company, and they told me that one of the issues they were having is because they were sharing these prompt conflicts in teams, they were having differences in behavior based on whitespace being copied. Someone was like playing around with the Open AI playground, they copy-pasted into Teamsā€¦ That person would copy-paste from Teams into codeā€¦ And there was small whitespace differences, and you wouldnā€™t think it would affect the models, but it actually did. And so they would then get performance differences they couldnā€™t explain. And actually, it just turned out that you shouldnā€™t be sharing your code via Teams, right?

So I guess thatā€™s one surprising example. I think another thing as well is the complexity of apps that people are now beginning to be able to build. So increasingly, I think people are building simple agents; I think more complex agents are still not super-reliable, but a trend that weā€™ve been hearing a lot about from our customers recently is people trying to build systems that can use their existing software. An example of this is - you know, Ironclad is a company thatā€™s added a lot of LLM based features to their productā€¦ And they actually are able to automate a lot of workflows that were previously being done by humans, because the models can use the APIs that exist within the Ironclad software. So theyā€™re actually able to leverage their existing infrastructure. But to get that to work, they had to innovate quite a lot in tooling. And in fact - you know, this isnā€™t a plug for Humanloop. Ironclad in this case built a system called Rivet, which is their own open source prompt engineering and iteration framework. But I think itā€™s a good example of, you know, in order to achieve the complexity of that use case - this happened to be before tools like Humanloop around - they had to build something themselves. And itā€™s quite sophisticated tooling. I actually think Rivetā€™s great, so people should check that out as well. Itā€™s an open source library; anyone can go and get the tool.

So yeah, I think the surprising things are like how error-prone things are without good tooling, and the crazy ways in which people are solving problems. Another example of a mistake that we saw someone do is two different people triggered exactly the same annotation job. So they had annotation and spreadsheets, and they both outsourced the same job to different annotation teamsā€¦ Which was obviously an expensive mistake to make. So very error-prone. And then I think also just impossible to scale to more complex agentic use cases.

Well, you already kind of alluded to some trends that youā€™re seeing moving forwardā€¦ As we kind of draw to a close here, Iā€™d love to know from someone whoā€™s seeing a lot of different use cases being enabled through Humanloop, and your platform, whatā€™s exciting for you as we move into this next year in terms of - maybe itā€™s things that are happening in AI more broadly, or things that are being enabled by Humanloop, or things that are on your roadmap, that you canā€™t wait for them to go liveā€¦ As youā€™re lying in bed at night and getting excited for the next day of an AI stuff, whatā€™s on your mind?

So AI more broadly, I just feel the rate of progress of capabilities is both exciting and scary. Itā€™s extremely fast; multimodal models, better generative models, models with increased reasoningā€¦ I think the range of possible applications is expanding very quickly, as the capabilities of the models expand.

I think people have been excited about agent use cases for a while; systems that can act on their own and go off and achieve something for you. But in practice, weā€™ve not seen that many people succeed in production with those. There are a couple of examples, Ironclad being a good oneā€¦ But it feels like weā€™re still at the very beginning of that, and I think Iā€™m excited about seeing more people get to success with that. Iā€™d say that the most common, successful applications weā€™ve seen today are mostly either retrieval-augmented applications, or more simple LLM applications. But increasingly, Iā€™m excited about seeing agents in production, and also multimodal models in production.

In terms of things that Iā€™m particularly excited about from Humanloop, is I think us becoming a proactive rather than a passive platform. So today, the product managers and the engineers drive the changes on Humanloop. But I think that something that weā€™re going to hopefully release later this year is actually this system ā€“ you know, Humanloop itself can start proactively suggesting improvements to your application. Because we have the evaluation data, because we have all the prompts, we can start saying things to you, like ā€œHey, we have a new prompt for this application. Itā€™s a lot shorter than the one you have. It scores similarly on eval data. If you upgrade, we think we can cut your costs by 40%.ā€ And allowing people to then accept that change. And so going from a system that is observing, to a system thatā€™s actually intervening.

Thatā€™s awesome. Yeah, well, I definitely look forward to seeing how that rolls out, and I really appreciate the work that you and the team at Humanloop are doing to help us upgrade our workflows, and enable these sort of more complicated use cases. So thank you so much for taking time out of that work to join us. Itā€™s been a pleasure. I really enjoyed the conversation.

Thanks so much for having me, Daniel.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. šŸ’š


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK