Collaboration & evaluation for LLM apps
source link: https://changelog.com/practicalai/253
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Transcript
Changelog
Play the audio to listen along while you enjoy the transcript. š§
Welcome to another episode of Practical AI. This is Daniel Whitenack. I am CEO and founder at Prediction Guard, and Iām really excited today to be joined by Dr. Raza Habib, who is CEO and co-founder at Humanloop. How are you doing, Raza?
Hi, Daniel. Itās a pleasure to be here. Iām doing very well. Yeah, thanks for having me on.
Yeah, Iām super-excited to talk with you. Iām mainly excited to talk with you selfishly, because I see the amazing things that Humanloop is doing, and the really critical problems that youāre thinking aboutā¦ And every day of my life itās like āHow am I managing prompts? And how does this next model that Iām upgrading to, how do my prompts do in that model? And how am I constructing workflows around using LLMs?ā, which definitely seems to be the main thrust of some of the things that youāre thinking about at Humanloop. Before we get into the specifics of those things at Humanloop, would you mind setting the context for us in terms of workflows around these LLMs, collaboration on team? How did you start thinking about this problem, and what does that mean in reality, for those working in industry right now, maybe more generally that at Humanloop?
Yeah, absolutely. So I guess on the question of how I came to be working on this problem, it was really something that my co-founders, Peter and Jordan, had been working on for a very long time, actually. Previously, Peter and I did PhDās together around this area, and then when we started the company, it was a little while after transfer learning had started to work in NLP for the first time, and we were mostly helping companies fine-tune smaller models. But then sometime midway through 2022 we became absolutely convinced that the rate of progress for these larger models was so high, it was going to start to eclipse essentially everything else in terms of performanceā¦ But more importantly, in terms of usability. It was the first time that instead of having to hand-annotate a new dataset for every new problem, there was this new way of customizing AI models, which was that you could write instructions in natural language, and have a reasonable expectation that the model would then do that thing. And that was unthinkable at the start of 2022, I would say, or maybe a little bit earlier.
So that was really what made us want to go work on this, because we realized that the potential impact of NLP was already there, but the accessibility had been expanded so far, and the capabilities of the models had increased so much that there was a particular moment to go do this. But at the same time, it introduces a whole bunch of new challenges, right? So I guess historically, the people who were building AI systems were machine learning experts; the way that you would do it is you would collect, annotate the data, youād fine-tune a custom modelā¦ It was typically being used for like one specific task at a time. There was a correct answer, so it was easy to evaluateā¦ And with LLMs, the power also brings new challenges. So the way that you customize these models is by writing these natural language instructions, which are prompts, and typically that means that the people involved donāt need to be as technical. And usually, we see actually that the best people to do prompt engineering tend to have domain expertise. So often, itās a product manager or someone else within the company who is leading the prompt engineering effortsā¦ But you also have this new artifact lying around, which is the prompt, and it has a similar impact to code on your end application. So it needs to be versioned, and managed, and treated with the same level of respect and rigor that you would treat normal code, but somehow you also need to have the right workflows and collaboration that lets the non-technical people work with the engineers on the product, or the less technical people.
And then the extra challenge that comes with it as well is that itās very subjective to measure performance here. So in traditional code weāre used to writing unit tests, integration tests, regression testsā¦ We know what good looks like and how to measure it. And even in traditional machine learning, thereās a ground truth dataset, people calculate metricsā¦ But once you go into generative AI, it tends to be harder to say what is the correct answer. And so when that becomes difficult, then measuring performance becomes hard; if measuring performance is hard, how do you know when you make changes if youāre going to cause regressions? Or all the different design choices you have in developing an app, how do you make those design choices if you donāt have good metrics of performance?
And so those are the problems that motivated what weāve built. And really, Humanloop exists to solve both of these problems. So to help companies with the task of finding the best prompts, managing, versioning them, dealing with collaboration, but then also helping you do the evaluation thatās needed to have confidence that the models are going to behave as you expect in production.
And as related to these things, maybe you can start with one that you would like to start with and go to the others, butā¦ In terms of managing, versioning prompts, evaluating the performance of these models, dealing with regressions, as youāve kind of seen people try to do this across probably a lot of different clients, a lot of different industries, how are people trying to manage this, in maybe some good ways and some bad ways?
[05:52] Yeah, I think we see a lot of companies go on a bit of a journey. So early on, people were excited about generative AI and LLMs; thereās a lot of hype around it now, so some people in the company just go try things out. And often, theyāll start off using one of the large, publicly-available models, Open AI, or Anthropic, Cohere, one of these; theyāll prototype in their own kind of playground environment that those providers have. Theyāll eyeball a few examples, maybe theyāll grab a couple of libraries that support orchestration, and theyāll put together a prototype. And the first version is fairly easy to build; itās very quick to get to the first wow moment. And then, as people start moving towards production and they start iterating from that maybe 80% good enough version to something that they really trust, they start to run into these problems of like āOh, Iāve got like 20 different versions of this prompt, and Iām storing it as a string in codeā¦ And actually, I want to be able to collaborate with a colleague on this, and so now weāre sharing things either via screen sharing, or āā You know, weāve had some serious companies you would have heard of, who were sending their model configs to each other via Microsoft Teams. And obviously, you wouldnāt send someone an important piece of code through Slack or Teams or something like this. But because the collaboration software isnāt there to bridge this technical/non-technical divide, those are the kinds of problems we see.
And so at this point, typically a year ago people would start building their own solution. So more often than not, this was when people would start building in-house tools. Increasingly, because there are companies like Humanloop around, thatās usually when someone books a demo with us, and they say āHey, weāve reached this point where actually managing these artifacts has become cumbersome. Weāre worried about the quality of what weāre producing. Do you have a solution to help?ā And the way that Humanloop helps, at least on the prompt management side, is we have this interactive environment; itās a little bit like those Open AI playgrounds, or the Anthropic playground, but a lot more fully featured and designed for actual development. So itās collaborative, it has history built in, you can connect variables, and datasetsā¦ And so it becomes like a development environment for your sort of LLM application. You can prototype the application, interact with it, try out a few thingsā¦ And then people progress from that development environment into production through evaluation and monitoring.
You mentioned this kind of in passing, and Iād love to dig into it a little bit more. You mentioned kind of the types of people that are coming at the table in designing these systems, and oftentimes domain experts ā you know, previously, in working as a data scientist, it was always kind of assumed āOh, you need to talk to the domain experts.ā But itās sort of like ā at least for many years, it was like data scientists talk to the domain experts, and then go off and build their thing. The domain experts were not involved in the sort of building of the system. And even then, the data scientists were maybe building things that were kind of foreign to software engineers. And what Iām hearing you say is you kind of got like these multiple layers; you have like domain experts, who might not be that technical, youāve got maybe AI and data people, who are using this kind of unique set of tools, maybe even theyāre hosting their own modelsā¦ And then youāve got like product software engineering people; it seems like a much more complicated landscape of interactions. How have you seen this kind of play out in reality in terms of non-technical people and technical people, both working together on something that is ultimately something implemented in code and run as an application?
I actually think one of the most exciting things about LLMs and the progress in AI in general is that product managers and subject matter experts can for the first time be very directly involved in implementing these applications. So I think itās always been the case that the PM or someone like that is the person who distills the problem, speaks to the customers, produces the specā¦ But thereās this translation step where they sort of produce that prd document, and then someone else goes off and implements it. And because weāre now able to program at least some of the application in natural language, actually itās accessible to those people very directly. And itās worth maybe having a concrete example.
[10:02] So I use an AI notetaker for a lot of my sales calls. And it records the call, and then I get a summary afterwards. And the app actually allows you to choose a lot of different types of summary. So you can say, āHey, Iām a salesperson. I want a summary that will extract budget, and authority, and a timeline.ā Versus you can say āOh, actually, I had a product interview, and I want a different type of summary.ā And if you think about developing that application, the person who has the knowledge thatās needed to say what a good summary is, and to write the prompt for the model, itās the person who has that domain expertise. Itās not the software engineer.
But obviously, the prompt is only one piece of the application. If youāve got a question answering system, thereās usually retrieval as part of this; there may be other componentsā¦ Usually, the LLM is a block in a wider application. So you obviously still need the software engineers around, because theyāre implementing the bulk of the application, but the product managers can be much more directly involved. And then, actually, we see increasingly less involvement from machine learning or AI experts, and less people are fine-tuning their own models. So for the majority of product teams weāre seeing, there is a an AI platform team that maybe facilitates setting things up, but the bulk of the work is led by the product managers, and then the engineers.
And one interesting example of this on the extreme end is one of our customers thatās a very large ad tech company, they actually do not let their engineers edit the prompts. So they have a team of linguists who do prompt development. The linguists finalize the prompts, theyāre saved in a serialized format and they go to production, but itās a one-way transfer. So the engineers canāt edit them, because theyāre not considered able to assess the actual outputs, even though they are responsible for the rest of the application.
Just thinking about how teams interact and whoās doing what, it seems like the problems that youāve laid out are, I think, very clear and worth solving, but itās probably hard to think about āWell, am I building a developer tool? Or am I building something that these non-technical people interact with? Or is it both?ā How did you think about that as you kind of entered into the stages of bringing Humanloop into existence?
I think it has to be bothā¦ And the honest answer is it evolved kind of organically by going to customers, speaking to them about their problems, and trying to figure out what the best version of a solution looked like. So we didnāt set out to build a tool that needed to do both of these things, but I think the reality is, given the problems that people face, you do need both.
An analogy to think about might be something like Figma. Figma is somewhere where multiple different stakeholders come together to iterate on things, and to develop them, and provide feedbackā¦ And I think you need something analogous to that for gen AIā¦ Although itās not an exact analogy, because we also need to attach the evaluation to this. So itās almost by necessity that weāve had to do thatā¦ But I also think that itās very exciting. And the reason I think itās exciting is because it is expanding who can be involved in developing these applications.
Break: [13:05]
You mentioned how this environment of domain experts coming together, and technical teams coming together in a collaborative environment opens up new possibilities for both collaboration and innovation. Iām wondering if at this point you could kind of just lay outā¦ Weāve talked about the problems, weāve talked about those involved and those kind of that would use such a system or a platform to enable these kinds of workflowsā¦ Could you describe a little bit more what Humanloop is specifically, in terms of both what it can do, and kind of how these different personas engage with the system?
Yeah. So I guess in terms of what it can do, concretely, itās firstly helping you with prompt iteration versioning and management, and then with evaluation and monitoring. And the way it does that - if thereās a web app and thereās a web UI where people are coming in and in that UI is an interactive playground-like environment, where people basically try out different prompts, they can compare them side by side with different models, they can try them with different inputs, when they find versions that they think are good, they save them. And then those can be deployed from that environment to production, or even to a development or staging environment. So thatās the kind of development stage.
And then once you have something thatās developed, whatās very typical is people then want to put in evaluation steps into place. So you can define gold standard test sets, and then you can define evaluators within Humanloop. And evaluators are ways of scoring the outputs of a model or a sequence of models, because oftentimes the LLM is part of a wider application.
And so the way that scoring works is thereās very traditional metrics that you would have in code for any machine learning system. So precision, recall, rouge, blue, these kinds of scores that anyone from a machine learning background would already be familiar with. But whatās new in the kind of LLM space is also things that help when things are more subjective. So we have the ability to do model as judge, where you might actually prompt another LLM to score the output in some wayā¦ And this can be particularly useful when youāre trying to measure things like hallucination. So a very common thing to do is to ask the model āIs the final answer contained within the retrieved context?ā Or āIs it possible to infer the answer from the retrieved context?ā And you can calculate those scores.
And then the final way is we also support human evaluation. So in some cases, you really do want either feedback from an end user, or from an internal annotator involved as well. And so we allow you to gather that feedback, either from your live production application, and have it logged against your data, or you can cue internal annotation tests from a team. And I can maybe tell you a little bit more about sort of in production feedback, because thatās something that ā thatās actually where we started.
Yeah, yeah. Go ahead, I would love to hear more.
Yeah, so I think that because itās so subjective for a lot of the applications that people are building, whether it be email generation, question answering, a language learning app - there isnāt a ācorrect answer.ā And so people want to measure how things are actually performing with their end users. And so Humanloop makes it very easy to capture different sources of end user feedback. And that might be explicit feedback, things like thumbs up/thumbs down votes that you see in ChatGPT, but it can also be more implicit signals. So how did the user behave after they were shown some generated content? Did they progress to the next stage of the application? Did they send the generated email? Did they edit the text? And all of that feedback data becomes useful, both for debugging, and also for fine-tuning the model later on. So that evaluation data becomes this rich resource that allows you to continuously improve your application over time.
[18:23] Yeah, thatās awesome. And I know that that fits inā¦ So maybe you could talk a little bit about how youāre ā one of the things that you mentioned earlier is youāre seeing fewer people do fine-tuningā¦ Which - I see this very commonly as aā¦ Itās not an irrelevant point, but itās maybe a misconception, where a lot of teams come into this space and they just assume theyāre gonna be fine-tuning their modelsā¦ And often, what they end up doing is fine-tuning their workflows or their language model chains, or the data that theyāre retrieving, or their prompt formats, or templates, or that sort of thing. Theyāre not really fine-tuning. I think thereās this really blurred line right now for many teams that are adopting AI into their organization, where theyāll frequently just use the term āOh, Iām training the AI to do this, and now itās betterā, but all theyāve really done is just inject some data into their prompts, or something like that.
So could you maybe help clarify that distinction? And also, in reality, what youāre seeing people do with this capability of evaluation, both online and offline, and how thatās filtering back into upgrades to the system, or actual fine-tunes of models?
Yeah. So I guess youāre right, thereās a lot of jargon involvedā¦ And especially for people who are new to the field, the word āfine-tuningā has a colloquial meaning, and then it has a technical meaning in machine learning, and the two end up being blurred. So fine-tuning in a machine learning curve context usually means doing some extra training on the base model, where youāre actually changing the weights of the model, given some sets of example pairs of inputs/outputs that you want. And then obviously, thereās prompt engineering and maybe context engineering, where youāre changing the instructions to the language model, or youāre changing the data thatās fed into the context, or how an agent system might be set upā¦ And both are really important. Typically, the advice we give the majority of our customers and what we see play out in practice is that people should first push the limits of prompt engineering. Because itās very fast, itās easy to do, and it can have very high impact, especially around changing the sort of outputs, and also in helping the model have the right data thatās needed to answer the question.
So prompt engineering is kind of usually where most people start, and sometimes where people finish as well. And fine-tuning tends to be useful either if people are trying to improve latency or cost, or if they have like a particular tone of voice or output constraint that they want to enforce. So if people want the model to output valid JSON, then fine-tuning might be a great way to achieve that. Or if they want to use a local private model, because it needs to run on an edge device, or something like this, then fine-tuning I think is a great candidate.
And it can also let you reduce costs, because oftentimes you can find you in a smaller model to get similar performance. The analogy I like to use is fine-tuning is a bit like compilation. If youāve already sort of built your first version of the language, when you want to optimize it, you might use a compiled language, and youāve got a kind of compiled binary. I think there was a second part to your question, but just remind me, because Iāve lost the second part.
Yeahā¦ Basically, you mentioned that maybe fewer people are doing fine-tunesā¦ Maybe you could comment on ā I donāt know if you have a sense of why that is, or how you would see that sort of progressing into this year, as more and more people adopt this technology, and maybe get better tooling around the - letās not call it fine-tuning, so we donāt mix all the jargon, but the iterative development of these systems. Do you see that trend continuing, or how do you see that kind of going into maybe larger or wider adoption in 2024?
[22:21] Yeah, so I think that weāve definitely seen less fine-tuning than we thought we would see when we launched this version of Humanloop back in 2022. And I think thatās been true of others as well. Iāve spoken to friends at Open AIā¦ And Open AI is expecting there will be more fine-tuning in the future, but theyāve been surprised that there wasnāt more initially. I think some of that is because prompt engineering has turned out to be remarkably powerful, and also because some of the changes that people want to do to these models are more about getting factual context into the model. So one of the downsides of LLMs today is theyāre obviously trained on the public Internet, so they donāt necessarily know private information about your company; they tend not to know information past the training date of the model. And one way you might have thought you could overcome that is āIām going to fine-tune the model on my companyās data.ā But I think in practice, what people are finding is a better solution to that is to use a hybrid system of search or information retrieval, plus generation. So whatās come to be known as like RAG, or retrieval-augmented generation has turned out to be a really good solution to this problem.
And so the main reasons to fine-tune now are more about optimizing costs and latency, and maybe a little bit tone of voice, but theyāre not needed so much to adapt the model to a specific use case. And fine-tuning is a heavier duty operation, because it takes longerā¦ You know, you can edit a prompt very quickly and then see what the impact is. Fine-tuning - you need to have their dataset that you want to fine-tune on, and then you need to run a training job and then evaluate that job afterwards.
So there are certainly circumstances where itās going to make sense. I think especially anyone who wants to use a private open source model will likely find themselves wanting to do more fine-tuningā¦ But the quality of prompt engineering and the distance you can go with it I think took a lot of people by surprise.
And on that note, you mentioned the closed proprietary model ecosystem versus open models that people might host in their own environment, and/or fine-tune on their own dataā¦ I know that Humanloop - you explicitly say that you kind of have all of the models, youāre integrating these sort of closed models, and integrate with open modelsā¦ Why and how is that kind of decided to kind of include all of those? And in terms of the mix of what youāre seeing with peopleās implementations, how do you see this sort of proliferation of open models impacting the workflows that youāre supporting in the future?
So the reason for supporting them, again, is largely customer poll, right? What weāre finding is that many of our customers were using a mixture of models for different use cases, either because the large proprietary ones had slightly different performance trade-offs, or because they were use cases where they cared about privacy, or they care about latency, and so they couldnāt use a public model for those instances. And so we had to support all of them. It really was something that it wouldnāt be a useful product to our customers if they could only use it for one particular model.
And the way weāve got around this is that we tried to integrate all of the publicly-available ones, but we also make it easy for people to connect their own models. So they donāt necessarily need us. As long as they expose the appropriate APIs, you can plug in any model to Humanloop.
That would be a matter of hosting the model and making sure that the API contract that youāre expecting in terms of responses from a model server, that maybe someoneās running in their own AWS or wherever, would fulfill that contract.
Thatās exactly right. Yeah. And in terms of the proliferation of open source and how thatās going, I think thereās still a performance gap at the moment between the very best closed models, so between the GPT-4, or some of the better models from Anthropic, and the best open sourceā¦ But it is closing. So the latest models from, say, Mistral have proved to be very good, LLaMA 2 was very goodā¦ Increasingly, youāre not paying as big a performance gap, although there is still one, but you need to have high volumes for it to be economically competitive to host your own model. So the main reasons we see people doing it are related to data privacy. Companies that, for whatever reason, cannot or donāt want to send data to a third party end up using open sourceā¦ And then also, anyone whoās doing things on edge, and who wants sort of real time or very low latency ends up using open source.
Well, Raza, Iād love for you to maybe describe, if you canā¦ Weāve kind of talked about the problems that youāre addressing, weāve talked about the sort of workflows that youāre enabling, the evaluation, and some trends that youāre seeingā¦ But Iād love for you to describe if you can maybe for like a non-technical persona, like a domain expert whoās engaging with the Humanloop system, and maybe for a more technical person whoās integrating data sources or other things, what does it look like to use the Humanloop system? Maybe describe the roles in which these people are ā like what theyāre trying to do from each perspective. Because I think that might be instructive for people that are trying to engage domain experts and technical people in a collaboration around these problems.
Absolutely. So maybe it might be helpful to have a kind of imagined concrete example. So a very common example we see is people building some kind of question answering system. Maybe itās for their internal customer service stuff, or maybe they want to replace an FAQā¦ Maybe theyāre trying to build some kind of internal question answering system to replace something, or an FAQ, or that kind of thing. So thereās a set of documents, a question is going to come in, thereāll be a retrieval step, and then they want to generate an answer. So typically, the PMs or the domain experts will be figuring out what are the requirements of the system, āWhat does good look like? What do we want it to build?ā And the engineers will be building the retrieval part, orchestrating all the model calls in code, integrating the Humanloop APIs into their systemā¦ And also, usually they lead on setting up evaluation. So maybe once itās set up, the domain experts might continue to do the evaluation themselves, but the engineers tend to set it up the first time.
So if youāre the domain expert, typically you would start off in our playground environment, where you can just try things out. So the engineers might connect a database to Humanloop for you. So maybe theyāll store the data in a vector database, and connect that to Humanloop. And then once youāre in that environment, you could try different prompts to the models; you could try them to GPT-4, to Cohere, to an open source model, see what impact that has, see if youāre getting answers that you likeā¦ Oftentimes early on itās not in the right tone of voice, or the retrieval system is not quite right, and so the model is not giving factually correct answerā¦ So it takes a certain amount of iteration to get to the point where even when you eyeball it, itās looking appropriate. And usually, at that point people then move to doing a little bit more of a rigorous evaluation.
So they might generate either automatically or internally a set of test cases, and theyāll also come up with a set of evaluation criteria that matter to them in their context. Theyāll set up that evaluation, run it, and then usually at that point they might deploy to production.
So thatās the point at which things would end up with real users, they start gathering user feedbackā¦ And usually, the situation is not finished at that point, because people then look at the production logs, or they look at the real usage data, and they will filter based on the evaluation criteria. And they might say āHey, show me the ones that didnāt result in a good outcomeā, and then theyāll try and debug them in some way, maybe make a change to a prompt, rerun the evaluation and submit it.
And so the engineers are doing the orchestration of the code. Theyāre typically making the model calls, theyāll add logging calls to Humanloopā¦ So the way that works - thereās a couple of ways of doing the integration, but you can imagine every time you call the model, youāre effectively also logging back to Humanloop what the inputs and outputs were, as well as any user feedback data. And then the domain experts are typically looking at the data, analyzing it, debugging, making decisions about how to improve things, and theyāre able to actually take some of those actions themselves in the UI.
[32:03] Yeah. So if I just kind of abstract that a bit to maybe give people a frame of thinking, it sounds like thereās kind of this framework setup where thereās data sources, thereās maybe logging calls within a version of an applicationā¦ If youāre using a hosted model or if youāre using a proprietary API, you decide thatā¦ And so itās kind of set up, and then thereās maybe an evaluation or a prototyping phase, letās call it, where the domain experts try their promptingā¦ Eventually, they find prompts that they think will work well for these various steps in a workflow, or something like thatā¦ Those are pushed, as you said, I think, one way into the actual code or application, such that the domain experts are in charge of the prompting, to some degree. And as youāre logging feedback into the system, the domain experts are able to iterate on their prompts, which hopefully then improve the system, and those are then pushed back into the production system, maybe after an evaluation or something. Is that a fair representation?
Yeah, I think thatās a great representation. Thanks for articulating it so clearly. And the kinds of things that the evaluation becomes useful for is avoiding regression, say. So people might notice one type of problem. They go in and they change a prompt, or they change the retrieval system, and they want to make sure they donāt break what was already working. And so having good evaluation in place helps with that.
And then maybe itās also worth ā because I think we didnāt sort of do this at the beginningā¦ Just thinking about what are the components of these LLM applications. So I think youāre exactly right, we sort of think of the blocks of LLM apps being composed of a base model. So that might be a private fine-tuned model, or one of these large public onesā¦ A prompt template, which is usually an instruction to the model that might have gaps in it for retrieved data or context, a data collection strategy, and then that whole thing of like data collection, prompt template and model might be chained together in a loop, or might be repeated one after anotherā¦ And thereās an extra complexity, which is the models might also be allowed to call tools or APIs. But I think those pieces that get taken together more or less comprehensively cover things. So tools, data retrieval, prompt template and base model are the main components. But then within each of those you have a lot of design choices and freedom. So you have a combinatorially large number of decisions to get right when building one of these applications.
One of the things that you mentioned is this evaluation phase of what goes on as helping prevent regressions, because in sort of testing behaviorly the output of the models you might make one change on a small set of examples, that looks like itās improving things, but has sort of different behavior across a wide range of examplesā¦ Iām wondering also, I could imagine two scenariosā¦ You know, models are being released all the time, whether itās upgrading from this version of a GPT model to the next version, or this Mistral fine-tune to this one over hereā¦ Iām thinking even in the past few days weāve been using the Neural Chat model from Intel a good bit, and thereās a version of that that Neural Magic released, thatās a sparsified version of that, where they pruned out some of the weights and the layers to make it more efficient, and to run on better ā or not better hardware, but more commodity hardware, thatās more widely availableā¦ And so one of the questions that we were discussing is āWell, we could flip the version of this model to the sparse one, but we have to decide on how to evaluate that over the use cases that we care about.ā Because you could look at the output for like a few tests prompts, and it might look similar, or good, or even better, but on a wider scale might be quite different in ways that you donāt expect. So I could see the evaluation also being used for that, but I could also see where if youāre upgrading to a new model, it could just throw everything up in the air in terms of like āOh, this is an entirely different prompt formatā, right? Or āThis is a whole new behavior from this new model, that is distinct from an old model.ā So how are you seeing people navigate that landscape of model upgrades?
[36:33] I think you should just view it as a change as you would to any other part of the system. And hopefully, the desired behavior of the model is not changing. So even if the model is changed, you still want to run your regression test and say āOkay, are we meeting a minimum threshold that we had on these gold standard test set before?ā
In general, I think evaluation - we see it happening in sort of three different stages during development. Thereās during this interactive stage very early on, when youāre prototyping, you want fast feedback, youāre just looking to get a sense of āIs this even working appropriately?ā At that stage, eyeballing examples, and looking at things side by side, in a very interactive way can be helpful.
And interactive testing can also be helpful for adversarial testing. So a fixed test set doesnāt tell you what will happen when a user who actually wants to break the system comes in. So a concrete example of this - you know, one of our customers has children as their end users, and they want to make sure that things are age-appropriate, so they have guardrails in place. But when they come to test the system, they donāt want to just test it against an input thatās benign. They want to see like, if we try, if we really red-team this, can we break it? And their interactive testing can be very helpful.
And then the next place where you kind of want testing in place is this regression testing, where you have a fixed set of evaluators on a test set, and you want to know āWhen I make a change, does it get worse?ā And the final place we see people using it is actually from monitoring. So okay, Iām in production now; thereās new data flowing through. I may not have the ground truth answer, but I can still set up different forms of evaluator, and I want to be alerted if the performance drops below some threshold.
So one of the things that Iāve been thinking about throughout our conversation here, and thatās I think highlighted by what you just mentioned in sort of the upgrades to oneās workflow, and the various levels at which such a platform can benefit teamsā¦ And it made me think of [unintelligible 00:38:31.06] I have a background in physics, and there were plenty of physics teams or collaborators that we worked with - you know, we were writing code - and not doing great sort of version control practicesā¦ And not everyone was using GitHub, and there was sort of collaboration challenges associated with that, which are obviously solved by great code collaboration systems of various forms, that have been developed over timeā¦ And I think thereās probably a parallel here with some of the collaboration systems that are being built around both playgrounds, and prompts, and evaluation. Iām wondering if thereās any examples from clients that youāve worked with, or maybe itās just interesting use cases of surprising things theyāve been able to do when going from sort of doing things ad hoc, and maybe versioning prompts in spreadsheets, or whatever it might be, to actually being able to work in a more seamless way between domain experts and technical staff. Are there any clients, or use cases, or surprising stories that come to mind?
[39:46] Yeah, itās a good question. Iām kind of thinking through them to see what the more interesting examples might be. I think that, fundamentally, itās not necessarily enabling completely new behavior, but itās making the old behavior significantly faster, and less error-prone. Certainly, fewer mistakes and less time spent ā okay, so a surprising exampleā¦ Publicly-listed company, and they told me that one of the issues they were having is because they were sharing these prompt conflicts in teams, they were having differences in behavior based on whitespace being copied. Someone was like playing around with the Open AI playground, they copy-pasted into Teamsā¦ That person would copy-paste from Teams into codeā¦ And there was small whitespace differences, and you wouldnāt think it would affect the models, but it actually did. And so they would then get performance differences they couldnāt explain. And actually, it just turned out that you shouldnāt be sharing your code via Teams, right?
So I guess thatās one surprising example. I think another thing as well is the complexity of apps that people are now beginning to be able to build. So increasingly, I think people are building simple agents; I think more complex agents are still not super-reliable, but a trend that weāve been hearing a lot about from our customers recently is people trying to build systems that can use their existing software. An example of this is - you know, Ironclad is a company thatās added a lot of LLM based features to their productā¦ And they actually are able to automate a lot of workflows that were previously being done by humans, because the models can use the APIs that exist within the Ironclad software. So theyāre actually able to leverage their existing infrastructure. But to get that to work, they had to innovate quite a lot in tooling. And in fact - you know, this isnāt a plug for Humanloop. Ironclad in this case built a system called Rivet, which is their own open source prompt engineering and iteration framework. But I think itās a good example of, you know, in order to achieve the complexity of that use case - this happened to be before tools like Humanloop around - they had to build something themselves. And itās quite sophisticated tooling. I actually think Rivetās great, so people should check that out as well. Itās an open source library; anyone can go and get the tool.
So yeah, I think the surprising things are like how error-prone things are without good tooling, and the crazy ways in which people are solving problems. Another example of a mistake that we saw someone do is two different people triggered exactly the same annotation job. So they had annotation and spreadsheets, and they both outsourced the same job to different annotation teamsā¦ Which was obviously an expensive mistake to make. So very error-prone. And then I think also just impossible to scale to more complex agentic use cases.
Well, you already kind of alluded to some trends that youāre seeing moving forwardā¦ As we kind of draw to a close here, Iād love to know from someone whoās seeing a lot of different use cases being enabled through Humanloop, and your platform, whatās exciting for you as we move into this next year in terms of - maybe itās things that are happening in AI more broadly, or things that are being enabled by Humanloop, or things that are on your roadmap, that you canāt wait for them to go liveā¦ As youāre lying in bed at night and getting excited for the next day of an AI stuff, whatās on your mind?
So AI more broadly, I just feel the rate of progress of capabilities is both exciting and scary. Itās extremely fast; multimodal models, better generative models, models with increased reasoningā¦ I think the range of possible applications is expanding very quickly, as the capabilities of the models expand.
I think people have been excited about agent use cases for a while; systems that can act on their own and go off and achieve something for you. But in practice, weāve not seen that many people succeed in production with those. There are a couple of examples, Ironclad being a good oneā¦ But it feels like weāre still at the very beginning of that, and I think Iām excited about seeing more people get to success with that. Iād say that the most common, successful applications weāve seen today are mostly either retrieval-augmented applications, or more simple LLM applications. But increasingly, Iām excited about seeing agents in production, and also multimodal models in production.
In terms of things that Iām particularly excited about from Humanloop, is I think us becoming a proactive rather than a passive platform. So today, the product managers and the engineers drive the changes on Humanloop. But I think that something that weāre going to hopefully release later this year is actually this system ā you know, Humanloop itself can start proactively suggesting improvements to your application. Because we have the evaluation data, because we have all the prompts, we can start saying things to you, like āHey, we have a new prompt for this application. Itās a lot shorter than the one you have. It scores similarly on eval data. If you upgrade, we think we can cut your costs by 40%.ā And allowing people to then accept that change. And so going from a system that is observing, to a system thatās actually intervening.
Thatās awesome. Yeah, well, I definitely look forward to seeing how that rolls out, and I really appreciate the work that you and the team at Humanloop are doing to help us upgrade our workflows, and enable these sort of more complicated use cases. So thank you so much for taking time out of that work to join us. Itās been a pleasure. I really enjoyed the conversation.
Thanks so much for having me, Daniel.
Changelog
Our transcripts are open source on GitHub. Improvements are welcome. š
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK