OCT 28 Optional: Live Office Hours 10 TUE 10/289:00 PM—10:00 PM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live Office Hours 10
Oct 28, 20259:00 PM - 10:00 PM GMT+5:30
Audio Transcript
Chat Messages
Robert Lavigne
Hello, hello.00:00:30
Hamel Husain
Hello.00:00:40
Trying to test my AirPods. Is it… hold on a second. Hello. Someone talk?00:00:43
Chris Iverson
Testing, testing…00:00:56
Hamel Husain
Hello?00:00:58
Hello?00:01:00
Aditya Sethuraman
I can hear you.00:01:02
Hamel Husain
Alright, yeah, let's go ahead and kick off, this Office Hours.00:01:04
Thanks for coming.00:01:08
We can do it the same like we have been doing.00:01:09
And, yeah.00:01:13
Raise your hand if you have a question.00:01:18
And we can… We can discuss.00:01:20
Alright, Dimitri.00:01:37
Dmitry
Hey, good to meet you guys. I had a question about, evaluation when there's personalization in the context.00:01:39
So, you know,00:01:48
agent that is answering questions, where there's a bunch of, kind of, like, user context, could be user preferences, could be past interests, things like that. You know, I think it's,00:01:52
probably, like, two kinds of problems, right? One is just personalization itself, and, you know, whether this matches a user's interest is kind of one thing, and then the other thing is whether the LLM is using it effectively. So wondering…00:02:01
You know, any kind of general tips and tricks for how you think about that.00:02:14
Hamel Husain
Personalization?00:02:19
Dmitry
Yep.00:02:20
Hamel Husain
So…00:02:20
There's not really anything too, I would say, special about it, it's just, like, you need to make sure that you have the…00:02:23
the context…00:02:30
there in your traces, which you should, because you're gonna be giving the LLM some context about the user.00:02:32
So that you can personalize the response for them. And then from there, it's basically the same thing, of, like, doing error analysis.00:02:41
So there's not too much, you know…00:02:48
When you do, maybe, like, synthetic data generation, you might want to think about different personas. You might want to see if…00:02:54
Like, your personalization, is it… do you have…00:03:00
Do you see certain themes in, like, the users that you can then use to, like.00:03:03
Formulate hypotheses about how your app might break.00:03:09
And look for those.00:03:13
But other than that, there's not too much that…00:03:14
Or I can't really think of anything…00:03:17
Dmitry
Yeah, so you're basically just thinking of it as, you know, in the same way that your… whatever your user input is, is one part of the context, or any other part of the context, personalization is just another part of the context here, and making sure that you're covering a wide enough range of distribution.00:03:19
Hamel Husain
Yeah.00:03:34
Dmitry
Okay.00:03:35
Good. Yeah, makes sense.00:03:37
Hamel Husain
Robert.00:03:40
Robert Lavigne
So, my question is, RAG-related. So, let's take a scenario.00:03:42
corpus of PDFs, you're running a… creating a tool to convert those PDFs into chunks that you would, you know, feed into your RAG via… via SQL or via vector or whatever. But fundamentally, the issue is more getting the PDFs00:03:47
Converted into a consistent format.00:04:02
And this is actually something a client of mine brought up in a conversation a couple of days ago, so I'd like to get your thoughts on it. Problems and solutions that you might have run into in the past, getting00:04:05
your data clean, I guess is the best question, through PDF converters, because sometimes it messes up tables, sometimes the ordering is wrong. Kind of, you know, things that you may have run into the past, and maybe give some guiding direction as to how to… this is less about the LLM, but more about data prep, right? Getting that data00:04:16
structured in such a way before you ingest it into whatever rag that you have. And I'm finding most people that I'm talking to, their data is not clean to start with, so their rag is not the issue, it's the organization of their data, how it got pulled, where some of the contacts may not have gotten00:04:35
collected with the table, as an example. So, general wide question there, but relating to, let's say, PDFs specifically, because there's a lot of variety in PDFs.00:04:54
Yeah, data is always a problem.00:05:05
Hamel Husain
Yeah.00:05:08
Yeah, I mean, the common data is always the problem, is true. It, like, predates LLMs, you know, if you don't have your data,00:05:10
Clean.00:05:19
It affects everything. So…00:05:20
Yeah, you have to look at it, you have to look at your pipeline when you're debugging things, and like, if the data is being extracted correctly.00:05:24
There's not anything, like, some magic trick I have in mind. You know, try to use… Good.00:05:33
Pdf extraction, my favorite one is… there's this company, Datalab.00:05:41
Now I wanna look into it. They have some open source models, and they host them as well. It's pretty good.00:05:49
the founder… is someone that I used to work with, so… I like it.00:05:57
Other than that, I can't really think of…00:06:04
Yeah, you have to make sure your data is being cleanly extracted and iterate on that problem.00:06:06
So…00:06:11
Robert Lavigne
Can you put a link to that data lab? I just did a quick query there, and it's not coming up, so maybe it's a more complicated name than just Datalab?00:06:14
Or is it datalabplatform.com, probably?00:06:23
Yeah, that's probably it there, right?00:06:26
Hamel Husain
It's datalab.to.00:06:29
Robert Lavigne
The TL? Okay. Yeah.00:06:31
Thank you.00:06:33
Yeah, thanks.00:06:37
Hamel Husain
Yeah, no problem.00:06:38
Mirko?00:06:39
Mirko
Hi guys, so I have a quick question on annotating traces.00:06:41
My question is related to…00:06:47
To the option to kind of give this thumbs up, thumbs down,00:06:51
option to user, essentially, to say whether an execution of an AI was good or bad.00:06:57
So I'm trying to kind of figure out where that fits into the picture of the error analysis approach that you have laid out in the course. Would you consider that thumbs up, thumbs down to be sort of like a…00:07:02
An outdated, primitive alternative.00:07:13
Or do you see that to be potentially a…00:07:16
I don't know, like, a first step of potentially getting a list of traces that, according to users, were not good.00:07:20
And then using those traces to kind of narrow down into, like.00:07:27
Like, what exactly was wrong, and then start the proper error analysis based on that subset, if you want.00:07:31
Hamel Husain
Yeah. So first of all, see if you can collect something better than a thumbs up, thumbs out from the user. If you can't, sometimes you can't, sometimes it's not feasible.00:07:39
You know, you're the best judge of, like, how much patience your users have and all that stuff.00:07:47
But I would say the thumbs up, thumbs down is just a filter for you to then potentially apply your own open coding.00:07:52
So, yeah, it's another signal that you can use to sample.00:08:02
In higher areas of… Failures, or higher probability areas of failures.00:08:06
Mirko
Right.00:08:14
Okay, makes sense. Makes sense.00:08:16
Thanks.00:08:19
Hamel Husain
Yeah, no problem.00:08:20
Anova.00:08:30
Anubha Saxena
Yeah, it's so…00:08:32
I don't know, I… I… I've been struggling with, gaining confidence on the dataset that I've created. I don't know if it covers, like.00:08:35
a lot of data points, if those data points make sense or not make sense, so I do have free users' data, but how do I ensure that it is covering all the different use cases that I want to cover?00:08:45
And we have, like, thousands of data points. I can go through them, sample them, but…00:08:59
okay, I'm like, at the end, I don't get that confidence, so it is not really, like, the problem with data set and the error analysis and stuff, it's more about how do we gain confidence on what we're doing, so that we can confidently ship something, you know?00:09:05
Hamel Husain
Yeah.00:09:21
So you can only have so much confidence if you're in a regime where you're synthetically generating data. I mean, hopefully you have slightly more confidence than not synthetically generating data and not looking at anything.00:09:22
You know, but…00:09:34
Yeah, it's… you don't know what your users are gonna do, they're gonna behave in, like, different ways. And AI can't really save you that much in terms of synthetically generating what a user might do. It can to some degree, but…00:09:38
You know, humans…00:09:52
they tend to do things that are different. So, you know, the best way is, like.00:09:54
You said you had, like, free user data?00:10:00
Maybe it sounds like maybe that's not representative, or, you know, that's not enough.00:10:03
Yeah, I mean, nothing beats, like, real users, like, if there's, like, some kind of design partners that you have, or some early, like, friendly users that are kind of…00:10:09
more… You know, like, using the product… Like,00:10:19
deliberately… very deliberately with you. And, you know, you can kind of see what they're doing.00:10:27
And you can, like, refine your synthetic data, and say, okay, like, this is, like.00:10:33
Because, like, what… so the way you gain confidence is you're generating synthetic data, and then you… as you release your product.00:10:39
you find that, oh, like, my synthetic data actually covered a lot of these things already. Like, I anticipated these things. So you start to build some confidence.00:10:46
Whereas, and it's a process, but, it's hard to, like, have confidence unless you get that feedback of, like, oh, okay.00:10:56
this test suite is reasonable. And you, you know, you likely will not have the best data to begin with. There'll be things, like, you didn't think about.00:11:05
And so…00:11:14
It's just a matter of, like, okay. It's hard to have confidence if you don't ship it.00:11:17
Anubha Saxena
Yeah.00:11:22
Hamel Husain
It's a little bit of a chicken and egg problem, but…00:11:23
Anubha Saxena
Yeah.00:11:25
Another sort of related… related to datasets, if no one else has questions.00:11:27
So…00:11:34
we have this dataset. The idea that I have in mind is that when we are going through user data, we get more failure modes, and then we try to expand this dataset with the new use cases that we found.00:11:35
And we kind of, like, do that indefinitely, or till the product lifetime.00:11:48
this dataset is gonna grow, so if you make changes, if… you've used Rain Trust, so we have, like, input, output, and expected.00:11:56
So, if the product changes.00:12:05
the expectation would change, and then you have to go through all this, like, data set, huge data set, with more than hundreds of data points, and update the expected values so that your evals now make sense. Do you have, like, any way around it, or any way to simplify this?00:12:08
Hamel Husain
So, is it, why is it their expected values in everything? Is it… is your application very deterministic? Like, you need… you have only reference-based evals?00:12:27
Anubha Saxena
So, for some things, I do. So, for example, if I'm… if I want to generate a diaphragm, I… like, the first step in the pipeline is to identify which kind of diagram I want to generate, and that is a bit deterministic.00:12:39
So I give it, like, a list of all the diagrams that we support, and then it says, okay, this… for this user prompt, maybe you need to generate, like, flowcharts, so it's like a single… like, it's… it gives me JSON, so it gives me, like, the detected type, and then explanation and error, if it does not support.00:12:53
Hamel Husain
Okay, so, like, for a diagram, our flow chart.00:13:13
When you update your product, You're getting a new expected flowchart that's different than the old one?00:13:18
Anubha Saxena
So, I was working today, in fact, to improve this thing.00:13:26
And I realized that it is not giving me the right timefront type. So I was changing the prompt.00:13:32
And then I realized that when I would look at the explanation the LLM would give, I would actually think that, you know what, maybe for this particular use case, it does make sense that it, like, gives me a different kind of diagram.00:13:40
In a way. So I was just, like… I realized that I'm changing the expectation as well. I'm not sure if you're supposed to do that, but to me, it made sense. And then I was thinking that after a few months, if I still, like, find that.00:13:54
to be happening, how would I be able to do that in such a huge dataset?00:14:09
Hamel Husain
Yeah. So it sounds like… The ex… expectation is a little bit loose.00:14:14
Like, you have a very specific… diagram in there?00:14:20
And it's meant as a reference, but it's not…00:14:25
like, you're using it as an assertion, but it's meant as a reference of sorts. Like, here's a diagram, you can think of a better diagram, but here's, like, an example of an acceptable diagram.00:14:29
Is that correct? And then, like, if you come up with a better diagram, then you can…00:14:40
Then, like, of course you can… It's not wrong.00:14:45
But, you have some… you have your expected diagram, I think, it sounds like, as a reference? Is that… am I understanding correct?00:14:49
Anubha Saxena
That is correct. One of the use cases that I just thought of, is that we currently only support, for example, 5 diagrams, and we try to… we try to fit the user prompts, or the…00:14:57
output into these 5 diagrams, but in future, maybe we'll support 10 diagrams. So even though something00:15:10
Was fit to a flow chart.00:15:17
We probably now have, like, a better diagram fit for it.00:15:21
And then we're gonna have to go through all the data points and figure out, like, what the mappings are, or expected values are.00:15:25
Hamel Husain
Yeah.00:15:31
So, the solution here… this is a great question. So, the solution in this situation is see if you can generalize the eval a bit. Like, right now, it seems like it's really narrow. Like, you're trying to say, okay.00:15:32
Does this user query result in this very specific diagram?00:15:45
But what, you know, think of, like, what you are doing as a human being is you're saying no.00:15:51
Okay, there's an expected diagram. It… It generated this other diagram.00:15:57
And you like that diagram more. And you're like, oh, the reference, let me update the reference, because this new thing is better, right?00:16:02
Well, that, that, illustrates a problem with the eval itself.00:16:10
Like… That… Reasoning that you are doing?00:16:15
should be the eval, right? So, not the very specific diagram, be like, okay, what makes… how do you know what a good diagram is? Like, it needs to have, like, X, Y, and Z facts.00:16:20
It needs to be one of these00:16:31
types of diagrams. And, like, being one of, like, X types of diagrams, that could be just in code or something, that doesn't have to, like.00:16:35
you know, be in some kind of expected value. But if you relax it a bit.00:16:42
not trying to assert a very specific diagram, but you try to incorporate the reasoning of what is a good and bad diagram, now you can have something, like, a lot more flexible that sort of incorporates, like, your knowledge into the eval.00:16:48
Anubha Saxena
Yeah, I think it's gonna be a bit difficult, because the pipeline goes with, like, you detect the diaphragm type, and then if it is a flowchart, you try to generate a flowchart, and if…00:17:07
it is not a flowchart. So, for example, if I say, give me,00:17:19
System architecture of e-com website, and if this auto-detection part fails, and it says you need a mind map, it's just gonna create a mind map, and it's not gonna make sense, right?00:17:24
So, yeah, I think…00:17:38
I got what the problem is, like, maybe my evals don't need to be too strict, but… yeah.00:17:41
Hamel Husain
Yeah, so if you have, like, a step that is00:17:50
Trying to route or classify, like, what kind of…00:17:53
diagram should be there. That can be an eval in itself.00:17:56
That says, like, okay.00:18:00
Yeah, like, for this kind of problem, this should be the diagram, or this should be the route.00:18:04
But try to, like, generalize as much as possible.00:18:10
the eval, so that you don't get stuck. Because if you go… if you make it too specific, then yeah, it can't… then it's too brittle.00:18:14
Basically trying to replace yourself with the LLM to some extent.00:18:22
Anubha Saxena
Okay, makes sense. Thank you.00:18:30
Hamel Husain
Yeah, no problem.00:18:33
Abhishek?00:18:36
Abhishek Panda
Hey, hi, Emmel.00:18:38
My question is not with respect to custom UI that I will ask you tomorrow or day after. So, my question with respect to synthetic data generation. So, last time when we spoke, so we created scenarios, and we were able to synthetically generate the data. Now, considering it's a website builder kind of agent.00:18:39
now we can access the web elements as well, like, I can access the screenshots, I can access the full HTML content as well, or, you know, so…00:18:56
now, Hamil, I'm a bit confused, like, what modality or what…00:19:05
data should I consider? Because when we are a user interacting with the agent, right, you can see and you can visualize. Now, for an agent to understand that, because this is a user agent which is interacting with the website builder agent, and we are trying to create large synthetic data, like synthetic traces, so that we can do failure analysis offline.00:19:09
So…00:19:29
Now, how to model this conversation? Like, it was easy when it was only the JSON and the scenarios. Now, when we are accessing the web elements, we can access00:19:31
It's about how to couple them together. Like, what form of data shall I use? I mean, yes, there are ways, right? I mean, images is also there, HTML content is also there, so what are your thoughts? I mean, in this kind of ambiguous situation, what is your thought?00:19:40
Hamel Husain
Yeah.00:19:56
Synthetic data generation would be very limited, a very limited use in your use case. You can do some things, but you're talking about very long-running conversations that have.00:19:57
Abhishek Panda
Yeah, like, multi…00:20:10
Hamel Husain
of turns, so you can't synthetically generate that. It's not gonna work.00:20:11
6 to 10.00:20:15
Okay, yeah, even 6 to 10 can be hard. Any… beyond, like, 3 or 4 is gonna start to get…00:20:16
I don't have a…00:20:23
Abhishek Panda
Good edit.00:20:24
The number of turns, thinking some kind of quitting strategy. Suggest me if we are going at the right path, or we are over-experimenting it.00:20:25
Hamel Husain
Well, I'm just saying, like, synthetic data… you know.00:20:33
It's gonna be hard to simulate a human. It's like, at some point, it gets really hard to simulate your users.00:20:38
Because if you can truly simulate a human.00:20:45
that's kind of AGI-ish. Like, we really… that's like a… like, we're not there yet, is what I'm trying to say. So we need to… we need to… it's… it's gonna be a lot better to,00:20:48
You can try to simulate a human, it's just going to… you know…00:21:01
It's gonna be difficult, like, what you want to try to do is use, like, real user data.00:21:07
And bootstrap off of that. And let, you know, like, early users.00:21:13
Especially in this context, where there's a lot of nuance, and a lot of…00:21:18
like, developer tool workflows, like, I think it would be very difficult to simulate. Because, like.00:21:24
Yeah, it's gonna… my intuition for developer tool workflows, like website builders and stuff like that, is going to stay on, like, a very happy path.00:21:30
And there's a lot of, like, un… Expected things that happen?00:21:38
When you're coding?00:21:43
They're trying to build something with AI.00:21:45
That are very specific to the user's environment, so on and so there's a lot of different foot guns.00:21:47
That I think might… it might not surface, so I would say don't get too crazy with simulation.00:21:52
try to… Get real user data.00:21:59
With, like, design partners.00:22:04
Abhishek Panda
So, Hamil, how we created this, we tried to partner with the design partners, and they helped us. How can we create scenarios, okay, like, a mind map of how we can create scenarios. So, step one, in step one, we created a JSON, like, a template, like, how the scenarios we can define. So, then we started.00:22:07
Hamel Husain
Okay, wait, sorry, let me just make sure this is clear. When I say design partners, I don't mean a consultant.00:22:25
Or someone like me that tells you what a thing is. Like, someone using the product. Someone, like, that's using it themselves.00:22:30
Abhishek Panda
Yeah, I mean…00:22:38
Hamel Husain
To do it.00:22:38
Abhishek Panda
And that product, so… Okay. That's sitting as design partner, yeah.00:22:39
Hamel Husain
Okay.00:22:44
Abhishek Panda
Because they know their customers well, right, because we are on the AI team. Now, we created scenarios, and then we said that, okay, only the conversation is not, you know, we want the web elements access as well. And then, along with the soft engines, we were able to access the elements. Right now, I mean, we can see, we can access it, though. That's what I was a bit confused. Shall I scrap the full HTML content?00:22:46
And tried to do it. Tested, I've already done that, however, I have tested few scenarios.00:23:09
It's working, but I have tested for, like, 2 to 4. Yes, it is able to generate conversation, and it is coupled with the web elements, okay? It's not like the conversation is something else, and there is something else in the web element.00:23:13
But, Amel, how to make a trustable pipeline, that's what my question was to you.00:23:28
Because users are new, ML, I mean, and this is getting, I would say, I mean, a lot of enhancement has been done, so more number of new users would come into picture.00:23:37
So that's why we want to test it offline.00:23:49
Hamel Husain
Yeah, I mean, there's not any…00:23:55
shortcut that I can think of, like, especially, you know, for this kind of tool, what I would do, personally, is I would…00:23:58
the first 3 or 4 users, I would…00:24:04
I kind of do it with them.00:24:08
If you're in the situation where you're really trying to launch something and you're not sure.00:24:10
And, like, you've tested it yourself, which it sounds like you have, but you really want to, like, know if it works?00:24:14
Is I would… I would do that.00:24:21
in, Fine in.00:24:24
Abhishek Panda
Okay.00:24:28
now going to LLM judge, right? Or, you know, if this is clear here itself, then, you know, LLM Judge, whatever the process we have, or, you know, there also, we need some kind of modification, because as of now, we have just…00:24:30
created the prompt, like, what I learned from the course, that how we should design, like, a better prompt with right kind of context.00:24:42
And it is providing score with reasoning.00:24:50
Okay, so… but that was for the previous step, where we only had conversation.00:24:52
No…00:24:58
Now, now, I mean, shall I only test in my automated evaluator? My question is, shall I only test in terms of,00:24:59
I would say, I mean, agent conversation, like, what the response it is giving, or, you know, to see, you know, web elements. Then that would be complete agent testing altogether.00:25:09
I don't know, what would be your position for LLM judge?00:25:20
If I can access the web elements. I mean, right now, we can access the web elements.00:25:24
Hamel Husain
Yeah, your LLM judge has to be able to see everything that your LLM sees in production.00:25:28
Abhishek Panda
Oh, okay.00:25:34
Hamel Husain
Has it had the same exact context.00:25:36
Abhishek Panda
Okay.00:25:38
But, Hamil, under the hood…00:25:40
Reasoning and all those, I shouldn't pass into LLM judge, right? I mean, why assistant come up with that output?00:25:42
those tool callings… tool callings, I should, but reasoning?00:25:50
Hamel Husain
You might not eat it.00:25:57
You might not.00:25:58
If you…00:26:01
It depends, like, is it… sometimes it can be harder to get rid of it than it is to, you know, keep it. Like, it depends what your pipeline is. And if you're running out of context window and stuff like that, okay, I get it. You basically want to know00:26:04
you know, Yeah, it could be interesting, but yeah,00:26:21
it all depends on the alignment that you can give the LM as a judge. So, you know, this is a meta-evaluation. This is, like, a meta-evaluation problem. So, like.00:26:29
In the… in the… at the end of the day.00:26:38
you know, it's kind of… we're guessing what will work best for your LLM judge to align it.00:26:41
or not.00:26:46
So, you might want to try both ways and see which one is aligning better with you.00:26:47
Abhishek Panda
Hamil, wouldn't it be biased, I think someone has put that question as well, that wouldn't it be biased, our automated evaluator, if we, you know, give input the reasoning as well?00:26:53
Hamel Husain
Potentially, but you would detect the bias.00:27:04
Because you're gonna measure.00:27:07
Against your labels.00:27:08
Shreya Shankar
Yeah, I think the broader point is that you don't know definitively for every single application. Should I use reasoning?00:27:11
in the LLM judge, or should I not?00:27:18
In some cases, I could imagine it is helpful. Maybe there's a logical error in the reasoning or something, and then now the judge is able to see that and detect that that's an error.00:27:21
In other cases, I could believe, yeah, maybe it's biased, I don't know, the judge… it depends on the application, right? So I think Hamel's point is, like, can you just experiment with both, with reasonings, components, without… with all of the context, without some of the context, and then figure out what has the highest00:27:31
Alignment with your labels.00:27:48
Abhishek Panda
Alright, so, Hamil and Shreya, then would it also depend on what kind of model I am using? Because I have, you know, data coming from different modality.00:27:53
So, I mean, how good is the model? That would also depend when I am trying to couple all of this information, like the conversation, tool calling, I mean, the web elements, it may be a screenshot as well of the web UI, and also other stuff.00:28:04
Shreya Shankar
Yeah? Again, it's an experiment thing, right? Like, I've never seen a person who writes the LLM judge00:28:20
And the first time they write it, it's perfect, right? I just know, even every time I do it, experts do it, anybody does it, I'm gonna have to go through, like, many, many iterations to get good alignment, and I'm gonna have to experiment with all of these levers. Like, what's the model type, how many few-shot examples did I put in? How do I carefully phrase failure mode? Like, you just… you gotta do it. I mean, that's also kind of why I hate LLM as Judge, because there's so much effort to make it work.00:28:27
But if you do all of these things, as long as you get good alignment at the end of the day, right, it doesn't really matter00:28:54
What it is, what are all the components of it, how you got there.00:29:01
Abhishek Panda
Got it, got it.00:29:05
Cool, cool, then.00:29:08
Hamel Husain
We had someone raise their hand, but maybe they disappeared.00:29:09
Oh, there we go. Ellen.00:29:15
Ellen Wong
Oh, I just… I wasn't the one that took their hand, but I… I do have a question. And…00:29:17
it's probably, so… so at work, we are starting to do some AI-powered features with LLMs, and we're in the very, very beginning of it. As in, when I asked how… how do… how do we validate or know we're, you know, building the right thing.00:29:24
the answer was really… we would just take a look. That's the level, so the levels that definitely not nearly as advanced.00:29:41
And one of the current challenges, we are in a beta phase of building something like… we're trying to generate insights to help our customers make sense of their data.00:29:49
And, the issue is, though, the usage of this particular feature is very, very low.00:30:01
And so, like, when I'm trying to approach the team and say.00:30:09
well, how do we improve this? There seems to be many paths to choose from based on conversation thus far. Like, we can try, like, just using it ourselves, though we're not the target00:30:13
Persona… We don't really quite have, like.00:30:27
A beta program, or it's not a vibrant one, given how low the usage has been.00:30:31
I think synthetic data would feel like we don't even know if it's gonna help us or not.00:30:36
And so, like.00:30:44
I'm learning all of these tools, I'm a little bit, like, unclear. Like, even if I go to the team and say, let's start with error analysis.00:30:47
It's just, like, we have so few data points that it feels, like, a little bit…00:30:56
unclear what I should ask of the team.00:31:01
So, yeah, I don't know if this… yeah.00:31:04
Hamel Husain
Yeah.00:31:08
Ellen Wong
there.00:31:09
Hamel Husain
Is it, like, kind of like a data analysis?00:31:10
Copilot, or something like that?00:31:12
Ellen Wong
Yeah, yeah, so within our product, we… we… we basically help customers, like NFL MNLB, for example, they… they would engage their customers with, like, messages, email, and so on and so forth.00:31:15
And in our dashboard, there could be, like, how many emails were received, like, how many were clicked through, and that kind of data. And so the AI feature that we're building is just trying to call out, oh, maybe there's anomalies, or maybe there are certain things to pay attention to, or this particular campaign was not very successful, maybe take a look, like.00:31:26
That kind of insight.00:31:49
And of course, it really depends on the customer's, data.00:31:51
Which… yeah, so that… that's the… the goal of it is to provide value for… for people who are using the platform to engage their customer.00:31:55
But right now, for one reason or another, it's just people are not clicking on it. So we… we… we don't really know why, because it's one of the most requested features. Like, one of those things where, like, hey, you guys, you wanted it. So…00:32:07
Yeah, that…00:32:25
Hamel Husain
Now, okay, so I don't know everything about the product, but I'm trying to infer from listening.00:32:28
is… so… When you have a feature in a product that says.00:32:36
Push this sparkle button, and we'll give you insights.00:32:43
That kind of has been a genre of product for a while, even before, like, AI, of like, we are gonna give you insights.00:32:47
And… that… kind of… That's very hard to do, because the notion of insights00:32:57
You know, it's very contextual, it's very, like.00:33:08
It's kind of, yeah, it's hard. It's kind of like… Dashboards?00:33:11
you know, you show a dashboard to an executive or a team, and you're like, oh, this is amazing, it tells the story. Only if I had this dashboard, now that, you know, I could have these same insights. No. That dashboard was created to tell this one story that we, like, dug up.00:33:16
And so…00:33:33
what I'm trying to say is, it could be useful, and I'm just inferring from what you're saying, I could be reading too many between the lines, is, like, try to figure out, like, a…00:33:35
Specific use case, or, like, a very specific… Sort of…00:33:46
within, like, this realm of insights, okay, like…00:33:53
Some very specific piece of data, or very specific report, or something that we can communicate to the user, say, okay, do you want to see this very specific thing?00:33:56
XYZ, ABC, we think it'll help you With whatever. Decreased churn, increase…00:34:09
your close rate. I don't know, I'm just making this up, I don't know anything about your specific product. That way, you can, like, start with something very concrete.00:34:18
And then, hopefully, it, like, will help people click on it. Because if I see, like, click on this for insights.00:34:29
I… I mean, I might click on it just because I'm, like, a data person, by training, so that, like, always intrigues me a little bit, but, it never has… I've never seen that go well.00:34:38
If that makes… if that makes, sense.00:34:49
Ellen Wong
Yeah, no, that… that makes a lot of sense. I, I think…00:34:53
what you actually just said gave me an insight into why we find it so difficult to even approach it beyond just looking, is because it's so general. It's literally clicking a sparkly button, it gives you insights of whatever, based on what LLM comes back with, and that makes error analysis really difficult, like.00:34:57
And so, I think… Yeah, I think kind of going back to the…00:35:17
Product discovery, what are questions we're trying to answer, narrowing on a specific use case will also make error analysis a lot… a lot more approachable.00:35:21
Thank you.00:35:32
Hamel Husain
Anyone else? Oh, okay, there we go. Chris.00:35:41
Chris Iverson
Hey, this might be a little outside the scope of what's intended for the course, but I was just curious,00:35:45
Hamel, do you have a framework for how you advise clients on information privacy concerns? Like, were to say the spectrum of…00:35:53
When it might be wise to host your own, or have a… Enterprise… Llm that… in theory.00:36:03
preserves, or… What am I trying to say? Like, I understand from…00:36:12
Masking information that may be personally identifiable, but, like, from a competitive perspective, do you ever see a risk00:36:20
of… Exposing patterns to these giant models that might00:36:29
Eat away at the competitive advantage that you might be offering your customer.00:36:35
Hamel Husain
I'm not an expert in data privacy. I do have some general thoughts, like.00:36:43
Okay, privacy doesn't live in a vacuum, you know? There's privacy, there's cost, there's, time to market.00:36:48
There's engineering complexity. All of these things are kind of…00:36:57
considered jointly, if you know what I mean.00:37:01
And oh, and there's also capability. There's, like, you know, how good your models are.00:37:04
And so… Sometimes, people can get carried away.00:37:10
With privacy? Say, oh, we need privacy.00:37:15
you know, but they can sacrifice a lot of those other dimensions, to the extent where it can be detrimental. So you have to be really clear on, like, what those other… all these things are jointly.00:37:20
And, you know, like, Using an open… model…00:37:32
Okay, like, hosting your own LLMs is non-trivial.00:37:39
there's infrastructure that can do it, but… yeah, it's non-trivial. If you don't have, like, lots of throughput and stuff like that, it's gonna cost you. It probably will cost you more, on average, than, like, a lot more.00:37:43
Because, like, you know, large model providers are…00:37:55
Able to aggregate over everybody, and, you know, do things that you can't at, like, a small scale.00:37:58
So there's that. There's also, I mean, you know.00:38:06
People had similar concerns, I think, using, like, the cloud.00:38:10
like, okay, is it really private? I think over time, This, like,00:38:16
sort of group of considerations, this joint, like, considerations of, like, time to market, engineering, complexity, whatever, people kind of realized, like, a lot more people went to the cloud00:38:22
Than… than not. But some… but that's not to say that everyone went to the cloud, like, there are some people that do not go to the cloud, and they have, like, legit reasons for that. They, like, they have, like, a really… yeah, they just can't let their data out of their data center or something. So you have to just have to, like, be really cognizant of all those dimensions.00:38:34
When you do that.00:38:52
Chris Iverson
I guess maybe the question is more, like, among the spectrum of clients that you might have worked with.00:38:54
what do you see on both sides of the spectrum? Like, how do you get them to understand what they're trading off? Or,00:39:01
Are there errors that you see on both sides that are kind of extreme to, like, give people a practical sense of,00:39:10
Like, hey, maybe just implement some tests in case somebody's leaking stuff, or…00:39:17
Hamel Husain
Yeah, I never see, I see a lot more errors, or I only see errors on the don't use an API.00:39:23
side. I rarely ever see errors on the… like,00:39:31
like, we should self-host. Like, I've worked with companies that self-host, but they were a cloud themselves. They were like a cloud… they're not gonna use another cloud, so it doesn't make sense. So they inherently had to self-host, and it makes sense, but00:39:37
You know.00:39:53
Yeah, I would say there's a bias at this point in our journey of, like, people, you know, they still don't trust things, and it's natural.00:39:55
But yeah.00:40:03
Yeah.00:40:05
Chris Iverson
Maybe that's what it gets to, is, like, what, trust barriers are you getting clients over, if that makes sense?00:40:07
Hamel Husain
I usually don't have to work that hard to… I don't have to get them over any barrier, because, like, as soon as we get into self-hosting.00:40:16
I mean, self-hosting is… is expensive.00:40:23
So, that… capitalism tends to…00:40:27
Chris Iverson
Take care of itself.00:40:31
Example.00:40:33
Hamel Husain
Aaron?00:40:38
Aaron Moss
Yeah, so I am working on a multi-agent00:40:41
application that was kind of half-baked when I got hired, so I can't really change it at this point, so I didn't want to start at Double Black Diamond, just know that.00:40:46
But,00:40:56
Right now, they are kind of doing the classic, hey, we are wanting a breadth scope where people can ask questions of the data that we proprietarily put together.00:40:58
So it's very wide, and thus comes a lot of trade-offs of, like, it's really hard to be reliable, it's really hard to have,00:41:08
Like, align with SME taste, because there's just so many use cases and different angles that can be approached.00:41:17
We are… so I basically prepped the client and the sponsor and, like, the product manager, like, hey guys, this is what's gonna happen, just giving you… because you guys selected this breadth, this is the scope, this is what's gonna happen, so we're gonna try to do our best, but just know.00:41:23
So we've done about 5 evaluation cycles, as I call them, where we…00:41:41
Would, kind of go through the steps that you guys outlined, with a solution, and then we implement that solution in that next, like, version.00:41:46
The thing that I am kind of struggling with is, and I don't know, Juan, if some of the products that are being promoted in this class inherently do this, because right now we're doing it all in Google Sheets and whatnot, so we don't have it pumped in yet. But,00:41:56
basically struggling to connect the story of, like, all these evaluation cycles where we're doing the actual coding. It's matching on the next, you know, we're using it to classify on previous… our next cycles, which is good, and they're hitting, which means that we're finding those edges.00:42:14
But,00:42:31
is telling the story, and specifically, like, the success metrics. So, like, what I've tried to do is kind of categorize some of these axial code categories into two metrics that I can, like, allow my client to report out, which is, like.00:42:34
Reliability is one. We want it to be reliable, so here's, like, here's a general sense of true pass.00:42:50
for reliability, meaning it achieves the output intended every single time it gives an output as intended. And then the other one is, SME-approved output, so, like, some type of quality metric.00:42:58
So I'm kind of having… I'm trying to kind of…00:43:11
search in the darkness a little bit of, like, trying to come up with these, and I didn't know if, in your guys' experience of going through these iterations, if you have a multi, multi- iteration cycle, how do you kind of…00:43:15
communicate this in an effective way, or give those to people, so that way they can communicate it effectively. And the way that I've kind of expressed that is through these success metrics.00:43:29
As well as thinking, like, the next step, because we're getting closer to a maintenance and prod type of, expectation, which is, like, an LLM dashboard where these automated evaluators will come in key.00:43:40
to allow them to, kind of, like, the people that will take over, to allow them to debug. Because basically what you're going to get is you release this thing, someone finds the edge case, and they're like, this didn't work.00:43:53
And then, instead of reaching out to us, we're trying to develop the dashboard so that way they can kind of debug things. So it's a two-parter, like, success metrics, then also, like, LLM dashboard for, like, maintenance and kind of democratizing this…00:44:03
This ability to kind of debug some of these00:44:17
some of these known cases, I guess.00:44:20
Sorry, that's a lot of wind-up, Hamel, so I apologize, tree. But yeah.00:44:23
Hamel Husain
Yeah, I could try to answer. I mean, so, the tools that we…00:44:31
kind of give you an overview in this class, you know, like the Brain Trust, Arise, Langsmith.00:44:37
They all have… and there's more than that, there's more tools than that, but those are the ones that we kind of run into the most, so we have them in the class.00:44:42
is, you know, they allow you to run evaluations and store the metrics, and version them. So you can, like, version the evaluation, so…00:44:50
Okay, so you might be changing your system on one hand, like your prompts, your RAG pipeline, your tools, whatever.00:44:58
And then you might be changing your evaluations. Your evaluations would be, like, being updated over time, your data sets will be updated over time.00:45:07
Anytime you change or update your datasets, you can version00:45:14
Like, anytime you change the evaluations, either, like, the LLM judges, or the data sets, or even, like, the evaluations themselves, you can version them.00:45:19
Because, like, some… you can't necessarily compare evaluation, like, apples to apples if, like.00:45:28
your valuation is changing inherently, which is fine. It can be a little bit chaotic at first if you're first building something. In the steady state, it's less chaotic when, like.00:45:33
You know, maybe you're evaluating… you're changing your system a lot more than your evaluations in the steady state?00:45:44
And then what happens is, like, those tools allow you to see… so you can create, like… you have all your evals.00:45:52
You know, like, whatever, different judges, different codebase evals, and you'll have, like, a suite of them.00:45:58
You can produce, like, an aggregate score.00:46:05
over them, like, because a lot of them will be binary, pass or fail. You can have, like, an aggregate score you can report out, and then a lot of those tools, and I don't know, like, for example, Phoenix has this, like, it shows you, like, a timeline.00:46:08
Or, like, a plot over time, like, every time you run the eval, it'll show you the progression of that metric.00:46:21
Which you can report. So, that's one way… you don't have to do it that way, there's a lot of degrees of freedom. I'm just telling you that to give you an idea of how you could do it.00:46:28
Don't feel like you have to use the tool. A lot of times I don't use the tool like that. It's not just for that purpose.00:46:38
Cause it's really… it comes down to,00:46:46
like, how you like to version things. Like, you might be versioning things in your own way.00:46:49
Storing data and, like, how you like to plot things and have dashboards could be different.00:46:55
Shreya might have an idea.00:47:04
Shreya Shankar
I don't know if this is directly answering your question, but I can speak to my experiences building a dashboard for the end users to help me00:47:07
do error analysis? Because maybe that was part of your question. You said you had, like, two axial codes that you were putting in front of the users to try to, label or provide feedback.00:47:16
And you had some hesitance, like, is this a good idea or not? I generally like this idea. I think it only works in cases where you know your end users are, like, really invested stakeholders, and they're, like, trying to make the product better with you. They're not gonna, like, run away. I don't know, you're not, like, building Facebook or, like, like, something that has to be, like, a perfect piece of technology, but something that, you know, both parties are invested in. So in my case, it's…00:47:26
public defenders in the state of California. They want useful, accurate AI analysis, for sure, and they're willing to help in whatever way to get there. So, with doing… with helping00:47:51
me with error analysis. I've built them a custom dashboard.00:48:04
to review outputs, and also provide feedback. Initially, an Axial code that I wanted, was00:48:08
like, basically, the labels that they gave me mapped one-to-one with certain axial codes. For example, one of my axial codes was extracting demographic information incorrectly from PDFs, like name, gender, race, whatever. And then00:48:17
but actually, specifically, it was extracting, I think, race, like, one of those four things. So I had an axial code for that in the dashboard for them to provide feedback, like.00:48:33
you know, is it correctly extracted or not? And it turns out the better thing to put in the dashboard was kind of a combination of axial codes, like four different00:48:44
things. Were all of these four criteria extracted correctly or not? And I think it was because, like, that's just how the human…00:48:54
looks at the output, they're able to see all four at once. It's easier for them to provide feedback on all four at the same time, and in most cases, I already know, because I did error analysis, that one failure mode is exhibited more than another, so having two different boxes for, like, gender and for race, like, one of those boxes is going to be empty, like, 99% of the time. So I had, like, a freeform box that was, like.00:49:02
Are all of these four things extracted correctly? Yes or no? Tell me which one is wrong. And then I had a little LLM pipeline myself that converted that open code into the exact axial code I want. And I felt like, okay, I can have this LLM pipeline because I know it's gonna be pretty correct, like, it's literally a classification task, extremely easy, and it's probably gonna be higher accuracy than a human anyways.00:49:27
And I, like, did a little bit of that verification, and it worked, so…00:49:51
I know I'm rambling, but here's just, like, it was just, like, a process of trial and error. Like, just because something that is an LLM failure mode doesn't mean it's the easiest way for the human to reason about it, and sometimes you need to do a little bit of adaptation and that extra mile, changing how people give feedback so that you can go and map it back to your axial codes, which map directly onto LLM failure modes.00:49:54
I guess another thing that I found was people did not like to give feedback in spreadsheets. They preferred to give it back in the dashboard. I… they… I don't know.00:50:18
And then the third thing was that sometimes people said that they had feedback to give, but they don't give it for every single LLM output. When they have seen the failure mode once, they, at least the public defenders, would think, oh, okay, I've already commented on this, like…00:50:33
Even though that failure mode is exhibited in another example, they wouldn't…00:50:51
go and annotate that failure mode again. They think, like, the LLM should remember, like, Treya will know, this is already a failure mode, when she looks at this feedback, and then she'll fix it.00:50:55
When in reality, I would love for them to annotate the failure mode for all examples, because, like, that's how it works, right? We want multiple instances of the failure mode. So I think there's, like, a lot of weird mismatches that I've had to, like, encounter.00:51:04
But, you know, I think at the end of the day, all that matters is both of us are aligned towards the same goal, and I'm, like, willing to adapt, like, how I show, the information that I want them to label.00:51:18
I don't know if that's helpful.00:51:30
Aaron Moss
Yeah, so, no, I like your…00:51:32
One is, like, we have to put, a feedback capability to lower the friction level for the user feedback, as opposed to Sheets, because you… and kind of like how we try to… we ultimately want cloud traces with our custom traces to be together in one data, visual.00:51:36
they would want the same thing. I also think that it's worth…00:51:55
So that's one thing that we're gonna need to, like, add as a… pre…00:51:59
a pre-step to before we do a GA release is, like, some way to capture that more explicitly.00:52:05
The other thing is that you mentioned, that I resonated with, which is, like, basically.00:52:12
giving kind of a Shapley-type value, or, like, sense of what it's doing with multiple, multiple evaluators, or multiple extractor use cases, whatever you… failure modes, as a way to… and we've had some feedback from UAT testers that they were like, it gave me an answer, but it looked right.00:52:18
But, I would love to know why it used that. And I think our initial, attempt was to use reasoning tokens, like the logic. We had, like, a reasoning, part, like, inherent with Gemini, because they kind of… their flagship model does that. So we were using that as a value. I didn't really like it, it was just a lot of noise to signal, so maybe00:52:40
Maybe a more explicit way to do that is to…00:53:05
Approach one of the, failure modes.00:53:08
And kind of just allow for that communication to be… like, basically go with the one that you feel like is the most critical to that task, or…00:53:12
yeah, that task, and then allow that to be shown as the reasoning, as opposed to just a dialogue that the LLM is doing. So, those two things made sense to me. I had…00:53:22
if no one else has their handwriting. I do have a third. So, I'm doing this thing, I tried this attempt, and you guys can tell me, if I'm going about it in the wrong way, but…00:53:33
because of the use case of the breadth, where, hey, they want to ask many questions of this database that they put together, I am attempting to use the questions or queries in our UAT testing as a signal of what a depth task could look like, so if you have breadth, you need to identify a task that it could… it needs to be really good at, with the idea that that will then be signal to a further00:53:44
workflow that you could really kind of build towards. So what I've done is I've taken all of the, past questions that it's done well on, the application, and I've summarized those into tasks using my AI intern, summary. And so I've given that back to the SME and said, use this as a jumping-off point00:54:09
to then reach out to candidates to then see if we can group. So basically, kind of that bottom-up idea, where we're using questions that we're passing.00:54:34
And I didn't know if you guys had any, and I… I was trying to look in the course… course reader, if you had any, like.00:54:44
use cases for the questions that you're getting right, like what I'm doing, which is, like, I'm using the questions we got right to enhance product sense when we don't really have a direction right now.00:54:51
Shreya Shankar
It's a… that's a really good tool to use. You're right, we don't talk about it in the course reader, like, we try to focus on, like, open coding, active coding, build these skills for failure modes, but you can equally apply this to, like, cases of success.00:55:07
and try to identify your concrete hypotheses for why things are successful. One of the patterns that I've found for these complex LLM applications is that00:55:21
If you're able to come up with, kind of, features that are more, like, binary or categorical, so, like, some part of the task was done correctly, or, you know, the data comes from a specific source, or the question falls in one of these three questions, like.00:55:34
Kind of categorical features.00:55:51
If you're able to come up with those that, like, explain LLM performance, think kind of like a decision tree or in, like, traditional machine learning, right? This task was very easy for us because we already had structured, categorically valued features to some extent.00:55:54
And then we could use tools like SHAP to figure out which ones correlate most with the outputs.00:56:09
You kind of want to do the same thing here.00:56:14
with…00:56:17
LLMs. Like, even though it's unstructured data, it's free-form outputs, if you can come up with features to explain the performance, then all of your error analysis becomes a lot easier. The way that you communicate trust to the end user becomes a lot easier, because then you can show the values for those features, and you know those features correlate with performance, you know, all sorts of things.00:56:18
So anything you can do to, like, move in that direction, I think, is, like, really helpful.00:56:37
If you… yeah.00:56:43
Aaron Moss
But I do think that this comes, you know, after you've already done, like, some rounds of error analysis, or you're in, like, a place where you're very stuck.00:56:45
Shreya Shankar
Like, intellectually with your product, how to make it better.00:56:51
Aaron Moss
Yeah, this is just me trying to use the internal signal from the questions that we have to…00:56:55
pivot our development from breadth to a depth piece that my guess is will add more value than just, hey, ask any question.00:57:03
Which, again, was the scope of the project before I showed up.00:57:13
But I do… I think basically what you're applying is the… use the… a passage mode, a pass mode, to group those, as opposed to going straight from questions to summary tasks, but use those as a definition of features, and then allow those features to educate what tasks it could be, as opposed to one-shotting it. Because that's what I was doing. I was like, take questions and one-shot it to a task, and00:57:18
And I gave it to my, PM and, the sponsor, and I was like, jumping off point, like, do what you will, like, do something with this, like, do product stuff with this. So that's a good, that's a good call. There's probably more… there's a sensitive classification that's probably required that will help you, like, chunk out a little bit. More signal, basically increase your signal on what could work, so…00:57:41
Hamel Husain
I hope it works, like, let us know if you're able to direct them into more specific use cases.00:58:07
Aaron Moss
Yeah, this one's… yeah, again, I'm just trying to bottom-up approach to…00:58:13
Give… give this product owner or product manager, like, an actual direction, as opposed to, we want to make something that00:58:19
just works, and I'm like, well, what does that mean? What is the product? What are we… who are we trying to go after here? Because again, like, I think to Ellen's conversation about, like, I have this, like, button that I push for insights, like, I'm trying to use, like, an actual, like, bottom-up to kind of get to the specif…00:58:28
specificity of, like, what is that feature. And so, Shreya's, like, point about, like.00:58:46
Passage mode features as a way to increase your signal set.00:58:52
is way better than I was just like, one-shot it, and, like, let's get lucky. That's honestly kind of what the approach I was taking, so…00:58:58
Yeah, I'll let you guys know.00:59:06
Balki Nakshatrala
Sorry, I'm trying to figure out how to raise the hand, I can't figure out where that icon is. Can I ask the question?00:59:09
Hamel Husain
Yeah, go ahead, yeah.00:59:14
Balki Nakshatrala
So, quite a few AI pipelines have kind of a dependency on the quality of the previous step.00:59:16
So, when you're doing the error analysis.00:59:24
Do you also, kind of, mark certain things as not to be considered in, kind of, the next stage? Because, you know, even if you take basic RAG summarization, you know, if the input of the00:59:27
the relevant chunks is not of high quality, you can't really measure the summarization step really well. So I'm just curious, this is just a two-step process, but some AI pipelines we have, have, like, many steps. So I'm just curious.00:59:39
How would you suggest, managing the error analysis in such a pipeline?00:59:55
Hamel Husain
So one of the important techniques that we teach is, like, when you're doing error analysis, stop at the most upstream error.01:00:01
I think that may be in week 2, I don't know if you may have got there yet. And, okay, the reason for stopping at the first upstream air is exactly for the reason you say, because, like, there's a lot of causal things going on, and you don't want to bog yourself down on everything.01:00:12
And so that's how you handle it, is you…01:00:28
you're constraining it to upstream errors. And it's a heuristic meant to simplify your life.01:00:32
Balki Nakshatrala
Yeah, no, I caught that. My question is,01:00:38
let's say I'm trying to build LLM as a judge and evaluating01:00:42
In an automated way, later steps.01:00:46
you know, how do I know which ones to filter out, because the…01:00:48
evaluation of errors downstream that are all because of the upstream. I'm talking during the later… not during the error analysis, where to stop, I'm talking more about doing the evaluation using LLM judge or something later. Don't include these, otherwise it will… Yeah, so your LM judge should…01:00:52
Hamel Husain
Be, testing a specific error?01:01:12
So, when your LMHS testing a very specific error.01:01:15
Then, it should be less of an issue about, like, many different errors, because you're trying to focus on, like, one specific one.01:01:20
There could be a reason why that…01:01:29
You know, maybe if there's… there's a lot of noise going on?01:01:33
And you have a good way of deterministically knowing if there's an error?01:01:37
That you don't want to be considered, then by all means, do that.01:01:42
But hopefully, it's not an issue if you're scoping the LLM. In most cases, if you're scoping the judge.01:01:46
properly.01:01:54
Balki Nakshatrala
You're saying… Even if the input quality is not good, you're still measuring an element that is01:01:58
Relation between the input and the output.01:02:06
Hamel Husain
It's a gen… okay, so, like, one question we get a lot in the course… is…01:02:12
If you have an LM judge, what happens if the trace that you're judging is not relevant at all? Like, it's not as the… it's not topical for the judge.01:02:17
And our answer is, it should just pass.01:02:27
Because… it's either pass or fail. So hopefully you can construct your judge.01:02:31
to where…01:02:36
you know, you don't have to worry about, you know… So, is this kind of a similar idea? Like, can you… can we make the judge01:02:37
Such that… You know, it's specific, and it passes…01:02:45
If there's no problem, or it's irrelevant, and it fails if it is.01:02:52
You know, the reality is, like, okay, yeah, like.01:02:56
There's limited context windows, maybe there's a lot of, you know, if there's a lot of noise, if you can do something to filter things out that are not necessary.01:03:01
But… hopefully when it's scoped, it's…01:03:10
that's the way it's working. Maybe Trey has other thoughts.01:03:15
Shreya Shankar
No.01:03:19
Hamel Husain
Okay.01:03:21
Shreya Shankar
Scope your failure modes really, really well. Yeah. Sometimes even over-scope it, in my opinion.01:03:21
Okay. Yeah, like, if you… I find that, like, the failure mode I usually run into is…01:03:27
sometimes I need to scope it even better, like, split that failure mode into two, because the LLM can't… LLM judge can't handle the whole thing.01:03:35
that failure mode happens a lot more than, oh, my LLM judge is magically able to align with me on all my failure modes that I've come up with, and then now I'm going to try to do cost optimization or something, and then bucket those failure modes together. So…01:03:44
Balki Nakshatrala
Let me just try retry phrasing this one more way, maybe. So, when we talk about an analysis, we assume the input… forget about where in the pipeline it is, but at that point in pipeline, we assume usually we look for inputs are,01:03:59
Correct? And then you are trying to evaluate the errors.01:04:15
My question is, but in some stages.01:04:18
The input quality may not be great to even measure that particular eval you are trying to,01:04:22
measure for when the quality is good. I'm just wondering if maybe I'm over-complicating this.01:04:30
Shreya Shankar
Is… are you saying, like, there are cases where the input quality is good enough for the LLM… original LLM to make a prediction, but then it's so… it's too bad for, like, an LLM judge to assess it?01:04:38
Balki Nakshatrala
If there's an upstream failure, but the pipeline goes through it, I'm talking about later when you're using LLM as a gauge to evaluate, and you're trying to do, you know, metrics based on evaluation of all the different traces coming in.01:04:51
Do… should we filter out the, you know, traces that probably01:05:05
Actually, it failed in earlier step.01:05:11
From being included in the, you know, metrics.01:05:14
That you're calculating at the later step.01:05:17
Shreya Shankar
I see.01:05:21
It depends on what you want, right? Like, if your dashboards that you're creating that are quantifying your failure modes, like, I… for me, I want the denominator for every single one of those failure modes to be01:05:24
the same, but otherwise it's very hard to compare, like, which failure mode to prioritize over another. The moment the denominators are different, I have…01:05:37
No idea what's going on. I actually had a student do this01:05:46
a couple weeks ago, and I was like, I, like, you can't compare these numbers because the denominators are all different. But maybe there is a case in which, you know, you…01:05:51
know exactly, like, oh, I'm excluding these traces from this particular failure mode because… or, like, I understand this denominator better, like, it's a different thing, like, I don't know. It's hard to tell you, like, definitely do something, don't definitely do something, but I think generally I would be wary of01:06:00
trying to build a dashboard where I'm plotting, like, different series, and the denominators are different.01:06:16
Hamel Husain
Yeah, just to add on that, just to kind of… I would say, probably, you don't want to filter. I mean, this is my intuition. Like, you have to have, like, a really good reason not to. If you have upstream things that are bad quality, that is an error.01:06:22
So, that should be a different error.01:06:35
And, you know, whatever.01:06:38
And, yeah, if that's causing… I mean, I guess, like…01:06:42
Hopefully, your process is such that you are focusing on the most upstream errors first.01:06:50
Cause, like.01:06:56
That's why we…01:06:58
try to do this so that you don't… you minimize that in your evals process? Because, like, it'll get really noisy then, yeah, because then you'll have… you'll trigger multiple errors. Which is fine, those errors actually exist. It doesn't really matter why they exist, but they exist. Like, if your rag…01:07:00
like, workflow has a error in it, like a syntax error that's not retrieving any document, it's causing everything to break, great. Let everything… let all the metrics look really bad. Let them all fail.01:07:20
I think that's good. They should fail.01:07:33
It looks bad, but it is bad. So… That's my opinion.01:07:35
Balki Nakshatrala
Thank you.01:07:42
Hamel Husain
Okay, we could take one more.01:07:46
Shreya Shankar
Yeah, big day.01:07:49
I could also stay longer as hemo, if you need me.01:07:50
Hamel Husain
No, I can't actually say, though.01:07:54
Vignesh Iyer
This is an actually direct follow-up on what you mentioned, Hamil, is when you make them generalistic and say, if something doesn't apply, then just have it pass, does that not kind of affect01:07:56
Your rates that you're calculating, like, your success rates, where if you just have a lot of things that01:08:10
Aren't applying and just… Possing.01:08:16
When you're calculating a sort of a rate.01:08:20
on… if it's on a data set. If it's not online, right? Like, when it's an online setting where you've got this evaluator running on production traces, then maybe, you don't look at the passing things, you just look at the failing things anyway. But if it's on a data set level, I'm assuming that data set01:08:23
Should be something that does apply to that evaluator, because if it doesn't, then your rates are kind of, like, off.01:08:42
aren't they?01:08:51
The rate of passing.01:08:56
So there'd be more things that kind of pass.01:08:58
If you just have a dataset full of, like, things that don't… Apply.01:09:02
Hamel Husain
Okay, I'll try to respond, and then Shreya can…01:09:09
help me detect if I'm missing something. So,01:09:13
You know, if… let's say you have, like, 100 traces about…01:09:18
Not retrieving, like, you're… okay, you have 100 traces, you have a specific eval about, you know, some retrieval issue.01:09:25
Retrieval is not a good one. Let's say, like, it's a specific Evo about a tool call.01:09:35
Okay.01:09:40
And… 90 of those traces don't even have that tool call.01:09:41
This is not, it's not, it's not relevant. But I would still say, like, if that tool wasn't called, and it wasn't…01:09:45
There was no error, then…01:09:52
there was no error. Like, that failure didn't occur.01:09:55
And the true… the prevalence of that problem01:09:59
is still reflected in the fact that it passes. Like, like, so that metric… you have to think, like, what is that metric measuring? It's measuring…01:10:04
Like, what is the prevalence of this failure in production?01:10:13
And, like, if the tool is never being called in production, then they'll rarely… the prevalence of it will be low.01:10:17
Vignesh Iyer
Right? Yes.01:10:23
Hamel Husain
So it's fine, in that framing, that you pass.01:10:25
Maybe Freya has a different… they… yeah, I could be missing something, but that's…01:10:30
One way of explaining it.01:10:37
Vignesh Iyer
That answers my question. So that would be in the production case, but in your kind of regression testing, sense, you would want to have a regression test that was…01:10:38
For ones that are failing, right? Because that's when the next time you change your prompt or add a feature, you want to see if those ones that did apply, which you used to create the evaluator initially.01:10:48
Now, they're still relevant in terms of traces and in terms of what that trace implies. Now are they regressing my,01:11:01
newly changed pipeline, those at least, should apply, right? The ones that are in your CIA.01:11:10
Hamel Husain
Yeah, so the CI one, you can be more, like, smart about it, you don't…01:11:19
Like, if you… especially if you have a lot of data, you can apply, like, certain tests to certain traces, and in that case, it's fine.01:11:23
But in… still, it's fine. If you're only applying certain tests to certain cases, then…01:11:30
I mean, the CI… the true prevalence rate in your CI is, like, not the same, you know? Like, CI is CI. You're trying to, like, discover an error. You're not trying to measure, like.01:11:38
what is the true prevalence, necessarily, in CI?01:11:50
You're just trying to trigger failures.01:11:53
I don't know if that helps.01:11:56
Vignesh Iyer
That makes sense, yeah.01:12:00
Shreya Shankar
I put a link in… or I tried to explain this in the chat, but I… yeah, maybe it's a good point that it might be confusing when you have a failure mode, how to evaluate it for every single trace, for example, in the tool call.01:12:06
Right? Some traces don't need to have the tool called. So how would you…01:12:20
evaluate whether that criteria is a success or not. Well, if it didn't need the tool call, then make sure it did not exhibit the tool call.01:12:25
That would be success. So, should it have called the tool would be 0, and then did it call the tool, it should be 0. 0 and 0 exclusive OR is what I put in there, but 0, X, or 0 is 1, good job, success.01:12:36
Should it have all the cool if that's 1? Then did it call the tool that should also be 1?01:12:50
1, exclusive over one is 1. Success. Failure happens whenever, like, the two values don't match. Like, it should not have called a tool, and it called a tool, or it should call the tool, but it did not call the tool.01:12:56
So, that's kind of… I feel like a lot of our failure boats, people think, okay, they're not applicable to the traces.01:13:10
so I'm gonna have, like, a third category, if not applicable. I would encourage you to rethink that, like, I… probably it is applicable in some way.01:13:17
And figure out how to still make it binary, like, yes or no. Like, RAG being empty… retrieval being empty is also a great example of, like, people might think, oh, that only influences the success of Generation step, but actually, no, it affects everything afterwards that, like, relies on the presence of some retrieved documents. So all of those should be failure as well.01:13:27
If… You know, the…01:13:49
If we didn't retrieve what we needed to do, or something was incorrect. I don't know, does that make sense?01:13:51
Vignesh Iyer
Yeah, I think that's a… that's a nice way to put it. So the applicability of that, evaluator is kind of almost… could… I can imagine that being on any evaluator, just putting some clause to… to check the applicability and that kind of…01:13:57
Okay, yeah, that definitely helps.01:14:12
Shreya Shankar
Yeah, like, there might be a case in which you absolutely need to do not… not applicable value, but that creates complexity for your application, and so I highly encourage you to try to see if you can make it truly binary in some way.01:14:15
Hamel Husain
Shreya helps, like, remind me of another really good thing to keep in mind, which is… Like…01:14:32
Okay, so say this tool calling example that we were talking about. If… it might be the case, like, if you're trying to… let's say, for whatever reason, you have an LLM judge for a tool call, you're trying to, like, judge the output of the tool call or something, and it's something very specific that, like, presupposes the existence of the tool call.01:14:38
you know, it might be the case that you engineer that LM judge that first says, does a tool call exist? If it doesn't, then it just passes, like, with code. So it's not like…01:14:56
you know, you have to think about, like, the judge, right? Like, should you invoke the LLM? Yes or no? Like, that can be something to think about.01:15:07
So you don't… you might not always have to. That's a different tangent.01:15:16
You didn't ask about that, but it seems related, so…01:15:20
Vignesh Iyer
I'll just try to summarize that, so you mean, like, a two-step evaluator, almost.01:15:24
Hamel Husain
It's not really toolset, so it's like, you have lo- you can have deterministic logic in your evaluator. It doesn't have to be purely code or purely LLM, it's totally valid. You can have, like, code and LLM in the same eval, if it makes sense. Like, if there's…01:15:28
logic. It's like, oh, if you can tell… if, like, you can think of a situation where something can pass, like, you know for sure it passed, based on code.01:15:42
And just exit, then that's fine.01:15:52
But, like, for fail, you might need to go to LM Judge. I'm just giving you an example. So it's always worth, like, think about it.01:15:55
Vignesh Iyer
sense.01:16:02
Yeah, no, I think all these examples were, were great. Thank you so much.01:16:04
Hamel Husain
Alright, I don't see any more questions.01:16:14
That means we will see you in the next one.01:16:17
Thank you for coming.01:16:22
Ellen Wong
Thank you!01:16:25
Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)