Week 2

OCT 15 Optional: Live Office Hours 5 WED 10/159:00 AM—10:00 AM (GMT+5:30) OPTIONAL Recording

Notes

Back

Recording

Optional: Live Office Hours 5

Oct 15, 20259:00 AM - 10:00 AM GMT+5:30

Audio Transcript

Chat Messages

Pardeep Singh

Hello, Hamil, how are you?00:00:51

I think you're on mute right now.00:00:57

Hamel Husain

I said, you might as well just get started right now, since you're the only person.00:01:01

Pardeep Singh

You know, usually I don't speak in the meetings and stuff, but, you know, I saw you online, I thought I'll say hello.00:01:03

Hamel Husain

Yeah, yeah, definitely.00:01:12

Pardeep Singh

Cool.00:01:13

Hamel Husain

Do you have any questions?00:01:14

Pardeep Singh

Well, just wanted to say hi. You know, just to give you… I think we have one minute, just to give you some context on the product which I'm working on.00:01:17

So… They started as, let's do business process automation.00:01:25

But then we generalized, or we said, let's just focus on another use case. So this is gonna be… you can just describe what you want to test in your application, and it will automatically generate steps, corresponding code, execute it, throw away the code, run some evaluations.00:01:30

and then, you know, validate. So it's like, basically built a framework which is considered as, you know, testing agent for your coding agent.00:01:48

And,00:01:57

the things I have been, you know, looking into when I started working on was, you know, when you only use LLMs to say what the next steps are, there's an automation. That's not…00:01:58

Testing, because for testing, all you really want to do is…00:02:08

oh, user filled up this form, then you have to validate, saying, did this user fill up the form? And so, therefore, you have taken these different DOM snapshots, and then you're comparing these DOM snapshots, finding out what changed, and what changed actually was valid change or not a valid change. So it became that complex. And then I realized my bill.00:02:11

is going huge, because… so then I started working on how do you sanitize DOM so that, you know, you convert DOM into a smaller format.00:02:31

And then, but you now lose some of the context once in a while. I think one thing I learned so far in this course, which I have been struggling with earlier, was this generalization, which is when you have this DOM,00:02:39

And then you have to generate code against that DOM,00:02:51

Once in a while, it will just give me some different answer, right? I was like, what are you talking about? This doesn't make any sense. And then now what I did was I just added tracing00:02:56

the trace becomes a huge file, because, you know, like, you have multiple snapshots and blah blah blah, but at the very least, I can just, you know, quickly look at a trace and see, by the way, if this was my DOM, and this is what I asked to do.00:03:06

And this is the code it generated, why the hell did it generate this code?00:03:19

And, you know, things like those. So am I actually able to move much faster than I was able to before?00:03:23

So, you know, I was earlier just writing logs in my log file, collecting that log, but now I just had this mechanism of00:03:29

Enable tracing, and it will just trace the whole thing, and it will generate, like, you know.00:03:37

30MB file with all what has happened, and… but you still have to go through this. I'm really hoping to build some automation around it, to see how can I…00:03:42

parse this log easily and stuff like that. But, you know, I think this course has been really useful to, to short, to find these shortcuts, into meta-analysis. Awesome, thank you. Just wanted to say hi.00:03:50

Hamel Husain

Yeah, yeah, I mean, we'll be…00:04:02

talking about building your own custom annotation interfaces in a future lesson, so watch out for that, I have a feeling that will help you.00:04:05

Pardeep Singh

Yeah, totally. Cool.00:04:12

Hamel Husain

Alright, welcome everybody. We're gonna do the same thing again.00:04:16

Pardeep Singh

As before, raise your hand if you have a question, and we will take it.00:04:21

Hamel Husain

One at a time. Looks like it's a small, cozy group.00:04:28

Right now, so… don't be afraid.00:04:33

Pardeep Singh

By the way, is this the only, or the last Q&A for this week, and the next one is next week, on 21st?00:04:35

Hamel Husain

Yeah, yeah, they're a little bit, back-loaded.00:04:44

Pardeep Singh

So, there's, like, more…00:04:47

Hamel Husain

I think there's, like, 4 next week.00:04:50

Pardeep Singh

Right, okay.00:04:52

Hamel Husain

And then maybe 4 after that.00:04:52

Pardeep Singh

Okay, fair enough.00:04:55

Shreya Shankar

Yeah, 3, but…00:04:57

Yeah, we typically have more later on, because then folks have had time to read the material and have more questions.00:04:58

Pardeep Singh

Yeah, makes sense.00:05:06

Hamel Husain

Camille.00:05:11

Camille Rose

Hi! Good to see you all. I was on this morning, and I had a question about a comment that you made I wanted to have you expand on a little bit, and then I have another separate question. So the first question.00:05:13

Hamel Husain

And remind me, are you working at Apple? Yes. Is that you? Okay, yes.00:05:28

Camille Rose

Yeah, my hair was up last night, maybe you didn't recognize me.00:05:31

So… we… we are testing something with multiple models.00:05:37

And there was a comment that you made this morning that I thought was really interesting.00:05:48

Around,00:05:53

You want to make sure that you're focused on the right thing in the beginning, and that's probably optimizing the prompt.00:05:56

and then go and test it on different models. I just want to make sure that I heard that right.00:06:05

And if you could expand on that a little bit?00:06:12

Hamel Husain

Yeah.00:06:15

This is also fresh in my mind, because I was just reading the course reader very carefully myself earlier today.00:06:18

Camille Rose

Hmm.00:06:23

Hamel Husain

And, it's important before you go to try to solve Gulf of Generalization.00:06:24

You're, like, making sure that you have solved gulf of specification?00:06:31

And, you know, trying to kind of change the model is kind of this gulf of generalization thing.00:06:36

Camille Rose

And you… so, like, you want to make sure, like, okay, have you specified everything correctly as you can?00:06:43

Hamel Husain

to the model.00:06:49

You know, that's usually what you want to do first, because, like, that's simpler, and you want to make sure, you know, like, you get all of that out.00:06:52

Before kind of trying to optimize the model.00:07:00

Camille Rose

Okay, and then any techniques or things that… words of wisdom when we're taking this and trying it amongst several different models?00:07:06

Hamel Husain

It's hard to say without knowing more, but I would say I like to start with the most powerful model, like the Frontier.00:07:24

And sort of work my way backwards and say, okay, like, can I make it work with the most powerful model? Then sort of say, okay, can I make it work with the less powerful model that's faster and cheaper? And how… where… how… what can I get away with?00:07:31

And understand the trade-offs, like the price, latency…00:07:46

Camille Rose

Performance trade-off.00:07:51

Okay, okay, that's great.00:07:53

And then my other question, a little bit un… well, yeah, unrelated, is…00:07:55

How have you seen large teams…00:08:01

maybe you haven't, I don't know. Who, you know, we have… we have a team of quality engineers, we have a team of data scientists.00:08:05

And now I'm introducing this error analysis. And so one of the questions that came up to me today was.00:08:14

How are we testing… Our traditional software, if you will, with our automated test… testing… You know, process.00:08:23

in parallel with a new agentic workflow.00:08:35

Like, they're kind of having a hard time trying to understand how the error analysis can be integrated into what00:08:41

They're currently doing, but they are understanding the need.00:08:51

to change how they're working, and I'm just wondering if you've had any experience with with this, with Teams.00:08:54

Shreya Shankar

I think a common misconception is that people think error analysis is like unit tests.00:09:00

In fact, like, people will ask us, like, hey, evals are like unit tests, right? And we have to fight this misconception, because it's two different philosophies. Error analysis is, like, from the ground up, we're gonna derive what the failure modes are by looking at the data, and then build evals for them.00:09:07

But software is top-down. Like, I already know what my tests are gonna be, so I'm going to have my QA engineer, like, write these tests, run them, and see what happens. I think it's really important to drill down that these are, like, two different philosophies, and they're both very, very important, and it's…00:09:24

I've never seen error analysis get successfully integrated into this, like, QA software, like, top-down framework, because nobody can just anticipate all the failure modes that come up with AI applications.00:09:40

So, sorry, maybe it's not, like, a great answer. I'm sure Haywa has practical tips for, like, actually bridging, you know, this communication gap between the two, but…00:09:53

I would hesitate to just try to shoehorn error analysis into the QA process.00:10:04

Camille Rose

That's really interesting, and can you clarify a little bit of nomenclature for me? Because…00:10:12

I think you said this in one of your videos, I think it was on Lenny's podcast, when you really went into the Google spreadsheet with Nurture Boss, and I think you even said, Trey, like, I'm not sure if eval is the right term in this context, because, like.00:10:18

I feel like what we're doing with error analysis is an eval, and you were saying, like, maybe we should be using a different word, so can you re-clarify that for me?00:10:34

Shreya Shankar

Yeah, I'm gonna say error analysis. So, part of this course is we try to ascribe terminology to specific practices, like error analysis, and we… you'll notice that we don't, like, ever say, this is an eval in the course, or we…00:10:44

we broadly say it in the beginning, but, like, everything that we teach, like, has a term. I'm hesitant to say evals to people, like, in the real world, because they have their own biases and the philosophies. Like, if you tell an eval to a QA engineer, they're not gonna think error analysis.00:10:58

So that's kind of where I'm coming from. Not that, you know, something is or isn't concretely definitely an eval. Evals can be what you want it to be. Evals can't be error analysis, like…00:11:13

I don't know, but… yeah, we know that it's very important to do error analysis, to do… analyze and measure and improve for AI applications, and I think it's, like, up to us as a community to try to figure out, okay, how are we going to incorporate these into broader organizational practices?00:11:24

Hamel Husain

Gonna drill down into that more. It's… it is a confusing term, because, like, the word eval…00:11:44

like, as an English term, Sort of means a test.00:11:50

Camille Rose

Black.00:11:54

Hamel Husain

say, I'm gonna email something, I'm gonna evaluate you, you feel like, okay, I'm gonna give you a test. And the problem is, is, like.00:11:55

The tests don't live in a silo.00:12:03

Okay, like, how do you come up with the test? What do you test? You know, how do you make sure the test is good?00:12:06

how do you then… like, all of these things, and so it doesn't live in a silo, so necessarily, Shreya and I have to, like, expand the scope of what we're teaching to include00:12:14

The stuff around the test, which is effectively lots of data analysis.00:12:26

And so, you could think of… you could rename this course to Data Analysis and Evals.00:12:32

for…00:12:39

Camille Rose

Oh, like…00:12:39

Hamel Husain

AI applications?00:12:40

Camille Rose

Yeah.00:12:41

Hamel Husain

That's the right way to think about it, and that's what error analysis is, is, like, data analysis.00:12:42

And so, that might help with the confusion.00:12:48

Because that's what we're really talking about, but it's so tied together,00:12:52

And a lot of people don't know that it's tied together.00:12:56

You can't really do one without the other, you shouldn't.00:12:59

And that's… that's kind of one of the motivations for the course, is that people were trying to do evals without data analysis.00:13:03

Camille Rose

Yeah.00:13:13

So when you say, just do an eval, like, what does that mean for you?00:13:14

Hamel Husain

No, it's always, like, data analysis, and I would… yeah, I would say, like, data analysis is more important.00:13:21

is it more important than the test? Because the whole point is, like, learning something. Like, you learn something.00:13:26

you have good intuition, like, you have some tools in which to improve your AI application. Because you don't really care about the evals. You care about, is the application good? Are users happy? Do you trust your application? Is it getting better?00:13:33

And so, like, you know, data analysis by itself does a lot of that work, and then evals help you systematize things and, like, give you more tools.00:13:50

So the whole thing is like a toolkit.00:14:01

So it's really, like… Can we take a scientific approach to improving our application?00:14:04

Camille Rose

Hmm.00:14:12

Hamel Husain

And there's so many different tools in that scientific toolbox.00:14:13

Camille Rose

Gotcha.00:14:18

Hamel Husain

It's just that we couldn't name the course that, like, if we tried to name the course…00:14:19

I mean, maybe we could, like, data science for AI?00:14:24

Camille Rose

No, I'm taking…00:14:29

Hamel Husain

I feel like people wouldn't know what that means, yeah.00:14:30

Camille Rose

Completely different. Yeah.00:14:34

Okay, awesome.00:14:37

Hamel Husain

But that's what you're doing, and that's what you are doing. You are just doing data science for AI.00:14:39

Camille Rose

People maybe wouldn't sign up then. But, yeah, this has been great, thank you.00:14:45

Hamel Husain

Yeah.00:14:51

Roanoke.00:14:54

Ronak Savla

Hi, I have a question. So, I'm building a system specifically for court reporters, stenographers. I'm building, specific to them, an AI proofreader.00:14:56

And they have, like, specific books that they follow, for grammar and, like, proofreading. And…00:15:11

I'm building right now on, GCP because I have some GCP credits, so, like, I'm using, Vertex AI.00:15:22

And the issue right now, like, I'm not sure where to start evals from, is because00:15:31

Each court reporter have their own preferences.00:15:38

So we have added, like, a data store just for their preferences, and, like, a RAG system for…00:15:42

The… for the books, that they follow. And…00:15:48

A preference for a quote reporter that we suggest might not work for another quote reporter. Like, where to use dashes, or where to use comma.00:15:54

Or… or semi… semicolon, right? So, how do I do error analysis on that, if they say, okay, this is great, but for someone else, it won't be?00:16:05

Correct.00:16:16

If that makes sense, sorry.00:16:20

Hamel Husain

Yeah.00:16:21

So…00:16:22

Yeah, I mean, I would still do error analysis, but make sure you have a domain expert.00:16:25

Which is, like, a court reporter helping.00:16:31

Ronak Savla

Yeah, my life is a culture borders.00:16:33

Hamel Husain

Okay.00:16:34

Ronak Savla

And…00:16:35

Hamel Husain

Hopefully, the user is able to provide some feedback, if it's this personalized, of like, hey, like, style and things like that.00:16:36

Hopefully, there's enough implicit or explicit signals, or you want to design00:16:47

your product, so it captures that, so that you can do error analysis effectively. So you can say, okay, like, you know, one failure mode is00:16:54

not respecting user preferences. And you can kind of drill into that specific error and understand, like, what kinds of user preferences are not being respected.00:17:02

You know… And then that itself is, like, a good…00:17:14

Like, when you do your axial coding.00:17:19

you'll find, okay, like, if… let's say your main problem is this, like, personalization, those axial codes, like, drill into this different aspects of personalization, and hopefully you'll get some prioritized view of, okay, these are things you want to focus on.00:17:21

Ronak Savla

Got it.00:17:38

Shreya Shankar

Yeah, I would echo that. I think for an application like this, you kind of want to go in with a hypothesis of, okay, users have their own preferences. I just don't believe that all preferences are always being respected all the time. So when you're doing your open coding.00:17:40

kind of look explicitly for adherence to preferences, note down, like, oh, they… this user had this preference, it wasn't respected, whatever. And of course, use a domain expert to help with that, and then I'm sure you will get failure modes that way. I think…00:17:53

You could also try to just do open coding00:18:07

for more general failure modes, but if you're feeling stuck and lost at, like, the sea of possibilities of failure modes, I suggest kind of going in with the instruction following or preference following, failure modes first.00:18:11

Ronak Savla

Got it. Alright, yeah, thank you.00:18:24

Hamel Husain

Steve?00:18:29

Steve

Yep.00:18:32

Hello, Amel. As you know, maybe I will put my camera as well.00:18:33

Okay, give me a sec… It's charging…00:18:38

Oh, whatever. I just have a quick ques… I have different questions, actually, because we go through the lectures as well. Oh, nice to see you, Steve, yeah.00:18:43

Yeah, and as you know, we have started our evaluation journey as well, and we are… we haven't started the actual coding, we haven't actually started everything, but one thing that in the lecture we have taken into account is the notion of criteria drift.00:18:53

And also confirmation bias.00:19:11

So, this is something that we really want to ask, is like, which step we really need to include this part, because when we give this instruction to our annotator, we already have this output, but they don't know what direction they need to go, they don't know what should be a good00:19:13

or wrong outputs. So, if we include the criteria, because when we've created these prompts.00:19:30

We kind of know the objective of it, but we are scared of including this criteria too early, but when is too early for them to tell them about hallucination, about missing interpretation, about, I don't know, sensitive data? Should it be before or after the first round of iteration?00:19:38

Because we don't want to have, at some point, to have this character drift in a way that, for the annotators, they say, like, the first output is helpful, but in the second one of iterations, they have changed their mind.00:19:57

So I don't know whether you guys have any good practices to first define these criterias, and when should we define these criterias?00:20:13

Shreya Shankar

Yeah, it's a good question. I suggest if it's the first time you're doing error analysis.00:20:21

Try to do, like, a cheap version of the error analysis process, like, have very, very few00:20:26

domain experts that are doing the open coding, try to do it fast, do, like, a small sample, and then see… you look through them yourself, and then look at, okay, what are they saying? You know, what could the criteria be? Let me try to formulate some rubric of criteria, and then I'll go back and do a second pass that's a little bit more in-depth, maybe adding a new annotator, or like…00:20:34

having them spend longer on each trace to think a little bit more. I suggest, basically, like, you kind of accept the fact that if you really have no idea how subjective that your application is going to be, just…00:20:55

know that you're gonna have two, three rounds of back and forth with the annotators, and then once you feel like, okay, things are stabilizing, I understand the application, now let me go have annotators label this at larger scale, or spend more time on this, spend more resources on this, then go for it. But this is just kind of…00:21:06

I don't know your specific application, so this is the advice that I would give. If your application is truly, like, you don't have a good idea for, what the error analysis process is going to teach you. For some applications, it's, like, extremely specific. Maybe it's LLMs summarizing specific types of emails. It's, like, very narrow scope. I probably could guess some of the00:21:26

error modes that you might see in that, and I wouldn't feel like I need to do these open-ended passes.00:21:49

Or cheap passes with annotators to figure out what the criteria are. But something truly open-ended, like, if I had to go work for OpenAI and, like, help them do error analysis for ChatGPT, I wouldn't even know where to start, right? Because everybody's…00:21:56

Steve

asking.00:22:09

Shreya Shankar

all sorts of crazy questions, and I don't know what their preferences are. So that's just two completely different ends of the spectrum. Like, you know where your application lies, and then kind of just adjust… you can allocate different resources, to external annotations based on where you are in the process of error analysis.00:22:09

Steve

So do you think, like, after the first round, where it's really plain test, where we don't actually know what is good or wrong, we can have a collaboration with all the annotators to actually define together this criteria, or this objective of00:22:26

what should be good or what should be wrong. So we limit a bit this criteria drift over the time.00:22:42

Hamel Husain

So let me, let me, like, clarify something. So Criteria Drift…00:22:49

Sounds like a bad thing, like, oh, you're gonna drift, and drifting seems, like, bad.00:22:53

You can also, like, change that term and call it Criteria Discovery.00:22:58

It's a good thing.00:23:03

Your criteria will always drift.00:23:04

in the… what Shreya describes in the paper is, like, it's just a phenomenon, like, you…00:23:07

can't expect to specify your criteria up front. You really need to look at it. And…00:23:13

So, with that in mind, like, Sure, you can try to…00:23:20

You know, specify your criteria up front.00:23:26

But what I would suggest is, like, Do the open coding, actually.00:23:31

And just not… just, you know, observe what's happening. Do that multiple times. You can then, like, try to write some criteria down.00:23:37

But just know that it's gonna take this criteria discovery00:23:46

We'll take some rounds. And that's good, because there's no other way for you to come up with the criteria.00:23:50

It's just a phenomenon that you need to react and you need to see.00:23:56

To then, like, elicit all the requirements out of you.00:23:59

Steve

Okay.00:24:04

And I have a last question, it's more like on the lectures about LLM as a judge, because for sure on our side, we might need to take this into account.00:24:04

once we finish all the agitation, and once we… I understand there's two types, there's the specialization and the generalization, and we need to use LLM as a judge for this specific category.00:24:13

But the thing that I do not understand is once we build the prompt of this LLM as a judge, and we see that the result is pretty good.00:24:24

How can we use these results to refine back the… System pump that we have.00:24:35

Because what we've seen is, like, prompt engineering is pretty hard on some specific tasks that, for example, we forgot to mention about it, we can just add a specific rule on the prompt engineering, but for some00:24:42

notions such as hallucination, even though we have this LLM as a judge, LM as a judge is just, like, kind of like a matrix to say, like, this output is good or this output is wrong. But on the POM engineering part, how can we actually fix00:24:55

Some notion of hallucination, or some notion of what you categorize as generalization.00:25:09

Shreya Shankar

Yeah, well, one thing I wanted to clarify earlier, I think, so LLM as Judge is a good tool to measure, like, generalization failure modes. If you have a specification failure mode, like, you didn't specify your preference on the prompt, do not go and try to build an LLM judge to measure this. Like, just go fix the prompt. Like, we don't want to waste time.00:25:18

So I think once you have measured a failure mode, and you know it's a true generalization problem, like, you're using a complex agent, and it's just something that LLMs don't know how to do.00:25:35

the…00:25:47

there's not that many possibilities that you can do. Like, option A is to use a more expensive model.00:25:48

you're probably already using that. You're probably already using a state-of-the-art model. So… but if you're not, then, like, try a more expensive model, run that on the eval, see how it does. Option B is to try to decompose your task.00:25:55

into some, more smaller-scoped task for the LLM. So if there's a way to break that task that the agent is doing into two different LLM steps in a workflow, or, like, three or four.00:26:08

The example that I give for this is, like, I work in a lot of document analytics. Sometimes people want to extract 20 fields from a document. You don't need to do them all in a single LLM call. You could have one LLM call do maybe 5 fields of extraction, and another LLM call do another 5 fields. So like that, you could kind of decompose the task.00:26:23

across many different LLM calls. Again, like, this is tedious, but hopefully you have your eval, so you can measure whether this change actually works.00:26:43

And then I guess the third thing that you could do is to try to fine-tune your own large language model to just get better at this task that you know it's already bad on. Unless you, like, have it00:26:53

AI team or, like, infrastructure team, I would not recommend doing the fine-tuning. Like, Hamel probably also has thoughts on that. Since we see a lot of people fine-tune models and then, like, not be able to keep up with the MLOps burden of, like, having to deal with that.00:27:06

And then I guess the last option is you can wait for a better model.00:27:23

Yeah.00:27:28

Steve

That's the…00:27:31

Hamel Husain

There's some op… there's, like, maybe a variant of one of Shreya's options, which is, like, you can increase…00:27:33

you know, test time compute, or increase reasoning. You can induce reasoning, like in the hallucination case.00:27:41

One common strategy is, like, force the model to cite the00:27:49

Sources, and have it force it to reason about, like, check itself and verify that, you know, it's only giving answers grounded in sources, for example.00:27:54

Which is a, you know, kind of a fancy way of, like, you know, test time compute, like, making the model think more.00:28:05

And, you know, you're gonna make it… it's gonna make it slower, but, you know, there's some trade-offs.00:28:14

But yeah, I mean, that… kind of sums it up. Those are the strategies that tend to work.00:28:23

Is to effectively, you know, find a way to throw more…00:28:31

compute at it, and R, scope it down.00:28:35

You know, so you're throwing more compute at a smaller surface area, or the same amount of compute at a smaller surface area.00:28:39

Steve

Okay.00:28:47

Makes sense. Thank you, guys.00:28:48

Hamel Husain

Yeah.00:28:50

Steve

Thank you.00:28:51

Hamel Husain

Bend from the future.00:28:54

Ben from the Future

Hey, Shreya. Great to be here from Australia. I don't normally get to join office hours, so this is a great time zone for me. Sorry if it's bad for everyone else, but we love it. So, thank you for running it here. We had a big win with error analysis. We built it into our platform, we got,00:28:55

Experts to label a bunch of00:29:14

of data, and we discovered that we got 99% of the way with just specification, errors. So we thought we were finding generalization errors. Like, sometimes it would say one thing, and other times it would say another thing, because we'd run the same query, like.00:29:16

20 times or 100 times, and we get, like, a 2% failure rate, right? And what we realized is that it was saying different things because the model was a bit confused about which… what I was saying. So when we improved the specification.00:29:34

the answers came back more consistent, so that we solved a really difficult bug just by looking at lots and lots of data, which is really great, so thank you. That gave us, gave us hope. It feels like you're going to lose hope on these LLM projects, because it feels like whack-a-mole. Every time you think you can go to production, some other bug comes up, so it really felt like having a framework stopped us from going mad, stopped us from going crazy, so…00:29:45

Thanks for that.00:30:08

One thing I noticed, though, was we started to play with LLM as Judge, just to see if we could,00:30:09

like, was there a generalization error? In particular, it was really bad at dates. You've probably seen this. GPT-4.1 can't do date math in its head at all, like, it seems to fail really consistently on that. We eventually just gave it a tool, and it uses the tool to do the date math, and that fixed it.00:30:15

But again, that wasn't… that wasn't really an LMS judge thing. It was like, it just really doesn't get this right, we'll give it some other tools, it fixed the problem.00:30:30

And so, I have this vision that I thought from this course I was going to get some way of00:30:38

I don't know, we use GPT-4.1 knowing how good is our agent going to be in production if I switch it to GPT-5.00:30:44

But I don't have the evaluation benchmark across hundreds of scenarios for my chatbot that would let me just switch models and be confident. And I was just wondering what your thoughts were on regression testing. Is that even achievable, or do I just need to get my subject matter experts to trace to another 30000:30:51

human evaluations after I've switched to GPT-4.00:31:11

Is that the only way, really? Like, could humans to look at it all again? Which just feels effortful. I didn't want to ask immediately after launching 4.1 and be like, hey, can we do that all again? Another two months of testing? You know, like, that seemed.00:31:15

Hamel Husain

Yeah, so what we usually see is… so, when you're starting out with evals, you get… you conquer a lot of low-hanging fruit with error analysis.00:31:27

And it's only after you do that, then you can start to… Identify high impact errors.00:31:36

and errors that are not easily fixed, and then write evals for those. And you'll… you'll slowly build up a suite of that, but you've… it sounds like you've just came out of error analysis, and you've gone through… maybe you need to do it, like, a few more times, but your system will stabilize.00:31:44

And then it'll be clear that, okay.00:32:02

Like, you need to move into this…00:32:06

Creating automated evals phase of things, and that's where you'll have your suite of evals.00:32:08

that then you can run in production, like, you know, we'll teach in another… in a future lesson how to run these in production, like, against a sample of your traces, how to put some of them in CICD, so on and so forth.00:32:14

Shreya Shankar

Next week, we talk about CICD, so, we do definitely recommend, like, you…00:32:29

create some set of traces, and then you have automated evaluators for those traces, and so you can swap out your model, basically, regenerate the traces.00:32:34

No, sorry, not one set of… some set of traces, a set of, like, inputs, basically, to your system that will then generate traces, and then you have automated evaluators for those.00:32:45

And then you can swap out the model of your system and then see, you know, how does the performance change on the failure modes. I would say, like, not… you can't just rely on that.00:32:55

That tells you, kind of, the known unknowns, failure modes you've already uncovered. What we find with AI is you also have unknown unknowns. When you switch to GPT-5, like, it's gonna have some weird smells in the output, and you still… like, just… you can still ship it, but be a little bit wary, and, like, prioritize error analysis, especially on live traces, because something new might come up that you didn't expect, because the model is different.00:33:04

So, yeah.00:33:30

Ben from the Future

And is there somewhere you could point me in the lectures where we talk about how to… I remember that it's… I'm imagining the slide there which you talk about how to do, LLM as judge, and there's a thing about specifying criteria. And I just noticed that when I'm doing my error analysis, I put in, like, an axial code.00:33:33

you know, and that basically allows me to see what's most problematic, as in I can focus my energy, but it doesn't give me an acceptance criteria for… to put into a sort of regression test, does that make sense? Do you often find yourself adding another field, which is specify all of your expectations, and then I can use that00:33:52

turn that filing mode into a test, does that make sense?00:34:12

Shreya Shankar

Yeah, so in this week's lecture, we take the axial code, and then we figure out how to make an LLM judge evaluator for that axial coding. Part of that process in writing the prompt is you're gonna build a detailed rubric, you're gonna have to do something to achieve high alignment with your00:34:15

codes from error analysis, and then that LLM judge prompt can kind of be plopped in into your CI. So that LLM judge prompt, you'll go through the process of00:34:34

Having to improve your prompt… kind of like a mini version of what you did for your main system, but now this is going to be an easier problem because it's just true-false task, and the failure mode is extremely scoped.00:34:45

Ben from the Future

Right.00:34:58

Shreya Shankar

So every actual code could almost become a test, right? Like, as in, eventually, if you had the time, yeah. Yeah, like, people actually end up having, like, I've seen, like, between 5, sometimes even, like, 8,00:34:58

But, you know, I would always tell people… so, it says how many res… basically, it's about the resources you have. If you have infinite people, they can all build LLM judge, do error analysis all day.00:35:10

Yeah, prioritize what is most impactful for your application.00:35:23

Ben from the Future

Brilliant. Thanks, Ryan.00:35:27

Hamel Husain

Maxine.00:35:32

Maksym Diabin

Cool, yeah, thanks. Hello, everybody. Kudos to everybody who has made it this time. It's 6 AM for me, so I'm super excited to be here. So, thanks, thanks for the course, Hamel and Shreya.00:35:36

a little bit of a context. So, we are in a financial sector. We are building an agent00:35:50

to work with the outliers, like financial outliers detection, so basically there is some sort of alarm coming from a separate system that is, triggering our agent, and so we have no human input, so basically it's our system that's00:35:58

crafting the input, and we are trying to go to external sources, such as HTML tables, PDFs, whatsoever, and retrieve the,00:36:16

information from the most recent, sort of, data sources, and verify whether the, updated data source has the updated, sort of, data point, and essentially compare it to the,00:36:30

whatever has been, sort of, expected to the expectations, right? And so…00:36:47

I have two questions. To do the scraping, we need to use an MCP. We are using FireCrawl, which has two modes, scrape and extract. Extract basically gives us LLM over LLM, which is, in a sense, random data.00:36:55

And scrape more often than not, works…00:37:12

Correctly. So, how do you even measure, how do you even cope with the MCPs as those are rendering yet another level of00:37:17

You know, like, unpredictability, in a sense.00:37:27

Yeah. And this is the simple question, sorry.00:37:32

Hamel Husain

No, no, no problem. So, like, for this, like, structured data extraction task, where you have… the good news is, like.00:37:36

For these tasks, you can use reference-based evals, like the cheaper ones, because what you can do is you can assemble a dataset that has,00:37:44

like, pairs of inputs and outputs, where the inputs are these documents, or these pages, or HTML that you're scraping, and the expected output is the structured data.00:37:54

And that's, like, you can assert it. You don't need a LLM to judge whether or not it's right, you can use code to assert it. And that's good news, because now that eval is, like, a lot cheaper to run, and more…00:38:05

yeah, it's more deterministic, you don't need to align it with a human, but you do need to assemble that data. So,00:38:18

you know, The core, like, eval…00:38:27

Process, it will be, like, to curate that data carefully.00:38:31

To make a good eval harness, that's challenging.00:38:36

That is, you know, it's a process. Like, you're gonna have to create an eval harness.00:38:41

That kind of embodies… The failure modes that you see?00:38:48

Like, the different kinds of challenges. It's like… it's like an obstacle course for your…00:38:53

you know, for your AI system, of, like, all kinds of different scraping, all kinds of different documents, like, in varieties of different…00:39:00

you know, contexts in different ways that they fail. You're constantly updated as you see more failures, like, oh, like, this is a new kind of crazy government website that I have to scraped, that I've never seen anything like this before. Okay, great. Something like that should be in the data.00:39:10

So it's really like curating this dataset, and then… Tuning your system to…00:39:26

You know, to… to kind of do well on that eval.00:39:35

Maksym Diabin

Now, as the data changes const… so the data, the financial data keeps changing, kind of, we are thinking about introducing our own MCP, which will, in a sense, cache00:39:40

or kind of fall back to the sort of cached data set, or, like, cached data sources in our S3, so we are essentially going to kind of plug those in. Do you think that makes sense?00:39:52

Or should we stick to the live data, which keeps changing, and that kind of makes the… Dataset staleness.00:40:05

Problem not that kind of resolvable, if…00:40:13

Do you want me to… do you know what I mean?00:40:17

Shreya Shankar

I don't think I followed the last part, like…00:40:20

Are you saying for the evals, you… do you say we should… are you asking whether you should refresh the data for the evals, or, like, for your application, I have…00:40:23

I don't.00:40:34

Maksym Diabin

The actual financial data keeps changing over time, sometimes, like, once in a week, or, like, even more often, and so the data set in this sense, like, if you are trying to use the same URL with the same MCP, and retrieve the financial results, those have already changed, and so we cannot come up to the grips with the stale… you know what I mean.00:40:34

Shreya Shankar

Yeah, I think you can try to just host a ver- like, take that website, host it somewhere else, so that way you can control that there's some data, those.00:40:57

Maksym Diabin

So that we have stale data.00:41:06

Shreya Shankar

Okay.00:41:07

That makes sense. Sorry, I had no idea if you were saying, like, for the application or whatever, I was like…00:41:08

Maksym Diabin

Yeah, yeah, that's exactly the problem we are facing. Thanks a lot, thanks a lot for the answers.00:41:13

The second question. So, and we are still coming to the grips with all the politics of evaluation, I still cannot get my hands over the prod data traces only in the dev data, so, like, all the corporate politics. Nonetheless, we have started with the,00:41:19

Path length, meaning, like, how many, steps we have in the multi-agent evaluation.00:41:38

tools used, which actually helped us a lot, so basically we understand, like, how many scrapes we have, and how many extracts, and basically, you know, like, we have found out, for instance, that Clot37,00:41:44

was using tools, like, super extensively, like, up to 17 times or something like that. So we were able to switch over to Cloud4, which now reduced the tool use down to just scrape and extract, and that's it, which is, like, super achievement in this case.00:41:57

So now we're trying to do the regex, search for the specific number, basically, that's what Hamil has said. So any…00:42:11

other kind of top-of-your-mind ideas, and I know that we are kind of putting the horse behind the carrot, in a sense, but, like, due to the slow process, maybe we can do something else, kind of.00:42:22

upfront.00:42:34

like any other evaluations, we can try to put up front, right? As the progress of doing the eval cycle is very slow here, and I'm just.00:42:36

Shreya Shankar

Hummel, do you want me to go for it, or…00:42:52

Hamel Husain

Yeah, go ahead, yeah. I can't think of anything else, but…00:42:56

Shreya Shankar

I mean, first of all, folks have never shown up and been like, we're itching to do more evals, like, I already did everything, like, what more can I do? People are usually drowning in the evals, so I think that's funny.00:42:58

I think that's the second thing is, if there are people who can do error analysis faster, and often this is actually true in organizations, like, for example, I have some interns who I can, like, run a ping today and get some label data back tomorrow, and it's not, you know, the quality of a domain expert that would take a week or two weeks to do some open coding or axial coding, but00:43:11

Something.00:43:36

Going through those kinds of codes yourself, or those axial codes yourself, then you can try to build as many, like, programmatic00:43:37

automated failure, but basically programmatic evaluators for. And often things that you think are LLM as judge evaluators, or are complex criteria, you can actually approximate pretty well with, like, keyword-based checks or programmatic checks.00:43:46

You won't… it won't be, like, perfect mapping, but you'll be like, oh, I'm able to recall, like, 95% of them, and that's, like, good enough. So I would say, like, it's…00:44:00

totally fine to build these kinds of approximate evaluators, and I love doing that. I think it really helps me prioritize.00:44:10

Like, what to do open coding and axial coding for, especially in the future.00:44:17

So if that's something you want to… get started with, I mean.00:44:22

Maksym Diabin

We're.00:44:27

Shreya Shankar

Other things, I think, like, the last week of the course, we talk about building your own annotation interfaces, so, that's also something you can probably get started with early if… yeah.00:44:28

Maksym Diabin

Fair enough. Yeah, thank you so much. Thank you.00:44:39

Francesco Lanciana

Yes.00:44:45

Hi, I just… so I had just a few questions, it might be quicker. So, one was just around analyzing traces when you have…00:44:46

human in the loop, and how, like, that changes, kind of, how you would go about, you know, identifying, like, doing the open coding and axial coding. Is there a different way to think about it, or is it, like, an extension of the same trace, even though you've got multiple inputs in it? Like, how do you kind of approach that differently?00:44:56

Hamel Husain

Yeah, I don't think I approach it differently, really. I mean, because, like, a chatbot is a human in the loop. There's, like, human always, like, in every step, right? And so, that's, like, kind of a canonical example that, sort of, we…00:45:19

are kind of talking about. So I think it's… this really… I don't think anything really changes.00:45:33

You know, it's always good to know where the human is in the loop, so you can… you tend to focus on that.00:45:42

Like, you need to know, like, what the intent is.00:45:47

But I can't… I mean, yeah, I think it's…00:45:50

It's not rea- I wouldn't even call it an edge case.00:45:55

I would say this is the… it's the normal case of human in the loop.00:45:58

Shreya Shankar

I did like the point that you said, Hamil, about, like.00:46:03

you can kind of spend more… I don't know what this is.00:46:06

Blue?00:46:11

balloons situation. Yeah, like, so often when you're doing open coding and axial coding, you're not paying equal attention to every word in the trace. Like, that's just not feasible, especially if these traces are long, or they have a lot of tool calls.00:46:13

So you want to, like, figure out what are places in that trace to look at. This is also why building custom interfaces is very helpful, because it'll render the trace in a way specific to your application to, like, help your eyes spend more time on the most important parts of the trace.00:46:26

So knowing where the human is interjecting in the conversation is very helpful, right? Because you can just quickly spend more time on that part of the trace, doing error analysis, rather than read every word.00:46:42

Equally.00:46:56

Francesco Lanciana

Cool.00:46:58

Awesome, thanks. Another one was just around… I think I quite liked…00:46:59

The way of thinking about, like, starting with the simplest architecture humanly possible, kind of going through the exercise to get, you know, like, the failure modes, and then, you know, fixing it from there, and then kind of putting in evals, and then…00:47:06

Working through, like, the architecture. You know, as a way, one possible way to…00:47:21

To improve it. When… like, this might just be an experience thing, because I'm still quite new to it all, but, like, when do you typically find, like, oh, I can just skip ahead on certain parts of it? You know, like, for example, yeah, I'm building, just a…00:47:26

an LLM that goes alongside a calendar and can answer different kind of questions, or do different tasks, like creating an event, or updating an event, or, you know, removing an event, or, you know, creating a task, or asking… answering a question. So I can do a bunch of stuff, and so I could immediately look at that and be like, oh.00:47:45

it feels like you'd want an LLM router, and then, like, a bunch of agents, like, that just kind of feels like the right approach for this. Or, you know, would it be, like, jumping ahead to that, or being like, no, just really militant about, I'm just gonna do one agent or something that sits at the start, go through the whole process, and then introduce it once I have those evals?00:48:04

Shreya Shankar

This is a good question. I… I feel like I…00:48:26

have built this skill of, like, building the first pass, because in research, you always have to build a baseline, and the definition of a baseline when you're doing experiments00:48:32

is, like, the simplest possible thing that is also a very good faith attempt that's solving the problem. So the simplest possible thing is writing Python, not even using an LLM, just writing 3 lines of Python, it's always returning the wrong thing, but, like, that's not a useful baseline, right? Like, it's not trying to do your task. So I would try to adopt this exercise of, okay, what's, like, kind of the least amount of code that I would have to write, or least moving parts of this00:48:42

system, but it's really, truly a good faith attempt in trying to solve the problem. And maybe that is, like, a single agent, very powerful, with, like, this set of tools, like, 5 or 6 tools that you can think of.00:49:07

And then you'd kind of write your system prompt using, like, a template, and then you just go generate a bunch of traces and do error analysis.00:49:20

That could very well be the thing… that could very well be your baseline.00:49:28

But depending on the task, right, some… if it's just… if it's not agentic or doesn't need tool calls, like, then you don't need to do that, right? So…00:49:34

Francesco Lanciana

Hmm.00:49:43

Yeah, with something like this, would, like, your first guess be to start with, like, just one big agent, or, you know, with lots of tools, or would you, like, immediately already be thinking, that sounds like you should break it up?00:49:44

Shreya Shankar

I… I like to start one big agent.00:49:59

So here, okay, here's how I like to start. I like to think about, A, what is, like, the set of tools it needs, and then B, what's the context that it needs? So sometimes that's, like, it needs to read from this data source, it needs to, like, be… integrate into this knowledge base. So, like, I'll just, like, spend effort engineering those… that connection.00:50:06

And then I'll plug in, like, the best model, like GPT-5, and then I'll be like, that's my baseline.00:50:24

So I don't… yeah, if you have, like, external context, I would definitely prioritize that.00:50:30

And then, just, like, whatever's, like, minimal set of tools that you really… like, it really doesn't have to be many. Right now, I have a few students working on, like, building a data agent benchmark, so, like.00:50:35

I don't know, analyze… answering questions about data that's, like, in multiple databases. So, like, the only tools that they are using are00:50:48

connection to the database, and then Python. I'm not, like…00:50:56

adding more tools, even though they could use more tools. Just, like, bare minimum that they need. And I'm making sure that they actually can access the data, otherwise they will not be able to do the task.00:51:01

And that's it.00:51:10

I don't know if that was helpful, but…00:51:14

Francesco Lanciana

Yeah, no, it was, thank you.00:51:15

Hamel Husain

Arch now?00:51:24

Archana Gadiraju

Yeah, hi guys. Thank you for the opportunity. I'm Archina from Singapore. I'm basically a product manager. I'm trying to build my own product right now, which is an EdTech product.00:51:31

So my question is, my first agent is a composition writing assistant.00:51:44

So it teaches students how to write a good composition.00:51:54

So, it already has a well-defined rubric.00:51:58

That is, you know, given out by educators and education boards across the world. So, when I'm doing AI evaluations, say, on the assistant, when they're giving feedback to the students, and, you know, the data set that I need to come up with for the building the AI evals, right.00:52:03

it's a really huge data set, if I have to come up with 100 compositions and go through each one of them, and how to break them down. So I'm looking for some help in terms of best practices to make it a bit more practical.00:52:22

for me, because if it's only responses with a few sentences, it's a different thing, but if it's, like, a 500-page… 500-word essays that kids are gonna write, and, you know, and I have to match it with rubrics.00:52:37

So, how do I go about it? So, I'm just looking for, like, the baseline suggestion here.00:52:51

Hamel Husain

Yeah, I mean, so… one… Thing is, like, one question is, like, can you…00:53:00

recruit some friendly customers or users or something like that. Get some seed of data that way. That's one strategy. Another thing is, 500 words doesn't seem like that much for an LLM to generate. You can definitely generate it.00:53:06

How you generate it, you know…00:53:24

I would give it a go. We're using the approach that we teach in this course, like using the dimensions, the tuples and whatnot.00:53:29

you might have more intuition than I about whether or not that is gonna work. Maybe you feel like it doesn't work for a good reason, you can tell me about it.00:53:38

Archana Gadiraju

No, I think it works, but the point is the number of dimensions that it might have, and the number of… it just comes up to two huge…00:53:48

Hamel Husain

Okay.00:53:58

Archana Gadiraju

large documents, how do you apply dimensions and tuples to come up with a… I'm sure LLMs can generate it, but we have to still proofread them and see if they're right. So reading through so many words is also, you know, kind of… there's human error.00:53:59

that's possible from my end when I'm actually validating the tuples, so I'm just trying to understand how large documents like these are, you know, what's the best practice around that?00:54:16

Hamel Husain

Yeah, so… okay, if you have, like, this 500-word essay composition that's an input.00:54:30

And you have, like, feedback from an LLM,00:54:36

The question becomes, like, is there something you can look at that… Gives you early feedback.00:54:41

Like, can you tell… For example, something is wrong just by looking at the feedback.00:54:49

Like… You know, like, something inherent into the feedback itself.00:54:56

Regardless of whether or not it's, like, correct.00:55:01

you know, relative to the composition. It's like, you don't like the tone, you don't like the way the feedback is organized, you know, things like that. That's, like, the…00:55:04

You know, one thing that you can kind of address first?00:55:15

And there might be… Ways to make it…00:55:19

Because, like, wrapped in the… so, like, it's very interesting, because, like, if you feel…00:55:27

that is daunting for you to do the error analysis, a lot of times, that's a product smell, that… how are your users going to know if it's good or bad either? And so, what you want to… for example, if your product is giving feedback.00:55:31

Archana Gadiraju

Mmm.00:55:47

Hamel Husain

you would want to, like, make it easier for the user to know if that feedback is good or not. Is this right or not? Because it's an AI, or hopefully they know it's an AI. And, you know, you want it to be able to cite exactly why… in that 500-page rubric, you wanted to, like, provide examples, like, you said this.00:55:49

in your essay, and that was misleading because of XYZERBC. Okay, so that's gonna help you in your error analysis.00:56:08

So we want to, like, think in that direction of, like, how do we make it easy for you to…00:56:16

Archana Gadiraju

Hmm.00:56:24

Hamel Husain

interpret.00:56:25

Shreya might have different ideas. I have a feeling she might.00:56:27

Shreya Shankar

No, no, I think this is pretty fun, and the other thing I guess we could mention is, like, try to design your custom interface to make it easier for you to scan. So maybe that's, like, highlighting words, or, like.00:56:30

I don't know, highlighting sentence lengths, just, like, all of these, like, tricks, right, to just, like, make it really easy to read. And then…00:56:43

Yeah, like, looking for specific keywords. Like, if there are ways that you can write code to try to approximate these criteria in the rubric, then you can run those00:56:52

functions on…00:57:04

one of these essays that you're trying to grade. Don't treat anything as, like, yes, it definitely followed the rubric or didn't.00:57:06

But just, like, have those shown up in your custom interface so that you can be like, oh, wait, this codebase evaluator flagged something, like, that's something to pay attention to while I'm grading this. So I would just try to think of as many heuristics as possible to make it easy to grade in the interface.00:57:14

Archana Gadiraju

Sure, thank you. I think I'm using my brain more than I ever have. And I have to go through all of these.00:57:30

Oh, thanks, guys.00:57:37

Hamel Husain

Thank you.00:57:40

Pardeep.00:57:41

Pardeep Singh

Okay, folks, I have a couple of questions. I think those questions are based upon what I heard in this conversation so far. The first one was, you know, somebody said, can we, you know, I think Shreya mentioned that we may want to run, evaluations in the pipeline.00:57:43

So, you know, you continuously do it, and this will also help in evaluation of models. If you want to go from, you know, 4.0 to 5, then you can evaluate those.00:57:59

One of the concerns that I've heard is cost. You know, traditional software is not costly to test.00:58:08

Running these evaluations and building a pipeline which has these evaluations is not only slow things down.00:58:17

But it's also very costly.00:58:23

what do you folks think about that, or have you heard about this concern? That is question one. Second is if there is any best practice to reduce cost, just in general. From a traditional software perspective, that would mean staggering test cases and stuff like that, but if there is, you know, this is a two-part question.00:58:27

Hamel Husain

Trey, you wanna take this?00:58:47

Shreya Shankar

Or I can do… Go for it, go for it.00:58:49

Hamel Husain

So this one is… that's a really good question. We actually talk about this specifically in the course. We will be talking about this, like, how to think about costs. So it's this coming, but just to give you a little bit of a teaser,00:58:51

you know, you want to kind of carefully curate what's going into CICD. You don't want to put the whole world into CICD and run, like, millions of tests in there. You know, you want to…00:59:08

And you want to run things… some tests, like, very frequently, some things, like, maybe nightly or weekly?00:59:21

you know, want to stagger them, just like you're saying. And then for your production, like.00:59:27

Running these things, like, against production, you want to… You know, like, often sample.00:59:33

your traces.00:59:40

Pardeep Singh

Okay. You know, especially if you're at, like, a large scale.00:59:44

Hamel Husain

If not a large scale, then okay, run it on all your traces, no big deal.00:59:47

So… And, yeah, we'll be talking about that.00:59:53

Pardeep Singh

Okay. I think one other question which came up was somebody said, you know.00:59:56

testing while they're invoking a tool, and the tool can have, old, stale data. So the way I solved this problem was that,01:00:02

I don't test that portion of it. You know, you test your tool very specifically, which should have a very predictable outcome, and you just write test cases against it, which is very predictable. And then, through LLM,01:00:12

we just, you know, validate whether the correct tool with the correct inputs are called, and that's it. What tool returns, it doesn't really matter. This helps in, you know.01:00:24

reducing the, the load overall to the system, and it also runs quickly, overall. And you can also, you know, the tools doesn't have to be online while you are actually testing your evaluations. Just, I thought, bring it up, somebody mentioned that.01:00:34

Hamel Husain

Yeah, I mean, it's worth noting that, like, the software components of your AI pipeline, you should still test your software in the software way. So, like, if you have tools, if you have, like, a grep tool, you certainly have unit tests on your grep tool to make sure your grep tool is… doesn't have syntax errors and, you know, can take valid inputs.01:00:48

So… yeah, I mean… Some tools are not deterministic.01:01:08

You know, you might have a tool that does a semantic search.01:01:17

Or a tool that… It's an NLM thing?01:01:21

Pardeep Singh

Like…01:01:24

Hamel Husain

In that case, you might have to do something different, but…01:01:25

Pardeep Singh

And even in… even in that scenario, wouldn't you… wouldn't we say that the tool… we should test that tool separately? You know, like, the evaluation for that tool should be separate.01:01:29

So, so you're not, like, dealing with, you know…01:01:39

I don't know what was wrong.01:01:42

Hamel Husain

Hmm.01:01:45

There's nothing wrong with testing the tool separately, I mean, that's totally… that's good. That's good that you're testing a tool, but I would say don't isolate… when you have a chain of causal events.01:01:48

You want to be careful of freezing, like, a mid part of that chain, because…01:02:00

a lot of times, you know, if your tool… if the input to a tool is, like, some natural language, another LLM.01:02:06

Pardeep Singh

Just giving it.01:02:13

Hamel Husain

then…01:02:15

you're gonna… you might… you might miss some critical failure. So I would just think really carefully about01:02:16

what, you know, the dynamics of the system. So I usually don't do that by default.01:02:24

Pardeep Singh

Okay. But…01:02:29

Hamel Husain

you might have a reason that you… you are comfortable with freezing things, but I would just be careful.01:02:31

Pardeep Singh

Okay.01:02:36

Shreya Shankar

Yeah, like, there are so many bugs from… for example, when I worked at Facebook for,01:02:37

to software bugs, like, creating data that was then just fed into machine learning models that made garbage predictions. Like, once, like, the sound was not working in Instagram for, like, 5 sec- 5 minutes, or I don't even know, some very small amount of time, corrupted all the data. Of course, like, it's gonna fail the Instagram sound, but I don't even know what they're running there, but from the ML data infrastructure perspective.01:02:42

perspective.01:03:07

I'm seeing, like, all sorts of, like, weird engagement.01:03:07

Right? So, like, I'm still getting the data, and my API still has to give a prediction, or some, like, recommendation, and you're right, I can't control it, and, like, yes, that software should be tested differently, but still, you know, the interaction means that I'm now aware of this bug, and01:03:12

can't get around that. So, I imagine that's true in Gen AI world as well.01:03:30

Hamel Husain

I think it's, like, it is kind of classic integration versus unit tests, so it's, like, this is where an integration test will find something, or the unit test will…01:03:35

Pardeep Singh

Yeah, yeah, fair enough.01:03:43

Cool, folks. Thank you.01:03:48

Hamel Husain

Yeah.01:03:50

I think we are at time, there's…01:03:53

Not any other questions that I see, but yeah, thank you for everyone for coming. Please continue to ask questions in Discord.01:03:56

And… We will see you soon.01:04:03

Thank you.01:04:07

Francesco Lanciana

Thanks.01:04:10

Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A

[

Home

](/parlance-labs/evals/2025-3/home)[

Community

](/parlance-labs/evals/2025-3)