Week 1

OCT 8 Optional: Live Office Hours 1 WED 10/89:00 PM—10:00 PM (GMT+5:30) OPTIONAL Recording

Notes

Back

Recording

Optional: Live Office Hours 1

Oct 8, 20259:00 PM - 10:00 PM GMT+5:30

Audio Transcript

Chat Messages

Hamel Husain

Make sure everyone can unmute one moment.00:04:01

Hello! Can everyone hear me?00:04:24

Yes? Okay, great. Shreya's coming in a little bit, she had to restart her Zoom.00:04:27

So we'll just wait a moment.00:04:32

Really excited to have everyone.00:04:34

So this is the first live session, it's in office hours. We'll explain what that means in a little bit.00:04:38

Let me just get everyone admitted into the Zoom, one moment.00:04:43

Alright, we can go ahead and kick this off. Let me just see if any… anyone… people are still trickling in at a very high rate. Let me just give it another minute.00:05:03

So people don't miss out.00:05:14

shreya

Can you hear me?00:05:44

Hamel Husain

Yeah, I can hear you.00:05:45

shreya

Okay, cool.00:05:46

Hamel Husain

I was just saying welcome, everybody, and yeah, thanks, for coming. This is the first Office Hours. And so, as you know, just to orient you, the course…00:05:47

Is a flipped classroom format.00:05:59

So, all the lectures are recorded, and the reason we did that is because we found, in teaching this course multiple times, we don't want… we want you to be able to focus on the lessons.00:06:01

and sort of watch those without the constant interruption of questions, and then have focus sessions like these, where students can ask questions. And so.00:06:13

You know, we have a lot of office hours. We have 12 or so office hours. We may have more, if there's other questions. We expect this first office hours to be quite full, because it's the first one.00:06:24

There's multiple ways to get your questions answered. There's a Discord.00:06:39

That's a chat interface, if you haven't seen that yet. You should have got a welcome email. I will also, sometime during this office hours, put a link to all this information in the chat.00:06:45

We…00:06:55

maniacally check the Discord and answer all the questions, very quickly. We have 3 TAs as well, or… we have more TAs than that, but, like, 3 kind of almost full-time TAs.00:06:56

Is that just orientations of the way we do these office hours, is we… Kind of… have…00:07:11

people raise their hands, and then we answer questions. And even if you don't get your question answered, especially this first time, since it is quite a lot of people.00:07:18

You will… we have found that people learn a lot by observing other people's questions and learning that way, so don't be disheartened if you don't get your question answered. Also, there's lots of office hours, at different time zones.00:07:29

So, any… did I miss anything, Shreya? Anything you want to add?00:07:43

shreya

No. That's… I think that's it. Yeah, ask questions away.00:07:46

I don't want to eat up too much time.00:07:51

Hamel Husain

Yeah, okay, so what we'll do is, like, raise your hand with the Zoom. There's a button in Zoom where you can raise your hand.00:07:53

Do that, we'll just go first come, first serves, and we'll go for it.00:07:59

So right now, I only see Martin, so far, so let's go ahead with Martin.00:08:05

Martinson…00:08:13

shreya

Oh, you might need to allow everyone to unmute.00:08:15

Hamel Husain

Okay, let me… let me check that.00:08:18

shreya

They have the same problem I had.00:08:20

Hamel Husain

Okay, let me… C…00:08:22

shreya

One other…00:08:30

Comment before we get into question. In previous cohorts, sometimes people take AI-generated notes, totally fine, but then they post these notes on the internet, which is less fine if people are talking about proprietary stuff here. So we ask that you please don't post your AI-generated notes on the internet.00:08:31

Leak away everybody else's secrets.00:08:50

Alright, let's get to questions.00:08:54

Martin Siniawski

Great. Well, can you hear me?00:08:56

Hamel Husain

Yeah, I can hear you, too.00:08:58

Martin Siniawski

Well, nice to meet you guys. So, I have a question. I've been doing the error analysis process. I originally saw it on Lenny's podcast, so I've been doing it since then. And the challenge that I… I'm right in the middle of it.00:09:00

at this point, I'm gonna kinda implement… there's, like, very obvious things that are missing in our prompt, so I'm trying to fix those things. I already categorized them in, like, axial coding, all of that.00:09:12

The challenge that I have is envisioning how we're gonna deploy this.00:09:22

Because we do have… this is a live app with a good amount of users.00:09:26

And, yeah, and we haven't done this…00:09:31

big, changes yet, and we… and we do have, like, we have landfuse for tracing.00:09:36

We don't have a set of tests beforehand, so I guess I'm, yeah, thinking through how we're gonna be able to deploy this, and what's the optimal way to do so. I have some ideas, I can, like, share them, but I guess that's kind of, like, the setup for my problem right now.00:09:42

Hamel Husain

So when you say deploy, so you're doing the error analysis, and when you say deploy, are you talking about something specific?00:10:01

Martin Siniawski

Yeah, so error analysis kind of uncovered different, types of errors. There's a really large one, which has to do with, like, lacking domain expertise that we haven't yet added onto the prompt, right? So, I'm basically working on plugging those things with a combination of adding things to the prompt, and also implementing a knowledge base.00:10:08

Which we hadn't had before.00:10:28

And,00:10:31

And we're using both, for chat, we use, like, the OpenAI Assistance, and for voice, we use DAPI. So we are doing that in both places.00:10:32

But right now, it's, let's say, a development assistant, or a new assistant that I'm creating based on the production one, with all these new changes.00:10:41

So, at some point, we need to, like, make this the new production one, and see how it's performing live with users.00:10:50

And so… and making sure that we don't break other things, right, and that sort of thing.00:10:58

So… I'm getting close to that, because I'm implementing the changes in the manual testing that I'm doing.00:11:04

Seems to be doing a lot better.00:11:11

But I'm trying to figure out what do we do. If you do, like, a progressive rollout, for example, to, like, I don't know, a percentage of people, and we measure with a human annotate, like, some kind of human sampling on… on… with LangFuse for that assistant.00:11:14

We're just using… starting to use Hamming, which is for, like, testing voice assistants. Could be tested before production, so maybe creating a test suite with the error analysis cases that I uncovered, so I'm trying to figure out, like, how… what's the best approach.00:11:28

Hamel Husain

Yeah.00:11:43

shreya

Okay, I think we understand your question better now.00:11:43

I would say one thing, and before Hamel, one is, like, try to decouple00:11:47

error analysis from, like, deployment and tech stack decisions, because a tech stack is going to change all the time. Like, in the last year, I think I have cycled through 3 toolings that I use for either deployment or monitoring or whatnot. So, like.00:11:52

I don't know, try not to invest too much into something too early, because you have no idea if you're going to change it next month, would be my first piece of feedback. And then the second thing is.00:12:09

It's okay to change your process over time. Like, initially, I had a process where we would kind of show outputs to other human annotators first.00:12:18

establish confidence that way before kind of sending it to more people. Once we felt confident with that process, we're like, okay, you can just skip the initial human annotator phase because00:12:30

like, just… I don't know, we've, like, gotten better at this whole process. So there's no, like, one right way to do it, and if it makes you feel better, like, you can be more, aggressive about your gatekeeping early on,00:12:41

But that's kind of the generic advice that I have.00:12:56

Hamel Husain

Yeah, and one thing I'll add to that is…00:12:59

There's different stages of sophistication that you can engage in when it comes to production monitoring and sort of operationalizing these evals that you will end up creating. So, we will talk about that in one of the last chapters.00:13:01

or the last sections of this course is, okay, how do you put the evals you do end up creating in things like CICD, what you should think about when you're monitoring production logs, how do you use that? We haven't got there yet.00:13:17

So for… We will get there.00:13:29

But it is good to focus on error analysis now before jumping to that stage.00:13:34

Martin Siniawski

Got it, okay. Yeah, I think we're doing error analysis, and at the same time, we're kind of starting to establish metrics, KPIs, so we're kind of doing everything at the same time, and maybe this initial deployment is a little bit more…00:13:40

artisanal, I guess, and then follow on once, hopefully, we have more of a system, right? And checks and balances. I don't know if that sounds…00:13:51

Hamel Husain

No, that's totally… that's good, yeah, you should be artisan… it's great to be artisanal to begin with, that way you have something to iterate on. You don't have to be perfect at, you know… it's a lot of things to kind of wrap your head around.00:13:59

So it's totally fine. It's great that you're starting with error analysis, because most people don't start there.00:14:12

So you're doing the right thing.00:14:18

Martin Siniawski

Yeah, thanks for that. It's been really helpful, and we've really learned a lot, so appreciate it. Thank you.00:14:20

Hamel Husain

Yeah, thank you.00:14:25

shreya

Novell Clemens?00:14:28

Neville Clemens

Yeah, thank you, and very helpful going through the material for this week. The textbook is really well written, and the lectures are very helpful. So I think as part of, like.00:14:30

These first four chapters, we have, for every trace, we've developed a framework to identify what are the different possible failure modes, and which of these failure modes are present on every single trace.00:14:41

Which is a good practical framework to know which errors are more frequent, etc, and go and fix them. In larger organizations, there's often an appetite for, like, some kind of a metric to indicate the health of the performance, right? What's the top-line metric?00:14:54

Now we have all these different error, types.00:15:10

Have you seen, or do you recommend methods of, like, using that to create a composite score? Like, are there weightings for different types of errors, and how do you come up with those weightings to come up with a top-line quality score for the product?00:15:15

Hamel Husain

Okay, I can go first, and Shreya can weigh in. So, a lot of times you might have these different evaluators.00:15:32

And we haven't gotten into, like, creating the evaluators yet, like, automated ones, like LM as a judge and, you know, codebase and stuff like that. What you can do is, a lot… so it's really helpful to keep it simple.00:15:40

So, what we teach in this class is your evaluators will be binary, pass or fail.00:15:53

And so it's very straightforward, then, to create a composite, because now you can have an aggregate pass or fail. And you can get into things like weightings, but what I will say is I try to be a bit parsimonious about00:16:01

the evals, and not have, like, a sea of evals. You know, it depends, like, whether they're LM as a judge or codebase, and we'll get into that, like, when we talk about creating the evals, but it's very reasonable to create a composite.00:16:15

But I would first start with…00:16:30

Make sure you have the right evals, and, like, go through the creating some, and then you'll get more intuition of, like, how to create the composite.00:16:32

Because the way that we teach how to create the evals is a little bit different. You know, a lot of people focus on, like, hey, your eval should output a score, or, you know, things like that, and we tell you not to do that.00:16:41

So… you know, I would…00:16:53

I think it will become clear. Freya, you have any…00:16:58

shreya

I agree with everything you've said. One thing that, from my experience in, like, recommender systems, I think one thing Rexis, who's pre-LMs does well, is the top-line metrics are product metrics, like, click rate and stuff.00:17:01

And that's what the top-line metrics should be, because they're interpretable, and they tie directly to the business. And then you can have a whole suite of metrics that are more ML-related, for example, like ranking-based metrics or whatnot, that can… that you can use to help diagnose00:17:15

the top-line product metric. And I think the same thing is very applicable today for LLMs. Like, just keep your…00:17:30

Product metric top line, and then, like, have aggregate metrics for, like, every failure mode that you observe in error analysis, and hopefully one of those correlates with the product metric.00:17:35

Hamel Husain

Alright, next is Natish.00:17:50

nitish

Hello, Hamila and Shreya, thank you for, starting this course, and, material is really, really good. My question, was mainly around, creating synthetic data for error analysis. One of the things that, as I was reading the course, was, how do I make sure that,00:17:53

that the examples are, the data that I created is exhausted, or is capturing all the errors.00:18:10

And I was wondering if there's, like, prioritization techniques that you have followed to look at some selective production data to make sure we are capturing some sort of problems or errors first?00:18:17

To make sure we are targeting the goal towards it, and go problem by problem. Like, wearing my product hat, it's like, start with the goal, think about certain user persona, think about their problems, what their errors could be, and then annotate those errors first.00:18:29

Or something like that, to make sure that, I have some confidence that I've captured all the problems, or, like, there's a systematic way to capture errors, overall in the system.00:18:44

shreya

So, if I'm understanding correctly, the question is, like, can I make sure that I have coverage of my synthetic data? Or, like, how much of coverage should I have to feel confident? Yeah, it's a good question. I would say early on, you can only cover what you know, like, especially prior to deployment.00:18:59

I think knowing that you're going to uncover new modes of data post-deployment is probably good. Like, it's just going to happen, you cannot get around that. The other thing that I'll say00:19:14

is when you do deploy, make sure… start with randomly sampling production traces, because people don't even start there. Like, they have the question that you're having of, like, how can I be exhausted? Well, first look at a random sample, develop intuition of, like, oh, hey, like, are there bespoke failure modes in the sample that it would be helpful for me to collect more samples00:19:25

Of… and people can… you can do clustering techniques on your random sample, for example, to see if there's some outlier clusters.00:19:47

But those are… think of those as methods to help you understand the random sample that you're starting with.00:19:56

Once you know what modes of data you have, then you can go and stratify and sample for those modes, and like, it will become very clear to you how to sample from your production stream after you have very good sense of your random sample.00:20:03

Hamel Husain

Another thing I want to say about synthetic data is it's always good to try to have a hypothesis.00:20:18

Of what your failure modes are, are likely to be. You want to have a bit of an adversarial mentality.00:20:22

And try to develop a sense where you think your application will break.00:20:30

And the way that you can do that is to look at lots of logs, and use the product yourself, and get kind of a smell or intuition, like, hmm, you know what, I bet the product will break XYZ ABC, and it's… you should, you know, kind of prioritize that.00:20:35

To see. And it's an iterative process, not like a one-shot thing where you, like, produce some synthetic data, you're good, you'll see, like, when you produce synthetic data, you'll get more intuition, and you'll discover other areas. There's no such thing as, like, 100% guaranteed coverage.00:20:52

It's kind of a… You know, iterative game of trying to explore.00:21:07

But, yeah, the, the… having a hypothesis is good.00:21:12

nitish

Yeah. One small follow-up from there is, like, if I select, like, error dialysis based on, let's say, our output metric, for example.00:21:19

let's say that the example for the Nurture Boss, if the output of the agent resulted into, let's say, less than a 2-star rating after the conversation, or the appointment was not booked, things like that, and then work backwards to the traces that00:21:29

were captured. Is that selective bias?00:21:45

Is that a right technique or not?00:21:48

Hamel Husain

It's very reasonable. You should, try… so that's actually a little bit more advanced, like, kind of using metrics and techniques to filter your data, but that's exactly the right thinking. You should balance that with random sampling. Like, don't only look at those, because there's, like, the unknown unknowns.00:21:52

nitish

Yeah.00:22:08

Hamel Husain

And then there's the known, you know, known knowns of, like, okay, you know, hey, this is probably bad, and I need to dig into it. So I would, you know, definitely use your product sense of, hey, like.00:22:08

These are the things that would give me an early signal of, like, something might be wrong. And try to use all of that to, like, fill out the space.00:22:23

nitish

Sounds good. Thank you so much.00:22:32

Hamel Husain

Thank you.00:22:34

Ashwini Karadi?00:22:36

Ashwini Kurady

Yeah, hi. Hi, Himal. Hi, Freya. Thank you so much for this course. I'm really learning a lot, and yeah, my question… I'll directly come to the question. My question is related to the subjective data, where, kind of, the evaluation doesn't… if it doesn't have something called as a ground truth.00:22:38

And the evaluation completely depends upon the human feedback, and expert human feedback, especially in the particular domain.00:23:01

one way is to go through having a quality or quantity metrics, but how actually can I, evaluate the subjective data? That's my question.00:23:12

shreya

You can still apply error analysis! That is the beautiful thing about error analysis, because think of it as you're determining what properties of good outputs are, like, properties of bad outputs. That's what the whole process uncovers for you. And maybe what's hard or new about error analysis is that it requires domain experts.00:23:25

Like, for your case. That's the only thing that's different. Maybe you can't evaluate yourself, but a domain expert can help you with the error analysis and tell you, oh, this output is bad.00:23:45

If you'll see in the lesson, when Hamel and I do this for Nurture Boss, like, actually some of the feedback that we give is subjective. And Hamel gives more subjective feedback, because he's very aware of that context in the setting.00:23:56

Right? And you could totally do the same thing. I would say nothing changes except for maybe you involve some other stakeholders in the error analysis.00:24:09

Ashwini Kurady

Okay, okay.00:24:17

And, and also one follow-up question is, regarding, building the prompt. So…00:24:18

so the LLM models, how often can we change that? Like, whenever there is a new model, it feels like we get a better answer for the prompt, and the constant question that I get is, how can we keep changing the models again and again? And if it is just one single prompt, we're using just one model, that is okay, but if it's a combination of different models.00:24:27

then how can we kind of keep a constant pace with the models, for the prompt? How often can we change it? That's my question.00:24:50

Hamel Husain

This is an excellent question. That's the core, sort of benefit you'll get from this class, is when you create… when you do end up creating the evals, you will have, like, a measurement system, where you can change whatever you want, however… however often you want, and you can measure00:24:59

The efficacy of your changes.00:25:20

So, so, you will have some way of… you will have a metric.00:25:24

Ashwini Kurady

Okay.00:25:29

Okay, okay.00:25:30

Thank you, yeah, those are the questions, thanks.00:25:31

Hamel Husain

Jillian.00:25:34

Gillian Langor

Hi. I'm gonna pile on to the love. This is feeling super useful and, like, actionable, and I'm just so grateful to be here. So my question is tactical in nature.00:25:36

we have chosen Phoenix as a tool to use internally, and I'm, like, quite busy with, my team kind of instrumenting and getting things set up. So, like, broadly, advice at this early stage for, like, what to do, what rabbit holes to not go down, but specifically, I'm…00:25:49

And I'm sure we'll cover this later on, so you can punt on this if we have content for it later, but I'm getting the sense that we want to extract the prompts from the version-controlled codebase to be able to iterate on them.00:26:05

with higher frequency, and I sense that you… the tools can afford that, and I'm not entirely sure how to approach that, so…00:26:18

Is that something that you can do? Should you do that?00:26:27

any kind of guidance around prompt management and version control for prompt management, and how it interfaces with tools like Phoenix.00:26:32

Hamel Husain

Very good question. So…00:26:40

Like, we get this question a lot, like, how to manage… how do you manage the prompts, and what's the right way to do that?00:26:44

There's a lot of different ways to do that, you know, like, whatever tool you choose will have a prompt playground, will even have a prompt store. You know, if you choose Phoenix, Langfuse, Braintrust, they all have, like, a prompt store thing. Now.00:26:50

I… I can tell you… I'll first start with my personal preference, is, like, I like to manage the prompts in the codebase, and then structure the code so that it's easy to iterate.00:27:05

If you, like, if you're finding that, hey, like, I can't iterate that fast on my prompts because they're in my codebase.00:27:16

That's interesting, like, I would want to see if we can… You know, refactor the code.00:27:24

In a way that makes it easy to play with.00:27:32

And then one thing that I personally like to do, and this is a little bit of my opinion, is, you know, I love using things like notebooks, because it's interactive and allows me to, like, I can import something from a codebase, and I can play around with things. One…00:27:34

limitation of prompt playgrounds. Let's say you have your prompt in whatever tool, like Phoenix or whatever.00:27:50

Is… they don't have access to your code.00:27:57

you know, you can run a prompt, and you can run data through that prompt. You can, like, template out, but eventually, when your application has retrieval, it has tool calls, it has something like that, that prompt playground won't be able to execute those things, because those are inherently part of your codebase. It then becomes, like, starts to become very limiting.00:28:00

So, what I was… you know, what I would suggest is try really hard to make your code modular and easy to play with, because you can't… in the limit, it's gonna be very hard to decouple the prompt.00:28:21

Like, the playing, you know, your system.00:28:33

Gillian Langor

Awesome. That's… that's really helpful, and I think very directionally, it's gonna guide us in a slightly different direction than we're currently going, so thanks.00:28:36

Hamel Husain

Yeah, no problem.00:28:44

Abhisheikh?00:28:46

Abhishek Panda

Hi, Hamil. Hi, Sriya. Nice meeting you.00:28:50

So, I'm yet to complete Lesson 3, but my question is, so, I'm part of the AI team and our cross teams. They have built multi-agent systems, like site designer, website optimizer, logo generator. Now, we need to build evaluation for that. The first thing is.00:28:52

how to generate synthetic data, okay? They haven't created a robust benchmark set. We don't have a good starter set, because certain things are new, so it's newly getting launched. Now, for such cool start problem, how to create synthetic data, product managers can help with certain scenarios.00:29:09

But how we can build a robust offline before, making the product live? That is my first question, Hamid.00:29:28

If you could answer, then we can go to the second question.00:29:36

Hamel Husain

Sure,00:29:41

So, it is important to see, you know, I would push you as hard as you can to see if you can get early users, friendly users, design partners, whatever it is, to use your product.00:29:42

You can certainly generate synthetic data based upon your product knowledge, but what I'm also hearing from you is, like, oh, that might not be enough. And that's right, that's often not enough.00:29:54

In a lot of domains, where you don't necessarily know what the user is gonna do.00:30:04

And a lot of times, even when you know… you think you know, you don't know what the user is gonna do. So I would start with getting some design partners, looking at how they use the application to inform your synthetic data generation, to say, oh, like, this is really interesting, this is the way people are using it.00:30:10

Like, let me… and then you'll set… you'll put your adversarial hat on, and say, you know what, like, these are things that I think might be challenging.00:30:28

Let's put that in the test case.00:30:36

And… and start with that.00:30:39

Abhishek Panda

Got it. So you're saying that collaborative design partners try to understand certain scenarios, and see what are the failure cases, and then, you know, design your prompt such a way that you can generate such synthetic data, is it?00:30:42

Hamel Husain

Yeah, I would… I would definitely anchor on those failure cases. But then also, like, you know, just see how people are using it, and it's always good to have, you know, tests. You don't… you don't… you want the test…00:30:56

to be…00:31:10

correlated with reality, and not kind of an academic exercise of, this is what could happen, because that's when you can waste a lot of time. So it is good, if you can anchor… it's especially gonna be high value for you to find failure modes really fast.00:31:11

Abhishek Panda

Got it. And my second follow-up question is, Hamil, I mean, let's say some kind of changes is happening, right? Because agent will respond something, right, in the chat interface, you are the user. You said that, okay, make some changes to the website, it did the change.00:31:29

But, you are not happy with that. I mean, according to the agent response that I have made these changes.00:31:43

But it is not exactly what you wanted. So when we do error analysis, do we need to scrap those information in some way? Like, what are those changes happening within the website to make a better evaluation? Or, I'm making the evaluation more complex?00:31:49

By… because I believe that external data is required, right?00:32:05

Hamel Husain

Yeah, I mean, you should be logging all the data that your LLM sees.00:32:10

To be able to act.00:32:15

And so, if you're… if you have an observability system, you need to kind of log that along with your open codes and your axial codes, that you can correlate it back.00:32:18

Because later on, when you do write evals, your evals will have to take that context into account.00:32:30

Abhishek Panda

But within that, there are a lot of changes happening, right? Some images change is happening, some color change is happening. Let's say you said, I want magenta color, but you did something like red or purple kind of thing, and you said that, no, this is not what I wanted. I wanted this kind of design. So, you need access to all those information within your evaluation, is it? In our offline setup?00:32:37

Hamel Husain

Yeah, so you should get… you should be logging all, like, multi-turn interactions, all tool calls, all retrieval steps.00:32:56

Every user input, that is part of what you will need for the eval.00:33:04

Abhishek Panda

Okay, got it, got it, got it.00:33:11

Thank you.00:33:14

Hamel Husain

Yeah, no problem.00:33:15

Igor.00:33:18

Igor Kasianenko

Hi, really happy to be here.00:33:20

I have one question about the Discord. I really like the threats feature, and I would like to…00:33:23

write down my question after I ask it here, and follow up on it, maybe for the duration of all course. Like, do you have any recommendation how to make it meaningful, so the thread will conceive the message? Like, how many words should I put in the beginning of the thread?00:33:29

So, it will be easily searchable, and for all other threads, like… How to start the thread?00:33:46

Hamel Husain

I wouldn't stress too much about it. I would write… I would ask your question. We will answer the question, we kind of look at all the questions, and…00:33:54

Yeah, you know, what we do also is, so we will be giving y'all access to a thing called Delphi, which is an AI-assisted tool with everything that Trey and I have said about00:34:06

you know, a set about evals, including Discord… Discord information, so what we do is…00:34:20

I… we curate the Discord questions every so often, and we make it into an FAQ. The FAQ is publicly available. It might not mean much to you if you haven't taken the class, but you have taken the class, you will like it. I'll get… I'll grab a link and I'll put it in the chat right now.00:34:25

Igor Kasianenko

Yeah.00:34:45

Hamel Husain

Don't worry too much, we do curate and kind of massage the information, so I would just ask the question.00:34:45

shreya

Yeah, well, Dustin.00:34:53

Igor Kasianenko

I get to it.00:34:54

shreya

Yes.00:34:54

Igor Kasianenko

Let me try to ask it now, and so, I'm working on proof of concept of AI assistant for computer game.00:34:57

That is, like, first player shooter, and the log file is huge, like, 1 million lines of00:35:06

All coordinates and in-game events.00:35:13

And I don't know as… Program these data scientists how to tackle this problem.00:35:16

And the ideal product is the chatbot that will answer what happened in the game, what can user improve.00:35:23

I have the main expert who can describe it, and my first approach was00:35:31

Could you describe all good and bad behaviors?00:35:38

And he said it will take 2 years to describe this game because it's too complex. So my question is, is there any…00:35:41

like…00:35:48

evaluation or, like, test-driven design for language models, so even before I start doing first prompts, and I don't know if I can feed, like.00:35:50

10,000 of lines of events into language model, or should I reprocess it with time series language models, and all those questions? Like, can I go from00:36:00

tests before I start developing this application then.00:36:12

Hamel Husain

Yeah, this is actually, I have an FAQ about this, it's a funny one, it says, should I engage in eval-driven development? And the answer is, probably not.00:36:18

Because…00:36:27

There's an infinite surface area and almost infinite complexity. It's kind of counterintuitive versus traditional software under development. You can't start with some tests, but this is, like, general guidance to help guide you, and00:36:28

It's really good to start with error analysis. Now, you might be wondering, how the hell do I do error analysis? It's gonna take… you said it's gonna take 2 hours.00:36:42

For, like, an expert to give you some judgment on, like.00:36:51

what happened in one user interaction, and I also heard that there's, like, hours of log files, and it's very, onerous, let's say, to, like.00:36:55

And so, there's a couple of things that come to mind. One is…00:37:07

When you're doing error analysis, you don't have to be exhaustive about every single thing you find.00:37:12

Try to figure out, like, what the largest, most important theme of failure is, and just stop, and move on.00:37:19

You know, and there's certain heuristics you can use, so if there's, like, an upstream error.00:37:27

The most, like, the first error, the first problem you see.00:37:33

Especially if you have sets of failures that cascade on one another. Like, if you have an AI process, and there's, like, 10 steps, but, like, the third step fails, and that causes all the downstream steps to fail, you want to focus on the upstream one, and just stop and move on.00:37:36

Because, you know, that step is the most important to fix. And kind of, you know, and you can come back to these things later, that's one thing. Second thing is, in this class, what we'll teach is, when it comes to annotating data.00:37:54

You want to…00:38:08

And especially for your use case that you just told us about, this, like, video game stuff.00:38:09

Like, the off-the-shelf.00:38:14

things are gonna not… probably not work for you. You're gonna want to create your own custom annotation interface that renders your data in a way that someone can more quickly annotate it.00:38:16

You know? You don't want to throw, like, a wall of text on somebody. You have to figure out how to curate and navigate this data.00:38:28

And maybe even search it properly.00:38:37

So those are, those are two things that come to mind. I don't know if Shreya has any other, ideas.00:38:41

shreya

No, I think that's good. Yeah, check out Chapter 10 for custom interfaces, if you want to read that, like, today, but, we'll get to it.00:38:47

Igor Kasianenko

Yeah, thank you. May I have small clarification?00:38:56

I think the main problem that I'm facing now is that I cannot even define the success criteria, like.00:39:00

what should I ask LLM to output? And, like, can I write, like, 3 minimal requirements as success criteria to see if it's, like, capable of doing the task that I'm doing?00:39:07

shreya

I think so. Like, you can totally do these probing activities to understand LLM behavior. I find that for extremely complex scenarios, like, creative scenarios, it helps to just00:39:23

throw a bunch of different things, build intuition about what the LLM can and cannot do, and then maybe reason about, okay, I want to build a V1 of this thing, what do I think is realistic with whatever state-of-the-art model we have now?00:39:34

And then go from there. I try to generate some synthetic data, and then follow the error analysis process.00:39:47

Igor Kasianenko

Yeah, thank you so much.00:39:54

Hamel Husain

Because…00:39:57

Like, I'll sing.00:40:02

Vikas Pratap Singh

Hmm.00:40:04

Yeah, VPSN, if you remember, right? So, yeah, excited, again, to be part of this, course.00:40:05

And… I have a question, there is one,00:40:14

kind of products, I'm building with00:40:19

With help of my team, where we are taking documents and doing a field extraction.00:40:22

And recently what we did is we moved away from,00:40:30

A 72 billion parameter model to a 14 billion parameter model, because,00:40:37

the client was not willing to, pay the GPU cost for that.00:40:45

Hosting that, open source model.00:40:48

So, the change that we observed with the switch to this small language model was,00:40:51

With the same prompt.00:40:58

We are, when it comes to field abstraction, we are seeing that model is sometimes missing some, fields.00:41:00

Or, it is, Misrepresenting a lot of attributes.00:41:07

And so, to mitigate this problem, what we did is we created a strict JSON based on the documents we know.00:41:14

And also, we know some rules which apply to certain fields. So as a second pass in the pipeline, we take the JSON created by the model, and we pass it through this step to find out what are the issues.00:41:23

But the challenge I'm facing here is, at the model keep,00:41:39

changing the way it interprets certain fields, and if I pass.00:41:44

that stick JSON in the context itself, in the prompt, to say that, okay, give me the output in this format, then it's unable to,00:41:50

Populate, many of the values.00:41:59

So…00:42:02

In your experience, is there a way to handle this problem that I'm facing with working with this, small language model?00:42:03

shreya

Yeah, there's actually one problem.00:42:14

Vikas Pratap Singh

It's a visual model, yeah.00:42:16

shreya

It's a known problem that a lot of these smaller models don't have formatting adherence as well as the other models.00:42:17

What I can… the trick that has worked for me most of the time is I actually will not ask for structured output in my first call to the large… to the small language model, and then I'll have another call after that, which doesn't need to use a VLM.00:42:27

But that will convert the first LLM's output into some structured output.00:42:43

And that works very well for me. And then in terms of your precision and recall problem, I would say you have to re-architect your pipeline to work with the smaller models. So, for example, if you did a bunch of extractions in one LLM call, maybe you need to00:42:50

have a smaller portion of the document per LLM call, like, operate on a chunk or a page instead of the entire document. Or you need to have different LLM calls for different things that are being extracted. So instead of extracting 5 fields in one go.00:43:07

do one call for each field. And you can kind of experiment with that to see how you can kind of systematically make the pipeline more complex.00:43:23

Vikas Pratap Singh

Okay, so, so just, so what… one thing which, we implemented, just two days back was, converting that document into a 300 DPI, PNG image.00:43:32

And that… that has helped us to improve, obviously, the accuracy, of the model. So, what I'm understanding is, you're saying that,00:43:44

Because we know the structure of the documents, so pass,00:43:55

Part, like, kind of part by part, so that model can handle,00:44:00

the image and the output better. Is that what I'm saying, right?00:44:04

shreya

Exactly, the recipe is…00:44:09

Vikas Pratap Singh

Smaller portion.00:44:10

shreya

of the task to the LLM.00:44:11

Vikas Pratap Singh

And…00:44:13

And then the final pass will be stitching it together, which can be done without making a call to LLM as well.00:44:13

shreya

Yeah, without a call to LLM, or if you need a call to an LLM, at least it's not the vision model, right? It's just taking text and then massaging it into some JSON or something.00:44:20

Vikas Pratap Singh

Okay.00:44:31

Okay, yeah, thank you so much, this helps. Yep. Thank you.00:44:32

Hamel Husain

Thank you.00:44:36

Daniel Saad.00:44:37

Daniel Saad

Hi, I was enjoying reading the companion book, it was very nice. I really enjoyed it. I have a question I faced when I was reading about the reference-free and reference-based questions.00:44:40

evaluations. There's something there that came to my mind, that there are some cases that you have this business report generation, deep research, or something like discovering hidden relationships in knowledge graphs.00:44:51

That there is no ground truth, there is no predetermined correct answer, and there is some… the task is to find this novelty inside, or inside the data that is given to the…00:45:04

And, inside a problem as a context, or sometimes complete net. There's no measure to see if it's this report that's generated complete, or…00:45:16

there is no sense of how much I done exploration versus exploitation of the data that I have, okay? And that's… when that's to come to this evaluation, okay, that's make it very hard. I can give you a very concrete example. I have a graph rack over the… all of the documents inside the Parliament of Germany, and I created a graph rack. I can check for the faculty of the00:45:25

results that are coming out, and, is this const instance, but I kind of said, okay, this, this, my agent went through and find this novel hidden relationship there, or it's just a stop as first a step of exploration?00:45:50

And there is nothing new inside, it's what is generated, and everybody knows that, you know, one of the base surface facts is there.00:46:07

Hamel Husain

So is it the case that you don't know…00:46:19

what is good and bad? Is that what you're saying? Like, you have a retrieval…00:46:22

over documents, and you want, you know, you're not sure whether any given retrieval is good or bad, so it's hard to evaluate. Is that kind of… I'm just trying to make sure… I don't understand.00:46:27

Daniel Saad

I don't know how complete it is.00:46:38

So, because there are some hidden relationship between the graph that the agent can jump, but sometimes doesn't jump, okay?00:46:41

but for the human, it's very hard to, in that scale of data, to find that relationships, okay? So that's creating… I can create evaluation for the factuality, correctness, and, you know, for the… how much retrieval is going to, but I don't know how much exploration versus exploitation is doing at the same time.00:46:48

shreya

Yeah, so, to rephrase the question, maybe it's like, you have ways of evaluating precision, like, how good are the return results, but you're struggling to evaluate recall, which is…00:47:08

Is it return… is it missing out some things that it should return? One strategy that I've found helpful here for, like, extraction tests is to try to keep having the LLM00:47:18

like, keep querying an LLM to retrieve more, and then see, like, does the quality of those retrievals diminish over time? At what point does it do it? And then you'll kind of know, like, okay, after maybe 2 or 3 iterations.00:47:30

I really am not getting anything useful, like, the LLM thinks something is useful, but as a human, I'm seeing it as very low precision. So that just helps it build intuition about, like, the data that you have available in there, and how useful that is for the task.00:47:46

I just say this because of the point you said of, like, with the scale of the data, you cannot have a human go and, like, fully give you the recall.00:48:02

So you need to rely on some other heuristic, like, iteratively query until the LLM just cannot find anything interesting anymore. The nice thing about the LLM is if you ask it for something more, it'll give you something more, right? So it's up to you.00:48:10

to figure out what's interesting or not. Another thing is you could use, like, a non-LLM-based ranking,00:48:22

Mechanism, like a ranking model or something, and then that will just give you a list of all of the outputs, and then you as a human can manually scan that list until you see that there's, like, a point in which it's all super irrelevant.00:48:30

Daniel Saad

So, for example, precision at core, and recall at the core, and then move the call from 5 to the infinite and see what is happening.00:48:43

shreya

Yeah, but do it slowly, like, 5, 10, 100, whatever, like, until you get to a point where you just… it's all useless, yeah.00:48:51

Daniel Saad

And keeping the ranking in mind, if the ranking diminished also or not, so that's NDCG or other metrics for the ranking, MRR also diminished at the same time, or not.00:48:58

shreya

I would not think about those other complicated metrics right now. I would just start simple, vary your K,00:49:09

and then look at the recall at K as it gets bigger.00:49:17

Daniel Saad

Oh, thanks a lot, this was very helpful, thanks.00:49:23

Hamel Husain

Mubine.00:49:27

mubeen

Yeah, hey, how are you?00:49:30

Hamel Husain

Hi.00:49:32

mubeen

Oh, perfect.00:49:34

So yeah, my question is more around how would you think about… I know the importance of evals in this scenario that I'm thinking, but I want to ask, like, how would you go about executing it and then deploying it?00:49:35

So, what we are doing right now is we are building a compliance chatbot, essentially, for financial firms, right? Any firm that is regulated, any step that we have to take, we have to be very careful about, you know, whether or not we're stepping on some regulation, whether or not this triggers some other00:49:49

Aspect of the regulation that adds extra business burden or whatnot.00:50:07

Now, the problem that I'm facing is, we built this RAG model, it answers reasonably, okay, we have to, of course, build out the eval platform around it, but…00:50:12

We can't take too much of a risk with this being out in the wild and us learning on the fly. Like, what's gonna work and what's not gonna work.00:50:25

So, my question is, in your experience, how have you seen such platforms? Have they gone live? Have they worked? And, how has… how has evals helped in development, but also00:50:34

How do you figure out, okay, when it's ready to…00:50:47

Go out in the wild, and then probably not hallucinate as much as it should.00:50:50

Hamel Husain

So some kind of things that come to mind generally is, like, one, try to have some design partners or, like, some friendly users or customers that you can work with to get some initial read on what the data is. You don't want to just, like, only rely on synthetic data in this use case.00:50:58

Because, like, you know, there's a risk that you just don't know. You're wrong about your hypotheses.00:51:15

Two is, like, try to scope the feature.00:51:21

or the, you know, scope what you're doing. Don't just have it be…00:51:25

you know, for example, you know, ask me anything chatbot that can do any question about compliance, that's a lot of surface area, so I would, you know.00:51:29

Try to scope it down, and see, you know, build confidence in that scoped feature, and then you can expand scope, like, over time. But, you know, that's generally…00:51:40

like, an approach that I've seen work well. Shreya, you have any thoughts?00:51:52

Katya May

Had the good water.00:51:59

Hamel Husain

Oh, no problem.00:52:01

Yeah, I mean, that's basically it. It's like… you know, just…00:52:03

Be really… Put your adversarial hat on.00:52:11

And be extremely adversarial based upon your design partners and based upon your product knowledge about where your app can go wrong.00:52:15

mubeen

D.00:52:22

Hamel Husain

And that's really the best that you can do. And scope it down appropriately.00:52:23

mubeen

Okay, okay, thank you.00:52:27

shreya

Through all of that.00:52:30

Sorry, I was gone for, like, 20 seconds, but yeah, I would agree with all of that.00:52:32

Hamel Husain

Monica.00:52:43

Monica Cabello

Hi, everyone! My name's Monica. I… can you guys hear me okay? Yes, right. Oh, wonderful. So, I work at a prop tech company, and I know that we've talked a lot about, you know, error analysis, and the second, kind of.00:52:45

bigger scale AI agent product that we launched was a blog writer, and so it feels very simple for real estate agents, but00:53:01

We write thousands of blogs, you know, every month for thousands of clients across basically every single market of the United States, and so we have domain experts that can help us00:53:11

you know, evaluate whether the grammar looks okay. We even have, like, a Gen AI media producer that, like.00:53:25

You know, really looks at the imagery that gets produced from these blogs, but…00:53:34

sometimes these blog topics can get very specific. Things like, you know, heat pump rebates in X city, or, you know, you know, closing costs in Alabama, including the, you know, intangible taxes or tax breaks that you can get. And so.00:53:38

You know, right now, our product manager is the one that's, you know, running these evaluations and, like, being the,00:53:57

person that's in charge of all of that, but my question is, like, when a product gets this domain-specific, like, how much…00:54:06

do you have to rely on domain experts? And, like, how do you decide that00:54:15

you know, you're going too deep before, it's, like, overkill. I mean, we do have00:54:22

a design partner program that we launched, like, a couple of weeks back, and I lead that initiative, but we pay these people to be design partner programs, and it can get very expensive. Their time, you know, has some value in getting them to review, like.00:54:27

people across every single, you know, domain experience, it can be really hard. Like, I guess, would love your advice on00:54:46

what to do and how to evaluate these products when it's this specific. So, yeah.00:54:55

Hamel Husain

Yeah, it's always really good to keep yourself honest about your ability to have the right domain expertise, or be a proxy for domain expertise.00:55:03

And there's certain strategies you can use. Like, one is, you know, you want to definitely be talking to your users a lot, and understand, like, whether this… your application is helpful for them.00:55:14

You can also look at product metrics, like various usage metrics, even, like, some explicit product feedback mechanisms you might have in your product, like this blog writing.00:55:26

And look at that data very carefully, and see if you understand why00:55:36

you know, a real estate agent is not using the blogs you generate, or making, like, lots of edits on top of it, or just not using the feature, trying the feature, not using it. If they're not giving you feedback, or you don't have visibility, like, if it's a mystery to you, you're like, hmm.00:55:44

Why are these…00:56:00

agents, these real estate agents, not using this feature even after trying it. That's a good smell that you don't have a good mechanism for product feedback?00:56:01

So that's… that's definitely one thing to… that's something that I would keep in mind. I will say,00:56:14

from a product perspective, we will have… we haven't announced it yet, but there will be… Teresa Torres will be doing a guest lecture, where she will be talking about the intersection of continuous discovery and evals.00:56:20

in this course. We haven't put it on the calendar yet, but it's coming. So I would definitely come to that, because that is very much part of, I think, what you are…00:56:35

Interested in with that question?00:56:45

Lots of hearts.00:56:47

Monica Cabello

Wonderful, really excited about that, and I love that you caught that it's tough for us to call these AI agents agents, because our real estate agents are agents, and so we have to call them AI specialists, so that's a good one.00:56:49

Thank you.00:57:06

Hamel Husain

Yeah, no problem.00:57:07

Vignesh?00:57:09

Vignesh Iyer

Hi, Shreya and Hamil, thanks, for the opportunity. Yeah, basically, my question is regarding, sort of, almost pipeline, kind of, development, almost before the process of error analysis and kind of doing it together, whereby00:57:11

thinking of a scenario of an insurance validation, insurance claims validation, agent, right? Considering the fact that, there's different channels of input.00:57:28

That can come through.00:57:40

So, you've got a chatbot, which could, interface where claims come in, and then, you ask… or maybe claims don't come in, but you ask questions about policies, to the chatbot, versus when maybe claims00:57:41

come through as emails, or come through, an automated claim submission system to this agent. Product is kind of given the,00:57:56

the… the task of creating some kind of shared layer, right? Which takes in, from these different channels.00:58:07

And, that shared layer is able to interface with policies and with different types of input, and then analyze these policies and claims, and then tell you what's covered and what's not covered. So my question is,00:58:15

when thinking about something like error analysis and failure modes, I find that it's…00:58:29

Kind of getting a little bit more tricky when there's multiple channels of input, and…00:58:36

that kind of adds to more dimensions. Each input has its own kind of dimensionality that can come in structured input versus, like, unstructured.00:58:41

Would you sort of design the pipeline to have such a shared layer, as is being asked, so that these different channels of input, so multi-turn questioning.00:58:50

Plus, like, structured, like, emails or structured things from, like, a claims processing system come in? Or would you keep your pipelines separate and your shared layers separate so that your error analysis process is a little more00:59:01

kind of easier to do, perhaps.00:59:18

Hamel Husain

So my first… okay, so, I think there's two separate things going on, is, like, one is a question about architecture, how you design your system. I think it's a little bit separate. That's one set of questions. Another set of questions is how you do the evals and the error analysis.00:59:26

I would say, okay, like, we would have to dig into with you further to figure out the architecture stuff, but with the evals, what you want to do is, if you have different channels, and Nurture Boss has this too, like, text messages, voice, chat, email.00:59:40

And what you want to do is, you want to design your own custom interface.00:59:56

to render that data, and have filters, so that when you are doing the annotation and the error analysis, you are in the right mindset, and you have the right context, and it's not confusing to you. Because you're absolutely right. If you're trying to do error analysis, and it's, like, all these different channels, all these different formats, and you're jumping around, it can be very onerous, so…01:00:00

One thing that we will discuss in this class is, like, how to make your own annotation interfaces.01:00:19

And so I would… I wouldn't say you need a separate error analysis system, per se. I would really keep it the same. I would just be very, make it customized for your domain to make sure there's no friction in looking at the data.01:00:25

Vignesh Iyer

Got it, got it, makes sense. And then, following on with the architecture, for instance, you're mentioning then it's… it's fine for, like, your pipeline to kind of keep this in play, and generate traces from all of these channels, and keep the process the same. Yeah, I don't have… I don't have to say that the pipeline design01:00:42

is so tightly coupled to the error analysis that just because I want to do error analysis, I split my pipelines. I'd rather keep them the way they are.01:01:00

Hamel Husain

Yeah, I don't… there's no reason I can see, so far, that you should have a different pipeline for them. You should have metadata, you should be capturing the metadata of, like, how…01:01:10

you know, obviously, like, how… what the channel is. Is it text, is it voice, is it whatever? And you should be logging as much information as possible. You can always process this data downstream.01:01:20

Vignesh Iyer

Got it.01:01:31

Thank you.01:01:32

Hamel Husain

I just wanted to do a time… so we all… we are coming up on time, we have about a minute left,01:01:35

we will be… so if we didn't get to your question, you can continue to answer it in Discord, or ask your questions in Discord, sorry. And then we have lots of office hours.01:01:41

So, I would encourage you to do that. I might be able to stay on longer, I just have to look at my calendar real quick, so…01:01:53

Let me see…01:02:00

shreya

Yeah, I can stay on until 9, like… 40, I would say.01:02:01

So… We can probably get through 4. Let's try.01:02:07

Ashish?01:02:13

Ashish Toshniwal

Hey, how's it going? I have two quick questions. So, first of all, so imagine a case with an AI agent that has really…01:02:15

high bar for what is correct, versus, a failure. Like, say, just, like, for the sake of example, an email bot, or an EA bot, that has to actually get01:02:27

the scheduling right between one or more parties. In that case, when you're getting design partners, in the beginning.01:02:40

one of the things that I'm thinking about is it may… if it fails, it could actually cause, like, fairly large consequences to the person using it. Do you have any ideas or thoughts about how to gather that data from design partners without necessarily actually making state changes in the beginning that are fairly negative01:02:49

to the person using it. Like, reputational risk.01:03:08

shreya

Okay, I think… Camel… might have… Dropped.01:03:18

Hamel Husain

No, I'm here, I just was.01:03:25

shreya

Okay.01:03:26

Hamel Husain

try to sort my calendar, see if I can stay… I can stay a little bit, longer, sorry. Can you repeat? I didn't.01:03:26

Ashish Toshniwal

Yeah, yeah.01:03:33

Happy to repeat the question. So, essentially, imagine a case with, like, an email EA, and the bar for what is a correct answer and an incorrect answer is pretty high. You know, there's, like, the end user has really high expectations, and they'll be able to tell easily if something goes wrong, or if a meeting gets moved that wasn't supposed to be, or canceled, what have you.01:03:34

So, the question is essentially.01:03:57

When sourcing those design partners in the beginning, when you have a high risk of causing an adverse outcome.01:04:00

For, for your initial design partners.01:04:07

how do you minimize the reputational risk in gathering that data, with design partners? I'm curious if you guys have encountered situations like that.01:04:10

shreya

I have some thoughts that Hamil's way better suited to answer this than I am.01:04:20

Hamel Husain

So one is, like, obviously, like, you need to be lo- using the application yourself.01:04:25

Quite a bit. You know, I think we saw this with coding agents. Developers have a really high bar, and in the very nascent stages of coding agents, we had this exact same problem. And that… there was… in that case, it was a very…01:04:30

Fortuitous, in that the…01:04:46

The developers were also the domain experts, and they could iterate, and they were actually dogfooding it.01:04:48

We have to be careful about this word, dogfooding, it's a little bit abused. People often say, yeah, we're dogfooding it, but they're not really dogfooding it. So, the best way I know is to, like, use these things01:04:54

is, like, kind of step up the risk ladder. Like, one is, like, use it yourself. Number two is, like, get these, like, you know, a few design partners.01:05:07

You know…01:05:16

To… that are friendly to do this, and then based on that, like, really be adversarial and generate very, like.01:05:18

Think very creatively about the synthetic queries.01:05:26

If you start with, like, a few design partners, you'll get a lot more information that then you can use to basically01:05:29

Try to cause havoc in your product.01:05:38

You know? It's in ways that are plausible, and that's the kind of angle you want to go with.01:05:41

I do agree it's a challenging problem. I don't have an AI EA. I have, like, a physical EA for this exact reason. you know, if you make something like that, please let me know.01:05:49

Ashish Toshniwal

Definitely will. And then just really quickly,01:05:58

The other… the other quick question I have is essentially around… so, when you have a… in that scenario, for… I think that…01:06:02

We want to have domain experts essentially checking as a…01:06:12

you know, preventative measure to make sure that before something goes out, it actually is, is correct. Is that technically, like, the approval or rejection of a proposed Asian action, and if there's a rejection, explaining why?01:06:18

Do you think that makes sense to include on the trace, or is that itself an open coding step?01:06:33

Hamel Husain

When you're talking about reason, are you talking about an LLM reasoning itself?01:06:42

Ashish Toshniwal

Yeah, yeah, exactly. So, LLM, like an agentic reasoning where tool calls are happening, and then you see, basically, what the agent has proposed to do, sort of similar to Claude Code.01:06:45

Hamel Husain

Yeah.01:06:55

Ashish Toshniwal

And then imagine, like, in Cloud Code, there's an example where… or, you know, it's, like, manually… or rather, like, say no, tell us what you want different, essentially. Is that itself open coding?01:06:55

Hamel Husain

Yeah, so, okay, I love this question. So, no, it is not open coding. You cannot… do not outsource the open coding. Like, you…01:07:08

have to do the open coding. It looks like open coding, sort of, but it's not. And you should absolutely, like, the reasoning should absolutely be captured in the traces. The reasoning really is part of the output of the language model, it's just that you don't show it to the user.01:07:18

And that's something that normally, with any off-the-shelf01:07:36

you know, observability tool, it will capture the reasoning. You know, language model providers might not show you the whole reasoning, but whatever is shown is captured.01:07:40

Ashish Toshniwal

So in this case, I don't mean to interrupt, I just want to make sure that I said it clearly enough, a domain expert would be making the call of whether or not the output is good or bad, and then saying why if it's bad.01:07:51

Hamel Husain

Yeah oh, okay, so if the domain expert is saying why it's bad,01:08:04

That is a form of open coding, like, if they actually have a comments field and say, this is bad, but I would put your eyes on it.01:08:10

Ashish Toshniwal

Okay. Because people may not be doing a good job.01:08:17

Hamel Husain

And I would say, is this a good open code? Because I'm gonna go fix the open code.01:08:20

It might be a good open code, it might not be. I would guess that a lot of it wouldn't be, based on my experience in the wild.01:08:24

Ashish Toshniwal

Gotcha. So you think we should run a separate process as well, where we're looking at the full end-to-end trace?01:08:34

Hamel Husain

Yes, yeah, look at the full indent trace, look at the person's comments, take that all into account, and then you'll get a sense, like, okay, are these good? Are these not good? How's the data?01:08:39

how is this feedback? And you'll kind of see, okay, I would actually do it yourself first,01:08:51

And then, you know, see, like, how good of a proxy this, like… User comments is.01:09:00

Ashish Toshniwal

Thank you so much.01:09:05

Hamel Husain

Thank you.01:09:07

shreya

Isaiah?01:09:16

Isaiah Bendi

Yeah, thank you, Shira. Thank you, Hamil.01:09:20

Thank you for this cause. Really, really,01:09:23

Exciting. So you'll pardon my line of question, I'm not a technical…01:09:26

guide as such, I'm just a founder. But this course has really, really opened my mind to a lot of things. So my question basically is around01:09:32

So currently, we, we, have, an enterprise, software as a service that helps, our clients, majorly telecommunication clients in Africa.01:09:41

Manage their distribution, their inventory, their territories and dealers.01:09:52

So, just, about towards the end of last year, we transitioned this into a chat commerce.01:09:58

So this year, we sort of, also now, are kind of reimagining that into, to be managed by, by, by AI agents.01:10:04

But one of the things that we've realized with that transition is that we've constantly been having issues with, hallucination. We've had an issue where the agent jumps a particular… so you can imagine we're dealing with inventory, so there are checkouts and all of that, so instead of going to the checkout sometimes, it will just, pass, TNCs to customers, or it can pass to customers, products.01:10:14

you know, there's just a bit of a lot of, hallucination or jumping of the queues and all of that, so we… we thought the problem was the architecture.01:10:38

So we sort of transitioned and now say, okay, let's make it multi-agents, where, we have a dedicated agent that is dealing with the inventory, we have a dedicated agent that is dealing with the payments and all of that. But still, we're…01:10:47

also still, we're still experiencing an issue of hallucination here and there. So I was wondering how can we… and I've been really, really worried of how can I guide my team into, you know, error analysis and writing the evals and all of that. And then I saw this course on Lenny's podcast, and I was like, okay, I think this is it.01:10:58

I was hoping if you could help me with some guidance. I'm happy to also share this question, on the Discord channel.01:11:18

shreya

Yeah, so first off, I think this is a great, very rich problem that a lot of people are probably facing similar variants of, so you should definitely post in Discord. Two things that I have to say for the question is, one,01:11:26

when doing the error analysis, it sounds like hallucination is a really broad category of problems, but reality, if you think of hallucination as, like, the axial code, there's actually a lot of different codes under that, right? Like, hallucinations with a particular tool, or specific ways that it hallucinates. And I think it's very important to get an understanding of those, maybe make each of those their own failure modes.01:11:38

Because then you can measure the prevalence of those, and then when you do split into this multi-agent architecture, or, like, whatever strategies you want to do to improve the performance, you can then have that fine-grained feedback of, like, oh, hey, it helped in this failure mode, but maybe not in this other failure mode, or, like, this agent is still bad. I don't… I don't know, I'm just making up something, but…01:12:01

my result. So next week, we'll talk about how to take your axial codes and convert those to automated evaluators. So, like, LLM as judge, or program… programmatic evaluators, that you can then run on your tracers at scale, and then see, like, what is the failure rate of these modes.01:12:20

So, concretely, I would say, like, hang tight, wait for next week, but really go into axial codes and try to make sure your axial codes are not too coarse-grained, because that loses insight on, how you're going to be able to improve your system.01:12:38

Isaiah Bendi

But would you advise the multi-agents, route? Because earlier on, we implemented as a single agent that is managing different stages. Now we've transitioned into multi-agents. Do you think that architectural?01:12:56

shreya

I…01:13:08

I maybe… it's really hard to definitively say unless you can see the prevalence of your axial code failure modes. Like, if indeed you can break it down into, like, certain types of retrieval or worse than others, then the…01:13:08

multi-agent architecture makes a lot of sense. That might not be the solution to your problem, it might just be the case that maybe you don't have enough context for your agent, or I don't know, like, something else might reveal itself during error analysis. So it's really hard to say at this moment before01:13:24

seeing, like, error analysis or, like, measurement, like, what's the improvement solution? Does that make sense?01:13:40

I'd say follow the life cycle, go in order, there's a way to do it, and then it will be clearer at the end what's the best01:13:47

Tool to improve.01:13:54

Isaiah Bendi

Okay.01:13:55

Okay, great, great. Thank you so much.01:13:56

shreya

Yeah.01:13:59

Hamel Husain

I'm back, by the way. Oh, I'll stick one more.01:14:00

shreya

Let's take one more, and then I, like, have to really run, sorry.01:14:03

Hamel Husain

Yeah, yeah.01:14:06

shreya

Swaffee.01:14:09

Swati Jhawar

I'm the lucky one,01:14:14

Hi, Strez. Hi, Hamal.01:14:16

A real quick question for me. So, in my mind, the process is threefold. I need to know what tasks and corresponding prompts, then the required data set.01:14:19

And from there, what rubrics and grader or eval I want to get to. The challenge I ran into in couple of the domains I'm working01:14:32

it's not easy to get experts who are going to help me validate the synthetic data I'm going to produce, or the prompts that I'm writing. Is it a good description of what in real world will look like?01:14:41

I'm curious, are we in the industry, since you both are ahead in the curve, are we at a point where I can completely rely on synthetic?01:14:55

data and synthetic way of prompt generation, or do I constantly, even if sample set, need to have some human expert validation for me to feel confident that this is representing and it will not fall flat once I go out?01:15:05

So, curious if I can just go synthetic route completely, do I need human input?01:15:21

Just in the first stage of prompt generation and data generation for it.01:15:27

Hamel Husain

Yeah, I could try to answer this one, so…01:15:33

You know, synthetic data is only as good as your hypothesis.01:15:37

Swati Jhawar

And your hypothesis…01:15:40

Hamel Husain

you know, Yeah… I mean… It can be better over time, but it's kind of…01:15:42

Somewhat flimsy a lot of times in the beginning, just like any hypothesis is, and it's a narrative process, and, you know, it's hard to just rely on synthetic data, because you don't know what you don't know.01:15:51

Swati Jhawar

And, you know, people use applications in ways… always use them in ways you don't expect.01:16:02

Hamel Husain

And have different mental models than you do?01:16:08

And so I would say we're not at that point. And then second is, like, the domain expert question.01:16:10

Is… there has to be a way that you give the language model some, like, description of what you want. You know, it's not…01:16:16

Swati Jhawar

Thank you.01:16:26

Hamel Husain

And you need, you know.01:16:27

You want to align your product with what your users think is good.01:16:29

And that needs to come from somewhere. It has to come from the domain expert.01:16:35

It, you know, it can come from you. If you have enough01:16:39

context and domain knowledge, but sometimes you don't, and you need the domain expert. Let's say it's, like, a legal application, you might not be familiar with the law, or you don't know.01:16:43

And so the only way is domain expert. You know, Now…01:16:51

the only regime where you wouldn't maybe need that is, like, some kind of AGI, which is not here.01:16:58

Swati Jhawar

Yeah.01:17:04

Hamel Husain

It's like, you know, a general-purpose AI that you just trust with anything and everything.01:17:05

Where you… where you have, quote, evaluated it on so many tasks that every time you give it a task, it, like, you feel like it knows everything, then you're willing to take that risk. We're not there yet.01:17:09

And so… I think you need domain experts, and you need…01:17:21

To look at the data.01:17:27

Swati Jhawar

Right.01:17:29

Can I ask one quick clarification there, Hamas? So, if, instead of relying on one.01:17:29

Say I produce those task descriptions and their prompts using GPT-5, then I give it as an input to another LLM,01:17:35

To see… because all of them are trained on different data sets, and… and is there a better way I can come up, if I have to, for whatever reason, rely on Synthetic only, but use multiple LLMs to ultimately produce those task descriptions or corresponding prompts.01:17:44

Can that make it better, or still no? Like, I have to have human involved.01:18:00

Hamel Husain

I mean, it will make.01:18:07

shreya

I'm always wary of…01:18:08

like, just having a more complex and indirect process of LLMs talking to LLMs as a way of getting aligned with humans. I find that if the core problem is alignment with humans, we gotta put a human in there somewhere, and for your own sanity, reduce the amount of indirection and complexity within whatever LLM A talking to B, talking to C to come up with D. Like, I've seen that just go so01:18:11

awry.01:18:36

Swati Jhawar

Okay.01:18:37

shreya

But maybe I'm extremely jaded, and maybe it does work very well, I don't know.01:18:38

Swati Jhawar

No problem. Thank you so much, that was the question. Thank you.01:18:44

shreya

Okay, cool. I… I have to run. Hamel, I don't know if you want to say, but we will be back for office hours tonight in the flip time zone. So, see folks later, or maybe see an entirely different set of folks later, I have no idea. And then feel free to ask me Discord, also, if you have questions.01:18:52

Hamel Husain

Thank you.01:19:11

Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A

[

Home

](/parlance-labs/evals/2025-3/home)[

Community

](/parlance-labs/evals/2025-3)