Week 1

OCT 9 Optional: Live Office Hours 2 THU 10/99:00 AM—10:00 AM (GMT+5:30) OPTIONAL Recording

Notes

Back

Recording

Optional: Live Office Hours 2

Oct 9, 20259:00 AM - 10:00 AM GMT+5:30

Audio Transcript

Chat Messages

Shreya Shankar

Yay!00:01:21

Hamel Husain

Hey, how's it going?00:01:22

Shreya Shankar

Can you hear me?00:01:25

Hamel Husain

I can, yeah. Can you hear me?00:01:26

Shreya Shankar

Alright, we'll take.00:01:31

Hamel Husain

platform.00:01:32

As well?00:01:33

Shreya Shankar

Yes.00:01:33

Yes.00:01:34

Hamel Husain

Okay, go ahead.00:01:35

Shreya Shankar

I don't think we…00:01:37

Hamel Husain

I've been this late before.00:01:38

Shreya Shankar

I know, I know.00:01:40

Hamel Husain

Okay.00:01:43

Why? Okay, I see. Should I begin?00:01:48

Shreya Shankar

Yeah!00:01:51

Hamel Husain

Okay.00:01:52

Let me…00:01:54

Alright, welcome everybody.00:02:03

Shreya Shankar

Cool.00:02:04

I'm so curious how many people are gonna be here.00:02:06

Hamel Husain

Yeah. In this, like, midnight…00:02:09

Shreya Shankar

the U.S. time officer.00:02:13

katenesmyelova

Well, you know, I know that you are very, very US-centric, but people exist outside of US as well, you know?00:02:17

Shreya Shankar

Yeah, that's why we have these slides.00:02:25

katenesmyelova

I'm joking, I'm joking. Okay, thank you so much for having these sessions. I really appreciate that you are working early and late to cater to everyone.00:02:28

It's really cool, thank you so much.00:02:38

I'm just joking.00:02:40

Shreya Shankar

Oh, I'm glad that people are coming.00:02:42

Okay, so in the morning, we… well, in the morning for us in Pacific Time, in the last office hours, we…00:02:46

did it as follows. We basically had people raise their hand, like, use the raise hand feature in Zoom, and you can kind of see that button in the bottom toolbar. So if you click that, you'll be kind of placed at the top, and then we'll just kind of go in order of people who've raised their hands. And then we please ask two things,00:02:55

please wait in line to ask the question, so everyone hopefully gets a chance to ask. And then, also, if you're recording this with, like, an AI note-taker, please don't post your AI note-taker's notes on the internet, because people might, you know, not want their trade secrets filled up for the world.00:03:14

So… yeah.00:03:34

Maybe people want to start raising their hands, and we can go through. Also, Hamel, you have your hat!00:03:36

Hamel Husain

Yeah, I was gonna wear it, but I don't think you can see it. Evals are all you need.00:03:42

katenesmyelova

Aye.00:03:46

Hamel Husain

So I decided to take it off, because it's not that obvious. I just look like a clown in the recording.00:03:47

Yeah, let's get right into it.00:03:55

Shreya Shankar

Okay, Robert.00:03:58

Robert Lavigne

I, I was in the, 11.30 a.m. one, and we ran out of time, and I was actually near the end with the handoff, so I decided I would ask first this time.00:04:00

Shreya Shankar

I'm glad you came.00:04:10

Robert Lavigne

Well, you know, I plan on hitting all of these. It's, 11.30 at night for me here, but in any case…00:04:11

My question was more relating to…00:04:17

the type of aggressive eval that we want to do at the initial bid, and what I mean by that is, do we want to red team later and focus more on basic functionality and getting that right before we start breaking it, in a nutshell, right? At what point in time do you tell the people to start focusing on the jailbreaking aspects?00:04:21

of the tasks and whatnot, versus just the, you know, just be a nice person, let's kind of get a feel for it and flush it out. And just, when do you kind of dial in the red team, basically?00:04:39

That's it.00:04:52

Shreya Shankar

Emma, you wanna go?00:04:55

Hamel Husain

Yeah, so it's a… kind of…00:04:56

There's no right answer. Really, like… okay, a lot of people get obsessed with red teaming.00:04:59

And… you should only be obsessed with red teaming if you have a reason to be obsessed with red teaming. So, you know, if you have, like, some app that's, like, has the ability to expose some sensitive information.00:05:06

Then, maybe… you should… you should be worried about it. However, you know, like…00:05:20

If you're worried, like, your, AI can be jailbroken to curse at you, and you're just trying to book airline tickets, I don't know that's…00:05:26

particularly concerning. So, you know, it… it… you kind of have to be…00:05:37

kind of mindful of what exactly the error modes are. And I would… I would personally start with the actual failure mode. I would first, like, focus on usage. Like, does… can you get the AI to work?00:05:45

Because, like, it doesn't really matter whether you can jailbreak it or not, but the intended functionality doesn't work.00:05:58

And then… You know, that's the first, kind of, usually priority, and then…00:06:04

If you're worried about any kind of jailbreaking, it's…00:06:09

Can you… first of all, like, if you're trying to secure sensitive information, can you…00:06:14

can you secure that at the application level better? Can you not expose that to the AI to begin with, or can you do something else?00:06:20

But, yeah, I mean, it's a spectrum. You always also want to have a little bit of an adversarial attitude, even when you're not, like.00:06:29

you know, concretely focused on red teaming and jailbreaking, you still have a little bit of an adversary attitude in the sense that you want to… when you're generating synthetic data, or when you're testing your system, you kind of want to…00:06:40

See how you can break it.00:06:54

Shreya Shankar

Yeah, I would agree with that. Kind of… I feel like a lot of people also jump the gun on this. Sometimes people will, like, go try to do a bread teaming approach without even have done… doing their error analysis in the first place, which is, like.00:06:59

interesting, choice of priorities, but I wouldn't recommend that.00:07:15

Hamel Husain

Okay, Ashish.00:07:23

Ashish Bhatia

Thank you, folks. My question is on, sample selection for error analysis.00:07:26

And let's say your daughter assistant is already in production.00:07:33

Generally, there are challenges getting production data, right, for privacy and other reasons. But what's a good strategy to collect samples so you get enough diverse data to then be able to do the right level of error analysis?00:07:38

Shreya Shankar

Yeah, we get this question several times. The first thing I would say is…00:07:56

try to get the real data as much as possible. Sometimes you can't get the real data because of privacy reasons. See if you can get, kind of, annotated versions of it. See if you can create a separate team that just does…00:08:01

kind of redaction or something, like, they might have privileges that rest of people don't. It's worth investing in those teams, I've found, at least in, like, my projects.00:08:14

Legal applications, just because then it allows us to actually do evals and look at our errors and whatnot. Like, the Nurture Boss data, for example, that Hamel got, like, all of that is kind of redacted, and then it makes error analysis a lot easier when we can go through it with people.00:08:24

So that's the first thing I would say. The second thing is, start with a random sample.00:08:41

And when you do your error analysis on a random sample, you might find that there are some interesting edge cases, and then you can sample based on those edge cases. I think the pitfall to avoid here is, like, just trying to, like, cluster your traces based on embeddings.00:08:48

and then sample from those clusters, because they're quite arbitrary, right? They're just whatever the embeddings think are meaningful clusters. They're not actually what are clusters of errors in your application, or clusters that you believe are meaningful as a person building the application. So I just say start with a random sample, you're going to find some outliers in there, and then use that to kind of sample more, and then over time, you'll know kind of what strata you want to sample on.00:09:03

Ashish Bhatia

Okay, thank you.00:09:30

Shreya Shankar

Hamel, you might have something to add, or… oh.00:09:34

Hamel Husain

No, I think that's generally… I mean, you'll get more… you can get more sophisticated over time. It is good to start with random sampling, and then you can take a more of a kind of data analytics mindset to it, and say, where do you find you… there are higher probability of failures?00:09:36

based upon what you know, and you can apply all kinds of filters based on your dimension… business dimensions, or other kinds of intuition you have.00:09:53

You know, there's no, like, magic formula, let's say, but the idea is to try to…00:10:04

Balance, let's say, like, exploring unknown things, and being open to… you know.00:10:12

Finding, you know, like, doing random sampling and, like, sampling in areas where you know, or you feel like you have a good intuition that things are broken.00:10:20

Ashish Bhatia

Sounds good.00:10:33

Hamel Husain

Kate.00:10:35

katenesmyelova

Hello. I have a question, regarding…00:10:37

Evolve and guardrails. So, again, I'm a pretty newbie in this area, and this is why, right now, we are using AWS guardrails to ensure that the,00:10:43

Again.00:10:58

That we steer, on top of the problem, that we steer the discussion where we want it to be, and that we do not expose some internal00:11:01

Workings of our agents and stuff like this, but…00:11:11

Do evals and guardrails live together, or in the future, we can potentially replace the guardrails, and just remove them, and just use the evals to…00:11:16

Do all the stuff.00:11:28

What guardrails do right now?00:11:30

Hamel Husain

I can take this if you like. Yeah, so…00:11:35

Guardrails are a special case of evals. So, guardrails are…00:11:37

Evals that run in their critical path, their request-response path.00:11:43

That block or change or transform a particular00:11:48

AI output before it gets to the user.00:11:54

And so… Guard rails have special characteristics that make it feasible for that to happen. Usually, they're fast.00:11:57

Usually, they're cheap to run.00:12:06

And…00:12:10

you know, that you have to have a different trade-off. So, like, for example, we'll get into false positives and false negatives, but you might have a different trade-off.00:12:11

on what you block. Like, you don't want to be over… overly, censoring things, maybe?00:12:23

Maybe you do, I don't know. It depends on the domain.00:12:30

But you have to consider that. Whereas an eval is… you're not necessarily trying to block something, you're trying to measure something, but it's very similar. So, it's just a special case of that. You do want to align the guardrails with the human, you do want to go through a lot of the same activities.00:12:33

But… And so, in practice, what that ends up…00:12:51

Usually being… it also ends up being different than evals in some sense, because usually… because of the pragmatic nature of it being fast.00:12:56

And… and cheap is, like, you want to… you want it to be extremely scoped.00:13:07

And a lot of times, like, using a very small model, so…00:13:12

But it's… there's a lot of similarities there. There's just a… it's kind of a special case.00:13:18

And something you should keep in mind, we haven't gotten to that yet, we will cover guardrails in a future, lesson.00:13:22

katenesmyelova

Okay, cool. That's exactly why I asked this question, because to me, they are similar in capturing the problems, but they react differently.00:13:31

Thank you.00:13:43

Hamel Husain

Yeah, thanks.00:13:44

padma chandra

I can.00:13:51

Hamel Husain

Yeah, go ahead.00:13:52

padma chandra

Thanks.00:13:52

So, I had a couple of questions, similar to Kate. And I work in a health tech industry.00:13:53

And some of the biggest challenges that we have is, in fact, with data more than LLM. We have data coming from various health systems, which have different pharma at different ranges.00:14:01

And it's kind of becoming… I mean, it can easily go out of hand, and what I wanted to know is… I mean, perhaps you might cover this later, in your lessons. I want to understand, is there any specific foolproof,00:14:14

Framework that, even the health tech industry can follow.00:14:30

Even, that could help us, strengthen this kind of scenario where we are running into a bunch of various health data formats, various range issues.00:14:37

Et cetera, et cetera.00:14:50

Hamel Husain

Yeah,00:14:56

So, one thing you want to do, and I, you know, I'm not sure exactly what the question is, but I have some intuition, and you can correct me if I'm… if I'm off, is, okay, like, if,00:14:57

I think…00:15:09

if you were dealing with lots of data, it sounds like there's lots of different data for, like, a particular instance of, like, using your product that might be involved, let's say you're working with a patient. You want to log all of that data.00:15:10

In a way, like, in… in the sense that whatever the LLM sees, you want to log all of it. You know, and so when we talk about traces, traces incorporate00:15:23

everything the LLM sees, and everything the LLM outputs, and all the tools that are called, all the data that's retrieved by the LLM and considered by the LLM, all of that is part of the trace.00:15:37

And so to properly do error analysis and debug and write evaluations, you're gonna need to log all of that. Now, in medicine, it's complicated, like, you may not have access to look at all of that. And there's not…00:15:47

There's not the great…00:16:02

Sort of, is… okay, so evals become a lot harder if you cannot look at the data.00:16:04

You know, I don't have…00:16:11

a magic bullet there for you. You know, you can try to rely on some kind of proxy metrics, you can try to anonymize that data.00:16:15

There's not really great…00:16:24

you're kind of handcuffed, and I would say, okay, like, yeah, I would try really hard to see if00:16:28

There is a way that some amount of people can look at the data.00:16:34

you know, that's really the only thing that I can offer there.00:16:41

padma chandra

Thank you.00:16:44

Shreya Shankar

No, I think it's just… people run into this problem all the time, and then you just… there's no world in which you can do evals without being able to look at your data. Unfortunately, we have found that this is the case, and so what ends up happening is people either abandon their AI projects entirely, because, you know, they had no way to do the eval.00:16:46

Or they will try to figure out some way to look at the data. Sometimes you need proxy teams to do it, like.00:17:06

you'll figure out a way, you'll… otherwise, it's just… I don't… yeah, I don't know what else to say. I feel like Hamil and I have, like, worked with enough people at this point where it's like, we've never seen someone, like, successfully make it very far without ever being able to look what… look at their data, and it's just a matter of time before they're like, I'm just…00:17:14

They're so frustrated that they figure it out.00:17:31

So that's mainly our, like, thoughts on this privacy thing. And it's really unfortunate. I hope that in the future, some of these vendors can figure out how to build tooling to, like, automatically do logging with, like, the ability to strip the PII.00:17:35

padma chandra

rare, I don't know, like…00:17:49

Shreya Shankar

Or, like, being able to summarize these traces, like, the way that we do this for traditional data science, when we're analyzing, like, tables of numbers and structured data, we have ways of something called differential privacy, right? We have ways of being able to compute00:17:51

queries without revealing information about the people, and we just don't have those techniques yet for the unstructured data. So I'm hopeful that, you know, over the next 5 years or so, people will figure this out.00:18:08

padma chandra

Thank you.00:18:20

Hamel Husain

Robert.00:18:23

Robert Lavigne

Yeah, I didn't want to take up anybody else's, question period, so if anybody got hands up, please put on… perfect. But I'll make this one quick. I just want to make sure that I'm aligning myself with the coding assignment. So I've opted to build it as a Hugging Face space, kind of.00:18:26

Building my own code on my own way, kind of taking, you know, the instructions and kind of building my own eval, my own datasets, my own golden datasets, and just kind of building it all in my hugging face space.00:18:41

Is there a point where not…00:18:54

I guess cloning your repository is gonna get me00:18:57

in a fork, where I'm not necessarily having some of the code that you're… you're expecting us to have down the road, or am I safe to kind of continue down this journey in parallel and see where it takes me?00:19:01

Shreya Shankar

You can continue down this journey. I mean, if you want to do more advanced things, like maybe build your own UI, or try to deploy this, or, like, make it a mobile application or something, at the point where you're, like, sufficiently diverging, then I don't know if the hiking space is…00:19:14

Robert Lavigne

Hugging.00:19:29

Shreya Shankar

space, I don't even know what it is. It's right for you.00:19:29

Robert Lavigne

So, to answer the question, so instead of putting it on GitHub, it's a Git, but it also allows it to run as a demo, right? So you're both.00:19:32

Shreya Shankar

Right.00:19:39

Robert Lavigne

the code, as well as having a GPU running the environment. Got it.00:19:40

Show and tell.00:19:45

Shreya Shankar

I don't think that it… you will need to diverge from that, unless you, like, want to make it a mobile app, or, like, you want to do fancy things, like connect it to your databases, and then your database.00:19:46

Robert Lavigne

Yeah, my original plan was just to set up a couple of Hugging Face data sets that I populate with my, you know, synthetic data, with my golden data sets, with my meta notes, you know, for my open00:19:55

you know, stuff like that, right? But like I said, I don't know, or I wasn't too sure if I was deviating too early, and kind of coating myself into a different fork, basically.00:20:08

Hamel Husain

Yeah, I mean, yeah, it really depends on your skill level. I can… I have an intuition about what you're saying, is that your skill level is high enough.00:20:19

that you… it won't matter. In fact, if you have your own project, it's probably better. You know, like, it's good that you explored issues.00:20:27

Robert Lavigne

Let's just say, I've been coding since 1982.00:20:37

Hamel Husain

Okay. So, so you'll be fine. It's good that you have.00:20:39

Robert Lavigne

Well, I know I would be fine, I just didn't want to… I didn't want to be not fine within the course, if you know what I mean.00:20:42

Hamel Husain

Oh yeah, yeah, you'll be fine with the course. You can… you can apply the concepts, I think, quite easily to anything that you're building.00:20:47

Robert Lavigne

Okay.00:20:53

Hamel Husain

You don't have to do it exactly with our code.00:20:54

Robert Lavigne

Okay, cool.00:20:57

Thank you.00:20:58

Hamel Husain

Yeah.00:20:59

Francisco?00:21:02

Francesco Lanciana

Hello, so yeah, this is, maybe a two-part question. I'm quite new to a lot of this, I really only started building LLM applications in, like, the last month.00:21:04

And so I'm kind of struggling to even get to a point where it's like, I know that it's in a good enough spot of what I'm trying to accomplish to even run an eval, like, it's, like, you know, I don't know if I'm getting there. So, for example, what I'm trying to do, just at the moment.00:21:15

is to allow someone to create an event on a calendar. And so they could ask that in, like, you know, a hundred different ways, and there'll be some information that I need, there'll be some information that I need to get them to provide me, like a human-in-the-loop type,00:21:30

thing, and there'll be some that I'm able to infer. And it seemed complex enough that I was like, oh, I'm using Langraph, and I was like, oh, I'll use a…00:21:45

a React agent in Langraph and just give it a bunch of tools, and then I'll kind of get something working. You know, try and not kind of build this whole custom workflow, just get something.00:21:55

And then kind of run evals on that. And I guess, like, the question… there's probably two parts, is really… one is just, is that…00:22:08

how you'd go… like, is that how you'd kind of treat this? Of, like, oh, that seems complex enough that, you know, and you've got this human in the loop, and you've got, like, a few different… like, the paths can kind of diverge, like, kind of quickly. You'd want an agent, or you'd just be like, no, just go hand-rolled workflow, and then work from there? .00:22:19

Shreya Shankar

I don't think you need to start with a workflow, it's a good question. Like, in the Nurture Voss example in the lecture, we… we have an agent. It is multimodal, it takes in, like, voice trans… it takes in a lot of stuff, and it has a lot of tools, and it's, like, agentic in some way. And all we do is, like, look at the end-to-end, from initial user request to the, like, when the user literally exits the conversation.00:22:39

Right, that's, like, the unit of…00:23:02

data on which we do the error analysis on. And I think that's totally fine. I don't think you should shortchange your approach or, like, build a workflow just for the sake of, like.00:23:05

you know, making your error analysis easier or something. Like, you want to build, like, a reasonable first pass attempt that is both simple, but also expressive enough to do the tasks that you want.00:23:13

Hamel Husain

Yeah, I would start simple,00:23:28

You know, especially… it depends, also… On your skill level.00:23:30

It's… you know, I have intuition that you are… know how to code. I could be wrong, but you.00:23:36

Francesco Lanciana

Are you familiar with coding? And so, good stuff, just not built LLM application.00:23:42

Hamel Husain

Okay, sure, yeah. I think you could… you should definitely start with, like, without any frameworks. The frameworks can hide quite a lot from you.00:23:47

And can make things confusing and opaque when they shouldn't be, and it might get in the way of your learning.00:23:55

Shreya Shankar

I think LangGraph is okay. I would be wary of, like, frameworks that try to write the prompt for you, or do, like, application logic for you.00:24:03

But if it's just, like, orchestrating or gluing, like, a… I don't know, I think it's…00:24:13

Francesco Lanciana

That's… yeah, that's because I was… I kind of looked at, like, I tried writing something up and… and found that… I was like, oh, actually, yeah, adding… adding human loop's kind of annoying, like, checkpointing's kind of annoying. All of these things, it was like, yeah, Langraph seems to handle those primitives without being too opinionated,00:24:17

Yep.00:24:37

Hamel Husain

Yeah, yeah.00:24:38

Shreya Shankar

nuts.00:24:38

Hamel Husain

Yeah, no, that's totally… that's totally fair,00:24:39

Yeah, I think it's fine. If you feel okay with it, it's fine. If you feel like you're understanding all the bits and pieces and how it works, then it's okay. And if you're using Langraph, you might want to check out Langsmith.00:24:44

In week 3, we will have, we will start,00:24:56

Releasing vendor-specific homework walkthroughs as well.00:24:59

And, we'll give you a lot of vendor information, but for you specifically, because you're using Landgraph, you're going to want to check out Langsmith.00:25:03

Francesco Lanciana

Yeah, I've been using it, and it's… it's quite nice. Like, what it gives you out of the box with barely any work whatsoever was… yeah, it was pretty good.00:25:11

Yeah, I had a follow-up question, which was just, and I'm happy to go after it. It was essentially just, like, with those kind of agents, what are you evaling? Like, are you evaling up to the point that it's a human in the loop? Are you being like, I expect it.00:25:19

To ask the human, and so you're kind of checking that it does that, or are you trying to, like.00:25:33

give it something that it can do fully end-to-end without the human on the loop? Like, how… yeah, how are you kind of evaling something like that?00:25:39

Hamel Husain

So you would eval all the workflows. You want to… so, you want to do completed sessions.00:25:47

So, like, some of your requests will have… will require human approval, some won't require human approval.00:25:54

And what you want to do is, like, you want to observe real interactions with your product.00:26:01

And then you want to observe after the fact, hey, like, did this session go the way that it should have been? And that's what open coding is.00:26:08

And then you're just, you know, when you're doing open coding, if you recall, you should stop at the first error that you observe in the sequence, and move on to the next trace. So…00:26:16

I hope that clarifies it.00:26:27

Shreya Shankar

I think it is application-dependent, like, in the Nurture Boss example, we just cut when there's a handoff to the human, and, like, everything up to that point, right, is, like, a trace to be evaluated.00:26:31

In some people's cases, like, you might want it to be longer, like, I don't know if you're building, like, an AI data analyst, where there are actually multiple conversations.00:26:45

while they're exploring the data or something, then a session, like, might consist of multiple conversations, or it might consist of, like, interactions that are not through chat, or, like, it's really application-dependent, I would say.00:26:58

Francesco Lanciana

Cool.00:27:12

Thank you. Thanks.00:27:13

Hamel Husain

Lorinda.00:27:17

Laura… no, sorry, I might have pronounced your name incorrectly.00:27:18

Loredana-Cornelia Ninov

Hi, no worries. Sorry for being camera off, it's super, super early here. But, I'm a product manager, so I'm working on an agent that's in the e-commerce industry. It's actually customer-facing.00:27:22

And it's not a support agent, so it's quite complex. It's actually a system of 5 agents. So, my question is more like with the product manager hat on. So, I'm being really hands-on, but I wouldn't call myself technical, but I'm super interested in all of this and trying to learn it. So, from your experience.00:27:36

What do you think is a good recipe for, like.00:27:59

tag teaming between the product manager and, like, I have a data science team when building the valves.00:28:02

Hamel Husain

Yeah, so, what this course hopefully will give you intuition for is kind of give you the language in sort of…00:28:15

a good intuition on what is important. And so, you know, I hope that we have convinced you so far that error analysis is important, and someone with a domain expertise, perhaps that's yourself if you're the PM on the team, you know, person that's maybe in tasks with00:28:25

sort of making sure the product is aligned with user expectations. You should be involved with the eval process very intimately.00:28:45

And what you want to do is co-design that eval process such that it's easy for you to do.00:28:55

And that might mean that your engineering team00:29:03

You know, helps expose the right tools for you.00:29:06

So that you can do things like open coding, axial coding, so on and so forth.00:29:11

And it might be the case that, you know, like a data science team may help you.00:29:17

But you should definitely do it together, and hopefully, over time.00:29:22

they can make it to where you can drive the workflow. Let's say if you're the, you know, you're the one that makes the call of whether something should be shipped or not.00:29:27

you know, it might take some time to get there, to kind of build a shared context, but that's what I've seen work the best. And you're gonna learn a lot of things in this class, like…00:29:39

Okay, how to know whether an eval is good?00:29:50

And… You know, that is where you will kind of…00:29:54

push back on your teams if you're presented with an eval. Let's say for you, like, so in the wild, you might encounter a lot of things. You might encounter…00:30:01

You know, your data science team, for example, might tell you, hey, here's a hallucination score, toxicity score. They might tell you, here's an agreement metric with human labels. We'll get into all those, and you'll know, like, oh, maybe that's not the right thing.00:30:10

And so that's… that's kind of what I mean by co-design the evaluation process, because you're actually going to learn a lot, even if you're not doing00:30:27

All the mechanics of setting it up.00:30:37

I think what you should keep in mind is, like, how do you drive it?00:30:40

Loredana-Cornelia Ninov

Okay, I've been…00:30:47

I've been kind of co-writing evils with them, but in, like, a similar manner to what you were saying, like, if I'm just putting text into a spreadsheet, not actually coding anything, but, like.00:30:49

when I went through the first set of emails, I kind of realized that maybe the language isn't that good, or we're missing use cases in the product, so it definitely… I could see it's helpful.00:31:03

I'm just, like, trying to get a sense of, like…00:31:15

how is… what's the best approach for a non-technical PM to kind of work with the team? But yeah, your answer is great. Thanks so much, Anal.00:31:20

Shreya Shankar

We love spreadsheets, by the way.00:31:28

All the time for error analysis.00:31:32

Pamel's a spreadsheet wizard. He's using AI functions and everything in there.00:31:35

Hamel Husain

Yeah, I mean, it's not really, you know, you may have seen the Lenny podcast, and I used a spreadsheet during that demo for this reason, because I think that it should be approachable. Later on in the course, we will talk about00:31:39

designing your own data annotation interfaces, and you can… that's something that you can vibe engineer, quite nicely.00:31:53

And that should make it… you know, sometimes the spreadsheet may not be exactly what you want. And if the spreadsheet is working, great, stay in the worksheet. I'm just saying, you want to make it as painless as possible.00:32:01

Loredana-Cornelia Ninov

Okay.00:32:13

Well, it's why I'm here, to learn. Thanks so much.00:32:15

Hamel Husain

Thank you.00:32:19

Steve?00:32:22

Steve

Yes. Okay.00:32:24

Yes, Steve, yeah, I think my question was kind of related to the one from Laura Dono. Like, it is… I think in the class you mentioned, like, having…00:32:26

an evaluator switching between 3 different types of hats, between product, design, engineering. And my question was, like, in practice, like.00:32:36

If we so happen to have, like, a team that is kind of diverse, with multiple different types of profile, is it possible to have some sort of collaboration between those three different types of profile, or three different teams? And do you see any type of00:32:46

Pitfalls between Three different profiles, or someone that is more of a generalist and switching hats?00:33:00

Hamel Husain

Yeah, this is the… We will talk about this soon, so one… Thing that we encourage you…00:33:11

strongly, is wherever possible, have one person be a benevolent dictator in the evaluation process. And, we use that term a lot in the course reader, in the lectures. You'll hear the term benevolent dictator.00:33:18

Shreya Shankar

I think it's, like, at the end of lesson… it's at the end of this week, somewhere.00:33:32

Steve

Okay.00:33:38

Hamel Husain

And,00:33:39

Yeah, and hopefully that term catches your attention. It's meant to catch your attention. But basically, everywhere you can in the evals process, you want to simplify it, and you want to…00:33:41

find ways to, you know, kind of get 80% of the benefit and be fast. And one of the ways is, like, hey, have one person00:33:54

make the final call. You don't want a committee doing error analysis. It'll make it super expensive, super slow, you won't get anywhere, you won't get anywhere fast enough. And what you need to do is try really hard00:34:04

Yeah, to have a benevolent dictator. Now, if your product is, like, really large, and it covers a lot of service area, maybe in different countries, different regions, in a different language, you can have, like, benevolent dictators for different parts of the product, or different things.00:34:16

But you want to try to avoid committees where possible. And if you really feel like,00:34:29

You know, you can't get there.00:34:36

then I would say do a very limited amount of paired error analysis and try to train someone.00:34:38

or try to, kind of, delegate it to the benevolent dictator, appoint the benevolent dictator. So, that's the guidance there. I don't know if that answers that question, but that's what I thought of.00:34:48

Steve

And when you talk about, like, dictator, I do believe it's single person, like.00:34:59

The most advantages, like, avoiding any type of bias or change of thoughts, right?00:35:04

Hamel Husain

Yeah, I mean, the most advantage is, like, moving fast.00:35:12

you know, because, like, with a committee, you're gonna discuss every trace for, like, 30 minutes, it's not good. So,00:35:17

you know, there are situations where you might need multiple collaborators, and we talk about it in the course reader, and there's inner annotator agreement.00:35:25

And things like that, but I would say try to be honest with yourself, like, do you really need that? You know, can you… can you appoint a benevolent dictator?00:35:35

Steve

For sure. Great. Thank you.00:35:44

Hamel Husain

Adele.00:35:49

Adel Mandanas

Yeah, hey. I just realized it might be too early in the course to ask this, but I'm gonna ask anyway.00:35:51

I'm just wondering what the industry standard is for collecting all these LLM traces, especially if we don't use any of these agent or LLM frameworks. So, for context, we built our own AI platform that just directly calls these APIs.00:35:58

But I just realized that, like, we move way too fast and never collected any of the traces beyond just the conversation turns. So I'm just wondering if there are, like.00:36:14

Companies you work for that just, like, plugged into these existing tools, or if we really have no choice but to invest significant engineering effort, again, to, like, collect these traces ourselves.00:36:24

Hamel Husain

So the concept of traces is not, let's say, novel to AI. It's been around for a while in distributed systems. One standard in distributed systems is very common is something called open telemetry. And open telemetry is a standard,00:36:39

you know, that is… when you talk about all these LLM observability vendors, and we,00:36:56

kind of walk you through 3 of them in this class. Arise… sorry, Arise Phoenix, Langsmith, Braintrust. They're all compatible with OpenTelemetry, so if you are logging your traces with OpenTelemetry, then, you know, you can…00:37:04

easily… sort of… you know, onboard yourself into those. I won't say it's free.00:37:20

So, you still have to do, usually, a bit of instrumentation work if you want to fully instrument your system. Every system is a bit idiosyncratic, so it's… and if you are not…00:37:28

having an AI lens on it, it's likely that you may have missed something, you're not logging everything in your existing OpenTelemetry, so you might have to, like, redo some things.00:37:42

So that's one kind of approach. Another approach is…00:37:53

You know, if you're using certain frameworks, which you're not, A lot of them have…00:37:58

Try to, like, be compatible, and have bolt-ons that auto…00:38:05

log for you. But I don't think that's relevant for you, based on what you told me.00:38:11

Adel Mandanas

Got it. Yeah, I think OpenTelemetry is the way for us. And just to clarify, logging everything also means logging the tool results, right?00:38:16

Hamel Husain

Yes.00:38:25

Shreya Shankar

Yeah, that's super critical, because one day you're going to want to debug it, and then you will have no idea.00:38:26

what to expect, or what should have happened in the trace. Like, the tool results are super critical for that.00:38:32

Hamel Husain

Basically, what you should think about is, at any given point.00:38:38

can you completely reproduce what happened? And to completely reproduce what happened at any given turn, you need to…00:38:42

be able to show the LLM everything it saw that… at that time.00:38:50

Which is, like, the results of tool calls, the results of retrieval, all previous turns, any data transformations that you have done on those turns, whatever.00:38:55

So, yeah, it's, you have to log that.00:39:06

Adel Mandanas

Got it.00:39:11

Hamel Husain

Hi.00:39:15

Ahmad?00:39:16

ahmed ahmed

Hey, actually two-part now. Adele, what was the last part of your question? You were asking if it was important to save what?00:39:18

Adel Mandanas

The tool result, on top of the tool call.00:39:28

ahmed ahmed

Oh, okay. Okay, yeah. So, just, like, literally, just log everything then. Sweet. And then my actual question was,00:39:31

With regards to, like, having a benevolent dictator with error analysis, is that mostly, for velocity? Or is it so that you become, like, the outcomes become a lot more focused, because it's, like, one person00:39:42

Dictating and, like, avoiding, like, a split-brain, outcome.00:39:59

Hamel Husain

It is mostly about velocity, but it also focuses you a lot, because what ends up happening in a lot of organizations, like, people are not…00:40:07

Making decisions?00:40:15

they're kind of kicking the can down the road, and they're saying, oh, yeah, like, we don't know, it's fuzzy, we all decided this, no one's really making a decision, it's kind of waffling too much, and we need to stop it, you know? We need to force somebody to say yes or no.00:40:17

And that is a lot easier with the benevolent dictator. It clarifies everything.00:40:33

A lot of times in it.00:40:39

You know, gets you actually there.00:40:40

ahmed ahmed

Okay.00:40:43

Hamel Husain

Things get stuck in committee, like…00:40:43

ahmed ahmed

Yeah, yeah, politics of work, I get it. Yeah. Yeah.00:40:46

Thank you.00:40:51

Hamel Husain

Yeah.00:40:52

Any other questions? Don't be shy.00:41:00

katenesmyelova

Hopefully, I will ask way more after I read and go through more courses, so right now I was just able to go through half of it.00:41:08

But so far, it's really, it's really amazing, and thank you so much for this opportunity, and it just speaks dearly to my heart, because I come from the quality engineering background, and it's just like testing on steroids.00:41:16

Just amazing. Thank you.00:41:28

Hamel Husain

Thank you so much. That means a lot. Thank you for the kind words.00:41:30

Sonia.00:41:34

Shreya Shankar

Oh, you're muted.00:41:39

Sonya Kotov

Tight me unmuted, I'm sorry.00:41:41

This is more of, like, a maybe open kind of thing that I wonder, speculation, would love your thoughts on it.00:41:44

I…00:41:52

I think it's really interesting to see an application, and then see, like, the code behind it, the prompts behind it, kind of…00:41:54

Take it apart, essentially, and figure out what all the pieces are.00:42:02

obviously not that many examples out there, because a lot of companies don't want to share that, so I loved, getting to see the00:42:06

The real estate agent, like, actually seeing the queries that people are writing.00:42:15

I'm curious if you have any tips on where I can see more things like that breaking down these types of applications, or even… I don't know if this is ethical, but, like.00:42:20

maybe if someone posts a link to their cool LLM app that they built, how I can hack it to figure out what the system prompt is, any… any ways to pull those out. So, any kind of thoughts on that, on, like, reverse engineering things that I see out there, so that I can learn how to build better?00:42:32

Hamel Husain

So, it is very difficult. Nurture Boss, and I've worked with a lot of companies at this point. Nurture Boss is the only generous company willing to let me…00:42:53

open all of their traces. They're actually very… Yeah, it's very special.00:43:03

And I've tried very hard to find another company that let me do that.00:43:09

And, I have not been able to. Thank goodness for Nurture Boss. And, you know, it…00:43:14

Toy datasets are a very, very different nature than real ones.00:43:23

And so, you're absolutely right. It would be good if somebody wanted to00:43:30

Share their data completely in that application.00:43:36

I'm always looking for it. If someone has, like, a real production application that is being served to real users, that is a commercial product.00:43:41

that's sufficiently real, you know, and willing to share that data, like, I'm always open to it. So, alright, and then please send them my way, and we'll, you know, create another example.00:43:53

So I don't really have, like, great strategies, let's say. If, you know…00:44:03

One kind of small niche that you can do it is, like, if you have, like, desktop applications, or let's say local applications.00:44:13

That are running on your computer that sort of call LLMs?00:44:25

You could run a proxy that captures all the web traffic. That's very advanced.00:44:31

You know, when… so I, like, I've done that with frameworks.00:44:37

to figure out what the hell is going on, because sometimes frameworks do very complicated things. I have a blog post, if you're interested. It's very… it's kind of technical, but, I'll share it.00:44:43

Has a funny title.00:44:55

That's slightly vulgar, but we can, we can, share it here.00:44:56

Sonya Kotov

Please do, I'm excited to see it.00:45:02

Hamel Husain

Yeah, I'll get that.00:45:06

Shreya Shankar

I really think prompts are easier to find, like system prompts. For example, I, like, put some link in the chat. Somebody's out there trying to reverse engineer system prompts. I think it's much harder to find real traces. I truly have not seen…00:45:07

Examples of real traces.00:45:21

out there.00:45:24

steven

And those that you supply, Treya, look like…00:45:31

general, like, coding agent prompts, and, as opposed to, like, a vertical…00:45:35

Shreya Shankar

Exactly. Yeah. Yeah, so it's like someone went there and, like, tried to probe the system prompt out of the agent.00:45:41

Sonya Kotov

That's great. Thanks, you too, I appreciate it.00:45:50

Hamel Husain

Yeah.00:45:53

Shreya Shankar

there's one other dataset that I like to use for research, it's from Anthropic. It's called this Helpfulness and Harmlessness Dataset. And I just… I don't…00:45:53

think it's fully real, because I do think it was, like, red teamer people trying to create a bunch of dialogues, and again, it's for a general purpose assistant, but they are queries that were written by humans, so… I don't know.00:46:04

Hamel Husain

Francisco.00:46:22

Francesco Lanciana

No. Yeah, I was just thinking about… are you… so, with evals, I know, like, we're talking a lot about traces, but are you kind of collecting, like, a set of input, output, like, golden… I think they call it, like, golden something,00:46:24

to say, like, yeah, it should always, like, if you give it this, it should kind of come out this way. Is that kind of what you're trying to build to? It's, like, just a huge set of something like that, where you can kind of run it, you know, at any point, and be like, oh, how is the system behaving against all of these…00:46:43

These inputs, and are we still getting, kind of, the score we wanted, or is it, like, a lot more involved in that you're kind of looking at individual points in the trace, and then…00:46:59

The second part to that was, like, are you…00:47:08

Supposed to, when you're kind of building out your application, build it in a way where00:47:11

you can mock any, like, thing that will talk to anything external other than, like, the LLM, so you can always say, like, it will return this data when I run this input, and so you can kind of, like, replicate it. Or is that, like, not the way to think about it?00:47:17

Hamel Husain

Okay, I'll answer the second one first. It's always good to write your code in a modular way so that you can play with anything.00:47:33

So that you can rerun a tool call, you can run a retrieval step, you know, you can do whatever, and you can quickly understand what it is. The way that I work with these applications, and I write them myself, like.00:47:39

I try to make it very modular, so I can import a certain thing into a notebook.00:47:54

like a Jupyter notebook, and run that thing interactively, and play with it, play with the data that comes out, and kind of fiddle with it.00:47:59

And so, you know, that's something to think about. Depends on your application, what you like, what tools you like, and stuff like that.00:48:07

I think your first question is, like, are you working your way towards a golden dataset?00:48:15

You're working your way towards test cases.00:48:22

So, in the sense, like, you're trying to collect interesting failures?00:48:26

And,00:48:31

you know, use… if you, you know, in some sense, use those as, like, test cases. It's not just failures, it's failures and non-failures, but, you know, you want to make sure, like, the interesting failures are in there.00:48:35

And similar to software engineering, what you want to do is try to minimally reproduce the error.00:48:48

Okay? So, if you have, like, some kind of…00:48:56

Bot that has a specific error at, like, the 20th turn.00:49:01

Can you reproduce that error in one turn? If so, by all means, you know…00:49:07

Redo that in one turn, and have that be the data, because that's going to be a lot easier to test, rather than trying to simulate 20 turns.00:49:12

But the idea is, like, you want to…00:49:21

Create a minimal test case to reproduce the error.00:49:24

And we will… that is discussed also in the course reader. Maybe not yet, but it will get there when we talk about automated evaluation.00:49:29

Shreya Shankar

I think Chapter 9 is the one for this, specifically. And yeah, people will have different things, like, they'll have these, like, offline, like, CI, like, checks for known unknowns, but unfortunately, that's not all you need, because you will have unknown unknowns emerge for just random new failure modes, because…00:49:38

AI is like that, right? It'll generate output that sometimes is wrong, and…00:49:58

It's gonna happen if you deploy your application, so you kind of always need to also be reviewing new choices, over time.00:50:03

Francesco Lanciana

Yeah. Like, it seems very similar to, like, hey.00:50:12

do regression testing and, you know, just a standard application where you're kind of just like, oh yeah, you found a new edge case or bug, you'll put a test to make sure that thing doesn't kind of happen again, or at least you're tracking it. Whereas in this case, yeah, it seems very similar, except maybe that you're00:50:16

like, you're not saying, you know, all of these have to pass all the time, because, like, it's just… it's not gonna happen that way, but at least you've got a benchmark of how it's kind of going, and it's like, oh, is it moving up or down?00:50:34

Like, the family rights, I think. Yeah. Okay.00:50:48

Cool, thanks.00:50:51

Hamel Husain

Vikas. Vikas Singh.00:50:54

Vikas Pratap Singh

Hey.00:50:57

I, I have a question, related to, like, OpenAI, today released, a video, by title, ULs in Action. I'm not sure,00:50:58

If you have watched that video. So, I hope, like.00:51:10

My understanding is they are using this,00:51:14

pairwise expert kind of grading and blending the gators to get more of unbiased, I would say, evaluation performed. But at the same time, I was also thinking, for OpenAI scale.00:51:17

Like, like, how many people they will have to pair to, to kind of,00:51:32

Make sure that model is doing really good on these, Like, task, right? So…00:51:38

If you can provide any insight,00:51:44

Or maybe explain it in a different way that can, Help me understand this better.00:51:47

Hamel Husain

I would have to watch the video.00:51:55

Shreya Shankar

Yeah, I have not seen the.00:51:56

Hamel Husain

video.00:51:58

Shreya Shankar

You said it came out today.00:51:58

Okay. You can post this in Discord, that would be interesting, I'm sure other people would be…00:52:01

Interested in this.00:52:07

Vikas Pratap Singh

Sure, I think, Ashi is already, yeah.00:52:09

Shreya Shankar

Yeah.00:52:13

Vikas Pratap Singh

I'll post the video link in the Discord channel as well, yeah.00:52:14

Shreya Shankar

I think, to your question about, like, how do these organizations like OpenAI or Foundation model providers do evals, and if that's different from us, absolutely it is. There's actually a paper called GPQA.00:52:17

And that kind of describes the high-level process of… their whole goal is to take00:52:33

problems that the LLMs truly do not know how to do, and have never, ever been trained on, and try to increase that model capability in a general way, like finding new domains, for example, like biology or physics or whatnot. I don't think that's anybody's goal here, because you don't really want to compete with OpenAI and Anthropic and whatnot on that, right? We're trying to build applications that's a little bit more difficult00:52:40

different.00:53:05

ideally, the capabilities that your LLMs need are somehow represented somewhere in some way in the training sets.00:53:06

And you're just trying to design an agent with the right tools, or the right problems that are scoped down into good units, and described well. So that's kind of how I would say is the difference. Maybe Hamil, something to add to that.00:53:14

Vikas Pratap Singh

So just one more thing to that. It means, kind of, that for OpenAI or Anthropic, who are building these foundation models, the existing benchmarks, if they are already doing pretty awesome on those benchmarks, then how do they, up the tempo, right? They have to then come up with some different strategy.00:53:29

So I believe, in that video.00:53:47

That's my understanding, they are trying to explain that now, how they are making sure that00:53:50

on these human tasks, how the models are performing much better than the previous model, or something on those lines. Yeah, yeah.00:53:54

Hamel Husain

Yeah, I'll take a look at the video and… Comment in Discord as well.00:54:05

Isabella?00:54:10

Isabella

Yay.00:54:14

I'm working on the early stages of a chatbot, and when I went to go do error analysis, mostly what I found was that most traces contained, like, some sort of error that the chat made, or that what our tools made. And when I went to go kind of, like, do this measuring, or just, like, trying to categorize and count up everything that was wrong, it was pretty impossible to just, like, have 5 or 6 categories. So, like, each00:54:15

category.00:54:39

could be broken down further into subcategories, and so I just kind of wanted to, like, get a pulse check on, like.00:54:40

should I just really… like, there were obvious takeaways of things that are broken that we need to fix, and, like, is this still a useful thing to be doing when the product is mostly not working? Like, does there need to be, like, a level of quality before, like, this eval process really is…00:54:46

fruitful? Or, like, am I just gonna get bogged down trying to, like, look through all of these things, and should I just use, like, my intuition after kind of, like, looking through00:55:01

What the, you know, error, the failure modes are, to, like, go fix it, and then, like, reassess in, like, a week.00:55:10

Hamel Husain

Yeah, you are very successful, actually. I'm really happy to hear this. So, error analysis is meant to…00:55:17

very quickly uncover all kinds of errors, and you don't necessarily have to… like, if you see errors in error analysis, like, hey, I have an obvious bug in my code, hey, I have…00:55:24

I need to, like, this is a really dumb prompt issue, like, I can't believe I, like, accidentally had that. You'll find all kinds of stuff like that, just go fix it.00:55:38

The most important thing about this process, is you're gonna make your application better?00:55:49

It's not about having the evals, per se, it's about improving.00:55:57

And so, you should absolutely go fix those.00:56:01

Because you found gold. Like, you found…00:56:06

all these things. And, you know, there's a lot of burning things probably in your mind, like, okay, I need to just… you can… when I, you know, showed people error analysis, even Nurture Boss. So, like, when I showed Jacob from Nurture Boss error analysis.00:56:10

He… was busy for 3 months.00:56:27

fixing all the different things you found in error analysis before even we started to think about evals. And so that's absolutely normal, especially when you're early.00:56:31

But that's a good thing. You know, in getting good at error analysis is…00:56:43

like, really useful. Like, you can get really… like, it's a skill.00:56:49

And so, yeah, I would say good job.00:56:55

Isabella

Okay, great, thank you, that was really helpful.00:56:59

Hamel Husain

Yeah.00:57:01

Alright, I think, oh, we have Steve? Okay. We can do one more question.00:57:09

Steve

Yeah, I just had, like, a couple more questions, especially around, like, I'm curious about, like, when do you think it's the right moment to switch from, like, Anadator and human in the loop to, like.00:57:15

switching to LLM as a judge?00:57:28

And I also have a question about, like, when you talk about prompt engineering.00:57:31

I think you, like, in the class, it's especially mentioned about guardrails, but especially when you're creating, like, a system prompt for an agent.00:57:36

do you really need to put those guardrails, knowing that they're already in the system prompt of the model? Is it just repetitive, or are you just inputting tokens, and it's a waste or not?00:57:45

Hamel Husain

Okay, I'm gonna answer the second one first, because I can remember it better, is,00:57:59

It should use guardrails? Let the error analysis let you know whether you should need guardrails or not. Like, do the simplest thing first. It's much simpler to have a prompt than have a guardrail.00:58:06

See if, you know.00:58:16

you need the guardrail. If it's a really sensitive thing where you want to be absolutely sure that you don't do that, and it's, like, really, like.00:58:20

Keep me up at night, then yeah, have the guardrail.00:58:29

Steve

I kind of have to balance it, but I wouldn't say don't have guardrails just for the sake of guardrails.00:58:31

Hamel Husain

Because we feel that it is… we are good AI engineers because we have our guardrails. There's no… don't check these boxes for the sake of it. It should be needs-driven.00:58:37

So, and you might have to remind me of the first question.00:58:49

Shreya Shankar

Answer the first question.00:58:53

Hamel Husain

Go ahead.00:58:55

Shreya Shankar

Oh, it was, like, basically, when do I use an LLM judge? My short answer is just wait for next week. We'll teach you guys how to, like, create some of your axial codes, turn them into LLM judge evaluators to be able to measure them.00:58:56

And, like, align them also with their preferences.00:59:12

To add to the guardrails comment, I think it's also interesting to think about the context behind how guardrails came out for these LLMs, right? Somebody released ChatGPT, the ChatGPT API, like GPT 3.5,00:59:15

And these frontier models were not retraining models fast enough, right? Think about the time lag in between when 3.5 came out and GPT-4. So, when people saw a bunch of errors from ChatGPT 3.5,00:59:31

They needed bolt-on solutions to fix those errors. So guardrails emerge as, like, okay, I'm gonna try to detect toxicity, or, like, things that I know immediately are bad today, but we just have not trained a new model to solve those problems.00:59:47

So now, right, that these foundation model providers are shipping faster, there are more foundation models out there.01:00:00

You don't always need a guardrail for everything.01:00:07

Steve

Okay, got it. Thanks.01:00:10

katenesmyelova

Actually, talking of the guardrails, again, we are also in the early stages of development. We do not have a valve yet, which are Panza scores, and I was using, kind of, guardrails to, kind of.01:00:15

Well, we are still developing the prompt, so guardrails are making sure that we are not, doing the wrong thing, and we are not clicking any information, but I am not sure now if that's the right approach.01:00:29

Hamel Husain

Yeah, I mean, it might not be the right approach. I don't want to tell you it's the wrong approach, because there's a lot of details about, you know, your product that I don't know.01:00:47

So it would be presumptuous of me to tell you that it's right or wrong, but I would just be skeptical. Being skeptical is always a good…01:00:55

sort of, mentality. At the end of the day, if you're using an off-the-shelf01:01:03

guardrail, which I think you might be.01:01:09

Be very skeptical of it, because if… what is that? That is probably someone else's prompt.01:01:12

Or something like this. So, just be very careful.01:01:19

If you want to… Kind of increase your skepticism?01:01:23

Just generally, I'll share… you want… yeah, I'll share this blog post about looking behind the scenes at some guardrails stuff.01:01:30

I'll put it in the chat.01:01:40

And you can… Yeah, it'll probably make you more skeptical.01:01:42

Shreya Shankar

Let's take this last question, Vignish, and then I'm really tired, I'm gonna go sleep.01:01:50

Vignesh Iyer

Sure, thank you so much, for just pulling me in there. With the, axial coding, process, the, error analysis and axial coding, so those first hundred, traces that we're kind of looking at, once we've gone through the process, like, once, when we talk about going through the process once.01:01:57

Is it that those 100 need to pass in at the first time, or is it…01:02:16

we're passing a batch of the 100, getting some of the, you know, errors, and getting some of the codes out, and then in the next batch, we're sending in a new set of traces, or we're sending in the same kind of traces. How does that process work?01:02:22

Shreya Shankar

So, good question. I would say if you've done a round of open coding, and you feel like towards the end.01:02:40

you learned things about your traces or your own preferences that might motivate you to go back and revisit your open codes, I would say go back and open code your data again. This is actually more common01:02:49

than you think, often because we only really know what we want until… we don't know what we want until we see it, so if you're feeling that way, you should…01:03:02

go back and do it. Now, as your application is more mature, you might be much more consistent in open coding, and then you're just going to want to open code more, or, like, entirely new traces, which is also totally fine.01:03:11

Vignesh Iyer

Got it, got it. But the initial kind of intention of how the course was kind of teaching it was that those first hundred traces, like, all at once, right? It's not like a batch of them. You want to complete 100 before you move to the next, level, sort of.01:03:26

Shreya Shankar

this, again, so everything that we say is, like, don't take it word for word like a Bible, like, it doesn't mean that, like, if you do 99, it's gonna be a useless process or anything. I think you should do as many as you need to feel like, okay, something is stagnating, like, I'm not finding any new failure modes, or, like, I already know 15 things I need to fix right now, like, everything is gonna be messed up because01:03:42

I just forgot to mention something in the prompt that's important. Like, go do all of those things, right? Like, there's no rule that you have to do 100. We just say 100, A, because people always ask us what's the minimum number, so you have to say something, and then B, sometimes, you know, when you're doing this for the first time.01:04:07

you don't know, right? You don't have confidence in yourself, of like, oh, am I doing it the right way, and so forth, so you want to do enough reps to feel like, oh, okay, like, I think I'm getting the hang of this.01:04:27

So maybe, hopefully, that's helpful context. Feel free to01:04:37

You can break early if you need. Sometimes people do more, like, in cases where most of the traces are actually good, I might go, like, to 150, and that sounds like a lot, but actually it's not if it takes me only 5 to 10 seconds to look at a trace and tell you if it's good or bad. Like, I can do this in 5 minutes, and you're gonna get to that point, too.01:04:42

So, yeah.01:05:02

Vignesh Iyer

Got it. Thank you so much.01:05:05

Hamel Husain

Yeah, I've done so much air analysis at this point that…01:05:09

I keep going until I find really ridiculous bad things that are going wrong, because I know that they're there.01:05:12

And I kind of have developed an intuition where… how to look. It just takes practice.01:05:20

Vignesh Iyer

Yeah, and then the aim then to find those ridiculous things, and like someone asked earlier, probably to see if it's so ridiculous that you need to fix your system prompt, or kind of just let it go, and then continue the process to see if it refines into a category that's01:05:27

Having a lot more in it, versus if it's too ridiculous, you just gotta stop and go fix it, right?01:05:46

Those are kind of your two paths that you could take.01:05:51

Shreya Shankar

Yeah, there's always a hierarchy in priorities that would save for these errors. Like, some of them are just, like, you can't even believe it's there, and then others are like, okay…01:05:57

It's not ideal, but in the grand scheme of things, I cannot be fixing everything all the time. I only have finite resources, so… Right.01:06:07

Vignesh Iyer

Makes sense, thanks.01:06:17

Hamel Husain

Alright, well, it sounds good. Thank you, everyone, for coming to this office hours. It was really nice to have everyone. Thanks for coming so late.01:06:21

And are early, or on time.01:06:28

srp

Thank you.01:06:32

Hamel Husain

And, yeah, we'll see you in the next one.01:06:33

katenesmyelova

Thank you so much. See you, bye.01:06:36

Hamel Husain

Thank you.01:06:38

Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A

[

Home

](/parlance-labs/evals/2025-3/home)[

Community

](/parlance-labs/evals/2025-3)