OCT 28 Optional: Live coding workshop HW 3 w/ Isaac Flath TUE 10/283:30 AM—5:00 AM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live coding workshop HW 3 w/ Isaac Flath
Oct 28, 20253:30 AM - 5:00 AM GMT+5:30
Audio Transcript
Chat Messages
Isaac Flath
Blue.00:01:01
Hamel Husain
Hey, how's it going?00:01:03
Isaac Flath
Yeah, somehow I had a different link for the thing, so I'm glad you told me you were starting.00:01:04
Hamel Husain
I also panicked, I was like, oh, maybe I'm in the wrong meeting. It's like, let me check.00:01:10
Isaac Flath
Well, I was on my calendar, and I joined from there, and it said I was waiting, and then when you said you were starting, I was like, oh, well, maybe I should go in… log into Maven.00:01:15
Hamel Husain
That's what I did too, I logged into Maven.00:01:23
Isaac Flath
Okay.00:01:26
Hamel Husain
Just to be clear.00:01:26
That's interesting, I can see something on your screen now.00:01:38
Oh, there it is, okay. Oh, it's your arm.00:01:42
Isaac Flath
Yeah.00:01:46
Hamel Husain
Makes sense.00:01:47
Isaac Flath
I have to have it set up just right, otherwise stuff gets in the frame.00:01:52
You can see the edge of the green screen pretty nicely with the background.00:02:00
Hamel Husain
Oh, nice.00:02:04
Isaac Flath
It's kinda cool.00:02:06
Hamel Husain
Alright, we can wait a few minutes, and then kick it off.00:02:18
Or, wait, wait a minute, even.00:02:21
Don't have to wait too long.00:02:25
You want to talk a little bit about what you covered last time, and sort of how you might plan on going about it this time?00:02:32
Just to refresh people's memory.00:02:38
Isaac Flath
Yeah, so last time we really talked about,00:02:40
Creating a prompt, starting simple, how to create some example queries of users, how to look at them, why they might be,00:02:45
You know, good or bad, unrealistic or realistic? And to really think about it from the product standpoint.00:02:56
And then we created a small annotation app.00:03:03
to… to look at them, and did a little bit of oven coding. We didn't do a ton, where we looked at the,00:03:07
the trace, so we took notes on all the things we liked or didn't like. Some of them were…00:03:14
You know, things that were just wrong, and some of them were just things that we didn't quite like.00:03:20
And so, shared a video, and that's in the, the, the…00:03:25
homework, repo as well, where Hamill and I…00:03:30
Did open coding for, I think, like, 20 minutes, so you can see us go through lots of, lots of traces or so. Yeah, so then this time.00:03:35
We'll kind of be diving into the next homework assignment, Homework 3.00:03:43
Hamel Husain
Alright, exciting.00:03:48
Okay, yeah, let's go ahead and kick it off.00:03:50
If… you ready?00:03:52
Isaac Flath
Yeah.00:03:56
Okay.00:03:57
Alright, so here's homework 3.00:04:03
We can look through the homework and get an idea of what the homework is happening, what we're doing in the homework. So, a lot of this00:04:08
Optional stuff is a little bit like overlap from the previous, but in a little bit of a different way. So, the full learning, start from scratch and implement the content pipeline, and so the idea here is that you're gonna generate recipe bot traces and then label your own data.00:04:16
So we did that a little bit by creating some, user queries and things,00:04:33
Last week, and we showed a little bit of open coding. And here is where…00:04:38
you know, I'd recommend doing several of these with… by actually going through the app and the UI and the UX, and not necessarily automating immediately. And so, we're gonna show that. We can see the next step is,00:04:46
Once we get and label the data.00:05:03
We're gonna split into, train, dev, and test sets, and we'll talk about kind of why that is.00:05:05
To get this, and then we're gonna be really going into LLM as a judge, creating the prompt, validating it, measuring it on news traces, and doing a little bit of analysis on that as well.00:05:11
So, that's kind of the whole, thing of it. If anyone has questions, feel free to do, like, the hand-raising thing, or unmute, and…00:05:26
Just ask at any time. That is not a problem.00:05:33
So, the first thing we've got here is we can look inside the data directory. There's this dietary.csv. This is, like, a starting point.00:05:38
this is just a specific, we made tuples last time of different dimensions, where you might want to evaluate queries off of, or to generate, and one of those dimensions might be something you want to measure how well it's doing against. So in this case, this is for dietary restriction.00:05:50
And so, what you see here is you have a user query, something they might enter and write in.00:06:09
And then you have…00:06:15
a dietary restriction. And the idea is, we want to make a judge, we want to be able to know00:06:17
When we enter this query, and we get a response.00:06:23
Is there… does it actually adhere to the vegan dietary restriction or not?00:06:30
For this query, does the response adhere to00:06:36
gluten-free. How good is our chatbot adhering to this dietary restriction? And you might have lots of different types of things you have. In this case, we're zooming in kind of on this one problem, because we did a bunch of open coding, and when we did this open coding.00:06:41
You know, like we talked about last week, we identified, like, hey, this dietary restrictions are top priority.00:06:56
Or whatever that is.00:07:02
So, as you can imagine, there's a few things we need to do to accomplish this. First, we need the actual responses.00:07:04
There's no way we can tell if it's compliant unless we see the responses, so we need to do that.00:07:11
We need… To then, do some analysis and,00:07:18
labeling to know, did it… what was the real answer or not. We need to create a prompt, so let's jump into it. The first step is you have this dataset of queries and dietary restrictions.00:07:26
Maybe you generated it manually, maybe you had an LLM help with it,00:07:40
You know, ideally, a lot of these are from real user queries, you may or may not have those, though. But the first thing you want to do is…00:07:46
Like, really read through them, and make sure they all make sense.00:07:54
For example.00:07:59
raw vegan salad that doesn't taste like grass, okay? Someone wants… someone wants that. You know, is this…00:08:02
you might say is… I don't know, is raw vegan and vegan different? I don't actually know. Ideally, if you're making this, you've got either you or someone you know as a domain expert. This is something that you might want to look up and say.00:08:13
Should this actually be a different category? I guess we can just ask.00:08:25
Is vegan different?00:08:30
Then raw vegan… Dietary-wise.00:08:33
Lucinda Linde
Yeah.00:08:39
Isaac Flath
And so, if there's anything in here where you're like, I don't really understand it,00:08:40
That's something that you want to be able to understand, because if it's coming up this early, it's probably going to come up a lot.00:08:44
Jon Pedley
In number two, for example, you've got gluten-free breakfast, and a second restriction, which is they don't want eggs.00:08:53
So… It presumably breaks things if you're trying to have multiple restrictions at the end.00:09:00
Isaac Flath
Yeah, so you could try to have… so this… you might have the same query in a different one. You could duplicate it and say, like, no eggs if you want, as a second dietary restriction to, like, see, does it comply both? That might be the easiest way.00:09:08
But that would also be an interesting finding. Like, if you went through, and you did this, and you're doing analysis, and you find, like.00:09:22
Huh, it seems like all the ones that got wrong, they had two restrictions instead of just one.00:09:29
Jon Pedley
Hmm.00:09:33
Isaac Flath
That would be interesting, because then you'd have to say, like, okay, well, do our users actually have two restrictions? Because if they do often have two restrictions, then…00:09:35
You know, if it doesn't do it very well, then you gotta figure out how to fix it, right?00:09:45
So, okay, what vegan means, what raw vegan adds. Okay, so it is,00:09:54
Okay, yeah, so it does,00:10:04
you know, add a restriction, so it's different, so that's good. Learned a little bit. And go through and read them all. I'm not gonna read them all here, but here, here's one.00:10:06
You could say, I want something light, but filling.00:10:17
Is that reasonable enough for low-carb? Maybe?00:10:21
Comfort food that won't make me feel guilty?00:10:26
I don't know how it would get vegetarian from this, that vegetarian's restriction. So, I might say, this doesn't seem like a realistic query. Like, I wouldn't expect…00:10:29
Like, how would a chatbot, any chatbot, get this right? How would any person know that… I mean, does comfort food mean…00:10:39
It must be vegetarian, I don't… I don't think so.00:10:46
And so as you generate these,00:10:49
You do this, and you might say…00:10:55
what do we want in this… in this condition? I'm flexitarian, leaning vegan. Do you…00:10:59
Do you… do you want it… does it… does it have… if it doesn't adhere to vegan, does it,00:11:06
Does the… does it fail?00:11:11
I mean, they've… the prompt kind of indicates that they're not too strict about it.00:11:13
So is it a true failure if it doesn't adhere to vegan completely?00:11:18
I don't know. That's kind of a product question.00:11:23
You know, again, here, I would say, like, gluten light recipe. I'm not celiac, just sensitive.00:11:33
Maybe we want gluten-free, because we want to be careful with that. Maybe a little bit of gluten is okay. Like, where do we want to lean for these? And so some of these aren't always…00:11:38
Always black and white. I eat pretty clean most of the time.00:11:50
That's Whole30.00:11:58
I don't know, does… is… if someone says I eat pretty clean.00:12:00
Like, what does that mean? Does that mean sugar-free? Does that mean Whole30? Does that mean vegetarian? Like, I feel like that can mean something different for lots of people.00:12:04
And so, I might wonder, is this a realistic query? Is this… is this something that…00:12:14
I could expect to get right.00:12:21
Or what is the… what is the wrong answer here? You know, if… is there… is there no right answer that could adhere to Whole30?00:12:24
So something else.00:12:30
So, actually going through and reading all these.00:12:33
And really thinking about them, not just saying, like, oh, it looks good.00:12:37
samuel varghese
The primary reason it's 3D…00:12:43
Isaac Flath
pretty important. You can find quite a lot, and you can think of quite a lot of edge cases.00:12:44
samuel varghese
So that's actually…00:12:51
Hey, Isaac, I have a question.00:12:54
Isaac Flath
Yep.00:12:56
samuel varghese
Yeah, this is Sam, by the way.00:12:58
Isaac Flath
Bye.00:13:00
samuel varghese
Like, for that, for that… hi, for that example you just provided with the, the 30 thing, like, why wouldn't the right answer not be anything but a follow-up question?00:13:00
You'd have to… For clarification.00:13:13
Isaac Flath
Yeah, I mean, you'd have to decide whether, does your product, do you want it just to give an answer, or do you want to follow up? Because if you want a follow-up query, I think that's definitely reasonable.00:13:16
samuel varghese
So, there…00:13:26
Isaac Flath
Yeah, I mean…00:13:27
samuel varghese
Gotcha. I think a follow-up.00:13:28
Isaac Flath
That should be totally reasonable.00:13:30
samuel varghese
And if you've got it.00:13:31
Isaac Flath
Yeah, and if your product doesn't support it, you know, a follow-up question right now, because it's just here in a response, you might have to say, like, oh, well, now do I need a chatbot, or…00:13:34
samuel varghese
Do you want that flow? How do you want to design that? Right, right.00:13:45
Isaac Flath
Do you want to just give a recipe that's largely healthy?00:13:49
samuel varghese
You're gonna say that.00:13:53
Isaac Flath
And have it say, here's a largely… or, like, here's a,00:13:54
Where was this one that I was doing?00:14:01
samuel varghese
56, I think.00:14:04
Isaac Flath
Oh, here.00:14:05
samuel varghese
And of course.00:14:05
Isaac Flath
Yeah, I eat pretty clean most of the time. Like…00:14:06
samuel varghese
Right, exactly.00:14:09
Isaac Flath
Should it… I eat pretty clean most of the time? Should it say.00:14:10
samuel varghese
And…00:14:13
Isaac Flath
Would you want it to respond and say, like, here is a,00:14:14
a recipe that's generally accepted as healthy, feel free to make a new query if.00:14:20
samuel varghese
Like, this.00:14:26
Isaac Flath
This is a generally healthy recipe. Do you want it to say, like, I don't know exactly what pretty clean means, or do you want it to say, I chose to do low-carb, light meal as a pretty clean meal? Here it is. Like, should it just clarify the decision it made, or should it ask?00:14:28
samuel varghese
Oh, right, like, make an assumption and then ask for clarification if they want something more detailed or something.00:14:46
Yeah.00:14:54
Isaac Flath
tell them I made this assumption, or this is what I did, feel free to do a new query, is another option. Because I eat pretty clean most of the time.00:14:54
It could be that it's like, well, they need to be more specific and tell me exactly what they want, or it could mean that the person's just like, I eat pretty healthy, and if you're like, well, what dietary… like, what do you mean, diet-wise? It's like, I don't know, I just eat good. Like, I don't know, maybe this is indicative of someone who's…00:15:05
You know, just… Feels like they eat pretty healthy and don't really know a whole lot.00:15:24
Because it's a…00:15:29
samuel varghese
And they see me.00:15:30
Isaac Flath
It's a thing, they just want to feel healthy. I don't know.00:15:31
samuel varghese
Right, right.00:15:33
You know, whereas here, I would kind of expect this person to…00:15:36
Isaac Flath
Be able to answer some follow-up questions, or more likely.00:15:43
So…00:15:50
Yeah, but these are all, like, things you want to think about. And these are things that, like, it's really helpful not just to look at the queries, but then, you know, go through,00:15:54
The chatbot for.00:16:03
Hamel Husain
I think Kate has a question.00:16:06
katenesmyelova
Yeah, I have a question. I was just… when I was, working on my kind of set of examples, I was ceding intentionally some,00:16:08
Error… user error failures, like.00:16:19
very vegan ones, sorry, very vague ones, or very ambiguous ones, and also, I think, to have some contradictory ones as well. For example, I want, a dish with oysters, but I'm allergic to seafood.00:16:22
Or something, something very unusual. I want a Japanese dessert and, with oysters in it, so…00:16:38
The point is, is it… does it make sense to generate these kind of weird, outliers to see how the,00:16:48
LLM response?00:16:56
Hamel Husain
Yeah, I think there's… yeah, go ahead. Oh, go ahead. I mean, well, yeah, I think it makes sense, like…00:16:59
You can if you want. Ultimately, you have to try to have good hypotheses about how a person will use your product.00:17:04
Like, if it's a real-life scenario, you know, try to do that, try to have… come with a hypothesis about how your product might break under real constraints.00:17:11
There's no right or wrong answer in terms of, like, what synthetic data is good or bad, but just…00:17:21
You know, you'll get better over time in terms of, as you get real users, you'll get more information about what00:17:27
Are more promising.00:17:33
triggers for you to test. So, there's no… there's nothing that's…00:17:35
wrong. I mean, it's totally fine for you to do it, just…00:17:40
You know, the whole idea is, like, be hypothesis-driven.00:17:43
katenesmyelova
Yeah, exactly, because that was my hypothesis. The hypothesis is I want to see how the chatbot reacts to certain things that00:17:49
They are not correct.00:18:00
Hamel Husain
Sure, yeah.00:18:02
Isaac Flath
Yeah, I think that's… I think that's great, and I think it kind of comes down to, like, things like, like we were talking here, this is unclear.00:18:03
And so, like, what do I want the chatbot to respond with? It could be a question, it could be maybe I say Whole30 is what I… what pretty clean means. It could mean… it should say, please provide a more clear query.00:18:12
There's a lot of things, so… so for this, saying, if it… if it doesn't adhere to Whole30, is this a failure? Or should the chatbot respond in a different way?00:18:25
So yeah, I think this seems like a reasonable query that a user would ask, so I would say we should probably evaluate it.00:18:36
But if it made… Yeah, so I guess, I guess what I'm saying is,00:18:44
If it's not a realistic, or… no, if it's not a good query, because users enter bad queries.00:18:54
Just think through, like, well, what should your product respond with? Should it just say, no, that doesn't make sense? Should it follow up? Should it…00:19:00
Like, what, what, what's… What's the good user experience you want in those cases?00:19:09
Okay, so… Cool, so we looked through these, we cleaned those out, so, the,00:19:19
Next thing we want to do is we want to generate some traces.00:19:32
So, if you've never done this before, the first thing, you know, I would recommend is just use the product.00:19:35
And so… Like, there's 50 or 60 of these.00:19:44
Like, how long would it take me to go through and do all of them?00:19:49
Like, 60's a lot, but, what, probably… Maybe an hour?00:19:54
if I just use, like, Excel or something?00:20:01
You can automate later.00:20:04
But at least the first, like, 10, I would say, for sure. Maybe even more.00:20:07
just kind of go through and use your product, because it's not just about evaluating the queries, you also want to find, like, weird bugs in your product, or UX things, or like…00:20:16
hey, yeah, I wish there was, like, oh, I could put a question asking.00:20:26
And so, like… A good way to get started is, you just copy in your first one.00:20:30
And we can make these traces, you know, we can say, like, in Excel, query… response.00:20:39
And sure, this is a little bit, you know, annoying to do, but you could build your dataset just…00:20:51
You know, read it, think about it, you know, annotate it if you want.00:21:00
And… let's refresh, get a new session.00:21:04
And maybe you see, like, hey, your refreshing takes too long, maybe you find things that aren't formatted right, you know, maybe sometimes it returns with Markdown and doesn't format it well.00:21:13
And so you can start to see, like, what is my user actually seeing?00:21:23
And not just… You know, not just raw data.00:21:29
And so you can fairly quickly, and I know this formatting isn't the best here in Excel, but you can fairly quickly do things kind of the manual way, and I think you should at first.00:21:33
So we can… Clean this up a little bit.00:21:48
Hamel Husain
So, like, yeah, you have to put it in, like, yeah, you have to put it in the…00:21:51
Isaac Flath
And so, like, I don't know, I guess what I'm saying is, like, if you can't start.00:22:01
you don't necessarily need to wait for, like, engineering resources, or be like, well, I have to, like, build a script. You can, and I think eventually you should automate this so that you can get lots of data and see how your thing is going.00:22:06
iterate.00:22:21
Hamel Husain
This app, this particular app is simple enough, because it's kind of a single turn thing, in a way, most of the time, so you could probably put in a spreadsheet. It becomes, like, you know, complicated if you have, like, lots of turns, lots of tool calls, rag, and stuff like that, then yeah, then you'll need to build a viewer of some kind.00:22:21
Which is something we'll go over this week in the lessons, by the way.00:22:37
Isaac Flath
Yeah.00:22:42
Hamel Husain
So, Kate, you have a question?00:22:43
katenesmyelova
Yeah, sorry, I have a question, which is kind of,00:22:45
Related to this, but more for homework too, when you are categorizing things into, into different categories when you are looking at the open code, is it possible that the same response can fall under multiple categories?00:22:49
Hamel Husain
I can, yeah. Yep.00:23:06
I try to keep it simple and try to find the best category for it, but if you want to deal with multiple categories, you can do that. We try to tell you to choose one, just to keep your life simple, but if you want to do other things, you can.00:23:09
No problem.00:23:22
samuel varghese
Samuel? Hey, Hummel, your comment about, like, building your own viewer, like…00:23:24
I mean, I can write code. It seems like, a lot of work for… I don't know if the costs…00:23:31
Outweighs the, like, is the juice worth the squeeze kind of thing, like…00:23:38
Hamel Husain
Yeah, it is worth the squeeze. So we have a lesson dedicated to this, of, like, why you should…00:23:42
vibe code your own. And then in, like, future sessions, Isaac will show you vibe coding your own.00:23:48
As well, live. But we also recorded doing so live, also.00:23:54
samuel varghese
I saw the recordings, those products looked pretty good, but, like, you're saying, still, it's better just to build your own.00:23:59
Hamel Husain
A lot of times, yeah. I mean, you can do it in the Trace Viewer, you can use Langsmith, Arise, Braintrust, if you want.00:24:05
But a lot of times, you'll want to do your own, because, like, you know, you want to format it in a specific way. If you're happy with, like, Langchain, Brain Trust, Arise, then use it. No worries. But a lot of times, you will feel limited by it.00:24:12
samuel varghese
Oh, interesting. So, like, when you do your projects, like, you do one… you build your own viewer all the time?00:24:27
Hamel Husain
Yeah. Usually takes less than a day.00:24:32
And has Matt.00:24:35
samuel varghese
Got it.00:24:36
Hamel Husain
off.00:24:36
samuel varghese
Okay.00:24:37
Isaac Flath
Yeah, I mean, you know, the last, session we did last week, we made a simple viewer in, what was it, like, 2 or 3 minutes?00:24:41
And that wasn't very sophisticated, but it got things there, and then, you know, you can iterate from there, so…00:24:49
especially if you're… if you keep it simple, and you're like, hey, if I just… I have an export to CSV button, and…00:24:56
I can do it based off a CSV, you can usually make it in minutes if you're trying to…00:25:03
Integrate with something. It doesn't take… Doesn't take very long, either.00:25:08
samuel varghese
I just like that one feature where, once you create the prompt, and you're like.00:25:13
Input it for all your queries, like, you just push a button, and it cranks out the inputs and outputs.00:25:17
That kind of got me anyways, sorry.00:25:26
Hamel Husain
Yeah, no worries.00:25:28
katenesmyelova
But there is a tool already in the… in the GitHub, it's Annotations Py, isn't it?00:25:29
Isaac Flath
Yeah, I think so. Is that one I made? Let me check, look.00:25:37
katenesmyelova
Because I'm running it, and I can, I can definitely use it.00:25:41
Yep, that's the one.00:25:46
Isaac Flath
Yeah, I think I probably made this.00:25:50
I don't remember, but probably. Yes, exactly. And this didn't take very, very long to build at all.00:25:51
And so… Okay, yeah.00:25:58
Yeah, so… As you can see, it's a hundred and…00:26:01
30… 140 lines of code. It's not, it's not super long, it's not super complicated, but, yeah, it can help. I think I made this… I think there's a video of this being made, so we can,00:26:06
Share that for sure.00:26:19
Hamel Husain
Anuba, do you have a question?00:26:21
Anubha Saxena
Yeah, about the multi-turn, like, a conversational chatbot.00:26:25
Do we annotate every piece of conversation, like, every response, or do we annotate the entire conversation in that case?00:26:31
Hamel Husain
Usually start with the entire conversation,00:26:42
And so you wanna… yeah, you wanna annotate the entire conversation first.00:26:46
And you wanna stop at the…00:26:51
First error you see, the most upstream error you see.00:26:54
So I don't know if you recall that from… I believe it's week 2?00:26:58
If you've gotten that far?00:27:03
We talk about that quite a bit.00:27:05
Anubha Saxena
Yeah, I remember that. So one thing is open coding, and00:27:09
in that phase, I probably… I'm just looking for patterns, right? Just looking at the responses and what things…00:27:16
like, how the… how the LLM is responding, right?00:27:24
In that phase, I'm looking at the entire conversation on a whole, and not just pieces, and sort of, like, labeling every request response.00:27:29
Is that right? So, at that point, I'm not really actively looking for failure points, I'm just, like, trying to analyze what the, how the conversation is going.00:27:39
Does that make sense?00:27:51
Hamel Husain
I would say that's not… I would say you're looking for… you are looking for failures.00:27:53
It's not like… it's not…00:27:59
Completely open journaling, like, oh, this, this, conversation makes me feel warm inside. No.00:28:02
It's… this conversation failed at this specific… I mean, this… this is not right because of, like, very… something very specific.00:28:10
So, you know, when you see a specific failure, when you do your axial coding.00:28:17
And you say, okay, there's a specific thing you want to fix, now you can do some…00:28:26
more root cause analysis and say, like, why is this error happening? In that case, you'll… you can, you know, think about, okay, where is it failing, usually, in these traces that you…00:28:32
That you, identified.00:28:46
With the axial coats.00:28:49
Does that help?00:28:51
Anubha Saxena
It helps. I'm just wondering… so, how I've been doing, I've been just putting my notes down, in a way, so even if the conversation is right.00:28:53
I would just put on the sort of, like, good things I found in that conversation, where everything went right, or a few things went right, so that I'm also deriving sort of, like, use cases which are working. Is that…00:29:06
The right thing to do.00:29:21
Not just looking for failure modes.00:29:23
Hamel Husain
You can write down what's right.00:29:27
I would spend more time On what is wrong?00:29:29
Sometimes, like, writing down what's right helps you Clarify what is wrong?00:29:33
And so, it's a good thing to do, but you're gonna get more value out of what is wrong, ultimately.00:29:42
And so… Yeah, I would really focus on that.00:29:48
And once you go all the way through it, you'll realize, like, hey, the whole point of doing this exercise is to find errors, like, find specific errors.00:29:52
And so, you want to say, okay, like, if you do your axial code and, you know, you find, like, you're always having retrieval errors, then, like, okay, we need to talk about, you know, focus on retrieval. Or, if you're finding that there's a specific issue with a specific00:30:02
feature of your chatbot, let's say it's, like, scheduling or something like that, then, like.00:30:18
maybe it's the scheduling tool, I don't know. Well, then you have to, like, look at those traces and say, okay, like, what is failing in this set of traces? It doesn't mean they have to go through and annotate every single piece, it just means, like.00:30:24
okay, what, you know, can you do a little bit of root cause analysis? And usually, if you're reading the trace carefully, you already know what is wrong.00:30:38
Like, when you look at an axial code, you already should intuitively know why00:30:47
this failure mode is happening. If you're doing error analysis correctly, if you're really looking at that many traces, you'll know.00:30:54
Anubha Saxena
Hmm.00:31:01
Makes sense. Thank you.00:31:04
Hamel Husain
Yeah, no problem.00:31:06
Isaac Flath
Alright, so once you have that,00:31:12
You know, you get these, responses, you have all these traces, you want to label them.00:31:15
So…00:31:23
we have this, like, CSV file, or maybe it's in… if it's in Phoenix, and it works well in there, you could do that, or Brain Trust, or whatever.00:31:28
You could label it in the CSV, however, this is not a very,00:31:36
Ergonomic way. And so, this is where you'd want to00:31:42
Create some sort of annotation app.00:31:47
Whoever brought up that there is this annotation app here that I'd totally forgotten about, and there was a way to… to trace.00:31:49
some example traces in your app, so that when you use your app, it saves traces to JSON. If you have something like a CSV export that some have, you can go in here and you can build it off of that, or you can build the full integration.00:31:59
We did build that open coding one last week. Should we… should we make a quick annotation app, on this thing, or should we kind of get to more of the…00:32:15
analysis stuff.00:32:25
What do you think, Hamel?00:32:27
Hamel Husain
Probably analysis, what do you think? Yeah, I think analysis, maybe.00:32:29
Isaac Flath
Cool, so you'd label these, same idea as, open coding, axial coding, make sure you have a domain expert, that's doing this.00:32:34
So the idea is this is your ground truth, so that when you do something, you have it, and you eventually end up with label traces. So this is the same data, but now you have an actual label, and not just the query.00:32:44
So you have label, reasoning.00:32:59
Which is the reason that the LLM gave, what kind of confidence level that it has. And so you have these extra fields, and so the question is.00:33:01
What do you do with this? And how do you… how do you evaluate00:33:10
From here. The first thing you want to do is you want to split this into a train dev test set.00:33:18
So… The idea here is that if you put Examples into your prompt.00:33:25
into your system prompt. And then you evaluate to see if that example is right. That's cheating, you know? If you…00:33:33
look at the data too much, you can bias it. You don't want to train an LLM as a judge to get really good at a specific set. And so what you do is you split your data into thirds.00:33:44
You have, kind of, one set of data that you can do anything to.00:33:55
You have another set that helps you evaluate if you did it right. You did not build your system00:33:59
Based on this held-out data.00:34:06
But you, you did it. So let's… let's split it, and once we split it, maybe it'll make more sense. There's a couple of ways to do this.00:34:10
the ways you can do this are, like, you could do it in Excel, I'll just show that super quickly, and then we can show how to do this, programmatically as well.00:34:23
So this is… all I did is I imported it into Excel, I get this dataset.00:34:36
Now, in order to split it into these sets, You can…00:34:42
Choose 30% of them, or 10% of them for the dev set, maybe 40% for the train set, and 50%.00:34:48
So, the easiest way to do this is just to highlight the first ones and say that's dev, and go on. Unfortunately, that may not work, because,00:34:57
you might have generated them in a specific order. And so what you can do is you can create some kind of random number.00:35:07
Once you've done this, you can split it into sets.00:35:19
This is largely what we would do programmatically. And so we can say, if this is00:35:23
Below 0.1, it's a number between 0 and 1, so if it's in the lowest 10th percentile.00:35:29
We would say… Let's put this in our dev set, or our train set.00:35:37
Otherwise… If it's in…00:35:45
The next, I don't know, the next 40%? We'll say that's our dev set.00:35:55
And all the rest will go into our test set.00:36:04
Hamel Husain
And if anyone needs a reminder of what we're talking about here, this is section 5.4 of the course reader. We talk about data splits.00:36:07
for LMS Judge.00:36:16
So if you wanted something to reference after this, you can go back to it, you can… Cross-reference it.00:36:18
Isaac Flath
Crap.00:36:25
So once we have this split, you know, why did we bother?00:36:28
So, once we have the split, we have… A set of…00:36:32
We have a set of queries we can do anything with, a set of queries that we can validate with, and a set of queries to validate what we think would help against, and then a final test set that is kind of our last protection against releasing something to production that we think will improve things, but actually doesn't.00:36:42
And so…00:37:01
What you might do is you might look and say, I'm gonna look at this train set, kosher dessert for Passover, and based on these, I might… there's a lot of things I might do. I might put them in my system prompt. We don't really train, so to speak, DLM unless we're fine-tuning.00:37:03
But we might… Put these into the…00:37:21
system prompt. And we wouldn't want to evaluate against them, it's there.00:37:26
Now, again, these are things you want to, look at a little bit carefully. You don't want to look too much of the.00:37:32
Hamel Husain
Can we zoom in just a little bit, Isaac?00:37:39
Isaac Flath
It's like, one thing I'm noticing about this set.00:37:44
Hamel Husain
I think you need to zoom in by this,00:37:46
This thing down here, maybe, I don't know.00:37:48
Isaac Flath
This thing… Oh, that's a little too much. Okay, here we go.00:37:51
Is that better?00:37:57
Hamel Husain
Yeah.00:37:58
Isaac Flath
So what I'm noticing here is we have duplicates, and so that's something we want to…00:38:00
Decide if we wanted to get rid of.00:38:06
And so, once we… once we use the training stuff, we can put that in our system prompt as our examples if we'd like, or variants of them, or we might find two or three that have, like, the right idea and make kind of a mixture to give an example for our system prompt. We'll look at prompts in a second.00:38:10
Our dev set is we say, like, okay, I think I made the prompt better. Let me just do a quick check. We can… we can… we can look. We can compare on our dev set. We know… we already labeled the data with our experts, and so we can say, like, did this help or not help? And we can analyze those results.00:38:26
And maybe that means we say, like, oh, well, I actually didn't help, let's revert that and try a different idea. And you can kind of go back and forth over and over again.00:38:44
to try and get something that works well. And the idea is that…00:38:54
You're not going to be able to put every example or account for every type of user query in your system prompt.00:38:58
And so what you want is you want a system where I created it.00:39:07
By looking at one set of data.00:39:12
But, I want to know that it performs really well when looking at completely different data.00:39:16
Because if I look at data and I make a system and it only works on the data that I made the system for, that does not generalize to other things. And so that's the idea. We look at one set.00:39:23
We create a system that we think will work well, not just for that set, but for other similar data, and then we evaluate against it.00:39:32
And we go back and forth iterating.00:39:40
Finally, when we say, okay.00:39:44
we have something that works really well, let's deploy it to production, we have a final test, and that is, we look at yet another holdout set, where we say, like, okay, here's one we've looked at even less.00:39:47
that we're just gonna do a final check to say, are we really sure? We're gonna take entirely new data set, make sure we didn't end up…00:40:01
overfitting, they call it, or p-hacking, if it's in statistics, or, you know, look-ahead bias, if you're talking about time series, it's all kind of the same thing. Make sure that we have this other set that was completely set aside, and yes, it does, in fact, work well on data that we haven't seen.00:40:09
And that's the purpose of the test set, is, I guess, a more strict version of that.00:40:25
Are there any questions on the split? Because this is, you know, pretty important, and I don't think it's done right a lot.00:40:32
Anyone have any questions, comments?00:40:41
katenesmyelova
My only question would be, the rule of thumb about this percentage, just like 10, 40, 50, or…00:40:45
Kind of war would you suggest?00:40:55
Isaac Flath
Yeah, I would say, yeah, rule of thumb. Now, the bigger your test set and your dev sets are,00:40:58
you know, the more, the more you can rely on them, right? Like, obviously, if your test set is two queries, then you're like, I don't trust it. So if you don't trust it, because you're like, you know, I don't think it has… it covers the full range of what our users are asking, then you want to increase it.00:41:06
And there's two ways you can increase it. You can make the percentage bigger by taking away from one of the other sets.00:41:22
Or you can label more data. Labeling more data often solves, like, every problem. So yeah, 1040, 50, but I would say it's more about, like,00:41:30
more about… more about the quantity. So, for example, we can pretend, like…00:41:41
Maybe sugar-free is really important to our app, because lots of people are asking about sugar-free.00:41:48
And we look in our dev set, or a test set, or a train set, and we notice that maybe in our test set, there's only one example of a sugar-free dessert.00:41:54
A request for a sugar-free dessert, and we know that, I don't know.00:42:05
40%, maybe most of our market are people looking to, like, do baking for desserts, and we noticed that, like, 40% of the queries are sugar-free dessert.00:42:11
variants. I might say, like.00:42:22
well, I don't care how big the overall dataset is, it could be 90%. If I've only got one query about sugar-free desserts when, like, 40% of my user queries are about sugar-free desserts, like, that's a problem. So it's not just about the, like, the raw percentage or the raw number. You want it to be…00:42:25
Kind of representative and, like, represent what your users,00:42:43
will be asking. And some of those might be…00:42:48
You know, bad queries as well.00:42:51
Does that make sense? Or did I just muddy the waters?00:42:57
katenesmyelova
No, no, it makes total sense, and actually, this is what I was also implying from reading the book, that actually you need to modify your data set to cater to actual user queries, and see how they… this is why it's really useful.00:43:00
To look at actual user traces and see what they ask the most.00:43:18
Isaac Flath
Yeah, and, you know, we look here, and then you might say.00:43:22
Well, let's look at the split. So maybe we'll, we can do that here.00:43:27
Whether you use Excel or Pandas or whatever, we might say.00:43:33
Make a pivot table and do something like, let's look at dietary restrictions…00:43:38
Let's look at what set it's in…00:43:46
We could just, you know, do a count.00:43:49
And we can see a little bit of, like.00:43:53
alright, what does this look like? Like, this looks pretty… it looks… seems like, okay, gluten-free, paleo, low-carb, vegan, those are… those are big ones.00:43:56
Okay, the top category overall is vegetarian. Interestingly, it's not as represented in the test set. It's still got 4 queries, maybe that's enough, but, like.00:44:06
train, sorry, I had train invalid, I think, but,00:44:16
I use machine learning terminology instead of eval's terminology. But same idea, like, maybe I say, like, oh, hey, this just doesn't seem representative, and I see, like, oh.00:44:22
like, this… this is… maybe this is a big category, and I'm like, why is… why do we not have enough queries here? So looking at this is… is there. And this starts to get into a little bit of…00:44:31
partially, do I just have enough queries? Because if you do a random split and you have enough, it'll all come out, it'll all work out. But you might also then say, if it's too expensive to get more data.00:44:41
Look at, stratified splitting.00:44:52
And what stratified splitting is, is a way of saying, I want.00:44:55
samuel varghese
I don't know whether.00:45:00
Isaac Flath
Maybe a 20-40-40 split?00:45:01
But I want it in a way that…00:45:06
the dietary restrictions are evenly matched. If 50% of all recipes are vegan, I want 50% of the training, 50% of the dev, 50% of the,00:45:08
test set to be vegan, so they're evenly distributed through those. And that's called stratified splitting, which is a little bit different than just this random split.00:45:23
samuel varghese
So, Isaac, back to your example with the sugar-free.00:45:33
Isaac Flath
Yup.00:45:38
samuel varghese
We used to… the number still showed, like, two queries for tests, or dev… or dev, I think. And, so what you're saying is, like, if sugar-free is important to you.00:45:39
create more queries for sugar-free, right? Is that the… is that the solution?00:45:50
Isaac Flath
Yeah, yeah, I mean, more data, solves pretty much every problem, often. Like, if you just look at more data and label more data, that helps your evaluations a lot. There's, like, tricks where you can generate them synthetically, and…00:45:55
scale them out, it's never gonna be as good as, like, hand-crafting a query, of course. But, yeah, if you have sugar-free as a huge representation of your00:46:10
user queries, and you're not even close to representing that in your evaluation sets, you'd want to get more queries.00:46:20
samuel varghese
Got it, thanks.00:46:29
Balki Nakshatrala
I have a basic question.00:46:30
Isaac Flath
Go.00:46:33
Balki Nakshatrala
I understand in the traditional machine learning, the concept of training data set and why you have to do these splits.00:46:34
Are you referring to here training as in these are being provided as few-shot examples, or…00:46:40
Hamel Husain
Yes, we're talking about fuse shot examples, so this is, this is terminology borrowed from machine learning, but, you don't need the same, percentages from machine learning, because we're not actually training a model.00:46:48
Balki Nakshatrala
Yeah, exactly, that's what I was asking. You're referring to only the ones you include in FuseShot, not to even look at them to create the prompt. Like, my question is, in terms of creating the00:47:00
like, what…00:47:12
Hamel Husain
No, it's from the same. So you look at them to create the prompt, and the few-shot examples. So it's like, that's what training means, is, like, you train the model by prompting, whatever prompting is. So few-shot… few-shot examples are part of prompting. That's… those are part of the prompt.00:47:13
And then prompt… and then, you know, your instructions are part of the prompt, too. So either way, they're part of the prompt, and that is drawn from the training set.00:47:30
Balki Nakshatrala
Got it. Thank you.00:47:38
Hamel Husain
Yep.00:47:40
Kavita Sunku
Hey, I have a quick question here. A slight continuation to what Balki already asked. So, let's say we have this training set, and you derive a system prompt.00:47:41
That would satisfy these conditions. After the first iteration, you find that, in your test set, there were some scenarios that were not covered.00:47:54
When you run the second iteration of this, those, prompts that were… not prompts, but those traces that were part of the test set, do they now become part of the train set, or do we have to remove that from the second time we test, because that's already been covered?00:48:05
In your prompt.00:48:24
Hamel Husain
Yeah. So, your test set is solely meant to make sure… is a backstop to make sure that you have not overfit to the training set, or you have not overfit to the dev set.00:48:26
Okay. If you're… if you…00:48:39
find that you, measure your LLM as a judge against a test set, and those metrics are very different.00:48:41
It might mean you've overfit, it might mean your data is not…00:48:50
evenly split well enough. It might mean, like, something is wrong. Your test set has, like, very different queries than, like, your, train set.00:48:54
or your dev set. So something is fundamentally wrong, and you need to start over again, and you need to think, okay, how can I get representative data in my train set? And you need to reshuffle the data, and you need to measure, and you need to, like, start the process again.00:49:05
If… and you need to… if you already have things, in no situation should you really be putting data from… you shouldn't just pluck your test set data and put it in your train set and say, oh, I'm gonna put those examples00:49:21
From the train… from the test set in my prompt, because that would just be overfitting.00:49:36
And you don't want to do that.00:49:41
You know, so you have to… sort of, the test set is, like.00:49:44
samuel varghese
Come on.00:49:48
Hamel Husain
Academically, you're only supposed to look at it once. Even if you look at it more than once, then you start to overfit, but you don't have to go that far. That's, like, a little being overzealous.00:49:48
samuel varghese
into one of Halloween.00:49:58
Hamel Husain
So it's just meant to protect you.00:49:59
samuel varghese
the room.00:50:01
Hamel Husain
So, the idea is, like, you will know… you should know from the train.00:50:01
samuel varghese
Thanks a lot.00:50:06
Hamel Husain
Versus dev set difference, whether or not you are messed up. You have messed up.00:50:06
samuel varghese
this transition.00:50:11
Hamel Husain
But the test set is just, like, another insurance policy, because you may not, you know, just like to force you to look at it again.00:50:12
samuel varghese
In case.00:50:19
Hamel Husain
you overfit to your dev set. I hope that helps.00:50:20
Kavita Sunku
Okay. Yeah, it does, but…00:50:24
just a slight nuance to this is, again, to understand it a little better. Now, let's say the dev set is our second step of validation, right? Once we have trained the system prompt well enough, now we are going to run the dev set and see how it performs. At this point, we find that it00:50:27
there is something that is missing in our system prompt. Is that the right time to modify our system prompt and then try the entire iteration again, or should we just run through all the dev00:50:46
Test, and then go back to a few new set of traces.00:51:01
Hamel Husain
You can change your system prompt, I wouldn't use examples directly from the dev set in your prompt. And by the way, when you say system prompt, it doesn't necessarily have to be a system prompt, we should just say prompt.00:51:07
samuel varghese
No, shut up.00:51:17
Hamel Husain
Not system prompt, because that… Does it mean something specific? So, yeah, just don't use examples.00:51:18
samuel varghese
Fair enough.00:51:25
Hamel Husain
directly from the dev set.00:51:26
There's gonna be some leakage, like, if you look at the dev set and you let that inform your prompt.00:51:28
You are overfitting a little bit, but that's okay. The cost of not overfitting at all is infinite.00:51:35
Kavita Sunku
So, you can…00:51:43
Hamel Husain
can't be perfect. I mean, we're just trying to… so…00:51:44
Kavita Sunku
Okay.00:51:46
Hamel Husain
And that's what the test set helps you do, is like, if you looked at it all day, like, if you really overfit on your dev set, then you'll notice a difference between.00:51:47
samuel varghese
The dev and test set, but…00:51:55
Hamel Husain
Yeah, it's totally fine to, like.00:51:57
samuel varghese
You know, stuff.00:51:59
Hamel Husain
figure out…00:52:00
samuel varghese
But…00:52:01
Hamel Husain
How you might change your prompt.00:52:02
Kavita Sunku
There you go.00:52:03
Hamel Husain
And you might find an analogous example from your train set, potentially.00:52:03
But don't, don't use the dev set.00:52:07
samuel varghese
Examples. And you want to try and think of… I'm sorry, go ahead.00:52:11
Sounds good.00:52:20
Isaac Flath
Like, yeah, so in the same, like, you don't want to use the examples exactly, and you want to see as much as you can to find, like, like, a common thread of.00:52:20
samuel varghese
Like, that causes, like.00:52:30
Isaac Flath
lots of issues to keep it as general as you can. You do need to get specific.00:52:31
samuel varghese
What a moment.00:52:35
Isaac Flath
you know, you don't want to be like, hey, you know, if they mention vegetables, broccoli, and anything over 350 degrees with garlic, then I'm gonna put, like, a… you know, like, at some point, it gets too specific, where you're very clearly going to be overfitting to specific examples that you're looking at.00:52:36
So, as much as possible.00:52:51
You do need some examples to explain some things that maybe are hard to put into words, or to give just a few examples, but as much as possible, if you can find, like.00:52:53
You know, rather than…00:53:03
specific things? Like, is there any… any common threads that you can think of, and then how can you express it, rather than…00:53:05
Ending up with… A ton of very, very, very…00:53:13
Minute tweaks, based on what you saw on the dev set.00:53:18
Now, you do need some specifics, like, it's not a hard rule, but there's a little bit of a smell if you end up with, like, 100 lines of, you know, really minute, hey, when this combination of four things happen, then do this.00:53:23
Might mean you're starting to overfit a little bit.00:53:36
Kavita Sunku
Okay. Alright, thank you.00:53:40
Isaac Flath
Sorry, go ahead, whoever was talking.00:53:42
samuel varghese
Sorry, Isaac, this is Sam. So, like, say you find some generalizations and you want to, like, tweak the prompt. Like, can you use the same data again for running the test, or do you need to, like, generate new data? That's… that was my question. Like, iteratively, like, how do you…00:53:45
Can you keep leveraging the same data?00:54:03
Isaac Flath
Yeah, I mean, if you're saying, like, what is the absolute perfect world if I had unlimited resources? Like, sure, let's make a new train test, or new test and dev set every single time we look at it at all. It's not really realistic. The dev and the test are made because00:54:08
It's… it's not realistic to even…00:54:26
you know, recycle one dataset, so the test is, like, closer to a gold standard, and the dev we're gonna look at a bunch. You know, there is a point where you might need to create a new data set, because you feel like it's just…00:54:30
you've gone too far. But it's not every single time, because…00:54:43
Creating this, if you remember, isn't as simple as, like, oh, well, let me just ask for some more synthetic queries, and then let me ask for some synthetic… let's run it through, and then let's use LLM as a judge to label it, and done. Like, you can't fully automate that, because this is supposed to be your source of truth.00:54:51
You know, and00:55:09
If you fully automate what your system can do, then all you're evaluating against is, like, how well your system is today.00:55:11
And if you do something that's correct, but your today's system says it's not, then you're going to be penalized for that. And so, creating these dev and test sets takes some time, because you not just need to generate the synthetic queries, you need to make sure the synthetic queries, you know, actually, you know, make sense.00:55:18
Or at least, are representative of what a user might ask.00:55:36
And you need to make sure that, you know, you're fairly confident in the labels. Some of that is, you know, LLM as a judge, but some of that is, you know, spot checking, or looking at low-confidence ones, or maybe even labeling a lot of them.00:55:41
with a domain expert. So, it would be great to regenerate one every single time you look at it, but you should be iterating quite a lot, and you don't want to, like, recreate a dev set00:55:54
6 times a week. It's just not realistic.00:56:06
Hamel Husain
to know, like, what the purpose of this is. So, what we were trying to do is make sure your AI generalizes to unseen data.00:56:09
Okay.00:56:18
So, how do you know if your AI generalizes to unseen data? Is to give it data that you have never seen.00:56:19
and see, measure what that is. Now, what is that? That's the test set. You have never seen it, you haven't looked at it, you haven't, you haven't used that to inform anything about your prompt, anything about, you know, like, your few-shot examples, your prompt, anything, right?00:56:29
And that… the idea is, like, in real life.00:56:45
Users are gonna ask you things that you haven't seen either. Because, you know, like, you want it to be… you want it to generalize.00:56:49
And so, that's the whole idea, is like, you don't want to make sure that you are generalizing,00:56:57
You know, you just have to… you can't be perfect, but that's, like, this whole exercise of train, dev, test, set is meant to help you simulate that.00:57:04
Just keep that in mind.00:57:13
samuel varghese
Have a lovely day.00:57:14
Okay, just one last question. So, Hummel, so, like, if, like, say we go live today, and, like, in a month, you run, you do some new test data, and if it hits the threshold that you're looking for.00:57:16
then you just move on, right? Otherwise…00:57:30
Then you'll start trying to, like, maybe decipher the prompt, or tweak it, or whatever, right?00:57:34
Is that… is that usually the way you go about these systems?00:57:40
Hamel Husain
Yeah, I mean, so if you… if you find that, like, hey, you have iterated on your…00:57:43
you know, so this is the judge, okay? So, if you're in on your judge, now you trust the judge, and then you can use the judge to00:57:48
Score, like, your real systems.00:57:58
And you can use it to figure out, like, if you…00:58:00
are doing well enough. In reality, like.00:58:04
Yeah, I mean, it's not like you…00:58:08
do it once and you move on, if you're satisfied. You probably… like, your system's gonna change.00:58:10
So it's like an iterative process. Like, you kind of keep cycling through. You might keep an LLM judge around for a bit, you might…00:58:15
Even if it's giving a really good score, you might…00:58:23
You know, score your live production data every so often to see if00:58:26
See what's happening, and see if there's any drift occurring.00:58:31
But yeah.00:58:35
The idea is we're creating an evaluator, so…00:58:37
samuel varghese
You know, you want to use…00:58:39
Isaac Flath
Alright, cool. So, so we've got this split.00:58:49
The next step is to start working on a prompt.00:58:54
there is a example prompt in the repo, this judge prompt here, and so you can take a look at that. My advice is, like, start with something reasonable, start fairly quickly, don't necessarily try and…00:58:59
Perfectly cover every single case right off the bat. But, you know, if you wanted to adhere to dietary restrictions, you know, you might want a light definition of each one.00:59:14
And some simple evaluation criteria.00:59:25
Eventually, over time, You know, you iterate, you look at it, you could even start with,00:59:28
running it through your product, you can start with, like, running it through ChatGPT on a few things, just to get, like, a little iteration. I think, the, eval frameworks have… have these frameworks. And… and cover…00:59:36
kind of the core things that would be needed. So, what are the… what are the definitions it needs to know? Is it doing a pass-fail?00:59:49
What, kind of at the bottom, usually, at the end, the final instructions in terms of, like, what format it should, it should be.00:59:57
You know, these are… Where it's gonna be entered.01:00:07
give it an output format, whether it's JSON or…01:00:11
or whatever. And then over time, you can put in, or you can immediately put in, examples of ones that are either good or bad, based on… I like to do that based on what, I'm seeing as failure modes, rather than immediately off the bat.01:00:14
So,01:00:35
what I tend to do is I write a simple prompt, I start with that, I'll go to something like a cloud or a chat GPT, and I'll just paste it in and see, like, okay, it seems like it works okay, and then I will move into trying to automate after I've done a few of them.01:00:37
A few of them, through a cloud interface, moved to automate those to run them against the whole dev set.01:00:55
Cool.01:01:05
Alright.01:01:12
So, from there,01:01:14
there is, what ends up coming out is this judge performance and these, test predictions. And so you get data every single prediction. Did it pass? Did it fail?01:01:18
what was the query? What was the response? You get all this data. Again, you can, load this into a notebook. Maybe we can do that. You can also load this into,01:01:33
into Excel and do it there.01:01:46
Whichever you'd prefer.01:01:47
So let's go ahead and give it a shot with, with a notebook.01:01:51
Load, test predictions…01:01:56
Into a notebook.01:02:02
And give me…01:02:05
Hamel Husain
I was just saying, don't be shy, you can use the Whisper flow if you want.01:02:16
Isaac Flath
Oh, that's right. I've been using Mac Whisper recently.01:02:20
Hamel Husain
Yeah.01:02:23
Isaac Flath
I switched again. I know I switch every week, but…01:02:24
I don't know why I do that. Whenever I'm sharing screens, I read what I'm typing instead of just…01:02:31
Transcribing.01:02:36
Hamel Husain
I think it's, like…01:02:41
You feel like you're in public or something? So, usually when you're in public, it's like, maybe you don't like to talk out loud?01:02:42
Isaac Flath
Oh yeah, that's true. So while this is running, while it's creating that, I can show everybody, how you might, how you can load this01:02:50
into Excel, if that's something that you're familiar with.01:02:59
So you can get power data, you can load a JSON here, You browse, find your file.01:03:02
From here, you get this, like, weird-looking thing. This is a list. You might need to convert it. You can convert it to a table.01:03:10
And this allows you to expand out, this button now allows you to expand out to JSON. I know it's like…01:03:20
the JSON to a table. I know that's, like, kind of a lot of steps to, like, remember, but that Power Query lets you load it and get this into a table in Excel, where you can calculate everything, right? If the true level and the predicted label are the… are the same.01:03:27
then the query performed. The LLM as a judge was correct. If they don't match.01:03:43
it wasn't correct, and so you can use that to calculate your accuracy. You have all of this, so you can calculate your, you know, true positive rate, which is, like, your true positive, over, actual positives, and you can calculate all your metrics here.01:03:50
If you'd like, you can do it by dietary restriction and all that. So that's,01:04:08
I quite like Excel, so that's a good way to do it. You can also do it with a notebook.01:04:15
Let's look at the notebook.01:04:30
Okay, while the AI is still running, we can see that there is also, this Homework 3 walkthrough, which, walks through everything we just did programmatically, and last cohort's video, doing the same thing, walked through, this notebook, and so you can see how,01:04:34
How you might split the data, some information here, how you might, do a stratified split, split them programmatically, so if you're interested in seeing that, and seeing more information here, done a different way.01:04:55
Feel free to see that, and it kind of covers some of the analytics, which we'll look at here in a minute.01:05:10
Alright, so it does not find the file, so let's go ahead and see why not.01:05:25
Just looking in the wrong directory.01:05:41
Hamel Husain
This is the notebook that was just created by Copilot, right?01:05:45
Isaac Flath
Yes, exactly. It, yeah.01:05:48
Okay, so we've got this test predictions JSON, that's the one that I just opened up in Excel,01:05:53
It looks at it here, we can see it looked at the shape, so at 46 rows, 8 columns. We can see those here, we can see a little bit of a sample of this.01:05:59
And here's where you want to look through and just see what it's doing. I don't think you need to code all of this from scratch. You could definitely do it, you know, all this stuff in Excel for this case.01:06:11
But you do want to have a general idea of what's going on. And I think with a little bit of practice, it's very overwhelming at first if you've never coded, but with a little bit of practice, you can start to read them. And so you can see, okay, we've got true and predicted labels.01:06:25
It's getting some sort of missing labels.01:06:40
Okay, we're dropping the rows with any missing labels, so that's…01:06:45
That's fine, we might want to look to see,01:06:53
Later, if there really are missing labels, like, what's going on there.01:06:57
We can see it's doing some sort of strip, it's uppercasing everything.01:07:04
I think this is a little bit overkill. This is, what you would expect from AI to, like, really do some overkill things, but it's also not harmful to make everything uppercase, just to make sure that you don't have a lowercase pass and an uppercase pass that don't match.01:07:08
Hamel Husain
I'll say something about… so I think a data analysis is pretty underrated when it comes to…01:07:23
AI coding, AI engineering, vibe coding, whatever you want to call it. It's a spectrum, but I think, like.01:07:29
Yeah, especially with notebooks, you can… just, like, analyzing data is something people don't talk about too much, but AI is really good at making plots, charts, things like that, like the kinds of things that Isaac is showing.01:07:37
And in my experience, it tends to do a pretty good job. You can always, like… one way is to check if the outputs make sense. So, like, if the aggregated numbers make sense.01:07:51
You can cross-reference it against another way of doing things, just to, like, spot check.01:08:03
But, you know, generally speaking, I would encourage you to try it.01:08:08
Isaac Flath
Yeah, absolutely. And you can kind of look and say, like, okay.01:08:16
It's doing some sort of weird thing with my data frame.01:08:19
it's looking for unknown labels, okay, so maybe I say, like, here, and if you get good at it, let's see if this works, I can say,01:08:24
Add comments to explain what's happening.01:08:33
And why?01:08:38
I'm a beginner coder, or something.01:08:40
And if this doesn't work, you can kind of go over to the chat interface.01:08:44
And you can start to see, okay, really, kind of, what is it doing?01:08:50
Okay, so we're finding rows that are not in the label list.01:08:54
Cool.01:08:58
Ensures both columns only contain pass, fail, or unknown. Okay, great. If it doesn't contain one of those, then we can't evaluate it.01:09:00
Okay, and so then we can see the pass-fail.01:09:09
You know, in unknown counts. So this is great. So you can kind of use it to understand, and you don't need to necessarily…01:09:12
Like, the more you can code, and the more of the syntax you know, and the more hands-on you get, obviously the better.01:09:19
the more product knowledge you know, obviously the better. But a lot of this is, like, you don't necessarily need… I didn't need to know to type the squiggly.01:09:25
Okay, so it's creating a confusion matrix for us, and so we can see…01:09:33
There's pass, pass, there's 31 that, we thought would pass and actually passed.01:09:40
You know, there's some of these unknowns, there's, so, for the most part, this stood fairly well, you know?01:09:50
Hamel Husain
So just to give people more clarity here,01:09:56
So this… oh, wait, let me just… redo this one second.01:09:58
So, in this confusion matrix, let me see if I can draw on it. Oh gosh.01:10:02
Okay, let's try.01:10:08
It's kind of hard sometimes. So the diagonal is where it got things right, and the off-diagonal is where things are wrong. So that's, like, one thing to know.01:10:10
And then, you know, these are sort of what is predicted on this axis, and then this axis is…01:10:19
Oh wait, maybe I'm… Wait, which axis is what? Maybe you should tell us.01:10:26
Isaac Flath
Let's see, so I, I think, I think the…01:10:32
Compute confusion matrix. Okay, so let's.01:10:38
Hamel Husain
So it looks like it's true label, then predicted label, okay.01:10:41
Isaac Flath
Come with this to let me… Now, if rows… No. Weather.01:10:45
Rows slash columns.01:10:52
Hamel Husain
Looks like columns is predicted, so…01:10:56
Isaac Flath
as actual…01:10:58
Hamel Husain
So, so, like, this, this, okay, yeah. Looks like this axis is… is actual…01:11:00
And this, and this, columns is predicted.01:11:06
Sorry for drawing on it.01:11:10
Isaac Flath
No, it's good, okay.01:11:12
Yeah, you got it right, according to this. So, yeah, so,01:11:13
A lot of this, though, you can kind of… Use AI to help understand.01:11:18
Hamel Husain
That's, like, really encouraging, I mean, even you used AI, so it's not like you should use it. Like, it's good. Like, don't be shy.01:11:24
Isaac Flath
Yeah, well, and then once you see it, you see what it says, then you say, like, okay, let's see if I can understand a little bit about this, and just learn a little bit as you go. I think…01:11:31
I think where people often go wrong with AI is they use it to generate something, and then they never look at it.01:11:41
But if you generate something, and then you look at it a bit, you don't necessarily need to…01:11:48
you'll understand every little thing every time, but then you can use AI to learn a little bit more and more, and it won't take long before you can understand quite a bit.01:11:53
Hamel Husain
What are the unknown… what does unknown mean in this case? Do you, like.01:12:03
Isaac Flath
Yeah, so, in here, some of the predicted labels just said unknown. For whatever reason, the LLM as a judge said unknown.01:12:07
Hamel Husain
You're right, okay.01:12:14
Isaac Flath
And that's, you know, something that we might want to look into, actually, because, you know, like, when I see this unknown, my curiosity is like…01:12:16
Did it, like… maybe those are good unknowns. Like, maybe those are the queries where it was, like.01:12:23
It was impossible to answer, and it just said unknown, because…01:12:33
it was, like, a really unusual user query, or, like, maybe I would look at it, and I'd be like, I don't know if this pass surveils, you know?01:12:38
So maybe there's something really weird about those, and we want to figure out01:12:46
you know, how to handle them better, maybe that's a follow-up question, maybe we need to add better user instructions.01:12:52
Hamel Husain
down to make sure it doesn't ever say unknown. It can only… is only allowed to say pass or fail. But you bring up a good point. You don't want to just do that. You want to look into it and say, okay, like, maybe there's something interesting there. Before we clamp it down.01:13:02
maybe we should check it out. So that's… yeah, that's always good.01:13:16
Isaac Flath
Yeah, I think anytime you see something coming out of your, AI system, or your product in general, and you're like, I don't understand what happened at all, like, you probably want to at least look to see, like, maybe I should understand it. Like,01:13:21
Pretty much all.01:13:37
Hamel Husain
That's basically what this class is about, in general. Yeah.01:13:37
Isaac Flath
That's the… that's the cliff notes. Okay, so yeah, this visualizes it for you, and to, instead of figuring out whether these pass fails, maybe we should have scrolled down and just, looked at the charts it made for us. But you could see it as both counts and percentages.01:13:42
I do recommend looking at counts and percentages. I think, I've seen a lot, actually, where people just jump straight to percentages.01:13:59
And they see something like,01:14:06
They see something, like, really shocking, that they're like, wow, this did really bad, or like, hey, this did really well, and you're like, okay, but…01:14:13
you had a sample size of 2. You know, 50% accuracy doesn't tell you a whole lot. And so, looking at generally, like, what's my volume look like?01:14:21
what's my percentages look like in both cases. In this case, it's normalized, but, is very helpful.01:14:31
Alright, so we move on to some of these summary metrics, and you see…01:14:40
Hamel Husain
Thank you, Ryan.01:14:44
Isaac Flath
quite a bit.01:14:44
Hamel Husain
Isaac, that this is the best notebook co-pilot that I think I've seen. I think it's better than Cursor.01:14:45
Isaac Flath
Yeah, you know, I tried it, when it first came out, and I hated it, and… but recently, it's… it's not been bad, like, this is pretty good.01:14:52
Hamel Husain
Yeah, it's better. So, yeah.01:15:01
Isaac Flath
So here, cool, we have the accuracy calculations, and we might say,01:15:05
we might say this, like, maybe I don't know what TP is and TN is.01:15:15
So we can ask, and we can maybe just go in here and copy and paste, and say, like.01:15:21
Hamel Husain
Do you have to copy and paste, or can you, when you select it, can you hit Command-L or whatever, and can it…01:15:28
Isaac Flath
Oh, yeah, I think you can do, you can do Command-Shift-I to bring it over, and that brings it.01:15:33
Hamel Husain
Oh, gotcha.01:15:38
Isaac Flath
SL7 5 through… lines 5 through 8.01:15:39
Hamel Husain
I'm pretty sure Cursor doesn't do this.01:15:43
I have to check, but that's pretty good. It, like, tells you the cell and everything.01:15:46
Isaac Flath
Yeah, I actually don't use… I used to use Cursor a lot. I… I kind of stopped, because I felt like VS Code just caught up.01:15:50
I'd rather use… But, I don't know.01:15:56
katenesmyelova
In fact, one of the videos that you did show in the, I think it was for Homework 2, did show you using cursor with a notebook.01:16:00
So, yeah.01:16:11
Isaac Flath
Last set?01:16:12
katenesmyelova
Really impressive, too. No, no, no, no, it was in the recording, that was added to the, to the GitHub.01:16:13
Hamel Husain
Yeah, that was from last time we recorded it, so…01:16:21
It… yeah, we keep changing the tools.01:16:26
samuel varghese
I think, Isaac… I talk to Isaac every day, and we keep…01:16:29
Hamel Husain
Trying new things.01:16:33
Isaac Flath
Yeah. I'm so sick of trying new tools, until the next day comes, I'm ready to try a new tool.01:16:35
Hamel Husain
Like, we just started trying… we just started using AMP heavily, but that's, like, a different rabbit hole. Probably don't have.01:16:40
Isaac Flath
Oh, yeah.01:16:45
Hamel Husain
Into.01:16:45
Isaac Flath
Yeah, cool, so I can say, like, can you explain these formulas,01:16:46
Some more, not sure I understand.01:16:55
Like… Why do I care about precision?01:16:58
And so, you can start to kind of dive in. You can also, you know, look up stuff yourself,01:17:05
But yeah, I mean, I think… I think the problem is if you go and you say, like, oh, I don't know what any of that means, I'm gonna ignore it,01:17:12
Like, yeah, just kind of be curious, and AI can help you learn that way.01:17:20
Quite a bit.01:17:24
So this is computing while it's thinking. Computing… maybe I should not be using GPT-5, slow.01:17:26
Alright, we got more information than we probably…01:17:35
Why should I care about this? Okay, so true positives, predicted pass, a true leaf passed, FP, predicted pass, but truth wasn't pass.01:17:38
Okay, so that's, like, a pretty good understanding of, like, what these are. Precision… okay. Precision,01:17:48
everything we predicted past, how many were actually passed? Of all the true pass examples, how many… like… so this is, like, really helpful because,01:17:58
Like, if you don't remember, like, what's precision versus recall, but you kind of remember the idea, like, you can very quickly get this, you can…01:18:08
If this is too much, you can kind of ask it for something more concise,01:18:16
But yeah, I mean, I guess just…01:18:21
I don't want to, like, overdo it, but yeah, just kind of be curious and learn about these things as they come up.01:18:24
Cool, here we see, for the… each category.01:18:32
Precision Recall F1 score support. I'm a little bit confused. Pass, fail, unknown.01:18:38
Accuracy, okay, so it's putting, like, the general things below here, that seems a little confusing to me, so, maybe we would…01:18:43
See you.01:18:53
Classes first, and then averages.01:18:56
See, I say putting all this in the same… Table is confusing me.01:19:01
And you show the same… information, but… more understandable.01:19:09
Clear.01:19:17
Yeah, and just kind of, like, play around with it,01:19:20
See what it does here. Ask it to change and look at some different things.01:19:24
This is great.01:19:29
it just, like, it decided to put in, like, what are the misclassifications? Like, what are the failures? And it just grabbed…01:19:32
It looks like I just grabbed the failures, the 6 failures, and printed them for you.01:19:39
And so then I might, you can say, like, okay, is there any, any examples here? Okay, there's…01:19:44
There's two vegan, I don't know, and interestingly, both the vegan ones failed because they were unknown. I don't know, is that,01:19:50
Yeah, I mean, when there's 6,01:19:58
It's like, there's not really any excuse to not look at every single query.01:20:01
You know?01:20:05
It looks like there's this DF errors.01:20:06
And so, we can say…01:20:13
Learn a little bit of Python here. DFERRORS dot… maybe we want to look at the reasoning. Let's look at the response first.01:20:15
Oops.01:20:27
Skin.01:20:28
And we can do print here.01:20:32
Oh, this is one thing I do not… okay, there we go. Can we see the whole thing?01:20:38
Okay, so in this case, I might, figure out why I can't see the whole thing.01:20:47
Why can I not see the whole thing?01:20:54
And maybe it's truncating it or something, but also, I have the trace ID, I have everything I need, I could just go back to that CSV and look at it if, this is causing problems.01:20:58
But yeah, with, like, 6 failures.01:21:08
Like, you can look at every single one, if you'd like.01:21:12
Now, I guess if it's your dev set, you may not want to look too closely, but,01:21:18
You know, at least for the first iteration or so, it's very helpful to know in my opinion.01:21:23
Hamel Husain
Sam?01:21:32
Great question.01:21:33
samuel varghese
Yeah, hey guys, did you guys create this notebook, like, real time, or was this from the.01:21:35
Hamel Husain
Yeah, he just created it right now.01:21:40
Isaac Flath
This is… load a notebook with pandas and give me a confusion matrix. That was my prompt. The laziest prompt ever. So you can do a much better prompt and get a better notebook.01:21:42
samuel varghese
Oh my gosh, that's huge, okay.01:21:51
Isaac Flath
But yeah, it had a lot of good information in it, you know, and you can kind of keep prompting and keep digging in to understand,01:21:54
You can see there's some sort of truncation thing here, and so we might want to look at this.01:22:03
Is it truncating somewhere?01:22:07
So yeah, I would say that,01:22:14
And here, it's saving it for you into CSVs and images.01:22:17
Which is kind of nice, right? Like, maybe you want to say, like, hey,01:22:23
Put this in… a MVP directory.01:22:30
So I have… So I can compare when I improve things.01:22:36
kind of prompted. So, like, you can save some of these for future reference if you want fairly easily.01:22:41
Yeah, so I guess the point isn't that you should vibe code everything, it's just that,01:22:49
You can learn as you go with AI pretty easily.01:22:55
Well, easy's relative, but…01:22:58
Don't necessarily stop yourself because you feel like there has to be code, because you can get quite far.01:23:03
Okay.01:23:20
Homework 3, the results.01:23:23
Save it? MVP, great. I've got my PNG file of this confusion matrix, I've got all this saved. If I want to go back later, I can run it again, and I don't have to remember, like, well, what was the confusion matrix last time? I can… I can save it.01:23:30
If I'd like.01:23:47
Were there any last questions? We've got about 8 minutes left.01:23:52
Anything I should cover in the last 8 minutes, or just Q&A?01:23:58
Hamel Husain
Alright, doesn't look like anyone… there's any questions.01:24:17
Thanks so much, Isaac. This is great, as usual. Isaac teaches a class on01:24:20
how to code effectively with AI.01:24:27
And he's also a TA in this course, so that's why we have him do these sessions, because he is, like.01:24:30
you know, he spends all day long teaching people how to use AI in code, and he also is trying all the tools all the time.01:24:36
Highly recommend checking out his stuff.01:24:44
Isaac, can you tell people a little bit about how to find more about you, and…01:24:47
Isaac Flath
Yeah, so one thing I would recommend is, there's gonna be a really good post, I don't know exactly when it's gonna come out, but sometime in the next week.01:24:53
On this blog, which is gonna cover some of the basics of01:25:02
basics is relative, but it'll cover, like, the step of, like, brainstorming with AI, and then,01:25:09
creating, a plan with AI that's a good prompt, like, not what I just did, but, like, a good prompt, and then a to-do plan, which helps you go a lot further than what we're doing in these hours and a half sessions. And this was a guest lecture that a company called Spec Story did, who does, you know, really great stuff.01:25:15
They were clients of mine in the past. And so, we're just gonna publish it publicly on this Elite AI-assisted Coding, so I'd recommend going and following this. It's free, so,01:25:33
That's my free plug.01:25:45
Hamel Husain
So this newsletter is really cool. Actually, like, so, Isaac and Eleanor, Eleanor is the other TA you've probably seen in this course, Eleanor and Isaac teach, like, teach this course, and they have this newsletter where they actually try all the AI tools.01:25:46
and they kind of do a review of them, and they also try… they try a lot of different stuff with AI tools, like, if you feel overwhelmed with AI tools, and you're like, wow, I can't even keep up, especially the AI coding tools, there's so many of them, and you want to save time, this, like, newsletter is excellent, and you can, like.01:26:03
look at all the past ones. I use it myself to, like, save time, to be honest.01:26:21
Isaac Flath
Yeah, like, here's the tools,01:26:28
There's also guides, like, some of them simple, like, here's how you use dictation to, like, make it easier prompting, which is, like, it sounds…01:26:31
Everyone thinks it doesn't sound that important until they see someone, like, dictating, like, a long-form prompt that's, like, really helpful, because it… you can just get so much more information about the product and your intent out.01:26:40
So there's, like, guides for, you know, how I…01:26:52
built a MCP server with spec-first development, I didn't even touch any code, and, got something like a reasonable MVP. There's other ones where, you know, it's about, like.01:26:56
making presentations, how I make a presentation with AI. There's others about how I learn and how I…01:27:07
like, argue with the AI to, like, understand more and use it as a learning tool, or how I create blog posts, or promo videos, or, what's the difference between…01:27:13
CLI versus MCP, or, how I… here's a Gentic coding, a fast HTML RAG eval app. This was a previous one in this course. Products UV and AI agents, and how I get into details. So some of it's, like.01:27:24
Here's places where I think you can use, you know, AI quite liberally, and some of it's, you know, more in-depth. You can see we have quite a few, talks where we've hosted people, and we publish01:27:38
You know, all those.01:27:50
Hamel Husain
And the talks are great, because you don't even have to watch the video, you do these annotated posts, where you, like, you know, show, okay, what was said, and…01:27:51
Like, different timestamps and stuff, yeah. It's pretty cool.01:28:00
Isaac Flath
This is the last one by OpenAnds, which is an async agent, that's pretty cool. But then tips. These are, like, tips that we think are useful from…01:28:03
atomic GET commits to make… how do you make a library AI-friendly? We talked about that with,01:28:12
Danny, who's amazing,01:28:19
And FAQs, questions people have asked us, these are things that people have asked that we answered.01:28:21
So, there's a whole bunch of information here that's all free, so…01:28:27
If nothing else, I think it's worth a follow.01:28:30
So there's that.01:28:34
For the course…01:28:37
You can see it here. We just got done with the cohort. We had a lot of reviews. I'm starting to build out, like, my own little review site,01:28:41
Here… I'm just showing this off because it looks cool, but, you know… We got a lot of reviews, from people who are…01:28:53
you know, CEOs, technical consultants, software engineers, data scientists, like, of all levels of, educators.01:29:01
So, people of all levels, and that's actually one of the big things that we're, looking to do. So, this is our… this is our course. You can see there's a big discount for…01:29:11
Hamel and Treya students, so this is your secret promo code, so go ahead and use that if you choose to enroll.01:29:21
And yeah, probably the biggest thing that we're doing for Cohort 2 is we're, restructuring it so that it's,01:29:29
you know, available to people of all levels. We kind of…01:29:36
Did it as a fairly advanced, class, and that was, you know, great for some, and there's… we realize there's a larger…01:29:40
you know, range of skills that we can help, so we're kind of breaking it out into, figuring out how to structure it to… to serve everybody a bit better, so we're excited to do that.01:29:48
Hamel Husain
Alright, well, yeah, thanks a lot, Isaac. Really appreciate you doing this walkthrough, and yeah, we'll see you later this week for the next homework.01:29:59
Isaac Flath
Alright, yeah, I'll see ya Wednesday, I guess, a couple days from now.01:30:07
Hamel Husain
Yeah, thank you.01:30:11
katenesmyelova
Thanks, Steve.01:30:13
Cheers, bye.01:30:13
Get live hands-on guidance on applying the AI eval course concepts to the homework and get your questions answered. This is perfect for coders who want to learn more about better AI workflows, and for PMs interested in understanding the details and how AI can help you do AI Evals effectively. We'll use AI agents and models as a partner to speed up our work and deepen our understanding. You will learn to: Apply AI Evals concepts from the course to real datasets Use AI tools to explore and learn new concepts in a practical way Learn how to use AI effectively to complete tasks quicker This is a series for everyone, from developers to product managers. The only requirement is a desire to learn by doing.
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)