OCT 14 Optional: Live Office Hours 4 TUE 10/149:00 PM—10:00 PM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live Office Hours 4
Oct 14, 20259:00 PM - 10:00 PM GMT+5:30
Audio Transcript
Chat Messages
Hamel Husain
Did you hear me?00:00:58
Great.00:01:00
I'm here a minute early, but00:01:04
You can get started a minute early, if you want.00:01:06
So, anyone that has a question, we could do the same thing as before, is like, go ahead, raise your hand.00:01:09
And we can… Get started.00:01:15
If you have a question, don't be shy.00:01:22
SRP, how's it going?00:01:26
srp
Going good, thanks, Hamil and Shreya, for putting this course together.00:01:30
So, I have a question which is slightly broad-based. In a vertical AI application, one where the domain understanding is extremely high requirement.00:01:36
there would be RAG and other architectures that will be used along with LLL, right? So, when you have the pipeline which includes all of it, I'm sure you have come across it. How do you suggest you approach AI evals?00:01:50
Is there any difference? Is there any variation that you think we should do?00:02:04
Hamel Husain
Yeah. So…00:02:10
no matter what happens, you should always start with error analysis, and you should, kind of do error analysis in the way that we teach, which is, you know, start reading the trace from the beginning, and stop at the most upstream error you see, and note that for your error analysis. Now, later in the course, we will get into00:02:13
How do you debug RAG?00:02:32
And we'll have some analytical tools we show you how to debug agents, like, if you have a lot of handoffs between agents.00:02:35
But since, like, a lot of things… it's, like, very similar. You know, retrieval is a very specific thing.00:02:42
It kind of gets into debugging search.00:02:50
More so than debugging AI.00:02:53
And there's some… Analytical tools that we'll show you there.00:02:56
Let me, make Shreya co-host, one second. Oh, she is co-host.00:03:03
srp
Okay.00:03:08
Shreya Shankar
co-host.00:03:08
srp
All right, I haven't, I'm still catching up on the content, so I may have a little more nuanced question as we go along, but, thanks, thanks for the response.00:03:16
Hamel Husain
Yeah, no problem.00:03:27
srp
And one quick thing, Hamil, in the Delphi, there was one part where I noticed a nuance about, do you, look at the whole trace versus the output?00:03:30
So, some of those, questions. So, I presume from whatever responses I have seen otherwise in Discord, you always look at the entire trace. Inputs, all the call traces, and the final output.00:03:45
But only thing is, you just look at, even if it is a multi-turn system, look at the first sign of a problem, stop that, debug it, and then proceed.00:03:59
Hamel Husain
Yeah. So, debugging is an interesting word, like.00:04:10
So the… just to, like, zoom out a little bit, you want to try to simplify the process of error analysis where possible? So depending on your system.00:04:15
So, like, you don't want to get into too much debugging when you're doing error analysis, you just want to take note of the problem and save debugging for later.00:04:24
Because, like, debugging can take time. Like, why did this happen?00:04:31
you know, it's okay to, like, take some notes if you know. Sometimes it's obvious, it doesn't really take debugging, it's, like, just very obvious. But you don't want to do a lot of heavy root cause analysis at that step. You just want to kind of do the error analysis.00:04:37
And then, depending on the application, sometimes I, you know, If it's…00:04:52
If it's something where the user is expecting Like, a… answer?00:04:58
For a specific input?00:05:04
I'll look at the user input and the output.00:05:07
and see, does it look, like, reasonable? Like, can I tell if it's something went wrong or not? If I do think something went wrong, then I'll go and, like, start reading the trace from top to bottom.00:05:10
Some applications are not like that. Sometimes, like, applications are meant as, like, chat applications, or might be doing a lot of different things. In that case, like, it doesn't necessarily make sense to look at the output.00:05:20
Only, like, as a first pass, because the output might just be, like, you know, for one specific question, or one specific thing. In that case, like, yeah, definitely just… then I'll just start from the top and go to the bottom.00:05:31
That's a little bit more nuanced.00:05:44
srp
Got it.00:05:47
Thank you.00:05:49
Hamel Husain
Yeah.00:05:51
Robert.00:05:54
Robert Lavigne
audio here… sorry, audio needed to be on. So, one of the…00:05:57
multiple things that I'm kind of working on right now, and it kind of very much aligns with what we're working on here, is it's a multi-LLM backend, right? So, from a JSON list, I'll pull from, let's say, 25 different models, or 10, or 5, whatever's in the JSON, basically. And the idea that I'm having here is…00:06:02
an asynchronous passing of, let's say, 10 evals across all those models with the common backend, right? So.00:06:21
Hamel Husain
the backend vector, the back-end system prompt, all of that is common across the AE.00:06:28
Robert Lavigne
LLM options, but the idea would be a, let's say, a 10 eval pass across 10 models.00:06:33
To basically see a matrix of where the evals would fail at a model level, right? So, how does GPT-5 handle that prompt versus how does GPT-4-1, how does Clode, how does Gemini? That's the idea, right? I've got that then on the back end, and I'm able to then run mostly LLM,00:06:42
calls against that trace, so the idea being, give me an after, action report of that trace, looking for key parameters, so that part I've got covered.00:07:01
But what I've been really intrigued in this course so far is the idea more of the mathematical evals that we can run much faster, right? So one example would be, is the conversation complete? You know, count the number of user and assistants in the conversation to make sure that there's no breaks or stuff like that, right? What are maybe some of the00:07:10
lesser or more common ideas around some of these mathematical evals that I could run, either by leveraging maybe sentence transformers as an example.00:07:30
golden data, I know you've been talking a lot about, so, you know, running a sentence transformer against Golden Data across all of those models, if you know what I'm kind of getting at. It's the first time I'm verbalizing this, so if it kind of is too much, I apologize. You're literally the first person I've ever verbalized this to, so… Hello, everybody.00:07:41
So if that made sense, kind of, you know, your thoughts on that, basically.00:08:00
Shreya Shankar
I could take that question. So we actually get this question a lot. A lot of people, and I think it's there in a lot of blog posts, you know, embedding your traces, trying to see a similarity in embedding space to a gold trace.00:08:06
What I have found, and Hamil, I'm sure you feel similarly, is that that similarity score is not really interpretable.00:08:21
like, say you have a cosine similarity of, like, 0.46, okay, between your trace and the golden trace. What does that even mean, right? Does that mean there's a failure mode? Like, what… how do you know what to improve, right? You don't know what the improvement step is. You don't even know if, okay, like.00:08:29
is that… maybe your golden trace is, like, one of many golden traces. What if there's some other answer that could have been good? And I think that's, like, maybe it goes to a broader point of when we're doing application-centric evaluation, we don't necessarily always want to pin all our hopes and dreams on single references. Like, we want to do the error analysis process, we want to do open coding, axial coding, build evaluators around that, and these things00:08:46
These sentence similarity scores are sometimes a distraction, in my experience. Hamil, maybe you also have Are there takes?00:09:11
Hamel Husain
Yeah.00:09:19
Picked up on something… Sherry's answer is really good. I picked on… on, picked up on something that's in my mind that you said that is not a question you asked, but we've done this so many times that… and I've consulted with so many people that it kind of…00:09:20
there sometimes can be a signal in what you're saying that I, like, I wanna just double-click on, so…00:09:35
You mentioned, like, having a matrix of different models.00:09:42
And that you are… one of the dimensions that you are trying to hill climb on is, like.00:09:46
Doing evals against different models.00:09:51
A lot of students and a lot of people get into evals, that's their first00:09:55
Sort of instinct is let me… sort of…00:10:00
keep switching the model, or, like, optimize the model as the first thing they think of. I would almost say that's the last thing you should do.00:10:04
You should, you know… because a lot of times, like, when you think about the three gulfs.00:10:13
a lot of the errors that you might have, and from our experience, is like, you know, a gulf of specification. So if you're not crossing the gulf of specification, like, switching a model is almost like… is the gulf of generalization. Like, hey, the model's not good. And so a lot of people, they just…00:10:19
You know, have this idea that, like, okay.00:10:37
everything is just generalization, I already told the model what I want, but oftentimes it's not true, so I would just…00:10:40
Kind of make sure that you're not optimizing the wrong thing00:10:45
too fast. Like, really think hard about, okay, and like, when you do the error analysis, you can reflect, like, hey, is my prompt really good? Like, you know, do I have… am I giving it the right specification? If I'm… if I'm… am I providing good examples in that prompt? So on and so forth.00:10:50
Martin?00:11:15
Martin Siniawski
Hey, everyone. So great to see you. Last week, I was asking a question. I was saying that we had been doing the error analysis process, we had found some clear opportunities for improvement.00:11:18
Mostly from missing context, so I guess Gulf of specification. And I had questions about deployment, like how to, like, deploy safely, I guess. We ended up doing a combination of manual testing, a progressive rollout.00:11:29
Sampling traces from the new prompt, so we're in the process of that.00:11:42
But now I want to start creating some LLM judges.00:11:46
And I guess I'm having a hard time figuring out where to start.00:11:51
Because, I guess…00:11:54
I feel like many of our failure modes will probably disappear, or change quite a bit, now that we're specifying more context.00:11:57
So I'm kind of wondering, like, if I should maybe… it feels like the version of our prompt maybe was greener than I thought.00:12:05
And I'm trying to, like, we're trying to fix it and improve it based on error analysis, the findings.00:12:12
So I'm wondering, should we get into, like, doing LLM as judges right now? Or should we maybe do a few iterations of, like, actually trying to solve these problems, since there… it feels like there might be quite a bit of low-hanging fruit?00:12:17
And then maybe, once we have a more, maybe, established, more mature, maybe more subtle errors, we get into, like, the Adenos judges.00:12:28
Hamel Husain
Yeah, I mean, I think it's really good to iterate on error analysis a lot. Usually, when people do error analysis, they find lots of low-hanging fruits, and what we always say is, like, go fix those. You know, some of them are very obvious, and, you know, you may not want to necessarily do an eval for all of them, because some of them are just so trivial.00:12:39
Or something obvious, you know, is missing. And that's totally fine, you want to iterate a bit.00:12:58
And the fact that you found a lot of things is really good. That means, like, the process is working.00:13:04
Martin Siniawski
Yeah.00:13:09
Hamel Husain
It's totally normal. The whole process is not linear. Like, you're gonna jump around a lot, like, do error analysis, you might iterate on that a bit, you might do…00:13:11
like, write some evals, you might say, oh, like, I discovered something later on, I need to do error analysis again, because, like, I have these new hypotheses. Are I seeing something new in the data that just didn't, like, forgot about? .00:13:19
Martin Siniawski
Yeah.00:13:32
the issue that we have is that the fixes themselves maybe weren't super easy to implement in the sense that I just went in and tweaked a few things and I shipped it. It was more like, hey, okay.00:13:33
we're missing a lot of knowledge here on this domain. Our domain is, like, sleep science, let's say. So I ended up talking to our sleep scientist, we put together a document, we realized that, hey, this document is really long, so we had to figure out how to, like, bring it in without, like, adding it onto the prompt, so there were, like, multiple steps.00:13:44
we got… went through those, and we shipped it, and I think we're in a much better place. I think we might have a process like that for a few of these other things, so it will take some work to fix them.00:14:02
But I just, yeah, it just feels like everything is very much still in flux, and I wonder if it's worth, like, doing LLM as judges right now, or maybe… I don't know, we do them in a couple of weeks, when the things are more established, more mature.00:14:13
Shreya Shankar
It sounds like LLM Judge is not a good fit for right now, for, like, all the reasons that you're saying. Like, you're still trying to bring in extra context, like, solve problems that you, like, didn't even have an approach to solve in the first place. Like, you're talking to domain experts to do this. I think LLM as a judge is a good…00:14:24
thing… good tool in your tool belt to use when you already have a first pass approach to doing something, and you know that you have specified in the prompt, like, to the best of…00:14:42
I don't know, like, to a basic ability, and you know that this is going to be a problem that you're going to have over and over again, because you've just seen it, and so now you just want an automated way of measuring the prevalence of that failure mode in your traces. Great. But you're not going to be at that point for, like, for every single failure mode, for sure, and it'll take a while to get to that maturity, so I think you have a good intuition.00:14:53
Martin Siniawski
Thank you, thank you. I have a follow-up, maybe it's… you tell me if it's a quick one or not, but as we were going through this, we realized that our prompt is quite long, and started reading more about the problems of prompt length and all of that, and we heard from someone at VAPI, which is, like, the vendor we use for voice. They were saying, like, do it, like, 10,000,00:15:16
characters, or I don't know if it's characters or tokens, honestly, but 10,000 was the number they were giving us. I look at ours, and it's, like, 36,000. And so now I'm wondering, and we started… there's a document by Anthropic in terms of, like, prompt engineering with context engineering, and how to make it more modular. I guess what my question is.00:15:36
should I make this a priority right now? To, like…00:15:54
find a different architecture for the prompt that's more modular, that we are able to bring it down sub-10K.00:15:57
Or, I don't know.00:16:04
Shreya Shankar
Is this a voice agent? I don't know… It's a voice agent.00:16:05
Martin Siniawski
Agent?00:16:08
I don't know what the guidelines… But we're a kind approach, right?00:16:09
Shreya Shankar
Yeah, I don't know what the guidelines are for voice, like, what sister prompt length or whatever. I would say that it's worth, like, going over your prompt and try to ruthlessly cut out things that00:16:12
are not important. Like, a lot of times people have AI to write their prompts, or they invoke prompt optimization tools that just inflate the length of the prompt. So just, like, make sure you're not making those basic mistakes, like, you know what's in your prompt. And then, I think it's still, like, up in the air kind of research question on what are the better, like, modular,00:16:25
kind of prompting framework so you can write pieces of your specifications and then put those together. But again, like, this is where you really want to go into every word in your prompt and make sure you can cut out things that are not relevant.00:16:45
Martin Siniawski
Okay, and for text prompts, do you have a number that you shoot for, in terms of length?00:16:58
Shreya Shankar
Oh god, that's really difficult. I'll send a link to the Awesome Systems prompts, and just so you know that, you know, these prompts can be extremely long. Like, 30,000 is actually not that long in my mind, compared to some of the stuff that I've seen here. Just sent it in the chat.00:17:03
Martin Siniawski
Alright.00:17:20
Scott Meyer
Martin, actually.00:17:21
Shreya Shankar
it.00:17:22
Scott Meyer
I put it in the chat, Martin, but Real-Time Voice has a 16,000 token limit for OpenAI.00:17:22
So, that might be causing it.00:17:28
Martin Siniawski
Yeah, yeah, yeah. I don't know if that's a speech-to-speech, though, but…00:17:31
Scott Meyer
Point of voice, yeah.00:17:36
Martin Siniawski
Yeah, ours is, like, chained, so I don't know if that's different, but thank you. Thanks, everyone. Thank you all. That's great.00:17:37
Hamel Husain
One thing to keep in mind is, if you're talking about a more heavy lift, like switching architectures.00:17:42
You know, it's very tempting for people to, like, bike-shed architectures.00:17:50
And it's very expensive. And, especially if you're talking about increasing complexity of your system, that's when you really want to think about evals. You want to… you want to ground it with evals, because, now you are taking on technical debt, and you want to make sure that debt is paid for.00:17:55
with some performance gain. You want to convince yourself, like, it's worth it. So, I understand, like.00:18:14
that's a situation, you know, like, tweaking your prompt and things like that, that's a little bit less… that's easier, and not necessarily causing technical debt, per se. Maybe a little bit of debt, but not, you know, in the same way of…00:18:21
you know, increasing the complexity of your architecture. So…00:18:33
So that's when, like, don't just…00:18:38
in, like, try different architectures, like, do an eval-driven, like, approach of, like, hey, okay, let's try, like.00:18:41
You know, if you want to do, like, multi-agent.00:18:51
Something or the other. Okay, like, you need to have evals first.00:18:54
Martin Siniawski
Yeah, yeah, yeah, it's almost like we're gonna do a major refactor, let's make sure we have some tests laid out before we do that.00:18:58
Hamel Husain
Yeah.00:19:05
Martin Siniawski
Okay, thank you for that.00:19:06
Hamel Husain
Ashish.00:19:09
Ashish Bhatia
Yeah, well, and sure, thank you for the course, I've been loving it so far. Couple of things, we have a more mature product, so getting error analysis on, let's say, 100 traces.00:19:10
is challenging. A lot of them are just right outcomes. Any strategies on picking00:19:23
Should I go pick, scenarios where there were actually failures, or error codes, and things like that?00:19:29
To narrow down my, kind of, search space a little bit.00:19:36
And then second thing is, once you have, let's say, a mature framework, kind of similar, right? How do you go hunt for more errors? Do error analysis on just the new, unknown, kind of?00:19:42
Hamel Husain
Yeah. So, there's a lot of things you can do here. One is see if you can…00:19:58
if you have any way, any implicit signals from your users on, like, what is a failure and what's not, sometimes people have more explicit thumbs up, thumbs down, sometimes it's something else. Like,00:20:03
You know, some kind of failed outcome, or your desired user behavior didn't happen, whatever it is, you can always00:20:14
Let's try to sample in those regions a lot more. You can compute statistics over your traces, so…00:20:23
try to find outliers in various ways. So, like, you know, are there, like, some conversa… say if you have a chatbot, there are some conversations that have lots and lots of turns, like, what happened there? Too few turns, what happened there?00:20:29
And it's kind of, like, application-specific, like, and it's like a data analysis exercise of, hey, how can we filter this data? Can we brainstorm ways to filter our data?00:20:46
that is more likely to surface errors. And then it's an iterative process, like, what you do is you then, you know, do error analysis on those. Usually what happens is you'll get more ideas.00:20:59
when you sample in high regions of errors, and you'll get… you'll continue to refine, like, other hypotheses of where other errors might be. Like, for example, you might… when you do this sampling, you might discover, like, oh, like, this tool call, you'll get an intuition, like, this particular tool is always…00:21:11
It's kind of… flaky.00:21:29
You're like, great, let's review all traces where that tool fired.00:21:32
Or when this tool is called with this specific parameter.00:21:36
you know, it's flaky. Okay, let's do… let's find that. So, it's very application-specific.00:21:40
And it really depends, like, what you find.00:21:49
But you can always be… you know.00:21:52
It's worth it to have hypotheses.00:21:56
Australia might have a different angle to it.00:22:00
Shreya Shankar
No, I think…00:22:04
you covered it. Like, if you already have hypotheses of failure modes, and you need more examples of that failure mode, go, you know, filter your traces for those specific failure modes. You can use LLM as judge to do that, you can use, like, an embedding similarity, like, whatever technique, like, you feel… it can even be basic keywords, like, I know it's this particular tool call, so I will keyword search for that tool call in my traces, and then get that set.00:22:05
Right, so I think it's… Hamel's pretty spot on about, like, have it be hypothesis-driven. And then the other thing I would add is, like, always look at some random sample every week, because there are unknown unknowns, and so you just…00:22:29
Want to make sure you're covering that, and you don't have to, like, exhaustively do it until you're tired, but, yeah, it's good to just see new things every once in a while.00:22:41
Hamel Husain
Shehzad.00:22:55
Shehzad Akbar
Hi,00:22:59
First of all, this course was incredible. I didn't realize that for a lot of the automation, the actual learning would be just do it yourself manually, which is…00:23:01
I think the greatest thing I learned so far,00:23:10
So, maybe I'll give a quick, like, sort of preamble to, like, what I'm building, and then…00:23:16
I'll dive in one or two. So, effectively, I'm trying to build, like, a publishing pipeline that takes a synopsis of some sort and generates on the… on the other end of it, like, images that kind of illustrate the story. This is a really dumbed-down version.00:23:21
of what I'm doing, but that's kind of this hypotheses that I've been working through for a while. And,00:23:37
One… one question I have is, like, I…00:23:44
I realized very early on that a lot of the inputs won't be very different, right? It's not like a…00:23:48
B2C or a consumer-facing LLM, like, in a day job, I work in a healthcare startup, and, like, I understand, somebody asking about their benefits and their insurance can ask the same thing 50,000 different ways, and so you want to capture all those cases. But for me, building this as a publishing, like, automation engine for myself.00:23:54
the inputs won't be that variable. So is there… one, I guess the first question is, like, is there any value00:24:13
for me, generating, like, different variations of a synopsis input, or is it more value for me to… which is what I did, is take the one synopsis and just run my LLM with the same synopsis.00:24:19
50 times, and then read the output from that, and do my open coding. Like, is there value in generating variations of the input when I00:24:33
I know reasonably it'll be the same.00:24:42
Shreya Shankar
It's a good question, and I appreciate that you described your use case, but I'm still… I've never done your specific use case. I've found that depending on the model, you will have different variants in the output. So, for example, if I use GPT-40 Mini to do these, like, synopsis00:24:46
I don't know what to call it. I might have actually more variance in the outputs than, say, if I use, like, a GPT-5, or something that's trained a little bit better. So, I'd probably use judgment on that end, but I… I find that…00:25:05
doing this, like, repeated analysis often, like, has diminishing returns after I look at, like, 5 or 6 of the same one, because they really do all kind of look the same, especially if the temperature is pretty low.00:25:20
But then again, it's very task-specific, so if you find that in error analysis, you're uncovering diverse errors from doing this, then yeah, go for it, continue, you know, generating a bunch of samples with the same prompt.00:25:33
Shehzad Akbar
Got it.00:25:46
Shreya Shankar
And then, I guess, just to follow up, and let me know if it's, like.00:25:48
Shehzad Akbar
taking too much time. Well, the other part that I learned was my original prompt00:25:51
Had, like, 4 or 5 different steps in it, so it's like, hey, like, take the synopsis, generate a bunch of beats.00:25:56
Now you gotta narration, like, figure out the characters, and now let's, like, let's, like, think through the spreads, like, how you're gonna put this, lay it out, and, like, what's the narration there? I actually found it more valuable, first, both for myself and for the actual system, to actually break those out into separate prompts.00:26:01
Right? And so it was easier for me to have. Now I'm like.00:26:19
I call it a beat agent, a narrative agent, a character agent, there's a bunch of agents now that are working.00:26:23
And then I'm trying to understand, because now I'm like.00:26:28
I got myself into, like, a fuck, this is a… oh, sorry, crap, this is a lot of…00:26:31
a lot of, like, responses I have to read through to get to the end. There's, like, 14 beats, there's a bunch of character descriptions.00:26:36
And so, like, in terms of, like, optimizing for this, should I still be, like, running from synopsis to end output, and then reading that, or should I then…00:26:43
be focusing on very specific agents, so optimize my prompt for the beat agent really well, then move on to the next, and then move on to the next. I imagine some layer of this is subjective, but I'd be curious to understand if there's a direction you'd say in terms of how I optimize across…00:26:54
And these aren't, like, parallel-running LLMs, they're just sequential, it's not…00:27:13
Shreya Shankar
Yeah.00:27:18
Shehzad Akbar
So one goes to the.00:27:19
Shreya Shankar
Yeah, so… if I were to rephrase your question, it's basically, I have a workflow of LLM calls, and do I optimize this workflow end-to-end, or do I optimize it, like, single call at a time?00:27:20
I'm just using the term workflow because I think, like, Anthropic came out with some blog post that says workflows versus agents, and it's probably somewhere in the Discord, but I could post it later and…00:27:33
It's just easy to have that terminology. I would say…00:27:44
Okay, I think there's two parts to this for what I would do, and I'm curious what Hamel would do.00:27:48
first thing is, I… often when I build workflows, the LLM calls are varying in, kind of, task difficulty, so sometimes it's easy steps, sometimes it's hard steps, and it doesn't make sense to really spend all my energy… equal amounts of energy trying to optimize every single step, if I know some are harder than others. So what I typically like to do is do, like, end-to-end error analysis, actually, to try to get, like, an aggregate view of00:27:53
what steps do I hypothesize in this workflow, or failing the most?00:28:18
And then go and optimize those individually, starting from the most upstream.00:28:22
most complex. So, I know this is vague, but, I find that it's just helpful to, like… it goes back to, I want to have hypotheses around where my workflows are failing,00:28:28
like, think about it as I'm investing my time in this, so I, like, want my time to, like, pay off the most, and then I'll go and, like, fix those.00:28:39
Hamel Husain
Yeah, I think it's pretty good,00:28:50
for whatever reason… okay, so, like, I have… I use a lot of… Writing…00:28:54
It's just, like, things. I have my own tools for writing. And.00:28:59
Shreya Shankar
You know, for me, there's a lot of human in the loop.00:29:05
Hamel Husain
like, I route an outline first, like, you know, like, let's say I take a video from the course.00:29:07
you know, I'll have, like, a chapter summary of that, then I'll have, like, an outline.00:29:13
And then the outline becomes, like, a more fleshed-out thing with my commentary. And in that specific workflow, I…00:29:19
kind of… I know intuitively, because there's human in the loop, I know that00:29:28
That I need to get the first part right.00:29:32
Otherwise, the downstream doesn't work. So then I just, like, make sure, like, I iterated on the…00:29:35
On the outline but that's because I kind of skipped a step, because, like, you know.00:29:40
But Shreya's saying, like, okay, she wants intuition, like, where it's gonna break. I already kind of knew that, like, I need to focus on that.00:29:46
Because of the very linear nature of what I'm doing. I don't know if your thing is, like, linear in that way, but if you already know, then, like, sure, like, do that. But if you don't know, then you should do what Shreya's saying, is like, okay, do the whole thing and get hypotheses of, like, where it breaks.00:29:52
Shehzad Akbar
Yeah.00:30:12
Okay, this is… so my current process, and I'll stop talking at this, is very much linear. It's kind of… it's a really fun project, so…00:30:13
I'll iterate on it, and then it's, like, me originally, and then when it gets to the illustrations, I'm just doing books for my kids, so my kids will pick the pictures, and they'll be like, this is the one I like, and that's the one I like, and so it's like, what I'm trying to create is, like, a product experience that just makes my life easier as I'm going through those flows, and just, like.00:30:22
Hanging out with them.00:30:41
Shreya Shankar
That's awesome.00:30:43
Shehzad Akbar
There is a lot of human in the loop in there. Yeah, a… would probably then…00:30:44
Okay, this was helpful, so thank you. I think I kind of have a broad understanding.00:30:53
I'm basically trying to optimize for the whole thing, end-to-end, and then figure out the pieces in the middle that fall apart.00:30:58
Hamel Husain
Makes sense.00:31:08
Shehzad Akbar
Thank you. Thanks.00:31:10
Hamel Husain
Yeah.00:31:11
Ayush?00:31:13
Aayush Agrawal
Hey everyone, thanks for the course, and yeah, very much echo the fact that it's been awesome to hear00:31:14
like, just the drilling in of, like, just go manual, and that's awesome. I've been doing that for my team. Just some context, I'm building, like, a GenAI platform for my company to, like, build agents off of… on, and then, you know, do that scalably and safely, so it's really important to, like, get the evals part right, because that is the mode.00:31:21
And so my question is very tailored to that, actually going off of kind of what Shahzad was saying. So…00:31:37
when we were bringing in these SMEs, who are typically non-technical, and we want to bring them in in a very easy way.00:31:45
two questions. One is, like, as we have, like, you know, an agent that has maybe multiple different, you know, it's a workflow, so there's multiple sub-agents.00:31:51
you know, some of the tools that we're looking at, like Arise, for example, just their traces can be really detailed, right? You could go into, like, each and every workflow, which, sometimes that's important for maybe a technical person, but then, from an annotation point of view, my first question is, like.00:32:01
do you need to annotate against the overall trace, or do we want them to, like, dig in and, like, annotate the specific tool call, or, like, the specific LLM1 over 5 one?00:32:16
So that's first, and then the second is, like, whenever I've seen, like, SMEs come in, it's always been, like, give me a golden data set so that I can run my LLM as a judge. And that's what I've always heard them being brought in for. And so…00:32:29
like, I didn't get that really clearly from week one, and maybe it'll come up, but, like, when do we have to do golden data set stuff, and what is it used for? Because right now, all the talk is all about, like, LLM as a judge, automate as much as you can, and so that's kind of what I'm trying to…00:32:43
Deep dive into.00:33:02
Hamel Husain
Okay, so the first question around, how do you annotate traces, when you're beginning, I would recommend, like, annotating the trace as a whole, at a macro level, before, you know… later on, you can definitely annotate, like, specific things.00:33:03
But your question kind of goes to a broader point of, like, hey, we have these tools, like Arise.00:33:20
And, it's showing a lot of information. All that information may not… may or may not be relevant, depending on who's looking at the trace. And you might, from your…00:33:29
product.00:33:38
You have probably some things that you know00:33:39
are usually not as useful to look at as other things. And you may not… and so, this is one reason why we teach you in this course, you haven't got there yet, to create your own annotation interface.00:33:44
Because your data and your situation is always going to be unique, and you want to really dial that in to you.00:33:57
So, that kind of addresses that issue.00:34:06
Always forget the second question.00:34:10
Aayush Agrawal
The second question's around Godel datasets.00:34:15
Hamel Husain
Oh yeah, golden data set.00:34:17
Shreya Shankar
Maybe it's worth thinking about some, like, historical context on the golden data set. So I'll give two perspectives on, you know, why we hear this term so much. One is from the traditional ML literature, and the other one is from, foundation model training, right? In traditional ML,00:34:21
anytime you use a model, you actually have to train. You can never use something, like, off the shelf, right? You've got to do some training or fine-tuning. So in those cases, you can't even embark on your ML journey without having some labeled data set.00:34:36
to do machine learning. So, kind of, the golden data set has been kind of hammered into people from00:34:50
that perspective. The other perspective of foundation models, like OpenAI, Claude, etc, they talk a lot about golden data sets because it's very important for them, right? They are training these models from scratch, they are training these weights,00:34:56
you know, they have some target that they need to meet, on… when they can publish some sort of benchmark, so that people like us can use these models and feel confident about them. Now, in this, like, weird new application-centric world where we are not necessarily training or fine-tuning models.00:35:12
It is… I don't think you always need to be thinking about having golden data sets. Like, what is very important is kind of this…00:35:30
This bottom-up approach of how we are trying to prototype something, doing some error analysis, finding failure modes, thinking of failures as kind of properties that traces have, and thinking about our application as, like, an infinite stream of traces, and not necessarily a specific set.00:35:37
That we're trying to, like, benchmark, maximize.00:35:55
I think it's… yeah, so I wouldn't think too much about, like, we should have a golden data set, because that's very, very important for our application, like, unless you are trying to do training, or you have to use that artifact itself.00:35:59
I wouldn't worry about it. Now, there's one case, and we talk about this next week, and kind of CICD, of, you know, sometimes you know that there are some pesky failure modes, and when you ship changes to your application, you don't want to00:36:11
you know, fail at those… you don't want to introduce any regressions. So you might curate a small sample of, like, known traces or, like, adversarial traces for your application to just make sure that you can pass on those… that set that…00:36:25
I wouldn't call it a golden set, because I don't think you need to have golden ground truth answers for that. I think you just need to have your set of traces, and then verify that the properties hold on those, and then ship it. But you know, that's… that's a case where you might think about having, like, a predefined set of traces in CI to prevent regressions.00:36:40
Aayush Agrawal
Yeah, I think that's the place where it actually helps, because otherwise, today, I see, like, folks might get, like, 100 questions that kind of work, right? And they put that in their golden data set, and then they report against that. They're like, oh, we have 90% accuracy, 95, but then if you actually use the product 10 times, you're like, 6 out of 10, this is crap, like, this is not doing anything.00:36:59
Shreya Shankar
Yeah.00:37:19
Aayush Agrawal
process to actually fix that, because they're like, we have 95% accuracy, what are you talking about?00:37:19
Shreya Shankar
So you said it better than I didn't want to say it like that, but that's why I think that golden datasets are very misleading in the application-centric world.00:37:25
Aayush Agrawal
Cool, thanks so much.00:37:35
Hamel Husain
Abhishek.00:37:40
Abhishek Panda
Hi, Dean.00:37:43
So, Hammer and Shay, I have one question with respect to open and Excel coding, with respect to my current company agent product. So, the choice of observability tool in our company is Langfuse. The development has started since last couple of months, security integration, everything has happened. So, can't suggest the interest as of now. So, for open and Excel coding, I'm thinking, I mean, the learnings I have taken from the home.00:37:44
works, right? So, as of now, for the starter.00:38:09
Admin, the, user interface that is there for open and Excel coding we have seen in Homework 1, right? Sorry, Homework 2. Can I use a similar interface for, open coding?00:38:13
Only looking at the conversations. I know that the interface we have, as of now, their tool calling and other stuff is not happening. Only looking at the conversations, we are tagging it for open coding. So, could you recommend a similar strategy?00:38:25
Until, you know, the language is coming into picture.00:38:41
Hamel Husain
So, okay, stop me if I misunderstood your question, but, first of all, like, it's important that you log everything into your traces, so, like, you know, if you're not… you should log your tool calls, you should log everything, everything that the LM sees, you should be logging it into your traces, because ultimately.00:38:50
You might have to… you might need that context in order to do your open coding.00:39:09
Effectively. And then, secondly, like, I'm not 100% sure if LangFuse has…00:39:15
What their annotation interfaces look like, like, you know…00:39:23
Abhishek Panda
basic annotation we can do in LangFuse. I did some research. I mean, up to some extent, open coding is possible, but if that is also possible in Excel coding, I can take care myself, right? I mean…00:39:28
Once we have the reasoning, then we can use LLM to categorize it.00:39:39
Hamel Husain
Yeah, when you say reasoning, make sure, like, just make sure it's your reasoning.00:39:46
Abhishek Panda
Yay.00:39:50
Hamel Husain
Okay,00:39:51
you know, axial coding, yeah, I mean, it doesn't really matter where you do the open coding, you could do it in a spreadsheet, you could do it in LangFuse, you could do it in BrainTrad, it doesn't really matter as long as you end up doing it in some organized way that you can get to later. You know, I'm not… we're not dogmatic about the…00:39:53
Abhishek Panda
spreadsheet is not the… I mean, I have seen, like, many times you have mentioned spreadsheet, but do you think you get a right approach? Because we want to store that trace, right? I mean, like a JSON kind of thing. What was the user, what was the content, you know, if any dual-calling or anything.00:40:09
So, when it's spreadsheet, how we gonna store all those things? I mean, is it a screen… I mean, I don't know, I am unable to visualize what do you mean by spreadsheet.00:40:26
Hamel Husain
Yeah, I mean, you don't have to use a spreadsheet either if it seems, intractable.00:40:35
you know.00:40:41
all I'm trying to say with the spreadsheet is, like, it doesn't really matter what tool you use, as long as you can organize the data in a way that you can get to later, so you can do analysis on it.00:40:42
You know, it might not make sense for you to use a spreadsheet if you have, like, large traces. You won't be able to really read it very easily in a spreadsheet. But you could if you wanted to really force yourself, but, like, you don't…00:40:53
Yeah, I mean, if you're already using LangFuse, and they have a way to take notes for every trace, by all means, like, use it, and if you're happy with it.00:41:07
Seems reasonable.00:41:15
Abhishek Panda
Got it. So you're saying that, I mean, I can do my own research, like, how much flexibility is there in LineFuse? If not, I mean, until the time that development is happening within my company, I can use any kind of, you know, if I can build an interface where I can log those traces and try, I mean, our,00:41:17
The product analyst team, they can, you know, look at those observations for even.00:41:34
Hamel Husain
Yeah, in a later lesson, we'll be covering how to build your own annotation interface.00:41:39
Abhishek Panda
Okay, good news.00:41:45
Hamel Husain
And then you can, you know, kind of use these things as a backend if you want to, like a database you could read and write from.00:41:46
Abhishek Panda
So, we have that in our later tutorial, isn't it?00:41:53
Hamel Husain
Yeah.00:41:56
Abhishek Panda
Okay. My second question is, I mean, with respect to the dimension grouping, I mean, in the recipe bot, yes, it was kind of… I mean, don't mind, it was kind of a bit easy, like, you know, choosing the dishes, timings, okay, and then, you know, grouping were done. Or you are saying that is just one strategy for creating user queries? I mean, there could be, like, different strategies, like, I can think of my own.00:41:57
And put dimensions. I mean, dimensions is more like scenarios only, right?00:42:21
Hamel Husain
Yeah, you should definitely think of other strategies. This is just a strategy that we use quite a lot.00:42:26
Shreya Shankar
Yeah, you don't have to use our dimensions either. Please don't. They don't apply to everything. So if you can think of dimensions that are appropriate to your application, I think the only thing that really matters is, like, they should vary somewhat, or be complementary to each other, right? So…00:42:35
If… if you have, like, time of day as a dimension. Like, you don't need to have a dimension for, like.00:42:50
Abhishek Panda
Birth of day and second of day.00:42:56
Shreya Shankar
Sorry.00:42:58
Abhishek Panda
Let's say it is something like Logo Generator. What would you suggest in this kind of scenario?00:42:59
Shreya Shankar
I don't know who's using the logo, so I would put my product hat on and think, okay, who are the target users? How might they vary? What kinds of queries would they come up with? And kind of generate those… I would reason about it like a reasoning LLM for some number of tokens before kind of saying some…00:43:06
dimensions.00:43:25
Abhishek Panda
Got it, got it. Cool, I got my answer, thank you.00:43:27
Hamel Husain
Mirko.00:43:32
Mirko
Hi guys. So, first of all, thanks for the course, really, really helpful. I will try to keep it short. I would like to dial into a question a student had last week about system prompt iteration, where I think she asked whether you would…00:43:34
do this in the code versus doing this in one of those tools like Langsmith and Co. And I just watched those vendor videos and was curious whether you changed your mind in any view, because I kind of like the fact that there is an interface for doing this, but I can't figure out how this would work with very complex system prompts.00:43:48
that pull in data from, I don't know, RAC and some tool calls, so it feels very cumbersome to do this within this UI and plug all these variables together.00:44:08
Just wanted to see what it was…00:44:16
Hamel Husain
Wittner videos? Are you talking about the bake-off that we did?00:44:18
Mirko
Yeah, exactly, exactly that one I'm referring to. So I think, like, I just watched a Langsmith one for now, where he talks about, like, how you can use the playground, you just switch to the playground quite easily, but…00:44:21
it finds… I find it quite intuitive and… and easy for very simple prompts, but if you have long prompts that are just sort of, like, augmented with code and, again, with… with…00:44:32
with tool calls and data that's coming from RAC,00:44:43
It feels almost impossible to do this with, with this kind of inter…00:44:47
interface, yeah, I just wanted to see whether you have derived anything that would change your mind.00:44:50
Hamel Husain
Yeah, I mean, I like to… you know, playgrounds are always of very limited use for me.00:44:59
Sometimes, if it's a single turn.00:45:06
then, like, if I can reduce the error to a single turn.00:45:11
then the playground is tractable. As soon as you get into, like, variable number of turns, Then playgrounds become…00:45:15
Kind of difficult, because,00:45:24
You know, you have to… do you have to do a lot of work to template them correctly?00:45:27
Also, like, your tool calls, you have to, like, assume they are static. They're, like, your prompt, you know.00:45:31
Because, like, it's a causal chain of events, and so if you have a prom playground, you have to sort of freeze downstream events, because you have to template those in. And, you know, that's… that's not,00:45:39
you know, I don't like that, because I'm not getting the full picture. I might, you know, sometimes that causal chain, or often it is very important, right? Because if you're fixing an upstream problem, you know, then you, you know, an upstream problem will fix the downstream problem.00:45:52
Mirko
Yeah. And so, you know, what I like to do is version the prompts with my code.00:46:07
Hamel Husain
And… I like to make the code very modular, so that I can play with different entry points.00:46:12
you know, I usually have… able to, like.00:46:20
You could use whatever you want, like, you know, you can either create an interface, like an admin interface in your application, like, allow you to play with the prompt and, like, have the same conversation or same interaction the user's having.00:46:24
So I like to use notebooks sometimes. So, you know, like, Jupyter Notebooks is, like, an interactive coding environment where it's like, okay, I can play with code, and I can run…00:46:35
Steps and do whatever?00:46:47
And, like, fiddle with it and change things.00:46:49
So, I personally like to have it with code. The one trade-off is, like, you want to make sure that your domain experts can write prompts, or fiddle with prompts, and in that case, it is… makes… does make sense to have, like, an admin interface to your application that does have access to, like, your tool calls, your retrieval, whatever, and that people can write prompts and then, like, see the result.00:46:54
Mirko
Yeah, makes sense, makes sense. Again, I was… I think you alluded to the perfect thing, which was, like, the templating feels very, very difficult to do within these platforms, at least for my use case, I'm not sure. But yeah, quite helpful. Thanks, thanks very much.00:47:16
Hamel Husain
I don't know how to pronounce this.00:47:32
Ngoc Chau Nguyen
Cho. Yeah, my name is Cho, yeah.00:47:37
Hamel Husain
Okay. Thank you.00:47:39
Ngoc Chau Nguyen
Okay, yeah, first of all, thank you so much for the course. I'm so glad I know about this before the course starts.00:47:40
So, my first question is actually about a notation tool. So, let me know if, I should grade, for later session.00:47:50
So, for the context.00:48:01
I have done underwrote analysis myself and see the pattern in arrows.00:48:04
But, not in an organized and systematical way. And, I've worked with Lancemith on that process.00:48:10
And, I really…00:48:19
I was really caught up with the idea of building own evaluations tool in Notebook. I do that with Marimo. I try Marimo, I really like it. So…00:48:22
I want to ask about, can you elaborate the regions we should have our own tools for our analysis process?00:48:34
The main, kind of, I want to understand, kind of, the main leads of frustration that make you want to use your own tool instead of vendors, and what feature you wish to be there when using their tool.00:48:45
That's the first question, yeah.00:48:56
Hamel Husain
Yeah, I mean, it's a good question.00:49:00
Langsmith is… great, like…00:49:04
Harrison, you know, listens to us directly, and kind of has done as much as he could as, like, making everything, like, very nice, like, hotkeys, markdown rendering, open coding, you know, so on and so forth.00:49:07
The reason, like, why we like to nudge you towards creating your own is, like, if you're working on a product, a lot of times your data is very specific to you.00:49:23
And you can do, you know, you can render it in ways that make it easier for you to read. For example, if you have, like, widgets.00:49:35
if you have, like, UI elements that your chat interfaces, like, rendering, or even if, like, Freya has an example, which I can put in the chat, is like, okay, an email writing assistant, and00:49:43
you know, the annotation interface that she created is, like, it makes it look like an email. It has, like, you know, to and from, and it, like, makes it appear like an email, and that, like, reduces cognitive load, because, like, you're seeing the data in the way that it's meant to be seen. It's not, like.00:49:59
like, a JSON, or even, like, an unformatted markdown, because you can then, like, notice different things. Also, like.00:50:15
For example, the Nurture Boss use case.00:50:24
So, the Nurture Boss that we talk about in class, is like, their system prompt is quite large.00:50:27
And so, if you're gonna do error analysis, you probably don't want to keep looking at the system prompt, you want to just collapse that by default.00:50:33
So just that little… just that very little detail, like, collapse system prompt by default. So you would want to collapse system prompt by default, but you would want to see maybe the property name.00:50:41
And some other, maybe, few lines of, like, metadata about what00:50:53
property we're talking about. So that's very specific to that application, right? So it's like… but that would really help in your error analysis, and now you don't have to, like, scroll as much, you don't have to, like, you know, collapse this thing, it's less clicks, and over, like, thousands of traces, it adds up. So, you know.00:50:58
That's why it can be really good to create your own, to make it, like, to reduce all the pain as much as possible.00:51:16
Ngoc Chau Nguyen
Yes, yes, thank you. So, it's really specific to the application. So, is there only, like.00:51:25
kind of… I want to see from angles, if I am to build, my own annotation tool, then what kind of, framework or, like, things I should remember in order to build the… the, like, good… good enough tools out there.00:51:34
Hamel Husain
Yeah, that's… that's something that we do cover in later lessons, of, like, here are some…00:51:55
like, guidelines to keep in mind. It is in the course reader as well, so if you're very eager, you can read ahead all the way to that to get.00:52:00
Shreya Shankar
Yeah, it's Chapter.00:52:09
Ngoc Chau Nguyen
Okay, yeah.00:52:11
Shreya Shankar
Week 4, Chapter 10.00:52:12
Ngoc Chau Nguyen
Okay, thank you. Another question, yeah, definitely, we'll check chapter 10.00:52:14
So, another question I have is about, theoretically, theoretical saturation.00:52:21
So, did you ever find a time when you could not reach, that saturation?00:52:29
why doing error analysis? Like, usually, what are common mistakes that lead to this?00:52:36
And, either way, we stop seeing…00:52:44
So, I, I can see that,00:52:49
the arrows, repeating… repeating themselves. So, I wonder, does that reflect LM nature of, like, error repetition?00:52:52
Shreya Shankar
It's a good question. So the question is, is it possible to never reach theoretical saturation?00:53:06
I've always been able to reach it, I don't know if Hamel's thoughts on this, but if you feel like you're not able to, I would think that00:53:14
there are maybe some bigger problems at stake, so one is maybe you're trying to build an application as general purpose as, like, ChatGPT, which is trying to take in anything or say or do anything, and it's, like, coding assistant, and writing assistant, and any application under the sun you can imagine and dream of, and, like, that scope is a little bit too big for a small team to, like, really00:53:23
Do error analysis and…00:53:46
be able to hit every single possible failure mode or saturate. I think still you would saturate. You would see the same errors occur over and over, but that tail of errors might be much longer in such an open-ended application. The other thing I would say is you might not be able to saturate if you're, like, if you don't have a00:53:48
process. If you don't have the same person or same team of people doing open coding and axial coding, like, everyone has wildly different00:54:06
I don't know, interpretations of what makes for a good and bad trace, and so you can, like, kind of never have any consistency in your labeling process, then you can never saturate. So I would look into one of those two failure modes if you find that you're not able to do theoretic… get to theoretical saturation.00:54:15
Ngoc Chau Nguyen
Yes, yes. So it's scope and consistency in, like, team's, process, right? Yeah. Yeah. Thank you, thank you so much. And, another question's about,00:54:32
I have actually,00:54:43
learned a lot from the podcast. You talk with the product grow, I think. So, in that, Streya have said, that LM might be the mode, like, the only mode, so I would really want, kind of, extensive,00:54:46
answer on why, LM evals might be the mode. So…00:55:04
From your point of view, yeah.00:55:10
Shreya Shankar
Yeah, I guess I can answer it if I made this comment. I have no idea, maybe it was, like, taken out of context as a large thing, I should boiling watch that. But…00:55:12
broadly, I would say the evals process is your moat, because that's where you can inject your subjectivity and your taste, and be responsible for, you know, building the application to be as good or what you envision it to be.00:55:20
If you don't have that, then your application is just, like, a wrapper around ChatGP.00:55:33
like a basic prompt in ChatGPT. So then what differentiates you from your competitors, or even ChatGPT itself? I don't know. So that's kind of probably where I was getting at with the evals being your mode. Not the specific failure modes you uncover, or, like, any LLM is judged, but your process of, like, doing the open coding, axial coding, infusing your perspectives and your tastes.00:55:39
And iterating on it on that ways, because it's going to be different for everybody. Like, your recipe bot might be different from Hamel's recipe bot, different from mine.00:56:01
But, like, that's what kind of makes our products personalized, right?00:56:10
Ngoc Chau Nguyen
Yeah, thank you so much. Okay, thank you for all the answers. I'm satisfied, yeah.00:56:16
Hamel Husain
Navelle. Neville.00:56:24
Neville Clemens
Yes, Neville. Thanks for all of the course materials, and this last few days I've been going through Chapter 5.00:56:26
which is on L&M Judges.00:56:35
And, my question is related to going back to the topic of golden sets and the ongoing calibration of LLM judges. So I think the text lays out the process for… you create a, you know, a labeled data set, you divide it into…00:56:37
Your training set, your dev set, and then finally your evaluation set.00:56:54
And then you have your LLM that's calibrated, you have a sense of what is the true positive rate, the true negative rate of the LLM.00:56:59
And you use that, run the LLM on your production data, and then you can have the correction.00:57:06
Towards the end of the chapter, they talk about, or you guys talk about, redoing this process on a regular basis. So now, let's say your product is in production, and you're getting 10,000 traces a week.00:57:12
There's this process of redoing this, essentially sampling those 10,000 traces to create that eval set for the judges, and recomputing the TPR and TNR every week.00:57:26
Shreya Shankar
I think every week is probably a bit of a stretch here. I think it's more important that you do error analysis every week. For error analysis… sorry, for LLM as judge specifically, it depends on the nature of the failure mode that your judge is trying to judge. Like, if it's a pretty static or pretty well-defined failure mode, that it's not gonna change over time.00:57:42
Then, you know, you don't necessarily need to continually revisit00:58:01
the training set, dev set, test set, true positive, rate, false name. But if your failure mode is somewhat seasonal or time-dependent, or the definition of that varies based on the user, so for example, I worked with this fashion AI company, and one of their failure modes was… is the suggested out00:58:06
outfit appropriate for the weather. So this is, like, a great example of a failure mode that, you know, you could really nail it. It's really good for the summer, it has very good alignment, but then you just, like, didn't realize that weather appropriateness00:58:26
is going to be different in the winter, and maybe the LLM doesn't understand that very well, and then all of a sudden, that judge is not aligned anymore, given the current distribution of traces and queries that users have. So TLDR, I think you should just00:58:42
Be judicious here, like, depending on the nature of your failure mode, figure out if you need to relabel the traces for your judge alignment or not.00:58:56
Neville Clemens
Got it. Thank you.00:59:04
Hamel Husain
Vidya?00:59:08
Vidhya Sriram
Hey, good morning, guys. Thank you for designing the course the way you have done.00:59:11
Because I work with, early-stage startups.00:59:17
And it's already helped me use this… use my learnings as a diagnostic to devise them. For context, I also have a qualitative research background, and I'm a product strategist.00:59:23
But the kind of first principles that the course have instilled in me has built more rigor in my thinking and made me more confident in my recommendations, which is good for them, too.00:59:34
So, I… a lot of gratitude there. Thank you. And I have one friendly suggestion and a PSA here. I see 87 people, and a lot of people here on the video are technically fluent, but I am talking to people like me who are building their technical fluency. I think the homework should not be optional.00:59:44
There is so much value just with those 5 examples, so many failure modes.01:00:03
that I could, which made me scared about the Cloud and ChatGPT I'm using, but it also helped me look at how amazingly it can fail, and how it can escape our attention.01:00:09
Unless we apply these principles. So, it was wonderful going through the painstaking, but wonderful. Finally, my question is about perturbing a trace.01:00:21
I didn't quite get that concept at all.01:00:30
If I have to read up somewhere else, I will, because I only went through the Maven part of it this time, and not the course reader. So, please direct me accordingly. Thank you.01:00:33
Hamel Husain
Yeah, so perturbing a trace means changing a trace, synthetically. So, like, you know, for example, I worked on a natural language query assistant.01:00:44
Like, where you write in English what data you're looking for, and it goes and fetches that data, tries to do analysis for you.01:00:56
One of the key… one of the inputs into the… the context of that is, what is your schema?01:01:02
Your database schema.01:01:10
And,01:01:12
You know, like, so it's like, what is your database schema? And also, some other variables that correlate with the schema?01:01:15
And so… To perturb that trace, what we do is we change the schema.01:01:25
And we change the user query just slightly, so that the query, the user query matches the schema, so you're still asking for something that makes sense, and the schema is still valid.01:01:31
You know, that's a technical one, but you can, you know, in the Nurture Boss use case, if you perturb the query, you could say, okay,01:01:43
You know, instead of asking for a 3-bedroom, ask for a 5-bedroom.01:01:52
You know, so, like, just change, morph the query in, like.01:01:56
ways that you think will be beneficial. And the reason is, like, sometimes, instead of synthetically generating things from scratch.01:02:01
Sometimes you can use what you have.01:02:11
and sort of say, okay, can I morph this?01:02:14
in valid ways. And you have to think critically, like, is it valid? So, in the… in the Nurture Boss case, you would have to say.01:02:18
okay, I'm gonna change it to 3 to 5 bedroom, but you want to make sure that you have01:02:26
the 5-bedroom. Like, that property has 5 bedrooms, maybe. Unless you're specifically trying to test the edge case of the user asking for things that aren't there, that's totally valid. But you have to, like, design it and think about it carefully, like, okay, what is the failure mode that you're trying to trigger?01:02:30
Or that you hope to trigger, or what are you trying to explore? And sometimes it's easier to…01:02:49
kind of… Modify what you have.01:02:55
Vidhya Sriram
Yeah, that's the distinction I couldn't grasp. This… the 5-bedroom test, am I testing it for, whether it can recognize that it's not there, or am I manipulating it to test its sophistication? I couldn't quite grasp it. Is there…01:02:59
The example helped, but where do I read up more on it?01:03:17
Shreya Shankar
We have a chapter in the reader, it's either chapter 4 or 5, I cannot remember, but it's on multi-turn evaluation strategies, so check that one out.01:03:21
We give more examples there, but again, I think the answer to your question here broadly is, like, what's your hypothesis? And, like, what's your expected or ideal output? Do you want to make sure the agent indeed says, rejects 5-bedroom queries, or rejects queries that are out of scope, then go synthetically try to generate queries out of scope.01:03:30
Perturb or change existing traces.01:03:54
to reference information that's out of scope and see how the LLF does, kind of.01:03:57
That's one way of approaching it. So, kind of have… have an ideal behavior or outcome that you want to verify first, then you can look at your existing trace, figure out how to change them, or as we say, perturb them, to test for the behavior you want to test, and then verify that the outcome is what you expect.01:04:02
Vidhya Sriram
Okay, both are portable.01:04:21
Hamel Husain
And when we say the word perturb, by the way, we don't mean any kind of technical term of art or anything, it's just the English word perturb, which means to change, and that's all we mean. We don't mean anything more.01:04:22
Vidhya Sriram
Okay, I'm perturbed, but I will get clarified. Thank you. Yeah, I did the Maven part, but the course reader is… One quick question. Which is the order we should follow? I think the course reader has a rich material, but the first time I did the course reader didn't do Maven. I got caught with some questions. When I came to office hours, I couldn't make sense of it.01:04:33
Which is the right order?01:04:55
Shreya Shankar
It's very difficult. This time around, we are testing, asking people to read before watching the videos, and overall, I think there are fewer questions this time around. Who is… you don't know the counterfactual, you don't know, like, if you watched the videos first, maybe you would also have the same questions. Like, maybe you need to hear it twice in order to understand it, like, any, you know, new material.01:04:56
So I would say, you know, if you know that watching the videos… the other thing is, like, if you're a better video learner than a reading learner, then you should watch videos, so, you know, don't…01:05:18
you don't have to follow the recommendation or guideline to a T, do what's best for you, but broadly speaking, I will say that we have way fewer questions when people are reading before watching the videos.01:05:28
But, yeah, don't know why.01:05:39
Hamel Husain
I like reading before watching the videos because, maybe that's how I was trained, like, say, in college and whatnot, is like, you should read these chapters before you come to class, and, you know, whatever. And that really did work for me and make it stick. So, yeah, I would…01:05:41
You know… Okay.01:05:59
Consider that.01:06:01
Vidhya Sriram
Thank you.01:06:03
Hamel Husain
Aaron?01:06:05
Aaron Moss
Hey guys,01:06:08
I am the… this is my second go-around, so my question's probably maybe a little ahead, I don't know where we're at right now, but,01:06:10
Basically, we have been building in a large enterprise, we have been building, our first kind of real agentic application, that was kind of… the scope was already baked before I showed up, was hired.01:06:16
So there's some breadth concerns where I would love for us to actually replicate tasks, but… and we've made it through our first evaluation cycle, and have about 100, user questions with outputs and traces, and we did the open coding, the axle coding.01:06:30
My question is, is, kind of what my… I did, and I just wanted to verify if this is, like, good practice, is we kind of just, because of the upstream, identification of each output.01:06:49
Or within the… each actual coding, if… You can reuse that sample.01:07:02
Because there's clearly… and when I say reuse, basically, you've developed the solutions, you implement them, and then you reuse the same 100%, 100, like, sample within your application to then identify the next upstream source.01:07:10
And I've been… we did it about 3 or 4… like, probably 3 times before we were able to onboard enough more users to get new questions. I just wanted to make sure that was kind of a good practice to have.01:07:25
And if you, like, recommend kind of reusing and identifying upstream solutions, that cycle makes sense.01:07:37
Shreya Shankar
I think it makes a lot of sense, and I like that… I think in your case, you're actually on the more rigorous end of the error analysis process. Like, some people will just do it one time, or, like, they won't want… and it's totally fine, right? Like, if you realize that the more rigor you have, you actually benefit from it, or you're able to serve your users better, then definitely do it, and…01:07:45
sometimes I say, like.01:08:07
in… we go to, like, 5 rounds, 6 rounds, just because, like, sometimes it can be open-ended, or the stakes are pretty high. Like, I work in a lot of, like, social justice settings as well, so…01:08:09
Yeah, I don't think what you're saying is weird at all, if… That's the… If that answers your question.01:08:22
Aaron Moss
Yeah, and is there, like, a limit, or I guess you said kind of 6 times? Like, I was just curious…01:08:29
Shreya Shankar
Kind of like for us old…01:08:34
Aaron Moss
old millennials, the copy of a CD, like, you make a CD from a CD from a CD, so it's, like, degrading in quality. Like, if there's any type of decay that happens when you do that, and how far you can take it, because, like.01:08:36
In the absence of…01:08:51
a cold start, and I think maybe it's the… it touches on that previous question about perturbed. Like, you could take the original 100 and synthetically perturb them to kind of create these shades of truth. I mean, they're not really going to be net new.01:08:53
user story or pro… you know, kind of, like, work through, but they will generate more of a three-dimensional understanding of what failure modes could occur, by changing it from a 3-bedroom to a 5-bedroom to a whatever, so…01:09:10
Shreya Shankar
Yeah, okay, that's a good question. I would say the reason that we kind of reuse traces, or reuse kind of…01:09:26
sample similar traces to do open coding on. In my case, it's like, sometimes I'll have different people doing the open coding or axial coding, and whenever we do multiple rounds of it, I actually will, like, learn something new of, like, oh, actually, there's a more correct interpretation of this thing. Like, that's particularly useful in the high-stakes setting. Like, if I'm working in social justice, I need to make sure that any definition is correct. And, like, I'm not an expert.01:09:35
It's such and such law, so…01:09:58
it just… the more people look at it, the more likely I am to, like, arrive at the correct definition. But that's particular to my use case. If you find that there's something similar of, like, you need more iterations to, like.01:10:00
get to a stable interpretation, or a correct interpretation, then that makes a lot of sense, and I…01:10:13
wouldn't feel like any quality is degrading? I mean, if you feel like quality is degrading, then that's a separate thing. Like, often that happens because people kind of just blindly try to do open coding, axial coding, without any particular01:10:20
intention there, they're just like, let me just get to my 100. So I'm gonna, like, sample the same trace over and over.01:10:33
100 times, and then… yeah, or I'm perturbing the same trace over and over 100 times for the sake of it, just to get to 100, and then… then that's not meaningful to your application, right? So, I hope that…01:10:40
Aaron Moss
Yeah, it…01:10:51
Shreya Shankar
Helps in some way.01:10:51
Aaron Moss
No, it is. It basically makes me… it gives me… makes me feel gross by doing that too many times, because it's just, like, how in data science we used to use Smote.01:10:53
As the… as the catch-all, like, hey, we're gonna do this, and we're just gonna, like.01:11:01
Shreya Shankar
50.01:11:05
Aaron Moss
explode this out, and it's gonna be great, and everybody thought it was great for a while, and then, you know, changed. I was… so I had another question. So, like, I'm kind of leading01:11:05
this large enterprise, like, 15,000 people, and we're kind of doing this diffusion of AI practices. And I just wanted to… I wanted to ask, from your guys' perspective on evaluation and this process that, you know, I've been, like, integrating into my own01:11:17
workflow… Have you guys seen or found ways to help with01:11:34
getting this process to diffuse… I mean, basically, this course is one way, one method to diffuse this way of doing AI evaluation. I didn't know if you had, like, a more of a meta understanding or something written down of, like, how you guys are going about this, because that's what I'm basically having to replicate.01:11:40
in my enterprise is, like, I am documenting, like, the center, the best practice, and I need to replicate myself01:12:00
As, you know, replicate what I know to other people, and so I just didn't know if you had more of that, more of a meta kind of understanding.01:12:09
Hamel Husain
So… Best friend.01:12:21
Aaron Moss
Yeah. I love them. Yeah, so the thing I found most helpful is, like…01:12:22
Hamel Husain
sort of a Trojan horse… Of, like, getting people to get excited about evals is… to, kind of.01:12:28
Not even talk about evals, but just… You know, in every…01:12:36
like, status meeting, or whatever you want to call it. You have, like, hey, here's a list of bugs that we found.01:12:41
This is what we fixed.01:12:47
you know.01:12:49
this is, like, the metric around it. Don't even talk about it. You know, you just present that, and it's very, very quickly, people will be like, how are you doing this?01:12:51
Like, why are you finding so many issues all the time? What's your process? Very interesting.01:13:02
And you're like, well, let me show you, and then it's like evals, and then everyone starts doing evals. And that's a… if you kind of do it the other way, of saying, like, oh, let me tell you about evals.01:13:07
Everyone's gonna be like, get out of here. So, like, just, that's the strategy.01:13:18
Aaron Moss
No, that's great. I totally appreciate that, because, like, it is… and guess what? I'm doing it the other way. I'm, like, basically saying, here's the awesome process.01:13:25
And it fits me, because I used to be a social science researcher, and so, like, open coding, axle coding, is, like.01:13:37
what I used to do, and that's, like, what… how we, like, quantified latent variables. But now… so, like, I'm… I'm in it. I'm, like, I'm preaching to the choir here, but yeah, probably stick… stick with, like, the outcomes first, and that lead back into the process. So, that makes sense.01:13:44
Hamel Husain
Yeah, no, great question.01:14:00
Shreya Shankar
Another interesting point is, when we first started teaching this course, people would ask that question a lot, like, you know, how do I convince people that evals are important? And I would say that we get that question a lot frequently.01:14:01
a lot less frequently now, so I don't know what's happening. Maybe people are understanding that you need evals, like, who knows? Like, everyone's changing, and I would say don't get too discouraged if people are…01:14:13
like, anti-evals now, because if they want to keep using AI or build AI products, at some point, they're gonna realize the value of it.01:14:24
Aaron Moss
Yeah, I just… I just linked the… the tweet that Hamel had about, like, OpenAI's doing it, really trying to attach my… attach the process to the brand, and be like.01:14:32
Look, these guys are doing it, so clearly this is something we need to do, like, you know, trying to take that strategy, too, so…01:14:42
Hamel Husain
Yeah, it's the outcomes.01:14:52
The authority thing is probably a little bit better than telling people about evals, but… Yeah, even then, people…01:14:54
Might say, like…01:15:01
you know, they might not get it. But I think people will understand the outcomes.01:15:03
Shreya Shankar
Yeah.01:15:07
Okay, this is the last question, and then I have to drop 10 a.m, but…01:15:08
Samuel Thomas Elliott
Thank you. Thank you both.01:15:13
I feel less crazy being in these meetings, because I'm all in on evals. So, we have a really unique challenge where we're trying to teach, different AI tools, agents, stateless, stateful, sales best practices.01:15:17
And we've had engineering kind of building all of our prompts, and there's been a need for bringing in domain experts. We really have a need for bringing in PMs as well, who essentially are designing, you know, the purpose and what the outputs should be and what quality looks like.01:15:31
the immediate pushback that I'm getting right now with evals is as we set up Langsmith, essentially being able to show or describe that01:15:48
doing evals in Langsmith that's showing the, you know, multi-step LLM calls to get to the output that we're looking for is worth setting up and building LLM judges on, even though it's not a true end-to-end flow throughout our code.01:15:58
And the fight is essentially, it's not worth doing this, the only thing that's worth it is if we're able to do an eval that's truly end-to-end through our code, which isn't transparent and accessible to people that we need it to be transparent and accessible to.01:16:14
So, I think the solution is, let's set it up and run an eval through Langsmith that's not fully end-to-end, but as close to it as possible, and then I work with the engineers to run the full01:16:26
more cumbersome end-to-end through the code, and compare the results to be able to not say, oh, I feel like it's not worth it, but actually show the data to show what the delta is. Is that the right path to kind of get past this objection I'm getting, or is there a different way to approach this?01:16:39
Hamel Husain
Tell me a little bit more about this end-to-end through code, like, why it has to be separate? Because I didn't quite…01:16:58
Samuel Thomas Elliott
Supposedly through Langsmith, and the Langsmith team told me that, oh no, we can run it fully through code, but01:17:06
Supposedly, if we were to run an eval in Langsmith through their front end, it wouldn't be running through our entire codebase.01:17:13
Hamel Husain
Okay.01:17:23
Samuel Thomas Elliott
And this is also part of the challenge, is I'm a non-technical person doing this.01:17:24
That can't necessarily speak the same language as, you know, our engineers, and then our engineers can't necessarily speak the same language of, like, understanding01:17:29
that we can write really good LLM judges, it just takes a lot of testing, and trusting, and knowledge to write the prompt.01:17:37
Hamel Husain
Yeah. To get it.01:17:44
Samuel Thomas Elliott
To the point.01:17:44
Hamel Husain
I'm gonna guess, based on what you're saying. I mean, I can only infer a bit. So, like…01:17:45
what I'm hearing is, you want to run some… you want to run some evals, and you want to calculate some metrics.01:17:51
and you're using, like, let's say Langsmith, And,01:17:58
you know, running through your codebase, what that means to me is, like, you need to do, like, the tool calls, and the rag, and the whatever. And, you're not able to do that. The Langsmith is not gonna run those for you, because they're not in your codebase. And so,01:18:04
You know, you need… The thing is, like, you definitely want to…01:18:18
write your code so that you can run these things in a modular way. It is a smell if you can't easily invoke01:18:23
Like, easily have an entry point that you can invoke the whole workflow.01:18:32
Does that make sense? That you can test the workflow, and that you can instrument it01:18:38
so that it is logged to Langsmith, and that you can, like, calculate the metric in the real way. If you can't… if there's some kind of code problem where, like, you can't get to the code, then you kind of have to…01:18:43
And be like, we didn't design the code, like, very well.01:18:57
Samuel Thomas Elliott
Yeah, because I can get most of the inputs and things, like, into a dataset, to make sure that it's running properly. Yeah.01:19:01
Okay.01:19:10
I know that's vague, I'll update you on future meetings on where we're at, because this is a big project.01:19:11
Hamel Husain
Okay, sounds good.01:19:18
Alright, well, thanks for coming to today's Office Hours, really, we really enjoyed it.01:19:22
We'll see you all in the next one.01:19:27
Thank you.01:19:30
Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)