OCT 21 Optional: Live Office Hours 6 TUE 10/219:00 PM—10:00 PM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live Office Hours 6
Oct 21, 20259:00 PM - 10:00 PM GMT+5:30
Audio Transcript
Chat Messages
Hamel Husain
Hello, everybody.00:01:23
Welcome to the office hours.00:01:27
We can… we can get started if you like. There's only…00:01:30
3 people here, so nice and cozy.00:01:35
You know.00:01:38
Robert Lavigne
Let me get that right.00:01:42
Hey, how's it going?00:01:44
Hamel Husain
Pretty good, how are you?00:01:46
Robert Lavigne
Doing well, doing well. I, you know, seeing we're just a few people here, I thought I'd just,00:01:47
tell you how great it's been so far. I've been definitely learning a lot. I don't have a very specific question today, seeing we're just waiting for people to come in, but00:01:53
I'm just getting into the RAG stuff now, which is gonna be good, because I've been doing a lot of various type of RAG stuff in the past, but I just wanted to give you an update. So, for the last 3 years, I've been running a lot of automated processes against00:02:02
system prompts and so forth across multiple agents. And the system I was using has a dashboard with all that data.00:02:16
But I've never used it as a trace. I've never even thought of it as a trace. So I created a archiver00:02:24
Last week, that would actually pull from their API and generate a whole bunch of JSON-Ls.00:02:30
of all of those conversation histories, and I'm now loading it as a private dataset in Hugging Face. So I'll now be able to do SQL calls against all of that, and do all of my, you know, actual coding and all of that, categorization.00:02:35
But it was interesting, one of the characters, one of the chatbots generated 2 megabytes worth of content, and I didn't even know that it generated that much until I started doing00:02:50
this eval stuff. So it's not just the evals that we're kind of learning from this, it's just re-looking at data that's been in front of us for the last two, three years.00:03:01
with a completely different set of eyes. So, now that we've got a lot more people here, I'm going to turn it back to you so you can actually do the proper questioning, but just wanted to.00:03:11
Hamel Husain
No worries, yeah. If you're up for it, you know, if you don't… if you might be open to…00:03:19
recording a video of you going through what you've learned, or something, and sharing it. Yeah, yeah.00:03:25
Robert Lavigne
Yeah, yeah, yeah.00:03:29
Hamel Husain
for the class.00:03:29
Robert Lavigne
Yeah, I did one a while back where I was showing the dashboard that I had created for the Disconnected Frontier Classic. And, you know, shout out to the people in the group. That was a dead chatbot that I wasn't doing anything with until all of a sudden the course came up. I said, I'll use that as a backend.00:03:30
And I was able to get 150 conversations on it last month.00:03:47
So that just gave me a whole new set of traces as what the opening gambit of the story is across 150 unique00:03:51
instances, so I would have never had that data had it not been for the course as well. So, I'll probably be doing it in the theme of that, but like I said, I've been evolving the dashboard that I've been using for that to be far more LLM as a judge, as well as running calculated.00:03:59
algorithmic elements to it. Like, small little things like message counts. So, you know, when you've got a thousand entries and you can see which ones are the ones that had, like.00:04:16
engagement versus the ones that were DOA, you know what I mean? Small little upgrades like that, I would have never done.00:04:25
Or I would have had an LLM do it, as opposed to saying, you know what, just run a Python to calculate all the, all the conversation entries, and just display that so I can see which ones and rank accordingly. And small little things like that.00:04:32
definitely got from this just by changing that thought process. So, yeah, if you're up for it, I'll definitely do a video, I'll post it, it'll probably be on my LinkedIn.00:04:45
Okay, back to you guys.00:04:55
Hamel Husain
Okay, back, thanks for…00:04:58
coming a little bit early, sometimes I come a few minutes early, and we chit-chat, so… people are welcome.00:05:01
So we can go ahead and get started. Martin.00:05:08
Martin Siniawski
Hey there, how are you? Good morning.00:05:12
So, just wanted to give you a quick update, and then a couple of questions, but just to give context,00:05:15
So, been following the course very closely. So far, what I'm finding most helpful, and I feel a little bit guilty about this, but you can tell me if that's alright or not, is, like, the error analysis part.00:05:21
In the sense that…00:05:31
We've just found so many things to improve, and I would say they're on the low-hanging fruit end, possibly, and we've been doing, like, for the past 3 weeks, 1-2-week sprints, addressing, and I think it's mostly Golf specification.00:05:34
So I'm still waiting for the dust to clear before implementing LLM as judges.00:05:50
So yeah, that's a little bit of the context.00:05:56
now I'm working… I sent a question at you, Hamil, you answered, so thanks for that. We have a coaching program that's 8 weeks.00:06:00
And people see the… they talk to an AI voice therapist every single day.00:06:09
And right now, what we've encountered through the error analysis is that00:06:15
the… the AI therapist is not very aware of the things that have been discussed before, and what things they need to discuss during the week, almost like a structure to it, right? Because there's things that need to be done every single week, but they need to be done once, or maybe a couple of times at most. So…00:06:21
We kind of tried to hack that before by putting everything on the prompt and giving some heuristics of when to trigger one way of talking versus the other one, but it hasn't been great.00:06:38
So right now, we're separating into a few more components, something that creates a… has some, like, persistence, and understands what's this… what's a week of the program look like, and what are the topics that need to be covered, and creates an agenda before.00:06:48
The problem gets triggered, and then something that will kind of,00:07:01
Process the conversation, and make sure which topics were covered, and then store those ones and mark them so they're not repeated next time.00:07:05
We were considering, like, a graph, like, a land graph or something like that to do that. You were saying, like, keep it simple. So I guess my first question, or very quick question, is, is this simple enough? Or that sounds like we're over-complicating it, and maybe…00:07:12
In one big prompt, maybe we're still able to do something like what we want to.00:07:27
Hamel Husain
So, it's good to keep it simple. What you're talking about.00:07:34
Seems like you're trying to manage memory, or have, like, memory.00:07:39
So that you can have continuity of an experience.00:07:44
Martin Siniawski
Yeah, and we already have memory for what the user says, and what they manifest, but not so much for the assistant to understand00:07:48
What kind of topics have they already discussed in a given week, let's say, and have more…00:07:58
Yeah, almost like a professional you're seeing every day for a given set of weeks, and you don't want them to repeat themselves and make sure they cover everything that's important.00:08:04
Hamel Husain
Yeah.00:08:13
you know, before you go to, like, Landgraph or anything like that, I would encourage you to try to implement00:08:16
Something in the dumbest way possible. Even if it's, like, even through memory, let's say.00:08:23
Martin Siniawski
Yeah. Of, like…00:08:30
Hamel Husain
you know, writing something to a database that, like, keeps state… Yes. …about, you know, what the plan is for a week or something like that.00:08:31
Until you feel like… There's a reason…00:08:40
more, like, from an engineering perspective, or it's coming cumbersome, or you like the, you know, you want to express it in that framework?00:08:46
But it's good to, like,00:08:55
Do it without the framework first, for a little bit, so that you can understand what the framework gives you.00:08:57
It'll give you, like, more…00:09:03
It'll give you a better ability to debug things.00:09:08
Cause sometimes the abstractions, they can hide quite a bit.00:09:11
Yep. And it's hard then to know… if you haven't done it first yourself, then it can be hard to understand, like, what is happening.00:09:15
Martin Siniawski
Yeah. Langraph, I think we've already, like, those types of approaches, we've already, like, postponed them for the indefinite.00:09:25
Hamel Husain
Beautiful.00:09:31
Martin Siniawski
Until we really need them.00:09:31
I'm just wondering, like, we're gonna be creating these, like, very small components, I'm hoping that,00:09:32
yeah, they create the agenda by understanding that persistence memory, and then they update what's being covered during the week. I don't know if that sounds simple enough, or if… because there's a world in which we try to keep on, like, stuffing it into a prompt, and putting all that logic in a large prompt with different variables.00:09:39
But it feels like it also gets tough to manage that.00:09:57
Hamel Husain
Yeah, I would try… I think the memory approach is good.00:10:03
Martin Siniawski
Okay. Like, you don't have to necessarily stuff it…00:10:05
Hamel Husain
keep concatenating onto a prompt. You can try to keep curating a memory.00:10:07
That is where you are… Cognizant of the size of that memory.00:10:13
And that you're pruning it as well?00:10:20
Yeah. And so, that seems reasonable to me.00:10:23
Martin Siniawski
Okay.00:10:25
One last thing is that it's emerging to us that00:10:26
So right now, we error analysis, right? We look at individual traces, maybe, yeah, just kind of sampling and randomly, or maybe with some kind of criteria, but…00:10:32
It feels to me that since our thing is a program which has 8 weeks, we're gonna have to start looking at individual users, and maybe all of their choices in a given week.00:10:41
Because there should be a logic to all of them, right? And the experience they have in multiple days. So it feels to me like maybe a different way of doing error analysis. I don't know if that's an approach that you've also.00:10:52
Hamel Husain
Yeah, that's really reasonable. Yeah, that makes sense, like…00:11:03
if you're doing error analysis and, like, it's reasonable for you to, like, look at all, like, a cohort of, like, the same week, that'll reduce your cognitive load.00:11:07
That's totally fine. Like, we do that, in other places as well, like, look at…00:11:15
you know, customers all from the same channel, or from the same segment, I can help you, like, get context and review things, instead of, like, jumping around. So it's totally reasonable.00:11:21
Martin Siniawski
But to us, it's not only the cohort, but also, like, looking at many traces for the same person, maybe in a period of time. That's also reasonable, I guess.00:11:31
Hamel Husain
Yeah, I think that's reasonable.00:11:40
Martin Siniawski
Okay.00:11:41
Sounds good.00:11:42
Shreya Shankar
Yeah, we just… we never recommend to start out with it, because when someone is doing error analysis for the first time, they don't know what to pay attention to if they're looking at 20 traces.00:11:43
But if you kind of have already been doing many rounds of it, or you're familiar with the process, then you should scale it up in the ways that make the most sense for you and your team.00:11:53
Martin Siniawski
Okay.00:12:03
Shreya Shankar
Yeah. One other comment, is you mentioned, like, oh, there's still so much value you're getting out of error analysis, which is really good, by the way. I think a common misconception in this course is that people just inadvertently think, oh, because we spend one week on error analysis, they will learn and finish one week.00:12:04
error analysis. And then we spend one week on LLM Judge, so they will just spend one week on LLM as Judge, and then magically, in, like, four weeks, you know, their product is, like, solved, right? And that's not the case, right? We…00:12:21
I would say that this is a year-long course of principles, like, that you will just continually apply, and they will all take varying amounts of time. That's great.00:12:35
So, don't feel like you're behind because you're still doing error analysis, right? Like, it's just the nature of the course. We cannot take years to teach.00:12:44
Martin Siniawski
Yeah, yeah. No, but that's actually a really helpful point. Maybe worth also, like, mentioning in the lectures more and more, if I missed it, because, yeah, it just feels like, oh, I need to transition to ELLM as judge, I didn't do it, you know, and I'm falling behind.00:12:52
Shreya Shankar
No, you're not falling behind. I mean, the videos are always there. You'll get to it when you need to, I promise you, you'll get there one day. One day, you will have solved Gulf of Specification, so…00:13:06
Martin Siniawski
And I'm doing the iterative rounds of error analysis, and it's great to see how we're fixing the things that we noticed before, you know, in the new versions, and that's also very rewarding, and it's working, so that's great. Well, thank you so much. Really appreciate it.00:13:16
Hamel Husain
Ani.00:13:30
Ani Ray
Hey, hey, hi, Shreya, hi, Hamil. Thanks for the excellent, course. So, I'm working on, like, agents that are maybe more technical, so, like, code optimization agents, and, like.00:13:32
agents built on, like, you know, logs, metrics, traces, and whatnot. I feel like, I just wanted to get your thoughts on, like, the approach that I should be taking as, you know, someone that wants to help with… as the PM, that wants to help with, like, the error analysis and, like.00:13:44
annotations and whatnot, but there's, you know, my domain knowledge isn't as thorough as, like, someone who's maybe an engineer that would be actually using these agents. So I'm curious if you have thoughts on how to, like.00:14:00
Approach, you know, the eval process for, kind of, these, code gen or, you know, technical agents.00:14:13
Hamel Husain
Can you… what kind of coding agent is it? Is it, like… like, cloud code type of thing, or…00:14:24
Ani Ray
So it's, like, something that will optim… like, you feed it your codebase, and then it'll find optimizations, performance optimizations, things like that. So, you know, an engineer can then go land the diff so that, you know, they realize this, you know, efficiency win on their codebase.00:14:29
Hamel Husain
Yeah, so domain knowledge is really key to doing any kind of evals. It's gonna be really hard, but…00:14:53
The best thing that I can say is to try to…00:15:00
Get domain knowledge by either working with engineers or talking to users, which often… and understand what users want, and a lot of times there is a gap between00:15:06
Like, you know, when you're doing your discovery, and what…00:15:16
like, what users want and what they need, you know, all the superpowers you have as a PM.00:15:22
And then, like, what your engineers think.00:15:27
they're building.00:15:30
And, you know, perhaps focus on that, but it is important that you also kind of…00:15:31
Get your domain expertise.00:15:37
My intuition is, like, you could potentially add value with that…00:15:39
Kind of that understanding, even as you're getting up to speed in the domain, with these more…00:15:45
like, UX-type things.00:15:52
And workflow-type things. It's just that you kind of have to take your… You could take your…00:15:55
user findings, and your discovery work, and see if you can filter your data a bit. Like, your users, the users that you talk to, are gonna give you some hypotheses.00:16:05
of, hey, like, this thing doesn't work, or, you know, we need to fix XYZ or ABC.00:16:17
And then you can try to validate the hypothesis somehow by looking at the data to try to find examples of that.00:16:24
That could be really interesting.00:16:30
I'm sure I might have another idea, but… huh.00:16:33
Shreya Shankar
Oh.00:16:37
Sorry, I was muted. No, I agree with everything you say. One other thing is I think that, like, using LLM judges and automated validators, like, if you go through the alignment process and create these things, with the domain expert there, now you have these00:16:39
Functions that you can, like, run on traces And then compute whether00:16:56
they're exhibiting failure modes or not, independent of the domain experts, which is really nice. And then you can do some sort of data analysis on, like, hey, you know, these failure modes are happening more, and I'm going to do some cohort analysis to figure out for what users00:17:03
they're experiencing this failure mode, and try to, like, propose improvements there. So I would, yeah, try to figure out how to bake in processes that create automated evaluators, so that way you can consume the outputs of the automated evaluators, and then use that to inform product.00:17:16
Hamel Husain
Oh, and make sure you come to Teresa's guest lecture next week. You know, she knows a lot about… she's the expert in, like, you know, discovery and a whole bunch of other product management disciplines.00:17:32
And, you know, she was a student in her first cohort, so she'll have a really unique insight how to fuse, like, evals with00:17:46
Like some of these other things.00:17:53
Ani Ray
Hmm, thank you.00:17:56
Hamel Husain
Anuba.00:18:01
Anubha Saxena
Thank you for the course, as everyone else says, right? Along with the course, what I really like is the community, like the Discord and these conversations, because I learn a lot from them.00:18:03
I have a few questions, actually. I come from software engineering background, and we write tests, you know, unit tests, integration tests, and all kinds of tests. And I was initially thinking that evals is also, you know, a kind of00:18:15
testing if everything is working fine or not. When… when I write…00:18:32
any kind of tests, especially unit tests, the more the better, right? Because I'm trying to guard, like, if the functionality is working fine or not.00:18:39
But in case of evals, my first question is, do we only write evals on failure mode? So after error analysis, when we find failures, do we only write evals for those? Or could it also be, for example, if I have some functionality.00:18:48
really vague example, but give me an image of a particular dimension, and then I want to ensure that the dimensions are within a range.00:19:04
Should I have…00:19:13
An eval for that, even though the, you know, the pipeline is working fine? Or… or not?00:19:15
And a follow-up on that actually is, how many evals are too many evals, you know? Because if you're doing error analysis all the time, and we're finding new things, maybe the old failure modes will no longer appear. So do we continue to run those evals, or is it okay to remove it, and when and how, basically?00:19:21
Hamel Husain
This is a really good question.00:19:43
So, in this class, like, We…00:19:45
In order to teach it without people getting confused, we…00:19:50
Have guided all of you to focus on failure modes that you find, rather than prospectively.00:19:57
Testing things?00:20:05
And it's just a guideline, really, and the reason behind that is… with LLMs.00:20:06
It's hard to anticipate what can go wrong, first of all.00:20:13
The surface area is a lot larger in terms of what can go wrong, and, you know, in addition to being a little bit more unpredictable.00:20:18
But also, a lot of the evals are significantly more expensive to build and maintain than… and even run, sometimes, oftentimes, than, let's say, you know, like your typical unit test, or even an integration test.00:20:27
So…00:20:45
That's why, like, as a first sort of guiding principle, we say, okay, like, focus on the ones you find, because if you try to approach it like software engineering.00:20:47
What ends up happening is you end up getting lost in the sea of evals, and that the evals don't…00:20:56
provide you with value, because they're expensive, and they're… can be time-consuming, and whatever. You know, you want to get…00:21:03
an ROI on the evals. Like, they have to serve you.00:21:11
And so…00:21:15
kind of aligning it with problems you find, the chances of the eval serving you is a lot higher. Now.00:21:16
There's a lot of nuance there.00:21:23
So… You know, there's different kinds of evals. There's the LLM judge.00:21:25
which is a lot more expensive. And then there's the code-based evals, which a lot of times can look a lot more like a unit test, and can, look, and can be, like, a lot cheaper to run and maintain.00:21:32
And so, it's a judgment call. If you…00:21:46
You know, if you find that…00:21:50
you have a good reason to believe, like, hey, like, I should test this, or this, like, really,00:21:52
you know, I'm really paranoid about this error, I really want to make sure it doesn't happen, yes, by all means, like, write that test. Especially if it's a code-based test, for sure. I mean, it's… that's very aligned with the software engineering mindset.00:21:59
Just keep in mind, like, okay, the tests…00:22:11
kind of have some cognitive load. These evals have some cognitive load, and you have to curate them.00:22:16
To say, like, hey, do these set of evals, are they providing me with signal?00:22:22
You don't want… a lot of times I've worked with companies, and their whole eval set, almost all of it, except for a few evals, are passing 100%.00:22:27
that's, like, a smell that… mmm, the evals are not that good. They're not providing you with value. You want to, you know, you want to, have some…00:22:37
signal on, like, where to improve. So, unlike software engineering, you know, software engineering, you feel really happy when you get the green checkmark in CICD, everything passes. You feel good. That's not really what you want to aim for here with evals.00:22:46
And so it's hard to say, like, you know, Shreya,00:23:01
kind of, I think, in the last office hours, or several office hours ago.00:23:08
Someone asked her how many evals00:23:12
she has, and I think she said, like, something around the order of 10, you know, maybe LLM judges, or something like that.00:23:15
that's… you know, I don't want to say there's, like, a number, just keep in mind, like, okay, the cost of different evals, knowing that the code-based evals are significantly cheaper, and then you have a fixed budget00:23:24
of, like… Theoretically, you have a fixed budget of how many you can possibly manage.00:23:37
And you have to say, like, okay, how, like,00:23:42
you know, which ones should I have in the set of evals?00:23:45
Shreya might have a…00:23:49
She usually has, like, an interesting other angle at it, so I should let her comment.00:23:52
Shreya Shankar
I don't know, interesting other angle. I will phrase your question as there's, like, two approaches to evals. There's a top-down approach, where you know some criteria that you want your application to follow, so you want to design00:23:56
design evals for those, and there's a bottom-up approach, which is what we teach in this class, which is I look at my data, I observe some failure modes, and then I will design the evals for those.00:24:12
Oops, sorry.00:24:24
We teach the bottom-up approach first.00:24:28
Because of all of our experience seeing people implement top-down.00:24:30
And then everything's still failing in production. Why is this the case? Well, if you remember in Chapter 1, we talk about this gulf of specification.00:24:36
Everything that you didn't actually write into your prompt, and you realize is really important.00:24:45
Those criteria never come top-down.00:24:51
Right? Like, they're just not in the prompt, you never thought of them. So, how could you have ever thought of them and created evals for them if they're not in the prompt?00:24:54
And, like, Martin said this earlier today, right? Like, he's, like, still there trying to bridge Gulf of specification. You cannot ever, like, solve those problems with the top-down approach.00:25:03
So that's why we always say start with the bottom-up approach, because the easiest golf to bridge is the golf of specification. Make sure everything, all your criteria is in there, in the prompt somehow. And then after that.00:25:14
you can think about, okay, like, now let me make evals for all of the things that I think are important to me. We always say start off with things that exhibit failures, because you should prioritize things that actually will turn users, and then you can go and write evals if you have extra resources for things that are nice to have.00:25:29
And sometimes, whenever, you know, you make a fundamental change to an application, like you change your agent model, like, when I move from GPT-4 to GPT-5, I might get rid of some evals because the GPT-5 just doesn't exhibit those failure modes anymore. That's totally fine.00:25:47
But yeah, hopefully that was, like, a…00:26:06
Philosophical answer to why we start bottom up.00:26:09
Focus on failure modes that we know people have, because that's the most important to the product, and then you can add more things once you have more resources.00:26:12
Anubha Saxena
That was really helpful, thank you. I have one more question, if we still have time for me. So we have a…00:26:22
Pretty… not pretty complex, but a complex… sort of…00:26:29
feature, where we have a chatbot, and the tools for that chatbot is, like, APIs, which are also LLM00:26:33
pipelines. So there are, like, two non-deterministic factors. First thing is, like, how the chatbot, like, the main chatbot is behaving, and how it is responding on top of the tool outcome.00:26:41
When we're doing error analysis on the chatbot itself.00:26:54
I personally think that maybe, for the first pass, at least, we can ignore what the outcome of these tools were, because it is also non-deterministic.00:27:00
And it can derail you from… because it's also, like, you know, operational factor. Those tools are owned by different teams, and then there's, like, you know, throwing at each other, like, it is, like, your tool is giving bad answers and those sort of things. But…00:27:10
how do we run meaningful evals in that case? And when I say meaningful, it is more about, you know, the interpersonal things, along with what makes sense for the product.00:27:27
Was I clear? I can also try to rephrase my question.00:27:44
Hamel Husain
I can try to answer, and then you can steer me.00:27:51
If I've misinterpreted, so…00:27:55
So, certainly, okay, like, it's always worth talking about error analysis, first, the first part. So, you definitely want to do error analysis with the idea that00:28:00
okay, you want to find the most upstream error in the chain of causal events. So if that's the tool call, that's the tool call. If it's something prior to a tool call, that's prior to a tool call. Whatever the first, kind of.00:28:11
logical… Or even, like, other… any kind of error that you see that is causing…00:28:25
you not to meet the user's goals. You should stop there, and, like, that should be your open code.00:28:31
And the reason for that is, is because there's a causal chain of events. Usually it doesn't… you should just fix the causal chain of event that's breaking first.00:28:37
Now, if you, like, are dealing with these tool calls, and they're, like, there's other teams,00:28:46
I don't really… So, I will say that AI is very similar to any other technology prop thing.00:28:56
half the consulting that I do.00:29:05
It's not really an AI problem, it's not really an evals problem, it's always this, like, people, organization problem.00:29:08
And…00:29:15
I don't know that it's any… not any… not necessarily a special sauce that I have encountered with AI.00:29:16
So I don't really have any kind of, like, advice for, okay, if you have multiple teams and you'd want to, like, how do you work around those teams?00:29:27
And, you know, finger-pointing, that's… Probably beyond evals.00:29:35
Anubha Saxena
So, in this case, if we think about maybe an an event scheduler tool.00:29:43
So, when we think about doing evals for a chatbot, which just schedules events, we just say that this tool was called with these parameters, and then we assume that it did what it had to do.00:29:51
In this case.00:30:04
it is… it is different, so we cannot just assume it as any other tool, and just say that, okay, this API was called with these parameters, and then we assume that00:30:07
it just did what it had to do, and the output was correct. And then do error analysis for that API separately.00:30:18
Hamel Husain
Yeah, okay, so, I mean, if you have a tool call that's unreliable, like, it's, you know, supposed to have a side effect of, like, scheduling an appointment and it's not.00:30:29
It's worth, kind of, Stopping and saying, okay, we're going to… we need to, like, fix this tool.00:30:38
Especially if it's in the middle of your… the causal chain of events.00:30:45
And I wouldn't assume anything like that is happening. I wouldn't assume the correctness of any tool. I would, you know…00:30:51
If you… if you could bring in…00:30:59
Sort of the logs of the side effect?00:31:03
being successful, of like, hey, the appointment being scheduled into the trace, even, so you can verify that it was successful.00:31:06
That would be good. It might be a matter of… You know.00:31:15
Making the software engineering practices of that team who maintains that tool better, so it's more reliable.00:31:21
Shreya Shankar
Yeah, sometimes in early-stage startups, like, the tool… the deterministic APIs will be failing, and, like, error analysis is not the right tool to do software engineering, right? Like, people should be writing unit tests, they should have good software engineering practices, so it's… yeah, as Hamel says, worse.00:31:30
Kind of creating that culture within the team to make sure that you can rely on your deterministic tools.00:31:47
Anubha Saxena
That makes sense. Thank you.00:31:56
Hamel Husain
Thank you.00:31:59
Melanie.00:32:00
Melanie Wuong
Hey Hamil and Sharia, great course. I just completed this week's, lectures, and saw that you said that the error analysis still persists for different modalities, which makes a lot of sense. What's not clear to me is how the dimensions would apply in an image context. Like, how would you…00:32:03
generate different, synthetic scenarios? Or do you maybe not pursue that and just try to gather a lot of images, which might burn through a lot of your00:32:21
Train, test, dev.00:32:32
pool. So yeah, I'm curious on that front.00:32:34
Shreya Shankar
Yeah, it's a good question. When you say images, I think there's, like, two types of image applications. One is you're consuming images as inputs, but generating text outputs. The other one is, like, a diffusion model, or, like, mid-journey of, like, you're taking in text or image inputs, but you're producing image outputs.00:32:37
I don't know what it is in your case, and I think that you…00:32:56
Melanie Wuong
Mine's, like, interpreting images and producing, like, an insight or, some structured data.00:33:02
Shreya Shankar
Yeah, yeah, okay. So in this case, I think it's actually very similar to how we think about dimensions for non-image modalities. Like, often the images are, like, complementing, you know, other text instructions, like types of queries or types of data,00:33:09
So, like, you can think about, you know.00:33:27
what are the types of inputs that people are gonna have anyways, and then, like, if images vary across those types of inputs, then you can, like, make a dimension for image varying in that sense. I think… if I think about a specific case, like, I work in a lot of, of course, legal applications, and00:33:29
like, general court cases, so whenever there are images, it's usually, like, footage of the event, or, like, frames from the incident, or evidence.00:33:48
But, like, the nature of that doesn't vary as much. The nature of the image doesn't vary as much as the, like, fact that the evidence image is there or not, or it was, like, taken by a police officer versus taken by somebody else on the scene.00:33:59
like, those are the kinds of dimensions that matter in, like, my case, and if you notice, they're actually, like, not about the image particularly. It's, like, just about the scenario in which I have the data.00:34:18
I don't know if that's true for your case, but I guess my broader take is, like, sometimes the way that you create dimensions is actually not that different based on the modality.00:34:28
Melanie Wuong
Hmm.00:34:39
So in your example, if you were trying to assess, like, what the image00:34:40
contains. How would you go about setting a dimension for that?00:34:45
Like, I don't know.00:34:50
Shreya Shankar
Yeah, I guess I never want to know what the image contains. It's more of, like, you know, does this evidence… does all of this data describe this type of incident, or exhibit a case in which some… so it's, like, kind of assuming all of the data together,00:34:51
So, maybe… but if I were to, like, do, like, captioning or something, I think I would, like.00:35:10
just… I would first just start with, like, a random sample of images that I think users will have, and then see if there's anything that varies within that sample of images. Like, sometimes it's low res, sometimes if it's, like.00:35:18
You know, like, the main item is obscured. Sometimes there's, like, multiple items that are of interest in the image.00:35:31
I'm just thinking of, like, a…00:35:40
case of, like, an AI styling assistant that I worked on in the past, and, like, they take image inputs of clothes and, like, want to caption them. And there's, like, a very high variance of, like, photo quality,00:35:42
But yeah, like, I think, like, just looking through some sample images and, like, trying to figure out if they change at all,00:35:57
That might help you inform some of your dementia.00:36:05
Melanie Wuong
And then just one last follow-up is.00:36:10
Hamel Husain
Just to be clear, sorry, like, the dimensions are talking about, like, sampling existing data, not necessarily synthetically generating data.00:36:12
Is that… just wanna make sure I understand.00:36:19
Melanie Wuong
It was more, yeah, to synthetically generate. Or, well, I guess to try to, because it's not in the hands of users yet, like, how would I go about00:36:22
simulating traces to make sure that when I'm building.00:36:30
Shreya Shankar
Oh, it's a story.00:36:35
Melanie Wuong
start with something. Like, just think about, like, if you were a user, what would they… like, just think about, like, 10 different people.00:36:36
Shreya Shankar
like… how my images vary in your case,00:36:43
like, when we did this exercise in the fashion case, like, we realized that image quality varies, like, sometimes people will take an image of their shirt also, and want the shirt captioned, but then, like, there's other items of clothing in there, so, like.00:36:48
That's, like, a type of, like, a dimension for us was, like, how many articles of clothing that are not the main article that we want captioned.00:37:04
To give you a specific answer there. And, like, yeah, we, like, kind of have to be creative and brainstorm these things. But…00:37:13
Such as the nature, I think, of synthetic data generation.00:37:21
Melanie Wuong
And then just a quick follow-up is, if we were to build, like, an LM as a judge for…00:37:26
like, say, one of the failure modes, in this scenario, if the…00:37:31
LLM has, like, limitations in, say, spatial reasoning, would that not carry through in the judge itself?00:37:36
And how would we go about that?00:37:42
Shreya Shankar
Yeah, it would, to some extent. So typically, in your main application, your LLM agent is, like, trying to do a lot, right? It's, like, trying to make a caption.00:37:45
for the article of clothing, it's, like, the main article in there. Typically, your judge is trying to do a much smaller scoped task, like, it's, like, only running on t-shirts, or, like, it's only trying to check a very specific binary failure mode.00:37:55
In which case… Probability of it, like, having spatial awareness00:38:10
general issues, it's not going to interfere as much, hopefully, if you align your judge prompt well.00:38:15
But yeah, that's just our experience, and kind of… that's why we tell people, like, when you do your LLM judge, focus on a very well-defined, like, narrowly scoped failure mode, so you can make sure your biases of the LLMs don't interfere.00:38:23
Melanie Wuong
Awesome, thank you.00:38:37
Hamel Husain
Daniel.00:38:40
Daniel Saad
Hi, thanks a lot. Sorry I don't have a picture at the moment, but,00:38:41
I would like to first thank you for the course. It changed my mind how we tackle the problems with the agents, and how we should deal in manner of how to do the project management, and how to set up ourselves for doing a project. It's really, really nice.00:38:47
My question is that, without getting too much into the details, that we have, we have these agents that generate these documents, and the document needs to pass these ISO standards, okay? And these ISO standards are not 1 or 2, there are sometimes more than 10, and each of these standards have this very specific way to00:39:03
Patterns, how should the document be written, okay?00:39:28
down to which word should be used, even. Okay, so that's the problem. So, you see, here.00:39:32
I don't have a problem of how to check my outcome, is that I have too many things that I need to check against. So that's… it's very specified, and sometimes these specifications clash with each other's. And at the same time, also, I need to adhere to all of these ESU problems. And one of the things that I do at the moment, I try to categorize these specifications.00:39:38
and create one evaluation for each of them. For example, in manner for language accuracy, I have an LLMS judge that does that, okay? But again, I pack around 5-6 rules that this LLMS judge needs to follow.00:40:03
And I don't know if it's an accurate way to do it, or this is an accurate way to tackle the problem.00:40:18
Hamel Husain
So the answer is, like, this really comes down to when you do your alignment phase, when you have, like, your labels.00:40:28
You can judge whether or not… or you can measure if your judge is… Good.00:40:35
According to the labels you have. Now, if your judge has too many… too much scope, and it's way too complicated for you to judge that, then that's a sign that maybe your judge is, you know, maybe you need to narrowly scope it. That's one kind of thought. Another, like, unrelated thought I had is…00:40:42
these different standards, some of them sound very specific. You mentioned, like.00:41:00
Having a specific word in a certain place, for example.00:41:05
It's worth it to try to be as creative as possible, and see if you can make any of these things code-based evals.00:41:09
To some extent, like, can you…00:41:17
parse the information out in a certain way and test certain things that you… that are deterministic in nature. I don't know. I don't know enough about the problem, but it's worth thinking about that.00:41:19
Daniel Saad
Yeah, I tried to do that. For example, they have these dictionaries that shouldn't follow, this word shouldn't be inside of it. So, for example, I tried to do that one way.00:41:30
But again, the problem… one of the problems that I'm facing is that, it sometimes is too specific, even for generating something, that you get clashing results. That's sometimes… that's the problem that we have at the moment. So, I have this pattern, so for example, I said my… to imprompt, okay, generate something based on this pattern with these rules.00:41:40
And what is coming back is because sometimes they are clashing with each other, or there are chains of the generation that happen after each other, that they sometimes clash with each other. And that's becoming a little bit problematic, and even the context of as a person to follow it is very hard, because there's too much rules, too many processes.00:42:03
And that's another problem, even for a human to follow it. That's the reason that we want to use agents or things to minimize this problem that could happen.00:42:26
Shreya Shankar
I don't quite follow the… maybe, can you illustrate your problem of flashing specifications? I don't think I've, like, heard of that before.00:42:37
Daniel Saad
So, for example…00:42:46
So, for example, I have this document, is a document that says, okay, that the things that you are going to, describe, or you are going to write, this process that you want to define, okay? Each sentence need to use, for example, shall instead of should, okay? That's one of the rules, and then it should follow this exact, exact, exact, patterns of writing.00:42:48
writing, okay? And then I have another topic, another issue that sometimes, sometimes has, okay, it's inside of it says, okay, you can also use different patterns, for example, this pattern, or this pattern. Then these two patterns becoming clashing with each other.00:43:12
Shreya Shankar
I see. Do you have any control over these?00:43:31
like, what specifications you put in the prompt, because the best way to, like, confuse an LLM or get different results every time is to actually provide conflicting instructions, because it won't know what to prioritize. So I would say, like, your primary goal probably is, like, how can you whittle down your specifications to conflict with each other00:43:34
As little as possible.00:43:57
Or, if that's very difficult, can you, like, prioritize them, or, like, have a rank order of maybe one specification is more important than another?00:43:59
Or you could have, like, priority classes of them, and then in your prompt, say, like, follow the first00:44:07
specifications more than the last… I don't know, you'll have to, like, figure that out. Basically, the idea is, like, you… there's no way to, like, post hoc repair00:44:14
any of this, like, you have to figure out how to make your instructions so that LLM not clash as much as possible.00:44:23
Daniel Saad
So, that's a really good approach, honestly, I think it's a really good idea, but I have a question. How can I do evaluation and prioritization? How can I see, okay, did the LLM follow this prioritization of the patterns?00:44:30
Hamel Husain
I wouldn't even do that. I would say, you need to… Pre-process your prompt.00:44:46
like, because, like, as a human, it's gonna struggle with the prioritization, too. If you gave me a long sheet of paper and say, like, hey, like, these are all these conflicting rules, when scenario A happens, do this rule, but then B happens, except for C, except for D, then, like, I'm gonna get it wrong. Like, even an intelligent being isn't gonna get it wrong. So I would say, like.00:44:52
To whatever extent you can…00:45:11
have a first stage as just, like… it sounds like it could be deterministic, like this prioritization, so we should just…00:45:13
We should do that outside the AI. Like, you know, do that in code, or do that somehow in, like…00:45:20
Compile, quote, compile the specification so it's clean.00:45:26
Shreya Shankar
Yeah, I would say, like, don't even think of this as being part of your pipeline. Like, try to do a first step of, like, can you even create a set of specifications that doesn't conflict with each other, or you can very well understand which ones conflict.00:45:31
And then you can think about, okay, how to do this compilation process at runtime, like, given a query… given some statement or writing.00:45:47
retrieve non-conflicting specifications. And then you can think about, okay, starting with synthetic data, doing error analysis, or whatever.00:45:57
Daniel Saad
Great, I will try it, and I'll come back next week and ask a follow-up question, how we go and see… Thanks a lot for the ideas, thanks a lot.00:46:08
Shreya Shankar
Yeah, of course.00:46:17
Yeah, sometimes the solution is don't even jump into error analysis yet. Like, read your data and clean your data.00:46:18
Daniel Saad
Thanks.00:46:28
Hamel Husain
Neil.00:46:28
Neel Bhat
How's it going, guys? I'm gonna echo everyone's kind sentiments here about the overall course. I've got two questions, they're kind of…00:46:31
completely unrelated, I'll try to get through them00:46:38
Quickly. One's for a personal project that I'm working on, and then the other one's, like, work-related. Personal project, I'll start with that one first, because I'm actually building something, and this course is a lot more relevant to that. It's actually a follow-up of, I think Shazzad asked this last week, related to kind of like a… so I'm building a little00:46:41
fun story generation app for my kids, very similar to Surjaz Idea, and a lot of things he called out last week,00:46:59
are exactly kind of the stuff I'm going through. It is a workflow of different things, generating the story. Off the story, I'm generating a, like, an image guide, which then00:47:07
I use to feed into generating the actual images for the pages to try and keep them consistent. And I also have an LLM generating off of the page text what the image prompt should be. So it's kind of a few different00:47:18
couple things are in sequence, and then I kind of parallelize out to just generate all the images, the page images. I have it, working well enough, I'd say it's like a V0, I've got some friends using it, some coworkers using it, and, like, roughly have a good idea now of, like, I haven't done, like, deep air analysis, but there's some very obvious things, as I've especially gone through and been kind of looking at00:47:30
some of the common, issues I've seen. What I'm having a hard time right now, or maybe it's… I don't know if it's lack of focus, or just not trying to figure out, like, where to go spend that time, is I can see at the end where one of my most common problems is, the image on the page just doesn't really match what the text says.00:47:49
Or it's not really a good, like, it's not the right, like, subject, or whatever it's trying to show. Or I kind of lose consistency across00:48:06
Like, you know, one of the scenes, a user is, you know, they're outside, or the, sorry, no, the characters are outside, the next scene, they're inside.00:48:15
there's a… I have hypotheses on where in the workflow some of that stuff might be failing, because I'm seeing the end result. I think, Sharia, you commented last time, just like, do the end-to-end, so I can see the end-to-end, I have all the different pieces laid out of, like, what all the prompts are.00:48:24
what I'm just kind of struggling on is, like, what is the pattern? Like, how do you guys approach, like, which area of the workflow to focus on? I know nothing is in a great spot, just because this is, like, V0, and I know I can make improvements on the image prompt, I can make improvements on every spot, but it's not obvious to me where to even start00:48:36
So more of just, like, a general question of, like, how do you guys approach that when you have multiple steps, multiple workflows, different models in use, and so on?00:48:54
Hamel Husain
Yeah, so I would, first, like, okay, I would collect that data set of these failure modes of, like, hey, the user queries, and just some examples for you to iterate on.00:49:04
And, you might know from error analysis, even loose error analysis, like, okay, which component00:49:14
is not good. Like, let's say the consistency. You mentioned consistency.00:49:21
So I would, you know, I would focus on one of these problems one at a time. So consistency is a problem.00:49:26
And I would just iterate and say, okay, like.00:49:32
can, you know, in your pipeline, is there enough information being, you know, in each of these different prompts to allow the model to be more consistent? How can I… you know, there's a lot of tricks for consistency. I'm not even the expert, because I don't…00:49:36
I just, you know, fiddle with the image models in the hobby sense.00:49:52
More so. So, You know, I would take one problem at a time.00:49:56
And do that, and just iterate a bit, before getting into, like, evals, per se, like…00:50:03
you know, like, formal evals, I would… because, like, you need to stabilize your system a little bit more and understand, like, okay, like, how do you even do it?00:50:09
But…00:50:17
Yeah, I mean, having the dataset that you collect is a form of evals, like, you're iterating on it, like, yourself, and you're, like, saying, okay, and you kind of, like,00:50:21
Yeah, like, stratify the dataset in a way, like, according to these different problems that you see.00:50:31
You know, but you can just focus on one to begin with.00:50:38
Neel Bhat
Yep, yep. And then, with that one, so, like, right now, the problem I have been focusing on is just, can I get the page, the image, to look… to at least somewhat match what I think is on the, on the page text?00:50:40
and it's, to me, it's either a… it's either just the image generation, which I see Igor here, yeah, I've now noticed, like, image generation is just hard, it's not gonna be perfect, which, right now, I'm seeing that. But it's also, I don't know yet if, like, is my prompt not good enough? Is the… the LLM that's taking…00:50:55
what it should be, what, like, the subject of the page, and trying to turn it into what the prompt should be, is that not good enough?00:51:10
is it just a… kind of that same process you're saying of just, like, pick one of those, try and make improvements until I get to a spot where I feel like, okay, maybe that's… that's good enough, it's specified well enough, and then pick the next piece of the workflow? Still focusing on the same problem, which would be just try and get the images to align better with what's on the page text.00:51:18
Hamel Husain
Yeah, I mean, it's always hard to know whether the L, your prompt is bad, or the model is bad.00:51:40
Neel Bhat
Yeah, that's true.00:51:47
Hamel Husain
I would say, like.00:51:48
you know, make sure you're using, like, the best model you can, first of all. That's a good way to try to minimize the model as bad.00:51:52
And it's worth studying lots of prompts, like image prompts,00:52:00
You know, especially… image prompts are, like, interesting, and different, and weird.00:52:06
Neel Bhat
Very different, yeah.00:52:10
Hamel Husain
You know, they sometimes, like.00:52:10
have a lot of language, or, like, from, like, photography, and…00:52:13
Other terms of art that normal people don't know about that are important.00:52:18
So it's hard to say without seeing your prompt, but I would say,00:52:23
Yeah, it's worth trying as hard as you can.00:52:29
You can also…00:52:32
Yeah, you could try to come up with ways to, like, if you want your image to have a certain style.00:52:35
Or… follow some… You know, template.00:52:43
You know, providing another image.00:52:47
like, some reference images and things like that. You know, these are all these ideas. It's hard to really know, it requires, like, some iteration.00:52:49
Neel Bhat
Yep, cool.00:52:56
Perfect, and that's kind of the path I've been going down, is just lots of trying in the fleeting hours I get when the kids are in bed. I'll quickly switch to the real life, which is work. It's a different, completely different nature of question, because it's less about building what I'm focusing a lot of time… my time here at work right now is.00:52:59
helping kind of adopt a lot of these practices and just general AI adoption across our company. And where I'm finding I'm spending a lot of my time right now helping different teams.00:53:14
So I'm working on getting, like, a little bit of a set of, like, AI champions kind of built up, but two very specific use cases, working with our CS team to help them get, like, a CS chatbot running, either through Zendesk or whatever other tools we're looking at, as well as a couple internal use cases where I was kind of working through, like, oh, let's just build a custom GPT, because that's probably the easiest, most accessible thing.00:53:25
And where I'm starting to try help, especially with stuff on this course, is like, while it's easy to build these, and we can probably get something going up pretty quickly, how do you know it's good and trustworthy? And trying to help others do a very lightweight version of this process, because I, as much as I'd love to have everybody here at Tavala take your guys' course, that's unrealistic, so I'm just curious.00:53:46
how do you guys do… are there resources, things that you guys have found helpful, where it's a little bit more of a lightweight version? What I've started with is, like, let's just set up a customer GT, and what I'm showing with, like, our CS chat, I was like, I just manually put in, in this Google spreadsheet, here's a bunch of other queries.00:54:06
That a user has asked, ran it through manually, and kind of showing people, like, you can then take basic notes on what you thought was good or not.00:54:22
But I'm just curious on, like, other, other things that you've found that have worked well.00:54:30
Before I just keep going down the path of, like, everyone just pull up their teeth and…00:54:35
Hamel Husain
Yeah, so the lightweight version of this whole thing is it just do error analysis?00:54:39
That's gonna get you 75% or more of all the value. You know… Even though it sounds silly.00:54:45
Neel Bhat
Yeah, no.00:54:54
Hamel Husain
It really will, and so I would just start there, and just do that.00:54:55
Neel Bhat
Cool.00:55:02
Love it!00:55:03
Thank you.00:55:04
Hamel Husain
Yeah, yeah, thank you.00:55:05
Abhishek.00:55:06
Abhishek Panda
Hi, Amel. Hi, Shreen.00:55:09
So, my question is with respect to automated evaluators. So, first thing is, I mean.00:55:11
The, automated device, is it dependent on my… the N number of Excel coding, the N number of failure categories that I am…00:55:17
Again, you know, getting through all the clusters that we have made via Excel coding, right? So, is it directly dependent on that, or it not need to be? Or, yes, I mean, for, you know, let's say a couple of, element SHH evaluators, it is dependent on that.00:55:25
And can I, also build some… evaluators.00:55:43
which is not actually defined on my failure… I mean, on the Excel coding. Let's say, in the analysis, we didn't define it, or… or it needs to be interrelated.00:55:49
Shreya Shankar
Oh.00:56:02
Hamel Husain
No, go ahead, go ahead.00:56:03
Shreya Shankar
Okay, so I would say the number of automated evaluators is, like, the same order of magnitude as your axial codes. And there isn't really a one-to-one mapping, because sometimes you will look at your axial code, realize, oh, this is a specification failure.00:56:05
I'm gonna go and improve my prompt, and then, like, I'm, like, pretty sure that's gonna fix the problem, and I'm…00:56:21
why then waste your time running an automated evaluator? And then, sometimes you'll realize, oh, I'm going to have some of these axial codes00:56:27
be codebase evaluators, I'm gonna have some of them be LLM judge evaluators. Maybe I'll actually split up one axial code into two00:56:37
a different LLM judge, because actually my axial code was too high a level, and maybe I need to be a little bit more specific about them. So we'll kind of go through this process, so that's why I say same order of magnitude, not necessarily one-to-one. Then your second00:56:47
question around maybe should you have more LLMS judge or automated evaluators? Yeah, you can have as many as you want, but I think if you were to prioritize, but you only have finite resources, focus on the things that you know are failing, and then suddenly, if you have a lot of extra time.00:57:05
And you have a lot of motivation to, like, go align more LLM judges, then yeah, I mean, measure literally whatever you want, and the methodology works regardless of whether it's a failure mode or not. Just quantifies prevalence of some feature or concept in your choices.00:57:23
Abhishek Panda
Got it, got it.00:57:40
One more question, Shreya and Hammer. So, I'm yet to complete this week's lecture, and of course, the CSE that we will have in our upcoming classes. So…00:57:42
I mean, our product is in this way, like, you interact with an agent, and based on your requirement, right, for… I mean, there are many services, and based on the content that you put, right, okay, this is what I want, then some other agents get called, and the job gets done.00:57:53
Okay. Now, I mean, my vice president directly asked me this question, that, he obviously got part of that course, so tell me that can we build… I mean, my product manager is also part of this course, he unable to join this particular class. He wanted to ask this question, that can we build a framework? I remember, Hamil, in the first class, you mentioned that, people sometimes think whatever they learned here, they can just automate the entire thing.00:58:10
But I thought I'll ask this question a bit later. I have some maturity now, and I understand, but I'm just thinking, or could you… some ideas, because if we start building separate evaluation for each of the agents, like, I mean, we have so many agents, so just thinking, I mean, what could be the…00:58:34
I mean, optimized way, because these agents are built by software engineers. They are not ML engineers, so they don't really understand a lot of other things. Let's say, for example, semantic analysis, or let's say image is getting generated, like, what measurement shall we look upon? So, there are many such agents, okay, like website builder, website designer, logos, and so…00:58:53
I mean, we are thinking, like, can we build… of course, error analysis, yes, that is manual-intensive, the first step.00:59:16
But some way, can we design a framework that will actually help these guys, like a placeholder, and they can do it. I hope I did put my question.00:59:21
Hamel Husain
Yeah, yeah, I mean, so, you can definitely have software and build tools to help you in this whole process, you just have to be really careful. My advice for you is…00:59:32
It's a process for you to discover00:59:44
I would say, you know, make sure you feel comfortable, and you've kind of mastered this whole eval process before you do that, number one. Number two, if you do decide to build tools, do an A-B test.00:59:47
And make sure, like, do it manually, or do it the way that we're teaching you, and then use your tools, and then reflect on, like, what you missed.00:59:58
and do that several times, it'll give you a lot of insight into, okay, like, what you can automate. Because what you feel like you can automate.01:00:07
And what you can automate, there can be a gulf there. And you just want to make sure that you understand what that is.01:00:16
Abhishek Panda
I, I, I understood. I understood that well. Maybe when I complete other lectures and get into Big Four, maybe I'll get some more knowledge, and I'll again come back to this question.01:00:22
No problem.01:00:36
Shreya Shankar
You're doing the right thing, don't automate too early.01:00:36
So…01:00:39
Abhishek Panda
Got it. How did you say? Yeah, maybe on the skin, yes.01:00:42
Hamel Husain
Dimitri.01:00:47
Dmitry Buykin
Hi, hello. I'm working on an interesting problem, regarding the, customer support, processing, you know, thousands of emails per day. Actually, it's even more, it's, yeah.01:00:49
around, 50,000 mLs per day, and a high variation of these cases.01:01:03
It was really kind of good to use this, axle coding and error analysis.01:01:11
So I identified the 10 primary kind of criterias for each, for the males, and,01:01:18
it looks fine in general, but when… I found that when starting updating the prompts for agents.01:01:26
And each prompt is around 5-6 pages, it's kind of converted, SOPs,01:01:32
Into the… this, prompt… wrong form.01:01:38
And I found it's kind of starting collapsing, because while I'm fixing one place, it's starting failing other criterias when I'm trying to run.01:01:43
evaluations.01:01:51
And my question, how to make systematically adapt the prompts to specific criterias?01:01:53
Because I, yeah, I found that I'm starting looping on the specific, prompts, and I cannot,01:02:00
Achieve the good quality enough for such, such cases.01:02:07
Yeah, this is kind of one issue here. Another issue is high variation of different cases, and,01:02:12
when I'm starting, kind of, changing in one place, let's say one agent, you're starting stealing jobs from another…01:02:19
Agents.01:02:27
And,01:02:28
Maybe generic kind of advice how to tackle this, high variation, high complexity of, various, high variability of cases.01:02:30
So, and maybe you have some experience helping you with such progress.01:02:38
Thanks.01:02:45
Hamel Husain
Yeah, so if you have, like, really large prompts, like you said, mentioned, 8 pages.01:02:48
long. It's quite long. And you know, if it's kind of, like, dense specification, it can be hard to tune it.01:02:55
One thing that I would want to dig into is, like, okay, can you… can you separate these prompts? Like, you know, a lot of times people with 8-page prompts, they basically have a lot of if-then statements, like, if this happens, do this. If this happens, in this use case, do this.01:03:05
In that case, you may want to have a… consider a router of some kind.01:03:19
Dmitry Buykin
Yeah, yeah, I have router.01:03:24
Hamel Husain
Okay.01:03:26
Dmitry Buykin
It's kind of… but it's mostly routing the specific case, and each,01:03:26
prompt, I'm not specified as a single huge prompt, it's more… it's kind of a combination01:03:32
Represent those chain of thought.01:03:39
Of, several prompts, mapped to the pedantic, models.01:03:41
And then, LLM is when executing and feeding the specific fields in the prompt. Each field has a kind of micro-prompt, which is guiding the model how to fill this.01:03:47
Properties extract from the… Aboriginal male-specific fields, or,01:03:59
Make some validation, or request,01:04:05
all to the external system. It's kind of aggregated. There's a huge checklist, actually.01:04:08
of, different, thanks.01:04:13
But, yeah, it's, it's more like code, not like a real, real prompt.01:04:18
But, again, I found this really challenging to… Iterate on these things.01:04:24
Mainly because it's really hard to trace all the specific change in the one area is starting, kind of.01:04:31
Changing overall quality of response.01:04:39
It's kind of starting collapsing, or…01:04:44
Well, when I'm trying to implement the fully… because of complexity of the real Real-world use cases.01:04:47
Hamel Husain
Is it… do you feel that your prompts… so, it's hard for me to understand everything, but,01:04:56
It sounds like you're assembling prompts, and kind of, like, it's a cumulative thing, where you have different components you're trying to fill out, and…01:05:04
You know, you have these, like, different, you said, micro-prompts.01:05:12
Dmitry Buykin
Yeah, it's a kind of set of different checklists, and each prompt is an independent checklist, it's just validating some… it's a validation checklist, then…01:05:16
Ensuring the specific, Data available, then ensuring the…01:05:26
specific requirements, like, can we answer this email or not? Also, debilitating the01:05:33
Who is the customer? What is his level on other things?01:05:40
And then, kind of, when it's generating also, it's also kind of, yeah.01:05:45
making the first attempt, then checking with another checklist, and so on. So it's kind of a multi-step,01:05:49
processes.01:05:55
Hamel Husain
Yeah, there's… so there's two things that I would…01:05:55
look into right away. One is.01:05:58
if you have a series of steps, is the prompt that you're supplying to each step, is it properly scoped? Have you scoped it down enough? Is the context scoped down enough? That's one thing I would check. Second thing is, if it's a chain of causal events.01:06:01
I would try to narrow down, like, which are the most upstream things that are failing here, and focus on that first.01:06:15
And not, you know…01:06:23
not try to focus on the entire breadth of everything, just focus on, like, what is failing first, and tackle that one at a time, rather than… because…01:06:25
the errors will cascade. You know, if… especially if you're passing context along to, you know.01:06:34
subsequent steps, which I feel like you might be doing, based on your explanation.01:06:42
And really, like, figure out, okay, where…01:06:48
which parts are failing, and just, like, yeah, one at a time. And you might want to even,01:06:52
Might be a good… Shit, like…01:06:59
point of, like, this is where you do evals, automated evals. Like, if you have, like, a specific component or a specific form you're trying to fill out.01:07:02
In this causal chain.01:07:09
Okay, like, it's a good… it sounds like the system is complex enough where you need to go to automated evals soon.01:07:11
Dmitry Buykin
Yeah, it's, representing Rio, actually. It's, Rio customer support, and they have, the specific specs for each type of,01:07:21
requests?01:07:30
And then I'm just trying to automate this part.01:07:31
And…01:07:35
Yeah, okay, I got this idea that we're kind of trying to decompose the things. Yeah, I'm trying to do it, and01:07:36
It's more or less stable, it's just, I mean, maybe fading up on the complexity. Maybe I need to just decompose to the smaller sub-agents.01:07:43
What is the…01:07:52
And… but I have another question, actually, related to how to create these simulations on scale, because,01:07:55
I have a lot of unstructured emails.01:08:03
Like, question emails and answer emails,01:08:07
Which generated by human support?01:08:11
And, I'm trying to extract this, synthetic examples.01:08:14
What… how the good answer should look like for specific type of questions.01:08:20
And maybe there are some good ideas how to implement this. I tried this approach with the01:08:25
Error analysis and, trying to…01:08:31
uses axial coding, and then based on this, specific kind of accesses, generate the good, balanced synthetic examples.01:08:34
But they look really… artificial, and… because sometimes the…01:08:45
Real emails can be one sentence long.01:08:51
And when it's generated by the machine, it's usually much more wordy and not related.01:08:55
to the real…01:09:03
kind of examples when there's a lot of… when text available, from the external systems, and that's why the01:09:05
Any syntactic example is just,01:09:14
If it's representing in text, it's, not good enough, because it's, Overly complex and overly detailed.01:09:18
Related to real responses.01:09:26
Hamel Husain
Shrey, you have any comments? I see the wheels turning.01:09:32
Dmitry Buykin
Thanks, Chris.01:09:35
Hamel Husain
Excellent.01:09:35
picture of that.01:09:36
Shreya Shankar
No, I think a recurring theme here is, like, how do you not get overwhelmed by a deluge of complexity? And I feel like the answer to that question for me is just keep doing it, keep doing the axial coding.01:09:39
like, keep doing error analysis, and then find ways to try to automate parts of it, and build up the parts that you automate. So, maybe, like, if you want to go back and apply those axial codes to other traces, you could have an LLM call01:09:52
apply those. I think Landon mentioned in the chat, like, the categorization trick we talked about. I don't know what exactly that is, but it might be, like, the Google spreadsheet that we use to try to apply each axial code to other traces, or back to our open codes and stuff.01:10:09
And, you know, like, just continuing to do it. I don't have a great…01:10:26
answer other than, you know, keep doing it, stick with it, and then once you do axial codes, once you have a starting point, you can maybe use that, complement with LLMs, apply it to your more and more traces, see if that helps you make sense of more and more complexity.01:10:31
But yeah.01:10:51
Hamel Husain
Yeah, so one thing I'll say is…01:10:53
There's different kinds of errors you'll have when you do your axial codes, like, one theme of errors you'll have is, like.01:10:56
The email doesn't sound right. It sounds artificial, or it sounds whatever.01:11:03
Writing is an extremely hard problem. I talk with Treya about this all the time. Like, if you try to, like, have a LM write a blog post, any long-form communication.01:11:07
I've never been able to prompt engineer a…01:11:18
AI to write exactly how I want. Always have to, like, delete stuff. I always have to, like, edit.01:11:21
Dmitry Buykin
stuff.01:11:27
Hamel Husain
And there's no… I've never been able to, no matter how hard I try.01:11:27
The reason I mention that, there's different… not… when you have your axial codes, like, you'll have some things that…01:11:31
Or maybe lower-hanging fruits, and some things that…01:11:38
Are, you know, going to be harder.01:11:42
You know, that's gonna be more in the generalization of golf.01:11:47
And you might want to prioritize the ones that are easier before you… and you can, like.01:11:51
You know, kind of narrow down, like, fix all the other things.01:11:55
And kind of use that… a little bit of that… some judgment of, okay, what things you intuitively have a better idea of how to fix.01:11:59
And yeah, like, generating synthetic emails might be hard. To actually sound like humans.01:12:11
Dmitry Buykin
Yeah.01:12:19
Yeah, and another issue, actually, that I cannot rely on human experts because, we have, kind of…01:12:21
highly, really kind of different type of,01:12:27
customer, agents, which is, some of them it's good, some of them not, and I found that even01:12:35
Human, answers sometimes not… addressing the…01:12:42
Issue of the customer, and it's, causing a lot of this,01:12:47
Question-answer, question-answer kind of chains, between the… with support.01:12:54
Where it's kind of refined over, let's say.01:12:59
5, 10 emails. What is it required for the customer?01:13:01
And… this is a really big issue, because even when I tried to define the01:13:07
Good instructions, collect, kind of, pull these instructions from the human experts.01:13:14
Then, day covering may be the most common Whoa.01:13:20
Five cases, which is, covering around 40% of all,01:13:25
Things, but the variation's really high, and even…01:13:31
Within single, let's say, payment processing.01:13:34
question sits around, 100 different categories.01:13:37
Amsterd.01:13:41
So I'm feeling that even with the 10 categories inside the Excel, Oh, coding…01:13:42
You know, tenure error, kind of,01:13:50
categories, I cannot cover all possible variations, so it's just exploding.01:13:53
Yeah.01:14:00
Hamel Husain
Yeah, I mean, you have to prioritize, you all have a limited budget of evals, and you have to, like, pick ones.01:14:04
And you… that's just the…01:14:09
Dmitry Buykin
Yeah, that's just the reality.01:14:12
Hamel Husain
It's true, like, yeah, yeah, because it's expensive.01:14:15
It's not… the evals are not free.01:14:18
Dmitry Buykin
Yeah, and maybe there's options to kind of automate the creation of Evolve somehow.01:14:23
not as LLM, as a judge,01:14:30
Creating a prompt for it, but that also kind of…01:14:33
Also generate some evals based on some methodology.01:14:37
Yeah, I'm thinking about this, how to actually compress this, available data into the more… Digestible form.01:14:44
Like, rules, or something, similar to this.01:14:53
Hamel Husain
Yeah, it's a good problem. I think about this problem as well.01:14:59
Dmitry Buykin
So I'm with you. Okay.01:15:02
Okay, anyway, thank you.01:15:07
Hamel Husain
Thank you.01:15:09
Shreya Shankar
Anyone else? I think Vignish. Oh, okay. But I also have to run exactly at 9.45, so let's see if…01:15:14
We can answer the question in 3 minutes.01:15:23
Vignesh Iyer
Yeah, try to be quick. Yeah, consider, like, a no-code platform, right? Similar to Neil Butt's question earlier. So, consider that we're building this for, very, non-technical users, and we want to have evals into the platform.01:15:26
What would you suggest would be, the best way to… to get, you know, some UI going that allows you to plug into these evals? Meaning, could I, should I…01:15:42
make sure that I have a way for batch executions to run where people provide, like, a dataset, maybe…01:15:56
from some, queries that they have, right, to do maybe error analysis, and maybe there's, like, a chat window that's open where you design your flows, and whenever you kind of test iteratively, right, like, while developing, that also kind of sends, traces.01:16:04
And that also can be used for error analysis. So, I need to, again, think about how I'm gonna bake this in to the users, like, that you need to do error analysis and go through this process. And I'm also thinking about the flows they would follow in terms of01:16:20
would it only be the online traces, that get logged? Would I allow them for, like, a batch of data sets to kind of run the flow on? And whether it also happens, like, during dev?01:16:36
If you've got any thoughts?01:16:49
Hamel Husain
So, go ahead, yeah, sorry.01:16:57
Shreya Shankar
Well, I think I may be confused, so you should go.01:16:59
Hamel Husain
Okay, so, my mental model of what you're saying is, like, okay, you're building lovable.01:17:03
And you're like, okay, how do you do error analysis on Lovable?01:17:09
So if I was, like…01:17:13
you know, with the… you know, I'm not working at Lovable, so, take a grain of salt, but I can imagine, if I was working at Lovable, I would start with…01:17:15
okay, how do I do error analysis intelligently? And it would be…01:17:25
Okay, what are some ways I can smartly sample the data that tells me something has failed?01:17:29
Or the user is not successful. And one kind of simple thing would be, like, hey, was the user able to publish their application?01:17:36
You know, because there is a user journey in Lovable of, like, deploying.01:17:44
Shreya Shankar
Your web application.01:17:49
Hamel Husain
And for things that… Okay, so I would, like, I would start doing data analysis.01:17:51
and say, okay, and I would study…01:17:57
what kind of statistics are associated with people who are deployed successfully versus not deployed successfully. I would look at, like, what tech stack are they trying to use? How many turns in the conversation were there? I would, like, try to come up with a lot of hypotheses, like, kind of like a data…01:17:59
data thinking.01:18:19
And try to come up with, like, a lot of hypotheses of, like, what leads to failure modes.01:18:21
As well. So I would, like, you know, look at the user signals to say, like, okay, what are…01:18:25
high likelihood failures. Then I would, like, try to learn more from a data perspective about those failures, and then I would start doing error analysis on those.01:18:32
Specifically, to get intuition on, okay, like, what are… Where is…01:18:41
lovable failing? Like, you know, is it some, you know, is it not…01:18:47
adhering to user instructions? Is it, you know,01:18:53
you know, producing code that isn't syntactically correct? Is it just not good with Python?01:18:57
you know, whatever, is it not good with, like, new frameworks? You know, whatever it might be. And I would start to maybe…01:19:02
Get my hands around it.01:19:10
you could probably get really far in something… so, like, coding agents are a little bit different. So, if you talk about, like, highly verifiable domain, like, code… coding…01:19:12
Shreya Shankar
You can…01:19:23
Hamel Husain
You know, you can iterate with some of these metrics a lot more.01:19:25
You can come up with, like, intermediate metrics that don't involve you know.01:19:30
like, the same kind of LLM judge automated evals. Like, you can… it's basically code-based evals of various kinds.01:19:36
To iterate against, like.01:19:43
you know, is the code syntactically valid? Like, does it have, like, these things that you think it should have? So on and so forth.01:19:46
That's how I would, like, start to go about it. I don't know what… I couldn't understand what the batch thing meant exactly, but…01:19:55
Maybe I misunderstood.01:20:03
Vignesh Iyer
Yeah, in this case, it's actually… think of it, like, as, like, a building a flow. You're kind of building the LLM pipeline as, like, a flow of, like, dropping nodes onto a canvas, and sort of, building LLM applications, no code.01:20:04
So it's not like you're generating code, you're building the LLM application, and I want users to be able to do eval.01:20:21
on whatever they've built, and they're non-technical. So, when I meant batch, meaning provide them a way to upload a dataset, and then kind of run, that flow that they've created on that dataset that produces traces, assuming that I need to bake in them that01:20:29
these are some synthetic queries or things that you may have to generate, for instance, right, to do it, trying to bake in that process. So I was wondering if I should only have, like, a chat window around my flow where people iterate, or also allow uploading, you know, datasets, allowing that process for something no-code.01:20:49
Hamel Husain
It's like, it just reminds me, for some reason, of,01:21:10
Eval gen, in a way?01:21:14
Which is a topic that Shreya will cover in a…01:21:18
Shreya Shankar
Yeah, next week, I'll talk about it.01:21:23
Vignesh Iyer
Don't care.01:21:25
Shreya Shankar
Yeah, and you can read Chapter 10 in the reader if you want to learn more about it.01:21:26
But yeah, I think it's going back to what Hamel says, like, you have the right idea, you know, upload a batch of data, and then maybe you can guide people through creating evaluators for it. I think there's no plug-and-play tool, like, you should probably design the thing yourself.01:21:32
And you feel free to read the paper, Who Validates the Validators? Look at other tools for inspiration, and then do some lightweight version, maybe, of error analysis or assisted error analysis in your product for your end users.01:21:48
Vignesh Iyer
Sure.01:22:03
Thank you.01:22:04
Hamel Husain
Alright, thanks everybody for coming. See you at the next office hours.01:22:13
Katya May
Thank you.01:22:17
Shreya Shankar
See you tonight. Yeah, bye.01:22:18
Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)