OCT 23 Optional: Live Office Hours 8 THU 10/239:00 PM—10:00 PM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live Office Hours 8
Oct 23, 20259:00 PM - 10:00 PM GMT+5:30
Audio Transcript
Chat Messages
Hamel Husain
How's it going?00:00:23
Robert Lavigne
Doing well, you? Sorry, I'm just working on a few things.00:00:25
Hamel Husain
I'm good, yeah.00:00:28
Robert Lavigne
the,00:00:32
Seeing no one's here, I'll bring you up to speed. So I did a massive data pull from, like, from 2023 through 2024, and obviously 2025 is not complete, but pretty much the whole year is there.00:00:34
2024 is where I use this system more than any other system, and it has their logs in the back end, and you would see the user activity and whatnot. But the API gives you the full raw JSON, or the raw JSON, and it had a lot more detail00:00:45
And it turned out, hey, this is…00:01:03
the trace, you know what I mean? You know, so what was I seeing before? I was seeing a chat log, whereas now, through the API, I'm seeing a very structured, you know, where was the source, was it pulled from the LLM, or was it pulled from the vector, and what aspect? Like, all of that detail is there, so data that I had never seen.00:01:05
Or care to look at, until now.00:01:24
I've now got two and a half years worth, and, like, one of the agents in 2024 did, like, 10 megs worth of output. Like, what's that math? 10 million characters? Is that right? You know, so a significant amount of data that I can go through now.00:01:27
and revisit, and it occurred to me that I can now run scripts against that JSON and extract it into user-assistant pairing, because a lot of it was long-form conversation.00:01:44
And now I've taken, like, say, 20 conversations and converted it to 2,000 pairings, which is perfect for fine-tuning.00:01:54
you know, vector database, you know, all this other stuff that I could do.00:02:03
Hamel Husain
analysis.00:02:08
Robert Lavigne
error analysis and so forth, right? But before we get too serious with the stuff, I was going through last night's traces, and I can confirm that when someone says commit suicide as a thing, my AI did the right job in saying, seek support. So, from an eval point of view, that trace worked, but it also…00:02:09
Hamel Husain
Building a mental health thing.00:02:28
Robert Lavigne
No, that was just someone rage-quitting on my app. Thought they'd be smart. It's like when I put up,00:02:29
a thing for my city. Literally, the only queries were, where are the… where's the meth and where are the hookers? You know what I mean? So, going back to the earlier conversation, when you build these public-facing apps, you gotta account for the asshole factor.00:02:36
So, good thing in…00:02:51
Hamel Husain
What is your app new? Again?00:02:52
Robert Lavigne
It was just a story, so the idea being,00:02:53
Hamel Husain
Oh, the story thing, okay, go.00:02:56
Robert Lavigne
Yeah, the story thing, and I guess someone decided that they would be…00:02:57
crass, I guess is the word I would use, but I don't care. What I… I looked at that from a whole different perspective, whereas the old me would have looked at that and go, hey, someone rage quit and was a dick. Whereas now, I'm like, did the LLM respond the way that I would want them to respond to that, as opposed to saying, here are the steps to do what you just asked to do, right?00:03:02
So purely from a…00:03:21
a sampling point of view, it's a sample that I would not normally have looked at with the same mindset.00:03:23
That I would now, whereas before I would have said, oh, you know, flag that as an open court of rage quit. Whereas now, it's more of a flag as a, did… did the system…00:03:30
stop harm, basically, right? So even on these gain little things, you gotta account for the human element, I guess.00:03:40
And I guess maybe the opening question there is, do you factor that as in your system prompt, or do you just part and parcel that as something that is a guardrail, or do you just…00:03:48
You know, accept the fact that 20% of the population's just going to be nasty with your, online presence, for lack of a better term.00:03:58
So, I'll start there.00:04:06
Hamel Husain
Cool, I'm glad you got the data, yeah.00:04:09
Robert Lavigne
Yeah, oh, the data's amazing, I'm gonna be shifting through it for weeks.00:04:10
Hamel Husain
Okay, we're at the start of the office hours, we can…00:04:15
We can go ahead and, kick it off.00:04:19
We'll do the same thing as before. You can raise your hand if anyone has a question, and we can kind of step through it.00:04:23
Don't be shy.00:04:31
See, there's some people already here.00:04:33
So feel free to…00:04:37
Robert Lavigne
I might as well start, just as a follow-up to what I just said. So the official question would be, when you're going through your traces, and you're finding content that is not necessarily in alignment00:04:41
with the desired effects. Either people are, trying to jailbreak it, or basically red-teaming it, or just being whatever. How does that play into your sample rate? Do you just flag those as things that the prompt should negate, guardrails to be implemented, or just put it down to a percentage of the population?00:04:52
Hamel Husain
Yeah, I mean, you should definitely do your same open coding, whatever, and00:05:14
You know, like, eventually you might want to sample in areas where you think.00:05:21
maybe more problematic. It's always a judgment call of any kind of failure mode that you see. It's like, do you want to sample more? Do you have enough? Do you have an idea of how this failure occurs, and all the different ways it could occur? And so you would sample more, just like anything else. One thing I will say about00:05:27
Jailbreaking, or… You know, people… Kind of saying.00:05:44
trying to get you… get the AI to, say, commit suicide or whatever.00:05:51
Robert Lavigne
Or get you in trouble legally with, hey, this AI chatbot ordered 50,000, you know, hot dogs for me.00:05:55
Hamel Husain
Yeah, yeah, okay, so, like, you have to be pretty… like, people are, for some reason, obsessed with, like.00:06:02
Like, this, like, jailbreaking?00:06:09
In a lot, and like, you have to be really honest, like, does that jailbreak really matter or not?00:06:12
Sometimes it's just a toy, jailbreak, like…00:06:18
okay, I got the AI to talk back to me and tell me that I'm stupid. Well, really, I mean, like, if your user is wanting that.00:06:22
then…00:06:32
That's not your, you know, maybe it's not your problem. It's just a toy, like, they're just trying to do that, like, maybe it's not an issue for your product.00:06:33
So I wouldn't get obsessed over that. I would definitely focus on, is your product working first, before getting obsessed about jailbreaks. I'm just mentioning that because of people…00:06:40
Have some paranoia about that.00:06:52
Robert Lavigne
Yeah.00:06:54
Hamel Husain
For some reason.00:06:54
Robert Lavigne
Well, I'll give you a follow-up example. One,00:06:54
it came, I think, from Pakistan or someplace like that, so clearly, you know, a bot farm or whatnot, but the prompt was, tell me about your tools and their parameters and so forth. So it was basically trying to have the system prompt expose its API, fundamentally.00:06:58
You know, so small little things like that are nice to see in a roundabout way when you're going through your traces, because how does the system natively respond to that? Because we've got smarter LLMs now that are handling some of this stuff without our system prompts needing to be there, right? Whereas the older models, I think, would have failed on that one.00:07:13
Hamel Husain
Yeah. Yeah, so you just have to be… you have to, like, prioritize, like, how important is that to you, given all your other errors and things like that.00:07:33
Robert Lavigne
Thank you!00:07:42
Hamel Husain
Yeah.00:07:43
Any questions? No questions today? Everyone's slow… everyone's pretty shy.00:07:51
Man, they should, like, call on people randomly.00:07:56
Although I don't want to do that. That, gives me PTSD. I was in an environment where people got called on randomly.00:07:59
Robert Lavigne
Call on Chris Iverson, he wants to ask a question.00:08:07
Hamel Husain
Oh, okay, we got some takers, Samuel.00:08:12
Samuel Thomas Elliott
Yeah, I'm curious what your thoughts are on modeling out, like, visually, more of the complex workflows of LLM calls within apps, especially when, like, routers are involved, so that you get the PMs aligned with the engineers, aligned with domain experts, of understanding, kind of, what is actually happening that's leading to the output.00:08:15
beyond going and looking at a trace within something like Langsbeth.00:08:38
Hamel Husain
Yeah, so, there's some really good analytical techniques that we have found helpful, so, there's… that's coming up in the lectures this next week, but one that especially comes to mind is this, transition failure matrix. It's a way you can visualize,00:08:44
You know, if you have, like, multi-turn conversations or highly agentic workflows with lots of tool calls and loops and things, it can help you make sense of patterns.00:09:04
Of where things are failing, what handoff points are brittle.00:09:16
Like, what specific tools… Are tending to fail, and where.00:09:22
It's a nice way to kind of… See, like, some hot spots.00:09:27
So that's the one that I really like.00:09:33
Samuel Thomas Elliott
So, that makes sense from an eval standpoint. What do you think about this in, like, the design stage? When we're modifying a system that's working to add additional capabilities? Like, is it worth trying to keep up a complex diagram, that's showing…00:09:35
The if-thens that have to do with, oh, we're inserting this system prompt now with this user prompt, because we recognize the intent of this user prompt is…00:09:51
X, therefore we're going to route it to a Claude model versus a perplexity model.00:10:01
Hamel Husain
I think the main utility of those diagrams is to help you make sure that you understand your system.00:10:08
So, like… If it helps you, then… sure, do it.00:10:14
Samuel Thomas Elliott
Because it's not… it's like any… it's like taking notes, right? It's like… Yeah.00:10:19
Hamel Husain
This, like, oftentimes the notes are not helpful, it's the process of, like, writing down the notes that cement it into your brain.00:10:23
And so, like, it's kind of like that, I would say, for this.00:10:30
Samuel Thomas Elliott
Thank you.00:10:36
Hamel Husain
Yeah.00:10:37
Mubin.00:10:39
Mubeen
Hi, how are you?00:10:44
Hamel Husain
Pretty good, how you doing?00:10:46
Mubeen
All good, all good. So I have, like, a couple of questions. One is, in terms of,00:10:48
doing error analysis on RAG systems, the recommendation that you guys had was…00:10:56
you can generate them… you can generate prompts synthetically, essentially reverse engineers from the source or the chunks, and then say, okay, given this chunk, what would be a reasonable query that would, request this, and then see if actually that chunk is being retrieved.00:11:04
Mike, but what was interesting is you said you want to separate the retrieval evaluation from the, you know, overall evaluation, or the overall eval. Or did I catch that incorrectly?00:11:23
And in that separation, would you change anything in the process of evaluation? In the sense that, would you change the open coding, axial coding structure for retrieval, or is it exactly the same process, just to run on specifically the retrieval part?00:11:39
With that pass-fail… Structure, or whatever.00:11:55
Hamel Husain
Okay, so… What you should do is, like… okay, forget about retrieval for a second, you should do your…00:12:00
Error analysis. And if your error analysis uncovers that retrieval is a problem, then we can dig into retrieval.00:12:08
Specifically, and say, okay, retrieval is a sub-problem that… Is different.00:12:17
You know, and then you want to do your evaluation on retrieval. That becomes a search evaluation.00:12:25
like, it's not… it's not so much, like, an AI…00:12:32
kind of general purpose evaluation. It's a very specific, like, information retrieval one.00:12:36
And so that's when you would start thinking about building a retrieval dataset using these metrics, like.00:12:41
You know, precision, recall.00:12:48
Mubeen
Hmm.00:12:50
Yeah.00:12:50
Hamel Husain
things like that. So it's like, that's kind of the hierarchy.00:12:51
Mubeen
Okay, and one final follow-up. The example that I've been using was the compliance chatbot. Sorry if it's a tired one, but the problem that I saw there was, even though the chunks retrieved were correct.00:12:56
For example, let's take a very concrete example of a question of base capital requirements. How much money does a company need to have in its bank account to be, you know, safe and to be above board?00:13:12
So the chunk said 500K,00:13:27
the model said… the response said 50K. But now I'm confused slightly, because, you know, in the prompt, we have mentioned everything, do not hallucinate, do not generate values from yourself. Would that… where would you classify that from the gulfs? Is it more of a specification or a generalization issue?00:13:30
The implication being that if it's a generalization issue, we look at more use cases and see how often this occurs, and if we need to change something in the underlying model.00:13:48
Hamel Husain
I mean, from the… okay, it's like, academically, you can argue that what… from what you said is a generalization issue, but I would suspect there might be a specification issue, in the sense you could…00:14:00
Prompt the model, maybe in a way that says, like, hey, you know, produce citations, for example, which has the effect of getting it to think.00:14:16
And… Attend more to the facts.00:14:27
Which tends to help align, like, have it00:14:31
have greater coherence with the things that have been retrieved. And that may not work, but it's something to try. It's like, it is worth thinking about, is there any greater amount of specification that you can…00:14:35
provide.00:14:48
But if you try… if you, you know, and there's no, like, magic tricks, like, you just kind of have to try things and iterate. And if you find that, despite that, the model still isn't doing what you want, then it might be a generalization issue.00:14:49
Mubeen
Okay.00:15:08
Okay, fair enough. We'll look into the prompt. I think that's the most important part of it. As in, if we can improve it through specification, generalization seems like a bigger, much broader issue.00:15:09
Thank you.00:15:24
Hamel Husain
Yeah, no problem.00:15:25
srp
Okay, I think it's my turn next. So, I will try to articulate my question as clearly as possible. If it doesn't come through in the first one, I will repeat.00:15:29
So, we are working on a product.00:15:39
That, let me give it a little bit of color. So, it is to help two partners to work together.00:15:43
and develop and bring a product to the market. So, there are certain general principles, let us say, depending on whether you are working on a hardware box, like a Wi-Fi router kind of a product versus a software product, whether it is meant for enterprise.00:15:50
Business process, or for security, kind of, system-level work and all of it.00:16:07
Now… The…00:16:12
AI-based AI native product that we are working is essentially to guide the salespeople on how the incentive structure works for certain types of products.00:16:16
Okay? Now, when you do this, I want in the product, the rules of engagement that two partners have agreed to is codified in a document, and that is part of the repository.00:16:28
That will be used to answer the questions. But I don't want an information leak to happen, functionally, where the work… rules of engagement between, let us say, a Cisco and Accenture versus a Cisco and, let us say, Cognizant, it should not leak into each other.00:16:42
Right? So how do I, using evals,00:17:05
Test for it, for this information leak.00:17:09
I don't know whether I was able to articulate the00:17:13
So, it is more of a information security kind of thing, but I am coming from a product management perspective. How do I describe this, and how do I ensure that me and my team test for this?00:17:20
Hamel Husain
I feel like Shreya's wheels are turning, so maybe I'll…00:17:39
Shreya Shankar
Yeah, I can… I can try to… so sometimes you do run into this case, like, exactly what you're saying, where you have a hypothesis for failure mode that's gonna occur in your application, and you want to measure that, you know, like.00:17:42
even if you didn't find that by way of error analysis, or you would need to do this for, like, security purposes or something, I would then…00:17:56
imagine that I have gotten to…00:18:05
I've, like, done error analysis, I imagine that this leakage is an axial code in some way, and now I'm going to try to trade an LLM judge to classify this in my traces. The challenge you're going to run into here is that you might not actually have example traces.00:18:09
that exhibit this leakage failure mode. So you might actually have to go in.00:18:24
look at your existing traces and figure out how to change it in a way that actually realistically exhibits the failure mode, so then you can have some sort of balanced data set of failure mode… failures and non-failures, so you can train an LLM judge.00:18:30
I would go the approach of trying to do some LLM judge to detect this failure mode, and then see how well that actually works.00:18:48
And if that doesn't work very well, then, you know, consider other examples, or,00:18:55
yeah, I don't know, like, deploy to just a small fraction of people.00:19:02
Yeah.00:19:06
srp
Yeah, the point, Rhea, is I don't know how to frame it, the prompts for the LLM judge.00:19:08
to detect information leakage, because it's, yeah, I'm not able to articulate it very clearly.00:19:16
Shreya Shankar
I think that's okay. Use AI tools00:19:23
as much to your advantage as possible. Like, sometimes I feel like, oh, this is very complex. I'm gonna start out by maybe, like, using ChatGPT voice mode to try to describe it, see if an AI can, like, give a more pithy summary of what I've said.00:19:27
sometimes I, like, try to come up with examples, and then use that, really, to do the bulk of the work in the prompt, and then, you know, see as I iterate on the LLM judge prompt.00:19:42
maybe I'll be able to refine that a little bit more. I think it's okay to just start somewhere, like, pretty bad, and then try to use the tools that you have available, and use example demonstrations to communicate your point.00:19:52
srp
Okay. And just one sub… point to this, are there any…00:20:04
quote-unquote, security check kind of tools available, anything that, is there? I'm not… I'm not up to date on this, side at all.00:20:09
Shreya Shankar
Totally will.00:20:20
Hamel Husain
Go ahead. Oh, I do know that, like, there are Azure guardrails to…00:20:21
Shreya Shankar
that claim… To do things that may…00:20:25
match to what you're doing, but I don't want to say that they are.00:20:30
srp
Somebody who's not used those.00:20:34
Hamel Husain
I mean, so, yeah, like, the guardrails and stuff like that, what is a guardrail? Essentially, it's an LM judge, or it's a classifier of some kind. There's nothing magic in there, like…00:20:38
you… It's important to know… Well, first of all, it's important to know there is no 100% jailbreak-proof thing.00:20:50
So when you're talking about this specific situation, it's good to keep in mind, you can only mitigate, you cannot guarantee.00:21:00
srp
Hmm.00:21:08
Hamel Husain
And so, with that in mind, you have to…00:21:09
You know, like, red teaming of some kind may be good.00:21:14
srp
You just have to keep in mind that some motivated person…00:21:18
Hamel Husain
Let's say… especially if they're really…00:21:21
You know, if they're, like, a researcher who knows how to jailbreak, or, like, does jailbreaking…00:21:26
All day long, for fun.00:21:31
It might be hard to compete against that, but, you know, it's like… it's like any other security practice, there's no…00:21:34
like, bulletproof thing. It's just, like, how many… how much resources are you willing to spend?00:21:43
Because it's, like, infinite security is infinitely expensive.00:21:50
srp
distributed.00:21:55
Hamel Husain
Yeah.00:21:56
srp
Okay, good. I was wondering if there are any, tools that I'm not aware of at all, so… okay, but basic principles have to be used, use LLM as a jet, that's the basic,00:21:58
approach.00:22:14
Hamel Husain
Yeah, I mean, it's good to know, like, tools… There's no magic…00:22:15
tool that can do something beyond What you already know.00:22:20
I mean, maybe they've, like, worked really hard on the prompt, maybe they have a classifier, maybe help you arrive there, but it's not…00:22:27
It's not giving you something you don't know about.00:22:35
srp
Understood. Yeah. Okay. Thank you. That helps.00:22:39
Shreya Shankar
I also generally just never trust an off-the-shelf tool for such, like, a complex failure mode.00:22:48
Like, as a rule of thumb.00:22:53
I don't know why.00:22:56
Hamel Husain
Oh, yeah, I 100% agree with you.00:22:58
If you see a tool like a guardrail, It kind of…00:23:01
Takes advantage of people in a lot of situations that don't know any better.00:23:07
Because it's like, oh, like… There's a guardrail… It feels safe.00:23:13
And this is, like, some magic AI stuff.00:23:18
And this person has a guardrail, let me just use it.00:23:21
But now you know.00:23:26
And so, you know, you don't have to fall prey to that. And just be skeptical. I mean, it's okay to use tools, just be super skeptical, like, what is it doing? What is it doing that I'm not doing?00:23:28
If you look at… There's this blog post I'll share in the chat.00:23:37
And it's… there's, like, some popular software that has guardrails in it, and if you look behind the curtain of what it is, it's like a prompt.00:23:46
So, I'll share the blog post in the… in the chat, and you can check it out, and they can…00:23:56
Then share my skepticism going forward.00:24:04
Any questions here?00:24:21
Shreya Shankar
What are you for me.00:24:26
Mirko
I will throw in a question, given that otherwise it gets a bit awkward, I assume. Okay, mine is a pretty generic one. So, Hamil, I think you mentioned last time that in order to prototype00:24:28
AI, you use super notebooks and stuff like that. So I'm not a developer, I'm a technical product manager that knows some Python coding, so I'm using notebooks myself.00:24:39
Buds… as prototyping goes along, I feel sometimes it gets a bit challenging to…00:24:50
Keep track of system prompt iterations.00:24:58
So, I mean, you keep changing things here, changing things, I don't know, on a couple of places.00:25:02
I'm curious how you keep track of what you change, and how big of… like, how big those changes are when you go from iteration to iteration.00:25:09
Hamel Husain
Yeah. So I like to… Check those, prompts into…00:25:18
like, Git, and, version them alongside the code.00:25:25
Mirko
You might be interested in watching the recorded session with Isaac yesterday.00:25:28
Hamel Husain
Where we go over the homework and do it with AI. In one… in particular, what he does is he refactors the prompt out of the code into a Markdown file.00:25:35
srp
Right.00:25:46
Hamel Husain
And so that you can iterate on it. And then the markdown file gets read into… the code.00:25:46
Mirko
Got it. Okay, you basically… every single change would be…00:25:52
a different mark time file, essentially, right? So you basically.00:25:57
Hamel Husain
Oh, no, you don't need a different Markdown file, you can just… unless you want different prompts, like, at the same time, but if you're truly changing your prompt, you can just version it.00:26:00
I see.00:26:10
Mirko
Makes sense, makes sense. Okay, cool. Thanks.00:26:12
Hamel Husain
Okay, Mubin is asking a question in the chat, since she says her audio is not,00:26:22
Oh, no, sorry. Okay, well, Mubin did ask a question in the chat.00:26:30
Do you want… Mubin, do you want to answer… ask it live? You can do so if you want. You don't have to be shy, otherwise…00:26:37
Mubeen
No, it was a side quest, so that's why I was like, maybe it's not relevant. But the question that I had was, you've been pretty skeptical about tools00:26:44
Along, you know, traces and optimizations, and now even now, right, you're, like, your argument is, It's… it's…00:26:55
Unlikely that something…00:27:07
a tool is actually better than your own judgment on solving a specific problem to your own company. The question is, do we ever see tools getting good enough that they can be general-purpose solutions for these things?00:27:09
Or are LLMs just a different…00:27:25
are just nature differently, and therefore, it's… it's going to be high… it's either highly unlikely, or it's… it takes… it takes more time than we originally envisioned for it to happen.00:27:28
Shreya Shankar
That's a good question.00:27:42
I can try to answer it. So, I think there's two points that are important in this question. One is, as the developer, do I have a good mental model of how this tool behaves?00:27:45
So, I think it's okay sometimes for tools to be non-deterministic. For example, a random number generator.00:27:57
I just have a very good mental model of how a random number generator works. I know it's going to be random, I know the uniform random distribution, like, it makes sense to me, I know when to use it, so forth. The inherent part of the non-determinism is not bad, it's just, like, when I don't really have a good mental model of the behavior of the tool, which00:28:03
for some of these LLM judge or off-the-shelf tools, like, I really have no idea how it works.00:28:23
So that's one point. The other point is, even if I use them and then, like, go into all this painstaking detail to develop a good intuition of how it works.00:28:29
Sometimes, when it's wrong, I have no way to fix it. They don't expose the prompt to me, they don't expose to me what they're checking, what few-shot examples they're using, whether they're trading their own judge or whatnot, so there's just no way for me to, like, steer it to whatever I need in order for my application to be successful.00:28:40
Now, I would gladly use a tool if one of these two points changes, so somehow I'm able to understand the behavior of that tool better.00:28:58
Or… I actually have the ability to change or probe those internals, ideally both.00:29:07
But I just feel like none of the off-the-shelf tools have showed me possibility of making progress in either one.00:29:15
Hamel Husain
I could share my screen real quick and show something, around these lines that Shreya's saying.00:29:26
So, this is the blog post I mentioned earlier. I don't know if it's showing up correctly. Did you see it?00:29:31
Shreya Shankar
I see it.00:29:38
Hamel Husain
So, this is, you know, apologize for the title, but it's fun. So, this is a blog post.00:29:40
That kind of investigates different tools.00:29:47
Let's go… and so basically, let's talk about Guardrails. So Guardrails is like a… kind of a somewhat popular tool. It's like a Python project.00:29:52
And this is not some kind of criticism of guardrails as the tool itself, but I just want to illustrate what's happening. So, a lot of these tools are opaque, right? Like, we have these tools, some of them are more opaque than others.00:30:00
And basically what I go through in this blog post is, like, okay, a lot of these tools are not transparent, like, so let me go back to the guardrails.00:30:13
So the documentation of, you know.00:30:21
this guardrails is like, hey.00:30:25
you specify some kind of, schema, and like, you know, you might have a prompt, what kind of pet should I get, and what should I name it? And then you have a guardrail.00:30:28
The guardrail… Ensures that the answer is…00:30:39
It, you know, adheres to this schema, in that it actually, like, gives you the name of a pet.00:30:45
Okay.00:30:50
So, if you don't know any better, you're like, okay, this is a guardrail. It will, you know, just by the name itself, you might feel, oh, like, this is a tool, it's gonna help me, it's gonna make sure that I don't go off the rails.00:30:51
like, because it's called guardrails, so you're like, oh, wow, it must be… must be… they must know something I don't, it's guardrails. That sounds really safe.00:31:08
So what this, what this blog post does is, like, I actually put a proxy, like, a man-in-the-middle proxy, which is basically, like, I… you know, some of these libraries don't have good…00:31:17
observability.00:31:29
And so I put a proxy in the middle, and I observe what is the traffic, being sent.00:31:32
you know, to the LLM. Like, I intercept it. And I… and then I, like, inspect that traffic, and I… and I say, like, what is actually happening?00:31:39
So if you look at the… if you look at this, if, like, this code, right, it'll say, okay, like, it'll give you, you know, this demo looks interesting. It's like, okay, we will give it the name of a pet.00:31:47
The name of your pet, and then, like, a dog, and it seems good. But what's actually happening here is actually a prompt.00:31:59
And so when you invoke this code, it's like giving you… it's like a prompt. Given below is an XML that describes the information to extract from this document.00:32:08
You know, and it has this, like, schema. It says, only return valid JSON object, where the key of the field is JSON, blah blah blah. Here are examples of simple XML, JSON, blah blah blah, right?00:32:17
And then it sends this to the LLM first. After the LLM gives the answer, it then sends another request to the LLM. I was given the following response.00:32:30
help me correct this, blah blah blah. So this is, like, it failed, and it kind of, you know, is doing this, right? So it's good to know what's happening. So what Shreya's saying is, like, there's no magic here, right? It's like, this is prompts. And so you have to… it's good to know that, like.00:32:43
there's no guarantee that someone can't jailbreak this. And it's also, is this better than your prompt? Maybe? But maybe not. This is generic. This, by definition, is boilerplate that has…00:33:02
you know, that's, like, trying to apply across all use cases and all things, like, if you injected your own…00:33:16
Domain knowledge in here.00:33:24
Could it be better? Yeah, maybe. And so it's good… it's just good to know, and so this is, like, inherent limitations, so… and you can see this with a lot of tools, you know, even if you do, like, prompt… use prompt optimization tools, you'll see, like.00:33:26
It's not really magic, it's just… a lot of times there's prompts. There's even one example from down here.00:33:41
of… You know, there is, like… this… sort of… LLM chain thing?00:33:49
Okay?00:33:59
That's supposed to help you brainstorm. And basically what it has is a, it… it has these, like.00:34:00
you know… Prompts, like, it forces, like, a chain of thought.00:34:09
And it's not really magic, but, you know, if you go behind the scenes, like, you'll see what's happening. So, just be skeptical, I guess, and know, okay, like, behind the scenes, what's happening, and…00:34:14
Because a lot of times, the failure is in this. Like, you use a tool like this, you know.00:34:26
If you use a tool like this, like, there'll be some…00:34:33
situation in which this doesn't actually make sense. Like, the prompt collides with00:34:36
your specific use case, and you don't know why. You're just like, it's giving me weird stuff sometimes. But if you could see the whole prompt, then… or you understood this, then you'll understand it doesn't make sense. So I hope that helps.00:34:42
Mubeen
Yep, I think it… it does.00:34:57
But what I'm getting here is…00:34:59
Well, the philosophical question was, do you think this gets solved at any point in time in the future?00:35:02
Hamel Husain
So the thing, like, this gets solved is an interesting question, like, Is it…00:35:14
Can we ever get to the point where I can write a prompt that will work00:35:21
for everybody, equally as well. I would say, probably not.00:35:26
I was gonna say, there's always a lot of alpha in, like…00:35:30
You thinking about your use case specifically for you.00:35:34
Mubeen
Understood.00:35:39
Hamel Husain
Meh?00:35:40
Mubeen
Thank you.00:35:41
Hamel Husain
Andreas.00:35:45
Andreas Edlund
Yes.00:35:47
Hello.00:35:49
Hello, how you doing? Sorry, my voice.00:35:50
Okay, I have a question. Is it possible to share a screen real quick? Can I do that?00:35:54
Hamel Husain
Yeah, I think so.00:36:01
Andreas Edlund
Okay.00:36:02
Then I tried to do it, I think that would be easiest, I think.00:36:02
So… Yeah.00:36:08
Can you see my screen?00:36:13
Hamel Husain
Yeah.00:36:14
Andreas Edlund
So I'm pretty new into the course, I love it. I'm starting to do open coding and such, and I have my side project that I created during summer when traveling around Europe with a family. I have two kids. This is also similar to another guy here. He's writing stories for his kids.00:36:17
But…00:36:36
But we did this during the vacation, so I'm trying to explain for my kids in a nice and fun way, starting to use ChatGPT and such, but you get the point. I kind of, okay, let's just package it, making it easier and faster. So…00:36:37
So I started to play around with that. I don't want to start… but you basically find a place, if you're in France, you see castles all over the place, and you just want to, okay, what is this place? You take a photo, and you get the information that you need, and adapt it for the kids in your age, etc. You can also create visuals and stuff. So my question is obviously here is about evils.00:36:52
So that I will show you the response, and I just want to ask you about what would be a good approach to actually start00:37:14
Oh, no.00:37:22
to start doing the real evals here. So if he was gonna go into this one,00:37:24
Here's… obviously, in the prompt, you have it here in the right, I have an identify point of interest agent, kind of finding the place and doing some shakes and stuff like that.00:37:30
And then it,00:37:41
goes on, and then creates… generate the story based on the input that the user gives. So I don't want to go through the whole thing here, but… but…00:37:43
here is what the output is. It's supposed to be an engaging story, of course. The kids really want to listen to it. I check in on my own kids, and he's like, all of a sudden, Dad, wasn't that the place that we were at? But yes, I won. So he kind of reflected to it, and it's supposed to be engaging, and of course, not be…00:37:53
Making up things.00:38:16
So here is my question. Those are the two, kind of, main things. It cannot lie, it cannot make up things, for sure, and it needs to be engaging for the kids.00:38:18
And then you have some other obvious topics and dimensions, call it, that you want to assure. And then also looking at the images, you can expand on that as well to make that even, you know, next is… I guess the sky is the limit here, but…00:38:30
But what is your take on this? How would you approach00:38:46
should I start reading all the stories? Ask historians, is this the truth? Like, I don't know this stuff. I can just…00:38:50
I can just trust ChatGBT for now, and do some random checks, and here and there, check with my close things close by here, you know, in Sweden, where I am, I can check these things that I kind of know, but still.00:38:59
Question. How would you go about this?00:39:13
Hamel Husain
Yeah.00:39:18
Do you,00:39:19
Do you ever… it sounds like you're building this for yourself first, which is great. That's, like, a really good pattern.00:39:21
Andreas Edlund
my kids.00:39:26
Hamel Husain
Do you ever read a story to them and you don't like it?00:39:27
You're dignified.00:39:33
Something is missing in the story, or you wish it was a little bit different?00:39:34
Andreas Edlund
A little bit so-and-so. What I said, Ruben, that little kid, he liked it. He's like, yeah, it's interesting. He wanted to listen to it when he's going to bed.00:39:39
But, I haven't done too much of that work, to be honest.00:39:50
Hamel Husain
Yeah, I mean, what I would do is, like…00:39:58
you know, iterated on it with the audience you have. It's great, you know? Like, kids and stuff, and I would put a feedback mechanism in this…00:40:02
Application directly, and say, like, allow you to, you know, like, rate… write comments.00:40:12
That you can collect, so in addition to the, you know, you don't necessarily, you know, thumbs up, thumbs down, it's fine, but, like, I would… I would just do your open coding in here.00:40:21
Andreas Edlund
Oh…00:40:29
Hamel Husain
So that, like, you can, you know, when you're reading it to your kids or whatever, they'll, like, make a comment of some kind, like, hey, dad, like…00:40:29
this sucks, because of XYZ. I mean, they're not gonna say it like that.00:40:37
Andreas Edlund
That's not only me, then, that could be my friends that can also… Yeah, yeah. You need to write your… okay.00:40:41
Hamel Husain
And I would just… because, you know, the nature of it, Is, like, very,00:40:47
It's amenable to capturing that feedback, you know?00:40:55
And I feel like, I don't know, especially if it's you, you just make your open coding easier.00:40:59
This way?00:41:05
Because it's just easier, yeah, it's easier, plus, like, this is, like, this is your custom data annotation interface, like, you just need to add, you know, because you, like, and if you're looking at Brain Trust, you might not be able to, like, view it. It's easier to view it here.00:41:08
In a way.00:41:22
You can always go to BrainTrust to debug it, but, like, that's what… that's, like, a first step, I would say.00:41:25
Andreas Edlund
Okay, good. Thank you.00:41:30
Appreciate it.00:41:32
I think that that's… I probably have more questions, but can do that in another… another session.00:41:34
Hamel Husain
Anyone else?00:41:53
Gillian Langor
Didn't raise my hand.00:42:00
so, the…00:42:01
I've admittedly not had enough time to dig in as much on this topic, but I'm like.00:42:06
I'm wanting to kind of get a better mental model around00:42:13
what systems you can put in place? Oh, did I skip David? I see David has his hand.00:42:18
David Gonzalez
No, I just… I just put my hand up.00:42:22
Gillian Langor
Okay. False negatives, so I'm… I have, like, a PDF00:42:24
PDF tool that's pulling out specific language. And, like, big aha for me is, oh, we gotta evaluate retrieval, like, of course.00:42:28
A big thing that I wanna…00:42:39
be able to evaluate and report on is the rate of false negatives. Like, we fail to identify information in a document that was there. We fail to retrieve it and appropriately classify it.00:42:44
And you know, as I'm thinking about all these, like.00:42:57
You know, more traditional ways of, like, looking for fuzzy matching, and…00:43:01
You know, looking for keyword searches, like, really kind of specific ways to, like, hone in on the text, like, the human heuristic ways that you would go do that.00:43:08
I'm like, why wouldn't I just incorporate those in the prompt in the first place, and in the system in the first place? Like, I'd love your kind of high-level thoughts on how to think about the problem of evaluating false negatives in a world where you're trying to minimize the human-in-the-loop kind of oversight.00:43:17
Shreya Shankar
Yeah, I could take this. I work a lot on document and long text processing. So…00:43:36
I think your problem of trying to measure your false negative rate, you'll never be able to do that perfectly, because00:43:42
That's, like, you don't know, unless some… you have some human-labeled ground truth. Somebody went there, found all possible extractions. You'll never be able to compute what the ground truth recall of your system00:43:50
is. You'll only be able to tell maybe method A or prompt A worked better than prompt B because it had00:44:05
Higher recall than the other, but you won't know if it's, like, 100%.00:44:12
So I think one thing that we had to realize is, like, we'll just… we'll never be at that 100%, and we're just always going to try methods to try to just increase the recall. The other thing is that it turns out that there are actually ways to systematically decompose your task on an automated way, without relying on a human.00:44:18
And at a high level, what you want to think about is, like, splitting your documents into smaller chunks. So, for example, if you have a 100-page PDF, maybe 1-page chunks.00:44:37
And then running your prompt on each page, or each chunk, and then seeing if you add up all the extractions together.00:44:49
do you actually end up getting more recall than if I try to put all 100 pages in one LLM?00:44:59
Gillian Langor
Cool.00:45:06
Shreya Shankar
So this kind of data decomposition, you actually can do in an automated way, because you can try various different, like, chunk sizes or page sizes, and use your same prompt00:45:07
And then… compute… how much… the LLM outputs, and then see what chunk size gives you the most00:45:20
recall. So that's kind of what I would recommend, and there's, like, all sorts of ways you can get even fancier, in these kinds of task decompositions. Like, if your prompt is trying to extract 10 types of different things, maybe instead of doing all of it in one LLM call.00:45:29
you could have a separate LLF call for each00:45:47
type of thing you want extracted.00:45:50
But, yeah, I'm happy to chat.00:45:54
Gillian Langor
Kind of what we're doing today, but we're not sort of, like.00:45:55
evalu… like, systematizing and, like, iterating on those variables, which I think is an opportunity for optimization, of course. One quick follow-up on, like, chunking PDFs in particular. Yeah. Is it sensible to chunk things by sentence? Is that… imagine, like, a long PDF. Is that, like, overkill?00:45:58
Shreya Shankar
It is overkill. I find that the smallest chunking unit that I've ever used is one page, and, like, this is using modern models like GPT-40 Mini. I think, like, we've gotten to a point where we can reliably process a page, and actually, you do want to have longer chunks, because sometimes you can't make sense of a chunk in isolation, or you need other context from the doc in order to process it.00:46:20
Yeah, so I would say, like, you know, do one page minimum chunks, and then I like to vary it from, you know, one page.00:46:46
To some number of pages, that's a fraction of the model's advertised maximum context length.00:46:53
So… so it's like, it could be, like, 10% of the context length, 25%, 50%, 75%, 100%. Never do I ever find maximum recall in the 100%.00:47:02
But, you know, in one of those, I'll get the maximum recall, and, like, that's a great.00:47:14
Gillian Langor
There's one kind of idea that we have now, which is, you know, these documents have, like, sections and natural, you know, areas in the document, but actually, like, parsing the document and correctly chunking it by section is a problem in and of itself that you then want to evaluate, and it's like, is that worth the effort? But the semantic meaning in the document, it does align to sections, so you can00:47:18
kind of… shortcut the problem a little bit by looking at my section, which is, again.00:47:43
Shreya Shankar
Yes.00:47:48
Gillian Langor
recall problem, like, is that worth it, in your experience, to add that layer of knowledge?00:47:48
Shreya Shankar
I think it's overkill. I think start with segmenting it into chunks of various sizes, see how well your system does. If you find that chunks are hard to make sense of in isolation, often I find that simply including the previous chunk as well as the current chunk in the prompt is enough content.00:47:56
Gillian Langor
context.00:48:14
If that's not…00:48:15
Shreya Shankar
Sometimes, like, just including the first chunk as context for all of the chunks, because it might have, like, a table of contents or something.00:48:16
Yeah, I know Hamel posted a bunch of stuff I didn't read, but I can send you a paper that I wrote that kind of talks about various different configurations for… we talk about, like, how to systematically and automatically decompose LLM operations on long documents, and then some operators that we had to come up with.00:48:26
in order to enable these automatic rewrites, so I will send you that paper.00:48:45
It is technical, so, you know, I don't know… I don't want to, like, make people…00:48:50
super overwhelmed with things, but yeah, I'm happy to answer any question in this one.00:48:58
Gillian Langor
Thank you, appreciate it.00:49:03
Hamel Husain
Yeah, what I posted in the Zoom is just that, actually, Shreya is one of the world's experts, foremost experts, on documented processing with LLMs. She's, done, like, lots of research in this area.00:49:06
I actually go to her whenever I have, like, a hairy issue on this axis.00:49:17
Just because she studied it more than anyone else I know. She has a lot of papers on her website, you should definitely check it out. And then also, in the upcoming week, she will talk a little bit about her work as well. She'll, show people Doc ETL,00:49:26
And this other thing called Raggy, which is really interesting. It's, like, interactive way of exploring things like chunk sizes and stuff.00:49:38
So definitely keep an eye out for that.00:49:47
Shreya Shankar
No, thanks.00:49:51
I love talking about data processing.00:49:53
Turns out we have all of the evals problems that everyone else has.00:49:58
In addition to data problems.00:50:02
Hamel Husain
David Gonzalez.00:50:08
David Gonzalez
Alright, hi. Long-time listener, first-time caller.00:50:10
Okay. I thought that joke would be funnier.00:50:13
Hamel Husain
No, it's funny, it's funny. I just, like, yeah.00:50:16
David Gonzalez
Okay, it's more… this is more of an architecture question. First of all, when you're starting the process for, these evaluations, how do you…00:50:19
what… What… what kind of, what, what kind of models are you thinking of? What…00:50:28
What LLM… what kind of LLM models are you thinking about as, like, the first model to implement when you're doing your evaluations?00:50:32
Because obviously, like, you know, it makes sense once you have an initial evaluation and standard set up, it makes sense to optimize the evaluation, the underlying model. But when you're first starting the process, what kind of models are you thinking about?00:50:40
In terms of setting it up.00:50:55
Shreya Shankar
I like to use, like, GPT-5.00:50:58
Honestly.00:51:03
And if I'm able to… because usually when I'm doing… this is LLM judge-powered evals, I imagine. Usually when I'm doing that, I find that I just, like, don't have very good alignment off the bat.00:51:04
I've never had good alignment off the bat, so, like, I want everything to be set up for as much success and alignment as possible. Once I'm able to align my LLM judge, then I think about, okay, like, maybe this prompt is good, I'm gonna use a cheaper LLM, or I'm gonna do other cost optimization strategies, but…00:51:16
Yeah, that's just my take, like, start powerful.00:51:34
And then go cheaper once that's accurate enough.00:51:39
David Gonzalez
Okay, so for evaluations, a more logical model, that makes sense. What about when you're… let's say you're an AI product manager, and you're starting a new project from scratch? Like, what model do you select?00:51:42
Initially.00:51:54
Shreya Shankar
For the actual, like, application?00:51:57
David Gonzalez
Yeah, let's say something as simple as, like, a chatbot to something as more complicated as, like, data processing.00:52:00
Or what the.00:52:05
Shreya Shankar
Yeah,00:52:05
again, it depends on the application. If I need to be able to use a lot of tools, I've had pretty decent experience with GPT-5, so I might start with something like that. Like, I know it has the ability to, quote-unquote, think, I know that it can choose between different tools and generate tool parameters, so I'm happy with it.00:52:07
If I absolutely need long contacts, then I…00:52:23
well, okay, caveat is I have OpenAI credits as part of my research, those. On one project, I have Gemini credits, but not always.00:52:27
So I will try to use the GPT-4.1 series, because that's the longest context window. And I… for data processing tests, I always do, like, long context first.00:52:37
David Gonzalez
Okay.00:52:50
Shreya Shankar
And then…00:52:51
David Gonzalez
So there… there's a lot of talk now of, like, small language models, which have smaller parameters.00:52:52
what… conceptually, I'm thinking, like, when you're architecting a new product, like, when do you… when would you use an SLM versus an LLM?00:53:00
Shreya Shankar
I… again, like, I try to follow the start with something more capable, and then change it to SLMs once I know that the large language model is accurate, and I want to reduce cost.00:53:10
This is because if you start… in my experience, when you start out with a small language model, you run into a lot of gulf of generalization issues.00:53:23
That are hard to pinpoint as generalization issues, because the model's just not capable.00:53:32
And then you're just spending your time trying to, like, architect your entire pipeline around that.00:53:37
When that wouldn't be an issue if you were just using a more powerful LLM. Hallucination is a great example of this. Like, I feel like until they had the 4.1 series, GPTs did not figure out… were not, like, materially better at hallucination.00:53:43
So there are failure modes like this that, you know, you don't want to spend all of your time trying to solve, and you want to see if the best model can do it out of the box for you, and then do cost optimization, which we also have a chapter on, well, I think the last chapter in the reader.00:54:00
David Gonzalez
unless you hadn't.00:54:15
Shreya Shankar
good accuracy.00:54:16
David Gonzalez
So, in general, you want to choose a model that's as logical and as many parameters as possible with as large context model as possible that you can afford and you can implement.00:54:17
Okay.00:54:28
Shreya Shankar
Yeah, I mean, if you don't have long…00:54:29
choices or long context, you don't need to do 4.1, it's just if you do need to have a lot of context, then…00:54:31
I would say 4.1.00:54:38
David Gonzalez
Gotcha, it's very application-specific.00:54:40
Shreya Shankar
Yes, yeah.00:54:42
David Gonzalez
And then the final question is, in terms of, like, determining an underlying database for RAG,00:54:43
Like, what are your thought processes? So…00:54:49
when I was doing research independently, I was thinking of, was it a vector… I don't know, vector database, but then in one of the previous office hours, you said to…00:54:52
To not choose a vector database, so…00:55:04
when you're thinking about the architecture of an AI product.00:55:06
what, what criteria are you looking for for an underlying database for RAG?00:55:10
Shreya Shankar
I know you'll take this.00:55:20
Hamel Husain
Yeah, so, like,00:55:21
What I usually do is… okay, like, what we're trying to say in the other office hours is don't get overly caught up on00:55:23
Tools?00:55:31
being like…00:55:33
And the… what we're trying to say is, like, if your rag is not working, the last thing you should do is try to swap your vector database.00:55:34
Because a lot of people focus on tools, and really it's the process that tends to matter.00:55:42
What you choose doesn't matter so much,00:55:48
But what… how I end up making that decision has to do a lot with the skills and the existing technology stack that someone has. So, like, if someone is already using Postgres, then we, you know, like, using PGVector is great, because it's, like, already there.00:55:52
Basically, I try to, like, introduce the least amount of complexity to the situation.00:56:09
And that's the overarching principle of, like, all of this. Like, that actually, like.00:56:15
is behind what Treya's saying, too, about starting with the bigger model.00:56:20
before you go to, like, small language model. It's all about, making it… making the whole thing as simple as possible.00:56:25
Like, starting with approaches that, you know, reduce complexity.00:56:32
And so… Yeah, it's like, okay, if you, like.00:56:38
You know, if you're starting from scratch, and you have a clean slate, and you can, like, you don't… you have skills and, you know, whatever.00:56:43
You know, like, for example, I'm a Python developer,00:56:50
And I'm comfortable with, like, open source, and I have, like, a lot of friends who use LanceDB, so I use LanceDB, because I know that I can raise my hand and get lots of support with LanceDB from many different places. So, you know, I really like that.00:56:56
So it's kind of like…00:57:15
Yeah, it's like the technology stack, where I think I can get the best support, who do I like? But ultimately, it's like…00:57:18
that's where the decision comes from. It's not like, okay, we have…00:57:25
one vector database that we feel is always best. And the same thing goes with eval tools.00:57:30
David Gonzalez
Hmm, okay. So, yeah, it's kind of like the question, like, if someone asks you what programming language to use, it's like, use the one that solves the problem that you know.00:57:36
Hamel Husain
Yeah.00:57:45
David Gonzalez
Cool. Alright, thank you.00:57:46
Hamel Husain
Yeah, no problem.00:57:49
Utku?00:57:51
Utku Boran Torun
Hi, I'm currently focused on evaluating LLMs on software engineering tasks, and00:57:54
Currently, we try to evaluate our automated… automated call review bots.00:58:04
We ran this both in company, like, 5 months, I suppose.00:58:11
We have, annotations of developers, as, like.00:58:19
Useful comment and, and not useful comment.00:58:26
But we want to make this evaluation, in… in more automated way, because, with the…00:58:31
human annotation. We, highly…00:58:41
depend on, human labor. So, like, we want to automate this evolution, like, we have, some ideas, but what would you…00:58:46
digest.00:59:00
For this task.00:59:02
Hamel Husain
Okay, I'm trying… like, I may not understand the question completely, but I'll try to answer, so…00:59:07
Okay, if you're building a coding agent, always keep in mind that, like, first, make sure you're doing the error analysis. It sounds like some developer is leaving comments, that's great. That's a form of potential open coding, but you need to…00:59:15
understand. You might need to do a little bit of your own open coding on top of their open coding, or refine their open coding. You need to get, you know, do the error analysis process. Coding agents have some unique qualities, like, one is it's a highly verifiable domain, a lot of times.00:59:29
And so, you want to think really hard, like, is there a way that you can take advantage of more code checks, or static analysis?00:59:46
Utku Boran Torun
Yes.00:59:54
Hamel Husain
Or anything?00:59:55
Utku Boran Torun
Sorry, I need to interrupt. It is not… it's not code engagement. It is… it is… Code review bot.00:59:56
Hamel Husain
Okay, code review bot. Okay, so it's, like, related, so it's like, okay, even if it's code review, is there any kind of…01:00:05
Maybe it's not verifiable, but it's probably something's verifiable, I don't know.01:00:13
So, you know, like, think about that.01:00:18
Carefully, like, try to be as creative as possible, see how you can create lower cost, you know…01:00:22
Things, like lower-cost, evals.01:00:30
Those are the main things I can think of, in terms of things to keep in mind.01:00:33
Treya might have…01:00:41
Shreya Shankar
No, sorry.01:00:46
Utku Boran Torun
One idea that we have is,01:00:50
like, maybe we can fine-tune our LLM with the data we have, like, the annotated data.01:00:53
And then… use it, LLM as a judge, like… Would it be something like…01:01:03
Recursive, or, like, is it… is it… Reasonable way to do that.01:01:10
Hamel Husain
So I would not do any fine-tuning unless you have good evals, because then how do you know your fine-tuning is working?01:01:19
Utku Boran Torun
So, just be, you know…01:01:28
Hamel Husain
Just to be blunt, no. But…01:01:31
Okay. I'm just responding to what you're saying.01:01:34
Utku Boran Torun
Okay, yeah.01:01:37
Yeah, okay, thank you.01:01:40
Hamel Husain
Yeah, no problem.01:01:42
Anuba.01:01:46
Anubha Saxena
My question is more with the intention of learning from your experiences. So I'm setting up this evals process in my company, and what I've found is that evals, like, the process that we discuss is the most exciting part of it.01:01:48
Creating the dataset and…01:02:04
like, the company that I work in, we are very heavy on visuals, so we… like, the output of the pipeline is usually, like, a DSL, and it needs, like, good visualizations for humans to process it.01:02:08
So we are kind of, like, stuck mostly on that, like, creating the perfect, like, not even perfect, like, data set in the right format or structure, and then re-presenting it01:02:20
To something that the humans can process.01:02:33
have you faced similar, kind of, problems when you're, like, interacting with other organizations or something? Because I feel like it's, like, really annoying how blocking it is, because I cannot get to evals if I do not have, like, a dataset and the visuals for it.01:02:36
Hamel Husain
Can you tell me a little bit more what you mean by DSL, and a little bit…01:02:56
More details, if possible.01:03:00
Anubha Saxena
So, for example, we have,01:03:03
diagram generation, so we use different shapes, and these shapes are represented in complex JSONs, so…01:03:05
this JSON would not only just have, like, the type or the text, it would also have maybe the color, the size, the position, and a number of other things. And for huge, JSONs, it could be, like, really long, and it is, like, not humanly possible to process it.01:03:13
But then we… this is, like, one of the use cases. We have, like, a lot of other formats as well.01:03:32
and then sort of scaling the solution to different formats, because every format is different, so every visualization will be different. It is kind of very…01:03:37
blocking and exhausting. And most of the times, people just try to jump into evals, but they cannot really do proper evals, because you need to have… then everything is, like, manual, and people are just evaluating on 20 things that they can manually test, which isn't really good.01:03:49
So I was wondering if other organizations have the same problem.01:04:07
Hamel Husain
Yeah, is your application deployed to users already? Are users using it, or is it more in the prototype stage, or…01:04:12
Anubha Saxena
It is already deployed, it is very much used, but the problem is, so we can run the evals… this is one of the things I'm currently exploring, like, using our product for the evals itself.01:04:20
But another problem is we also traced some user data, and the user data cannot be, like.01:04:33
cannot be, like, moved back to another account just for visualizing it, right? So it needs to be anonymized first, and it needs to be in a platform which is, like.01:04:41
used for storage, like Databricks.01:04:52
We are also using BrainTrust, by the way, but we cannot, like, take the user data out of, like, those platforms into our product again.01:04:54
Hamel Husain
Yeah, I mean, so with this highly visual stuff.01:05:06
It's gonna be really important for you to build your own data annotation interface.01:05:10
And you might be able to…01:05:15
bootstrap your own product, because it's already doing the visualization.01:05:17
To give it, like, an admin view or something, where you can…01:05:23
Like, you know, someone with administrative access or the right01:05:27
permissions can review past interactions, so you don't have to rebuild the whole visualization infrastructure. You can somehow use those components. And then what you need is, like, an additional way to make comments, or do open coding, at least. You can, like, start with the open coding.01:05:31
And… You want to capture those? You can allow your users to even write comments, and, you know.01:05:50
chart… start to see the process. You know, you're gonna have to maybe write your own open codes on top of that, too, but it is, like, if you can somehow bake it into the product.01:06:00
That would be really useful. And then, like, as far as making it…01:06:11
More exciting, or more… Having… giving it a higher likelihood of this whole thing01:06:16
Getting funded, or your colleagues, or the organization.01:06:22
What we found is, like, if you…01:06:28
Kind of… when you're doing your error analysis,01:06:32
you don't even need to… you probably shouldn't even use the word evals. You should kind of come to every meeting with, like, hey, here's, like, the 5 issues we found.01:06:37
And, you know,01:06:45
And you will find that there's some low-hanging fruits, and then you will say, like, you can say, hey, we fixed these, like, 3 issues from last time, these are some other issues, and then people will start to get pretty intrigued.01:06:48
And say, oh, like, why… how is Anuba com… like… Finding all this stuff.01:07:02
And then… then you can… start talking about. You can answer the question.01:07:08
And then that's how you can…01:07:13
Make evals… give it legs, let's say.01:07:17
Anubha Saxena
Yeah. I,01:07:20
I wanted to do that, but right now I'm so stuck on the visualization and creating the right dataset part, that it's, like, very difficult to… so we're using BrainTrust, and it has, like, limited visualization capabilities, and then… and RainTrust also has a01:07:22
very complex UI in itself, so there are, like, buttons and a lot of things going on. So it is, like, very difficult for people to onboard to it. Even if you give, like, a link and just ask them to, you know, view the outputs, it is very overwhelming. So the incentive is very low for them, just01:07:39
Like, the… the amount of efforts they have to put into processing the data, information, and then.01:07:58
Hamel Husain
Oh yeah, it's not gonna work if you try to do this in BrainTrust. It's not… it's gonna fail, like, 100% probability. So you need your own data annotation interface. I don't think I could do the error analysis either, if it was, like, trying to look at all this JSON stuff, you know, like…01:08:05
I wouldn't be able to know if it's… there's, like, no, it's impossible, I think. So, like,01:08:23
Yeah, if there's a way that you can… you know.01:08:28
You might be able to do it, like, yourself. I don't know how complicated it is, to, like, render these things. You may be able to use AI to do it, I don't know…01:08:33
like, if these are standard, like, mermaid diagrams or something that is, like, you can, you know, the AI will know how to use, or… you might be able to do it. You might be able to recruit someone to help you.01:08:45
To create this annotation interface.01:08:58
Anubha Saxena
So at least…01:09:04
what we tried in the past was just wipe coding some stuff for rendering, and I think it's gonna, like, fail so badly in a few months, because people are just wipe coding right here, there, and then…01:09:05
no one's there to maintain it, you know? And then if things break, they're gonna scratch everything and start over again, which is not really…01:09:18
So I… I do think that using the product interfaces itself is the best way here, but there are, like, some limitations to it as well.01:09:26
But, yeah, that is… that is good, confirmation on my thoughts as well.01:09:36
Shreya Shankar
I think it's okay to go through many rounds of the custom interface you build, like, that happens for me all the time. Like, I'm on, like, the fifth version of an interface that I've built for…01:09:43
like, SF Public Defenders, and it's just… it is what it is, you know? The task changes over time, and the way… we… we discover new ways of, like, interpreting the outputs to make it easier for humans to grade, and it just means that we have to build new interfaces. I'm a really big fan of vibe coding, because I know that, like, because it takes so little time to vibe code, I can just do it again from scratch. So I wouldn't be, like, too afraid of that.01:09:53
The other thing that I found very useful is, like, having AI-assisted transformations at the trace level in your interface. So, for example, you might want, like, if you're looking at a trace and you want to visualize it because it's really long.01:10:17
Maybe you could have, like, a button that creates, like, a vega light plot based on information.01:10:31
in that trace, and that uses an LLM call under the hood to generate some Vega plot, and then renders the Vega plot. Like, I've done that before. The mermaid diagram, also a good idea of, like, if you click the button, LLM looks at the trace, and then, like, outputs a flow chart or something to help people understand the content in there.01:10:38
So if you find that, you know, there's different ways of01:10:59
Like, different recipes for making sense of the outputs.01:11:02
like, you can have buttons in your interface that call an LLM to, like.01:11:06
do that recipe to get the visualization for you. And you don't have to, like, predefine all of that up front, or, like, deterministic code to create those visualizations.01:11:12
So…01:11:22
Anubha Saxena
Makes sense. Thank you.01:11:23
Just to follow up on that, if I'm using LNM for doing that, wouldn't I have to evaluate that as well, and then it's just like a cycle?01:11:26
Shreya Shankar
Yes and… I don't think so, like…01:11:35
This is… how do I say this? I feel like using LLMs for these one-off, like, I decide when to evoke it, and it's just for debugging use, is very different. It's like, when I go to chat GPT and I ask it to, like, help me rephrase a sentence.01:11:39
Like, I don't need an eval for that one invocation of that tool, because, like, I'm supervising the thing, and I'm kind of using it as an aid for my own thinking.01:11:56
I feel like evals get very important when you're on this batch setting of, you know, you're repeatedly generating something, or users are consuming something, and you have no idea… you don't know everything that's being sent to them.01:12:06
And it's, like, very high stakes, right? They can churn. You're not gonna, like, churn out of error analysis because there's, like, a failed call to generate a mermaid diagram, I would hope.01:12:21
So… Yeah.01:12:30
Anubha Saxena
Excellent. Thank you.01:12:33
Shreya Shankar
Let's take Julian's question, and then that's the last question that I can take, but maybe Hamilton stay along.01:12:40
Gillian Langor
Okay, I wanted to kind of dovetail on that last point. This is where I'm struggling as well, is, like.01:12:45
getting all the information in the right place to start, and I feel, like, kind of overwhelmed because I'm, like, technical, but I'm not an engineer on the team, and I'm, like.01:12:52
probably the most, like, data science-y person, but I'm not a data scientist, so I'm, like, in this kind of weird in-between space, and Anuba, we're using, like, a component in our app to visualize the documents. That's, like, a necessary thing to interpret the trace, but our traces are in Phoenix.01:13:03
And I'm, like, going from one place to the other, and a question I have is… it's maybe really stupid, but the… if we built a tool that, like, let's say uses that component in the app that visualizes the, like, underlying document or something, so that you can make sense of the…01:13:19
highlighted bits. So that code is already done. It can highlight bits in the document.01:13:36
Shreya Shankar
The trace information is in… is somewhere else. It's not in the app.01:13:42
Gillian Langor
if I'm building another tool, and I'm… if I'm using that component, let's say, like, in another…01:13:49
yeah, vibe-coded tool, or some tool in our app environment.01:13:57
Am I storing the notes back where all the Phoenix traces live?01:14:02
Like, am I using Phoenix… the Phoenix annotation interface to capture the notes in the Vibecoded tool?01:14:07
Hamel Husain
Yeah, often, like… so there's two important things you said. Okay, so, like, one is…01:14:16
Okay, so you have some application code that is rendering your data, and that's a really big part of your application sometimes.01:14:21
And if you vibe code your own tool, you kind of have to, like, re-render that data. In the beginning.01:14:30
having a separate tool is fine, because you don't want to decouple… you want to move fast, and you want to view the data, you don't want to, like, get it blocked by all this stuff. In the limit, though, it does tend to… there tends to be a very high overlap between this annotation app and your application sometimes, and that's when, like Anubha was saying.01:14:37
And we were discussing, like, you might want to merge them, almost, because there's gonna be a lot of shared code.01:14:56
And then it also puts it on the critical path more, like, you can have engineers maintaining it, because now it's, like, part of, like, some admin view. So that's, like, one aspect.01:15:03
The second one is, yes, so, like, with Phoenix or anything else, and I do this with Phoenix all the time, is I just use it as a database, essentially.01:15:13
That I read and write to. And so, like, you read the traces out. When you make your notes, you can write your notes back01:15:22
to that same trace, and in the metadata of that trace. I think Phoenix has even,01:15:30
has a… has a special field for, like, open coding, now? Yeah.01:15:37
You can store it in there. You don't have to store it in there, you can store it also in metadata. You can store it anywhere, really, as long as you put it consistently somewhere.01:15:42
But yeah, it's like, I use it as a database for my annotations.01:15:50
Gillian Langor
Okay. More generally. That's reassuring. I was like, is that a crazy idea? Am I thinking about this wrong?01:15:56
Hamel Husain
No, it's the right idea, actually. I actually think it's, like, the preferred way of using Phoenix.01:16:00
for… yeah.01:16:09
Now…01:16:11
Phoenix is still useful in the sense, like, it's still nice to be able to, like, filter traces in that interface sometimes, like, quickly look up things, share traces with other people, whatever. It's not like the UI is, like, pointless, but, like, yeah, using it as a database is totally valid.01:16:12
Gillian Langor
Yeah. It's almost like, what's the point of Phoenix if we're building our own tooling around it and trying to justify? Like, I feel like it is very valuable to be able to look through the raw traces, everything that's there that might not be relevant for the evaluation purpose. Anyway, okay.01:16:27
Hamel Husain
Yeah, and what I really like about Phoenix, too, is, like, the… because it's open source, it, like, you… it exposes the backend database to you transparently.01:16:41
I think by default, like, if you just use it, like, from GitHub, it'll have a SQLite database, like, by default, and you can just connect directly to it without having… without even using an API. You can just query it.01:16:49
You know, so you can do that kind of stuff?01:17:04
Awesome.01:17:07
Gillian Langor
Okay, thank you.01:17:08
Hamel Husain
Okay, it looks like there's no more questions. Thank you, everybody, for… for coming.01:17:17
Dr Katya May
See you in the next one.01:17:24
Hamel Husain
Alright, thanks.01:17:29
Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)