Week 4

OCT 29 Optional: Live Office Hours 11 WED 10/299:00 AM—10:00 AM (GMT+5:30) OPTIONAL Recording

Notes

Back

Recording

Optional: Live Office Hours 11

Oct 29, 20259:00 AM - 10:00 AM GMT+5:30

Audio Transcript

Chat Messages

Shreya Shankar

Hello? Hello? Okay.00:00:14

Hey, everyone!00:00:36

Hamel Husain

Hi, everybody. It's just me and you.00:00:39

Shreya Shankar

That's us!00:00:43

Vidhya Sriram

No, no, no, you're not alone.00:00:44

Shreya Shankar

Oh, okay. Oh, I know.00:00:47

We are worried… we are going to be alone starting next week, so… Yeah.00:00:53

Robert Lavigne

I'm gonna miss these sessions.00:01:00

Hamel Husain

Alright, we can go ahead and kick it off.00:01:08

Shreya Shankar

Yeah.00:01:10

Robert Lavigne

Yeah, you were here first, you wanna start?00:01:12

Vidhya Sriram

Yeah, I will, I will go, I'll keep it quick.00:01:15

Shreya Shankar

You don't have to keep it quick, we have so much time.00:01:21

Vidhya Sriram

Oh, wait, wait till it trickles down. I want to share the custom eval tool that I created.00:01:23

Shreya Shankar

Cool.00:01:29

Vidhya Sriram

It's all based on Lesson 10. I put it in the Discord.00:01:30

Shreya Shankar

Bye-bye!00:01:34

Vidhya Sriram

It's all based on Lesson 10, and that was… it was just great. I'm yet to use it with the traces and everything, extensively put it to use, but I wanted to share. And, this is something I think is going to be very useful with the founders I work with. They don't have any tools, and Excel sucks, God. So…00:01:35

I'm realizing it as I'm doing the homework.00:01:55

So… so that's, that's something that felt so good.00:01:59

Hamel Husain

You wanna demo it? You wanna share your screen and, put a CSV in there and show us?00:02:02

Vidhya Sriram

I want to put it to test a little more, because I was busy working on, you know, mapping the principles and the difficulties that I'm having, and I was trying to do that mapping and coming up with the design specs, and then doing it with Lovable, so it was more, you know… I will definitely do it in the next office hour, just that I want to give it a spin myself and then do it.00:02:08

Shreya Shankar

Yeah, feel free to just, like, take a scream brag when you do it, and put it in the Discord. I love to see it. I've never seen someone do Lovable, actually. Use lovable to.00:02:28

Vidhya Sriram

Cool.00:02:37

Shreya Shankar

Custom eval tool.00:02:37

Vidhya Sriram

So I'm curious how it works out.00:02:38

Yeah.00:02:42

Hamel Husain

It should be, it should work, it's… yeah.00:02:42

Shreya Shankar

Yeah, I'm sure it works.00:02:45

Vidhya Sriram

by how it has actually brought the things to life. Like, all the things that we think about, and that's discussed in Lesson 10. That was my… I loved Lesson 10, 11, and 12.00:02:46

Levin was especially helpful because I'm not technically fluid, I'm building that fluency, and I'm not a software engineer. So really understanding how it was and how it is, the before and after, that's… that was very instructive for me, but,00:02:56

my best, the way… the officers are so different for me, because I don't have the questions many others have when they are putting it to use with their products.00:03:12

I work with founders, and I'm trying to apply these principles. I could already do it in an informed way with 2-3 people, so that's the value I get. So, this tool, I am… I'm…00:03:22

I'm thinking it'll be great to, you know, to test with different people, and give it to them. You know, if it's private, they don't have to share everything with me, but it can be so helpful. So that's the idea. So I'm really looking forward to you giving it a try and also sharing your feedback.00:03:33

So, I'll do that.00:03:48

Shreya Shankar

Cool.00:03:50

Vidhya Sriram

I had, I had one tactical question. I didn't know what to do with the model cascades, the…00:03:51

lesson… the last two weeks… Yeah. …to,00:04:00

to GitHub, links. What are we supposed to do? There was no official homework.00:04:04

Shreya Shankar

Yes.00:04:10

Vidhya Sriram

buffers.00:04:11

Shreya Shankar

Good question. It's totally optional. We had it in the original curriculum because we very ambitiously thought, hey, once people have done their evals and have good accuracy, maybe now they want to do cost optimization. So you can think of model cascades as a tool to match the accuracy of the LLM that you had before, but now reduce the cost. So use cheaper LLMs for certain cases.00:04:11

Figure out ways to, you know, make the whole thing cheaper at scale.00:04:36

What we realized is that you don't have to… most people don't even get there, so I would focus on, you know, mastering the eval's techniques, making sure you have very good accuracy, and then come back and revisit the model cascades work.00:04:43

post in Discord, read that chapter in the reader, and then see if that makes sense to you to apply for cost reduction.00:04:56

Vidhya Sriram

Okay, but this is very different from the other technique where people use a reasoning model.00:05:03

To understand how it works, and try to replicate that with a cheaper model.00:05:10

Shreya Shankar

Yeah, that's very different, very different.00:05:15

Vidhya Sriram

It's almost the opposite, and what is that called?00:05:17

Shreya Shankar

I don't know.00:05:21

Vidhya Sriram

Okay.00:05:23

Okay.00:05:24

Shreya Shankar

I don't know.00:05:25

Vidhya Sriram

That is not distillation, either.00:05:26

Hamel Husain

This solution is, like, more of a fine-tuning thing. That's, like, very specific technique.00:05:29

Vidhya Sriram

Okay.00:05:35

Hamel Husain

It's sort of… conceptually is distillation-ish.00:05:35

But if it's not… if you said the word distillation, it would confuse submachine learning people, so… I want. Okay.00:05:40

Vidhya Sriram

All right, thank you. Robert, you go. I had one… one question, but that can wait.00:05:47

Robert Lavigne

No, no, don't go… finish your train of thought, that there's not a lot of people here, so keep going.00:05:53

Vidhya Sriram

Different subject. I wanted to start writing about this, because I see a lot of value in, one, writing to think better, second, also, just with the limited number of applications that I have done, I can see how differently you internalize.00:05:58

The concepts are one thing, but how you apply for different applications.00:06:15

So one question that came up was, usually I see people, when they write reviews and when they write posts, they refer this course and they also talk about a discount.00:06:19

Is that available for your next cohort? Then I can, you know, combine those things.00:06:28

How would I…00:06:34

Shreya Shankar

Yeah.00:06:35

Hamel Husain

Yeah, I can give you a link. I mean, it's… this one is probably… More than…00:06:36

It was when you… joined, but, I mean, the price keeps going up. But.00:06:43

Vidhya Sriram

I joined at 35% discount from Lenny's after… I did it at 2AM after listening to the podcast. I put in the first comment, and I joined the course. That's how I joined.00:06:49

Shreya Shankar

That's so funny.00:07:00

Vidhya Sriram

This is definitely more than what I paid. Yeah. That's not what I will want to discuss about, but yeah. In a way, it is…00:07:01

That's the funny thing, right? It's priceless.00:07:09

But also, it's on demand, so this is the time, so yeah, I get it.00:07:13

Yeah, go for it, Robin.00:07:20

Thank you.00:07:23

Hamel Husain

Yeah, wrong.00:07:24

Robert Lavigne

So,00:07:26

I guess the best way to describe this would be a custom trace, so let me walk you through the scenario that I'm in. So, I'm building these set of tools that would not be agentic tools, in the sense that they would not necessarily be, here's a list of tools that you can call, go off and call them. They'd be more00:07:28

procedural, so the idea being, user puts in a query, there'll be maybe 5 LLM00:07:42

processes that will take place before a reply comes back, and it'll always be those same procedural things, right? So the idea being, one will be pick a video based on the user's intent, right? So here's a list of 30 videos that might be a hit.00:07:49

queue that up, load it, play it while it's maybe then doing the LLM deep dive. That might take 30 seconds to come back, right? So they got something to watch. So…00:08:04

The problem I have is my traces don't exist in three-quarters of that code, because they're not necessarily an LLM tool call where, you know, the normal tracing would be done via the LLM00:08:13

path that I have in place. So, am I better off in a scenario like that to just say, you know what, I'm just going to build a custom trace that'll cover my entire toolset, including my LLM, and just… that'll be my trace?00:08:28

That's the question.00:08:43

Hamel Husain

It sounds like it, yeah, I mean, like…00:08:44

Robert Lavigne

I think I knew the answer.00:08:45

Hamel Husain

Is that relevant to the user interaction, you know?00:08:46

If it's, like, gives you… if it's contextual.00:08:49

to what is happening to the user, then yeah, you should put in trace.00:08:53

Robert Lavigne

Yeah, they will always be…00:08:58

well, a lot of that would be lost if I didn't have these logged, right? So normally, I would just put them to a debug or something like that, but nowadays, I'm not thinking in terms of a debug, I'm thinking in terms of a trace. So I'm kind of shifting my debugging steps to trace steps, and having a unified00:09:00

JSON or something that'll come out from any time that… that tool set or series of tools gets called in parallel, basically.00:09:17

Shreya Shankar

Yeah, that makes a lot of sense to me. For what it's worth, other people that I've talked to or work with also find at some point, like, they need to also be controlling the logging themselves. Like, the instrumentation that comes out of the AI observability tools kind of breaks when you have these, like, very…00:09:27

It makes sense to get to a point to do this. I think you can't, like, rely on the tool.00:09:44

Robert Lavigne

I'm not having… I'm not inventing the wheel for the sake of inventing something.00:09:48

Shreya Shankar

I don't… I don't think you are.00:09:52

Robert Lavigne

Okay, cool. Thank you.00:09:54

Hamel Husain

Yeah, the trace, kind of trace… the idea of traces, predates LLMs.00:09:55

And there's, like, standards like OpenTelemetry you can use that a lot of those00:10:00

Tools used, also, under the hood, and if you use those, you can… You know, log… Whatever you want.00:10:06

Shreya Shankar

Yeah.00:10:14

Robert Lavigne

So, once again, open telemetry, correct?00:10:15

Hamel Husain

Yeah.00:10:18

Shreya Shankar

Yeah, I'm sending you this link. And some of the, dashboard companies00:10:18

like, for example, Arisees, they might be able to work with anything that's logged in OpenTelemetry.00:10:24

Yeah, range… I think they all might, actually, at this point.00:10:32

Hamel Husain

Thetadog definitely does, but, like, I don't…00:10:36

Shreya Shankar

That's, like, not AI.00:10:39

Robert Lavigne

And, and…00:10:42

Hamel Husain

They're getting in there, they're doing AI now, yeah. Started.00:10:42

Robert Lavigne

Phoenix be something that I might consider as a back end to save this stuff, or is that maybe…00:10:46

Hamel Husain

Yeah, you could put it there, yeah, for sure.00:10:51

Robert Lavigne

No, that because I kind of like the fact that that's kind of available in a roundabout way. Okay, cool.00:10:52

Shreya Shankar

Looks like, yeah, Phoenix might work for you, just from a very basic scan. I just sent a couple of links.00:10:59

Robert Lavigne

Yeah, but I would try to just make sure it's compatible with OpenTelemetry, so… Yeah, so you're saying if I kind of get really familiar with OpenTelemetry's metadata, for lack of a better term, I've probably got a good baseline of understanding of what traces are that have gone back decades, but now are being repurposed for this purpose.00:11:05

Shreya Shankar

OpenTelemetry as a schema and, like, a framework also predates LLMs.00:11:23

Robert Lavigne

Yeah, that's kind of what I was getting at.00:11:28

Shreya Shankar

Jeff.00:11:30

Robert Lavigne

It's… Pure fundamentals, basically.00:11:30

Shreya Shankar

Yeah.00:11:33

Robert Lavigne

Okay, cool. Thank you.00:11:34

Shreya Shankar

Yeah.00:11:35

Hamel Husain

We got 14 people.00:11:42

Maybe there's some other brave soul in here.00:11:43

Shreya Shankar

Big niche.00:11:46

Vignesh Iyer

I'm not ready, so I'm usually last, so I thought I can jump in early today.00:11:47

Yeah, so,00:11:54

Regarding, like, the open telemetry and the traces, I thought maybe it's more… more than a question, it's kind of like, I thought maybe raised a discussion point.00:11:56

Like, your different aspects that you have, for instance, like your spans, your traces, and your observations. Maybe I thought we could chat about them, like, for instance, spans,00:12:05

from what I understand, a span groups something over a time interval, right, that occurs. So, is something a span only when it has something, it has other data aspects00:12:17

Within it, like tools, Is that, when you classify it as a span?00:12:32

Shreya Shankar

I don't know what a span is in the LLM.00:12:39

Hamel Husain

Yeah, let me… I can pull up a chart, just give me a moment.00:12:42

Shreya Shankar

Yeah, oh, there's one blog post.00:12:45

Hamel Husain

Yeah.00:12:47

Shreya Shankar

Like, read my mind, we are thinking of the same thing.00:12:47

Hamel Husain

I'm trying to find it. Oh yeah, here we go, let me see.00:12:50

Let me pull this… Thing up here.00:12:54

Shreya Shankar

I think it's not a clean one-to-one mapping with, like, pre-LLMs, because what…00:12:58

what is actually the case for your application, if you have, like, a conversational AI, is every conversation, like, prefix of the conversation, all of the messages always get sent to the LLM every time the user interacts, right? So it's…00:13:04

Vignesh Iyer

break.00:13:19

Shreya Shankar

It's kind of like… I don't even know what people would consider a span in this case. Oh, it looks like Hamila's found this. That's great.00:13:20

Hamel Husain

So I put the link to the tweet in there. There's also a blog post, I just have to find it, but, like,00:13:33

Yeah, maybe it's in this alt thing, let's see.00:13:38

Oh, no, okay. He was just good about making an alt, I guess. Alt, text, but,00:13:43

Yeah, I mean, it means a little bit different, but it's like…00:13:49

The atomic units are usually called a span.00:13:52

And.00:13:55

Shreya Shankar

It's like a message, I would imagine.00:13:56

Hamel Husain

Yeah.00:13:59

Shreya Shankar

No, I guess that's not even true. I don't know what the difference between a trace and a run is.00:14:00

Inland.00:14:05

Vignesh Iyer

Trace, from what I understand, is, like, one interaction.00:14:06

And the traces are grouped across, like,00:14:11

You could have multiple interactions, right? Like, multiple, calls.00:14:16

I think the span falls within the trace as one unit of work.00:14:20

Hamel Husain

Yeah, yeah, spans are within traces. So, usually trace is, like… you know, I kind of…00:14:25

I think most of the time, traces are, like, sessions, like, one… Sort of unit of…00:14:32

Let's say you're trying to get something done, a user is interacting with the AI, it's all the turns of a specific conversation, for example.00:14:39

Should be in a trace.00:14:46

Shreya Shankar

Yeah, I also agree, like, to me, a trace is the list of turns. Like, every turn is annotated with the content and the role. The role can be the user, the assistant, system role, or it can be, like, a tool role, in which it's, like.00:14:49

You know, tool, tool argument… whatever the tool returns, basically. That's, like, part of that.00:15:06

Vignesh Iyer

Got it, yeah, because currently, what I'm trying to do is,00:15:16

I've got, like, a no-code tool, right? And then, I'm trying to just get its trace… it does good tracing on the no-code platform itself.00:15:22

But when… when kind of tracing to something like Phoenix or, LangFuse, the issue is.00:15:32

it's… it's kind of shown in a linear way. Like, the tools are not shown underneath the conversation. It, like, just follows in a linear way, so they haven't,00:15:40

Kind of wrapped the relevant components00:15:50

well enough into these various, components, so I'm just trying to figure out what's what, and make sure I can get a nice hierarchy.00:15:54

which will obviously make the error analysis a lot easier, like pre-error analysis, like, I need to do this work. And also, when the evaluators are created too, it might be easier to reference the relevant elements00:16:03

On the evaluator a lot easier if they're not00:16:18

just in, like, a linear fashion, right? There's some sort of hierarchy.00:16:22

I'm sorry.00:16:27

Shreya Shankar

I don't think there's anything wrong with there being a linear…00:16:28

Vignesh Iyer

Okay.00:16:32

Shreya Shankar

think of a trace as a linear sequence, because ultimately, the LLM, or the AI,00:16:33

thinks of a trace as a linear sequence of messages, right? Like, that's what the LLM sees.00:16:39

And it might… it's easier to, I think, start out treating it linearly. I think you might get to a point where you realize, oh man, this error analysis is a little bit tedious, because there's some tools that are just taking up too much space, and maybe it's good to, like, collapse those.00:16:45

In which case, great! Awesome. Build your own custom interface and, like, recognize that, you know, you can collapse those messages, or provide some things as context. Maybe you don't need to show the system prompt every single time, stuff like that. But I… I feel like…00:16:59

It's worth, you know, trying to do the error analysis, treating it all as linear trees, and then evolving from there when you realize that there's, like, common patterns and what you can collapse.00:17:14

Vignesh Iyer

Okay, got it.00:17:27

Thanks.00:17:29

Hamel Husain

Someone's, asking a question. Oh, okay, another's saying thank you.00:17:43

Yujohn Nattrass

I gotta…00:17:49

Hamel Husain

Anyone else?00:17:49

Yujohn Nattrass

I have a question here.00:17:50

Hamel Husain

Yeah.00:17:52

Yujohn Nattrass

Thank you for your time here.00:17:52

to be honest, I haven't had time to go over the homework last week, but one thing that's been on my mind was CI, and how exactly people run CI.00:17:56

for your applications, like, I've seen people put it in a test suite, and then I was like, okay, that kind of makes sense. And then I try to use some of these, tools out there, and you could, like, run00:18:07

You have a, like, your run…00:18:19

an experiment on a dataset locally, and it prints out a report. And I was like, okay, that's interesting, I've never seen this, like…00:18:22

How… I'm like…00:18:30

in both cases, I guess I kind of understand, like, the purpose of CI. You want to, you know, see regressions, but, like, how are people, like, doing this in the wild? Like, should I try to stick it into a test suite, or should I…00:18:32

create a report where people can see it, and then they're like, okay, this is good, this change is good, we can move forward, or this is bad. That's where my question is.00:18:45

Hamel Husain

Yeah, let me see if I can… I'll try to find a screenshot.00:18:56

From a codebase that I'm allowed to show you. One second.00:18:59

Shreya Shankar

I can speak to two experiences that could be relevant. So, one is,00:19:04

I've worked on an AI fashion application before, now I advise that startup. They're, like, an AI personal stylist. Their CI is… well, I mean, it's also biased, because I wrote it, and I'm, like, a no-tools kind of person. Done entirely in GitHub Actions, it is just PyTests.00:19:11

And every time something fails, I log it to a database.00:19:29

and I require these tests to pass, or there's some unit tests00:19:34

I'm using unit tests loosely, but some cases in which I run evals that I know it's a little bit flaky, so I allow that to retry a few times. There are other cases where I only care for the aggregate statistics, so past, you know, more than 90% of these, and then I'm happy.00:19:40

Because it's just really hard to get 100% all the time on some of these cases. And I would say it's, like, no more than…00:19:56

I think the most we ever got was, like, 40 different test cases. Like, these are, like, really, like, edge cases that we kind of uncovered as we deployed the application, and we wanted to make sure that, you know, we could,00:20:04

we could solve those. Another experience that I had was before LLMs, but I think it would be more… it would still be applicable in LLMs. We used MLflow.00:20:19

basically, we would run MLflow jobs in CI, so that we could log all of the results.00:20:31

to MLflow, and then that would, like, give us a report. So every new pull request that's created, there's, like, a new experiment on MLflow,00:20:38

And it's triggered only if, like, certain code paths are triggered.00:20:47

That seemed to be nice, because then people could… we also had an automation that would, like, link to that report in the PR. That's, like, a bot that would come up with that. And… yeah, that was helpful when reviewing.00:20:51

Those are just my two experiences. Maybe Hamel also has.00:21:05

Hamel Husain

Yeah, let me just share… I'm not gonna share anything groundbreaking or anything.00:21:09

But, maybe… is it sharing my screen? Is it…00:21:14

Shreya Shankar

What is it?00:21:18

Hamel Husain

It's like, what about this… is that better?00:21:19

You're sharing the whole… okay.00:21:23

So this is just, like,00:21:24

an example, you just run it in CI. In this case, I think…00:21:27

This is, like, a TypeScript database, or a TypeScript codebase, and they're using…00:21:30

whatever unit testing framework, is it called Jest? I forget what it's called. I think it could be called Jest, I don't know. I'm a Python developer, so I don't remember anymore.00:21:36

Shreya Shankar

Yeah, Jess.00:21:44

Hamel Husain

And yeah, and so, like, they have test cases, and you can see,00:21:45

you know, they're just printing it out, and they also have a summary. I don't have the summary in this, like, screenshot. You know, this is, like, some examples of what some test cases… this is, like, some code-based ones.00:21:52

But, yeah, you just basically run it in CI, and you can do it a lot of different ways. You can store your data in…00:22:05

you're… in this case, in this specific case, the data was small. It wasn't, like, long traces or anything.00:22:12

You know, or the traces were short, so we just, like, put the data in the rebo.00:22:21

But… If that's not feasible, then you can store your data somewhere else externally and pull it.00:22:27

your observability platforms, like, Lang Smith, Rain Trust… I should just come up with the acronym for these three vendors, but those… the vendors will, like.00:22:34

They have a dataset thing that you can just pull from, and you can… those dataset things get reversed as well, so it's kind of convenient, and you can, like, actually run…00:22:45

The, evals, and then…00:22:55

like, calculate the metrics, you can store it there, you can pull it locally, you can even emit a report. This is not in this screenshot, but…00:22:58

At the end of this, they do emit a report.00:23:06

That just shows a summary of, like, how many failures00:23:10

percent of failures and, like, an overall score, something like that. It's just, like…00:23:13

Basically, just the summary is, like, you can use your existing CI infrastructure, you don't have to have something special. That's all.00:23:19

Yujohn Nattrass

I did have one more follow-up question. So, you spot a regression here.00:23:29

And then…00:23:34

I guess when you run the CI, you're also storing all the traces and spans, so you can go back and review it, or…00:23:36

Is that… be, like, too much data, or… it's perfectly fine.00:23:42

Hamel Husain

No, it's perfectly fine. You can tag it as, like, CI traces, you know? So you just organize your data so it's not overwhelming you.00:23:50

And you can just see, okay, like, what happened? And yeah, that's totally fine.00:23:58

Yujohn Nattrass

Oh, okay.00:24:06

Hamel Husain

Thank you. In fact, when you do, like, a lot of these… so the vendors, they'll have this concept called experiment.00:24:08

And then, you know, you can run the experiment, which is basically, like, calculate the metrics, and00:24:14

I mean, let's see if I can find an example.00:24:23

Give me a moment.00:24:26

I don't know if I have one handy that I can show, but I can try.00:24:29

Deceive.00:24:36

I have an example.00:24:39

Let's see…00:24:45

I'm on some legacy version of… Phoenix, apparently.00:25:00

Let me see if I can find, like, a…00:25:05

Let's see if I can find this…00:25:15

Let me share my screen again.00:25:31

I think it's… here. Okay, so this is, like, from the Phoenix docks.00:25:35

This is… every vendor has this, like, something like this, so it's like, okay, you can have a baseline, like, one run, right? This is one run on the left-hand side, there's a different run on the right-hand side, and then you can, like, compare, you know, these are, like, specific runs.00:25:39

And then you can, you know, you can compare as many runs as you want, and then there's another thing that we could, like, just shows you a diff, even, if you wanted to. But you can, like, compare side by side, like, oh, this is the baseline run, this is the new run, what failed, and whatever.00:25:57

Yujohn Nattrass

Okay.00:26:13

Hamel Husain

So you might look into that, the data sets experiments thing, it's kind of handy.00:26:15

Yujohn Nattrass

Right, right. Yeah, no, definitely will. Thank you.00:26:18

Hamel Husain

Yep.00:26:23

Anyone else? Don't be shy. It's your time.00:26:44

Pardeep

Okay, perhaps I'll ask a question.00:26:53

Camille,00:26:56

the small language models, I think I'm getting more and more curious about the small language models, you know, particularly thinking through it, like, you know.00:27:00

you know, I'm not sure, this is just a philosophical discussion, perhaps, which is better to use LLMs, which is large language models, or internally use small language models and fine-tune it for yourself.00:27:10

Which can be cheaper to deploy, and then, you know, tailored to your needs. And would you recommend organizations to do that?00:27:21

Why or why not? Have you, have you played around with those things yet?00:27:29

Hamel Husain

I, like, so, I used to be really excited about small language models.00:27:36

Actually, my first course on Maven was about fine-tuning.00:27:41

And it was, it was, like…00:27:45

kind of in this exciting time of what I would say, fine-tuning and, like, you know,00:27:49

And, yeah, I was bullish back then, I was doing a lot of fine-tuning.00:27:58

of open models, and, of even closed models.00:28:03

In fact, that CI screenshot I showed you was a client called ReChat. It was, like, real… another real estate, thing that we fine-tuned several things, including, like, the OpenAI. But…00:28:09

Since then… I've experienced a regime where a model's capability increases so rapidly.00:28:22

that anyone that I worked with that did fine-tuning.00:28:30

Back, like, threw it away.00:28:34

And started using Frontier models.00:28:37

And so… I kind of…00:28:41

Yeah, I guess, like, swallowed some bitter pill. I mean, I was, like, really enthusiastic, I was like, you should do fine-tuning, fine-tuning is good, blah blah blah. Even the… even in the fine-tuning course, I had a guest that was like, oh, fine-tuning is dead.00:28:46

And I was like, oh, like, I don't know if I believe that. But, you know, like, I think it's hard. It's hard to, like, you're kind of swimming upstream a lot of times with fine-tuning, like…00:28:58

Especially small models, like… It can be very expensive, to host your own model.00:29:11

Because you have to have really good hardware utilization. You need to be keeping everything busy, like, all the time.00:29:19

And that's hard to do if you are one company, and you're not aggregating across, like, lots of things.00:29:29

So, it can be… yeah, it's quite difficult,00:29:37

And, you know, this is, like, all the engineering time, not to mention the hardware, and all the other stuff.00:29:41

And so it just, like, is very difficult economically to beat hosted models, and then…00:29:47

You know, yeah, fine-tuning itself is, like…00:29:56

If you're doing evals, like, fine-tuning is, like, a lot easier. It almost… is somewhat…00:30:00

free, in a way, but it's not free. You have to do… you have to, like, kind of understand what you're doing.00:30:06

And so, it just… Philosophically, it's like…00:30:12

The number of people that it makes sense for is, like, shrinking.00:30:16

Like, all the time. Like, rapidly.00:30:20

Pardeep

Got it. So, one of the… the reason I asked this question is one of the situations I was dealing with was somebody working in a bank, you know, they're reaching out to say, we want to do something, but we want to keep our data private. So now there are a few ways to do it, which is, you know, we'll say, well, you have the RAG system.00:30:24

But then, you know, before making a call, you just sanitize everything in a way that, you know, just, you know, it just doesn't provide a lot of information. So I realized it's just not gonna work if you want to keep all your data private on-premise.00:30:41

and still want the frontier models, it's just not gonna work. So either you just deploy your own, you know.00:30:55

it could be large language models or small models. Fine-tune them in-house if you want to. But even if you don't want to, then you can just, you know, deploy those, you know, Llama models on-premise and still keep your data.00:31:03

right there. But you still have the opportunity to fine-tune them. And I think they were getting very excited, but I was like, no, the engineering cost may be just too high to bear unless, you know, you do the cost-benefit analysis of, well, if I build something like this.00:31:15

would it make sense financially to do this? But I think right now, financials are, like, out of the window, look, nobody's thinking about that.00:31:30

So… Well, we'll see.00:31:39

Yeah, thanks for the, thanks for the advice. I'll definitely look at the course.00:31:44

Hamel Husain

Sorry, when you say, look at the course, what do you mean? Oh, you mean the fine-tuning one?00:31:49

Pardeep

Yeah, defined anyone.00:31:53

Hamel Husain

Oh yeah, just, I'll put a link.00:31:54

Pardeep

That course is, like.00:31:57

Hamel Husain

open… It's a lot… let me… I'll give you a link to it.00:31:58

Pardeep

Okay. Yeah, I mean, I'm pretty curious right now. The problem which I'm trying to solve is, I think somebody, I think we had the side conversation here, which is, you know, I started with building00:32:02

you know, like, automation using, you know, Cua models, and I realized that it just doesn't work. You know, the brain is up in the cloud, and it just tells me what actions need to be done. So I said, okay, I'm just gonna build my own Cua. So I started building my own Cua.00:32:13

And I realized that this actually is a much harder problem, because now, you know, the DOM manipulation and website manipulation is a really hard problem, because, you know, people build different technology, different blah blah blah, different mistakes. So it was becoming harder and harder. Then I was like, well, if I would00:32:28

fine-tune the models to actually solve this problem, that actually would be good. But then I was like, well, that's also becoming tricky. So now I'm doing the combination of Goa models and the DOM manipulations that, you know, in some places Gua works, some places DOM works. So if I merge these two effectively.00:32:45

then, at the runtime, I can do things in a way which actually becomes more effective. So that is what we are trying to do right now. But I think the… this is how the whole idea of fine-tuning came in, but I'm also now thinking not to do it for now, just, you know.00:33:00

Try to push the boundaries on the frontier models, pay for it as much as possible, and then down the line decide what makes sense.00:33:16

Hamel Husain

Fine-tuning OpenAI models works really well, by the way. Like, I've done it many times.00:33:25

And it's just, you know, those… you can… you can fine-tune, like, a frontier-level model.00:33:29

It, it increases the costs.00:33:35

But, if you need increased capability, especially on a scope task, It can be okay.00:33:38

Especially if you're fine-tuning a smaller model.00:33:44

From a larger one, you can do desolation, essentially.00:33:48

Pardeep

Yeah, yeah.00:33:52

Hamel Husain

But, yeah, I mean, you all have to, like, there's a lot of dimensions in terms of, like, hey, let that be talked about, that you have to, like, consider, is it feasible or not to begin with.00:33:54

Pardeep

Totally, you know, I think the follow-up question is not technically, but I think the, the overall course here and, the Discord, are we gonna keep it open for, like, Arjun?00:34:06

Two minutes. My son, my son is right here.00:34:19

Hamel Husain

No, no, yeah, makes sense.00:34:23

Pardeep

So, the Discord itself, I think…00:34:25

if we make this as a community, I think there are obviously chances of brainstorming these ideas, and, you know, ask questions around. I'm wondering.00:34:30

Hamel Husain

Yeah, the Discord is gonna be there…00:34:36

Until Discord decides to sell to Salesforce, or whatever Discord does, I don't know, like, you know. But, like, I'm not going to shut it down by any means.00:34:39

Pardeep

Okay, cool, cool, thank you.00:34:49

Shreya Shankar

We have a question in the chat. Would you say fine-tuning is generally a last resort?00:35:03

animal.00:35:09

Hamel Husain

Yes.00:35:11

Shreya Shankar

I do it only at a scale where it makes sense, and the scale is crazier than you think. It's like, I…00:35:17

I have long document workloads, and I'm trying to do inference on, like, 100,000 plus documents, and I want them, like, within, like, a couple of days, and I think I'm gonna do it repeatedly.00:35:24

I don't know, over the next few weeks.00:35:37

It's, it's, like, not a, not a startup scale.00:35:40

Hamel Husain

If you're gonna, like, offer, like, a special… Service that does something.00:35:44

For a specific domain.00:35:51

then yeah, maybe it makes sense. And if you have, like, gathering lots of data.00:35:55

Then you can out-compete, everyone else, potentially, if you focus on one domain.00:36:00

But… most of the time, most people are not trying to do that.00:36:08

Like, out-compete with data. But at scale, it's possible.00:36:13

Vidhya Sriram

Is there a small project we can experiment with to try fine-tuning? Like, a beginner project, to get a sense of what the workflow is?00:36:19

Hamel Husain

Yeah, you can do, so the… probably the easiest, most approachable thing to do is…00:36:28

is to try to fine-tune OpenAI model, and I would just Google OpenAI fine-tuning, and look at their guides.00:36:35

And the reason I say that is, like, you know…00:36:43

To fine-tune an open model is, like, significantly more…00:36:47

Work, because you have to, like…00:36:51

Load the model, load the weights, get a GPU, debug CUDA, do all this stuff that can go wrong.00:36:53

So… with the OpenAI.00:37:00

You kind of just have to focus on data, assembling data.00:37:04

Which is a good place to think about.00:37:08

That's one place you can start. If you're ambitious, my favorite thing to do is… to,00:37:11

instruction tune a model. So take, like, a model that doesn't chat with you, and then feed it conversations, and then…00:37:22

It starts to be… have a, conversational… interaction.00:37:29

That is really good. I think that helps people give mental models, but that one's more…00:37:34

Yep, you kind of should be familiar with some machine learning to… because it's a lot of code.00:37:41

Vidhya Sriram

Okay, that's a deal breaker. Thanks.00:37:47

Shreya Shankar

Another reason I'm generally hesitant to say, oh, go do fine-tuning is because there's, like, a lot of little gotchas that are gonna be there across the entire process, like…00:37:51

people will say, oh, just do DPO, like, direct preference optimization. Well, no, actually, you should do this, like, workflow of first do SFT, supervised fine-tuning on some responses, then do DPO or, like, some other preference optimization, then do RLE chat, like, it's…00:38:02

it's just so bespoke with so many different steps along the way, and there's no, like, good tutorial out there for how to do it, because the process is actually very different for different domains. So you're kind of having to relearn all of the things that frontier model companies have come up with during post-training, and it's kind of, like, not the.00:38:19

Vidhya Sriram

You don't want to be focusing on all of that when you can focus on your application, right? So… Right, right.00:38:39

Shreya Shankar

That's kind of the trade-off that I see in my mind.00:38:43

Vidhya Sriram

Yeah.00:38:49

Yeah, because even those evals, there was a clear pecking order, and there were concepts you can understand, and then… and then wipe code. I could take that leap, because the concepts were clear, and, you know, I'm… it's not like that.00:38:50

Right? You're saying it's complex. I see Perdeep's hand up. Maybe it's a follow-up. I had another one, but I'll…00:39:03

Pardeep

Oh, I think this was a lingering hand up from the previous…00:39:11

Vidhya Sriram

Okay.00:39:14

Pardeep

You can stand up, so I feel bad.00:39:14

Vidhya Sriram

I… if there is no one else, I have a question about the,00:39:16

your recommendation about keeping it binary, I totally understand why Leica's scale is useless. That's, like, NPS for me. But… the binary, sometimes.00:39:24

becomes too reductive in terms of understanding how useful something was. Like, it was partially useful. Partially the tone was okay, but then you kind of lost your way in the end. That happens, right? So is it… is it better sometimes to use your judgment and have, yes, partially00:39:34

Successful and no. Those options.00:39:52

Shreya Shankar

what I would do in that case… so, I'm concretely imagining, like, a writing assistant. This happens to me all the time in writing, where the AI will write, it'll write fine, and then it'll, like, devolve into, like, just having bad writing, sounding like an AI, like that. I don't even know how to describe it, it's not great.00:39:55

I would create a failure mode that is specifically, you know, transitions to having, like.00:40:10

you know, bad tone, or having… or, like, not following my rules, or having choppy sentences. I would create that as a failure mode.00:40:17

And then try to measure that, rather than change my existing failure mode to be 3-way.00:40:27

Vidhya Sriram

Okay.00:40:34

Okay, so the norm is to have only two, so try… strive for that. Strive for making it binary.00:40:36

Is what you're saying, right?00:40:43

Shreya Shankar

I get burned every time I'm like, oh, but I think this is a special case, I should do, like, three-way classification, because the alignment is much harder, and then I have to somehow make a decision based on it.00:40:45

Like, oh, should I improve this or not? Should I prioritize something else or not? And…00:40:56

Anyways, at the end of the day, if I'm making a binary decision.00:41:03

I find that it's better to decompose my criteria into binary criteria.00:41:06

Vidhya Sriram

Okay, but I was trying out a RAG application where it brought the number of chunks, but, you know, the precision was good, but the recall was bad, that kind of a thing. So I wanted it to tell me00:41:12

I had a relevance check, and I wanted it to tell me that, yeah, it recovered some relevant chunks, so it was partially successful, but then it couldn't actually answer the question completely.00:41:24

That kind of a thing, that's where I felt the partial would have qualified it better for me.00:41:36

In this use case, it's still… I see.00:41:44

Shreya Shankar

You have to strive for binary? Okay, so I… maybe Hamil has some other feedback here.00:41:46

I think you… maybe you want to separate the, like, annotations you give for yourself, like, it's totally fine to annotate, oh, it was partial, this was all the… almost all the way there. Like, this is what I think of as, like, open coding feedback, but when it comes down to the axial coding, it's like, did it fail or not?00:41:51

Because that's going to inform what numbers are on my dashboard.00:42:09

The last… my dashboards are not useful if they're, like, 85% of the way there, like, an average of, like, everything is lukewarm, right? Like, that's not telling you how good your application is.00:42:13

But what you can do is drill down on the failure mode, if you see it's very low, all of them are zero in the recall, and then you'll see in your open codes, or your annotations, like, oh, like, this was partially all the way there, like, maybe this is a good example of something we could fix, or is closer to being fixed.00:42:24

Vidhya Sriram

Okay, it's real.00:42:42

Shreya Shankar

And maybe Hamil has thoughts on this as well, I'm curious.00:42:43

Hamel Husain

Yeah, I'm pretty militant about the binary.00:42:46

Vidhya Sriram

Okay.00:42:49

Hamel Husain

So…00:42:50

you know, for the same reasons that Shreya said, I don't think I have anything additional to add, except, yeah, I, like, I would…00:42:52

try to… yeah. You really need to do binary.00:43:02

Vidhya Sriram

Okay, I can qualify all I want, but the dashboard should be binary. You know, it raises the bar on us, in a way, to address.00:43:07

Hamel Husain

Yeah, because if, like, the rag is, like, only partially correct.00:43:14

Then, like, is the outcome really helpful for the user, and this, like, this is… is this correct? And I would say, no. Just fail it.00:43:18

It's failed.00:43:26

So.00:43:27

You know, like, because there's not, like, partialness, partially correct, like, I don't know what to do with that, you know? It can be partially correct in a lot of different ways, and it's really not… none of it is correct.00:43:28

So, I would say, no. If it fails, it needs to stay failed until we pass it.00:43:39

Vidhya Sriram

Okay.00:43:47

Shreya Shankar

Yeah, like, I think it's such a slippery slope. Once your dashboards are not aligned with the product success, it is so hard to recover from that.00:43:48

So…00:43:56

Vidhya Sriram

Right.00:43:57

Shreya Shankar

Yeah.00:43:58

Vidhya Sriram

Thank you. Yeah.00:44:00

I know how my son felt.00:44:01

So, I know. Thank you.00:44:03

Pardeep

Yeah, I was… I was just gonna, you know, add on to what Vidya said. I think the…00:44:08

I have one example of… in the past, I worked on something which was not LLM-related, but this was, basically…00:44:13

you know, people are doing something, and we want to see if they actually did something. So we had this, you know, image processing things where we will just, you know, monitor that and say, did it actually happen? And then we had built the classification model of this actually happened or not.00:44:21

And then we actually built something, you know, you record a video and send it to India, and we have this, you know, people tagging to say, oh yeah, this is the auditing mechanism of, yes, this actually happened, and now that becomes a course correction. But this was, like, a very Boolean way of doing things, whether it happened or not happened.00:44:36

That is one example of, you know, it's better to collect those facts as a very Boolean overall. There's another project we've worked on where, you know, you have these, you know, content where you need to do content classification, but now there are, like, 5 different ways of classifying the same content. How do you do it?00:44:52

Right? And, you know, then you have this multi-class classification problem, and then what we did was we said, okay, you know, how do you have this, you know, very good instructions on what would make this a classification of this category? So this is a multi-categorization, and then we first did ourselves.00:45:09

And that was, like, really hard, you know?00:45:27

multi-class class view. And then, you know, we again did the same thing. You know, we had these, you know, hundreds and thousands of, you know, inflow of content coming in, and there is somebody who is, you know, tagging these based upon the content, and then we started building tools around it to automatically detect based upon the people. But I think with LLMs, I think it becomes harder, but if you can figure out to partially quantify it, it may work, but I generally agree with00:45:29

you know, Boolean is usually a better way to solve it. Either it is useful or it's not useful, rather than partially useful. Because, you know, it's hard to quantify, you know, just in general.00:45:54

I just thought I'd share.00:46:08

Hamel Husain

Yeah, I mean, like.00:46:09

Vidhya Sriram

Thank you for me.00:46:11

Hamel Husain

You can do fine-grained… so fine-grained annotations…00:46:11

Just become more exponentially caught, like… become exponentially more… Costly when,00:46:15

you know, when you start to deviate from binary, as you get more and more complicated. So, like, the most impressive…00:46:25

annotation that I've ever heard of in machine learning is probably Tesla.00:46:32

So there's, like, a talk from Andrzej Kapati where he talks about like, Tesla…00:46:37

like, the machine learning, there's, like, hundreds, like, not… I don't know how many, like, thousands of annotators, hundreds of annotators. Each annotator is given, like.00:46:45

maybe months of training, and, like, there's, like, a book or manual, like, this thick, with all the rules and all of the things. So you can just imagine, like, how expensive that is. That's, like, very intense. You know, but they're fine-grained annotating, like, so many things.00:46:56

In, like, videos.00:47:11

So, like, there's no way… like, most people can't do that, like, you can't do that for LMs, like, it would just make…00:47:14

Annotation, prohibitively expensive.00:47:19

And so, like, you start going in that direction when you start, like… so the things that we're teaching is to, like, keep the process tractable.00:47:24

Because it doesn't take much to kind of bump it out of the… into intractable.00:47:33

The story about Tesla is, like, very unnecessary, I apologize.00:47:44

Being just confused.00:47:49

Everything.00:47:51

Vidhya Sriram

No, it was provocative. I didn't know much about it, so now it makes me think about it.00:47:55

Shreya Shankar

Oh, Trish just raised their head.00:48:07

Trish Uhl

Hey there, so this is awesome, course is awesome, community is awesome, office hours are awesome, thank you.00:48:11

Thank you. And I'm curious on the people side of this, so I lead the product and engineering teams for Enterprise Solutions for a financial services firm in North America.00:48:19

And, and in working with the non-technical… so this is new, right? Like, going through these evaluation processes, of course,00:48:31

And…00:48:41

especially working with the non-technical, the subject matter experts, and getting the business involved, I'm curious, like, other techniques. So I wouldn't want to do the Tesla thing like Hamel just described, right? Like, I don't want to do months of onboarding, months of training, and give everybody 5-ring binders. That's not my… that's not my intention. But I think, you know, one of the biggest call-outs from the… from this program00:48:42

so far…00:49:07

that I've been able to socialize with my team, and then also with the stakeholders at large, is to say.00:49:09

The first part of our evaluation process is deliberately effortful. I can see the bells and whistles and all the different dashboards that are built into some of the platforms that we're using, but we're going to ignore them for the first 100, because I want people to get in, I want people to understand what we're doing, I want people to understand the data.00:49:15

we're gonna get our legs under us, and then we'll be able to build, you know, then switch to a, you know, a different phase that then takes us to more simplicity and scale. So I've actually been using the Microsoft 5. I put them in the,00:49:34

I put them in the chat. So the Microsoft 5, so I actually do use the Likert scale to start out with on relevance, groundedness, coherence, fluency, and accuracy, but again, it's strictly for just that onboarding00:49:49

period in the beginning, because otherwise what I found, especially with the non-technical stakeholders.00:50:03

is that they were missing stuff. So things that should have failed, they didn't fail it because they didn't even think about what they were looking at. So as an example.00:50:10

One of the products that we're working with is taking a bunch of validated data and output from another system that's been validated00:50:19

It's mostly financial information, and then it's passing it to the LLM. It's an explainer. It's explaining how the previous product did whatever it did, and trying to help the user understand the inner workings of the other product.00:50:27

And what I found was that when we started it with the subject matter experts, they were validating the numbers in the outputs, but they weren't actually validating the words.00:50:41

So… so I was like, okay, like, the whole thing that we're trying to do here is to make sure that correctness is not only that it's not, you know, pulling or hallucinating on false numbers from the actual record and the dataset that was being sent, but it's actually phrasing it properly, like, you know, 15% up or 15% down, right? Like, is it actually… and so what I'm… what I'm finding is00:50:51

in just starting with that… with that level of effort, and I'm… I'm curious…00:51:16

How you guys are onboarding, especially non-technical stakeholders to this process.00:51:21

Shreya Shankar

I like the point that you've raised about, kind of, using lightgart scales as a way to essentially elicit open codes.00:51:30

Before coming up with rubrics that are going to be much more definitive on what pass and fail is. That is… sorry?00:51:38

Oh.00:51:48

Trish Uhl

Oh, sorry, I said yes, yeah. Oh, thanks, yeah.00:51:48

Shreya Shankar

Yeah, I think… I think the broader point is…00:51:52

Is that, yeah, like, when you're… I think a lot of people, actually, in the course use the benevolent dictator model of, like, we are going to try to figure out how to have one person do as much of the annotation as possible, they are the expert, they won't make these mistakes, but I definitely agree with you, it breaks once you need00:51:55

More people doing the annotation.00:52:13

And for these things, I would say, like, you have to somehow design the initial part of the process to get to a very detailed rubric.00:52:16

you still want to have binary, kind of, evaluations, but getting there is difficult, and maybe Likert scales is a way to do it. Maybe everybody doing open coding, and then you looking at the open coding, and then somehow trying to massage that into the right rubrics also could be a case.00:52:23

I don't know what the best solution is. Wow, there's, like.00:52:43

30 thumbs ups that just happened at the same time.00:52:46

Trish Uhl

No, I only hit it once!00:52:50

Shreya Shankar

But I don't know, it makes sense to me, and I think that's cool, and maybe, I don't know, some message for us is how can we be more explicit about that in the course reader, of, like, when you don't do Benevolent Dictator, here are some ideas to, like, get your rubrics00:52:56

Kind of.00:53:12

Trish Uhl

I'm also using it to… well, thank you for that, Shreya, because the other way that I'm using it is also to do calibration with the human raters as well, right? So again, just the initial, you know, we're talking, like, the first week or two, right? Yeah. Like, getting into this, just to get everybody, you know, into the process, but being able to calibrate, to be able to then…00:53:14

have something, some kind of structure to the conversation to debrief on how people rated what they rated. I've also been using it with my team, too, because they're new to this. And so, yeah, I'm finding that that's really helping to drive the conversations. It just… it just provides a little bit of structure. And then we can switch to binary, because now we know what pass-fail means.00:53:34

But yeah, but getting there has been a…00:53:57

Challenge.00:54:01

Hamel Husain

Yeah, I mean, it's an interesting discussion. I think, the way that I… so the way I then board people…00:54:03

To error analysis is to do it with them.00:54:09

Always.00:54:13

And always do it with them, because… It's the most important thing.00:54:15

That you can do in evals?00:54:20

Probably the most important thing you can do in AI.00:54:23

Yeah. In general. And so…00:54:25

I find that when we do it, and, like, I don't do it for them. Basically, what I do is, like, I do some, and then I let the other person drive.00:54:27

And I'm just the… critic, you know? Like, are you sure this is good? Why is this good?00:54:36

And really tried to push on that. I think about the product, and say, you know…00:54:43

And I just… You know, you can… once you practice that a bit, of being that person.00:54:51

You can get quite good at it, and…00:54:59

Yeah, it's a good way to, like, onboard people and get them to look at things. And then, often what I do is, when people get stuck.00:55:04

Of, like, hey, I have this error in my application, and I'm not sure…00:55:11

you know, I keep hearing that there's a problem.00:55:17

And then we dig into it a bit, be like, okay, how are you hearing about it? Do you have any signals? Or, tell me more about the problem. And then we search for traces, and then we look at data. We're like, let's look at the data right now, together.00:55:21

And I find that people, when you train, you can train them to look at the data quite nicely.00:55:34

If your approach works, then that's great. It sounds like it is working. But I also encourage you to, like, do some error analysis with them, and push them to think critically about their data, because people don't… people are not good at error analysis.00:55:41

out of the box. You can't just tell them, go look at it and do open coding.00:55:56

I find that no matter how much you explain it, you kind of have to do it.00:56:01

with them.00:56:06

Trish Uhl

Yeah, it's a really good call-out, Hamel. I mean, so basically what we do…00:56:09

is we actually have, like, working sessions. To your point, like, we'll have everybody… we'll have a group session. Everybody's on the group session, we'll walk through a scenario, we'll walk through a query, we'll walk through, you know, kind of a worked example and say, okay.00:56:14

You know, we ran this, we ran this input, we expected this output, here's what the, you know, and we walked through the whole thing, like, the whole kind of setup.00:56:30

And then we give people a bunch of test scripts, and they might go away for 30 minutes or something like that. Everybody goes off, because I want them to have that time to, like, sit and work with it, and think about it, and reflect, and then we bring everybody back together again in order to debrief.00:56:39

And so that's how we've been… that's how we've been running…00:56:54

Again, like, just that first onboarding to help people just kind of wrap their heads around it. I think to your point, because I've seen other teams where they're just, like.00:56:58

sending people off into the unknown, and then it's like, it's an email, and you should look at stuff, and then give us feedback. And it's like, no.00:57:09

Hamel Husain

Yeah, one thing you should think about when you're doing these group sessions… Sorry, Hamil?00:57:16

Is, like, when you're doing these group onboarding sessions, is make sure that you are teaching people how to simplify.00:57:22

Trish Uhl

Yeah.00:57:29

Hamel Husain

for whatever reason, in group sessions, people can waste time and, like, try to overanalyze a trace, and you have to coach people, or say, like, okay, stop at the first upstream error, we already got it, move on. And people will be like, no, no, there's other things, but like, no, that's the way we do this, is we just move on.00:57:31

And so people will learn also how to be efficient. So there's, like, two things you have to worry about. One is…00:57:46

is the person doing a good job at their analysis? And the second thing you have to worry about is, are they being efficient? And you have to kind of coach them on both.00:57:53

Trish Uhl

Yeah, that's a really… that's a really good call-out. We often find that in the second piece, right? So, the initial setup, and then give people time to kind of, you know, go do it and… and wrap their heads around it, and then when we come back in that debrief, that's where that… those efficiencies… that's a… that's a really good call-out, yeah, that's where that comes…00:58:01

That's where that comes out.00:58:20

And just even hearing people's different perspectives of how it is that they looked at, and how, you know, what they were thinking about when they rated something, or why did they rate something that way, and so on and so forth. And again, just getting that kind of alignment across, you know, what it means and what it is that we're trying to do has been really helpful in the dialogue.00:58:23

Rob, you had a good question. Actually, you know, it's really funny, we had, I'm just looking at what you put in the chat with the golden datasets. So.00:58:43

So…00:58:51

one of the products that we have is actually a multi-reg, multi-reg pipeline, or pipelines, plural, that are wrapped with agents. So it's multi-agent, and it's multi-rag pipelines. And so one of the things that I have found to be really effective with that is…00:58:52

to really separate those things out in conversation with the team as a… as a… as two different components of the same system. And so then talking about, and Rob, you and I had a conversation about this the other day, about…00:59:11

you know, the RAG system is oftentimes straight-up troubleshooting and QA, right? Like, what we would think of more akin to traditional, testing when you get into early stages of that, and then separate that from the AI evaluations that will ultimately happen with the app itself, if you will, that's built on top of the RAG.00:59:26

So, as an example, the two different datasets are…00:59:46

Again, we're using Golding datasets and reference documents on the app side.00:59:50

to bring the humans together, so, you know, I've got multiple stakeholders, compliance, legal advice and guidance, so on and so forth, that ultimately have to come together and have input into what the output is gonna need to look like and what the constraints are around it. It can say this, it can't say that, it can use this word, it can't use that word. So we've been building golden datasets as reference documents, again, for the humans.00:59:55

Versus, as an example, going back to the mechanics of the RAG system and just figuring out01:00:20

you know, is the retrieval process working? Are the, you know, is the ingestion working, so on and so forth, is then doing the, you know, the much more simple, question and answers and breaking that down into the actual test script. So we've got01:00:26

So that's the other thing that I'm finding is clarity, and there are different types of golden data sets, depending on the thing that you're looking at, that you're evaluating. And are you using it in order to build a scalable system that you can automate, like the things that we've been talking about with the dashboards and stuff? Or are you using it in order to be able to use it as a tool to bring stakeholders together and be able to have a conversation on defining, ultimately, what01:00:41

You know, the definition of done or what good looks like.01:01:06

That was a mouthful.01:01:10

Hamel Husain

No, thanks for sharing01:01:14

Okay, I don't think we have… oh, is Robert, you have a question?01:01:20

Robert Lavigne

Yeah, I just want to follow up on what Trish was just saying there. So, given that scenario that you were talking, Trish, there, where you're bringing in and onboarding a group, is there something to be said about having, let's say.01:01:25

To your point, 10…01:01:37

consistent queries that all people would do the same, thereby knowing that the system would more or less come back with the same type of flawed or non-flawed answer, and you would kind of get a sense for whether or not your01:01:39

Master list of pass-fails, based on what you know those questions would be against the existing system, would be a way of potentially calibrating whether or not they get it.01:01:53

If you know what I mean. Meaning, you know, if you know you've got 5 fails and 5 passes in your test set that they're being onboarded with, and they're all going through that same set.01:02:04

You know, if you're all of a sudden seeing some deviations between people in… with that original01:02:14

training set, for lack of a better term, before you, you know, let them go out into the wild, is kind of what I was kind of leaning towards.01:02:20

Trish Uhl

Yeah, that's exactly right. That's exactly what we're doing, and that onboarding process is…01:02:28

You know, let's say it's the first 5 queries, as an example, and everybody's doing… everybody's running the same ones, so we might… we might walk through the same… the… we might walk through one… well, we usually walk through one together, so everybody sees it at the same time, asks questions, and then, you know, we give them another 4 or 5 that people then go off01:02:33

you know, we go off a Zoom, people go and just work at their desks, and then we, you know, have a scheduled debrief session, like a follow-up, like, 30 minutes later, and come back and talk about what it is that people found. But to your point, it's all the… they're all running the same thing so that we can, you know, get a sense of…01:02:53

exactly that, how people came at it, or have questions, but it's common. They're all running the same test script at that point.01:03:10

Robert Lavigne

You're, you're almost testing your, your, your…01:03:20

Those people the same way you would test the evals, making sure the evals are doing it right.01:03:23

Trish Uhl

That's right, yeah, it's a really good call-out. That's exactly right. We're getting the… it's more about getting the people ready versus the… and then we can, you know, move towards, again, more towards scale and automation and building out more of the tooling, and then using some of the tooling that's embedded within the01:03:28

But it's really interesting, I mean, there's a lot of…01:03:46

there's some arm wrestling over the first part, and that is, well, why can't we just, you know, Trish, why can't we just look at the dashboards? It's giving me a whatever right here, and I'm like, you don't know enough about your data and enough about those metrics that are embedded in that platform to know if that's useful to us or, you know, or not. And again, like, I thought.01:03:50

That was, a really great call-out early in the… in the program, Shreya and Hamel is…01:04:09

That point alone has made this program worthwhile, is you have to know your data. And that first part of the evaluation process has to be effortful.01:04:15

Hamel Husain

feel like I have succeeded in life.01:04:29

Thank you.01:04:32

Okay, I think we're at time, and… yeah.01:04:38

We have one more office hours left after this, so, thanks everyone for coming.01:04:44

And see you in the next one, if you can make it to that.01:04:48

Shreya Shankar

Hey, guys.01:04:54

Andrew Zaf

Thank you. See you both.01:04:55

Hamel Husain

Thanks.01:04:57

Trish Uhl

Thank you!01:04:58

Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A

[

Home

](/parlance-labs/evals/2025-3/home)[

Community

](/parlance-labs/evals/2025-3)