OCT 31 Optional: Live Office Hours 12 FRI 10/3110:00 PM—11:00 PM (GMT+5:30) OPTIONAL Recording
Notes
Recording
Optional: Live Office Hours 12
Oct 31, 202510:00 PM - 11:00 PM GMT+5:30
Audio Transcript
Chat Messages
Hamel Husain
Testing, testing…00:01:12
Alright, welcome everybody. This is our last Alphys Hours, so… Excited to have everybody here.00:02:07
we can…00:02:21
Go ahead and do the same thing that we have been doing the whole time. You could raise your hand and go ahead and ask a question, don't be shy.00:02:23
There's not too many of you here this time, so it's kind of cozy.00:02:32
We do have one more session in this class, we have Isaac later today.00:02:36
But yeah, Abhishek, do you have a… Question?00:02:41
Abhishek Panda
Yes, I do. Before I raise my hand, you mentioned it.00:02:47
Okay, so, Hamil, my question, I mean, I have a question, and I have a complaint to you as well, that, I mean, in the entire course, yes, I learned, like, how to label the traces and generate, but it was all on conversation.00:02:53
But what about those agents where, you know, some other form of things happening? Like, recipe, what was more with related to where we… I mean, it was more happening in conversation, conversation, but let's say some different kind of interfaces there, and that is agentic-driven.00:03:06
for those kind of things, Hamil, what is your suggestion? Like, shall we take the screenshots, and we take some kind of multimodel MLM to, you know, process all those? Because, I mean, the text itself is not enough, right? So, to those kind of tools, what would be your suggestion?00:03:22
Hamel Husain
Yeah, I mean, so the technique we teach, like, generalizes… To any kind of… Agent00:03:42
It doesn't have to be conversational.00:03:50
You still want to do the same thing of error analysis, look at the traces, you still want to render those traces in a way that makes sense to you. If you have visual elements or visual artifacts that are produced, you want to show those00:03:52
you want to show those in your annotation interface. You probably want to build your own annotation interface that surfaces that to you so you can annotate those things. And if you do want to judge certain things, you may want to have those visual artifacts as well.00:04:04
So… Does that make sense?00:04:18
Abhishek Panda
Yes, yes, this does make sense.00:04:23
So, and also, ML, I have understood, I mean, throughout the course, and I have watched the videos of BrainTrust and Phinex, and also, in my company, we are using LangFuse as well, but I realized, I mean, at the end, why you kept that custom annotation interface at the end, because00:04:26
I mean…00:04:45
Yes, you can upload it, but real time, if you want… I mean, not real time, when you want to capture, like, where, you know, visual, effects are there, and you want to capture those things, how would you take screenshots and put in real time into language? So, you need some interface.00:04:46
to capture those traces and, you know, all those screenshots and everything. And then, I think once you have that from a storage layer.00:05:01
we can attach it to some observability to it. So that's what, you know, we are working on, and what I learned, I was trying to speak to my product manager, and we were envisioning, are we going at right direction?00:05:10
Hamel Husain
Yeah, yeah, you want to get those, and you want to be able to view those in your annotation interface, 100%.00:05:23
Abhishek Panda
Got it. I mean, I have one more question. So, I think I've been asking this to you multiple times, so you must have understood, you know, synthetically generating data from00:05:29
By taking those other forms of data as well, along with conversation. I understand that you have asked me that we caveat on the multi-turn conversation. Yes, we can do it when we provide everything to an LLM, it will generate something, but how robust is that? We need to see that.00:05:37
Synthetic is fine, but what about the online evaluation? Hamil, that's what's a bit confused, that when I'm getting real conversation, right, my automated evaluators00:05:54
They also need access to the conversation. Yes, from a storage layer, we can get the conversation. We have our own internal memory storage, but what about those information, those visual elements?00:06:06
do I need to scrap that in some way, and we need to build some, you know, real-time rules so that… because it will be all CICD process, right? So, real-time, you want to see if there is a drift in the metric. So… what would you suggest?00:06:18
Hamel Husain
It sounds like the visual… you're saying the visual is important in understanding whether something successful happened or not?00:06:33
And so, then, yes, you would need to…00:06:40
feed that visual information into the judge, or into the eval itself. So you would have to set up some infrastructure where you could00:06:44
for real-time… For real traces, that you could easily pull that into your judge.00:06:51
Abhishek Panda
Got it, got it, got it. Because we have agents like website builder, we have agents like site designer, we have, like, content generation, okay? So, to analyze these kind of things real-time, I mean, we have some automated evaluators, like, one of the things is a job to be done, okay? Now.00:06:59
job… if, let's say, you are a user, right? You went there, and you said, I want to build a website, and it's a cupcake business, but what if my website builder agent generates a mobile business template? You as a user won't be happy, right? So, those kind of things we want to analyze, and I'm envisioning that what I have learned from the course, along with the conversation, put those multimodal data, and00:07:18
Do some tight software engineering where00:07:39
We, of course, synthetically, offline also we can do, and online also, we can send… I mean, of course, there are practices to send those data effectively, and to the evaluation in real time.00:07:42
Hamel Husain
Yeah. I mean, so you have to be, parsimonious and try not to over-engineer things, if possible.00:07:54
So, I mean, it is, you know, LLMs can read HTML,00:08:02
And they might be able to just discern certain types of errors. Like, they should be able to see if something is built for mobile or not without, like, visually looking at it.00:08:06
And you have to make a decision of whether, like, when visual is really needed and when it's not needed. And you want to, like, just like everything else, you want to try to keep it as simple as possible.00:08:17
at all times, and really think carefully about whether or not you need to add more layers and more complexity to your evaluation pipeline. Like, is it really necessary?00:08:27
It, you know,00:08:38
even all the way down to, do you even need an LLM judge? Like, can you use code-based evaluator in some kind, you know?00:08:42
Abhishek Panda
Okay.00:08:49
Hamel Husain
uncertain errors. So, like, you really need to, at all times, and this is not just AI, it's any kind of… anything you're building, but all.00:08:50
Abhishek Panda
Possibly.00:08:57
Hamel Husain
But also, here, with evaluators, too, like, try to make it as simple as possible.00:08:57
Abhishek Panda
Correct. I mean, of course, during the evaluator stage, you mentioned about the pitfalls, like.00:09:04
We don't necessarily need to get into LLM Judge for all agentic applications. If I am able to do with some simplest code-based checks, then I'm good. I don't need to build LLM as a judge. I mean, that's not a necessary step in evaluation. That is one of the techniques, but not a necessary step, right? Is that what you are trying to tell?00:09:09
Hamel Husain
Yes. And then also, if you do use LM as a judge, you might not need, like, visual language models. Like, do you really need to see the visual thing? You have to really, like…00:09:28
be critical and say, like, yes, I really need to see the visual, or no, you know what, I don't think I need to see the visual, because the LLM… if I show it the HTML, okay, maybe it's fine.00:09:37
Abhishek Panda
Got it, got it. One last question, Hamil, now coming specifically to image generator. Of course, you know, HTML, as you mentioned earlier, but when it comes to image generator.00:09:49
So, now, that actually goes to those multimodel elements, right, where, you know, we are trying to understand the image, and what if, you know, it was some image generator kind of agentic app based on your input, it tried to generate an image, and you as a user are not happy, and that's what we are trying to visualize. Like, in this kind of application, what would be your suggestion? Because00:10:00
even if we think about traces, labeling it, and whatever we want to do, right? So, like, if you'.00:10:22
Hamel Husain
I'm gonna label, do, like, image,00:10:27
Image generation, you will do the same error analysis. In fact, error analysis was kind of, like, born almost in image generation for machine learning.00:10:31
Or, not image generator, image classification in machine learning. It was kind of like the canonical, like, sort of use case for… or that's, like, the… the 101, sort of, of,00:10:41
Of error analysis.00:10:52
So you want to do error analysis, you want to, like, do your open coding on, like, what's wrong with image generation. And then similarly, you want to see if you can create00:10:54
You know, you might want to use a vision model to see if you can Align the judge.00:11:03
According to…00:11:09
Abhishek Panda
I see. Why do you see that's wrong? So, so it's the same process.00:11:11
Got it, got it, got it. Just one more. So, Hamil, when you were at GitHub, did you try to… I mean, your team tried to level all this? Because when you were in code, there could be many such errors, right? So, how much did you really do when you built the evaluation for GitHub Copilot? Like, I mean, that could be infinite, right? So, I mean, considering GitHub is a very large-scale application, and there is…00:11:15
Many varieties of code and errors.00:11:39
Hamel Husain
So, how did you scale it? I think this question… Yeah, so GitHub Copilot, we did a lot of different evaluations. We had, like, a lot of code-based evaluations. We leveraged, like, a lot of the data on GitHub, ran their unit tests at scale.00:11:41
I can send you… I can put a video in the chat where we talk about it. There's a lightning lesson.00:11:57
Abhishek Panda
That, that was… thank you.00:12:03
Hamel Husain
Katya.00:12:06
Katya May
Hello? Oh, wait, I suppose I can turn on my video, hey. Hi. So, not a question, I just wanted to, I don't know, update you on where I am. I'm one of those, from cohort 2, complete new to coding, but knew I needed this information, so…00:12:08
And after, like, I got stuck on homework, too, because, you know, brand new to coding, last time. But I went away, took a month, deep dive into00:12:28
I took Harvard University's, like, CS50 and CS50 Python, and just trying to gobble up the basics, came back.00:12:39
I've been able to do all the homeworks, so I just wanted to, like… like, honestly, like, in, in, Cohort 2, sometimes I was up to, like, 3 in the morning, crying, because, like, I couldn't set up my environment, but then, one of the lovely,00:12:48
classmates did, like, this 30-minute video for us noobs, last time in court, too, showing us how… just walking us through how to set up the environment, and I was, like, so grateful, to him. But yeah, I was able to do all the homeworks on RecipeBot, so just, yeah, thought I'd give you an update, and now, so for the next month, I'll just apply all that to the app.00:13:06
I'm building, and see how we go. That's great.00:13:30
Hamel Husain
I'm really happy to hear that.00:13:33
Katya May
Thanks. Just wanted to say that.00:13:35
Hamel Husain
Yeah, really happy to hear that.00:13:40
Tiba, do you have any ques… do you have a question? I see your camera's on.00:13:45
Teeba Alkhudairi
Yeah, I… I don't know if it's a question, but the status of where I'm at, and maybe a question,00:13:48
But yeah, like, I'm at a company that's around 3,000 employees, and everyone's doing AI projects, and I'm, like, snooping in, because I was supposed to do an AI product, but it got delayed, so I'm snooping in with other…00:13:57
teams, and I see, like, just doing the traces, I'm like, oh my god, like, how is this even in market?00:14:13
So, I'm now stuck at a point… well, not stuck, but it's been hard with the…00:14:20
Excel sheet way of doing things, especially because the JSON, it's hard to read, and it's very, you know, time-consuming, and you're missing stuff. Especially with data, you need to go back to reports, and check that they built the right reports, and all that.00:14:27
So, I'm trying to work with the AI governance team at our company to see, like, what tools we can use, but they feel like they're being bombarded, like, by so many teams for so many AI requests. So something that we're… I'm gonna try is build an agent.00:14:41
with Gemini, to see if I can…00:15:00
you know, make this faster and easier. I don't know if anyone's tried that before.00:15:04
I was just gonna…00:15:09
Hamel Husain
What does that, what does that involve, like, when you say build an agent, like, like a standalone one? Like, a prototype of something?00:15:11
Teeba Alkhudairi
Yeah.00:15:18
Hamel Husain
Okay.00:15:19
Teeba Alkhudairi
a custom UI, maybe. But I think at a company this large, and with many different use cases, it's better to have00:15:20
an external evaluator, right? Have you seen that? Or, like, the more the custom UI,00:15:29
Is a better way of going about this.00:15:36
Hamel Husain
Oh, I see the question.00:15:38
It can be often easier to start out with, like, one of the vendors, if… just to get your data in there.00:15:42
and get started.00:15:48
Eventually, you'll most likely start creating your own interfaces and things like that, but at least getting your data into a database of some kind.00:15:50
In the limit, like, what I usually do is I… even though I start off sometimes with vendors when I work with, companies, because I do consulting for evals, is, you know, we start off…00:16:00
in, like, one of these vendors, and then we just end up using it as, like, a database for the, like, a custom front-end UI that we end up making for annotations.00:16:14
But we usually… so we don't throw it away. It's still useful.00:16:24
And it's still, like, useful to have it around, like, sometimes you just want to quickly search a trace in a different way, or do something, or pass it to a colleague, and…00:16:28
You don't want to build all the user interface stuff that they have.00:16:37
So it can kind of, like, your custom UI, custom annotation thing can just complement the vendor. So I would say.00:16:41
It's… Yeah, it's kind of a nice way to get started.00:16:49
Teeba Alkhudairi
under.00:16:54
Hamel Husain
Yeah, with the vendor. And, like, choosing the vendor, it kind of depends on, like, what your tech stack is, like, what the skills on your team are.00:16:54
Like, you know.00:17:02
Are you… are they, like, is it a Python shop, TypeScript, some of the language? You know, are they using any frameworks already?00:17:04
Like, if they're using Langchain already, then maybe Langsmith is a good…00:17:13
If you… are you, like, Python?00:17:17
heavy development shop, then, like, probably, like, Phoenix Arise would be better. If you need open source, then Phoenix Arise is good.00:17:20
If you, yeah. If you just want something, like, hosted, and you don't care about all those things, like, Brain Trust is pretty good. So…00:17:28
And Braintrust has, like, a little bit more of a polished UI?00:17:37
Teeba Alkhudairi
When you mean everything is hosted, you mean,00:17:43
Can you explain that in terms of?00:17:47
Hamel Husain
Oh, okay. So, like,00:17:49
like, with Phoenix, like, you would take that software, and, like, you figure out… you can use their, like, hosted version for you. They, like, host it for you, so you don't have to, like, have a server, and you don't have to, like, you know, host it. But sometimes, like, you want to host it yourself, because data privacy or something…00:17:52
is a concern. So, like, that… in that sense, like, Phoenix is flexible, because it's completely open source, so, like, you can take the code, and engineers can self… you can self-host it on your own infrastructure.00:18:09
But you might not need that. That involves some complexity, right?00:18:21
And so, if you don't care about, like.00:18:25
Yeah, just so it depends, like, what the…00:18:29
We just have to, like, look at, like, your company, like, okay, what the skills are, what the infrastructure is, what the stack is, and then, like, select a vendor.00:18:32
Teeba Alkhudairi
Got it.00:18:41
Yeah, that was helpful, thank you.00:18:44
Hamel Husain
Yeah.00:18:46
Miresh?00:18:49
Vignesh Iyer
Yes, Hamil, thanks. Yeah, so there's been a pipeline I've worked on in the past, which was basically some sort of, like, ranking, pipeline, LLM pipeline, like, workflow, which, basically there's a bunch of processes.00:18:51
that, business processes that are available, and what I need to do is, select a set of processes from this, from the set of, all the processes that.00:19:08
are most valuable for optimization, like, optimizing those workflows with, like, an LLM. So a lot of the data regarding those workflows were in text, and that's why we're talking about LLMs and not00:19:22
like, classic ML ranking. So, I've got this pipeline, like, the aim of the pipeline itself is scoring.00:19:38
it needs to score the processes, and, the way we kind of thought about it was we had, like, two or three steps. The one was, like, enhancing,00:19:46
the workflow with more data on different metrics, like, is it more, automatable, or, you know, different kinds of enhancements that we do to the data? And then the next step was to, like, use a scorecard and say, we'd provide such classification scores for,00:19:57
kind of these particular metrics to kind of rank it out. And now that finally produces a score, which we deterministically, like, get into a number based on, like, enums or something.00:20:16
Now, my question is, if I had to think about evals for something that were already kind of00:20:27
they were not… they were, like, kind of a judging kind of step, and if I had to perform error analysis, it's already hard for me to think about, like, what's better.00:20:34
than the other.00:20:45
Hamel Husain
So if you have a pipeline that's supposed to produce a score, numeric score, and you're trying to rank, ultimately, like…00:20:47
You may wanna, you know… You may want to consider a classic machine learning component somewhere in there.00:20:54
Like, you know, you want to… you're trying to learn to rank.00:21:01
You're trying to learn… to, you know, produce a score is fundamentally a regression.00:21:04
And so, you know, you mentioned you have, like, enhance stage and a scorecard.00:21:10
sort of, like, where you're trying to instruct the LLM to give you a score, I would say, like, enhance is fine with an LLM. Producing a score with an LLM is kind of fraught with a lot of problems. You might want to turn that into, like, a regression.00:21:15
With classic machine learning.00:21:29
And then, now you're in a much easier regime when it comes to evaluation. Now you just evaluate the regression model against, like, some kind of ground truth that you… that you might have, like, some score, or some, or you can, you know, if you're trying to rank, you can use, like… you can…00:21:31
evaluate it like a ranking problem, strictly, and not open-ended, you know, AI problem.00:21:48
Vignesh Iyer
Makes sense. I think just two points on that was, I think the regression00:21:57
the only issue we had with the regression was the data. Obviously, it becomes classic machine learning, right? So you need a decent amount of data, and there wasn't, like, too many data points, and a lot of them were kind of textual as well. And then, though we could use, like, textual.00:22:01
Hamel Husain
You could use textual inputs, I guess you just need a label. You need, like, what the actual score is.00:22:19
So.00:22:26
Vignesh Iyer
Got it.00:22:27
Hamel Husain
Yeah.00:22:28
Vignesh Iyer
Yeah.00:22:28
And the other last part was, we were actually not using numbers to score, we were kind of scoring in classification.00:22:29
kind of, like, generating a classification, which mapped, like, to an enum, but I still get… it still applies, what you mentioned, like, we could go that route, but just to clarify, like, we weren't, like, asking the LLM to produce a number, because, yeah, we did find that that was…00:22:37
very,00:22:54
Yeah, random. So it was, like, a classification, like, hierarchy, but it produced a text of that, which we mapped deterministically to, like, a number. So, like, good, very good, something like that, right?00:22:56
Hamel Husain
Yeah, I mean, you know.00:23:13
you might want to consider classic machine learning for the final stage of some kind. Doesn't mean you don't use a language model, you could always fine-tune a language model to, like, do something and still, like, adapt the language model to, like, do regression or do… produce a score. But you need to have either… there's no escape from having data, so…00:23:18
Vignesh Iyer
Yeah. You gotta have data to train, you gotta have data to evaluate, you gotta have data somewhere, so you have to assemble this data somehow.00:23:37
Hamel Husain
In some fashion.00:23:43
But yeah, I think… Consider classic, classic ML.00:23:45
Vignesh Iyer
Just reframe the problem, yeah. Awesome. Thank you.00:23:51
Hamel Husain
Vladimir.00:23:56
Vladimir Rodeski
Hey, Hamel, good to see you.00:23:59
Hamel Husain
Yeah, likewise.00:24:01
Vladimir Rodeski
Alright, so in the past few months, I've been working on, this,00:24:03
AI agent that basically allows you to backtest your trading strategies. So you describe to an agent, like, what kind of technical indicators you want to use, let's say, like, is it trend following, a mean reversion?00:24:09
Which assets you want to test it on, and then it creates a backtest, executes it, and shows you the results. So, this is how much return you would have generated, maximum drawdown, things like that.00:24:22
And it's working. One issue that I ran into is it generates a lot of different back tests, and then I want to be able to analyze… the agent to analyze the results and come up with some conclusions, like, okay, these types of strategies work for these types of assets, or it doesn't work for this type of assets, and things like that.00:24:35
So my question is, right now, these results are stored in, in, in the files.00:24:54
and I want to look them up, does it make sense to either index those files using what we discussed in the class, or should I put a database and also store those structured results there? Like, how do I decide, do I need an extra layer, or try to reuse the existing thing?00:25:01
That's my, first question. And the second question is.00:25:19
I also want the analysis agent, it could be the same or different agent, to look at the results, and then form maybe new hypotheses, and also come up with, you know, evaluation of why this result's happening, and what the next result should be. And this analysis, I want to persist over time, so I build a knowledge base.00:25:22
that then you can query, and this knowledge base can grow. So, I was thinking of using embeddings for that, and store it in the VectorDB,00:25:42
Yeah, so these are two questions.00:25:54
Hamel Husain
Okay. I think your first question is, like, should you… how should you store the data, or something like that? Like, you have all these analyses, and, should you…00:25:58
Vladimir Rodeski
Yes.00:26:07
Hamel Husain
I mean, you need a database of some kind. Database is, like, a loose term, like, anything is a database, conceptually. It depends, like, how formal… how much you want to formalize it. Like, you know, it depends, like, how much scale you have.00:26:11
you know, whatnot.00:26:25
Vladimir Rodeski
Yeah, so in this case.00:26:31
Hamel Husain
It sounds like you need to store the data and access it, so, whatever…00:26:32
Vladimir Rodeski
But you cannot run some queries, for example, what's the best performing metrics? You would have to scan an index and do the thing.00:26:36
So… Yeah, sorry.00:26:45
Hamel Husain
Oh, okay. Yeah, you might want to…00:26:49
You might want to do a database, because it sounds like you have some data inside each document that you want to pull out and have more readily available. You don't want just, like, a folder of documents, so it sounds like a database might be good. We could do some extraction and have, like, a schema of some kind.00:26:55
So it seems fine, as far as, like.00:27:14
So, I think you mentioned vector database.00:27:19
Vladimir Rodeski
Yeah.00:27:22
Hamel Husain
It's not…00:27:25
It's not, like, clear yet from the conversation that, like, vector database will solve your problems, I'm not sure. Like, do you need to semantically search00:27:29
Over documents, I don't know. Doesn't sound like you are, maybe you do.00:27:39
Vladimir Rodeski
I think so. One issue that I ran… I was just saving those results into files, and then I started running out of context window, because I said, okay, it's tried to read the entire thing, it's like, it's just too much.00:27:44
So, I want to be able to say, you know, like, you know, here's the next area of research that I want to do, let's say. I want to dive more into meat frequency trend-following strategies on crypto.00:27:57
Hamel Husain
I feel like, like, vector database is not necessarily the magic bullet for you, because, you mentioned that you have, like, a lot of structured data as well.00:28:13
Vladimir Rodeski
Yeah.00:28:22
Hamel Husain
Like, you want to pull data out of the documents, and so you want, like…00:28:22
you want to give the LLM the ability to search in many different ways, not just semantically, but also, like, write SQL queries, write all kinds of different queries over your data, and you might want, like.00:28:27
We don't know what will work until we start doing evals, but, you know.00:28:40
You will likely end up exploring a lot of different approaches, including, like, hybrid search.00:28:46
not doing, semantic search at all, just using, like, traditional search. You know, it's, like, starting from the most simple thing all the way to, like, more complicated.00:28:52
And… You might want to choose a database that…00:29:03
is flexible. I know, like, Postgres, for example, has, like, vector database… vector search in it.00:29:07
You know, but you can start with just using it as a database.00:29:16
As it isn't originally intended.00:29:20
And then see if that works.00:29:23
Like, if you give your LLM the ability just to run normal queries, does it find what you want? Maybe it does. If it does, then don't worry about vectors.00:29:25
But…00:29:35
You might find that, okay, there's some semantic information you want to query, and then you have to think about how you want to…00:29:36
expose that to the LLM.00:29:43
Vladimir Rodeski
And…00:29:45
Hamel Husain
you know, it's a pretty deep topic, retrieval, RAG is a really deep topic. There can be a whole… there are, like, entire classes on that. It's not just one vector a lot of times that works. A lot of times you need multiple vectors per document, and you have, like, lots of different representations per document. It gets fairly deep.00:29:47
So I want to say, like, okay, just keep your mind open to…00:30:07
The idea that you will have to retrieve in many different ways.00:30:12
Vladimir Rodeski
Okay.00:30:16
Thank you.00:30:18
It was good seeing you. Bye.00:30:20
Hamel Husain
Yeah, yeah, likewise.00:30:22
Or June.00:30:24
Arjun Murakonda
Hey, Hamel. Firstly, thank you very much for00:30:28
organizing this and, like, just generally putting out a lot of content. It has been super helpful to me, and Abhishek, who was speaking earlier in the call as well, so, thank you very much. I have two questions. One, I think.00:30:32
Just slightly adjacent to… Evaluations, probably. But quite,00:30:46
necessary. Like, one, engineering, and two.00:30:52
Organizational challenges, so… kind of want to get your perspective on it.00:30:56
So, engineering challenge, I think. So, what we are basically building, I guess, it's, it's public now. We at GoDaddy are building, like, this multi-agent, or, called Arrow, and it's a…00:31:07
you have an orchestrator in the middle, and then has a bunch of agents that connect to it. So, collecting traces has been a big engineering challenge for us. So, we have our own eventing platform.00:31:23
And schemas and all that, right? And then, when you think about a multi-agent system, you have 30, 40 agents connecting to it.00:31:36
We need to connect, get traces from00:31:44
browser side, server side, the orchestrator agent, and then we have, like, an internal service that actually does quite a lot of this for other agents. So there's quite a lot of systems that we need to collect traces from.00:31:47
And it just has been really challenging for us to get that end-to-end logging.00:32:01
how, like, are there other, teams or people that are facing similar challenges? How do we navigate that, right?00:32:07
And… because that's the fundamental, block for us to really get to, like scaling our evaluations. We're struggling to capture what's happening scalably at this point.00:32:16
Hamel Husain
Yeah, I don't know if this helps, but, Nurture Boss, which is the example we use a lot in the course, when we were working with them, they, you know, their infrastructure is, like, distributed.00:32:28
So, you send a text message.00:32:41
In a conversation, you might send the next text message 4 hours later.00:32:44
To continue that conversation. And the server that receives that is, like, could be a different one than the one that you originally, you know, that was originally hit with the previous turn.00:32:48
And so, they actually… Had trouble, or we had trouble, like, using off-the-shelf00:33:01
things like brain trust, like, cleanly in their codebase, because they had to think about Grabbing, like, the thread.00:33:10
You know, and then, like, updating it constantly.00:33:19
In the way that the vendor00:33:22
Like, they had to, you know, work inside the lines of what the vendor offered.00:33:25
And so, you know, there's a lot of, like, extra steps and, like, some extra complexity that didn't…00:33:31
sort of… work well with their system architecture. So they ended up just building their own Observability.00:33:37
Like, just completely, and it was, like, much faster, and they were able to do it. I don't know if that helps, I don't know if that's part of any problem, but that's something to think about.00:33:45
Observability is hard. It can be hard at scale.00:33:54
Is it because you have to instrument your system?00:33:58
Especially of a lot of components.00:34:01
Hopefully… it's hard, like, if you don't start off with it.00:34:04
and you're trying to reverse, like, kind of do it after the fact sometimes, it can be hard.00:34:12
I don't really have any…00:34:18
solution, per se. One thing I can say is, like, sometimes…00:34:21
If observability is insurmountable, it is a smell that the code is too complicated.00:34:27
And, like, the thing, like… and so you might wanna…00:34:34
I don't know enough about anything that is going on inside of what you're building, but something to pay attention to if you feel like you can't instrument things.00:34:37
Arjun Murakonda
Yeah,00:34:50
I think we can certainly talk more about it, but don't want to take too much time. Just, like, one more rev on this, and then we'll switch to the other question, I guess.00:34:53
essentially, we have quite a lot of… we're taking quite a lot of the pain out of the equation. We are, like, to your first suggestion, essentially, we are building it internally, and we are considering evals as a separate, you know, tool that we can plug into, right?00:35:01
That's how we are approaching it from an engineering challenge perspective. And then, to…00:35:16
We are trying to take the pain out of the equation by actually doing quite a lot of auto-instrumentation.00:35:25
Because there's multiple SDK choices that we offer to our agent teams, and for whatever reason, because we wanted to move fast, now we are in that place where we are trying to really serve multiple users. So, yeah, I think,00:35:32
Seems like, based on what you're suggesting, just because I think this first suggestion is something that we're on.00:35:52
But second one, I think, is something I need to still suss out more. And maybe this is segue into the other, like, really, no question, I guess, sorry, it's just more context. If there's anything, I'll stop there, if you want to add more.00:35:58
But,00:36:12
If none, I think then a segue into this is, essentially, we need to… the engineering problem… that's the engineering problem, but the organization problem is that we…00:36:14
need, like… it's like we need to sell evals to these, teams for them to actually take observability seriously. They're stuck in the mindset of just observing, browser-side events or clicks and stuff like that, and when you're building this multi-agent, right, you're00:36:24
Capturing a lot of the clicks and, inputs into it, but they're not really building for, like, the actual00:36:41
chat-based invocations and application style, right? So, we're not really collecting all that. So it's been quite a lot, a huge organizational challenge for us to sell them to00:36:49
sell them on a mouse to do observability, so it's like, how do we… how are people navigating.00:37:01
Hamel Husain
Is there any team that's willing to do observability?00:37:07
Or is it kind of across the board, like, no one wants to do it?00:37:10
Arjun Murakonda
There are a couple, I think they're stuck in, like, building funnels and, stuff like that, but not, like, understanding how to improve their agents, you know, like, they're building those more, yeah.00:37:14
Hamel Husain
Yeah, I mean, the best way to sell evals is to not…00:37:30
So, say, the word evals, and it's to show… Some results.00:37:33
And those results can be… they can just be like, hey, I found these issues systematically.00:37:39
You know, with your agent.00:37:45
and kind of keep coming with those insights, and then also have results, like, we fixed XYZ, ABC.00:37:48
And then eventually people will say, like.00:37:55
you know, kind of a side effect is like, oh, like, how are you doing that? Okay, this is…00:37:59
We're using this evals process. Now, prerequisite for that is data.00:38:05
You have to capture the data from somewhere, you can't… there's no way you can do evals without data.00:38:10
So I would almost pick… see if you can find, like, one team somewhere that has something close enough.00:38:15
In just…00:38:23
try to supercharge that team, like, do the… do this analysis even for them. And then, like, show it to everyone else, and show, like.00:38:25
the progress that they made to everyone else. Without even saying the word eval, saying, like, this is, you know, and then say, okay, if you want to do this, this is what you need to do.00:38:36
And don't bring up the word evals first.00:38:45
just…00:38:47
walk your way there, like, first, like, make sure… they're like, oh, okay, this, like, data thing, let me do that. And then, slowly, like.00:38:48
When they wake up one day, they're doing evals, but… Start, start, like, slow.00:38:57
Arjun Murakonda
Yeah.00:39:04
Perfect. Okay. We tried to push those POCs, so, no, I absolutely appreciate, that, yeah. Thank you.00:39:04
Hamel Husain
Yeah.00:39:14
Pardeep?00:39:17
Pardeep
Aml, first question, is this the… is this the last class of this course?00:39:21
Hamel Husain
Yes, yeah, this is the last office hours. There's a session later today, I think at 3 p.m.00:39:27
Pacific.00:39:32
Pardeep
Okay, cool. The question I have is, I have been working on another project, which is,00:39:33
you know, Whisper-like application, which is using Whisper in the end, which is the voice agent. One thing we're realizing is this is super expensive, so we are actually trying to use a local00:39:40
implementation of that, but now the tracing becomes really hard if you have to trace those things. So have you thought about a situation where either the small language models, let's imagine, which are natively sitting on a user's laptop or native devices.00:39:52
Then this becomes a little harder as a project to do, which is, you know, how do you collect these traces, and how do you manage that?00:40:05
Have you thought about that a little bit? Just wanna see what do you think.00:40:13
Hamel Husain
Yeah, it depends, like, you know, is phoning home every so often a… You know, an option?00:40:18
Doing it locally is totally reasonable, I think that makes a lot of sense.00:40:27
But, hopefully you can collect the logs.00:40:33
Pardeep
So, okay, so here there are two types of logs that you possibly can collect. One is the translated00:40:39
logs automatically, then that means you are not evaluating how Whisper is doing, because that's less imagine that that is the technology. But then, once you have that, then you have some translations that run, and then you do some LLMs on top of that, or SMLs on top of that, which is, you know, to translate into a summary, or, you know, take all these sections.00:40:46
So even if it was in the cloud, do you recommend even…00:41:06
Evaluating the quality of translation, or that is just… we should assume that this is, like, a technology, it works?00:41:13
Or would you ever recommend that, or would you not recommend that?00:41:20
Hamel Husain
Yeah, I'm gonna make you laugh at my answer, you should do error analysis and see, like, if that is actually a problem you should care about.00:41:23
And the reason… it's… and I'm not saying that just to be funny, I'm saying that because…00:41:30
the models are really good at voice… voice-to-text now. Right. And it's quite possible it's not a problem.00:41:38
That you are even having?00:41:47
I mean, theoretically, you could eval it, but it's, you know, if you're doing error analysis of your system.00:41:50
I would guess, maybe, that it wouldn't be your most burning problem. It's just a guess. Like, my intuition wouldn't be, oh, like, that is gonna be a problem.00:41:57
And so, maybe not. And so… I would see…00:42:09
Yeah, and so then you might not need that engineering complexity if you're doing error analysis and you're… you can see, like.00:42:16
It doesn't seem like transcription is a problem, so let's not worry about it.00:42:23
It seems like, yeah… I…00:42:29
I like, like, this, like, transcribing locally, it makes sense. Like, I use Whisperflow, and I can tell the way it works is that, like…00:42:36
It's recording your video, it's like… and it's pushing it to the cloud, and then you get a transcription back.00:42:44
But that often fails when I don't have connection, and that's very annoying, so…00:42:51
Pardeep
Yeah, plus, you know, if you're transcribing something for an hour long when people are moving, then that also becomes really hard.00:42:56
We also rely on the quality of the transcription actually is different. For example, testing out, you know, OpenAI models versus…00:43:02
you know, 11 labs, and I think there is a new company which is now claiming, I think they just raised, you know, I think a $100 million fund around.00:43:10
Hamel Husain
Hmm.00:43:19
Pardeep
Absolutely.00:43:20
Hamel Husain
What are you talking about? Yeah.00:43:20
Pardeep
you know, personalizing the voice itself, and I think people are getting more excited about that, which, to be… like, personally, I wouldn't ever want to do that. I wouldn't want to personalize voice-voice and get into that legal trouble, but people are getting excited about personalization. Now, evals for those things become harder and harder, because00:43:21
I've also heard people saying, you know, I was talking to one of the doctors, they were like, no, you know, this didn't work because they didn't personalize my voice that well. I'm like, wow, wow, like, why would you ever want to do that as a doctor, first of all? Second is, now, the only person who actually can evaluate is probably you. You have to, you know.00:43:41
Record yourself enough time and say, did it work properly or not?00:43:59
But I think the complexity of evaluations.00:44:04
you know, they have been increasing, and I'm wondering how you are keeping up.00:44:08
In this whole era, and where do we make a boundary to say, now, you know, this is not where you spend time on?00:44:13
Hamel Husain
Yeah, I mean, it's interesting, like, okay, like, with voice… Cloning… That's a kind of, so, like…00:44:25
We talk… when we talk about this class is, like.00:44:38
A lot of times, it's, it's like applied evals.00:44:41
Yeah, like, if you're building a product, this is how you do applyability. If you're building a model.00:44:46
That does something?00:44:50
that's a little bit of a different regime. If you're, like, training a model, you're building a model. You don't want to necessarily do what we're teaching, you want to go back into the classic machine learning phase. So, like, if you're trying to do voice cloning, for example,00:44:52
You know, there's a lot of metrics and a lot of other measurements that you can do to measure the coherence between the original and the generated.00:45:07
And you can do a lot of…00:45:17
like, analysis on that, and that's, like, a different regime completely. You don't need to do all this, like…00:45:19
open coding, axial coding, LM as a judge, that doesn't even make sense, right? You want to iterate really fast using these, like, very specific techniques.00:45:24
And so… Yeah, I mean, I think…00:45:34
And so those will always be there. And you should always think, like, you know,00:45:40
you should always know, like, the problem you're trying to solve, like, it's not all LMs all the way down sometimes. Sometimes you want to have classic machine learning somewhere, sometimes you want to do…00:45:46
just use code, and it's not really, like, an eval thing, so… But the core, like, this, like, core thing of, like, you invoke an LLM, it gives you some stuff.00:45:56
that's… the evals, it's, like, very general. It, like, works. It doesn't really have to…00:46:07
Doesn't change that much, like, the process of it?00:46:14
Hopefully we'll have better tools that help00:46:18
Kind of… have a human in the loop?00:46:21
To do it a little bit faster, but it's kind of… it's been the same, really.00:46:25
I don't find it's too…00:46:31
Too different. That's why we created the class.00:46:33
Because we found that we were just repeating the same thing over and over again to every client that we worked with, and so…00:46:36
We said, okay, like, it kind of deserves to be a class.00:46:43
Pardeep
Got it, got it. Okay, well, thanks, Mel.00:46:47
Hamel Husain
Yeah.00:46:50
Malas.00:46:56
Manas Sur
Hi, Emily.00:47:01
Hamel Husain
Hi. Thanks for the content.00:47:02
Manas Sur
And, I would like to say that I have been…00:47:05
doing some complex projects in the past as part of classical software engineering. I've never been this nervous00:47:11
shipping agenting solution to their customers, like I'm feeling now. But after this course.00:47:18
I think it gave me some direction in how to cover up all the scenarios and having a…00:47:24
more persistent system for the customers. Thanks for the content.00:47:31
And, yeah, so my first question is, it's a simple one.00:47:36
While I'm building online evaluation for my agentic system.00:47:41
So, it can have a workflow DAG with multiple agent nodes to come to a single outcome.00:47:46
Like, I'm creating code-based evaluators, and some as LLM judge.00:47:52
But, there are certain cases where maybe I have put more safe…00:47:58
evolves right now, because I'm not sure.00:48:04
Like, while sending an email, I want to be double sure on the to address and form address, which comes from a tool call.00:48:07
I don't want it to be swabbed for any reason, so what I have done is I have created a eval where it will go and do a tool call again in a separate node, and it will check. If it fails, then we are failing the workflow.00:48:15
But till now, it has not failed. Is there a best practice where we can also go and remove Evolts?00:48:30
From online events, which are majorly the LLM judge Evals.00:48:37
And, over the time.00:48:42
Hamel Husain
Yeah. So it's basically, like, you can phase them out.00:48:46
or you can reduce the frequency on which you run the eval. If you find that the eval is always passing, like, 100% of the time, you know, the eval is not giving you that much signal, like, it's giving you the same signal every time. So, you know, instead of running it00:48:50
You can run it, like, nightly, weekly, whatever, just to, like, have some backstop in there, so you can maybe catch regressions.00:49:06
But you can relax it all the way down. And you might even say, like, just turn it off, if it… you find that it never…00:49:15
is wrong. It's totally fine. It's not like software engineering, where… like, software engineering is, like, you have this comfort of, like, okay, like, these tests all pass. LLM…00:49:23
evals is more like risk mitigation. Does that make sense? And so the risk mitigation… there's never a 100% guarantee that nothing will go wrong. This is impossible. This is all just, like, how do we minimize risk, and…00:49:34
There's a cost to minimizing the risk, though. Like, eval is cost… it'll cost you some complexity to keep it running, some mental overhead. Same thing with, like, unit tests, but unit tests is, like.00:49:48
A little bit different.00:50:00
Like, you don't want to have too many unit tests either, and think… you don't want to think carefully, like, what you're testing.00:50:02
Same thing here, but it's also… but it's just a little bit different.00:50:07
Like, LLM evals can be a little bit more cognitive load.00:50:11
to reason about and maintain and manage and all that. So, just keep that in mind. It's totally reasonable just to, like, completely…00:50:16
sunset eval, if it's… If his time is done.00:50:24
And I, I do, I do do that.00:50:30
Often.00:50:32
Manas Sur
So, I have another one, the last one.00:50:34
foreign, task which might take longer, like, it can take days or weeks, use cases like you are trying to bring a renewal to closure from the start to end, which can take one month or a week or so. So there might be… my agents are trying to send reminder emails, responding to the customer's email, or making an agentic phone call.00:50:37
Or maybe a human in the loop, actual human, going and trying to have some conversation during this time.00:51:00
So now, if you see this complete flow, then there can… there are a lot of variation throughout this one month.00:51:06
And, like, each… so if you take the n-1 logic.00:51:13
then the evals can be… I don't know, it's like a huge number. I'm not able to think of a eval to cover, have more confident…00:51:17
metrics around eval here. So, is there any best practice where, like, to cover off all these long-duration agentic solutions?00:51:26
Hamel Husain
Is it a long-duration agentic solution. Is it able to keep… do you have any problems with contact? How'd you deal with context?00:51:39
Manas Sur
Yeah, so, like, anytime we are doing some conversation or transcript, we are trying to rank the important points and put into some vector store, that's the plan. And so whenever we are trying to reply to a customer, we go back and see what was the conversation through their vector database embeddings.00:51:48
Hamel Husain
So, with the… if you have long-running threads, it's totally reasonable to… Have different milestones.00:52:10
like, you know, sub-events that happen along the way. You're saying, like, it takes a month, okay? Like, can we break it down into, like.00:52:21
milestones. And then it's totally fine to do error analysis on those milestones, to say, okay, like, I can do00:52:29
You'd still be reuniting the whole trace, but…00:52:37
you can gate it to where, like, now the trace is ready for review once it's crossed, like, XYZ milestone.00:52:41
And then to keep in mind, like, when you're doing error analysis, you're cert… you're kind of…00:52:48
Stopping at the most upstream error, you find?00:52:52
So you're gonna be… you end up, in these long-running conversations, you're gonna be focused first on the upstream stuff.00:52:56
And then you're gonna work your way down to, like, the more downstream stuff.00:53:02
Is the way that it should work, if you're doing the error analysis correctly.00:53:06
There's no escape, really, like, if this is long-running.00:53:10
What you might want to do is,00:53:14
So you have to do, like, error analysis…00:53:19
kind of intelligently, and see, like, what works for you. So don't be, like, dogmatic about it, so…00:53:22
What you can do is, like… Look at the final result.00:53:29
see if… Anything seemed, like, wrong?00:53:34
If everybody's, like, happy, and the user's happy, and whatever, the result is fine.00:53:39
it's somewhat reasonable to say that this passed, right? You don't have to, like, necessarily read, like, top to bottom all the way. Sometimes I just read, like, something about the intent, the user intent, the goals, and then I look at the final outcome.00:53:44
Then if I see, like, oh, something funny is here, or I have some signal that something went wrong, that can happen because of the final output, it can happen because of the explicit user signal, that, like, okay, this wasn't… this could have been better somehow, then you can try to look at it.00:54:00
That's one way to do it.00:54:17
Another way to do it is to do some analysis of your traces.00:54:19
And, you know, look at outliers. Say, like, okay, how many turns are in different milestones? How much time are we taking per milestone? What are some outliers?00:54:23
You know, like, oh, these, particular ones, they're having a lot of back and forth.00:54:33
In this particular milestone, or why is this… this… these traces are, like, this milestone's taking way longer than normal.00:54:39
So you gotta be… you can be intelligent in how you…00:54:47
search for issues. You don't have to get overwhelmed. You can kind of have targeted ways of looking at things.00:54:51
I'm not able to come up with all the different ideas for you, like…00:54:58
I'm trying to give you a sample of, like, how you might think about it. Like, cause your product probably has some…00:55:02
unique things that you can think of, of like, oh, okay, I can analyze the data.00:55:10
And do some data analysis of the traces to let me understand, even without reading them, per se, like, just, like, narrow down, like.00:55:16
ones with higher probability of, like, something is fishy. Does that make sense?00:55:28
So, that's maybe something to think about.00:55:33
Manas Sur
Yeah, thank you.00:55:37
Hamel Husain
Yeah.00:55:39
Dinesh.00:55:41
Vignesh Iyer
Yeah, actually, Hamil, just trying to, brainstorm on what Manus is saying, and trying to see if I actually understand, like, he could correct me, or yourself as well. For, like, long-running tasks.00:55:43
I'm just thinking, it's long-running when it's actually in production, right? Can the problem not be rephrased to a short-running task, where those tasks do, actually00:55:57
run immediately after each other, while you're still doing the error analysis in the beginning. Like, maybe you're synthetically framing that process to kind of take place till the end.00:56:08
And if synthetic generation is, like, an option, you're kind of reframing it to kind of short, and then you still get the whole trace, and it looks like it's…00:56:20
a long-running task, but at the end of the day, it's just a workflow, right? And it's just that it was long-running because maybe you waited on someone to respond, if I understand correctly, but the agents themselves and the individual steps were not really long-running. They were doing their thing.00:56:31
Hamel Husain
Sure, yeah. So the idea is, like.00:56:50
You don't want to, like, okay, you want to pick some manageable…00:56:53
sort of way to review traces, right? Like, so if you just cut up all your multi-turn conversations into single turns, and just, like, review those single turns, like, that would be…00:56:58
intractable. Like, there would be a lot of noise. You'd be like, oh, like… Okay, but…00:57:12
You know, it would just get… there would be a lot of duplication, and a lot of just…00:57:18
It would be inefficient.00:57:23
On the other hand, if you go on the other extreme, of like, I'm only gonna wait until this month-long conversation is done.00:57:25
to then review it, you might be waiting too long, you know? Because, like, something meaningful has happened. So you want to try to pick a middle ground and say, like, okay, let me wait till enough has happened to where it does make sense to review. That's all I'm saying. So that's, like, that's this idea of milestones. It's like…00:57:32
pick some sensible, sort of, checkpoints in which you might then decide to look at the trace, and you can, like, you know, use that00:57:49
limited cognitive… Energy you have efficiently.00:57:58
Because there's a limited amount.00:58:02
Vignesh Iyer
Yeah, yeah, makes sense. Cool, thanks.00:58:06
Hamel Husain
Yeah.00:58:09
Trish, you got a question?00:58:15
Love the owl background, by the way.00:58:18
Trish Uhl
Oh, thanks! I came to listen, Hamel, I… this has been a really good conversation, like, it's just great to hear, like, how everybody's trying to apply it, and then just, like, just listening to the conversation and talking through. I don't think I had my hand up. Did I have my hand?00:58:21
Hamel Husain
No, no, I.00:58:35
Trish Uhl
Oh, no, okay. Oh, thanks.00:58:36
Hamel Husain
No, no problem.00:58:38
Aditya.00:58:39
Has a hand up.00:58:41
Aditya Sethuraman
Hi, Amel. Thanks. So this is a question for you, and I think Shreya isn't here. It's not a technical question about the course. Thank you so much for the course, it's fantastic. I've learned a lot. My question is for you, what is your content diet?00:58:43
I ask this to everybody who I meet.00:58:58
Is there… because I can see you're clearly pretty prolific, you're… you've helped us so much with this course, you're very busy.00:59:01
So, what is it that you do to…00:59:08
Hamel Husain
Yeah, my content diet is, like, 5 friends.00:59:10
Aditya Sethuraman
Okay.00:59:13
Hamel Husain
I let the 5 friends tell me.00:59:14
What to look at?00:59:16
And they… they tell me,00:59:18
Yeah, they, like, send me text messages, I'm like, did you see this? I send them some text messages, did you see this? And I just contain it, and I ask questions within those, like, group of people.00:59:21
And it's just people that I feel like, oh, we are on the same page of, like, what we think is interesting, and what we think is noise, and whatever.00:59:31
Okay.00:59:39
So yeah, I don't try to consume the whole fire hose.00:59:40
Aditya Sethuraman
Yeah, yeah, exactly. Okay, excellent. Thank you.00:59:44
Hamel Husain
Yep.00:59:47
I think, boquy had a.00:59:48
Balki Nakshatrala
Yeah.00:59:51
Hi, Hamil. Thank you so much. I've been a passive listener, but at the same time, we actually work on these things, so I was able to translate some of these things and start working on evals in our products.00:59:53
I think at the, end of it, I mean, if I just summarize, I just want to, you know, validate01:00:07
I think the feedback loop is the core thing here, and it starts with proper capturing of log traces, analyzing it.01:00:14
do the manual eval first, which is all the open code stuff.01:00:23
And then, select as fine-grained LLM as judge, metrics as possible.01:00:27
But even that… It would probably have to be recalibrated every…01:00:35
Few weeks or something, based on if you've seen more new open codes.01:00:42
Is that correct? There's also a re-evaluation of that?01:00:46
Hamel Husain
Yeah, yeah, you have to go back. It's not like a linear, you're done. You have to, like, go back, keep… this is like a…01:00:50
Little bit of a spaghetti.01:00:57
Balki Nakshatrala
Yeah. So, I think what it has been great for me is, normally we get caught up with all these buzzwords out there, and I… I could never connect to these, how these 6 metrics or 5 metrics people talk about, could convey the, system's performance, so… but…01:00:58
did I now oversimplify it? Like, can you maybe caution me against, like, some things that, you know.01:01:15
I should further be looking out for, rather than oversimplify it, like, I…01:01:21
Hamel Husain
No, I mean, no, I mean, you know, I… I don't think you're oversimplifying it, necessarily, like, the most important part is looking at your data. The most important part is, like, that error analysis part.01:01:27
And, like, constantly going back to that.01:01:40
And, yeah, there's no linear path through the whole thing. When you… when you get really good at it.01:01:44
And you feel comfortable, you can break a lot of rules.01:01:50
But, or break a lot of the guidelines that we told you, because then you'll get intuition on, like, oh, okay, what is it that we are doing? We can't, like, go… we try not to throw all the mess on you of, like, this is how you can break all the guidelines, because then it'll be hard for you to learn.01:01:53
Balki Nakshatrala
Yeah.01:02:09
Hamel Husain
It'll be very confusing.01:02:09
But just keep in mind that, like.01:02:11
later on, once you get really good at it, and you're like, oh, I think I can, like, relax this one guideline that Hamill told me.01:02:14
That's good. That's… that's an okay.01:02:23
Balki Nakshatrala
God.01:02:27
So, one final question. Since we don't get real questions until we actually face some new challenges, are we able to still post in the Discord later? I know we have the access to the material, but questions in future, I think that's.01:02:27
Hamel Husain
Yeah, yeah, you could definitely post in the Discord, you'll have access to that.01:02:42
Balki Nakshatrala
Okay.01:02:46
Hamel Husain
And then you also have access to that Delphi, too. You know, let us know how that works. We, you know…01:02:47
Balki Nakshatrala
What is Delphi? Sorry.01:02:54
Hamel Husain
Yeah, it's the AI assistant that we have.01:02:55
If you signed up.01:02:58
Balki Nakshatrala
for it, yeah. If not, look at your onboarding email. Okay.01:02:59
Hamel Husain
there's instructions, like, how you can request access. It's like an AI tool that we give you. It's basically like a GPT with Shreya and us.01:03:03
Balki Nakshatrala
I heard everything.01:03:12
Hamel Husain
loaded into it.01:03:13
Balki Nakshatrala
Thank you. I appreciate it very much. This made many things clear, and, you know, I feel confident about even hard AI problems.01:03:14
to be very realistic about stating where we are, even with execs, and, you know, I think before even we go to clients, so I think it has provided a lot of confidence for me and my team. Thank you.01:03:22
Hamel Husain
Thank you, really nice to hear that.01:03:34
Okay, one last question with Vignesh.01:03:36
Vignesh Iyer
Exactly another question, Hamil. I saw your mail about, this hackathon that you're gonna hold.01:03:40
Hamel Husain
boosts.01:03:46
Vignesh Iyer
somewhere there, right? So, yeah, I'm not even close to that, there's no way I'm getting there. But interested in, you mentioned something about it being online, possibly, if it goes well.01:03:47
would be interested to know, like, how would you conduct a hackathon like this for evals? Like, what was the plan?01:04:00
Hamel Husain
Oh, I just meant to say, like, I'm just open-minded to doing it online, I haven't…01:04:07
thought about it yet. I don't even know if I have time to… to create something like that. I'm working with some other people.01:04:12
Should do the in-person hackathon. I'm working with, Brian, who's a TA in this course. He is, he's been a TA in the previous cohorts more so than this one.01:04:19
And he's doing a lot of the logistics. Putting on a hackathon is very time-consuming.01:04:31
So I don't know. I don't know if I would do it online. If I were to do it online, it would basically be, like… it would basically be, like, Kaggle.01:04:35
Okay. In fact, if I did online, I probably would… I don't know, we might put it on Kaggle.01:04:45
So,01:04:49
it would look like that. If you don't know what Kaggle is, you can look it up, it's like.01:04:51
Vignesh Iyer
Yeah, yeah.01:04:55
Hamel Husain
You know, machine learning, competition.01:04:56
Vignesh Iyer
Makes sense.01:04:59
Cool, thank you.01:05:00
Hamel Husain
Yeah.01:05:02
Alright, thank you, everybody.01:05:05
It was really nice to have all of you in the course. Please, keep in touch on Discord.01:05:07
You can also find me on X, LinkedIn, whatever your favorite social network is.01:05:13
Really happy to keep in touch with you there. Thank you.01:05:18
Balki Nakshatrala
Thank you.01:05:23
Aditya Sethuraman
Thank you.01:05:25
Live session where instructors will address questions. Instructors may present answers to common questions, followed by live Q&A.
[
Home
](/parlance-labs/evals/2025-3/home)[
Community
](/parlance-labs/evals/2025-3)