Episode 40 | Why Your AI Agents Go Off the Rails — and the Harness That Saves Them | Ankur Bhatt | WRKdefined Podcast Network: Conversations Pushing The Boundaries of Work

Ankur Bhatt — Head of AI at Service Titan — joins Amy and Meg to explain why most AI agent initiatives die between demo and production, and what to do about it. Ankur has spent the last two years building production agents that handle high-stakes work like tax notices and payroll compliance, and he's published one of the most useful practitioner guides on the topic anywhere. The answer, he argues, isn't a better model — it's something called harness engineering. He breaks down why agents have "the cognitive ability of a PhD with the attention span of a two-year-old," the three failure modes that sink most deployments, and the six principles that turn probabilistic AI into reliable enterprise software. Plus: why writing code is no longer the bottleneck, why your next product probably shouldn't have a UI at all, and a Leadership Corner on managing peer egos when you're the most senior woman in the room. ⏰ TIMESTAMPS: 00:00 Cognitive ability of a PhD, attention span of a two-year-old 00:17 Meet Ankur Bhatt: VP AI @ Rippling, Head of AI @ Service Titan 01:22 From SAP/SuccessFactors to startup speed 03:25 What customers actually want from AI right now 06:32 The demo trap: six-day demo, six engineers, three months of fixes 08:23 What "harness engineering" actually means 09:47 Why architecture matters more, not less, in the agent era 13:01 Where the term "harness" came from (the Manus story) 15:47 Three failure modes: compound error, context overload, specification vacuum 19:16 Why agents are like ADHD partners — the executive-function problem 21:11 The six principles of harness engineering 24:01 The Montessori analogy: maps, stations, and skills 27:43 Why specs and PRDs matter more now, not less (planning mode) 28:55 Skills vs. hooks: what goes where 30:57 Building a skills marketplace inside your organization 35:45 The 10–20% problem: scaling individual productivity to a team 39:40 The new bottleneck has moved upstream 42:52 From features to agent experiences (the Karpathy home-control example) 45:22 The two layers of B2B agent design every leader misses 48:46 Leadership Corner: lonely at the top, surrounded by egos 49:26 Meg's "trust council" reframe 53:26 Where to focus your emotional energy (hint: not on changing your peers) 55:31 Managing egos as a core executive skill 🔑 KEY INSIGHTS: -Why your AI agent goes off the rails: compound error, context overload, and specification vacuum — and how to design around all three -The six principles of harness engineering, in order — starting with "give agents maps, not manuals" -Skills vs. hooks: how to encode domain knowledge and enforce quality without overloading the model -Why "spec before code" matters more in the agent era than it did in the human-engineer era -The new SDLC: when writing code stops being the bottleneck, what becomes the bottleneck instead -Why continuing to build point-and-click UIs may already be irrelevant — and what an "agent experience" looks like in B2B -Leadership Corner: why peer loneliness usually isn't a peer problem, and how to build a trust council instead 📚 RESOURCES: Ankur's article: Agentic Engineering — Why the Harness Matters More Than the Model: https://www.linkedin.com/pulse/agentic-engineering-why-harness-matters-more-than-model-ankur-bhatt-fyjwe/ Daniel Kahneman, Thinking Fast and Slow Andrej Karpathy on the No Priors podcast (the home-control agent example) Anthropic's "progressive disclosure" approach to skills Rippling: https://www.rippling.com ServiceTitan: https://www.servicetitan.com 🤝 CONNECT: Ankur Bhatt: https://www.linkedin.com/in/ankurbhatt77/ Instagram: https://www.instagram.com/megandamyshow/ LinkedIn: https://www.linkedin.com/company/the-meg-amy-show #AIAgents #HarnessEngineering #AI #AITransformation #FutureOfWork #SoftwareDevelopment #Leadership #MegAndAmyShow

Powered by the WRKdefined Podcast Network.

[00:00:00] It's a capable smart intern with the mind of a two-year-old, I sort of think. Or the attention span of a two-year-old. The attention span of a two-year-old. The cognitive ability of a PhD with the attention span of a two-year-old. Exactly. That is actually the right level I will say. Ankur Bhatt has spent the last two years leading AI engineering at some of the most AI-forward companies in enterprise software.

[00:00:24] Ankur. As VP of AI engineering at Rippling and now head of AI at Service Titan, he's built production agents that handle real customer problems. Tax notices, payroll compliance, things where getting it wrong costs real money. Everyone's excited about AI agents. The demos look amazing, but here's what most people won't tell you. Getting from demo to production is where most initiatives fail. Ankur just published what might be the most comprehensive practitioner guide I've read on why agents fail

[00:00:53] and how to fix them. The answer isn't a better model. It's something called harness engineering. Welcome Ankur. Hey Amy, hey Meg. Pleasure to be here. How's it going? It's been an AI-forward day as always. Exciting.

[00:01:21] I think everyone knows that we know you from SAP, and you spent like your entire childhood and young adulthood at SAP, many, many years inside a massive enterprise software machine. And then you did something quite bold and you moved to smaller AI companies, starting with Eightfold, then Rippling, and now Service Titan.

[00:01:48] And these types of companies are moving at a completely different speed. And I'm wondering what that just feels like. I see P and then its success factors, but I appreciated the most was the passion for customers and helping them through their enterprise transformation and running their business day to day. That is something which has stayed with me throughout my journey at SAP.

[00:02:18] Of course, nowadays, forward deploy engineering is cool. I was maybe one of the first forward deploy engineers for our nascent product. In these startups, which I've appreciated the most is the passion for customers, the proximity and closeness to the customers. Something which you start to miss in a larger organization because the layers create this separation where you're not touching and feeling how your customers are feeling every day. I think it's always been important.

[00:02:48] But in this moment where there's a lot of change, there's a lot of new opportunity. I think it's almost like even more important to be really close both with your own hands on these tools, but also getting really close with customers to see how they're thinking about their jobs changing and how they're thinking about things that they're eager to have access to in this world of AI.

[00:03:11] So maybe tell us a little bit about what that looks like for you and how you take what you've done with customers in the past and turn that into a discovery in this new moment. The thing which I observe across the customers I interacted at Eightfold, Rippling and now Service Triton, there is a recognition that AI is important for them, but they're looking for somebody to guide them.

[00:03:39] And a lot of times it's less about the technology per se, but it is more about the change management of what does it mean for my org and my team and my business? How do I really maximize the benefit AI can offer me?

[00:03:59] And today, what I'm observing across the board is, I think the cloud and the chat GPT value is very well understood across the board. They may not sort of follow the drama around OpenAI and Anthropic as we follow, but they do appreciate them as apps which they have day-to-day on their phone or on their browser, which they are using instead of using traditional tools as a whole.

[00:04:28] If I'm able to do these things for my personal productivity, what does this mean for my team? What does it mean for my org? And then everybody starts to sort of ask these fundamental questions and look for answers. What ends up happening is that any product they look at, they're coming in terms of similar expectation. But there is a big step up in the models in last six months.

[00:04:57] And suddenly, the capabilities they offer people is amazingly, any AI capability people look at in the product has to have that depth, has to have that accuracy, and has to be 10x or 5x or x productive for them to be able to get value. And this is something which I'm finding every customer is grappling with and are looking for solutions.

[00:05:28] But they'll quickly be able to assess and say, okay, is this creating value for me? Or not? We have a very, like a feedback mechanism of thumbs up, thumbs down. Do you get a punch in the stomach every time somebody gives you a thumbs down? No, we were getting like stream of thumbs down or thumbs up, like on two channels. And like, we could see that people like when they like something, they were really giving feedback and they don't like something that like they were very clear about what didn't work.

[00:05:56] So that just shows that the rapid pace of AI, which they are using today, has changed their expectation of what AI they expect in a product, and what type of an outcome they're expecting the AI should give them. And they're being very demanding in terms of, is this just a fancy demo? Or is this really going to help me run my business, answer critical questions for me,

[00:06:25] or let me run like my day-to-day operations on an effective manner? Yeah. And this whole concept of, you know, a demo versus is it really valuable is an ongoing theme, right, that we're seeing. And, you know, we've heard from some listeners of the show that they're

[00:06:50] seeing this happen in real life where, you know, there's all this talk of, hey, you know, one person did this in six days and it's already in production. But the reality is so much different that really it was a demo and then, you know, six engineers spent three months trying to fix all the issues and deal with all the hallucinations and the gaslighting. And so, you know, I think that your experience,

[00:07:17] Ankur, of not only building AI applications, but actually being the leader of the AI transformation within your organizations in terms of how best to leverage the AI tools, to build AI tools, quite unique and quite exciting. And as a result, you have started publishing some pretty amazing content

[00:07:42] to bring specific experience and broaden it to the larger systems thinking that we need here within the AI space. Your most recent one is about harnesses and how the improvement of models was obviously a huge

[00:08:05] leap forward for us. But that is not enough, that really we need to be understanding the harness and be tackling harness engineering. And so first of all, can you tell us what the heck a harness is? If you talk to researchers, they got inspired by Daniel Kamen's thinking slow, thinking fast, and that's where the reasonings came in. When to do thinking, when to do quick answering, when to look up

[00:08:34] the web, they're becoming a system of their own, where they are making these judgment decisions on any type of inquiry we are throwing at them. And that suddenly has opened up capabilities where the models can interact with their environment and react against that environment. What does that mean? That means if model is making a call and getting a response back, it can analyze, it can reason over that response.

[00:09:03] It can then decide to continue based on my reasoning to make the next step and make the next step. The loop will run and you will get an outcome rather than a response, which is a very different thinking from a chart-like experience. In a chart-like experience, you asked a question, you got an answer. Here, you are giving an intent, you're giving an outcome you're looking for, and the model is going

[00:09:29] to reason over it, until it's like this over-eager intern who will keep going on passionately and not stop until it has made you happy and given you the outcome you asked. Or has properly gaslit you into thinking you might be happy. You might, yes, of course. Task completion focus for sure, as I've come to learn that it really matters that you inspect what

[00:09:56] task you give it because it's so motivated to complete the task that it will go about it with so much enthusiasm and go off the rails against intent pretty quickly. I think what makes your experience so unique, Ankur, is that you have lived in the software

[00:10:19] development life cycle for your entire career, thinking in terms of software as the business that you're in and the products that you deliver. I also tend to think this is when architecture becomes more important, both understanding the architecture within the models themselves and how that's progressing, but also understanding your own architecture in interacting with that

[00:10:44] to get a result. And I suspect that's really where this harness idea becomes strong. To me, it feels very much like an unlock that we had when we started to think about object-oriented architectures and being able to really break apart bigger systems, but to scale, not just scale what the systems could do,

[00:11:09] but scale how the teams were able to build and to create products. So coming back to the harness piece, help us understand what is the audience for which a harness becomes no longer a nice-to-have but a must-have, and what are the kind of problems that having a strategy around harnesses become an unlock for productivity within the software development world?

[00:11:39] It's sort of like when we had web or mobile, right? We all got conditioned to the idea that when we transition from web to mobile, the interface is changed, the phone knew your location, so our expectations on opening an app like Uber or Lyft meant it knew where I was and where I needed to go. And there's like certain conveniences sort of came in when I moved from a browser-based experience to a

[00:12:07] to a mobile app experience, right? And it is sort of similar which an agent type of an experience is going to create for all of us, wherein the software will understand our intent and run through this loop to get us the outcome. So first and foremost, people should think about what's happening when

[00:12:31] they're asking these systems to take actions on their behalf because they will see this everywhere. I'm sorry, I just really want to like clarify what a harness is for folks. So, you know, there's the model and then for a time we had prompt prompts and now we have harnesses. And so harness is kind of

[00:12:55] like a an Uber version of a prompt. The term came from Manus was the company which based out of Hong Kong, which got acquired by Meta. That was one of the first ones who built sort of such an initial harness because everybody was trying to figure out if I give a model and give a task, give some tools to it, give an outcome, it will run and try to satisfy the outcome. And people started

[00:13:24] iterating around how to maximize this. How do you take this power of the model, its ability to call tools, take actions, but then get to a more deterministic outcome at the end. Because these are all probabilistic systems. We are used to a well-defined software world where we define a requirement, we implement something and the software does exactly what it does. And if it doesn't do the

[00:13:53] requirement, then it's a bug. With probabilistic systems, it's not a bug actually. Because the model is relying on set of instructions given to it to achieve that outcome. And those set of instructions are not just the simple prompt you gave. Because the harness is taking your prompt, the task you are giving

[00:14:16] it. It is taking your instructions essentially and then translating for the model with giving it a set of tools if possible, giving it a set of memory, giving it a set of components to help the model achieve the outcome. And I think that's why we are seeing success on those harnesses. But the primitives

[00:14:42] tools these tools have built are using, like are something which everybody is can use to build an agent. Because it's a, if I'm in a legal and I'm building an agent or if I'm in, you know, in tax, I'm building an agent which should handle a tax notice, which is one of the examples I built. Or if I'm in rippling and I'm or in service tight and I'm building an agent to be a chief of staff of a

[00:15:11] business owner, then I can learn from these coding harnesses. How they are taking instruction given by a product manager or a business person and translate into something working code. And that's where the term harness has started to come from. And so Ankur, you've identified that there's three failures that

[00:15:37] are very likely that people are going to run into when using, and can you identify what those three failures are? There are three things which always go wrong. The compound error of the decisions it takes, the context overload of the information it maintains, and the specification vacuum, lack of clarity of

[00:16:01] the instruction. First, it is making decisions. Every time it, you give it a task, it's not taking one decision, it is taking series of decisions. So I give a task to a harness, like co-work. It first reasons over it, calls some tools, get some answers, then reasons over it again. So there is a compounding

[00:16:25] effect in a certain task I give it. It's typically making five to ten to twenty decisions over what it is observing is happening. And if my first decision is wrong, or my second decision is wrong, my third decision is wrong, it just compounds. You may have seen working in co-work, it will start and it will go down a path and you are wondering, like why it is going down that path. But it just does, because it's sort of

[00:16:52] reasoning, acting, observing, reasoning, acting, observing. It's taking decisions and it just goes down that hole and it compounds. That's one. Second, there is a limit of information the harness can pass back to the model. That's what we call context. Think of it as the amount of information that can be sent to the model at any point in time. Which means I have to keep this information like clean and fresh because you

[00:17:22] asked me a task, I'm working on it, I take an action, I observe. So I'm building a lot of information around what I acted, what I observed. This is what is called context management because essentially what you're doing is helping the model like take actions, put something in

[00:17:47] memory to understand what action it has taken and then reason over it and do the next one. And this overload happens very fast if the information is not managed well enough. And third is the, what I call the specification vacuum. So you can give a single line as an instruction or a task, the model will take it and the harness will give it to the model and start running. But you can also

[00:18:16] give it a well-defined instructions. It's like I always use the metaphor of an intern. And all of us have had interns who come and work for us. And I can give a ill-formed task to my intern and say, go and fix this website for performance. Right? Now it's very vague, which website, which UI, performance in terms of what, but is exactly not clear on performance. And the intern, over eager

[00:18:43] intern may still go and spend two days and come back and say, I looked at your website, but I couldn't find what is slow. And you may say, oh, I never wanted you to work on this website. I wanted you to work on the app, but you never specified that to the intern. And that's the same problem the agent doesn't, will get lost. And why it will get lost is because the harness will keep giving, will take an action, give the information back to the model and it will keep reasoning over it

[00:19:09] and keep going on. And you will be like, no, no, no, this is not what I meant. So those are the three things I always come back to. It's almost like these systems are anybody that's had to sort of work with an ADHD partner or child. It's you, you really have to be the executive function for these agent design, because otherwise you're going to have a lot of drift along the way.

[00:19:35] You are absolutely right, Meg. I have a two-year-old and if I try to give it too much instructions, he will not like carry all those instructions. Right. And it's exactly the same with a coding agent where if I give it too much instructions, it will like pick up the last sentence and just run with it and forget about all the three things I said before it. What was the middle thing again? The famous quote from Fish Called Honda. Yes.

[00:20:03] Yes, exactly. Right. And that's exactly what happens. So co-work is an example of an agent which internally has a harness and has the model essentially. And they are constantly working on solving these three problems. But of course, they can only do as much as the humans using them.

[00:20:26] Right. Like if I don't specify clarity of my specification, if I'm, you know, giving too much of information and too much of my context, then even they can't handle it. Not because that it didn't try. It tried very hard, but you couldn't get the productivity you were hoping for or you couldn't get the productivity you saw like, you know, others are getting. And so lucky for us,

[00:20:51] you have some answers to these problems, I believe. Right. Six principles that that you've introduced and some really practical steps that people can take to avoid these failures. Do you want to share?

[00:21:09] Yeah. So the six principles are in order. Give agents map, not manual. Map is a simple step of directions, not detailed instructions. Principle two, spec before coding. So define what done looks like

[00:21:33] before the agent even writes a single line of code. Principle three, tests as guardrails. So establish that the hardness will run tests automatically. Failures means the agent has to stop, not us having to make it stop. Fourth, architecture as enforcement. So basically use hooks and design of your system to enforce

[00:22:04] the patterns it should follow. So your architecture enforcement has to be a deterministic set of rules that get executed. Fifth, I tend to think of these agents you may be using as multiple different interns with different personalities. Use multi different models, different agents, do a cross-agent

[00:22:29] collaboration and force that to get maximum outcome. And last but not the least, failures, when you observe that the agent is not doing something, treat them as bugs of your harness. Fix the harness around in terms of what you could do, could have done better in terms of giving it the map or the spec or having some test or some hooks, rather than think that, oh, my better way of prompting will solve the answer.

[00:22:58] Awesome. Okay, so now if we talk about that context problem and the two-year-old's inability to remember the middle, there's a variety of components within the harness that you can leverage in order to set the context window. And those are the instructions, the skills, the hooks, the memory, you know, all these different

[00:23:27] components. When do you put something in a skill? When do you put something in an instruction? How do you think of this from a systems perspective that in the best way to get that two-year-old intern, you know, doing what you want? It's a capable, smart intern with the mind of two-year-old. I sort of think. Or the attention span of a two-year-old. Attention span, yes. The attention span of a two-year-old.

[00:23:53] Yeah, it's got the cognitive ability of a PhD with the attention span of a two-year-old. Exactly. That is actually the right level, I would say. You will laugh, but I sort of think about it in terms of, you know, there is a Montessori education. And if you look at how they organize toddler classrooms, they're essentially doing some sort of a harness engineering. Clear example I'll give you, in my toddler class, they have different stations.

[00:24:19] So that's a way for the toddlers to remember, this is where you go and play art. This is where you go and do play with dinosaurs. This is where you go and do, you know, water play, right? And that's a map given to my two-year-old. They don't need to remember all the details. All they need to remember is water play, this dinosaurs, this painting, that. And that's essentially is the first principle of a map.

[00:24:48] People create ClaudeMD or AgentMD, which are long and like, don't do that. Make it very simple and directive, giving it more like an index. The second principle is that once you are on the station, of course, you are supposed to color. They don't know how to color. They don't know how to

[00:25:11] use a color on a paper. So what do the teacher does? It writes or shows them, show and tell, this is how you do it, which is essentially a spec. This is what you are telling the model. So before you get the model to start generating code for you, which it will, you are spending time with it and saying, this is the problem I'm trying to do. What should a done looks like? The model is going to reason over

[00:25:38] your instruction. Because if you don't want the model to reason, then you're essentially not tapping onto the power it has and trying to dump it down and they never can be dumped down. So cool about this anchor is, you know, obviously in my role, I've had access to amazing architects and great product people. And I have all over the years had, you know, my share of decent ideas and terrible

[00:26:06] ideas and all of the things in between. But what I always had access to was to be able to sort of say, okay, well, I think we should go after this thing or I think it should roughly be this and pause and have the questions that these experts asked me help inform my own clarity of my idea. Because oftentimes there would be big gaping holes in the middle that weren't even hard for me to

[00:26:35] to articulate what I wanted. It just I didn't even see them. And so then when somebody would say, well, okay, but then what would happen in this case? And I would be like, oh, okay, well, then I would want it to do that. Or what would happen in this other thing? Oh, well, and then if you do this, you're going, you know, it's going to take 10 times as long. Do you really want that? No, I want it to be something, you know, I could have more variance on the answer if I got it to some level of precision

[00:27:03] that's more than sufficient for this particular task. And so what I see here is giving that kind of expertise and availability to everyone if you follow your methodology, which is instead of just telling it to go do stuff, have that conversation element in the spec creation that helps you not only

[00:27:26] better understand the holes in your own thinking, but also avoid your to your point going on a compounding tangent that causes you to end up in a place where you're like, this isn't this isn't anything I wanted at all. Right. And that's the planning mode, right? And planning mode, right? Yeah, absolutely. Like I mean, the funny thing is, even before these tools introduced planning, then it became very clear that I had to really write down what they do, which is one of the reasons

[00:27:55] I started writing because we're all learning this together. So the more we share each of each other's learning, we can collectively understand how to harness these agents for our advantage, essentially. But going back to what you said, right? I think there is also this misnomer now just because I can get code created just so fast that I don't need to do a PRD or a design anymore. That's just a fallacy.

[00:28:23] We were doing these artifacts for us, for debate and discussion between us is the requirement very clear as PRD, epics and stories and user journey maps and jobs to be done to make sure that we can pressure test it before the code gets written. The only difference is you're not writing for an engineer now anymore. A junior engineer, you're writing for the model. So the instruction format is

[00:28:49] starting to change. And that's why people are talking about different types of markdown files these days. And now you've mentioned skills and hooks as well, right? In addition to the instructions and the specs. And so what's a skill versus what's a hook? So if you look at any skill, it is actually nothing but precise set of instructions. It's essentially

[00:29:16] saying to the model, oh, I'm telling you to read this PDF. You don't have to go around figuring out how to read this PDF. I'm just giving you a PDF skill. And you just, whenever you need to read a PDF, just tell my skill and my skill will figure it out. And how does it know to go to that skill?

[00:29:38] That is where the map comes in. Because in the map, each skill has a unique identifier. So Anthropic is one of the first ones who pioneered this. Barry Zhang and his team from Anthropic, they were the innovators on this. In the file of skill, there is a header which says the name of the skill is PDF something and it has a one line description of the skill. Anthropic calls it

[00:30:08] progressive disclosure. So you're not telling it, oh, these are the five tools you can use to do a PDF document. There is one skill and then the skill will then internally take care of five types of PDF reading tools, whichever ones are needed, it will take care of that and figure it out. But that complexity is not exposed to the model.

[00:30:33] Imagine that when you're dealing with engineering teams where you have multiple people and you're onboarding them, having some of this, you know, sort of codified for everyone would simplify things. Oh, absolutely. And also make sure you don't end up with bigger compliance risks than you're wanting to have in this in this context.

[00:30:57] A hundred percent. So one of the things we've done is create a skills marketplace, essentially, where you are codifying and encapsulating, okay, this is how we do tests, this is how we do code review, or this is how you do sanity check. And these are all then auto configured and deployed on every developer's machine. So the moment they set it up so that the cognitive load for a new engineer is much

[00:31:22] less. Once you've got an AI curious experts who are running ahead, codify their practices into skills and publish it for the rest of the organization so that they can then essentially get to the same level of productivity. Sounds to me like we're just in some ways reinventing or creating a next generation of what we in the past might have thought of as platform teams that were creating

[00:31:48] tools and guardrails and guidelines that informed the entire engineering stack going forward. essentially you're capturing your domain knowledge as skills and encapsulating it and letting the platform teams essentially then build the harness and then run. I do want to come back to maybe your other question around hooks and tests, right? So on one side you're doing specs and

[00:32:16] specs and like map. On the other side, going back to my taller metaphor, essentially, there are rules he has to follow on the station. Like if he's on a water station, they don't need to splash. Those are guard rates essentially, right? Let's take Cloud Cowork as an example. You give an instruction. How does

[00:32:38] Cowork know that it has done the right thing? When you are asking Cowork to build an app, you can build test and also get Cowork to run the test constantly, the TDD metaphor. In fact, one of the ideas Simon Wilson talks about is start with TDD, get the harness to write your tests and then tell it keep going on

[00:33:06] until the test pass. So that way it has the guard rail, which is telling it, I have to build the spec. The spec has these tests and all the tests have to pass till I finish. The guardrails essentially get implemented as hooks in the system. So hooks is a technical term, which

[00:33:29] actually again, entropy came up with for Cloud, where we could add specific instructions, which the harness will call and check. And outcome of output of that, it will say, did I finish correctly on? You can write basically two types of hooks. One is functional hooks, right, which is testing the functionality and making sure it's working correctly.

[00:33:52] And second, you can have hooks which are non-functional in nature, which is checking patterns, checking, you know, code quality, checking, you know, reference integrity, looking at secrets. These are deterministic routines, which always run. So the model that knows, oh, I'm not complete yet, my task is not complete,

[00:34:15] because my like 30 view is highlighting some errors. And there is this meter study, which measures how long can models keep going on like this. This whole idea of long horizon tasks are coming in. I have had, I've been able to achieve like following these principles where I built a spec and I let it run for almost two hours.

[00:34:42] It will, like I can let it start and I can go and have my lunch and come back and, and it was running and... Yeah, there's gonna, there's gonna be like competitions of who can have the longest running thing that actually does the thing it's supposed to do, right? It'll, it'll be like the new, the new egg drop, right? Yes. Like you're, right? 100%. Oh, hi, can you drop the egg? Yeah. 100%.

[00:35:08] So, so Ankur, I've been using these tools and doing quite a bit of building and, and fixing with vibe coding and so on. But I, it's just me, right? Like it's just a single person. So it's, it's pretty simple. And, and of course you're, you're in this world of building enterprise software

[00:35:30] with lots of people involved. This is kind of like the new SDLC, right? And like, what, what are you seeing as being like the biggest coordination costs and how are you mitigating those? Not everybody is able to comprehend how to maximize productivity using these agents. In any team, there are 10 to 20% of the people who are curious, who are experimenting and figuring out how to maximize,

[00:36:00] you know, the output and accuracy from these agents. So, uh, the teams and orgs are struggling to scale up. Like how do I take and bottle up this, uh, productivity my 100x engineer has and offer it to the rest of the org. So it's not very intuitively obvious what it takes to get productivity

[00:36:26] from them. So, so going from individual productivity to team productivity is the first lever challenge. The second challenge is making an organizational change to get value out of these tools. In a classic world, you followed Scrum or Agile and you did some routines in a certain manner. Look at the world now we just talked about where I can let hundreds of agents run for hours and hours

[00:36:56] overnight and get the code created. That means my org and the team doesn't need to function in that old way at all. People are like, people are coming up with routines where they spend hours and hours on the spec and code and the hooks. And then they let the, the teams are then letting the agent run overnight and then they come back next day to see the entire application done for them, entire products and features done for

[00:37:23] that. So this means you have to rethink and reimagine how your different roles interact and interface with each other. You still need a PM, you still need an architect or a designer, and you still need those capabilities, whether it is like in specific people or it's a certain degree you want everybody to have certain skills, a certain depth of knowledge. So each of them can then maximize and get better outcomes from

[00:37:52] these agents. So are the different roles and the way that they interact in order to have the best functioning team, is that starting to coalesce or is it still in kind of experiment, experimental constantly changing form? It's a constantly changing experiment. Because the harnesses are not sitting idle. They are innovating at a breakneck pace. If you saw the latest cloud code version, now it has

[00:38:22] now the background task and schedule task and routines, it can run like on its own learning from OpenClaw. Peter's kooky innovation with OpenClaw is actually building a harness which is like far ahead with anybody else, which is why everybody is now studying that OpenClaw harness to say how do I bring

[00:38:44] those things into my harnesses. And to me, that is what is something which requires you to then think about what it means in your roles day to day. So that's one side. And then second, we are at a reflection point where those roles are changing drastically because writing code is no longer the

[00:39:10] bottleneck. So if writing code is no longer the bottleneck, then what an engineer has to do, what a PM has to do, or what a designer has to do, each of them has to evolve to figure out how they collaborate together, or do you need all three people? There are points of view, designers can evolve into this person who can do that for you. There could be engineers with the product sense, they can evolve into that or product managers with system design sense can evolve. So all of that-

[00:39:39] But to some degree, you really need more people being able to do those upfront specs and all the harness engineering because you have so much capacity now to do the coding, right? Like the bottleneck is more front-loaded now than back-loaded, right? Yes, exactly. I actually am starting to believe that there's two different bottlenecks and I'm willing for this

[00:40:08] point of view to evolve. I think you need your systems thinkers building not just these harnesses, but the entire life cycle process. Which parts need to be consistent? Again, if you're dealing with big systems, which parts need to be consistent and how do you avoid having everybody having to roll their own, getting more leverage across? And then I think to your other

[00:40:36] point, Amy, now when you start to think about like how to build features and capabilities, getting more in that front end of clarity of what to build, how it should work, and what would be the proper evaluation criteria to know that you achieved it. The question that I have in the back of my mind, Ankur, and I, again, recognize we're still early days.

[00:41:05] What is in your mind the proper kind of center of gravity unit of work for a capability? Is it one person that owns it end to end? Is it a couple people? How would that be broken up? And at what level of granularity would you want to think about subdividing so that, because in the end, you know, you're probably delivering multiple things

[00:41:32] to customers to customers that need to make sense together. So are we going to get like all of these like teams of one plus, you know, millions of agents building a bunch of stuff. And then at the end, you've got this terrible Frankenstein thing that doesn't come together very nicely because it's missing its soul, you know, like what's, what's your thought about the unit of work?

[00:41:56] I think as product builders, I, you still have to create value to your customers. So you have to really work backwards from that lens. And that has not changed. You still have to think about what is valuable for your customers and, and have somebody who owns that definition.

[00:42:13] of value and outcome that you're going to deliver to your customers. And that has to be a single owner deciding and debating and taking inputs from everybody else to decide on what's the outcome and the value we deliver to our customers is then having a set of either agents, which are doing this on their behalf or humans working together, like which is where the idea of flexible teams comes in because if it is a simple enough task,

[00:42:41] maybe a single person with a set of agents can do it because your harness is built strong enough. I think you said you had three things, but I think you only said two. Was there a third? Did you want to get to that? When you start experiencing agents as a interaction model and you start seeing an agent runs and delivers you an outcome, your lens about everything else changes. And that means I, I don't build any features anymore. I challenge people to

[00:43:11] people who think about, Oh, you're building features to get capabilities in your product, but you're not taking a step back and asking. I was always building the software to deliver an outcome for the customer. What is my agentic experience to do that? Offer that agent experience. What do I, what type of a harness I need to build in my product to launch those experiences.

[00:43:35] And then really completely re-imagine my product interaction model from a point and click or a mobile app type experience to an agent driven interaction. And that because we are so lost in one and two, very few people are starting to re-imagine like agents over apps, right? Because I challenge people who say I want to run fast.

[00:44:01] But if you're still creating a traditional web UI where people have to point and click and do, you're already, you're creating something which is no longer relevant. And to me, one of the classic examples was in Andrei Karpathy's podcast, No Priors. He talked about how he uses OpenClaw to manage his home. And he said very suddenly he had five apps to control temperature, sound, etc., etc. And now he has one agent, which takes care of that for him.

[00:44:31] So if that is the kind of experience you can offer to your customers, why would you still create an app or a UI and a point and a click type of an experience? And that type of re-imagining organizations are not doing. And I feel very passionate about it because I want to make my life simpler and I want more agents to do stuff for me on the behalf. And I want everybody to think in agents and offer me their agents and have my agent talk to their agents.

[00:44:59] And I don't want to go to a UI and click something anymore. But that type of re-imagining is not happening that more. And this to me is the third biggest challenge because the moment you do one and two, you will immediately lead the conclusion, why am I still creating this experience for my customers? I'm not offering it to my engineers. And I think this gets to one thing that people have to watch out for.

[00:45:26] And that is that the customers that you're talking to probably don't have a full appreciation for what their future looks like when it's agentically enabled. And so it's on you as the person that cares about the customer to be able to read between the lines.

[00:45:44] If it is an agent using that, it is still your responsibility to not get a little bit confused by the data because we've already said agents are very task-oriented and may, you know, beat against something. It doesn't mean that's the most efficient or best way to solve the problem.

[00:46:02] And in a B2B context, you have not only the person that's using that solution, which may be an agent, as you said, but the business value that the organization is trying to drive. And oftentimes people struggle to understand those two layers are both a yes and. You have to think about both at all times.

[00:46:24] Absolutely, Matt, because we were conditioned to just create a well-formed, complex UI and we threw the problem over to our customers saying, you figure out my application to get the outcome you want. That's right. And that's no longer the case anymore. Now I have to do the work to figure out what is the outcome they're looking at. How do I have skills, my domain skills to call the right tools within my product?

[00:46:51] How do I have hooks to measure that outcome happened? And what I previously wrote about how do you go from creating an agent product, which is not a demo, but a real trusted agent, which your customers can rely on day to day. They are, which my hope is like more and more work starts to happen, more and more thought starts to happen because there is a whole lot of new experiences that have to be created.

[00:47:20] Think of all like in co-work or in Cloud Code, we have complete new way of experiencing things, right? Like the loop running and, and Cloud has this funny, like vocabulary of stuff, it booping and blabbing and it uses, right? That's an interaction. That's an example of they figured out, okay, what do we do when the loop is running? Let's use some funny words, juliening and like, you know, so that people will stay interacted.

[00:47:46] But in a B2B world, when you're creating an agent, you have to think about different interactions. How do you show traces that the human can trust that, okay, agent is looking at the right thing? There is a whole sort of, I would say, class of components and experiences and primitives to be created to make these agents real in enterprise. I think we nailed it. I think this has given us a lot. I mean, there's like so many more questions I have.

[00:48:16] We'll have to wait for your next paper. Yes. And then I'll, and then I'll. I can go on. You can see me, Amy. I'm fascinated about all this. So thank you so much, Ankur. We really appreciate both you coming and sharing with our audience, but answering all our questions offline when Amy and I get stuck. You are a gem for so much patience and so much support. So thank you. Yeah. Thank you. Thanks for inviting Amy and pleasure sharing and learning together.

[00:48:46] You ready to leadership corner? Let's leadership corner. All right, here we go. In the last nine months, I've been promoted twice, and the CEO has adopted my approach as best practice across the organization. I'm clearly adding value and growing fast, but I'm lonely. I'm the most senior woman here, and most of my C-suite peers are emotionally immature. They react in the moment while I step back and see the big picture.

[00:49:14] I'm managing egos when I should be collaborating with peers. Do I stay and continue modeling a better way of working, or has this company become too small for me? Ooh, big question. So first off, I love the reflection.

[00:49:33] I think it's important to constantly take a pulse and ask yourself both your goals, your growth levers, and is the environment ripe and fruitful? And I also, you know, congratulations. Being promoted twice is a big deal, and it does signal a lot of trust.

[00:49:57] And I believe that, you know, having built upon that creates a lot of really good opportunities. So my instinct, and again, this is, you know, based on my own sort of values and goals, is that you probably need to reassess what your expectations are of your peers and your group.

[00:50:24] And what I mean by that is, when you get to a certain level of seniority, you really need to start building groups that help you and support you and can be there for you. So Amy and I have sort of taken on the Melinda Gates language of trust counsel, people that you trust, that know you, that can give you advice, that can support you, that can help build you up.

[00:50:52] I have yet to be in a work environment where that trust counsel is guaranteed to be in my peer group. I have also been in many work environments where my peer groups had a lot of different motivations, skill levels, and interests that were not well aligned with my own.

[00:51:16] And I've had a pretty broad range of reactions and emotions to that, some more helpful than others. And so if I were to summarize the wisdom that I think I've built up over the years, is that it is very important to build trusting relationships with your peers.

[00:51:36] It is very important to accept your peers for who they are and where they are and to work to support them in that as opposed to expect them to be something that they are not. And it is likely the case that your response of loneliness is a lot about just the level that you've achieved versus necessarily a failing of your peers.

[00:52:05] So if the challenge is loneliness, I guess my response would be you need to address your loneliness by adding other groups that you interact with to help you with that. If the challenge is opportunity in the job or ability to get good work done, then that's when you need to think about exiting.

[00:52:30] So I'm not hearing that opportunity isn't coming your way and growth isn't coming your way. And I'm also not necessarily hearing that your peers are sabotaging you or that you're dealing with a narcissist or anything like that.

[00:52:43] So I think this is probably, at least to my read, a moment to say it is time to expand your network of mentors and trusted people that help you with the loneliness and the collaboration and the ideation and let go the idea that that's the responsibility of your peers. That's my reader. That's what I would say. First off, Meg's a liar.

[00:53:13] No. So we did have a situation where we had the amazing truth. I thought of that when I said it. Yes, that's true. But not every day. Very rare. It's very unusual. And it lasted like a very short window. And you have to appreciate that moment and relish it at the time and know that it will not repeat itself. Okay. So a couple things to add.

[00:53:42] So in terms of the peers, I love what Meg was saying. In terms of, you know, I am always looking for people's strengths and opportunities to connect with them. And so I love Meg's advice here. Now, you say they're emotionally immature and they have big egos. You're going to be able to find strengths. You're going to be able to find common ground. You're going to be able to find ways to connect.

[00:54:09] Unless you are dealing with extreme personalities like a narcissist or a saboteur or a narcissist that is a saboteur. You know, like so if you're dealing with that level of emotional drain, then that's maybe a different scenario altogether. But that's not necessarily what I'm reading into this.

[00:54:33] The other thing that I would say, in addition to building kind of your external truth counsel where you have that support, really focus a lot of your emotional energy on hiring and development. So you talked about, you know, should I be modeling a better way to work?

[00:54:54] You know, think of that not necessarily as to your peers because that, you know, maybe is a bit patronizing, is a bit, you know, off-putting to them and not going to help you with your goal of building trust and community.

[00:55:10] But in terms of, you know, thinking about who you're hiring and how you're developing your people and the other people within the company, you know, start building that muscle that way. Anything else to add? Anything I lied about there, Meg? No, no, no. You're the one that called me on the lying. So you're absolutely right. I did misspeak.

[00:55:38] But I did think that there's one thing to add, and that is the managing egos is an executive skill. It's a job you have to figure out how to do. It is not the fun part of the role, but it is absolutely a big part of the job in my experience. And I don't think my experience is that unique in that area.

[00:56:01] So when you work with people that have been to a certain level of career or professional success, you're just going to have to figure out how to navigate big egos. So while I'm not loving that for you, I think you could tell yourself a different story and say, hey, this is a skill that I need to build some resilience for. And think about it that way. Wonderful.

[00:56:29] Well, what a fun time we had with our anchor. Oh, my goodness. Yes, my brain is still swirling with all the ideas. Anybody that hasn't yet read his articles, I strongly recommend it. But make sure it's on a moment where you have some time because they're pretty deep and pretty hefty. Yeah. Yeah. Lots to unpack there. It hit me hard. It's in my tracks. Oh, my God. Oh, my God.

[00:56:56] Anybody else suffering from the Claude E. writing giving you? I think I'm moving up the value chain from like anxiety and annoyance to actual irritation and agitation. So anyway, I don't know where that goes next. Probably rage. So I will try to manage. Rage quitting, I think is the. Or rage maxing. Maybe would you be. Oh, yes. We should be maxing more things anyway. I think we got to work on our maxing.

[00:57:23] So in the spirit of maxing on the Megan Amy show, thank you to all our listeners. We really appreciate you. Please do like and subscribe. And let's invent the future together, everyone. I believe in us. Let's make every day count. We'll be right back.

Episode 40 | Why Your AI Agents Go Off the Rails — and the Harness That Saves Them | Ankur Bhatt

Episode 40 | Why Your AI Agents Go Off the Rails — and the Harness That Saves Them | Ankur Bhatt

Search Episodes

Sponsors

More Episodes from the Network