别再吵框架了,你的AI Agent在循环里等死
半年、30个AI Agent、付费客户、真实生产环境。结果发现,框架之争毫无意义,真正杀死Agent的是你从未重视的循环和记忆问题。
核心观点:决定AI Agent生产成败的关键不是框架选择,而是循环崩溃与上下文腐烂这两个非技术问题。
过去半年,我一直在生产环境中运行AI Agent,不是那种在Jupyter Notebook里跑一次就扔掉的Demo,而是真正面对付费客户、处理真实业务逻辑的Agent。大约30个Agent,来自不同行业,处理不同任务。我原本以为难点在于选择正确的框架——LangChain还是CrewAI?AutoGen还是OpenAI Agents SDK?这些争论在开发者社区里吵得热火朝天,每个框架都有自己的拥趸和极端批评者。然而,6个月后的真相是:框架几乎无关紧要。你随便选一个就行,只要团队熟悉它。真正杀死Agent的,是两样几乎没人公开讨论的东西:循环崩溃和上下文腐烂。循环崩溃的表现很典型:Agent调用同一个工具,得到相同的结果,然后再次调用同一个工具,如此反复,直到你手动杀死它,或者它耗尽你的API预算。你可能会认为这只是一个简单的逻辑Bug,应该很容易捕捉。但实际情况是,当Agent的决策路径变得复杂,当它需要调用5个、10个甚至更多工具才能完成一个任务时,循环几乎不可避免。原因在于,Agent缺乏一种“直觉”——人类在多次尝试后会自动意识到“这条路走不通”,但Agent没有这种能力。它只会机械地遵循其推理链,而一旦推理链里的某个状态判断不准确,它就会陷入死循环。更糟糕的是,循环崩溃往往发生在任务快要完成的时候。你离成功只差一步,但Agent却卡在了那一步上。我见过一个Agent在处理退款流程时,来回调用了同一个查询订单接口12次,因为每次返回的结果都是“处理中”,而它没有学会等待或询问。这不是框架的错,这是Agent架构本身的缺陷。我们期待Agent能像人类一样“思考”,但它的思考是线性的、缺乏上下文的、无法真正理解“重复”意味着什么。另一个更隐蔽的杀手是上下文腐烂。当Agent与用户的交互越来越长,它的上下文窗口就会被各种历史信息填满。起初,这些信息是有用的;但一旦上下文超过某个阈值,Agent的反应就会变得迟缓、混乱,甚至开始胡言乱语。它会忘记自己最初的目标,把之前的成功经验当成当前任务的模板,从而产生荒谬的输出。我见过一个Agent,在处理了20个不同的客服请求之后,第21个请求明明是关于订单取消的,它却因为上下文里充满了“退货”相关的内容,而开始指导用户如何打包退货商品。用户困惑,客服团队抓狂。上下文腐烂的根源在于,当前的AI模型并没有真正的长期记忆。它们只有有限长度的上下文窗口,并且对窗口内的信息没有优先级区分。所有信息,无论新旧、重要与否,都被同等看待。这就导致当上下文窗口被填满时,模型的表现急剧下降。你可能觉得,定期清空上下文不就解决了?但问题是,清空上下文意味着Agent会失去对当前任务状态的感知——它忘了自己之前做了什么,于是不得不从头开始,或者需要用户重新提供大量信息。这又回到了循环问题:要么忍受腐烂的上下文,要么清空上下文后陷入“失忆—重复—再失忆”的循环。这两个问题,框架都解决不了。LangChain也好,AutoGen也罢,它们主要解决的是Agent的构建和编排问题,让开发者更容易地把多个LLM调用串联起来,更容易地集成各种工具。但循环崩溃和上下文腐烂是更底层的系统问题,涉及到Agent的决策机制、记忆管理、状态跟踪。框架对此无能为力,因为它们假设Agent的推理链条是完美且无限的。真正有效的解决方案,来自另一个方向。我在实践中发现,最有效的办法是引入外部的、结构化的记忆系统。比如,我给我的Agent构建了一个Kanban板——不是给人看的,而是给Agent自己看的。这个Kanban板是一个本地优先的MCP服务器,它允许Agent在每次操作之后,把当前任务的状态、关键决策、已完成的步骤和下一步计划,以结构化的方式记录下来。当Agent需要重置上下文时,它可以从Kanban板读取当前状态,而不是依赖Context Window里那些混乱的历史记录。这相当于给Agent安装了一个外部硬盘,让它可以“记住”自己做到哪一步了,而不用把所有事情都挤在有限的工作内存里。效果立竿见影。上下文腐烂的问题大幅缓解,因为Agent在重启后可以迅速恢复状态。循环崩溃也减少了,因为Kanban板记录了之前尝试过的路径,Agent在决策时可以看到“这条路已经试过了,死路”,从而避免重复。当然,这不是银弹。Kanban板本身也需要管理,如果Agent写入的状态信息不准确,或者Kanban板的逻辑有Bug,反而会引入新的问题。但它至少指明了一个方向:未来的Agent架构,必须把“记忆”作为一等公民来对待,而不是把它当成LLM上下文窗口的附属品。现在回过头看,那些关于框架的争论,就像是在争论该用哪款螺丝刀来建房子,却忽略了房子本身的地基是松软的。框架是工具,但工具再好,也无法弥补架构设计上的根本缺陷。社区里大多数关于Agent的讨论都聚焦于如何让Agent“更聪明”——更好的Prompt、更复杂的推理链、更大的模型。但这些都是在同一维度上优化。真正的突破,可能来自于另一个维度:如何让Agent“记得住”和“停下来”。这意味着我们需要重新思考Agent的内部状态管理、工具调用的反馈机制、以及错误恢复策略。而这一切,远比选择LangChain还是CrewAI重要得多。如果你正在构建自己的Agent,我建议你花更少的时间在框架选择上,花更多的时间在设计Agent的记忆系统和循环检测机制上。否则,半年后你会发现,你的Agent不是在循环中把API预算烧光,就是在上下文腐烂中把用户气跑。
参考来源
- After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else. - https://www.reddit.com/r/artificial/comments/1tlt8b9/after_6_months_of_running_ai_agents_in_production/
- Fireside chat at Sequoia Ascent 2026 from a ~week ago. Some highlights:
- The first theme I tried to push on is that LLMs are about a lot more than just speeding up what existed before (e.g. coding). Three examples of new horizons:
- 1. menugen: an app that can be fully engulfed by LLMs, with no classical code needed: input an image, output an image and an LLM can natively do the thing.
- 2. install .md skills instead of install .sh scripts. Why create a complex Software 1.0 bash script for e.g. installing a piece of software if you can write the installation out in words and say "just show this to your LLM". The LLM is an advanced interpreter of English and can intelligently target installation to your setup, debug everything inline, etc.
- 3. LLM knowledge bases as an example of something that was *impossible* with classical code because it's computation over unstructured data (knowledge) from arbitrary sources and in arbitrary formats, including simply text articles etc.
- I pushed on these because in every new paradigm change, the obvious things are always in the realm of speeding up or somehow improving what existed, but here we have examples of functionality that either suddenly perhaps shouldn't even exist (1,2), or was fundamentally not possible before (3).
- The second (ongoing) theme is trying to explain the pattern of jaggedness in LLMs. How it can be true that a single artifact will simultaneously 1) coherently refactor a 100,000-line code base *and* 2) tell you to walk to the car wash to wash your car. I previously wrote about the source of this as having to do with verifiability of a domain, here I expand on this as having to also do with economics because revenue/TAM dictates what the frontier labs choose to package into training data distributions during RL. You're either in the data distribution (on the rails of the RL circuits) and flying or you're off-roading in the jungle with a machete, in relative terms. Still not 100% satisfied with this, but it's an ongoing struggle to build an accurate model of LLM capabilities if you wish to practically take advantage of their power while avoiding their pitfalls, which brings me to...
- Last theme is the agent-native economy. The decomposition of products and services into sensors, actuators and logic (split up across all of 1.0/2.0/3.0 computing paradigms), how we can make information maximally legible to LLMs, some words on the quickly emerging agentic engineering and its skill set, related hiring practices, etc., possibly even hints/dreams of fully neural computing handling the vast majority of computation with some help from (classical) CPU coprocessors. - https://nitter.net/karpathy/status/2049903821095354523#m
- [AH] What if Russia, Prussia, and the Dutch openly backed the American Revolution instead of France? - https://www.reddit.com/r/AlternativeHistory/comments/1tlsvqu/ah_what_if_russia_prussia_and_the_dutch_openly/