I benchmarked Opus 4.6 vs 4.7 on organizational memory retrieval. 4.6 wins, and the failure mode

I've built an organizational memory system for myself that I’m using across multiple customers and on my own systems. It's a temporal knowledge graph that gives agents persistent memory across sessions, tracks decisions, contradictions, and causal chains across work streams.

To validate it properly, I built the OMB (Organizational Memory Benchmark). It simulates a real company called "Meridian Systems," a 148-person B2B SaaS company with $16.1M ARR. 15 key personas across 6 departments (Executive, Engineering, Product, Sales, Customer Success, Finance). Data is generated in 3 rounds per month: department-internal artifacts (Slack, email, decisions, tickets, code commits), cross-department interactions (escalations, meetings, customer emails), then chaos injection (angry customers, board interference, the intern breaking things, regulatory surprises).

The timeline includes things like a VP Sales promising a custom integration the CTO says conflicts with the migration architecture, a CFO approving a $200K budget then freezing it 4 months later, a junior engineer quietly violating an architectural RFC that doesn't get caught for 2 months, and an acquisition offer only 3 people know about. Plus an AWS outage, actual code with bugs for different products, etc.

We planted specific ground truth: contradictions, multi-hop decision chains, and knowledge gaps (things that are NOT formally tracked anywhere, where the correct answer is "nowhere"). These are the hard questions.

Yesterday Opus 4.7 dropped, so I ran both models on 84 hard questions from the many we have, just to get a quick gauge. Same retrieval prompt, same knowledge graph, same questions, same hooks, same MCP server Apples to apples.

Results:

Opus 4.6: 81.3%

Opus 4.7: 75.8%

Per-category breakdown across all 6 categories:

Where 4.6 wins:

• Contradiction detection (spotting conflicting statements across departments): 4.6: 88.1% vs 4.7: 69.0%

• Multi-hop reasoning (following causal chains across months): 4.6: 81.0% vs 4.7: 61.9%

• Cross-department tracing (following info flow across teams): 4.6: 100% vs 4.7: 90.5%

• Decision tracing (explaining WHY something was decided): tied at 90.5%

Where 4.7 wins:

• Temporal ordering (sequencing events correctly): 4.7: 81.0% vs 4.6: 64.3%

Knowledge gap detection was close (4.7: 61.9% vs 4.6: 64.3%) but 4.7's conciseness actually helped it avoid fabricating documents on 2 questions where 4.6 over-elaborated and invented tracking that doesn't exist.

The speed difference is massive. 4.6 averages 61.5 seconds per question, 4.7 averages 20.5 seconds. 4.6 produces answers that are about 2.5x longer. It does more searches, follows more branches, names more specific dates and people. That thoroughness is why it wins on the hard categories, but it's also why it's 3x slower and more expensive.

The most interesting failure mode: when the correct answer is "no formal documentation exists," both models confidently invent documents. 4.7 does it more aggressively, fabricat…

为什么值得关注

能改变理解方式，而不只是重复常识；符合当前抓取需求；它提供了新的理解或解释，而不只是表面观点

来源：reddit，领域：tech，保留分：0.62