How I Traced a Memory Leak That Only Appeared After Hours of Runtime

I started this one directly inside Blackbox AI using the Claude Opus model because short profiling runs were not going to surface anything useful. The issue only appeared after sustained execution, so the approach had to simulate behavior over time rather than inspect a single snapshot.

The service itself looked stable. CPU usage was flat, response times were consistent, and garbage collection logs showed normal activity. But memory kept increasing slowly until the container was eventually killed. Not a spike, not a crash, just steady growth that made no sense relative to traffic.

Instead of taking isolated heap dumps, I fed the allocation paths, cache logic, and request lifecycle into Blackbox AI and used AI Agents immediately to simulate repeated execution cycles. The idea was to track which objects persisted across iterations rather than which ones existed at a single point.

That is where the pattern started to emerge.

Objects were being released, but not fast enough. A caching layer was holding references longer than expected. Not permanently, which is why it did not look like a traditional leak, but long enough that under continuous load, new allocations outpaced cleanup.

The tricky part is that the eviction logic was technically correct. It was based on access frequency, which worked well under test conditions. But production traffic had a different distribution, so certain entries were rarely accessed yet never expired quickly enough.

Using multi file context inside Blackbox AI, I mapped how objects flowed into the cache and how eviction conditions were evaluated over time. Then I used iterative editing to test alternative strategies. One variation shifted eviction from access-based to time-based. Another combined both with a hard upper bound.

To avoid introducing a different problem, I used multi model access to compare how each strategy behaved under simulated load. Some approaches reduced memory usage but caused excessive cache misses. Others maintained performance while still allowing memory to grow.

The version that worked introduced a strict cap combined with time-based eviction, ensuring that no object could persist beyond a defined window regardless of access patterns.

After applying that change, memory usage stabilized. The gradual increase disappeared completely.

It was not a leak in the usual sense. It was retention that only became visible when observed across thousands of iterations.

[留言]

为什么值得关注

能改变理解方式，而不只是重复常识；符合当前抓取需求；它提供了新的理解或解释，而不只是表面观点

来源：reddit，领域：tech，保留分：0.71