SWE-CI: New Benchmark Tests AI Agents on Long-Term Code Maintenance via CI

What SWE-CI Actually Does
SWE-CI is the first repository-level benchmark built upon the Continuous Integration loop. It aims to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability.
Key Details from the Paper
The benchmark comprises 100 tasks, each corresponding on average to:
- Evolution history spanning 233 days
- 71 consecutive commits in a real-world code repository
SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. This addresses a gap in current evaluation methods: while LLM-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing (as shown by benchmarks like SWE-bench), real-world development involves complex requirement changes and long-term feature iterations that static, one-shot repair paradigms fail to capture.
The paper specifically notes that SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution. This moves beyond simple bug fixing to assess how agents handle the iterative nature of real software development.
Technical Context
This type of benchmark is significant because most current AI coding agent evaluations focus on single-shot fixes or isolated coding problems. SWE-CI's CI-based approach better reflects how development actually happens in mature software projects, where changes accumulate over time and must maintain compatibility with existing systems.
For developers using AI coding agents, this benchmark could help identify which agents are better suited for long-term project maintenance versus quick fixes. The multi-round, iterative nature of the tasks tests persistence and consistency—qualities that matter when integrating AI assistance into ongoing development workflows.
📖 Read the full source: HN AI Agents
👀 See Also

Career-Ops Fork Adds LinkedIn Job Discovery Using Apify
A developer forked the career-ops Claude Code system and added LinkedIn job discovery using Apify, addressing the main limitation of the original project which only scanned pre-configured company career pages.

Extracting OpenClaw Components: A Developer's Experience with Lane Queue and Memory System
A developer attempted to extract specific components from OpenClaw for use in personal AI agents, testing the Lane Queue task execution system and examining the memsearch memory system. The Lane Queue was successfully reimplemented in Python using documentation, revealing gaps in documentation and 13 implementation issues.

Claude Desktop App Cowork Feature Enables AI-to-AI Communication via Shared Google Docs
Users have successfully implemented Claude-to-Claude communication using the new cowork function in the desktop app, with two agents reading and writing to a shared Google Doc. The test involved five rounds of question-and-answer dialogue between the AI agents.

claude-powerline v1.20 adds TUI dashboard mode, context bar styles, and environment variable display
claude-powerline v1.20 introduces a TUI dashboard mode that replaces the single statusline with a full panel showing model info, context usage with progress bar, costs, git status, and more. The update adds 9 visual progress bar styles for context usage and environment variable display capability.