Coding Agent Logs Enable Open Federated Training

When you use coding agents like Claude Code or Codex CLI in agent mode, they log comprehensive session data locally on your machine. These logs capture the full interaction loop: your initial task, the model's reasoning process, every tool call made, every environment response, every error encountered and retry attempted. This creates complete (state → action → reward → next state) tuples—the exact data format reinforcement learning researchers need.

What's in the logs

The source author checked their own machines and found:

Mac Mini: ~/.claude/projects/ containing 3.1GB across 1103 files from 574 agentic sessions
MacBook: ~/.codex/sessions/ containing 2.4GB across 3530 files from 79 agentic sessions
MacBook: ~/.claude/projects/ containing 652MB across 316 files from 99 agentic sessions

In total, they identified 775 sessions with real tool calls containing approximately 41 million tokens. Extrapolated across thousands of developers, this could represent hundreds of billions of tokens of real agentic trajectory data—data that currently has no open equivalent like The Pile dataset.

Why this data matters

The environment provides clear feedback signals: exit code 0 or not, tests pass or not. This offers the missing training signal for causal reasoning, error recovery, and long-horizon planning—areas where current models struggle. Big AI labs already collect this data internally to train their proprietary models, but there's no open equivalent because the data is fragmented across individual developer machines.

The proposal: Federated learning

The post proposes using federated learning where your data never leaves your machine. You would train a small LoRA adapter locally, share only the weights with differential privacy noise added, and receive an improved global model in return. Everyone contributes compute and signal without exposing their raw data. Alternatively, the community could anonymize the data to create a dataset for fine-tuning models.

Practical steps

To preserve your logs (Claude Code deletes them after 30 days by default):

echo '{"cleanupPeriodDays": 36500}' > ~/.claude/settings.json

To check what's on your own machines:

du -sh ~/.codex/sessions/ 2>/dev/null
du -sh ~/.claude/projects/ 2>/dev/null
find ~/.codex/sessions/ -name "*.jsonl" | wc -l
find ~/.claude/projects/ -name "*.jsonl" | wc -l

The Reddit post encourages developers to share their numbers in the comments to gauge the actual scale of unused data across the community, with the goal of building an open equivalent if there's enough interest.

📖 Read the full source: r/LocalLLaMA