ICML 2026 Desk-Rejects 2% of Papers for LLM Review Policy Violations

ICML 2026 has implemented a two-policy framework for LLM usage in peer review and taken disciplinary action against reviewers who violated their agreed-upon policies. The conference desk-rejected 497 papers, representing approximately 2% of all submissions.
Policy Framework and Violations
ICML 2026 established two distinct policies for LLM use in reviewing:
- Policy A (Conservative): No LLM use allowed
- Policy B (Permissive): LLMs allowed to help understand papers and related works, and to polish reviews
Reviewers selected which policy they preferred to operate under, with no reviewer who strongly preferred Policy B being assigned to Policy A. The only reviewers assigned to Policy A were those who explicitly selected "Policy A" or "I am okay with either [Policy] A or B."
Detection and Consequences
795 reviews (~1% of all reviews) written by 506 unique reviewers assigned to Policy A were detected to have used LLMs in their review. These reviewers had explicitly agreed not to use LLMs. Every flagged instance was manually verified by a human to avoid false positives.
When a designated Reciprocal Reviewer for a submission produced such a review, their submission was rejected, resulting in 497 total rejections. All Policy A reviews detected to be LLM-generated were removed from the system.
If more than half of the reviews submitted by a Policy A reviewer were detected to be LLM-generated, all of their reviews were deleted and the reviewer was removed from the reviewer pool. 51 Policy A reviewers (about 10% of the 506 detected reviewers) fell into this category.
Technical Detection Method
The detection method involved watermarking submission PDFs with hidden LLM instructions that would subtly influence any review produced via an LLM. The technique:
- Created a dictionary of 170,000 phrases
- For each paper, sampled two phrases randomly from this dictionary (probability smaller than one in ten billion for any given pair)
- Watermarked PDFs with instructions visible only to an LLM, instructing it to include the two selected phrases in the review
- These watermarks would not be directly visible to a human reading the PDF
The method was based on recent work by Rao, Kumar, Lakkaraju, and Shah. The conference notes this technique may only catch the most egregious and careless uses of LLMs in reviewing, particularly where reviewers input the PDF to an LLM and directly copy-paste the output.
Impact and Context
The conference emphasized they are not making judgments about the quality of flagged reviews or reviewers' intentions, but simply enforcing the policies reviewers agreed to. The disruption has required removing violating reviews, potentially finding new reviewers, and desk-rejecting some submissions that had already received a full set of reviews.
This approach reflects the broader challenge conferences face in adapting to AI integration in research workflows while maintaining review integrity.
📖 Read the full source: HN LLM Tools
👀 See Also

Claude Opus 4.5 and Sonnet 4.5 removed from /model selection, require launch flag
Claude Opus 4.5 and Sonnet 4.5 are no longer available in the /model selection menu during sessions. Users must now start sessions with the --model flag specifying the full model ID to access these older versions.

Open Source vs Frontier Models: Single-File Canvas Car Scene Benchmark
A developer tested 12 models including GPT-5.5, Claude Opus 4.7, and Qwen 3.6 Plus on a single-file HTML canvas car driving animation task, with results publicly compared.

Microsoft releases Phi-4-reasoning-vision-15B multimodal model with training insights
Microsoft Research has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model available through Microsoft Foundry, HuggingFace, and GitHub. The model balances reasoning power with efficiency and excels at math/science reasoning and UI understanding.

Reddit discussion argues AI competition is closed vs open source, not US vs China
A r/LocalLLaMA post argues that framing AI competition as America vs China is a false narrative used to influence investors and politicians, with the real battle being between closed and open source models. The author notes Chinese labs are open sourcing models primarily for market relevance, not magnanimity, and could go closed source as market conditions change.