Leveraging Agent Skills for Writing CUDA Kernels with Upskill

Hugging Face has introduced a method to enhance smaller AI models’ performance on complex tasks, such as writing CUDA kernels, through the use of agent skills. This process utilizes the new upskill tool, allowing you to generate and evaluate agent skills with large models and apply these skills to smaller or more cost-effective models.
Agent skills are packaged forms of knowledge that can be exchanged between models and tools, defined as files containing instructions in markdown and scripts. They prove particularly beneficial in niche or hard problem domains where models might not naturally excel.
Steps to Upskill Using Claude and Upskill Tool
1. Building a Kernel with Claude Opus 4.5: The process begins by employing Claude Code to interactively assemble a kernel and export the trace. This involves iterating solutions with draft skills, enabling continuous improvement through smaller model experimentation.
2. Creating an Agent Skill from the Trace: After the kernel is constructed, instruct Claude to generate a skill file for the completed task. Employing the Anthropic ‘skill creator’ can also facilitate this process, creating skills based on the agent's activity trace. upskill enhances usefulness by also providing test cases to assess skill performance.
3. Applying the Skill across Models: Transfer the newly crafted skill to desired models following standard practices, where skills are formatted as directories, e.g., {agent}/skills/{skill_name}/SKILL.md. Use upskill eval commands to run model performance comparisons using these skills, highlighting differences in accuracy and token usage across varied platforms like codex or cursor.
Ultimately, skills can aid in reducing token consumption while maintaining accuracy, critical for recurring tasks on different models. However, variations in effectiveness suggest iterative skill refinement may be necessary.
📖 Read the full source: Hugging Face Blog
👀 See Also

How to Run OpenClaw Fully Local with Ollama
A Reddit post outlines a process for running OpenClaw completely locally without cloud APIs or per-token billing, using Ollama and LLMFit to benchmark local models.

Optimizing Qwen3.5-9B on RTX 3070 Mobile with ik_llama.cpp: Config Tweaks and Benchmarks
A developer shares optimization findings for running Qwen3.5-9B Q4_K_M on an RTX 3070 Mobile 8GB GPU using ik_llama.cpp, achieving ~50 tokens/second generation speed and significant prompt evaluation improvements through configuration adjustments.

Building a Full BI System with Claude Code and Metabase for Under $50/month
A Reddit user built a complete BI system using Claude Code, BigQuery, and self-hosted Metabase — replacing $15k expert quotes with 3 days of work and $30/month in cloud costs.

Fixing OpenClaw Prompt Bloat and Slow Response Loops
Users experiencing long delays since 2026.4.26 can reclaim performance by reducing context bloat: trim always-injected files, limit visible skills, and avoid pasting huge tool outputs in main chat.