Ollama Issues: License Compliance, Bugs, Performance Regressions

Ollama's Core Technology and Attribution Issues

Ollama's entire inference capability originally came from llama.cpp, the C++ inference engine created by Georgi Gerganov in March 2023. For over a year, Ollama's README contained no mention of llama.cpp, and their binary distributions didn't include the required MIT license notice for the llama.cpp code they were shipping.

The community opened GitHub issue #3185 in early 2024 requesting license compliance, which went over 400 days without a response from maintainers. When issue #3697 was opened in April 2024 specifically requesting llama.cpp acknowledgment, Ollama's co-founder Michael Chiang eventually added a single line to the bottom of the README: "llama.cpp project founded by Georgi Gerganov."

Technical Problems with Custom Backend

In mid-2025, Ollama moved away from using llama.cpp as their inference backend and built a custom implementation directly on top of ggml. This custom backend reintroduced bugs that llama.cpp had solved years ago, including:

Broken structured output support
Vision model failures
GGML assertion crashes across multiple versions
Models that worked fine in upstream llama.cpp failed in Ollama
Lack of support for tensor types required by new releases like GPT-OSS 20B

Georgi Gerganov identified that Ollama had forked and made bad changes to GGML.

Performance Benchmarks

Multiple community tests show llama.cpp running 1.8x faster than Ollama on the same hardware with the same model:

161 tokens per second versus 89 tokens per second
On CPU, the performance gap is 30-50%
A recent comparison on Qwen-3 Coder 32B showed ~70% higher throughput with llama.cpp

The performance overhead comes from Ollama's daemon layer, poor GPU offloading heuristics, and a vendored backend that trails upstream.

Model Naming Issues

When DeepSeek released its R1 model family in January 2025, Ollama listed the smaller distilled versions (models like DeepSeek-R1-Distill-Qwen-32B) without clearly indicating they were distilled rather than the full models.

📖 Read the full source: HN LLM Tools