Ollama is a front-end written in go, and wrap-up the back-end of llama.cpp
Here is the steps for setup ollama in k8s cluster
(1) write a k8s yaml file,
where
using ollama docker image ( ollama/ollama ),
and put into k8s with persistent storage for model ( mount as /root/.ollama)
expose ollama 11434 as service (load balancer)
(2) login to ollama, pull models
kubectl exec -it comrite-web-ollama-8676669bb-jvx2z — /bin/bash
nohup ollama pull llama2 &
nohup ollama pull codellama &
we do not need to run ollama run llama2 inside container as it started ollama serve already, it can load model based on request
(3) test:
curl http://192.168.86.88:11434/api/generate -d ‘{
“model”: “llama2”,
“prompt”:”Why is the sky blue?”
}’
curl -X POST http://192.168.86.88:11434/api/generate -d ‘{
“model”: “codellama”,
“prompt”: “Write me a function that outputs the fibonacci sequence”
}’
can pass the parameter to disable streamming
“stream”: false