(1) use container
https://github.com/abetlen/llama-cpp-python/pkgs/container/llama-cpp-python
(2) mount k8s storage as /models
export MODEL point to the right llama-model.gguf
(3) expose 8000 to loadbalancer to outside
(4) browse to ip:8000/docs for API exploration