model is not working any idea for fix ?

#1
by gopi87 - opened

CUDA_VISIBLE_DEVICES=2,3,0,1
~/llama.cpp/build/bin/llama-server
--model /mnt/nvme/nex/DeepSeek-V4-Flash.gguf
--tensor-split 1,1,1,1
--cpu-moe
--cache-type-k q8_0
--cache-type-v q8_0
--ctx-size 1000
--batch-size 1000
--ubatch-size 1000
--parallel 1
--temp 0.6
--top-p 0.95
--top-k 20
--threads 35
--threads-batch 35
--jinja
--host 0.0.0.0
--port 8080

i am running like this but models is producing ghebriesh

CUDA_VISIBLE_DEVICES=2,3,0,1
~/llama.cpp/build/bin/llama-server
--model /mnt/nvme/nex/DeepSeek-V4-Flash.gguf
--tensor-split 1,1,1,1
--cpu-moe
--cache-type-k q8_0
--cache-type-v q8_0
--ctx-size 1000
--batch-size 1000
--ubatch-size 1000
--parallel 1
--temp 0.6
--top-p 0.95
--top-k 20
--threads 35
--threads-batch 35
--jinja
--host 0.0.0.0
--port 8080

i just ran "./llama-server -hf sokann/DeepSeek-V4-Flash-GGUF (or -m path/to/DeepSeek-V4-Flash.gguf) -c 32768 --host 0.0.0.0 --port 8085 --jinja -np 1 -lv 4" and it worked fine on 1x3090 + some ram
i think multi-gpu setup might be broken for now
also i don't think the context, batch and ubatch sizes are correct
i just left everything on default and it figured it out by itself

Looks like the gibberish is caused by the Q8 KV cache. Works fine with the default F16 KV cache

Looks like the gibberish is caused by the Q8 KV cache. Works fine with the default F16 KV cache

exactly right looks like kv cache as some problem

i just ran "./llama-server -hf sokann/DeepSeek-V4-Flash-GGUF (or -m path/to/DeepSeek-V4-Flash.gguf) -c 32768 --host 0.0.0.0 --port 8085 --jinja -np 1 -lv 4" and it worked fine on 1x3090 + some ram
i think multi-gpu setup might be broken for now
also i don't think the context, batch and ubatch sizes are correct
i just left everything on default and it figured it out by itself

correct

Sign up or log in to comment