Instructions to use sokann/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sokann/DeepSeek-V4-Flash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sokann/DeepSeek-V4-Flash-GGUF", filename="DeepSeek-V4-Flash.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use sokann/DeepSeek-V4-Flash-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: llama cli -hf sokann/DeepSeek-V4-Flash-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: llama cli -hf sokann/DeepSeek-V4-Flash-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: ./llama-cli -hf sokann/DeepSeek-V4-Flash-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sokann/DeepSeek-V4-Flash-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf sokann/DeepSeek-V4-Flash-GGUF
Use Docker
docker model run hf.co/sokann/DeepSeek-V4-Flash-GGUF
- LM Studio
- Jan
- Ollama
How to use sokann/DeepSeek-V4-Flash-GGUF with Ollama:
ollama run hf.co/sokann/DeepSeek-V4-Flash-GGUF
- Unsloth Studio
How to use sokann/DeepSeek-V4-Flash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sokann/DeepSeek-V4-Flash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sokann/DeepSeek-V4-Flash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sokann/DeepSeek-V4-Flash-GGUF to start chatting
- Pi
How to use sokann/DeepSeek-V4-Flash-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sokann/DeepSeek-V4-Flash-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sokann/DeepSeek-V4-Flash-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sokann/DeepSeek-V4-Flash-GGUF
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use sokann/DeepSeek-V4-Flash-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf sokann/DeepSeek-V4-Flash-GGUF
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "sokann/DeepSeek-V4-Flash-GGUF" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use sokann/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
docker model run hf.co/sokann/DeepSeek-V4-Flash-GGUF
- Lemonade
How to use sokann/DeepSeek-V4-Flash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sokann/DeepSeek-V4-Flash-GGUF
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-GGUF-{{QUANT_TAG}}List all available models
lemonade list
| diff --git a/ggml/include/ggml-rpc.h b/ggml/include/ggml-rpc.h | |
| index 5ad121ae5..de770ee63 100644 | |
| --- a/ggml/include/ggml-rpc.h | |
| +++ b/ggml/include/ggml-rpc.h | |
| extern "C" { | |
| #define RPC_PROTO_PATCH_VERSION 1 | |
| #ifdef __cplusplus | |
| -static_assert(GGML_OP_COUNT == 97, "GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION"); | |
| +static_assert(GGML_OP_COUNT == 98, "GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION"); | |
| #endif | |
| #define GGML_RPC_MAX_SERVERS 16 | |
| diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h | |
| index d6807b6dd..a050211b1 100644 | |
| --- a/ggml/include/ggml.h | |
| +++ b/ggml/include/ggml.h | |
| extern "C" { | |
| GGML_OP_RWKV_WKV7, | |
| GGML_OP_SOLVE_TRI, | |
| GGML_OP_GATED_DELTA_NET, | |
| + GGML_OP_LIGHTNING_INDEXER, | |
| GGML_OP_UNARY, | |
| extern "C" { | |
| struct ggml_tensor * state, | |
| int64_t K); | |
| + GGML_API struct ggml_tensor * ggml_lightning_indexer( | |
| + struct ggml_context * ctx, | |
| + struct ggml_tensor * q, | |
| + struct ggml_tensor * k, | |
| + struct ggml_tensor * weights, | |
| + float scale_embd, | |
| + float scale_heads); | |
| + | |
| // custom operators | |
| typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata); | |
| diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c | |
| index eb8341c9a..a290e6dad 100644 | |
| --- a/ggml/src/ggml-cpu/ggml-cpu.c | |
| +++ b/ggml/src/ggml-cpu/ggml-cpu.c | |
| static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm | |
| { | |
| ggml_compute_forward_gated_delta_net(params, tensor); | |
| } break; | |
| + case GGML_OP_LIGHTNING_INDEXER: | |
| + { | |
| + ggml_compute_forward_lightning_indexer(params, tensor); | |
| + } break; | |
| case GGML_OP_MAP_CUSTOM1: | |
| { | |
| ggml_compute_forward_map_custom1(params, tensor); | |
| static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads) { | |
| case GGML_OP_FLASH_ATTN_BACK: | |
| case GGML_OP_SSM_CONV: | |
| case GGML_OP_SSM_SCAN: | |
| + case GGML_OP_LIGHTNING_INDEXER: | |
| { | |
| n_tasks = n_threads; | |
| } break; | |
| struct ggml_cplan ggml_graph_plan( | |
| { | |
| GGML_ABORT("fatal error"); | |
| } | |
| + case GGML_OP_LIGHTNING_INDEXER: | |
| + { | |
| + // temp buffer for dequantizing lightning indexer keys | |
| + const int64_t ne10 = node->src[1]->ne[0]; | |
| + cur += sizeof(float)*ne10*n_tasks; | |
| + } break; | |
| default: | |
| break; | |
| } | |
| diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp | |
| index 6724686b8..971798728 100644 | |
| --- a/ggml/src/ggml-cpu/ops.cpp | |
| +++ b/ggml/src/ggml-cpu/ops.cpp | |
| void ggml_compute_forward_fwht(const ggml_compute_params * params, ggml_tensor * | |
| } | |
| } | |
| } | |
| + | |
| +// ggml_compute_forward_lightning_indexer | |
| + | |
| +void ggml_compute_forward_lightning_indexer( | |
| + const ggml_compute_params * params, | |
| + ggml_tensor * dst) { | |
| + | |
| + const ggml_tensor * src0 = dst->src[0]; // q | |
| + const ggml_tensor * src1 = dst->src[1]; // k | |
| + const ggml_tensor * src2 = dst->src[2]; // weights | |
| + | |
| + const float scale_embd = ggml_get_op_params_f32(dst, 0); | |
| + const float scale_heads = ggml_get_op_params_f32(dst, 1); | |
| + | |
| + GGML_ASSERT(dst->type == GGML_TYPE_F32); | |
| + GGML_ASSERT(src0->type == GGML_TYPE_F32); | |
| + GGML_ASSERT(src2->type == GGML_TYPE_F32); | |
| + | |
| + GGML_TENSOR_TERNARY_OP_LOCALS | |
| + | |
| + GGML_ASSERT( nb0 == sizeof(float)); | |
| + GGML_ASSERT(nb00 == sizeof(float)); | |
| + | |
| + int n_embd = src0->ne[0]; | |
| + int n_head = src0->ne[1]; | |
| + int n_batch = src0->ne[2]; | |
| + int n_stream = src0->ne[3]; | |
| + int n_kv = src1->ne[2]; | |
| + | |
| + ggml_to_float_t const k_to_float = ggml_get_type_traits(src1->type)->to_float; | |
| + GGML_ASSERT((src1->type == GGML_TYPE_F32 || k_to_float) && "lightning indexer: unsupported K-type"); | |
| + | |
| + const int nr = n_kv; | |
| + const int ith = params->ith; | |
| + const int nth = params->nth; | |
| + | |
| + // (temporary) buffer for K converted to float | |
| + float * src1_row_f32 = (float *) params->wdata + ith*(1*n_embd + CACHE_LINE_SIZE_F32); | |
| + | |
| + // rows per thread | |
| + const int dr = (nr + nth - 1)/nth; | |
| + | |
| + // row range for this thread | |
| + const int ir0 = dr*ith; | |
| + const int ir1 = MIN(ir0 + dr, nr); | |
| + | |
| + for (int i_stream = 0; i_stream < n_stream; ++i_stream) { | |
| + for (int i_batch = 0; i_batch < n_batch; ++i_batch) { | |
| + for (int i_kv = ir0; i_kv < ir1; ++i_kv) { | |
| + char * src1_row = (char *) src1->data + i_kv*nb12 + i_stream*nb13; | |
| + if (k_to_float) { | |
| + k_to_float(src1_row, src1_row_f32, n_embd); | |
| + } else { | |
| + src1_row_f32 = (float *) src1_row; | |
| + } | |
| + float * src2_row = (float *) ((char *) src2->data + i_batch*nb21 + i_stream*nb23); | |
| + float * dst_row = (float *) ((char *) dst->data + i_batch*nb1 + i_stream*nb3); | |
| + float score = 0.0f; | |
| + for (int i_head = 0; i_head < n_head; ++i_head) { | |
| + // dot product of q and k for head i_head | |
| + float qk = 0.0f; | |
| + float * src0_row = (float *) ((char *) src0->data + i_head*nb01 + i_batch*nb02 + i_stream*nb03); | |
| + ggml_vec_dot_f32(n_embd, &qk, 0, src0_row, 0, src1_row_f32, 0, 1); | |
| + qk *= scale_embd; | |
| + // ReLU and weights | |
| + score += MAX(qk, 0.0f) * src2_row[i_head]; | |
| + } | |
| + score *= scale_heads; | |
| + dst_row[i_kv] = score; | |
| + } | |
| + } | |
| + } | |
| +} | |
| diff --git a/ggml/src/ggml-cpu/ops.h b/ggml/src/ggml-cpu/ops.h | |
| index a8e18c716..e956c25d3 100644 | |
| --- a/ggml/src/ggml-cpu/ops.h | |
| +++ b/ggml/src/ggml-cpu/ops.h | |
| void ggml_compute_forward_rwkv_wkv7(const struct ggml_compute_params * params, s | |
| void ggml_compute_forward_solve_tri(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| void ggml_compute_forward_gla(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| void ggml_compute_forward_gated_delta_net(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| +void ggml_compute_forward_lightning_indexer(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| void ggml_compute_forward_map_custom1(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| void ggml_compute_forward_map_custom2(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| void ggml_compute_forward_map_custom3(const struct ggml_compute_params * params, struct ggml_tensor * dst); | |
| diff --git a/ggml/src/ggml-cuda/argsort.cu b/ggml/src/ggml-cuda/argsort.cu | |
| index c4f08091e..33a38c23e 100644 | |
| --- a/ggml/src/ggml-cuda/argsort.cu | |
| +++ b/ggml/src/ggml-cuda/argsort.cu | |
| static __global__ void init_offsets(int * offsets, const int ncols, const int nr | |
| #endif // STRIDED_ITERATOR_AVAILABLE | |
| #ifdef GGML_CUDA_USE_CUB | |
| + | |
| +// returns the suggested maximum number of rows to process during one argsort_f32_i32_cuda_cub() call | |
| +int argsort_f32_i32_cuda_cub_chunk_nrows(const size_t nb01, const int64_t nrows) { | |
| + // perform argsort in chunks up to approximately this size (currently 64MB) | |
| + // to avoid excessive temporary buffers memory usage | |
| + const int chunk_bytes = 1 << 26; | |
| + | |
| + // calculate how many rows will fit in one chunk (must be at least one) | |
| + const int chunk_nrows = chunk_bytes > nb01 ? chunk_bytes / nb01 : 1; | |
| + | |
| + // limit the resulting amount to total nrows | |
| + return nrows < chunk_nrows ? nrows : chunk_nrows; | |
| +} | |
| + | |
| void argsort_f32_i32_cuda_cub(ggml_cuda_pool & pool, | |
| const float * x, | |
| int * dst, | |
| void ggml_cuda_op_argsort(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { | |
| const size_t shared_mem = ncols_pad * sizeof(int); | |
| const size_t max_shared_mem = ggml_cuda_info().devices[ggml_cuda_get_device()].smpb; | |
| - if (shared_mem > max_shared_mem || ncols > 1024) { | |
| - ggml_cuda_pool & pool = ctx.pool(); | |
| - argsort_f32_i32_cuda_cub(pool, src0_d, (int *) dst_d, ncols, nrows, order, stream); | |
| - } else { | |
| - argsort_f32_i32_cuda_bitonic(src0_d, (int *) dst_d, ncols, nrows, order, stream); | |
| + // early return if we can use bitonic argsort | |
| + if (shared_mem <= max_shared_mem && ncols <= 1024) { | |
| + return argsort_f32_i32_cuda_bitonic(src0_d, (int *) dst_d, ncols, nrows, order, stream); | |
| + } | |
| + | |
| + const int chunk_nrows = argsort_f32_i32_cuda_cub_chunk_nrows(src0->nb[1], nrows); | |
| + | |
| + ggml_cuda_pool & pool = ctx.pool(); | |
| + | |
| + for (int64_t i = 0; i < nrows; i += chunk_nrows) { | |
| + int iter_nrows = chunk_nrows < nrows - i ? chunk_nrows : nrows - i; | |
| + | |
| + argsort_f32_i32_cuda_cub(pool, src0_d, (int *) dst_d, ncols, iter_nrows, order, stream); | |
| + | |
| + src0_d += ncols * iter_nrows; | |
| + dst_d += ncols * iter_nrows; | |
| } | |
| #else | |
| argsort_f32_i32_cuda_bitonic(src0_d, (int *) dst_d, ncols, nrows, order, stream); | |
| diff --git a/ggml/src/ggml-cuda/argsort.cuh b/ggml/src/ggml-cuda/argsort.cuh | |
| index 22b7306f2..3abb6448a 100644 | |
| --- a/ggml/src/ggml-cuda/argsort.cuh | |
| +++ b/ggml/src/ggml-cuda/argsort.cuh | |
| void ggml_cuda_op_argsort(ggml_backend_cuda_context & ctx, ggml_tensor * dst); | |
| #ifdef GGML_CUDA_USE_CUB | |
| +int argsort_f32_i32_cuda_cub_chunk_nrows(const size_t nb01, const int64_t nrows); | |
| void argsort_f32_i32_cuda_cub(ggml_cuda_pool & pool, | |
| const float * x, | |
| int * dst, | |
| diff --git a/ggml/src/ggml-cuda/top-k.cu b/ggml/src/ggml-cuda/top-k.cu | |
| index db1d39e2d..5e708e6c5 100644 | |
| --- a/ggml/src/ggml-cuda/top-k.cu | |
| +++ b/ggml/src/ggml-cuda/top-k.cu | |
| void ggml_cuda_op_top_k(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { | |
| const int ncols_pad = next_power_of_2(ncols); | |
| const size_t shared_mem = ncols_pad * sizeof(int); | |
| const size_t max_shared_mem = ggml_cuda_info().devices[ggml_cuda_get_device()].smpb; | |
| + const bool use_bitonic = shared_mem <= max_shared_mem && ncols <= 1024; | |
| + const int chunk_nrows = argsort_f32_i32_cuda_cub_chunk_nrows(src0->nb[1], nrows); | |
| - ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * nrows); | |
| + ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * chunk_nrows); | |
| int * tmp_dst = temp_dst_alloc.get(); | |
| - if (shared_mem > max_shared_mem || ncols > 1024) { | |
| - argsort_f32_i32_cuda_cub(pool, src0_d, tmp_dst, ncols, nrows, GGML_SORT_ORDER_DESC, stream); | |
| - } else { | |
| - argsort_f32_i32_cuda_bitonic(src0_d, tmp_dst, ncols, nrows, GGML_SORT_ORDER_DESC, stream); | |
| + for (int64_t i = 0; i < nrows; i += chunk_nrows) { | |
| + int iter_nrows = chunk_nrows < nrows - i ? chunk_nrows : nrows - i; | |
| + | |
| + if (use_bitonic) { | |
| + argsort_f32_i32_cuda_bitonic(src0_d, tmp_dst, ncols, iter_nrows, GGML_SORT_ORDER_DESC, stream); | |
| + } else { | |
| + argsort_f32_i32_cuda_cub(pool, src0_d, tmp_dst, ncols, iter_nrows, GGML_SORT_ORDER_DESC, stream); | |
| + } | |
| + CUDA_CHECK(cudaMemcpy2DAsync(dst_d, k * sizeof(int), tmp_dst, ncols * sizeof(int), k * sizeof(int), iter_nrows, | |
| + cudaMemcpyDeviceToDevice, stream)); | |
| + | |
| + src0_d += ncols * iter_nrows; | |
| + dst_d += k * iter_nrows; | |
| } | |
| - CUDA_CHECK(cudaMemcpy2DAsync(dst_d, k * sizeof(int), tmp_dst, ncols * sizeof(int), k * sizeof(int), nrows, | |
| - cudaMemcpyDeviceToDevice, stream)); | |
| #else // GGML_CUDA_USE_CUB | |
| ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * nrows); | |
| int * tmp_dst = temp_dst_alloc.get(); | |
| diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c | |
| index 0f682fd18..31a5cbc97 100644 | |
| --- a/ggml/src/ggml.c | |
| +++ b/ggml/src/ggml.c | |
| static const char * GGML_OP_NAME[GGML_OP_COUNT] = { | |
| "RWKV_WKV7", | |
| "SOLVE_TRI", | |
| "GATED_DELTA_NET", | |
| + "LIGHTNING_INDEXER", | |
| "UNARY", | |
| static const char * GGML_OP_NAME[GGML_OP_COUNT] = { | |
| "GLU", | |
| }; | |
| -static_assert(GGML_OP_COUNT == 97, "GGML_OP_COUNT != 97"); | |
| +static_assert(GGML_OP_COUNT == 98, "GGML_OP_COUNT != 98"); | |
| static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = { | |
| "none", | |
| static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = { | |
| "rwkv_wkv7(r, w, k, v, a, b, s)", | |
| "A X = B, A triangular, solve X", | |
| "gated_delta_net(q, k, v, g, beta, s)", | |
| + "lightning_indexer(q, k, weights, scale_embd, scale_heads)", | |
| "unary(x)", | |
| static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = { | |
| "glu(x)", | |
| }; | |
| -static_assert(GGML_OP_COUNT == 97, "GGML_OP_COUNT != 97"); | |
| +static_assert(GGML_OP_COUNT == 98, "GGML_OP_COUNT != 98"); | |
| static_assert(GGML_OP_POOL_COUNT == 2, "GGML_OP_POOL_COUNT != 2"); | |
| struct ggml_tensor * ggml_gated_delta_net( | |
| return result; | |
| } | |
| +// ggml_lightning_indexer | |
| + | |
| +struct ggml_tensor * ggml_lightning_indexer( | |
| + struct ggml_context * ctx, | |
| + struct ggml_tensor * q, | |
| + struct ggml_tensor * k, | |
| + struct ggml_tensor * weights, | |
| + float scale_embd, | |
| + float scale_heads) { | |
| + | |
| + GGML_ASSERT(q->type == GGML_TYPE_F32); | |
| + GGML_ASSERT(weights->type == GGML_TYPE_F32); | |
| + GGML_ASSERT(q->ne[0] == k->ne[0]); | |
| + GGML_ASSERT(q->ne[1] == weights->ne[0]); | |
| + GGML_ASSERT(k->ne[1] == 1); | |
| + GGML_ASSERT(q->ne[2] == weights->ne[1]); | |
| + GGML_ASSERT(weights->ne[2] == 1); | |
| + GGML_ASSERT(q->ne[3] == k->ne[3]); | |
| + GGML_ASSERT(k->ne[3] == weights->ne[3]); | |
| + | |
| + int64_t ne[4] = { k->ne[2], q->ne[2], 1, q->ne[3] }; | |
| + struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne); | |
| + | |
| + ggml_set_op_params_f32(result, 0, scale_embd); | |
| + ggml_set_op_params_f32(result, 1, scale_heads); | |
| + | |
| + result->op = GGML_OP_LIGHTNING_INDEXER; | |
| + result->src[0] = q; | |
| + result->src[1] = k; | |
| + result->src[2] = weights; | |
| + | |
| + return result; | |
| +} | |
| + | |
| //////////////////////////////////////////////////////////////////////////////// | |
| struct ggml_hash_set ggml_hash_set_new(size_t size) { | |
| diff --git a/src/models/deepseek4.cpp b/src/models/deepseek4.cpp | |
| index 759654228..38652ef83 100644 | |
| --- a/src/models/deepseek4.cpp | |
| +++ b/src/models/deepseek4.cpp | |
| ggml_tensor * llama_model_deepseek4::graph::build_lid_top_k( | |
| indexer_weights->ne[0], indexer_weights->ne[1]/n_stream, indexer_weights->ne[2], n_stream, | |
| indexer_weights->nb[1], indexer_weights->nb[2]/n_stream, indexer_weights->nb[3]/n_stream, 0); | |
| - indexer_q = ggml_permute(ctx0, indexer_q, 0, 2, 1, 3); | |
| - cb(indexer_q, "lid_q", il); | |
| - indexer_k = ggml_permute(ctx0, indexer_k, 0, 2, 1, 3); | |
| - cb(indexer_k, "lid_k", il); | |
| - | |
| - ggml_tensor * indexer_kq = ggml_mul_mat(ctx0, indexer_k, indexer_q); | |
| - cb(indexer_kq, "lid_kq", il); | |
| - | |
| - indexer_kq = ggml_cont(ctx0, ggml_permute(ctx0, indexer_kq, 2, 1, 0, 3)); | |
| - cb(indexer_kq, "lid_kq", il); | |
| - | |
| - ggml_tensor * indexer_score = ggml_relu(ctx0, indexer_kq); | |
| - indexer_score = ggml_mul(ctx0, indexer_score, indexer_weights); | |
| - indexer_score = ggml_sum_rows(ctx0, indexer_score); | |
| - indexer_score = ggml_cont(ctx0, ggml_permute(ctx0, indexer_score, 2, 1, 0, 3)); | |
| + ggml_tensor * indexer_score = ggml_lightning_indexer(ctx0, indexer_q, indexer_k, indexer_weights, 1.0f, 1.0f); | |
| cb(indexer_score, "lid_score", il); | |
| indexer_score = ggml_add(ctx0, indexer_score, inp_lid.kq_mask); | |