Most interesting !
Amazing job at optimizing various parts of the task.
It seems that being an MoE with 'only' 37B active params per token would put it within the reach of CPU & RAM inference for the lucky hobbyist with an Epyc homelab and 8 or 16 memory channels on a second hand single or dual Gen2 mobo (around $2500 used). Any idea of how hard it would be (will?) support the new architecture for llama.cpp ?
I must confess that my interest in LLMs is grounded RAG as I consider any intrinsic knowledge of the LLL to be unreliable overfitting. Is DeepkSeek able to perform grounded RAG like Command R and Nous-Hermes 3 for instance ?
Thx for this amazing model and all the insights in your report !
('yangxiaobo regularly submits copycat landing pages of AI products developed by other people as "Show HN," as evidenced by the posting history: https://news.ycombinator.com/submitted?id=yangxiaobo )
I must confess that my interest in LLMs is grounded RAG as I consider any intrinsic knowledge of the LLL to be unreliable overfitting. Is DeepkSeek able to perform grounded RAG like Command R and Nous-Hermes 3 for instance ?
Thx for this amazing model and all the insights in your report !