Kimi's paper reveals its inference architecture, which is responsible for 80% of traffic

Latest update time：2024-07-04

Reads：

Cressy from Aofei Temple
Quantum Bit | Public Account QbitAI

The latest paper by Dark Side of the Moon and Tsinghua KVCache.ai team revealed for the first time the reasoning architecture behind Kimi !

You should know that Kimi is the hottest star in the domestic large model industry. He is so popular that he has never lacked traffic and is often even overloaded.

With the publication of the paper, the question of how Kimi was able to handle this huge amount of traffic has also been answered.

The reasoning architecture behind Kimi is called Mooncake , and its main feature is that it adopts a separate design .

Moreover, Mooncake was designed with high-traffic scenarios in mind and developed specifically for this situation.

In simulation scenarios, Mooncake can bring up to 525% throughput growth , and in actual scenarios it can handle 75% more requests .

According to a Zhihu article written by Xu Xinran, vice president of Dark Side of the Moon Engineering, more than 80% of Kimi's traffic is handled by this system .

Building a distributed system based on KV cache

The core of the entire Mooncake system design revolves around KV cache .

(KV cache is used to store key-value pairs. Its main advantage is that it can access and retrieve data simply and efficiently, which can improve the reasoning speed and reduce computing resource consumption in large models.)

The reason for doing this is that the team expects the capacity of the KV cache to remain high for a long time , so it is necessary to optimize the KV cache.

Structurally, Mooncake consists of a global scheduler (Conductor) , a Prefill node cluster, a Decoding node cluster, a distributed KVCache pool, and an RDMA communication component (Messenger) .

The global scheduler is the first stop after the user request arrives at the system. It is responsible for receiving the request and scheduling the request to the Prefill and Decoding nodes according to the KV cache distribution and load .

The scheduler needs to comprehensively consider factors such as the reuse length of the KV cache and load balancing when scheduling to maximize the reuse of the KV cache.

Specifically for Mooncake, it adopts a heuristic automatic hotspot migration strategy that can automatically copy hotspot KV cache blocks without the need to accurately predict future accesses.

At the same time, this method of dynamically replicating hot KV cache blocks is also an important way to achieve load balancing.

实验结果表明，与随机调度和负载均衡调度相比，Mooncake的调度策略可以显著降低TTFT （Time To First Token，首个Token延迟），提高系统性能。

After scheduling is completed, the tasks will be handed over to the Prefill and Decoding nodes for calculation respectively.

After the Prefill node receives the request forwarded by the scheduler, it reads the cache from the KV cache pool, performs precalculation and generates a new KV cache.

For long context requests, Mooncake will also divide them into blocks and use multiple nodes for parallel processing to reduce latency.

In addition to receiving requests from the scheduler, the Decoding node also receives the KV cache generated in the Prefill phase. The node decodes the cache and generates the final result.

Among them, large-capacity, high-performance KV cache storage is provided by the cache pool; the RDMA communication component is responsible for KV cache transmission between different nodes with its advantages of high bandwidth and low latency.

In addition to adopting a KV cache-centric workflow, Mooncake has another important feature - a separated architecture .

One of the important factors for adopting a split architecture is that the computational characteristics of the two stages of Prefill and Decoding are very different .

Specifically, they are responsible for TTFT and TBT (Time Between Tokens) respectively .

This leads to differences in computational complexity, memory access methods, parallel granularity, and sensitivity to latency :

Therefore, the Dark Side of the Moon team also split the GPU clusters accordingly so that they can be deployed on different node clusters to achieve resource isolation and specialized optimization.

In addition, the KV cache pool in Mooncake is also distributed, making full use of the idle CPU, DRAM, and SSD resources in the GPU cluster to achieve large-capacity, high-bandwidth KV cache storage and transmission, while also reducing the waste of idle resources.

Predict load in advance and reject excess requests in time

However, even though Mooncake adopts an efficient separation architecture, the extremely large traffic in the actual environment is still a test for the system.

In this regard, the author also proposed new coping strategies.

In an overload scenario, the key to scheduling is to decide whether to accept new requests.

Since Mooncake uses a split architecture, it can adopt an early rejection strategy to reject requests in advance in the Prefill phase based on the load of the Decoding node.

Mooncake uses the SLO (Service Level Objective) satisfaction of TTFT and TBT as a load metric.

The specific SLO requirement is that the 90th percentile value (P90) of TTFT shall not exceed 10 times the processing time of a single request under no-load conditions, and the P90 value of TBT shall not exceed 5 times.

这种早期拒绝策略可以显著减少无效的Prefill计算,提高资源利用率，但同时也带来了新的问题——Prefill和Decoding节点负载的波动，导致资源利用率下降、影响系统性能。

This is because in the early rejection strategy, there is a lag in the system's decision to reject the request, as shown in the following figure:

In phase 1, the loads of the Prefill node and the Decoding node are low. At this time, the scheduler will continue to accept new requests until the load of the Prefill node reaches the upper limit.
After entering phase 2, the requests processed by the Refill node begin to enter the Decoding node, causing its load to rise rapidly. When the load of the Decoding node exceeds the threshold, the scheduler begins to reject new requests, but the load of the Prefill node is still high.
In phase 3, the load on the Prefill node begins to decrease as the scheduler rejects new requests. However, the previously accumulated requests are being processed in the Decoding phase, and the node load is still high.
Finally, in phase 4, the load on the Decoding node starts to decrease because all previous requests have been processed and new requests are rejected. At this time, the scheduler starts accepting new requests again, and the load on the Prefill node starts to increase again.
After that, this process will be repeated periodically, causing the load of the Prefill and Decoding nodes to fluctuate in opposite phases.

To address this problem, the Dark Side of the Moon team revised this simple early rejection strategy and proposed a prediction-based early rejection strategy to reduce the fluctuation of node load.

The core idea of this strategy is to predict the Decoding node load after a period of time and decide whether to reject the request based on the prediction results.

Predictions can be made at two levels: request level and system level. Request-level predictions are more difficult because it is necessary to predict the execution time of a single request; system-level predictions are relatively easier because they only require predicting the overall load.

Mooncake uses a simplified system-level prediction method , assuming that the execution time of each request follows a fixed distribution, and predicts the load situation in the future.

Experimental results show that this prediction-based early rejection strategy can effectively alleviate the load fluctuation problem.

最终，端到端性能评估结果表明，Mooncake的架构设计和优化策略， 有效提高了推理服务性能 ，尤其在长上下文和真实场景下优势更加显著。

On the ArXiv Summarization and L-Eval datasets, Mooncake achieves 20% and 40% higher throughput than the baseline method vLLM, respectively.

On the simulated dataset, Mooncake's throughput can reach up to 525%, and on the real dataset it can also process about 75% more requests than vLLM.

The performance evaluation results under the overload scenario show that when using the prediction-based early rejection strategy, the number of rejected requests is reduced from 4183 in the baseline to 3589, indicating that the system's request processing capability has been improved.

Regarding future development, Zhang Mingxing, another author of the paper and assistant professor of the Department of Computer Science at Tsinghua University, said that judging from the current trend, the load of large model services will become more complex and diversified, and scheduling will become more complex and more important.

As for the development direction of Dark Side of the Moon, Xu Xinran gave an answer - the implementation of the distributed strategy also means that the entire system of Dark Side of the Moon will develop independently in the two directions of "computing power/$" and "bandwidth/$" in the future, which will be more friendly to hardware optimization.

Paper address:
https://arxiv.org/pdf/2407.00079 GitHub: https://github.com/kvcache-ai/Mooncake Reference links: [1]https://zhuanlan.zhihu.com/p/705910725 [2]https://zhuanlan.zhihu.com/p/706204757