Tighten compilation cache invariants around eagle

zou3519 · zou3519 · commit 631cfc214c0b · 2025-05-05T08:04:36.000-07:00
I'm recording down my understanding of how eagle and the compilation cache works after discussing vllm-project#17211 with @luyuzhe111 and @WoosukKwon. In the future we likely will have a situation where we want to torch.compile multiple pieces of code (e.g. decoder and encoder separately) and then we'll need to refactor the system to support it (each compiled region needs its own cache directory with its own hash) But until then the current design seems fine. Signed-off-by: rzou <zou3519@gmail.com>
diff --git a/vllm/compilation/backends.py b/vllm/compilation/backends.py
@@ -415,6 +415,18 @@ def __call__(self, graph: fx.GraphModule, example_inputs) -> Callable:
             self.compilation_config.cache_dir = cache_dir
 
         if compilation_counter.num_graphs_seen > 0:
+            # NOTE: Eagle3 compilation
+            # The eagle3 head is a separate model that gets run, so it needs
+            # its own cache dir (each cache dir is 1:1 with a model.forward).
+            # The eagle3 head does not need its own hash; the hash of the
+            # original model entirely determines the config of the eagle3 head.
+            #
+            # If you are here because you are using multiple torch.compile
+            # calls in a single model, please open an issue and let's discuss.
+            speculative_config = self.vllm_config.speculative_config
+            assert speculative_config is not None
+            assert speculative_config.method == "eagle3"
+
             cache_dir = self.compilation_config.cache_dir + \
                 f'-{compilation_counter.num_graphs_seen}'
         else: