♻️ Cache Me if You Can: Accelerating Diffusion Models through Block Caching

CVPR 2024
1Meta GenAI, 2Technical University of Munich, 3MCML, 4University of Oxford
Work done during Felix' internship at Meta GenAI
Teaser image

Speeding up diffusion models through block caching. We observe that there are many redundant layer computations at different timesteps in diffusion models when generating an image. Our block caching technique allows us to avoid these unnecessary computations, therefore speeding up inference by a factor of 1.5x-1.8x while maintaining image quality. Compared to the standard practice of naively reducing the number of denoising steps to match our inference speed, our approach produces more detailed and vibrant results.

Abstract

Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).

Method

Analysis



We observe, that in diffusion models, not only the intermediate results x, but also the internal feature maps change smoothly over time. (a) We visualize output feature maps of two layer blocks within the denoising network via PCA. Structures change smoothly at different rates. (b) We also observe this smooth layer-wise change when plotting the change in output from one step to the next, averaging over many different prompts and randomly initialized noise. Besides the average, we also show the standard deviation as shaded area. The patterns always remain the same. (Configuration: LDM-512, DPM, 20 Steps.)

We make three key observations:

  1. Smooth change over time. Similarly to the intermediate images during denoising, the blocks change smoothly and gradually over time. This suggests that there is a clear temporal relation between the outputs of a block.
  2. Distinct patterns of change. The different blocks do not behave uniformly over time. Rather, they apply a lot of change in certain periods of the denoising process, while they remain inactive in others. The standard deviation shows that this behavior is consistent over different images and random seeds. Note that some blocks, for example the blocks at higher resolutions (either very early or very late in the network) change most in the last 20%, while deeper blocks at lower resolutions change more in the beginning.
  3. Small step-to-step difference. Almost every block has significant periods during the denoising process, in which its output only changes very little.

Block Caching

We hypothesize that a lot of layer blocks are performing redundant computations during steps where their outputs change very little. To reduce the amount of redundant computations and to speed up inference, we propose Block Caching.

Automatic Cache Schedule

Not every block should be cached all the time. To make a more informed decision about when and where to cache, we rely on the change metrics visualized above. Our intuition is that for any layer block i, we retain a cached value, which was computed at time step ta , as long as the accumulated change does not exceed a certain threshold δ. Once the threshold is exceeded at time step tb , we recompute the block’s output.

Scale Shift Adjustment

To enable the model to adjust to using cached values, we introduce a very lightweight scale-shift adjustment mechanism wherever we apply caching. To this end, we add a timestep-dependent scalar shift and scale parameter for each layer that receives a cached input.

Results

EMU + Caching

Given a fixed computational budget, we can perform more denoising steps and obtain higher-quality results. Here, we compare EMU with our caching approach at 20 steps vs. 14 steps with the default setup. With identical inference speed, our caching technique produces finer details and more vibrant colors.

Ours
Baseline

A magical portal opening to reveal a hidden realm of wonders.

Ours
Baseline

A tranquil garden with cherry blossoms in full bloom under a full moon.

Ours
Baseline

An ancient castle on a cliff overlooking a vast, mist-covered valley.

Ours
Baseline

A yellow tiger with blue stripes.

Ours
Baseline

A time-traveling wizard riding a mechanical steed through a portal, leaving trails of stardust in their wake.

Ours
Baseline

A floating city in the clouds where airships navigate through tunnels of light, and majestic creatures soar in the skies.

Quantitative Results

We conduct a human evaluation study on the visual appeal of images generated with either the configuration with caching or the baseline without caching. We always compare configurations that have the same latency.




LDM + Caching

We show different configurations for the common LDM architecture. The caching configurations at 20 steps and the baseline configuration at 14 steps have the same latency. The baseline with 20 steps is about 1.5x slower. Our method often provides richer colors and finer details. Through our scale-shift adjustment, we avoid artifacts that are visible when naively applying block caching.



Quantitative Results

For different solvers, we test our caching technique against baselines with 1) the same number of steps or 2) the same latency. In all cases, our proposed approach achieves significant speedup while improving visual quality as measured by FID on a COCO subset removing all faces (for privacy reasons). Legend: SS = Scale-shift adjustment, Img/s. = Images per second.

BibTeX

@article{wimbauer2023cache,
  title={Cache Me if You Can: Accelerating Diffusion Models through Block Caching},
  author={Wimbauer, Felix and Wu, Bichen and Schoenfeld, Edgar and Dai, Xiaoliang and Hou, Ji and He, Zijian and Sanakoyeu, Artsiom and Zhang, Peizhao and Tsai, Sam and Kohler, Jonas and others},
  journal={arXiv preprint arXiv:2312.03209},
  year={2023}
}