♻️ Cache Me if You Can: Accelerating Diffusion Models through Block Caching

CVPR 2024

Felix Wimbauer^1,2,3, Bichen Wu¹, Edgar Schoenfeld¹, Xiaoliang Dai¹, Ji Hou¹, Zijian He¹, Artsiom Sanakoyeu¹, Peizhao Zhang¹, Sam Tsai¹, Jonas Kohler¹, Christian Rupprecht⁴, Daniel Cremers^2,3, Peter Vajda¹, Jialiang Wang¹

¹Meta GenAI, ²Technical University of Munich, ³MCML, ⁴University of Oxford
Work done during Felix' internship at Meta GenAI

arXiv Video Code

Speeding up diffusion models through block caching. We observe that there are many redundant layer computations at different timesteps in diffusion models when generating an image. Our block caching technique allows us to avoid these unnecessary computations, therefore speeding up inference by a factor of 1.5x-1.8x while maintaining image quality. Compared to the standard practice of naively reducing the number of denoising steps to match our inference speed, our approach produces more detailed and vibrant results.

Method

Analysis

We observe, that in diffusion models, not only the intermediate results x, but also the internal feature maps change smoothly over time. (a) We visualize output feature maps of two layer blocks within the denoising network via PCA. Structures change smoothly at different rates. (b) We also observe this smooth layer-wise change when plotting the change in output from one step to the next, averaging over many different prompts and randomly initialized noise. Besides the average, we also show the standard deviation as shaded area. The patterns always remain the same. (Configuration: LDM-512, DPM, 20 Steps.)

We make three key observations:

Smooth change over time. Similarly to the intermediate images during denoising, the blocks change smoothly and gradually over time. This suggests that there is a clear temporal relation between the outputs of a block.
Distinct patterns of change. The different blocks do not behave uniformly over time. Rather, they apply a lot of change in certain periods of the denoising process, while they remain inactive in others. The standard deviation shows that this behavior is consistent over different images and random seeds. Note that some blocks, for example the blocks at higher resolutions (either very early or very late in the network) change most in the last 20%, while deeper blocks at lower resolutions change more in the beginning.
Small step-to-step difference. Almost every block has significant periods during the denoising process, in which its output only changes very little.

Block Caching

We hypothesize that a lot of layer blocks are performing redundant computations during steps where their outputs change very little. To reduce the amount of redundant computations and to speed up inference, we propose Block Caching.

Automatic Cache Schedule

Not every block should be cached all the time. To make a more informed decision about when and where to cache, we rely on the change metrics visualized above. Our intuition is that for any layer block i, we retain a cached value, which was computed at time step t_a , as long as the accumulated change does not exceed a certain threshold δ. Once the threshold is exceeded at time step t_b , we recompute the block’s output.

Scale Shift Adjustment

To enable the model to adjust to using cached values, we introduce a very lightweight scale-shift adjustment mechanism wherever we apply caching. To this end, we add a timestep-dependent scalar shift and scale parameter for each layer that receives a cached input.

Results

EMU + Caching

Given a fixed computational budget, we can perform more denoising steps and obtain higher-quality results. Here, we compare EMU with our caching approach at 20 steps vs. 14 steps with the default setup. With identical inference speed, our caching technique produces finer details and more vibrant colors.

A magical portal opening to reveal a hidden realm of wonders.

A tranquil garden with cherry blossoms in full bloom under a full moon.

An ancient castle on a cliff overlooking a vast, mist-covered valley.

A yellow tiger with blue stripes.

A time-traveling wizard riding a mechanical steed through a portal, leaving trails of stardust in their wake.

A floating city in the clouds where airships navigate through tunnels of light, and majestic creatures soar in the skies.

Quantitative Results

We conduct a human evaluation study on the visual appeal of images generated with either the configuration with caching or the baseline without caching. We always compare configurations that have the same latency.

LDM + Caching

We show different configurations for the common LDM architecture. The caching configurations at 20 steps and the baseline configuration at 14 steps have the same latency. The baseline with 20 steps is about 1.5x slower. Our method often provides richer colors and finer details. Through our scale-shift adjustment, we avoid artifacts that are visible when naively applying block caching.

Quantitative Results

For different solvers, we test our caching technique against baselines with 1) the same number of steps or 2) the same latency. In all cases, our proposed approach achieves significant speedup while improving visual quality as measured by FID on a COCO subset removing all faces (for privacy reasons). Legend: SS = Scale-shift adjustment, Img/s. = Images per second.

♻️ Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Abstract

Method

Analysis

Block Caching

Automatic Cache Schedule

Scale Shift Adjustment

Results

EMU + Caching

Quantitative Results

LDM + Caching

Quantitative Results

BibTeX