Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).
We observe, that in diffusion models, not only the intermediate results x, but also the internal feature
maps change
smoothly over time. (a) We visualize output feature maps of two layer blocks within the denoising network
via PCA. Structures change
smoothly at different rates. (b) We also observe this smooth layer-wise change when plotting the change in
output from one step to the
next, averaging over many different prompts and randomly initialized noise. Besides the average, we also
show the standard deviation as
shaded area. The patterns always remain the same. (Configuration: LDM-512, DPM, 20 Steps.)
We make three key observations:
We hypothesize that a lot of layer blocks are performing redundant computations during steps where their outputs change very little. To reduce the amount of redundant computations and to speed up inference, we propose Block Caching.
Not every block should be cached all the time. To make a more informed decision about when and where to cache, we rely on the change metrics visualized above. Our intuition is that for any layer block i, we retain a cached value, which was computed at time step ta , as long as the accumulated change does not exceed a certain threshold δ. Once the threshold is exceeded at time step tb , we recompute the block’s output.
To enable the model to adjust to using cached values, we introduce a very lightweight scale-shift adjustment mechanism wherever we apply caching. To this end, we add a timestep-dependent scalar shift and scale parameter for each layer that receives a cached input.
Given a fixed computational budget, we can perform more denoising steps and obtain higher-quality results. Here, we compare EMU with our caching approach at 20 steps vs. 14 steps with the default setup. With identical inference speed, our caching technique produces finer details and more vibrant colors.
A magical portal opening to reveal a hidden realm of wonders.
A tranquil garden with cherry blossoms in full bloom under a full moon.
An ancient castle on a cliff overlooking a vast, mist-covered valley.
A yellow tiger with blue stripes.
A time-traveling wizard riding a mechanical steed through a portal, leaving trails of stardust in their wake.
A floating city in the clouds where airships navigate through tunnels of light, and majestic creatures soar in the skies.
We conduct a human evaluation study on the visual appeal of images generated with either the configuration with caching or the baseline without caching. We always compare configurations that have the same latency.
We show different configurations for the common LDM architecture. The caching configurations at 20 steps and the baseline configuration at 14 steps have the same latency. The baseline with 20 steps is about 1.5x slower. Our method often provides richer colors and finer details. Through our scale-shift adjustment, we avoid artifacts that are visible when naively applying block caching.
For different solvers, we test our caching technique against baselines with 1) the same number of steps or 2) the same latency. In all cases, our proposed approach achieves significant speedup while improving visual quality as measured by FID on a COCO subset removing all faces (for privacy reasons). Legend: SS = Scale-shift adjustment, Img/s. = Images per second.
@article{wimbauer2023cache,
title={Cache Me if You Can: Accelerating Diffusion Models through Block Caching},
author={Wimbauer, Felix and Wu, Bichen and Schoenfeld, Edgar and Dai, Xiaoliang and Hou, Ji and He, Zijian and Sanakoyeu, Artsiom and Zhang, Peizhao and Tsai, Sam and Kohler, Jonas and others},
journal={arXiv preprint arXiv:2312.03209},
year={2023}
}