Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

by Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

Visual tokenization via auto-encoding is a critical component of state-of-the-art image and video generation, yet tokenizers have received far less attention than generators in scaling efforts. To address this gap, we introduce Vision Transformer Tokenizer or ViTok, a Vision Transformer-based auto-encoder enhanced with Llama and trained on large-scale datasets. Our study systematically explores scaling the bottleneck, encoder, and decoder sizes. We find that increasing the bottleneck size improves reconstruction but degrades generative performance when it becomes too large. Scaling the encoder yields no significant benefits for reconstruction and actively hinders downstream generation tasks, while scaling the decoder enhances reconstruction quality but has limited impact on generative performance. These findings suggest that scaling within the current auto-encoding paradigm offers limited benefits. However, we observe that the decoder behaves as a conditional generative model, balancing trade-offs in reconstruction and generative loss functions. Additionally, we find that videos are inherently more compressible than images at equivalent compression rates, presenting unique opportunities for future research. Through our scaling analysis, ViTok achieves competitive performance in image and video reconstruction across benchmarks l ike ImageNet-1K, COCO, and UCF-101, while reducing computational costs by 2–5× compared to prior methods. When integrated with Diffusion Transformers, ViTok sets new state-of-the-art benchmarks for class-conditional video generation, demonstrating its potential as a scalable and efficient visual tokenizer. More updates coming soon, star/watch the github to keep posted!

ViTok Main Figure

We showcase our ViTok architecture and key findings from scaling auto-encoders for image and video reconstruction and generation below. We enhance traditional CNN-based auto-encoders by integrating Vision Transformers (ViTs) with an upgraded Llama architecture into an asymmetric auto-encoder framework forming Vision Transformer Tokenizer or ViTok. Visual inputs are embedded as patches or tubelets, processed by a compact Llama Encoder, and bottlenecked to create a latent code. The encoded representation is then upsampled and handled by a larger Llama Decoder to reconstruct the input. Color-coded text boxes highlight the effects of scaling the encoder, adjusting the bottleneck size, and expanding the decoder. Additionally, we discuss trade-offs in loss optimization and the model's adaptability to video data. Our best performing ViTok variant achieves competitive performance with prior state-of-the-art tokenizers while reducing computational burden. Below we present our findings in more detail and place figures related. Please refer to our paper for a comprehensive analysis and additional results. High resolution tokenizer weights + more details coming soon!

Findings

Finding 1. Regardless of code shape or FLOPs expended in auto-encoding, the total number of floating points in the latent code (E) is the most predictive bottleneck for visual reconstruction performance.

256p Image Reconstruction Results
256p image reconstruction sweep over floating points E. We evaluate ViTok S-B trained with stage 1 objective using combinations of patch sizes p = {8, 16, 32} and channel widths c = {4, 8, 16, 32, 64} to investigate how the total floating points E influences FID, IS, SSIM, and PSNR in reconstruction tasks. Our findings reveal a strong correlation between log(E) and either log(rFID)/rIS/rSSIM/rPSNR. This indicates that E is the primary bottleneck for reconstruction, irrespective of the code shape or FLOPs expended. Additionally, similar trends are observed across the ImageNet-1K and COCO datasets, indicating that these patterns are consistent regardless of the dataset used.

Finding 2. In generative tasks, scaling the number of floating points in the code (E) does not consistently improve generative performance. Instead, optimal results are achieved by tuning both E and c to balance reconstruction and generation capabilities. A low E limits reconstruction quality, while high E and channel size c hinder the convergence and performance of the generative model.

256p Image Generation Results
256p image generation over E. We evaluate each tokenizer from our prior sweep on DiT. Results for CFG scales of 1.5 and 3.0 are on the left two and right two plots respectively. Our results show no strong linear correlation between log(E) and generation performance. Instead, a second-order trend reveals an optimal E for each patch size, indicating a complex interplay between E and channel. This highlights the necessity of optimizing both parameters to balance reconstruction quality with generative capabilities.

Finding 3. Scaling the encoder provides no benefits for reconstruction performance and can potentially worsen generation results. Finding 4. While scaling the decoder can enhance reconstruction performance, it provides limited benefits for generation tasks.

256p Encoder Scaling on Image Reconstruction
Encoder scaling on 256p image reconstruction We evaluate reconstruction metrics of ViTok trained with stage 1 over model sizes S-S, B-S, S-B, B-B, B-L, L-L. There is no correlation between encoder size and reconstruction performance indicating that scaling the encoder is unhelpful in improving reconstruction capabilities. This argues that visual encoding does not require much computation.
256p Decoder Scaling on Image Reconstruction
Decoder scaling on 256p image reconstruction Using the results from before, we plot various decoder sizes (S, B, L) over reconstruction performance. There is a strong correlation between decoder size and reconstruction performance, which indicates scaling the decoder improves reconstruction. Although, increasing the decoder size from Base to Large does not provide the same boost of performance as doubling E to 8192 from 4096.
256p Encoder Scaling on Image Generation
Encoder scaling on 256p image generation We evaluate each tokenizer from before on DiT. We plot encoder size over generation metric results for CFG scales of 1.5 and 3.0 on the left two and right two plots respectively. There is a weak negative correlation between encoder size and final performance indicating that scaling the encoder is harmful for generation results. This is coupled by the fact that increased encoder sizes make training slower due to increased computational overhead.
256p Decoder Scaling on Image Generation
Decoder scaling on 256p image generation Using the results from before, we plot various decoder sizes (S, B, L) over generation performance. We plot decoder size over generation metric results for CFG scales of 1.5 and 3.0 on the left two and right two plots respectively. Unlike reconstruction, there is no clear correlation between decoder size and generation performance. This indicates that scaling the decoder has minimal benefits overall for auto-encoding.

Finding 5. There is a trade-off between rSSIM/rPSNR and rFID/rIS, influenced by the choice of loss weights and objectives (including perceptual and GAN losses). Consequently, the decoder can be viewed as a conditional generation model, which effectively extends the main generator.

Metric Trade-offs in 256p Image Reconstruction
Metric trade-offs in 256p image reconstruction We train ViTok S-B/16 with stage 1, varying the LPIPS (LP in figure) weight and using either L1 or L2 MSE reconstruction loss. Additionally, we finetune ViTok S-B/16 with stage 2 and include the result as L2+LP+GAN. The results indicate that enhancing rFID/rIS scores through increased perceptual and visual losses requires a trade-off with rSSIM/rPSNR, resulting in loss of information from the original image. This indicates the decoder’s role as a generative component.

Finding 6. Videos exhibit the same reconstruction bottleneck characteristics with respect to E as images do. However, auto-encoding takes advantage of the inherent compressibility of videos, enabling E to scale more effectively relative to the total number of pixels than images.

256p Video Reconstruction Results Detailed Over E
256p video reconstruction results over E We train ViTok S-B with stage 1 on 16 frame 256p videos at 8 fps, varying tubelet patch sizes and temporal strides. Reconstruction performance is evaluated using rFID per frame, rFVD, rSSIM, and rPSNR on the Kinetics-700 validation, UCF101 training, and Shutterstock validation datasets. The results exhibit a similar trend to image reconstruction, demonstrating a strong correlation between E and reconstruction performance. Expectantly, videos are more compressible than a direct scaling from images would suggest.

Reconstruction Results

Name Params (M) GFLOPs ImageNet COCO
rFID↓ PSNR↑ SSIM↑ rFID↓ PSNR↑ SSIM↑
SD-VAE 59.3 162.2 0.78 25.08 0.705 4.63 24.82 0.720
SDXL-VAE - - 0.68 26.04 0.834 4.07 25.76 0.845
OAI - - 0.81 24.43 0.786 4.59 24.19 0.800
Cosmos-CI - - 2.02 31.74 0.700 5.6 31.74 0.703
ViTok S-B/16 129.0 34.8 0.50 24.36 0.747 3.94 24.45 0.759
ViTok S-L/16 426.8 113.4 0.46 24.74 0.758 3.87 24.82 0.771
256p image reconstruction comparison. We assess the reconstruction performance of ViTok on the 256p ImageNet-1K and COCO-2017 validation sets, benchmarking them against CNN-based tokenizers with an equivalent compression ratio x16 spatial compression. Our ViTok S-B/16 tokenizer achieves state-of-the-art (SOTA) rFID scores on both ImageNet-1K and COCO datasets, outperforming other CNN-based continuous tokenizers while utilizing significantly fewer FLOPs. Furthermore, ViTok maintains competitive performance in SSIM and PSNR metrics compared to prior methods. When scaling decoder size to Large, ViTok improves all its reconstruction numbers.
Name Params (M) GFLOPs ImageNet COCO
rFID↓ PSNR↑ SSIM↑ rFID↓ PSNR↑ SSIM↑
SD-VAE 59.3 653.8 0.19 - - - - -
ViTok S-B/16 129.0 160.8 0.18 26.72 0.803 2.00 26.14 0.790
512p image reconstruction comparison. We assess the reconstruction performance of our top-performing tokenizers on the 512p ImageNet-1K and COCO-2017 validation sets, benchmarking them against a CNN-based tokenizer with an equivalent compression ratio x16 spatial compression. Our ViTok S-B/16 tokenizer maintains state-of-the-art (SOTA) results across all metrics, while maintaining computational significantly reducing flops
Method Params (M) GFLOPs # Tokens rFID↓ rFVD↓ PSNR↑ SSIM↑
TATS 32 Unk 2048 - 162 - -
MAGViT 158 Unk 1280 - 25 22.0 0.701
MAGViTv2 158 Unk 1280 - 16.12 - -
LARP-L-Long 174 505.3 1024 - 20 - -
ViTok S-B/4x8 129 160.8 1024 2.13 8.04 30.11 0.923
ViTok S-B/8x8 129 73.2 512 2.78 20.05 28.55 0.898
ViTok S-B/4x16 129 34.8 256 4.46 53.98 26.26 0.850
128p Video Reconstruction. We evaluate S-B/4x8, S-B/8x8, and S-B/4x16 on video reconstruction for 16 frame 128p video on UCF-101 11k train set. ViTok S-B/4x8 achieves SOTA performance in rFVD and various compression statistics. ViTok S-B/8x8 and ViTok S-B/4x16 also provide competitive reconstruction numbers for the compression rate performed. ViTok also reduces the total FLOPs required from prior transformer based methods.