Visual tokenization via auto-encoding is a critical component of state-of-the-art image and video generation, yet tokenizers have received far less attention than generators in scaling efforts. To address this gap, we introduce Vision Transformer Tokenizer or ViTok, a Vision Transformer-based auto-encoder enhanced with Llama and trained on large-scale datasets. Our study systematically explores scaling the bottleneck, encoder, and decoder sizes. We find that increasing the bottleneck size improves reconstruction but degrades generative performance when it becomes too large. Scaling the encoder yields no significant benefits for reconstruction and actively hinders downstream generation tasks, while scaling the decoder enhances reconstruction quality but has limited impact on generative performance. These findings suggest that scaling within the current auto-encoding paradigm offers limited benefits. However, we observe that the decoder behaves as a conditional generative model, balancing trade-offs in reconstruction and generative loss functions. Additionally, we find that videos are inherently more compressible than images at equivalent compression rates, presenting unique opportunities for future research. Through our scaling analysis, ViTok achieves competitive performance in image and video reconstruction across benchmarks l ike ImageNet-1K, COCO, and UCF-101, while reducing computational costs by 2–5× compared to prior methods. When integrated with Diffusion Transformers, ViTok sets new state-of-the-art benchmarks for class-conditional video generation, demonstrating its potential as a scalable and efficient visual tokenizer. More updates coming soon, star/watch the github to keep posted!
We showcase our ViTok architecture and key findings from scaling auto-encoders for image and video reconstruction and generation below. We enhance traditional CNN-based auto-encoders by integrating Vision Transformers (ViTs) with an upgraded Llama architecture into an asymmetric auto-encoder framework forming Vision Transformer Tokenizer or ViTok. Visual inputs are embedded as patches or tubelets, processed by a compact Llama Encoder, and bottlenecked to create a latent code. The encoded representation is then upsampled and handled by a larger Llama Decoder to reconstruct the input. Color-coded text boxes highlight the effects of scaling the encoder, adjusting the bottleneck size, and expanding the decoder. Additionally, we discuss trade-offs in loss optimization and the model's adaptability to video data. Our best performing ViTok variant achieves competitive performance with prior state-of-the-art tokenizers while reducing computational burden. Below we present our findings in more detail and place figures related. Please refer to our paper for a comprehensive analysis and additional results. High resolution tokenizer weights + more details coming soon!
Finding 1. Regardless of code shape or FLOPs expended in auto-encoding, the total number of floating points in the latent code (E) is the most predictive bottleneck for visual reconstruction performance.
Finding 2. In generative tasks, scaling the number of floating points in the code (E) does not consistently improve generative performance. Instead, optimal results are achieved by tuning both E and c to balance reconstruction and generation capabilities. A low E limits reconstruction quality, while high E and channel size c hinder the convergence and performance of the generative model.
Finding 3. Scaling the encoder provides no benefits for reconstruction performance and can potentially worsen generation results. Finding 4. While scaling the decoder can enhance reconstruction performance, it provides limited benefits for generation tasks.
Finding 5. There is a trade-off between rSSIM/rPSNR and rFID/rIS, influenced by the choice of loss weights and objectives (including perceptual and GAN losses). Consequently, the decoder can be viewed as a conditional generation model, which effectively extends the main generator.
Finding 6. Videos exhibit the same reconstruction bottleneck characteristics with respect to E as images do. However, auto-encoding takes advantage of the inherent compressibility of videos, enabling E to scale more effectively relative to the total number of pixels than images.
Name | Params (M) | GFLOPs | ImageNet | COCO | ||||
---|---|---|---|---|---|---|---|---|
rFID↓ | PSNR↑ | SSIM↑ | rFID↓ | PSNR↑ | SSIM↑ | |||
SD-VAE | 59.3 | 162.2 | 0.78 | 25.08 | 0.705 | 4.63 | 24.82 | 0.720 |
SDXL-VAE | - | - | 0.68 | 26.04 | 0.834 | 4.07 | 25.76 | 0.845 |
OAI | - | - | 0.81 | 24.43 | 0.786 | 4.59 | 24.19 | 0.800 |
Cosmos-CI | - | - | 2.02 | 31.74 | 0.700 | 5.6 | 31.74 | 0.703 |
ViTok S-B/16 | 129.0 | 34.8 | 0.50 | 24.36 | 0.747 | 3.94 | 24.45 | 0.759 |
ViTok S-L/16 | 426.8 | 113.4 | 0.46 | 24.74 | 0.758 | 3.87 | 24.82 | 0.771 |
Name | Params (M) | GFLOPs | ImageNet | COCO | ||||
---|---|---|---|---|---|---|---|---|
rFID↓ | PSNR↑ | SSIM↑ | rFID↓ | PSNR↑ | SSIM↑ | |||
SD-VAE | 59.3 | 653.8 | 0.19 | - | - | - | - | - |
ViTok S-B/16 | 129.0 | 160.8 | 0.18 | 26.72 | 0.803 | 2.00 | 26.14 | 0.790 |
Method | Params (M) | GFLOPs | # Tokens | rFID↓ | rFVD↓ | PSNR↑ | SSIM↑ |
---|---|---|---|---|---|---|---|
TATS | 32 | Unk | 2048 | - | 162 | - | - |
MAGViT | 158 | Unk | 1280 | - | 25 | 22.0 | 0.701 |
MAGViTv2 | 158 | Unk | 1280 | - | 16.12 | - | - |
LARP-L-Long | 174 | 505.3 | 1024 | - | 20 | - | - |
ViTok S-B/4x8 | 129 | 160.8 | 1024 | 2.13 | 8.04 | 30.11 | 0.923 |
ViTok S-B/8x8 | 129 | 73.2 | 512 | 2.78 | 20.05 | 28.55 | 0.898 |
ViTok S-B/4x16 | 129 | 34.8 | 256 | 4.46 | 53.98 | 26.26 | 0.850 |