Convert image to video
Stable video diffusion online can help you accomplish many tasks.
Stable video diffusion online can transform every image you like into a video, allowing you to keep it as a memento, explore the unknown, and discover enjoyment.
- Competitive in Performance
- Stable Video Diffusion is released in the form of two image-to-video models, capable of generating 14 and 25 frames at customizable frame rates between 3 and 30 frames per second. At the time of release in their foundational form, through external evaluation, we have found these models surpass the leading closed models in user preference studies.
- Our Ever-Expanding Suite of AI Models
- Stable Video Diffusion is a proud addition to our diverse range of open-source models. Spanning across modalities including image, language, audio, 3D, and code, our portfolio is a testament to Stability AI’s dedication to amplifying human intelligence.
- Adaptable to Numerous Video Applications
- Our Stable Video Diffusion model can be easily adapted to various downstream tasks, including multi-view synthesis from a single image with finetuning on multi-view datasets. We are planning a variety of models that build on and extend this base, similar to the ecosystem that has built around stable diffusion.
- Significant step
- Now available in research preview, this state-of-the-art generative AI video model represents a significant step in our journey toward creating models for everyone of every type.
Frequently asked questions
- What are the different variants of Stable Video Diffusion?
There are two variants: SVD and SVD-XT. SVD creates 576×1024 resolution videos with 14 frames, while SVD-XT extends the frame count to 24.
- What are the frame rates of Stable Video Diffusion models?
Both models, SVD and SVD-XT, can generate videos at frame rates ranging from 3 to 30 frames per second.
- What are the limitations of Stable Video Diffusion?
The model has difficulties generating videos without motion, cannot be controlled by text, struggles with rendering text legibly, and sometimes inaccurately generates faces and people.