ERNIE-ViLG 2.0 is a text-to-image model that offers better performance than Dalle-2 and Stable Diffusion, two of the most popular text-to-image models currently available. The new model was designed and trained by a team of researchers from Baidu, and the outcomes are breathtaking!
The outcomes demonstrated that ERNIE-ViLG 2.0 significantly outperformed Dalle-2 and Stable Diffusion. This is a significant achievement and demonstrates the power of the ERNIE framework. The Metaverse Post team compared ERNIE-ViLG 2.0 with Stable Diffusion below:
These results provide strong support for the hypothesis that ERNIE-ViLG 2.0 is a more effective text-to-image system than both Dalle-2 and Stable Diffusion.
The Unet architecture from Stable Diffusion is taken as a basis, but with changes:
- Mixture of Denoising Experts: Trained 10 neural networks instead of just one. Each is responsible only for certain diffusion steps.
- Textual knowledge: Automatically reweighted the words in the query. Keywords get more weight.
- Visual knowledge: During training, objects were detected on intermediate generation results and the weight of the loss function on regions with objects was increased.
As a result, the world’s largest text-to-image model came out with 24 billion parameters (10 times bigger than SD) to train the model.
Prompts are simply translated from Chinese to English automatically in the HuggingFace public demo before being sent into the AI. From this, a lot of features flow.
- ERNIE has no knowledge about favourite public figures. It certainly has local favourites in China.
- As a result, the method of using celebrity names in prompts to dramatically boost the quality of faces fails.
- Some circumstances will be significantly distorted by translation from Chinese. Surprises are in store for you if you don’t speak Chinese.
- It doesn’t even know anything about Greg Rutkowski. For instance, ERNIE has no knowledge about Arnold Schwarzenegger.
Read related articles:
Read More: mpost.io