Abstract
Text-to-image generation represents a rapidly evolving frontier in artificial intelligence, enabling the transformation of natural language descriptions into visually coherent and semantically rich images. This paper presents a comprehensive review of state-of-the-art generative models—including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and advanced Diffusion Models—focusing on their capabilities to produce high-fidelity, contextually accurate images from textual inputs. Additionally, we analyse leading sustainable image synthesis frameworks such as DALL-E 2, Stable Diffusion, Imagen, and MidJourney, assessing their advancements in image quality, semantic alignment, diversity, and computational efficiency. Our systematic evaluation highlights significant progress in generating realistic, high-resolution images while identifying persistent challenges related to semantic consistency, fine-grained control, ethical considerations, and substantial computational demands. We further discuss critical trade-offs between model performance and sustainability, fostering future research directions aimed at developing more efficient, fair, and environmentally responsible text-to-image generation systems. This survey serves as a guiding resource for the next generation of sustainable AI-driven text to image synthesis technologies.
Keywords
Deep learning, Diffusion model, DALL-E, Generative models, Text to Image generation,Downloads
References
- I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27 (2014).
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners. OpenAI blog, 1(8), (2019) 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data augmentation using GAN for improved liver lesion classification. 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, Washington, DC, USA, (2018) 289-293. https://doi.org/10.1109/ISBI.2018.8363570
- Y. Zhao, H. Wang, J. Zhang, Z. Xu, AI-driven fashion design and customization: Generative adversarial networks in apparel prototyping. Computers in Industry, 127, (2021) 103434.
- X. Mao, W. Yu, K.D. Yamada, M.R. Zielewski, Procedural content generation via generative artificial intelligence. arXiv preprint arXiv:2407.09013, (2024). https://doi.org/10.48550/arXiv.2407.09013
- S. Lee, M. Kohga, S. Landau, S. O'Modhrain, H. Subramonyam, AltCanvas: a tile-based editor for visual content creation with generative AI for blind or visually impaired people. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, 70, (2024) 1-22. https://doi.org/10.1145/3663548.3675600
- C. Zhang, C. Zhang, M. Zhang, I.S. Kweon, (2023). Text-to-image diffusion models in generative AI: A survey. arXiv preprint arXiv:2303.07909. https://doi.org/10.48550/arXiv.2303.07909
- J. Agnese, J. Herrera, H. Tao, X. Zhu, A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(4), (2020) e1345.
- E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, 34(09), (2020) 13693-13696). https://doi.org/10.1609/aaai.v34i09.7123
- P. Cao, F. Zhou, Q. Song, L. Yang, (2024) Controllable generation with text-to-image diffusion models: A survey. arXiv preprint arXiv:2403.04279. https://doi.org/10.48550/arXiv.2403.04279
- R. Schwartz, J. Dodge, N.A. Smith, O. Etzioni, Green AI. Communications of the ACM, 63(12), (2020) 54-63. https://doi.org/10.1145/3381831
- A. Birhane, V.U. Prabhu, E. Kahembwe, (2021) Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963. https://doi.org/10.48550/arXiv.2110.01963
- I.O. Gallegos, R.A. Rossi, J. Barrow, M.M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, N.K. Ahmed,. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3), (2024) 1097-1179. https://doi.org/10.1162/coli_a_00524
- D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.M. Munguia, D. Rothchild, D. So, M. Texier, J. Dean, (2021).Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. https://doi.org/10.48550/arXiv.2104.10350
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E.L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35, (2022) 36479-36494. https://doi.org/10.48550/arXiv.2205.11487
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zero-shot text-to-image generation. In International conference on machine learning, (2021) 8821-8831. https://doi.org/10.48550/arXiv.2102.12092
- E.M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, (2021). On the dangers of stochastic parrots: Can language models be too big? FAccT 2021. https://doi.org/10.1145/3442188.3445922
- Y. Gong, Z. Zhan, Q. Jin, Y. Li, Y. Idelbayev, X. Liu, A. Zharkov, K. Aberman, S. Tulyakov, Y. Wang, J. Ren, (2024). Efficient GANs for Image-to-Image Translation. arXiv preprint arXiv:2401.06127.
- A. Valyaeva, AI image statistics: How much content was created by ai. Everypixel Journal, 15, (2023). Adobe Creative Cloud Adoption Grows to 33 Million Paid Members, https://prodesigntools.com/number-of-creative-cloud-subscribers.html
- M. Yang, Z. Wang. Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision, (2025) 1-38.
- H. Chen, Q. Xiang, J. Hu, M. Ye, C. Yu, H. Cheng, L. Zhang. Comprehensive exploration of diffusion models in image generation: a survey. Artificial Intelligence Review, 58(4), (2025) 99. https://doi.org/10.1007/s10462-025-11110-3
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee. Generative adversarial text to image synthesis. In International conference on machine learning, PMLR, (2016) 1060-1069.
- A. Radford, L. Metz, S. Chintala, (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv: 1511.06434.
- D.P. Kingma, M. Welling, An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), (2019) 307–392. http://dx.doi.org/10.1561/2200000056
- V. Ashish, Attention is all you need. Advances in neural information processing systems, 30, (2017).
- J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, Long Ouyang et al. Improving image generation with better captions. Computer Science, 2(3), (2023) 8.
- J. Bao, W.M. Yu, K. Yang, C. Liu, T.J. Cui. Improved few-shot SAR image generation by enhancing diversity. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,IEEE, 17, (2024) 3394-3408. https://doi.org/10.1109/JSTARS.2024.3352237
- J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, (2020) 6840-6851.
- H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, IEEE, Venice, Italy, (2017) 5907-5915. https://doi.org/10.1109/ICCV.2017.629
- T. Xu, H. Zhang, X. Huang, S. Zhang, L. Zhang, (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1316-1324.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2), (2022) 3.
- A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
- I. Sutskever, M. Chen. (2021) Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
- E. D’Armenio, A. Deliège, M.G. Dondero. Semiotics of Machinic Co-Enunciation. About Generative Models (Midjourney and DALL· E), Signata. Annals of Semiotics, 15, (2024). https://doi.org/10.4000/127x4
- H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8), (2018) 1947-1962. https://doi.org/10.1109/TPAMI.2018.2856256
- B. Li, X. Qi, T. Lukasiewicz, P.H.S. Torr, ManiGAN: Text-guided image manipulation, in Proc. IEEE/CVF Conference Computer Vision Pattern Recognition. (CVPR), IEEE, Seattle, WA, USA, (2020) 7877–7886. https://doi.org/10.1109/CVPR42600.2020.00790
- M.A.H. Palash, M.A. Al Nasim, A. Dhali, F. Afrin. Fine-grained image generation from bangla text description using attentional generative adversarial network. In 2021 IEEE International Conference on Robotics, Automation, Artificial-Intelligence and Internet-of-Things (RAAICON), IEEE, Dhaka, Bangladesh, (2021) 79-84. https://doi.org/10.1109/RAAICON54709.2021.9929536
- M. Bahani, A. El Ouaazizi, K. Maalmi. The effectiveness of T5, GPT-2, and BERT on text-to-image generation task. Pattern Recognition Letters, 173 (2023) 57-63. https://doi.org/10.1016/j.patrec.2023.08.001
- M. Kang, J.Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park. Scaling up GAN’s for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Vancouver, BC, Canada, (2023)10124-10134. https://doi.org/10.1109/CVPR52729.2023.00976
- A. Sanghi, H. Chu, J.G. Lambourne, Y. Wang, C.Y. Cheng, M. Fumero, K.R. Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, (2022) 18603-18613. https://doi.org/10.1109/CVPR52688.2022.01805
- J. Yu, Y. Xu, J.Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B.K. Ayan, B. Hutchinson, Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), (2022) 5.
- A. Sanghi, H. Chu, J. G. Lambourne, Y.Wang, C.Y. Cheng, M. Fumero K.R. Malekshan, (2021) CLIP-Forge: Towards zero-shot text-to-shape generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, USA. https://doi.org/10.1109/CVPR52688.2022.01805
- Z. Wang, W. Liu, Q. He, X. Wu, Z. Yi, (2022) Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386. https://doi.org/10.48550/arXiv.2203.00386
- J. Shi, C. Wu, J. Liang, X. Liu, N. Duan, DiVAE: Photorealistic images synthesis with denoising diffusion decoder. (2022) arXiv:2206.00386v1.
- S. Naveen, M.S.R. Kiran, M. Indupriya, T.V. Manikanta, P.V. Sudeep, Transformer models for enhancing AttnGAN based text to image generation. Image and Vision Computing, 115, (2021) 104284. https://doi.org/10.1016/j.imavis.2021.104284
- H. Bansal, D. Yin, M. Monajatipoor, K.W. Chang, (2022) How well can text-to-image generative models understand ethical natural language interventions?. arXiv preprint arXiv:2210.15230. https://doi.org/10.18653/v1/2022.emnlp-main.88
- R. Navigli, S. Conia, B. Ross, Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality, 15(2), (2023) 1-21. https://doi.org/10.1145/3597307
- C. Bird, E. Ungless, A. Kasirzadeh, (2023) August. Typology of risks of generative text-to-image models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 396-410. https://doi.org/10.1145/3600211.3604722
- T. Lee, M. Yasunaga, C. Meng, Y. Mai, J.S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, M. Kang, Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36, (2023) 69981-70011.
- H. Chen, Q. Xiang, J. Hu, M. Ye, C. Yu, H. Cheng, L. Zhang, Comprehensive exploration of diffusion models in image generation: a survey. Artificial Intelligence Review, 58(4), (2025) 99. https://doi.org/10.1007/s10462-025-11110-3
- X. Tu, Z. He, Y. Huang, Z.H. Zhang, M. Yang, J. Zhao, An overview of large AI models and their applications. Visual Intelligence, 2(1), (2024) 34. https://doi.org/10.1007/s44267-024-00065-8
- P.P. Liang, C. Wu, L.P. Morency, R. Salakhutdinov, (2021) Towards understanding and mitigating social biases in language models. In International conference on machine learning, PMLR, 139, 6565-6576.
- O. Bendel, Image synthesis from an ethical perspective. AI & Soc, 40, (2025) 437–446. https://doi.org/10.1007/s00146-023-01780-4
- E. Sheng, K.W. Chang, P. Natarajan, N. Peng, (2019) The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326. https://doi.org/10.18653/v1/D19-1339
- M. Yang, Z. Wang, Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision, 133(6), (2025) 3689-3726. https://doi.org/10.1007/s11263-025-02357-y
- S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, D. Doermann, (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, IEEE, USA. https://doi.org/10.1109/CVPR.2019.00290
- C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, (2022) Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705. https://doi.org/10.18653/v1/2022.acl-long.331
- F. Zeng, W. Gan, Y. Wang, P.S. Yu, (2023) December. Distributed training of large language models. In 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, China. https://doi.org/10.1109/ICPADS60453.2023.00126
- X. Wang, Z. Tang, J. Guo, T. Meng, C. Wang, T. Wang, W. Jia, Empowering edge intelligence: A comprehensive survey on on-device ai models. ACM Computing Surveys, 57(9), (2025) 1-39. https://doi.org/10.1145/3724420
- J. Xu, Z. Li, W. Chen, Q. Wang, X. Gao, Q. Cai, Z. Ling, (2024) On-device language models: A comprehensive review. arXiv preprint arXiv:2409.00088. https://doi.org/10.48550/arXiv.2409.00088
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017).
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, X. Chen, (2016) Improved techniques for training GANs. Advances in Neural Information Processing Systems 29 USA: Curran Associates, 2234–2242.
- A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision. Proceedings of Machine Learning Research, 139, (2021) 8748–8763.
- Poloclub, “GitHub - poloclub/diffusiondb: A large-scale text-to-image prompt gallery dataset based on Stable Diffusion,” GitHub. https://github.com/poloclub/diffusiondb
- “poloclub/diffusiondb · Datasets at Hugging Face,” Mar. 16, 2023. https://huggingface.co/datasets/poloclub/diffusiondb
- N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal, (2018) Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317.
- Cyizhuo, “GitHub - cyizhuo/CUB_200_2011_dataset: CUB-200-2011 dataset by classes folder,” GitHub. https://github.com/cyizhuo/CUB_200_2011_dataset
Review

