Abstract

Text-to-image generation represents a rapidly evolving frontier in artificial intelligence, enabling the transformation of natural language descriptions into visually coherent and semantically rich images. This paper presents a comprehensive review of state-of-the-art generative models—including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and advanced Diffusion Models—focusing on their capabilities to produce high-fidelity, contextually accurate images from textual inputs. Additionally, we analyse leading sustainable image synthesis frameworks such as DALL-E 2, Stable Diffusion, Imagen, and MidJourney, assessing their advancements in image quality, semantic alignment, diversity, and computational efficiency. Our systematic evaluation highlights significant progress in generating realistic, high-resolution images while identifying persistent challenges related to semantic consistency, fine-grained control, ethical considerations, and substantial computational demands. We further discuss critical trade-offs between model performance and sustainability, fostering future research directions aimed at developing more efficient, fair, and environmentally responsible text-to-image generation systems. This survey serves as a guiding resource for the next generation of sustainable AI-driven text to image synthesis technologies.

Keywords

Deep learning, Diffusion model, DALL-E, Generative models, Text to Image generation,

Downloads

Download data is not yet available.

References

  1. I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27 (2014).
  2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners. OpenAI blog, 1(8), (2019) 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  3. M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data augmentation using GAN for improved liver lesion classification. 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, Washington, DC, USA, (2018) 289-293. https://doi.org/10.1109/ISBI.2018.8363570
  4. Y. Zhao, H. Wang, J. Zhang, Z. Xu, AI-driven fashion design and customization: Generative adversarial networks in apparel prototyping. Computers in Industry, 127, (2021) 103434.
  5. X. Mao, W. Yu, K.D. Yamada, M.R. Zielewski, Procedural content generation via generative artificial intelligence. arXiv preprint arXiv:2407.09013, (2024). https://doi.org/10.48550/arXiv.2407.09013
  6. S. Lee, M. Kohga, S. Landau, S. O'Modhrain, H. Subramonyam, AltCanvas: a tile-based editor for visual content creation with generative AI for blind or visually impaired people. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, 70, (2024) 1-22. https://doi.org/10.1145/3663548.3675600
  7. C. Zhang, C. Zhang, M. Zhang, I.S. Kweon, (2023). Text-to-image diffusion models in generative AI: A survey. arXiv preprint arXiv:2303.07909. https://doi.org/10.48550/arXiv.2303.07909
  8. J. Agnese, J. Herrera, H. Tao, X. Zhu, A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(4), (2020) e1345.
  9. E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, 34(09), (2020) 13693-13696). https://doi.org/10.1609/aaai.v34i09.7123
  10. P. Cao, F. Zhou, Q. Song, L. Yang, (2024) Controllable generation with text-to-image diffusion models: A survey. arXiv preprint arXiv:2403.04279. https://doi.org/10.48550/arXiv.2403.04279
  11. R. Schwartz, J. Dodge, N.A. Smith, O. Etzioni, Green AI. Communications of the ACM, 63(12), (2020) 54-63. https://doi.org/10.1145/3381831
  12. A. Birhane, V.U. Prabhu, E. Kahembwe, (2021) Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963. https://doi.org/10.48550/arXiv.2110.01963
  13. I.O. Gallegos, R.A. Rossi, J. Barrow, M.M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, N.K. Ahmed,. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3), (2024) 1097-1179. https://doi.org/10.1162/coli_a_00524
  14. D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.M. Munguia, D. Rothchild, D. So, M. Texier, J. Dean, (2021).Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. https://doi.org/10.48550/arXiv.2104.10350
  15. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E.L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35, (2022) 36479-36494. https://doi.org/10.48550/arXiv.2205.11487
  16. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zero-shot text-to-image generation. In International conference on machine learning, (2021) 8821-8831. https://doi.org/10.48550/arXiv.2102.12092
  17. E.M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, (2021). On the dangers of stochastic parrots: Can language models be too big? FAccT 2021. https://doi.org/10.1145/3442188.3445922
  18. Y. Gong, Z. Zhan, Q. Jin, Y. Li, Y. Idelbayev, X. Liu, A. Zharkov, K. Aberman, S. Tulyakov, Y. Wang, J. Ren, (2024). Efficient GANs for Image-to-Image Translation. arXiv preprint arXiv:2401.06127.
  19. A. Valyaeva, AI image statistics: How much content was created by ai. Everypixel Journal, 15, (2023). Adobe Creative Cloud Adoption Grows to 33 Million Paid Members, https://prodesigntools.com/number-of-creative-cloud-subscribers.html
  20. M. Yang, Z. Wang. Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision, (2025) 1-38.
  21. H. Chen, Q. Xiang, J. Hu, M. Ye, C. Yu, H. Cheng, L. Zhang. Comprehensive exploration of diffusion models in image generation: a survey. Artificial Intelligence Review, 58(4), (2025) 99. https://doi.org/10.1007/s10462-025-11110-3
  22. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee. Generative adversarial text to image synthesis. In International conference on machine learning, PMLR, (2016) 1060-1069.
  23. A. Radford, L. Metz, S. Chintala, (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv: 1511.06434.
  24. D.P. Kingma, M. Welling, An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), (2019) 307–392. http://dx.doi.org/10.1561/2200000056
  25. V. Ashish, Attention is all you need. Advances in neural information processing systems, 30, (2017).
  26. J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, Long Ouyang et al. Improving image generation with better captions. Computer Science, 2(3), (2023) 8.
  27. J. Bao, W.M. Yu, K. Yang, C. Liu, T.J. Cui. Improved few-shot SAR image generation by enhancing diversity. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,IEEE, 17, (2024) 3394-3408. https://doi.org/10.1109/JSTARS.2024.3352237
  28. J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, (2020) 6840-6851.
  29. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, IEEE, Venice, Italy, (2017) 5907-5915. https://doi.org/10.1109/ICCV.2017.629
  30. T. Xu, H. Zhang, X. Huang, S. Zhang, L. Zhang, (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1316-1324.
  31. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2), (2022) 3.
  32. A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
  33. I. Sutskever, M. Chen. (2021) Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
  34. E. D’Armenio, A. Deliège, M.G. Dondero. Semiotics of Machinic Co-Enunciation. About Generative Models (Midjourney and DALL· E), Signata. Annals of Semiotics, 15, (2024). https://doi.org/10.4000/127x4
  35. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8), (2018) 1947-1962. https://doi.org/10.1109/TPAMI.2018.2856256
  36. B. Li, X. Qi, T. Lukasiewicz, P.H.S. Torr, ManiGAN: Text-guided image manipulation, in Proc. IEEE/CVF Conference Computer Vision Pattern Recognition. (CVPR), IEEE, Seattle, WA, USA, (2020) 7877–7886. https://doi.org/10.1109/CVPR42600.2020.00790
  37. M.A.H. Palash, M.A. Al Nasim, A. Dhali, F. Afrin. Fine-grained image generation from bangla text description using attentional generative adversarial network. In 2021 IEEE International Conference on Robotics, Automation, Artificial-Intelligence and Internet-of-Things (RAAICON), IEEE, Dhaka, Bangladesh, (2021) 79-84. https://doi.org/10.1109/RAAICON54709.2021.9929536
  38. M. Bahani, A. El Ouaazizi, K. Maalmi. The effectiveness of T5, GPT-2, and BERT on text-to-image generation task. Pattern Recognition Letters, 173 (2023) 57-63. https://doi.org/10.1016/j.patrec.2023.08.001
  39. M. Kang, J.Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park. Scaling up GAN’s for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Vancouver, BC, Canada, (2023)10124-10134. https://doi.org/10.1109/CVPR52729.2023.00976
  40. A. Sanghi, H. Chu, J.G. Lambourne, Y. Wang, C.Y. Cheng, M. Fumero, K.R. Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, (2022) 18603-18613. https://doi.org/10.1109/CVPR52688.2022.01805
  41. J. Yu, Y. Xu, J.Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B.K. Ayan, B. Hutchinson, Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), (2022) 5.
  42. A. Sanghi, H. Chu, J. G. Lambourne, Y.Wang, C.Y. Cheng, M. Fumero K.R. Malekshan, (2021) CLIP-Forge: Towards zero-shot text-to-shape generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, USA. https://doi.org/10.1109/CVPR52688.2022.01805
  43. Z. Wang, W. Liu, Q. He, X. Wu, Z. Yi, (2022) Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386. https://doi.org/10.48550/arXiv.2203.00386
  44. J. Shi, C. Wu, J. Liang, X. Liu, N. Duan, DiVAE: Photorealistic images synthesis with denoising diffusion decoder. (2022) arXiv:2206.00386v1.
  45. S. Naveen, M.S.R. Kiran, M. Indupriya, T.V. Manikanta, P.V. Sudeep, Transformer models for enhancing AttnGAN based text to image generation. Image and Vision Computing, 115, (2021) 104284. https://doi.org/10.1016/j.imavis.2021.104284
  46. H. Bansal, D. Yin, M. Monajatipoor, K.W. Chang, (2022) How well can text-to-image generative models understand ethical natural language interventions?. arXiv preprint arXiv:2210.15230. https://doi.org/10.18653/v1/2022.emnlp-main.88
  47. R. Navigli, S. Conia, B. Ross, Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality, 15(2), (2023) 1-21. https://doi.org/10.1145/3597307
  48. C. Bird, E. Ungless, A. Kasirzadeh, (2023) August. Typology of risks of generative text-to-image models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 396-410. https://doi.org/10.1145/3600211.3604722
  49. T. Lee, M. Yasunaga, C. Meng, Y. Mai, J.S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, M. Kang, Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36, (2023) 69981-70011.
  50. H. Chen, Q. Xiang, J. Hu, M. Ye, C. Yu, H. Cheng, L. Zhang, Comprehensive exploration of diffusion models in image generation: a survey. Artificial Intelligence Review, 58(4), (2025) 99. https://doi.org/10.1007/s10462-025-11110-3
  51. X. Tu, Z. He, Y. Huang, Z.H. Zhang, M. Yang, J. Zhao, An overview of large AI models and their applications. Visual Intelligence, 2(1), (2024) 34. https://doi.org/10.1007/s44267-024-00065-8
  52. P.P. Liang, C. Wu, L.P. Morency, R. Salakhutdinov, (2021) Towards understanding and mitigating social biases in language models. In International conference on machine learning, PMLR, 139, 6565-6576.
  53. O. Bendel, Image synthesis from an ethical perspective. AI & Soc, 40, (2025) 437–446. https://doi.org/10.1007/s00146-023-01780-4
  54. E. Sheng, K.W. Chang, P. Natarajan, N. Peng, (2019) The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326. https://doi.org/10.18653/v1/D19-1339
  55. M. Yang, Z. Wang, Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision, 133(6), (2025) 3689-3726. https://doi.org/10.1007/s11263-025-02357-y
  56. S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, D. Doermann, (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, IEEE, USA. https://doi.org/10.1109/CVPR.2019.00290
  57. C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, (2022) Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705. https://doi.org/10.18653/v1/2022.acl-long.331
  58. F. Zeng, W. Gan, Y. Wang, P.S. Yu, (2023) December. Distributed training of large language models. In 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, China. https://doi.org/10.1109/ICPADS60453.2023.00126
  59. X. Wang, Z. Tang, J. Guo, T. Meng, C. Wang, T. Wang, W. Jia, Empowering edge intelligence: A comprehensive survey on on-device ai models. ACM Computing Surveys, 57(9), (2025) 1-39. https://doi.org/10.1145/3724420
  60. J. Xu, Z. Li, W. Chen, Q. Wang, X. Gao, Q. Cai, Z. Ling, (2024) On-device language models: A comprehensive review. arXiv preprint arXiv:2409.00088. https://doi.org/10.48550/arXiv.2409.00088
  61. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017).
  62. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, X. Chen, (2016) Improved techniques for training GANs. Advances in Neural Information Processing Systems 29 USA: Curran Associates, 2234–2242.
  63. A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision. Proceedings of Machine Learning Research, 139, (2021) 8748–8763.
  64. Poloclub, “GitHub - poloclub/diffusiondb: A large-scale text-to-image prompt gallery dataset based on Stable Diffusion,” GitHub. https://github.com/poloclub/diffusiondb
  65. “poloclub/diffusiondb · Datasets at Hugging Face,” Mar. 16, 2023. https://huggingface.co/datasets/poloclub/diffusiondb
  66. N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal, (2018) Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317.
  67. Cyizhuo, “GitHub - cyizhuo/CUB_200_2011_dataset: CUB-200-2011 dataset by classes folder,” GitHub. https://github.com/cyizhuo/CUB_200_2011_dataset