Towards Sustainable Image Synthesis: A Comprehensive Review of Text-to-Image Generation Models

Smita Bharne; Pallavi Sapkale; Ekta Sarda; Shamal Salunkhe; Puja Padiya

doi:10.54392/irjmt2557

Review

Home / Archives / Volume 7, Issue 5, Year 2025 /

DOI: 10.54392/irjmt2557

Towards Sustainable Image Synthesis: A Comprehensive Review of Text-to-Image Generation Models

Smita Bharne⁺⁻
Pallavi Sapkale⁺⁻
Ekta Sarda⁺⁻
Shamal Salunkhe⁺⁻
Puja Padiya⁺⁻

Department of Computer Engineering, Ramrao Adik Institute of Technology, D. Y Patil Deemed to be University, Navi Mumbai, India.

Department of Computer Science and Engineering, Ramrao Adik Institute of Technology, D. Y Patil Deemed to be University, Navi Mumbai, India.

Department of Computer Engineering, Ramrao Adik Institute of Technology, D. Y Patil Deemed to be University, Navi Mumbai, India.

Dimensions

Plum Analytics

Abstract

Text-to-image generation represents a rapidly evolving frontier in artificial intelligence, enabling the transformation of natural language descriptions into visually coherent and semantically rich images. This paper presents a comprehensive review of state-of-the-art generative models—including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and advanced Diffusion Models—focusing on their capabilities to produce high-fidelity, contextually accurate images from textual inputs. Additionally, we analyse leading sustainable image synthesis frameworks such as DALL-E 2, Stable Diffusion, Imagen, and MidJourney, assessing their advancements in image quality, semantic alignment, diversity, and computational efficiency. Our systematic evaluation highlights significant progress in generating realistic, high-resolution images while identifying persistent challenges related to semantic consistency, fine-grained control, ethical considerations, and substantial computational demands. We further discuss critical trade-offs between model performance and sustainability, fostering future research directions aimed at developing more efficient, fair, and environmentally responsible text-to-image generation systems. This survey serves as a guiding resource for the next generation of sustainable AI-driven text to image synthesis technologies.

Keywords

Deep learning, Diffusion model, DALL-E, Generative models, Text to Image generation,

Downloads

Download data is not yet available.

References

I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27 (2014).
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners. OpenAI blog, 1(8), (2019) 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data augmentation using GAN for improved liver lesion classification. 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, Washington, DC, USA, (2018) 289-293. https://doi.org/10.1109/ISBI.2018.8363570
Y. Zhao, H. Wang, J. Zhang, Z. Xu, AI-driven fashion design and customization: Generative adversarial networks in apparel prototyping. Computers in Industry, 127, (2021) 103434.
X. Mao, W. Yu, K.D. Yamada, M.R. Zielewski, Procedural content generation via generative artificial intelligence. arXiv preprint arXiv:2407.09013, (2024). https://doi.org/10.48550/arXiv.2407.09013
S. Lee, M. Kohga, S. Landau, S. O'Modhrain, H. Subramonyam, AltCanvas: a tile-based editor for visual content creation with generative AI for blind or visually impaired people. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, 70, (2024) 1-22. https://doi.org/10.1145/3663548.3675600
C. Zhang, C. Zhang, M. Zhang, I.S. Kweon, (2023). Text-to-image diffusion models in generative AI: A survey. arXiv preprint arXiv:2303.07909. https://doi.org/10.48550/arXiv.2303.07909
J. Agnese, J. Herrera, H. Tao, X. Zhu, A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(4), (2020) e1345.
E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, 34(09), (2020) 13693-13696). https://doi.org/10.1609/aaai.v34i09.7123
P. Cao, F. Zhou, Q. Song, L. Yang, (2024) Controllable generation with text-to-image diffusion models: A survey. arXiv preprint arXiv:2403.04279. https://doi.org/10.48550/arXiv.2403.04279
R. Schwartz, J. Dodge, N.A. Smith, O. Etzioni, Green AI. Communications of the ACM, 63(12), (2020) 54-63. https://doi.org/10.1145/3381831
A. Birhane, V.U. Prabhu, E. Kahembwe, (2021) Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963. https://doi.org/10.48550/arXiv.2110.01963
I.O. Gallegos, R.A. Rossi, J. Barrow, M.M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, N.K. Ahmed,. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3), (2024) 1097-1179. https://doi.org/10.1162/coli_a_00524
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.M. Munguia, D. Rothchild, D. So, M. Texier, J. Dean, (2021).Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. https://doi.org/10.48550/arXiv.2104.10350
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E.L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35, (2022) 36479-36494. https://doi.org/10.48550/arXiv.2205.11487
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zero-shot text-to-image generation. In International conference on machine learning, (2021) 8821-8831. https://doi.org/10.48550/arXiv.2102.12092
E.M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, (2021). On the dangers of stochastic parrots: Can language models be too big? FAccT 2021. https://doi.org/10.1145/3442188.3445922
Y. Gong, Z. Zhan, Q. Jin, Y. Li, Y. Idelbayev, X. Liu, A. Zharkov, K. Aberman, S. Tulyakov, Y. Wang, J. Ren, (2024). Efficient GANs for Image-to-Image Translation. arXiv preprint arXiv:2401.06127.
A. Valyaeva, AI image statistics: How much content was created by ai. Everypixel Journal, 15, (2023). Adobe Creative Cloud Adoption Grows to 33 Million Paid Members, https://prodesigntools.com/number-of-creative-cloud-subscribers.html
M. Yang, Z. Wang. Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision, (2025) 1-38.
H. Chen, Q. Xiang, J. Hu, M. Ye, C. Yu, H. Cheng, L. Zhang. Comprehensive exploration of diffusion models in image generation: a survey. Artificial Intelligence Review, 58(4), (2025) 99. https://doi.org/10.1007/s10462-025-11110-3
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee. Generative adversarial text to image synthesis. In International conference on machine learning, PMLR, (2016) 1060-1069.
A. Radford, L. Metz, S. Chintala, (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv: 1511.06434.
D.P. Kingma, M. Welling, An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4), (2019) 307–392. http://dx.doi.org/10.1561/2200000056
V. Ashish, Attention is all you need. Advances in neural information processing systems, 30, (2017).
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, Long Ouyang et al. Improving image generation with better captions. Computer Science, 2(3), (2023) 8.
J. Bao, W.M. Yu, K. Yang, C. Liu, T.J. Cui. Improved few-shot SAR image generation by enhancing diversity. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,IEEE, 17, (2024) 3394-3408. https://doi.org/10.1109/JSTARS.2024.3352237
J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, (2020) 6840-6851.
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, IEEE, Venice, Italy, (2017) 5907-5915. https://doi.org/10.1109/ICCV.2017.629
T. Xu, H. Zhang, X. Huang, S. Zhang, L. Zhang, (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1316-1324.
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2), (2022) 3.
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
I. Sutskever, M. Chen. (2021) Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
E. D’Armenio, A. Deliège, M.G. Dondero. Semiotics of Machinic Co-Enunciation. About Generative Models (Midjourney and DALL· E), Signata. Annals of Semiotics, 15, (2024). https://doi.org/10.4000/127x4
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8), (2018) 1947-1962. https://doi.org/10.1109/TPAMI.2018.2856256
B. Li, X. Qi, T. Lukasiewicz, P.H.S. Torr, ManiGAN: Text-guided image manipulation, in Proc. IEEE/CVF Conference Computer Vision Pattern Recognition. (CVPR), IEEE, Seattle, WA, USA, (2020) 7877–7886. https://doi.org/10.1109/CVPR42600.2020.00790
M.A.H. Palash, M.A. Al Nasim, A. Dhali, F. Afrin. Fine-grained image generation from bangla text description using attentional generative adversarial network. In 2021 IEEE International Conference on Robotics, Automation, Artificial-Intelligence and Internet-of-Things (RAAICON), IEEE, Dhaka, Bangladesh, (2021) 79-84. https://doi.org/10.1109/RAAICON54709.2021.9929536
M. Bahani, A. El Ouaazizi, K. Maalmi. The effectiveness of T5, GPT-2, and BERT on text-to-image generation task. Pattern Recognition Letters, 173 (2023) 57-63. https://doi.org/10.1016/j.patrec.2023.08.001
M. Kang, J.Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park. Scaling up GAN’s for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Vancouver, BC, Canada, (2023)10124-10134. https://doi.org/10.1109/CVPR52729.2023.00976
A. Sanghi, H. Chu, J.G. Lambourne, Y. Wang, C.Y. Cheng, M. Fumero, K.R. Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, (2022) 18603-18613. https://doi.org/10.1109/CVPR52688.2022.01805
J. Yu, Y. Xu, J.Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B.K. Ayan, B. Hutchinson, Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), (2022) 5.
A. Sanghi, H. Chu, J. G. Lambourne, Y.Wang, C.Y. Cheng, M. Fumero K.R. Malekshan, (2021) CLIP-Forge: Towards zero-shot text-to-shape generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, USA. https://doi.org/10.1109/CVPR52688.2022.01805
Z. Wang, W. Liu, Q. He, X. Wu, Z. Yi, (2022) Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386. https://doi.org/10.48550/arXiv.2203.00386
J. Shi, C. Wu, J. Liang, X. Liu, N. Duan, DiVAE: Photorealistic images synthesis with denoising diffusion decoder. (2022) arXiv:2206.00386v1.
S. Naveen, M.S.R. Kiran, M. Indupriya, T.V. Manikanta, P.V. Sudeep, Transformer models for enhancing AttnGAN based text to image generation. Image and Vision Computing, 115, (2021) 104284. https://doi.org/10.1016/j.imavis.2021.104284
H. Bansal, D. Yin, M. Monajatipoor, K.W. Chang, (2022) How well can text-to-image generative models understand ethical natural language interventions?. arXiv preprint arXiv:2210.15230. https://doi.org/10.18653/v1/2022.emnlp-main.88
R. Navigli, S. Conia, B. Ross, Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality, 15(2), (2023) 1-21. https://doi.org/10.1145/3597307
C. Bird, E. Ungless, A. Kasirzadeh, (2023) August. Typology of risks of generative text-to-image models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 396-410. https://doi.org/10.1145/3600211.3604722
T. Lee, M. Yasunaga, C. Meng, Y. Mai, J.S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, M. Kang, Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36, (2023) 69981-70011.
H. Chen, Q. Xiang, J. Hu, M. Ye, C. Yu, H. Cheng, L. Zhang, Comprehensive exploration of diffusion models in image generation: a survey. Artificial Intelligence Review, 58(4), (2025) 99. https://doi.org/10.1007/s10462-025-11110-3
X. Tu, Z. He, Y. Huang, Z.H. Zhang, M. Yang, J. Zhao, An overview of large AI models and their applications. Visual Intelligence, 2(1), (2024) 34. https://doi.org/10.1007/s44267-024-00065-8
P.P. Liang, C. Wu, L.P. Morency, R. Salakhutdinov, (2021) Towards understanding and mitigating social biases in language models. In International conference on machine learning, PMLR, 139, 6565-6576.
O. Bendel, Image synthesis from an ethical perspective. AI & Soc, 40, (2025) 437–446. https://doi.org/10.1007/s00146-023-01780-4
E. Sheng, K.W. Chang, P. Natarajan, N. Peng, (2019) The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326. https://doi.org/10.18653/v1/D19-1339
M. Yang, Z. Wang, Image synthesis under limited data: A survey and taxonomy. International Journal of Computer Vision, 133(6), (2025) 3689-3726. https://doi.org/10.1007/s11263-025-02357-y
S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, D. Doermann, (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, IEEE, USA. https://doi.org/10.1109/CVPR.2019.00290
C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, (2022) Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705. https://doi.org/10.18653/v1/2022.acl-long.331
F. Zeng, W. Gan, Y. Wang, P.S. Yu, (2023) December. Distributed training of large language models. In 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, China. https://doi.org/10.1109/ICPADS60453.2023.00126
X. Wang, Z. Tang, J. Guo, T. Meng, C. Wang, T. Wang, W. Jia, Empowering edge intelligence: A comprehensive survey on on-device ai models. ACM Computing Surveys, 57(9), (2025) 1-39. https://doi.org/10.1145/3724420
J. Xu, Z. Li, W. Chen, Q. Wang, X. Gao, Q. Cai, Z. Ling, (2024) On-device language models: A comprehensive review. arXiv preprint arXiv:2409.00088. https://doi.org/10.48550/arXiv.2409.00088
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017).
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, X. Chen, (2016) Improved techniques for training GANs. Advances in Neural Information Processing Systems 29 USA: Curran Associates, 2234–2242.
A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision. Proceedings of Machine Learning Research, 139, (2021) 8748–8763.
Poloclub, “GitHub - poloclub/diffusiondb: A large-scale text-to-image prompt gallery dataset based on Stable Diffusion,” GitHub. https://github.com/poloclub/diffusiondb
“poloclub/diffusiondb · Datasets at Hugging Face,” Mar. 16, 2023. https://huggingface.co/datasets/poloclub/diffusiondb
N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal, (2018) Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317.
Cyizhuo, “GitHub - cyizhuo/CUB_200_2011_dataset: CUB-200-2011 dataset by classes folder,” GitHub. https://github.com/cyizhuo/CUB_200_2011_dataset

Downloads

PDF

Article Details

Volume 7, Issue 5, Year 2025

DOI: 10.54392/irjmt2557

Published 2025-09-24

How to Cite

Bharne, Smita, Pallavi Sapkale, Ekta Sarda, Shamal Salunkhe, and Puja Padiya. 2025. “Towards Sustainable Image Synthesis: A Comprehensive Review of Text-to-Image Generation Models”. International Research Journal of Multidisciplinary Technovation 7 (5):94-120. https://doi.org/10.54392/irjmt2557.