Abstract
The stigma of depression and mental illness is growing everywhere in the world and this is the reason why the moves are on to come up with diagnostic tools, which would be rapid, efficient and consistent. The framework suggested in this study is a state-of-the-art technic to infer patient-specific data on therapy-patient dialogs to fill a knowledge gap that existing diagnosis solutions are unable to cover. The proposed signal processing scheme, Patient-Specific Audio Extraction Pipeline (PSAAP) enhances input to the Machine Learning (ML) of mental illness detection. The method locates and measures non verbal acoustic features such as the pitch, the intensity, and the Mel Frequency Cepstral Coefficients (MFCCs) that play a crucial role in the determination of mental good health. Noise reduction, speaker diarization, silence deletion as well as other samples of preprocessing are implemented on DAIC-WOZ dataset to maintain quality of audio and sound. In this code, speech characteristics which are instrumental in such diagnosis have been retained so that the depression symptom described by monotony, slow speech and low variability of pitch can be analyzed precisely. The quantitative findings are demonstrated in the framework and signal-to-noise ratio (SNR) is augmented by up to 16 dB compared to the existing methods. The method allowing the practitioner to make a sound and clinically pertinent evaluation since separating patient-specific variables of the voice and removing therapist feedback makes assessment trait-centered and dependable. The main argument in favour of this one is that it possesses the quality to be applied in the extraction of patient speech framework in general since the patient can find such opportunity to be helpful in the mental healthcare examination and also in the circumstances where the computing capabilities are poor.
Keywords
Mel-frequency cepstral coefficients (MFCCs), Short time Fourier transform (STFT), Distress analysis interview corpus - wizard of Oz (DAIC-WOZ), Discrete cosine transform (DCT),Downloads
References
- Z. Wang, L. Chen, L. Wang, G. Diao, Recognition of audio depression based on convolutional neural network and generative antagonism network model, IEEE Access,8, (2020) 101181–101191. https://doi.org/10.1109/ACCESS.2020.2998532
- World Health Organization. (2017). Depression and other common mental disorders: global health estimates. In Depression and other common mental disorders: global health estimates. https://www.who.int/publications/i/item/depression-global-health-estimates
- H. Solieman, E.A. Pustozerov, The detection of depression using multimodal models based on text and voice quality features, In 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), IEEE, St. Petersburg, Moscow, Russia, (2021) 1843–1848. https://doi.org/10.1109/ElConRus51938.2021.9396540
- X. Li, X. Zhang, J. Zhu, W. Mao, S. Sun, Z. Wang, C. Xia, B. Hu, Depression recognition using machine-learning methods with different feature generation strategies. Artificial intelligence in medicine, 99, (2019) 101696. https://doi.org/10.1016/j.artmed.2019.07.004
- V. Ravi, J. Wang, J. Flint, A. Alwan, Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement, Computer Speech & Language, 86, (2024) 101605. https://doi.org/10.1016/j.csl.2023.101605
- S. Tokuno, Stress evaluation by voice: from prevention to treatment in mental health care. Econophysics, Sociophysics, and Multidisciplinary Sciences Journal, 5(1), (2015) 30-35.
- N.H. Chetty, M. Shah, S. Kabaria, S. Verma, (2017) A Survey on brain tumor extraction approach from mri images using image processing, In 2017 2nd International Conference for Convergence in Technology (I2CT), IEEE, Mumbai, India, https://doi.org/10.1109/I2CT.2017.8226187
- Q. Zhao, H.Z. Fan, Y.L. Li, L. Liu, Y.X. Wu, Y.L. Zhao, Z.X. Tian, Z.R. Wang, Y.L. Tan, S.P. Tan, Vocal acoustic features as potential biomarkers for identifying/diagnosing depression: a cross-sectional study. Frontiers in Psychiatry, 13, (2022) 815678. https://doi.org/10.3389/fpsyt.2022.815678
- L. Fürer, N. Schenk, V. Roth, M. Steppan, K. Schmeck, R. Zimmermann, Supervised speaker diarization using random forests: a tool for psychotherapy process research. Frontiers in psychology, 11, (2020) 1726. https://doi.org/10.3389/fpsyg.2020.01726
- X. Huang, F. Wang, Y. Gao, Y. Liao, W. Zhang, L. Zhang, Z. Xu, Depression recognition using voice-based pre-training model. Scientific reports, 14(1), (2024) 12734. https://doi.org/10.1038/s41598-024-63556-0
- D.M. Low, K.H. Bentley, S.S. Ghosh, Automated assessment of psychiatric disorders using speech: A systematic review, Laryngoscope Investigative Otolaryngology, 5(1), (2020) 96–116. https://doi.org/10.1002/lio2.354
- J. Gratch, R. Artstein, G.M. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D.R. Traum, The distress analysis interview corpus of human and computer interviews. In LREC, 14, (2014) 3123-3128.
- Zheng-Hua Tan, Achintya kr. Sarkar, Najim Dehak, rVAD: An unsupervised segment-based robust voice activity detection method,” Computer Speech & Language, 59 (2020) 1-21. https://doi.org/10.1016/j.csl.2019.06.005
- D. S. Williamson, J. Barker, and S. Watanabe, Speech separation and recognition in multi-speaker environments, IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, (2008) 1644-1648. https://doi.org/10.1109/IJCNN.2008.4634018
- J. Ao, M.S. Yldrm, R. Tao, M. Ge, S. Wang, Y. Qian, H. Li, USED: universal speaker extraction and diarization. IEEE Transactions on Audio Speech and Language Processing, 33, (2025) 96–110. https://doi.org/10.1109/taslp.2024.3511268
- M. Nykoniuk, O. Basystiuk, N. Shakhovska, and N. Melnykova, Multimodal data fusion for depression detection approach, Computation, 13, (2025). 1-9. https://doi.org/10.3390/computation13010009
- A. Sharma, S. Panda and S. Verma, Sign Language to Speech Translation,2020 13th International Conference on Computing Communication and Networking Technologies (ICCCNT). https://doi.org/10.1109/icccnt49239.2020.9225422
- E. Kerz, S. Zanwar, Y. Qiao, and D. Wiechmann, Toward explainable AI (XAI) for mental health detection based on language behavior. Frontiers in Psychiatry, 14, (2023) 1219479. https://doi.org/10.3389/fpsyt.2023.1219479
- J O’Sullivan, G. Bogaarts, P. Schoenenberger, J. Tillmann, D. Slater, N. Mesgarani, E. Eule, T. Kilchenmann, L. Murtagh, J. Hipp, M. Lindemann, F. Lipsmeier, W. Cheng, D. Nobbs, C. Chatham, Automatic speaker diarization for natural conversation analysis in autism clinical trials. Scientific Reports, 13(1) (2023) 10270. https://doi.org/10.1038/s41598-023-36701-4
- W.K.Lu and Q. Zhang, Deconvolutive short-time Fourier transform spectrogram. IEEE Signal Processing Letters, 16(7), (2009) 576–579. https://doi.org/10.1109/lsp.2009.2020887
- J. Kim, S. Lee, and C. Davis, Pre- and post-target cortical processes predict speech-in-noise performance, NeuroImage, 227 (2021) 117699. https://doi.org/10.1016/j.neuroimage.2020.117699
- A. El-Moneim, S. Hassan, and A. Youssef, Speaker recognition using pre-processing approaches for robust identification, International Journal of Speech Technology, 23 (2020) 451–462. https://doi.org/10.1007/s10772-019-09659-w
- M. Ahmed, R. Khan, and X. Li, VPQAD: A voice pre-processing and quality assessment dataset for robust speech applications, IEEE Data Descriptions, 32 (2024) 45–59. https://doi.org/10.1109/IEEEDATA.2024.3493798
- H. Nakasone, J. Zhou, and J. K. Hansen, Automated speaker recognition in real-world conditions: Controlling the uncontrollable, in Proc. 8th Eur. Conf. Speech Commun. Technol. (Eurospeech), Geneva, Switzerland, 2003, pp. 1405–1408. https://doi.org/10.21437/Eurospeech.2003-299
Articles

