AI to Detect Speaker in a Speech
source link: https://towardsdatascience.com/ai-to-detect-speaker-in-a-speech-a1dae5b597b0?gi=cd351608271a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Using AI to detect the speaker in a speech from the voice data
May 15 ·5min read
Picture by icons8 on Unsplash
With the advancement in AI , one can come up with many interesting and helpful AI applications. These AI applications can be helpful in Health, Retail, Finance, and various other domains. The main idea is to keep thinking about how can we utilize these advanced technologies and come up with interesting use cases.
Through this blog post, I intend to cover an AI application where one can detect the speaker from their voice. I will also explain the process by which I created this dataset. The code and dataset are made available here . There are a few blog posts around this topic but this one is different in two ways, First, it will provide a clear guide on how to detect the speaker effectively using some best practices without falling in pit-falls and secondly, at the end I will cover some really interesting use cases/applications that can be extended from this work. So, let’s get started.
Creating the Dataset
I created a Dataset consisting of 5 celebrities/popular figures from India.
Dataset created of 5 celebrities/popular figures from India. Image Source ~ Wikipedia
I took many speeches/interviews of these celebrities from Youtube and converted them into an Mp3 file .
Further, I converted these MP3 files into Spectograms using a popular Librosa Python library . I created these spectrograms repeatedly at an interval of 90seconds from the mp3 clip.
def generate_spectogram(file,path,jump = 90): total_time = librosa.get_duration(filename=file) till = math.ceil(total_time/jump) for i in range(till): x , sr = librosa.load(file,offset=i*jump,duration=jump) X = librosa.stft(x) Xdb = librosa.amplitude_to_db(abs(X)) librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log',cmap='gray_r') plt.savefig(file_save,dpi=1200)
These spectrograms look like:
Spectrogram from Voice
There’s a really good read on Librosa and Music Genre Classification.
Once we have converted the audio clips to Images, we can train a supervised Convolutional Neural network (CNN) , model.
Some Challenges
Developing such an application had some of its own challenges. These challenges are
- Our Dataset contains voices of pretty similar people. Detecting Gun-shot sound with the barking of the dog is not a very difficult task as these are different sounds. In our case, differentiating a person’s voice from another is a tougher problem.
- We have created a Dataset from Youtube speeches/interviews of these celebrities, so there are many times noise from the background or other person/interviewer speaking in between or the crowd applauding.
- The Dataset has at-most 6–7 clips per person which hampers the accuracy. A richer dataset would give better accuracy and confidence in detecting the person accurately.
Best Practices to Train such Models
While training this application, some things didn’t work well for me, and somethings worked like a charm boosting the model’s performance. In this section, I will call out the best practices to train such models without falling into the pitfall.
librosa.display.specshow(Xdb, sr=sr, x_axis=’time’, y_axis=’log’,cmap=’gray_r’)
plt.savefig(file_save,dpi=1200)
Model Training and Accuracy
I trained the model using FastAI Library. I used a ResNet architecture to train a CNN model. The dataset created and codes are made available here .
The model gave an accuracy of about 80–85% on the completely unseen test data (from different clips) when trained on a limited training set. The model performance can be improved by enriching the training dataset.
Other Interesting possible Usecases
Such a properly trained application can have its usage in
- Automatically tag speaker from a video/audio.
- Check how good one can mimic a celebrity, comparing the score coming out from the model for that celebrity.
- Creating an application to guess the Singer from a random song and compare how well AI can detect. Its liking playing versus AI.
- It can be useful in crime investigation to detect with high confidence the person/speaker from the tapped phone conversations.
Conclusion
Through this blog-post, I covered an AI application where one can detect the speaker from their voice. I also emphasized on the best practices of training such models and other interesting use cases possible out of it. The dataset created and codes are made available here .
If you have any doubts or queries, do reach out to me. I will also be interested to know if you have some interesting AI application/use case in mind to work on.
About the author-:
Abhishek Mungoli is a seasoned Data Scientist with experience in ML field and Computer Science background, spanning over various domains and problem-solving mindset. Excelled in various Machine learning and Optimization problems specific to Retail. Enthusiastic about implementing Machine Learning models at scale and knowledge sharing via blogs, talks, meetups, and papers, etc.
My motive always is to simplify the toughest of the things to its most simplified version. I love problem-solving, data science, product development, and scaling solutions. I love to explore new places and working out in my leisure time. Follow me on Medium , Linkedin or Instagram and check out my previous posts . I welcome feedback and constructive criticism. Some of my blogs -
- 5 Mistakes every Data Scientist should avoid
- Decomposing Time Series in a simple & intuitive way
- How GPU Computing literally saved me at work?
- Information Theory & KL DivergencePart I andPart II
- Process Wikipedia Using Apache Spark to Create Spicy Hot Datasets
- A Semi-Supervised Embedding based Fuzzy Clustering
- Compare which Machine Learning Model performs Better
- Analyzing Fitbit Data to Demystify Bodily Pattern Changes Amid Pandemic Lockdown
- Myths and Reality around Correlation
- A Guide to Becoming Business-Oriented Data Scientist
Recommend
-
182
README.md 데브시스터즈는 본 기술의 소스코드를 연구 목적으로 공개하였으나, 해당 소스코드의 상업적 이용 및 그로 인한 타인의 퍼블리시티권 침해의 우려가 있어 소스코드 배포 및 샘플 재생을...
-
163
The top speaker on the Pixel 2 XL is about 10 dB quieter than the bottom one [Video]
-
134
There are many approaches to execute PHP sub-tasks asynchronously or to parallelise PHP execution. While some solutions require extra extensions, individual PHP builds or a lot of process control management, this talk will show you how to config...
-
130
Transcript
-
80
Color schemes occupy an important part in your life. This talk will explain how to create your lovely color scheme for Vim, and will introduce some useful tools.VimConf 2017 — International Vim Conference:http://vimconf.vim-jp.org/2017/Iceberg —...
-
69
Transcript
-
17
Transform 2021 Join us for the world’s leading event about accelerating enterprise transformation with AI and Data, for enterprise technology decision-makers, presented by the #1 publisher in AI...
-
5
How open API standards will transform financial services Open standards will have a huge impact on driving innovation in banking. Learn the status in the U.S. – and the bold new opportunities open standards a...
-
8
What is speaker recognition? Article 02/20/2022 4 minutes to read...
-
6
June 9, 2022 ...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK