[09052019] Voice based gender recognition using Gaussian mixture models¶
This blog presents an approach to recognizing a Speaker's gender by voice using the Melfrequency cepstrum coefficients (MFCC) and Gaussian mixture models (GMM). The post provides an explanation of the following GitHubproject Voicebasedgenderrecognition.
Watch  Star  Fork  Download  Follow @SuperKogito 

The aforementioned implementation, uses The Free ST American English Corpus dataset (SLR45), which is a free American English corpus by Surfingtech, containing utterances from 10 speakers (5 females and 5 males).
Keywords: Gender recognition, Melfrequency cepstrum coefficients, The Free ST American English Corpus dataset, Gaussian mixture models
Introduction¶
The idea here is to recognize the gender of the speaker based on pregenerated Gaussian mixture models (GMM). Once the data is properly formatted, we train our Gaussian mixture models for each gender by gathering Melfrequency cepstrum coefficients (MFCC) from their associated training wave files. Now that we have generated the models, we identify the speakers genders by extracting their MFCCs from the testing wave files and scoring them against the models. These scores represent the likelihood that user MFCCs belong to one of the two models. The gender models with the highest score represents the probable gender of the speaker. In the following table, we summarize the previous main steps, as for a detailed modeling of the processing steps, you can refer to the Workflow graph in Fig_5.




Data formatting¶
Once you download your dataset, you will need to split it into two different sets:
Training set: This set will be used to train the gender models.
Testing set: This one will serve for testing the accuracy of the gender recognition.
I usually use 2/3 of the the data for the training and 1/3 for the testing, but you can adjust that to your needs/ wishes. The code provides an option for running the whole cycle using “Run.py” or you can go step by step and for the data management just run the following in your terminal:
$ python3 Code/DataManager.py
Voice features extraction¶
The MelFrequency Cepstrum Coefficients (MFCC) are used here, since they deliver the best results in speaker verification. MFCCs are commonly derived as follows:
Take the Fourier transform of (a windowed excerpt of) a signal.
Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
Take the logs of the powers at each of the mel frequencies.
Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
The MFCCs are the amplitudes of the resulting spectrum.
To extract MFCC features I usually use the python_speech_features library, it is simple to use and well documented:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  import numpy as np
from sklearn import preprocessing
from scipy.io.wavfile import read
from python_speech_features import mfcc
from python_speech_features import delta
def extract_features(audio_path):
"""
Extract MFCCs, their deltas and double deltas from an audio, performs CMS.
Args:
audio_path (str) : path to wave file without silent moments.
Returns:
(array) : Extracted features matrix.
"""
rate, audio = read(audio_path)
mfcc_feature = mfcc(audio, rate, winlen = 0.05, winstep = 0.01, numcep = 5, nfilt = 30,
nfft = 512, appendEnergy = True)
mfcc_feature = preprocessing.scale(mfcc_feature)
deltas = delta(mfcc_feature, 2)
double_deltas = delta(deltas, 2)
combined = np.hstack((mfcc_feature, deltas, double_deltas))
return combined

Gaussian Mixture Models¶
According to D. Reynolds in Gaussian_Mixture_Models:
<< A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocaltract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative ExpectationMaximization (EM) algorithm or Maximum A Posteriori(MAP) estimation from a welltrained prior model. >>
In a some way, you can consider a Gaussian mixture model as a probabilistic clustering representing a certain data distribution as a sum of Gaussian density functions (check Fig_6). These densities forming a GMM are also called the components of the GMM. The likelihood of data points (feature vectors) for a model is given by following equation 6 \(\begin{equation} P(X  \lambda)=\sum_{k=1}^{K} w_{k} P_{k}\left(X  \mu_{k}, \Sigma_{k}\right) \end{equation}\), where \(\begin{equation} P_{k}\left(X  \mu_{k}, \Sigma_{k}\right)=\frac{1}{\sqrt{2 \pi\left\Sigma_{k}\right}} e^{\frac{1}{2}\left(X\mu_{k}\right)^{T} \Sigma^{1}\left(X\mu_{k}\right)} \end{equation}\) is the Gaussian distribution, with:
\(\lambda\) represents the training data.
\(\mu\) is the mean.
\(\Sigma\) is covariance matrices.
\(w_{k}\) represent the weights.
\(k\) refers the index of the GMM components.
To train a Gaussian mixture models based on some collected features, you can use scikitlearnlibrary specifically the scikitlearngmm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  import os
import pickle
from sklearn.mixture import GMM
def save_gmm(gmm, name):
""" Save Gaussian mixture model using pickle.
Args:
gmm : Gaussian mixture model.
name (str) : File name.
"""
filename = name + ".gmm"
with open(filename, 'wb') as gmm_file:
pickle.dump(gmm, gmm_file)
print ("%5s %10s" % ("SAVING", filename,))
...
# get gender_voice_features using FeaturesExtraction
# generate gaussian mixture models
gender_gmm = GMM(n_components = 16, n_iter = 200, covariance_type = 'diag', n_init = 3)
# fit features to models
gender_gmm.fit(gender_voice_features)
# save gmm
save_gmm(gender_gmm, "gender")

Gender identification¶
The identification is done over three steps: first you retrieve the voice features, then you compute their likelihood of belonging to a certain gender and finally your compare both scores and make a decision on the probable gender. The computation of the scores is done as follows 1:
Given a speech Y and speaker S, the gender recognition test can be restated into a basic hypothesis test between \(H_{f}\) and \(H_{m}\), where:
\(H_{f}\) : Y is a FEMALE
\(H_{f}\) : Y is a MALE
\begin{eqnarray} \frac{p\left(Y  H_{f}\right)}{p\left(Y  H_{m}\right)} = \left\{\begin{array}{ll}{ \geq 1} & {\text { accept } H_{f}} \\ {< 1} & {\text { reject } H_{m}}\end{array} \right. \end{eqnarray}where \(\begin{eqnarray} p\left(Y  H_{i}\right) \end{eqnarray}\), is the probability density function for the hypothesis \(H_{i}\) evaluated for the observed speech segment Y, also called the likelihood of the hypothesis \(H_{i}\) given the speech segment Y 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36  import pickle
import numpy as np
from FeaturesExtractor import FeaturesExtractor
def identify_gender(vector):
# female hypothesis scoring
is_female_scores = np.array(self.females_gmm.score(vector))
is_female_log_likelihood = is_female_scores.sum()
# male hypothesis scoring
is_male_scores = np.array(self.males_gmm.score(vector))
is_male_log_likelihood = is_male_scores.sum()
# print scores
print("%10s %5s %1s" % ("+ FEMALE SCORE",":", str(round(is_female_log_likelihood, 3))))
print("%10s %7s %1s" % ("+ MALE SCORE", ":", str(round(is_male_log_likelihood,3))))
# find the winner aka the probable gender of the speaker
if is_male_log_likelihood > is_female_log_likelihood: winner = "male"
else : winner = "female"
return winner
# init instances and load models
features_extractor = FeaturesExtractor()
females_gmm = pickle.load(open(females_model_path, 'rb'))
males_gmm = pickle.load(open(males_model_path, 'rb'))
# read the test directory and get the list of test audio files
file = "speakertestfile.wav"
vector = features_extractor.extract_features(file)
winner = identify_gender(vector)
expected_gender = file.split("/")[1][:1]
print("%10s %6s %1s" % ("+ EXPECTATION",":", expected_gender))
print("%10s %3s %1s" % ("+ IDENTIFICATION", ":", winner))

Code & scripts¶
Obviously the code provided on GitHub is more structured and advanced than what provided here since it is used to process multiple files,and to compute the accuracy level
Results summary¶
The results of the gender recognition tests can be summarized in the following table/ confusion matrix:
Female expected 
Male expected 

Female guessed 
563 
28 
Male guessed 
21 
376 
Using the previous results we can compute the following system characteristics:
Precision for female recognition = 563 / (563 + 28) = 0.95
Precision for male recognition = 376 / (376 + 21) = 0.94
Accuracy = 939 / 988 = 0.95
Conclusions¶
The system results in a 95% accuracy of gender detection, but this can be different for other datasets.
The code can be further optimized using multithreading, acceleration libs and multiprocessing.
The accuracy can be further improved using GMM normalization aka a UBMGMM system.
References and Further readings¶
 1(1,2)
Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models, Digital signal processing 10.1 (2000): 1941. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.338&rep=rep1&type=pdf
 2
Sérgio R. F. Vieira, Eduardo M. B. de A. Tenório and Tsang Ing Ren, Speaker Verification Using Adapted Gaussian Mixture Models, August, 2014, https://github.com/embatbr/speechverify/blob/master/report/report.pdf
 3
Sina Khanmohammadi, ChunAnChou, A Gaussian mixture model based discretization algorithm for associative classification of medical data, July, 2015, https://www.sciencedirect.com/science/article/pii/S0957417416301440
 4
Hanilçi, Cemal & Ertas, Figen. (2013). Investigation of the effect of data duration and speaker gender on textindependent speaker recognition. Computers & Electrical Engineering. 39. 10.1016/j.compeleceng.2012.09.014. https://www.researchgate.net/publication/235995473_Investigation_of_the_effect_of_data_duration_and_speaker_gender_on_textindependent_speaker_recognition
 5
The present and future of voiceprint based security PDF and Lecturevideo.
 6
Machine Learning in Action: Voice Gender Detection using GMMs : A Python Primer, https://appliedmachinelearning.blog/2017/06/14/voicegenderdetectionusinggmmsapythonprimer/