You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training and evaluation metrics, along with model checkpoints and results, are directly logged into a W&B project, which is openly accessible [here](https://wandb.ai/wadaboa/titanet). In case you want to perform a custom training run, you have to either disable W&B (see `parameters.yml`) or provide your own entity (your username), project and API key file location in the `parameters.yml` file. The W&B API key file is a plain text file that contains a single line with your W&B API key, that you can get from [here](https://wandb.ai/authorize).
42
42
43
-
## Results
43
+
## Training & validation
44
+
45
+
This section shows training and validation metrics observed for around 75 epochs. In case you want to see more metrics, please head over to the [W&B project](https://wandb.ai/wadaboa/titanet).
46
+
47
+
### Baseline CE vs TitaNet CE
48
+
49
+
This experiment compares training and validation loss and accuracy of the baseline and TitaNet models trained with cross-entropy loss. As we can see, training metrics reach similar values, while validation metrics are much better with TitaNet. Moreover, plots suggest that the baseline model had a slight overfitting problem.
This experiment compares training and validation loss and accuracy of two TitaNet models (model size "s"), trained with cross-entropy and ArcFace loss. The ArcFace parameters (scale and margin) are the ones specified in the original paper (30 and 0.2). As we can see, metrics are quite similar and no major differences can be observed.
This section shows some visual results obtained after training each embedding model for around 75 epochs. Please note that all figures represent the same set of utterances, even though different figures use different colours for the same speaker.
46
74
@@ -50,28 +78,28 @@ This test compares the baseline and TitaNet models on the LibriSpeech dataset us
This test compares the baseline and TitaNet models on the VCTK dataset, unseen during training. Both models were trained with cross-entropy loss and 2D projections were performed with UMAP. As above, TitaNet beats the baseline model by a large margin.
This test compares two 2D reduction methods, namely SVD and UMAP. Both figures rely on the TitaNet model trained with cross-entropy loss. As we can see, the choice of the reduction method highly influences our subjective evaluation, with UMAP giving much better separation in the latent space.
This test compares two TitaNet models, one trained with cross-entropy loss and the other one trained with ArcFace loss. Both figures rely on UMAP as their 2D reduction method. As we can see, there doesn't seem to be a winner in this example, as both models are able to obtain good clustering properties.
"Our baseline model is based on the d-vector concept. A d-vector is simply a way to refer to speaker embeddings generated by a DNN (Deep Neural Network), hence the \"d\" prefix. The standard way to compute such d-vectors, as described in [Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467), is through a stack of LSTM layers processing spectrogram segments. In particular, the full spectrogram of shape $B\\times M\\times T$ is unfolded in a sequence of tensors of shape $B\\times M \\times S$, where $S$ is the segment length. Then, each segment is fed into a recurrent module and hidden states are collapsed in a single dimension by either averaging or simply taking the last one. Collapsed vectors are then projected onto the embedding size and once we have one embedding vector for each segment, the embedding vector of the full spectrogram is just the average of all its constituent segments' embeddings."
0 commit comments