Skip to content

Commit bebd310

Browse files
committed
add results images
1 parent 7e6319b commit bebd310

16 files changed

+197
-90
lines changed

.gitignore

+1-4
Original file line numberDiff line numberDiff line change
@@ -137,11 +137,8 @@ data/
137137
# Training
138138
checkpoints/
139139
figures/
140+
pretrained/
140141

141142
# Wandb
142143
init/wandb_api_key_file
143144
wandb/
144-
145-
# Results
146-
pretrained/
147-
results/

README.md

+37-3
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Nithin Rao Koluguri, Taejin Park, Boris Ginsburg,
1212
https://arxiv.org/abs/2110.04410.
1313
```
1414

15-
It is "small scale" because we only rely on the LibriSpeech dataset, instead of using VoxCeleb1, VoxCeleb2, SRE, Fisher, Switchboard and LibriSpeech, as done in the original work. The main reason for this choice is related to resources, as the combined dataset has 3373 hours of speech, with 16681 speakers and 4890K utterances, which is quite big to be trained on Google Colab. Instead, LibriSpeech has 336 hours of speech, with 2338 speakers and 634K utterances, which is sufficient to test the capabilities of the model. Moreover, we only test TitaNet on the speaker identification task, instead of testing it on speaker verification and diarization.
15+
It is "small scale" because we only rely on the LibriSpeech dataset, instead of using VoxCeleb1, VoxCeleb2, SRE, Fisher, Switchboard and LibriSpeech, as done in the original work. The main reason for this choice is related to resources, as the combined dataset has 3373 hours of speech, with 16681 speakers and 4890K utterances, which is quite big to be trained on Google Colab. Instead, the LibriSpeech subset that we consider has about 100 hours of speech, with 251 speakers and 28.5K utterances, which is sufficient to test the capabilities of the model. Moreover, we only test TitaNet on the speaker identification and verification tasks, instead of also testing it on speaker diarization.
1616

1717
## Installation
1818

@@ -26,7 +26,7 @@ pip install -r init/requirements.txt
2626

2727
## Execution
2828

29-
Both training and testing parts of the project are managed through a Jupyter notebook ([titanet.ipynb](titanet.ipynb)). The notebook contains a broad analysis of the dataset in use, an explanation of all the data augmentation techniques reported in the paper, a description of the TitaNet model and a way to train and test it. Hyper-parameters are handled via the `parameters.yml` file. To run the Jupyter notebook, execute the following command:
29+
Both training and testing parts of the project are managed through a Jupyter notebook ([titanet.ipynb](titanet.ipynb)). The notebook contains a broad analysis of the dataset in use, an explanation of all the data augmentation techniques reported in the paper, a description of the baseline and TitaNet models and a way to train and test them. Hyper-parameters are handled via the `parameters.yml` file. To run the Jupyter notebook, execute the following command:
3030

3131
```bash
3232
jupyter notebook titanet.ipynb
@@ -40,4 +40,38 @@ python3 src/train.py -p "./parameters.yml"
4040

4141
Training and evaluation metrics, along with model checkpoints and results, are directly logged into a W&B project, which is openly accessible [here](https://wandb.ai/wadaboa/titanet). In case you want to perform a custom training run, you have to either disable W&B (see `parameters.yml`) or provide your own entity (your username), project and API key file location in the `parameters.yml` file. The W&B API key file is a plain text file that contains a single line with your W&B API key, that you can get from [here](https://wandb.ai/authorize).
4242

43-
Currently, training and testing on Google Colab is only allowed to the repository owner, as it relies on a private SSH key to clone this Github repo. Please open an issue if your use case requires you to work on Google Colab.
43+
## Results
44+
45+
This section shows some visual results obtained after training each embedding model for around 75 epochs. Please note that all figures represent the same set of utterances, even though different figures use different colours for the same speaker.
46+
47+
### Baseline vs TitaNet on LibriSpeech
48+
49+
This test compares the baseline and TitaNet models on the LibriSpeech dataset used for training. Both models were trained with cross-entropy loss and 2D projections were performed with UMAP. As we can see, the good training and validation metrics of the baseline model are not mirrored in this empirical test. Instead, TitaNet is able to form compact clusters of utterances, thus reflecting the high performance metrics obtained during training.
50+
51+
Baseline | TitaNet
52+
:-------------------------:|:-------------------------:
53+
![](results/ls-baseline-ce-umap.png) | ![](results/ls-titanet-ce-umap.png)
54+
55+
### Baseline vs TitaNet on VCTK
56+
57+
This test compares the baseline and TitaNet models on the VCTK dataset, unseen during training. Both models were trained with cross-entropy loss and 2D projections were performed with UMAP. As above, TitaNet beats the baseline model by a large margin.
58+
59+
Baseline | TitaNet
60+
:-------------------------:|:-------------------------:
61+
![](results/vctk-baseline-ce-umap.png) | ![](results/vctk-titanet-ce-umap.png)
62+
63+
### SVD vs UMAP reduction
64+
65+
This test compares two 2D reduction methods, namely SVD and UMAP. Both figures rely on the TitaNet model trained with cross-entropy loss. As we can see, the choice of the reduction method highly influences our subjective evaluation, with UMAP giving much better separation in the latent space.
66+
67+
TitaNet LS SVD | TitaNet LS UMAP
68+
:-------------------------:|:-------------------------:
69+
![](results/ls-titanet-ce-svd.png) | ![](results/ls-titanet-ce-umap.png)
70+
71+
### Cross-entropy vs ArcFace loss
72+
73+
This test compares two TitaNet models, one trained with cross-entropy loss and the other one trained with ArcFace loss. Both figures rely on UMAP as their 2D reduction method. As we can see, there doesn't seem to be a winner in this example, as both models are able to obtain good clustering properties.
74+
75+
Cross-entropy | ArcFace
76+
:-------------------------:|:-------------------------:
77+
![](results/ls-titanet-ce-umap.png) | ![](results/ls-titanet-arc-umap.png)

results/ls-baseline-ce-svd.png

23.9 KB
Loading

results/ls-baseline-ce-umap.png

21.3 KB
Loading

results/ls-titanet-arc-svd.png

20.6 KB
Loading

results/ls-titanet-arc-umap.png

11.9 KB
Loading

results/ls-titanet-ce-svd.png

20.3 KB
Loading

results/ls-titanet-ce-umap.png

15.1 KB
Loading

results/vctk-baseline-ce-svd.png

25.4 KB
Loading

results/vctk-baseline-ce-umap.png

22.5 KB
Loading

results/vctk-titanet-arc-svd.png

20.4 KB
Loading

results/vctk-titanet-arc-umap.png

15.6 KB
Loading

results/vctk-titanet-ce-svd.png

19.4 KB
Loading

results/vctk-titanet-ce-umap.png

16 KB
Loading

src/utils.py

+3-7
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,8 @@ def visualize_embeddings(
9797
)
9898

9999
# Store embeddings in a dataframe and compute cluster colors
100-
embeddings_df = pd.DataFrame(
101-
np.concatenate([embeddings, np.expand_dims(labels, axis=-1)], axis=-1),
102-
columns=["x", "y", "l"],
103-
)
104-
embeddings_df.x = embeddings_df.x.astype(float)
105-
embeddings_df.y = embeddings_df.y.astype(float)
106-
embeddings_df.l = embeddings_df.l.astype(str)
100+
embeddings_df = pd.DataFrame(embeddings, columns=["x", "y"], dtype=np.float32)
101+
embeddings_df["l"] = np.expand_dims(labels, axis=-1)
107102
cluster_colors = {l: np.random.random(3) for l in np.unique(labels)}
108103
embeddings_df["c"] = embeddings_df.l.map(
109104
{l: tuple(c) for l, c in cluster_colors.items()}
@@ -120,6 +115,7 @@ def visualize_embeddings(
120115
color=c,
121116
label=f"{label} (C)",
122117
marker="^",
118+
s=250,
123119
)
124120
if not only_centroids:
125121
ax.scatter(to_plot.x, to_plot.y, color=c, label=f"{label}")

0 commit comments

Comments
 (0)