Due to my current work, I need to review some graphs from time to time. This post is a collection of methods and utilities which are helpful here.
t-SNE
Of course, I need to start with this because t-SNE is a very commonly used review methodology - not only for graphs. t-SNE is an acronym for Distributed Stochastic Neighbor Embedding. The technique was created by Laurens van der Maaten to provide a way to visualize high-dimensional data in 2008 - which makes it a bit obvious that every researcher dealing with machine learning should have used this technique [1][2].
$$ \begin{equation} {\displaystyle p_{j\mid i}={\frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})}{\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}}} \end{equation} $$This technique embeds multi-dimensional data points into a two-dimensional map by maintaining the relative distances between each data point. Therefore, a use case is the review/detection of clusters in a given dataset.
The algorithm is about the attraction of data points. $Equation 1$ depicts the formula to calculate the probability of a datapoint $x_i$ could be exchanged by $x_j$. This probability could be seen as a similarity between data points. $\sigma_{i}^{2}$ is the Gaussian variance and normalizes the probabilities. The similarity is calculated for all data points expect $p{i|i}$, because the similarity with a datapoint to itself is not interesting in this case and therefore defined as $0$. Also, $p_{i|j}$ is not the same as $p_{j|i}$ because each data point has a different neighbourhood. To overcome this, we use this equation:
$$ \begin{equation} {\displaystyle p_{ij}={\frac {p_{j\mid i}+p_{i\mid j}}{2N}}} \end{equation} $$...where $N$ is the dimension. To map those similarities to the target plot, we project the similarities from the target plot with this function:
$$ \begin{equation} {\displaystyle q_{ij}={\frac {(1+\lVert \mathbf {y} _{i}-\mathbf {y} _{j}\rVert ^{2})^{-1}}{\sum _{k}\sum _{l\neq k}(1+\lVert \mathbf {y} _{k}-\mathbf {y} _{l}\rVert ^{2})^{-1}}}} \end{equation} $$$Equation 1$ uses Gaussian distribution, whereas $equation 3$ uses t-distribution. the t-distribution falls fast and has a long tail, which helps to overcome crowding. Essentially $q_{ji}$ should be as close as possible to $p_{ji}$. The difference of this is calculated using the Kullback–Leibler divergence, which measures the statistic difference between two probability distributions:
$$\begin{equation} {\displaystyle \mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {\frac {p_{ij}}{q_{ij}}}} \end{equation}$$We are almost there. We now train a model to find the smallest $\mathrm {KL} \left(P\parallel Q\right)$, mainly using gradient descent. This post is a very high-level explanation. Additionally, you might take a look at this video, which explains the concept more graphically:
References
- https://lvdmaaten.github.io/tsne/
- Wikipedia, https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
UMAP
Coming up next
Celluloid: Record Graph developments
Sometimes, it might be handy to observe the plotted graph during training. Either you create multiple plots or straight a video. Using Celluloid this is just a task of a few lines:
The result might look like this: