Posters

Earth mover distance: compare distributions

Relates to the earth mover distance mentioned in the honeycomb io with Danyel Fisher on Monday - Day 2 during his observability talk

We propose a new fast method of measuring distances between large numbers of related high dimensional datasets called the Diffusion Earth Mover’s Distance (EMD). We model the datasets as distributions supported on a common data graph that is derived from the affinity matrix computed on the combined data. In such cases where the graph is a discretization of an underlying Riemannian closed manifold, we prove that Diffusion EMD is topologically equivalent to the standard EMD with a geodesic ground distance. Diffusion EMD can be computed in {Õ}(n) time and is more accurate than similarly fast algorithms such as tree-based EMDs. We also show Diffusion EMD is fully differentiable, making it amenable to future uses in gradient-descent frameworks such as deep neural networks. We demonstrate an application of Diffusion EMD to single cell data collected from 210 COVID-19 patient samples at Yale New Haven Hospital. Here, Diffusion EMD can derive distances between patients on the manifold of cells at least two orders of magnitude faster than equally accurate methods. This distance matrix between patients can be embedded into a higher-level patient manifold which uncovers structure and heterogeneity in patients. Finally, we show DEMD’s incorporation into a neural ode framework we recently developed called MIOFlow (Manifold Interpolating Flows) for learning dynamics from static snapshot data. MIOFlow uses an autoencoder with multiscale diffusion distances to find a manifold embedding of data. Then within this latent space, we use the neural network to learn a continuous time derivative to perform dynamic optimal transport of static snapshot measurements of cells and in the process infer continuous dynamics and single-cell trajectories. We show results of this in a cancer metastasis system measured using single-cell RNA-sequencing.