Tutorial about clustering localizations data#

Locan provides methods for clustering localizations in LocData objects. The methods all return a new LocDat object that represents the collected selections for each cluster.

from pathlib import Path

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import locan as lc
/tmp/ipykernel_1299/1692805095.py:6: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
lc.show_versions(system=False, dependencies=False, verbose=False)
Locan:
   version: 0.20.0.dev41+g755b969

Python:
   version: 3.11.6

Synthetic data#

We simulate localization data that follows a Neyman-Scott distribution in 2D:

rng = np.random.default_rng(seed=11)
locdata = lc.simulate_Thomas(parent_intensity=1e-5, region=((0, 1000), (0, 1000)), cluster_mu=1000, cluster_std=10, seed=rng)

locdata.print_summary()
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
identifier: "1"
comment: ""
source: SIMULATION
state: RAW
element_count: 6409
frame_count: 0
creation_time {
  2024-03-14T11:08:06.174673Z
}
fig, ax = plt.subplots(nrows=1, ncols=1)
locdata.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Blue', label='locdata')
plt.show()
../../_images/8e318991687b11e2d39364fdcc9ffee79fff748c0f3cd2487fce92fef0637375.png

Cluster localizations by dbscan#

noise, clust = lc.cluster_dbscan(locdata, eps=20, min_samples=3)
assert noise.data.empty
fig, ax = plt.subplots(nrows=1, ncols=1)
locdata.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Yellow', label='locdata')
lc.LocData.concat(clust.references).data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Blue', label='clustered data')
clust.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Red', s=10, label='cluster centroids')
plt.show()
../../_images/82c8cb6772471526af013728feee564d50f0d7ddd51af91fb304225faaae91e9.png
clust.data.head()
localization_count position_x uncertainty_x position_y uncertainty_y region_measure_bb localization_density_bb subregion_measure_bb region_measure localization_density subregion_measure
0 1016 20.379759 0.300643 85.374381 0.314089 3262.794976 0.311389 229.091796 1000000 0.001016 4000
1 1050 681.909709 0.305784 248.276785 0.310572 4122.205751 0.254718 257.346576 1000000 0.001050 4000
2 986 353.614004 0.302981 512.621728 0.320747 3814.439353 0.258491 248.125971 1000000 0.000986 4000
3 996 690.951712 0.315008 513.619337 0.313548 3935.485208 0.253082 250.933843 1000000 0.000996 4000
4 368 992.523811 0.288485 636.239028 0.548494 1650.773324 0.222926 170.375014 1000000 0.000368 4000

Cluster localizations in the presence of noise#

Often homogeneously distributed localizations are present that cannot be clustered (so-called noise). In this case noise should be set True such that two LocData objects are returned that hold noise and the cluster collection. If noise is False it will be part of the returned cluster collection.

locdata_cluster = lc.simulate_Thomas(parent_intensity=1e-5, region=((0, 1000), (0, 1000)), cluster_mu=1000, cluster_std=10, seed=rng)
locdata_noise = lc.simulate_Poisson(intensity=1e-4, region=((0, 1000), (0, 1000)), seed=rng)
locdata = lc.LocData.concat([locdata_cluster, locdata_noise])
noise, clust = lc.cluster_dbscan(locdata, eps=30, min_samples=3)
fig, ax = plt.subplots(nrows=1, ncols=1)
locdata.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Yellow', label='locdata')
lc.LocData.concat(clust.references).data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Blue', label='clustered data')
noise.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Gray', label='noise')
clust.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Red', s=50, label='cluster centroids')
plt.show()
../../_images/520b02d28d9d1d7be1235643d12c00cbd08a153f892cd5411996d0ee94e2aede.png

If single localizations should be inlcuded as individual clusters, we need to reduce min_samples to 1. In that case noise will always be None.

noise, clust = lc.cluster_dbscan(locdata, eps=20, min_samples=1)
assert noise.data.empty
fig, ax = plt.subplots(nrows=1, ncols=1)
locdata.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Yellow', label='locdata')
lc.LocData.concat(clust.references).data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Blue', label='clustered data')
# noise.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Gray', label='noise')
clust.data.plot.scatter(x='position_x', y='position_y', ax=ax, color='Red', s=50, label='cluster centroids')
plt.show()
../../_images/e446b257277fc65ccad6e8ce809adf913b716927190eecfe45289b7954d09c20.png

Other cluster functions are available in the locan.data.clustermodule.