Tutorial about setting up an analysis pipeline and batch processing¶

Quite often you experiment with various analysis routines and appropriate parameters and come up with an analysis pipeline. A pipeline procedure then is a script defining analysis steps for a single locdata object (or a single group of corresponding locdatas as for instance used in 2-color measurements).

The Pipeline class can be used to combine the pipeline code, metadata and analysis results in a single pickleable object (meaning it can be serialized by the python pickle module).

This pipeline might then be applied to a number of similar datasets. A batch process is such a procedure for running a pipeline over multiple locdata objects and collecting and combing results.

from pathlib import Path

%matplotlib inline

import matplotlib.pyplot as plt

import locan as lc

lc.show_versions(system=False, dependencies=False, verbose=False)

Locan:
   version: 0.21.0

Python:
   version: 3.11.12

# A path in which test data can be found:
TEST_DIR: Path = Path.cwd().parents[2] / "tests"
TEST_DIR

PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/locan/checkouts/stable/tests')

Apply a pipeline of different analysis routines¶

Load rapidSTORM data file¶

path = TEST_DIR / 'test_data/npc_gp210.asdf'
print(path, '\n')
dat = lc.load_locdata(path=path, file_type=lc.FileType.ASDF)

/home/docs/checkouts/readthedocs.org/user_builds/locan/checkouts/stable/tests/test_data/npc_gp210.asdf 

Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.

dat.properties

{'localization_count': 202,
 'position_x': np.float64(5623.488892810179),
 'uncertainty_x': np.float64(3.9700846636030356),
 'position_y': np.float64(6625.534602703435),
 'uncertainty_y': np.float64(3.999303432482808),
 'intensity': np.float32(3944778.0),
 'local_background': np.float32(1131.3207),
 'frame': np.int16(24),
 'region_measure_bb': np.float32(47134.543),
 'localization_density_bb': np.float32(0.004285604),
 'subregion_measure_bb': np.float32(870.98047)}

Set up an analysis procedure¶

First define the analysis procedure (pipeline) in form of a computation function. Make sure the first parameter is the self refering to the Pipeline object. Add arbitrary keyword arguments thereafter. When finishing with return self the compute method can easily be called with instantiation.

def computation(self, locdata, n_localizations_min=4):
    
    # import required modules
    from locan.analysis import LocalizationPrecision
    
    # prologue
    self.file_indicator = locdata.meta.file.path
    self.locdata = locdata
    
    # check requirements
    if len(locdata)<=n_localizations_min:
        return None
    
    # compute localization precision
    self.lp = LocalizationPrecision().compute(self.locdata)
    
    return self

Run the analysis procedure¶

Instantiate a Pipeline object and run compute():

pipe = lc.Pipeline(computation=computation, locdata=dat, n_localizations_min=4).compute()
pipe.meta

Processed frames::   0%|          | 0/24884 [00:00<?, ?it/s]

Processed frames::   8%|▊         | 1965/24884 [00:00<00:01, 19485.33it/s]

Processed frames::  25%|██▍       | 6166/24884 [00:00<00:00, 32550.95it/s]

Processed frames::  46%|████▌     | 11339/24884 [00:00<00:00, 41207.51it/s]

Processed frames:: 100%|██████████| 24884/24884 [00:00<00:00, 62703.89it/s]

identifier: "1"
method {
  name: "Pipeline"
  parameter: "{\'computation\': <function computation at 0x78ce1a05e660>, \'locdata\': <locan.data.locdata.LocData object at 0x78ce1c3eb710>, \'n_localizations_min\': 4}"
}
creation_time {
  seconds: 1754766859
  nanos: 946351000
}

Results are available from Pipeline object in form of attributes defined in the compute function:

[attr for attr in dir(pipe) if not attr.startswith('__') and not attr.endswith('__')]

['_get_parameters',
 '_init_meta',
 '_update_meta',
 'computation',
 'computation_as_string',
 'compute',
 'count',
 'file_indicator',
 'kwargs',
 'locdata',
 'lp',
 'meta',
 'parameter',
 'report',
 'results',
 'save_computation']

pipe.lp.results.head()

	position_delta_x	position_delta_y	position_distance	original_index	frame
0	-11.189941	4.859863	12.199716	0	24
1	32.580078	4.170410	32.845909	1	25
2	13.549805	-15.439941	20.542370	3	141
3	4.669922	3.010254	5.556060	4	142
4	20.469727	14.750000	25.230383	6	239

pipe.lp.hist();
print(pipe.lp.distribution_statistics.parameter_dict())

{'position_delta_x_loc': np.float32(0.6029139), 'position_delta_x_scale': np.float32(13.2682), 'position_delta_y_loc': np.float32(-1.227687), 'position_delta_y_scale': np.float32(14.76087), 'position_distance_sigma': np.float64(14.067675781250028), 'position_distance_loc': 0, 'position_distance_scale': 1}

../../_images/2b38fa61999f829aa827ce933f11f6ccd443ac7887db393e95c0fd7ff66cd5c3.png

You can recover the computation procedure:

pipe.computation_as_string()

'def computation(self, locdata, n_localizations_min=4):\n    \n    # import required modules\n    from locan.analysis import LocalizationPrecision\n    \n    # prologue\n    self.file_indicator = locdata.meta.file.path\n    self.locdata = locdata\n    \n    # check requirements\n    if len(locdata)<=n_localizations_min:\n        return None\n    \n    # compute localization precision\n    self.lp = LocalizationPrecision().compute(self.locdata)\n    \n    return self\n'

or save it as text protocol:

The Pipeline object is pickleable and can thus be saved for revisits.

Apply the pipeline on multiple datasets - a batch process¶

Let’s create multiple datasets:

path = TEST_DIR / 'test_data/npc_gp210.asdf'
print(path, '\n')
dat = lc.load_locdata(path=path, file_type=lc.FileType.ASDF)

locdatas = [lc.select_by_condition(dat, f'{min}<index<{max}') for min, max in ((0,100), (101,202))]
locdatas

/home/docs/checkouts/readthedocs.org/user_builds/locan/checkouts/stable/tests/test_data/npc_gp210.asdf 

[<locan.data.locdata.LocData at 0x78ce11b01490>,
 <locan.data.locdata.LocData at 0x78ce12ea1810>]

Run the analysis pipeline as batch process

pipes = [lc.Pipeline(computation=computation, locdata=dat).compute() for dat in locdatas]

Processed frames::   0%|          | 0/6164 [00:00<?, ?it/s]

Processed frames::  24%|██▍       | 1491/6164 [00:00<00:00, 14870.70it/s]

Processed frames::  51%|█████     | 3139/6164 [00:00<00:00, 15752.59it/s]

Processed frames::  76%|███████▋  | 4715/6164 [00:00<00:00, 13483.13it/s]

Processed frames:: 100%|██████████| 6164/6164 [00:00<00:00, 15587.32it/s]

Processed frames::   0%|          | 0/18639 [00:00<?, ?it/s]

Processed frames::   5%|▌         | 1021/18639 [00:00<00:01, 10156.37it/s]

Processed frames::  25%|██▍       | 4569/18639 [00:00<00:00, 24708.13it/s]

Processed frames::  56%|█████▋    | 10511/18639 [00:00<00:00, 39743.44it/s]

Processed frames::  97%|█████████▋| 18131/18639 [00:00<00:00, 53795.76it/s]

Processed frames:: 100%|██████████| 18639/18639 [00:00<00:00, 43216.92it/s]

As long as the batch procedure runs in a single computer process, the identifier increases with every instantiation.

[pipe.meta.identifier for pipe in pipes]

['2', '3']

Visualize the combined results¶

fig, ax = plt.subplots(nrows=1, ncols=1)
for pipe in pipes:
    pipe.lp.plot(ax=ax, window=10)
plt.show()

../../_images/3dca2046ca3d3d409d93deb7eb28039996c25a2c4a58793265d352ac50971a82.png