Tutorial about managing files in batch processing#

When analysing a set of experiments you want to collect, match and group files according to information content and experimental conditions.

The Files class will help you.

from pathlib import Path
import tempfile

%matplotlib inline

import matplotlib.pyplot as plt

import locan as lc
lc.show_versions(system=False, dependencies=False, verbose=False)
Locan:
   version: 0.20.0.dev41+g755b969

Python:
   version: 3.11.6

Some file structure to be analysed#

directory = Path(tempfile.mkdtemp())
subdirectory = directory.joinpath("sub_directory")
subdirectory.mkdir()
files = [
    directory / "sub_directory" / "file_group_a_0.data",
    directory / "sub_directory" / "file_group_a_1.data",
    directory / "sub_directory" / "file_group_b_2.data",
    directory / "sub_directory" / "corresponding_file_0.data",
    directory / "metadata.meta",
]
for file_ in files:
    file_.touch()
list(directory.glob("**/*.*"))
[PosixPath('/tmp/tmp_hvjxqz3/metadata.meta'),
 PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.data'),
 PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.data'),
 PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.data'),
 PosixPath('/tmp/tmp_hvjxqz3/sub_directory/corresponding_file_0.data')]

The Files class#

lc.Files?
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.

Identify files#

files = lc.Files.from_glob(
    directory=directory,
    pattern="**/*.*"
)
files.df
file_path
0 /tmp/tmp_hvjxqz3/metadata.meta
1 /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2....
2 /tmp/tmp_hvjxqz3/sub_directory/file_group_a_1....
3 /tmp/tmp_hvjxqz3/sub_directory/file_group_a_0....
4 /tmp/tmp_hvjxqz3/sub_directory/corresponding_f...

For each file a Path object is stored:

files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1019267881.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
0 metadata.meta
1 file_group_b_2.data
2 file_group_a_1.data
3 file_group_a_0.data
4 corresponding_file_0.data
files.print_summary()
Number of files: 5
Base directory: /tmp/tmp_hvjxqz3
Columns: Index(['file_path'], dtype='object')
       file_path
count          5
unique         5

Exclude files#

files = lc.Files.from_glob(
    directory=directory,
    pattern="**/*.*"
)

files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/641744908.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
0 metadata.meta
1 file_group_b_2.data
2 file_group_a_1.data
3 file_group_a_0.data
4 corresponding_file_0.data
stoplist = lc.Files.concatenate([
    lc.Files.from_glob(directory=files.directory, pattern="**/*.meta"),
    lc.Files.from_glob(directory=files.directory, pattern="**/*group_b*.*")
])
stoplist.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/823210556.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  stoplist.df.applymap(lambda x: x.name)
file_path
0 metadata.meta
1 file_group_b_2.data
files.exclude(stoplist=stoplist)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/3083961993.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
2 file_group_a_1.data
3 file_group_a_0.data
4 corresponding_file_0.data

Match corresponding files#

files = lc.Files.from_glob(
    directory=directory,
    pattern="**/*.*",
    regex="group_a_0"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/3682074927.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
0 file_group_a_0.data
corresponding_files = lc.Files.from_glob(
    directory=directory,
    pattern="**/*.*",
    regex="corresponding"
)
corresponding_files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/257078577.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  corresponding_files.df.applymap(lambda x: x.name)
file_path
0 corresponding_file_0.data
files.match_files(files=corresponding_files.df)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1564609699.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path other_file_path
0 file_group_a_0.data corresponding_file_0.data

Match metadata files#

files = lc.Files.from_glob(
    directory=directory,
    pattern="**/*.*",
    regex="group_a_0"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/3682074927.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
0 file_group_a_0.data
files.match_file_upstream(pattern="*.meta")
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/745566580.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path metadata
0 file_group_a_0.data metadata.meta

Group files#

files = lc.Files.from_glob(
    directory=directory,
    pattern="**/file*.data"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1768514306.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
0 file_group_b_2.data
1 file_group_a_1.data
2 file_group_a_0.data
files.set_group_identifier(name="A", pattern="group_a")
files.df
file_path group
0 /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.... NaN
1 /tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.... A
2 /tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.... A
files.set_group_identifier(name="B", pattern="group_b")
files.df
file_path group
0 /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.... B
1 /tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.... A
2 /tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.... A
files.group_identifiers()
Index(['A', 'B'], dtype='object')
grouped = files.grouped()
grouped.groups
{'A': [1, 2], 'B': [0]}

Indexing and iterating over files#

files = lc.Files.from_glob(
    directory=directory,
    pattern="**/file*.data"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1768514306.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  files.df.applymap(lambda x: x.name)
file_path
0 file_group_b_2.data
1 file_group_a_1.data
2 file_group_a_0.data

Slicing Files yield a new Files instance:

files[0:3]
<locan.locan_io.files.Files at 0x7f740224e510>

Indexing Files yields a Series with the selected row:

files[0]
file_path    /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2....
Name: 0, dtype: object

Iterating over Files yields a namedtuple for each row:

for file in files:
    print(file)
    print(file.file_path)
Files(Index=0, file_path=PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.data'))
/tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.data
Files(Index=1, file_path=PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.data'))
/tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.data
Files(Index=2, file_path=PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.data'))
/tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.data