Tutorial about managing files in batch processing¶
When analysing a set of experiments you want to collect, match and group files according to information content and experimental conditions.
The Files class will help you.
from pathlib import Path
import tempfile
%matplotlib inline
import matplotlib.pyplot as plt
import locan as lc
lc.show_versions(system=False, dependencies=False, verbose=False)
Locan:
version: 0.22.0.dev32+g4bfc3ab8b
Python:
version: 3.11.14
Some file structure to be analysed¶
directory = Path(tempfile.mkdtemp())
subdirectory = directory.joinpath("sub_directory")
subdirectory.mkdir()
files = [
directory / "sub_directory" / "file_group_a_0.data",
directory / "sub_directory" / "file_group_a_1.data",
directory / "sub_directory" / "file_group_b_2.data",
directory / "sub_directory" / "corresponding_file_0.data",
directory / "metadata.meta",
]
for file_ in files:
file_.touch()
list(directory.glob("**/*.*"))
[PosixPath('/tmp/tmpuz6fwhw5/metadata.meta'),
PosixPath('/tmp/tmpuz6fwhw5/sub_directory/file_group_b_2.data'),
PosixPath('/tmp/tmpuz6fwhw5/sub_directory/file_group_a_1.data'),
PosixPath('/tmp/tmpuz6fwhw5/sub_directory/file_group_a_0.data'),
PosixPath('/tmp/tmpuz6fwhw5/sub_directory/corresponding_file_0.data')]
The Files class¶
lc.Files?
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
Identify files¶
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*"
)
files.df
| file_path | |
|---|---|
| 0 | /tmp/tmpuz6fwhw5/metadata.meta |
| 1 | /tmp/tmpuz6fwhw5/sub_directory/file_group_b_2.... |
| 2 | /tmp/tmpuz6fwhw5/sub_directory/file_group_a_1.... |
| 3 | /tmp/tmpuz6fwhw5/sub_directory/file_group_a_0.... |
| 4 | /tmp/tmpuz6fwhw5/sub_directory/corresponding_f... |
For each file a Path object is stored:
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/1019267881.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | metadata.meta |
| 1 | file_group_b_2.data |
| 2 | file_group_a_1.data |
| 3 | file_group_a_0.data |
| 4 | corresponding_file_0.data |
files.print_summary()
Number of files: 5
Base directory: /tmp/tmpuz6fwhw5
Columns: ['file_path']
Describe: file_path
count 5
unique 5
Exclude files¶
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/641744908.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | metadata.meta |
| 1 | file_group_b_2.data |
| 2 | file_group_a_1.data |
| 3 | file_group_a_0.data |
| 4 | corresponding_file_0.data |
stoplist = lc.Files.concatenate([
lc.Files.from_glob(directory=files.directory, pattern="**/*.meta"),
lc.Files.from_glob(directory=files.directory, pattern="**/*group_b*.*")
])
stoplist.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/823210556.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
stoplist.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | metadata.meta |
| 1 | file_group_b_2.data |
files.exclude(stoplist=stoplist)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/3083961993.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 2 | file_group_a_1.data |
| 3 | file_group_a_0.data |
| 4 | corresponding_file_0.data |
Match corresponding files¶
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*",
regex="group_a_0"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/3682074927.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | file_group_a_0.data |
corresponding_files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*",
regex="corresponding"
)
corresponding_files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/257078577.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
corresponding_files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | corresponding_file_0.data |
files.match_files(files=corresponding_files.df)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/1564609699.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | other_file_path | |
|---|---|---|
| 0 | file_group_a_0.data | corresponding_file_0.data |
Match metadata files¶
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*",
regex="group_a_0"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/3682074927.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | file_group_a_0.data |
files.match_file_upstream(pattern="*.meta")
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/745566580.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | metadata | |
|---|---|---|
| 0 | file_group_a_0.data | metadata.meta |
Group files¶
files = lc.Files.from_glob(
directory=directory,
pattern="**/file*.data"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/1768514306.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | file_group_b_2.data |
| 1 | file_group_a_1.data |
| 2 | file_group_a_0.data |
files.set_group_identifier(name="A", pattern="group_a")
files.df
| file_path | group | |
|---|---|---|
| 0 | /tmp/tmpuz6fwhw5/sub_directory/file_group_b_2.... | NaN |
| 1 | /tmp/tmpuz6fwhw5/sub_directory/file_group_a_1.... | A |
| 2 | /tmp/tmpuz6fwhw5/sub_directory/file_group_a_0.... | A |
files.set_group_identifier(name="B", pattern="group_b")
files.df
| file_path | group | |
|---|---|---|
| 0 | /tmp/tmpuz6fwhw5/sub_directory/file_group_b_2.... | B |
| 1 | /tmp/tmpuz6fwhw5/sub_directory/file_group_a_1.... | A |
| 2 | /tmp/tmpuz6fwhw5/sub_directory/file_group_a_0.... | A |
files.group_identifiers()
Index(['A', 'B'], dtype='object')
grouped = files.grouped()
grouped.groups
{'A': [1, 2], 'B': [0]}
Indexing and iterating over files¶
files = lc.Files.from_glob(
directory=directory,
pattern="**/file*.data"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1158/1768514306.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
| file_path | |
|---|---|
| 0 | file_group_b_2.data |
| 1 | file_group_a_1.data |
| 2 | file_group_a_0.data |
Slicing Files yield a new Files instance:
files[0:3]
<locan.locan_io.files.Files at 0x7d68672a1310>
Indexing Files yields a Series with the selected row:
files[0]
file_path /tmp/tmpuz6fwhw5/sub_directory/file_group_b_2....
Name: 0, dtype: object
Iterating over Files yields a namedtuple for each row:
for file in files:
print(file)
print(file.file_path)
Files(Index=0, file_path=PosixPath('/tmp/tmpuz6fwhw5/sub_directory/file_group_b_2.data'))
/tmp/tmpuz6fwhw5/sub_directory/file_group_b_2.data
Files(Index=1, file_path=PosixPath('/tmp/tmpuz6fwhw5/sub_directory/file_group_a_1.data'))
/tmp/tmpuz6fwhw5/sub_directory/file_group_a_1.data
Files(Index=2, file_path=PosixPath('/tmp/tmpuz6fwhw5/sub_directory/file_group_a_0.data'))
/tmp/tmpuz6fwhw5/sub_directory/file_group_a_0.data