Tutorial about managing files in batch processing#
When analysing a set of experiments you want to collect, match and group files according to information content and experimental conditions.
The Files class will help you.
from pathlib import Path
import tempfile
%matplotlib inline
import matplotlib.pyplot as plt
import locan as lc
lc.show_versions(system=False, dependencies=False, verbose=False)
Locan:
version: 0.20.0.dev41+g755b969
Python:
version: 3.11.6
Some file structure to be analysed#
directory = Path(tempfile.mkdtemp())
subdirectory = directory.joinpath("sub_directory")
subdirectory.mkdir()
files = [
directory / "sub_directory" / "file_group_a_0.data",
directory / "sub_directory" / "file_group_a_1.data",
directory / "sub_directory" / "file_group_b_2.data",
directory / "sub_directory" / "corresponding_file_0.data",
directory / "metadata.meta",
]
for file_ in files:
file_.touch()
list(directory.glob("**/*.*"))
[PosixPath('/tmp/tmp_hvjxqz3/metadata.meta'),
PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.data'),
PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.data'),
PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.data'),
PosixPath('/tmp/tmp_hvjxqz3/sub_directory/corresponding_file_0.data')]
The Files class#
lc.Files?
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
Identify files#
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*"
)
files.df
file_path | |
---|---|
0 | /tmp/tmp_hvjxqz3/metadata.meta |
1 | /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.... |
2 | /tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.... |
3 | /tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.... |
4 | /tmp/tmp_hvjxqz3/sub_directory/corresponding_f... |
For each file a Path object is stored:
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1019267881.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | metadata.meta |
1 | file_group_b_2.data |
2 | file_group_a_1.data |
3 | file_group_a_0.data |
4 | corresponding_file_0.data |
files.print_summary()
Number of files: 5
Base directory: /tmp/tmp_hvjxqz3
Columns: Index(['file_path'], dtype='object')
file_path
count 5
unique 5
Exclude files#
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/641744908.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | metadata.meta |
1 | file_group_b_2.data |
2 | file_group_a_1.data |
3 | file_group_a_0.data |
4 | corresponding_file_0.data |
stoplist = lc.Files.concatenate([
lc.Files.from_glob(directory=files.directory, pattern="**/*.meta"),
lc.Files.from_glob(directory=files.directory, pattern="**/*group_b*.*")
])
stoplist.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/823210556.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
stoplist.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | metadata.meta |
1 | file_group_b_2.data |
files.exclude(stoplist=stoplist)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/3083961993.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
2 | file_group_a_1.data |
3 | file_group_a_0.data |
4 | corresponding_file_0.data |
Match corresponding files#
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*",
regex="group_a_0"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/3682074927.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | file_group_a_0.data |
corresponding_files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*",
regex="corresponding"
)
corresponding_files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/257078577.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
corresponding_files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | corresponding_file_0.data |
files.match_files(files=corresponding_files.df)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1564609699.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | other_file_path | |
---|---|---|
0 | file_group_a_0.data | corresponding_file_0.data |
Match metadata files#
files = lc.Files.from_glob(
directory=directory,
pattern="**/*.*",
regex="group_a_0"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/3682074927.py:6: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | file_group_a_0.data |
files.match_file_upstream(pattern="*.meta")
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/745566580.py:2: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | metadata | |
---|---|---|
0 | file_group_a_0.data | metadata.meta |
Group files#
files = lc.Files.from_glob(
directory=directory,
pattern="**/file*.data"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1768514306.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | file_group_b_2.data |
1 | file_group_a_1.data |
2 | file_group_a_0.data |
files.set_group_identifier(name="A", pattern="group_a")
files.df
file_path | group | |
---|---|---|
0 | /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.... | NaN |
1 | /tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.... | A |
2 | /tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.... | A |
files.set_group_identifier(name="B", pattern="group_b")
files.df
file_path | group | |
---|---|---|
0 | /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.... | B |
1 | /tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.... | A |
2 | /tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.... | A |
files.group_identifiers()
Index(['A', 'B'], dtype='object')
grouped = files.grouped()
grouped.groups
{'A': [1, 2], 'B': [0]}
Indexing and iterating over files#
files = lc.Files.from_glob(
directory=directory,
pattern="**/file*.data"
)
files.df.applymap(lambda x: x.name)
/tmp/ipykernel_1373/1768514306.py:5: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
files.df.applymap(lambda x: x.name)
file_path | |
---|---|
0 | file_group_b_2.data |
1 | file_group_a_1.data |
2 | file_group_a_0.data |
Slicing Files yield a new Files instance:
files[0:3]
<locan.locan_io.files.Files at 0x7f740224e510>
Indexing Files yields a Series with the selected row:
files[0]
file_path /tmp/tmp_hvjxqz3/sub_directory/file_group_b_2....
Name: 0, dtype: object
Iterating over Files yields a namedtuple for each row:
for file in files:
print(file)
print(file.file_path)
Files(Index=0, file_path=PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.data'))
/tmp/tmp_hvjxqz3/sub_directory/file_group_b_2.data
Files(Index=1, file_path=PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.data'))
/tmp/tmp_hvjxqz3/sub_directory/file_group_a_1.data
Files(Index=2, file_path=PosixPath('/tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.data'))
/tmp/tmp_hvjxqz3/sub_directory/file_group_a_0.data