locan.locan_io.files.Files#

class locan.locan_io.files.Files(df=None, directory=None, exists=True, column='file_path')[source]#

Bases: object

Wrapper for a pandas.DataFrame with selected methods to identify, match, and group file paths.

Note

Iteration and indexing is implemented in a way that integer indexing or iterating over the Files instance returns a single row (as Series or namedtuple). Slice indexing returns a new Files instance with selected rows.

Parameters:

df (pd.DataFrame | dict[str, str] | None) – file names
directory (str | os.PathLike[Any] | None) – base directory
exists (bool) – raise FileExistsError if file in df does not exist
column (str) – key/column in df from which to take a file list

Variables:

df (pd.DataFrame) – dataframe carrying file paths
directory (Path) – base directory

Methods

`__init__`([df, directory, exists, column])
`add_glob`([pattern, regex, column])	Search for file paths using glob and/or regex pattern in base directory and provide files in new column.
`concatenate`([files, directory, exists])	Concatenate the file lists from multiple File instances and set the base directory without further action.
`exclude`([stoplist, column, column_stoplist])	Exclude files in self.df.column according to stoplist.
`from_glob`([directory, pattern, regex, column])	Instantiate Files from a search with glob and/or regex patterns.
`from_path`([files, directory, column])	Instantiate Files from a collection of file paths.
`group_identifiers`()	Get categories defined in self.df.group.
`grouped`()	Get groupby instance based on group_identifiers.
`match_file_upstream`([column, pattern, ...])	Find a matching file by applying `locan.find_file_upstream()` on each file in self.df[column].
`match_files`(files[, column, other_column])	Add files in new column.
`print_summary`()	Print summary of Files.
`set_group_identifier`([name, pattern, glob, ...])	Set group_identifier name for files in column as identified by string pattern and/or glob pattern and/or regex and keep them in column "group".

add_glob(pattern='*.txt', regex=None, column='other_file_path')[source]#

Search for file paths using glob and/or regex pattern in base directory and provide files in new column.

A logging.warning is given if the number of found files and those in self.df are different.

Parameters:

pattern (Optional[str]) – glob pattern passed to Path.glob()
regex (Optional[str]) – regex pattern passed to re.search() and applied in addition to glob pattern
column (str) – Name of column in Files.df carrying these files

Return type:

Self

classmethod concatenate(files=None, directory=None, exists=True)[source]#

Concatenate the file lists from multiple File instances and set the base directory without further action.

Parameters:

files (Optional[Iterable[Files]]) – sequence with File instances
directory (UnionType[str, PathLike[Any], None]) – new base directory
exists (bool) – raise FileExistsError if file in files does not exist

Return type:

Files

exclude(stoplist=None, column='file_path', column_stoplist='file_path')[source]#

Exclude files in self.df.column according to stoplist.

Parameters:

stoplist (UnionType[Files, Iterable[bool | str | PathLike[Any]], None]) – Files to be excluded
column (str) – key/column in df from which to exclude files
column_stoplist (str) – key/column in stoplist from which to take files

Return type:

Self

classmethod from_glob(directory=None, pattern='*.txt', regex=None, column='file_path')[source]#

Instantiate Files from a search with glob and/or regex patterns.

Parameters:

pattern (str) – glob pattern passed to Path.glob()
regex (Optional[str]) – regex pattern passed to re.search() and applied in addition to glob pattern
directory (UnionType[str, PathLike[Any], None]) – new base directory in which to search
column (str) – Name of column in Files.df carrying these files

Return type:

Files

classmethod from_path(files=None, directory=None, column='file_path')[source]#

Instantiate Files from a collection of file paths.

Parameters:

files (UnionType[Sequence[str | PathLike[Any]], str, PathLike[Any], None]) – sequence with File instances
directory (UnionType[str, PathLike[Any], None]) – new base directory
column (str) – Name of column in Files.df carrying these files

Return type:

Files

group_identifiers()[source]#

Get categories defined in self.df.group.

Return type:: categories

grouped()[source]#

Get groupby instance based on group_identifiers.

Return type:: pandas.core.groupby.DataFrameGroupBy

match_file_upstream(column='file_path', pattern='*.toml', regex=None, directory=None, other_column='metadata')[source]#

Find a matching file by applying locan.find_file_upstream() on each file in self.df[column].

Parameters:

column (str) – Name of column in Files.df carrying files to match
pattern (Optional[str]) – glob pattern passed to Path.glob()
regex (Optional[str]) – regex pattern passed to re.search() and applied in addition to glob pattern
directory (UnionType[str, PathLike[Any], None]) – top directory in which to search
other_column (str) – Name of new column carrying files

Return type:

Self

match_files(files, column='file_path', other_column='other_file_path')[source]#

Add files in new column.

A logging.warning is given if the number of files and those in self.df are different.

Parameters:

files – New file list
column – Name of column in Files.df carrying files to match
other_column – Name of new column carrying files

Return type:

Self

print_summary()[source]#

Print summary of Files.

Return type:: None

set_group_identifier(name=None, pattern=None, glob=None, regex=None, column='file_path')[source]#

Set group_identifier name for files in column as identified by string pattern and/or glob pattern and/or regex and keep them in column “group”.

Parameters:

name (Optional[str]) – new group_identifier
pattern (Optional[str]) – string pattern
glob (Optional[str]) – glob pattern passed to Path.match()
regex (Optional[str]) – regex pattern
column (str) – Name of column in Files.df carrying files to match

Return type:

Self