locan.locan_io.files.Files#

class locan.locan_io.files.Files(df=None, directory=None, exists=True, column='file_path')[source]#

Bases: object

Wrapper for a pandas.DataFrame with selected methods to identify, match, and group file paths.

Note

Iteration and indexing is implemented in a way that integer indexing or iterating over the Files instance returns a single row (as Series or namedtuple). Slice indexing returns a new Files instance with selected rows.

Parameters:
  • df (pd.DataFrame | dict[str, str] | None) – file names

  • directory (str | os.PathLike[Any] | None) – base directory

  • exists (bool) – raise FileExistsError if file in df does not exist

  • column (str) – key/column in df from which to take a file list

Variables:
  • df (pd.DataFrame) – dataframe carrying file paths

  • directory (Path) – base directory

Methods

__init__([df, directory, exists, column])

add_glob([pattern, regex, column])

Search for file paths using glob and/or regex pattern in base directory and provide files in new column.

concatenate([files, directory, exists])

Concatenate the file lists from multiple File instances and set the base directory without further action.

exclude([stoplist, column, column_stoplist])

Exclude files in self.df.column according to stoplist.

from_glob([directory, pattern, regex, column])

Instantiate Files from a search with glob and/or regex patterns.

from_path([files, directory, column])

Instantiate Files from a collection of file paths.

group_identifiers()

Get categories defined in self.df.group.

grouped()

Get groupby instance based on group_identifiers.

match_file_upstream([column, pattern, ...])

Find a matching file by applying locan.find_file_upstream() on each file in self.df[column].

match_files(files[, column, other_column])

Add files in new column.

print_summary()

Print summary of Files.

set_group_identifier([name, pattern, glob, ...])

Set group_identifier name for files in column as identified by string pattern and/or glob pattern and/or regex and keep them in column "group".

add_glob(pattern='*.txt', regex=None, column='other_file_path')[source]#

Search for file paths using glob and/or regex pattern in base directory and provide files in new column.

A logging.warning is given if the number of found files and those in self.df are different.

Parameters:
  • pattern (Optional[str]) – glob pattern passed to Path.glob()

  • regex (Optional[str]) – regex pattern passed to re.search() and applied in addition to glob pattern

  • column (str) – Name of column in Files.df carrying these files

Return type:

Self

classmethod concatenate(files=None, directory=None, exists=True)[source]#

Concatenate the file lists from multiple File instances and set the base directory without further action.

Parameters:
  • files (Optional[Iterable[Files]]) – sequence with File instances

  • directory (UnionType[str, PathLike[Any], None]) – new base directory

  • exists (bool) – raise FileExistsError if file in files does not exist

Return type:

Files

exclude(stoplist=None, column='file_path', column_stoplist='file_path')[source]#

Exclude files in self.df.column according to stoplist.

Parameters:
  • stoplist (UnionType[Files, Iterable[bool | str | PathLike[Any]], None]) – Files to be excluded

  • column (str) – key/column in df from which to exclude files

  • column_stoplist (str) – key/column in stoplist from which to take files

Return type:

Self

classmethod from_glob(directory=None, pattern='*.txt', regex=None, column='file_path')[source]#

Instantiate Files from a search with glob and/or regex patterns.

Parameters:
  • pattern (str) – glob pattern passed to Path.glob()

  • regex (Optional[str]) – regex pattern passed to re.search() and applied in addition to glob pattern

  • directory (UnionType[str, PathLike[Any], None]) – new base directory in which to search

  • column (str) – Name of column in Files.df carrying these files

Return type:

Files

classmethod from_path(files=None, directory=None, column='file_path')[source]#

Instantiate Files from a collection of file paths.

Parameters:
  • files (UnionType[Sequence[str | PathLike[Any]], str, PathLike[Any], None]) – sequence with File instances

  • directory (UnionType[str, PathLike[Any], None]) – new base directory

  • column (str) – Name of column in Files.df carrying these files

Return type:

Files

group_identifiers()[source]#

Get categories defined in self.df.group.

Return type:

categories

grouped()[source]#

Get groupby instance based on group_identifiers.

Return type:

pandas.core.groupby.DataFrameGroupBy

match_file_upstream(column='file_path', pattern='*.toml', regex=None, directory=None, other_column='metadata')[source]#

Find a matching file by applying locan.find_file_upstream() on each file in self.df[column].

Parameters:
  • column (str) – Name of column in Files.df carrying files to match

  • pattern (Optional[str]) – glob pattern passed to Path.glob()

  • regex (Optional[str]) – regex pattern passed to re.search() and applied in addition to glob pattern

  • directory (UnionType[str, PathLike[Any], None]) – top directory in which to search

  • other_column (str) – Name of new column carrying files

Return type:

Self

match_files(files, column='file_path', other_column='other_file_path')[source]#

Add files in new column.

A logging.warning is given if the number of files and those in self.df are different.

Parameters:
  • files – New file list

  • column – Name of column in Files.df carrying files to match

  • other_column – Name of new column carrying files

Return type:

Self

print_summary()[source]#

Print summary of Files.

Return type:

None

set_group_identifier(name=None, pattern=None, glob=None, regex=None, column='file_path')[source]#

Set group_identifier name for files in column as identified by string pattern and/or glob pattern and/or regex and keep them in column “group”.

Parameters:
  • name (Optional[str]) – new group_identifier

  • pattern (Optional[str]) – string pattern

  • glob (Optional[str]) – glob pattern passed to Path.match()

  • regex (Optional[str]) – regex pattern

  • column (str) – Name of column in Files.df carrying files to match

Return type:

Self