piper.io.duplicate_files¶
-
piper.io.duplicate_files(source=None, glob_pattern='*.*', recurse=False, filesize=1, keep=False, xl_file=None)[source]¶ select files that have the same file size.
This files are are assumed to be ‘duplicates’.
- Parameters
source – source directory, default None
glob_pattern – filter extension suffix, default ‘.’
recurse – default False, if True, recurse source directory provided
filesize – file size filter, default 1 (kb)
keep –
{‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to mark.
first: Mark duplicates as True except for the first occurrence. last: Mark duplicates as True except for the last occurrence. False: Mark all duplicates as True.
xl_file – default None: output results to Excel workbook to xl_file
- Returns
- Return type
pd.DataFrame
Examples
from piper.io import duplicate_files source = '/home/mike/Documents' duplicate_files(source, glob_pattern='*.*', recurse=True, filesize=2000000, keep=False).query("duplicate == True")
References https://docs.python.org/3/library/pathlib.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html