piper.io.duplicate_files

piper.io.duplicate_files(source=None, glob_pattern='*.*', recurse=False, filesize=1, keep=False, xl_file=None)[source]

select files that have the same file size.

This files are are assumed to be ‘duplicates’.

Parameters
  • source – source directory, default None

  • glob_pattern – filter extension suffix, default ‘.

  • recurse – default False, if True, recurse source directory provided

  • filesize – file size filter, default 1 (kb)

  • keep

    {‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to mark.

    first: Mark duplicates as True except for the first occurrence. last: Mark duplicates as True except for the last occurrence. False: Mark all duplicates as True.

  • xl_file – default None: output results to Excel workbook to xl_file

Returns

Return type

pd.DataFrame

Examples

from piper.io import duplicate_files

source = '/home/mike/Documents'

duplicate_files(source,
                glob_pattern='*.*',
                recurse=True,
                filesize=2000000,
                keep=False).query("duplicate == True")

References https://docs.python.org/3/library/pathlib.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html