duke

powerplantmatching.duke.duke(datasets, labels=['one', 'two'], singlematch=False, showmatches=False, keepfiles=False, showoutput=False)

Run duke in different modes (Deduplication or Record Linkage Mode) to either locate duplicates in one database or find the similar entries in two different datasets. In RecordLinkagesMode (match two databases) please set singlematch=True and use best_matches() afterwards

Parameters
  • datasets (pd.DataFrame or [pd.DataFrame]) – A single dataframe is run in deduplication mode, while multiple ones are linked

  • labels ([str], default ['one', 'two']) – Labels for the linked dataframe

  • singlematch (boolean, default False) – Only in Record Linkage Mode. Only report the best match for each entry of the first named dataset. This does not guarantee a unique match in the second named dataset.

  • keepfiles (boolean, default False) – If true, do not delete temporary files