### Setup Virtual Environment and Install Dependencies Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Commands to set up a virtual environment named 'dedupe-examples' and install project dependencies from 'requirements.txt'. This ensures a consistent development environment. ```bash mkvirtualenv dedupe-examples pip install -r requirements.txt workon dedupe-examples ``` -------------------------------- ### Run MySQL Example Script Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Instructions to run the MySQL example, which demonstrates deduplication on IL campaign contributions data stored in MySQL. Refer to 'mysql_example/README.md' for detailed setup and execution. ```bash cd mysql_example # Refer to README.md for specific commands ``` -------------------------------- ### Install Dependencies for Gazetteer Examples Source: https://github.com/dedupeio/dedupe-examples/blob/main/gazetteer_example/README.md Installs the necessary Python dependencies for the in-memory Gazetteer example using pip and a requirements file. It assumes a virtual environment is set up for dependency management. ```bash pip install -r requirements-2.x.txt ``` ```bash pip install -r requirements-1.x.txt ``` -------------------------------- ### Setup PostgreSQL Database for Gazetteer Example Source: https://github.com/dedupeio/dedupe-examples/blob/main/gazetteer_example/README.md Sets up a PostgreSQL database named 'dedupe_example' and exports the connection string as an environment variable. This is a prerequisite for running the PostgreSQL-backed Gazetteer example. ```bash createdb dedupe_example export DATABASE_URL=postgres:///dedupe_example ``` -------------------------------- ### Run CSV Example Script Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Steps to navigate into the CSV example directory, install the 'unidecode' library, and run the 'csv_example.py' script. This script demonstrates deduplication on a list of early childhood education sites. ```bash cd csv_example pip install unidecode python csv_example.py ``` -------------------------------- ### Run Dedupe.io Example Scripts Source: https://github.com/dedupeio/dedupe-examples/blob/main/mysql_example/README.md These bash commands execute the Python scripts to initialize the MySQL database and run the deduplication example. Ensure dependencies are installed and MySQL configuration is complete before running. ```bash cd mysql_example python mysql_init_db.py python mysql_example.py ``` -------------------------------- ### Run Patent Example Script Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Instructions to run the patent example script. This involves navigating to the 'patent_example' directory, installing 'unidecode', and executing the Python script to process patent data. ```bash cd patent_example pip install unidecode python patent_example.py ``` -------------------------------- ### Run Gazetteer Example Script Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Commands to run the gazetteer example script. This example uses the Gazetteer class to link entries between two spreadsheets of electronics products. ```bash cd gazetteer_example.py python gazetteer_example.py ``` -------------------------------- ### Clone Dedupe Examples Repository Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Instructions to clone the dedupe-examples repository using Git. This is the first step to obtain the example scripts. ```bash git clone https://github.com/dedupeio/dedupe-examples.git cd dedupe-examples ``` -------------------------------- ### Run PostgreSQL Big Dedupe Example Script Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Steps to run the PostgreSQL example, which is a port of the MySQL campaign contributions example to handle a large dataset on PostgreSQL. Consult 'pgsql_big_dedupe_example/README.md' for details. ```bash cd pgsql_big_dedupe_example # Refer to README.md for specific commands ``` -------------------------------- ### Initialize MySQL Database with Dedupe.io Source: https://github.com/dedupeio/dedupe-examples/blob/main/mysql_example/README.md This script initializes the MySQL database for the Dedupe.io campaign contribution example. It requires a pre-created 'contributions' database and correct MySQL connection details in 'mysql.cnf'. The script populates the database with sample data for deduplication. ```python from __future__ import print_function import os import sys import time import re from unidecode import unidecode from fuzzy import DamerauLevenshtein from mysql import connector from mysql.connector import errorcode import dedupe def unicode_csv_dict_reader(unicode_csv_data, **kwargs): # csv.reader expects strings for row in unicode_csv_data: yield [unicode(s, "utf-8") for s in row] def preloaded_mysql_example(db_config): # load up our training data with open('mysql_example/names.csv') as f: reader = csv.reader(f) header = [h.strip() for h in next(reader)] data_d = {} for row in reader: row = dict(zip(header, row)) data_d[row['id']] = row # ## Create a distinct and human readable dataset print('Creating a distinct and human readable dataset') # Create the canonical record def canonical(fields): return str(unidecode(fields['firstname'].strip().lower()) + fields['lastname'].strip().lower() + fields['zip'].strip()) # Define the comparison fields dedupe.convenience.write_settings(data_d, canonical, header, 100000, 'mysql_example/mysql_learned_settings') # ## If we have training data, train the model. Otherwise, use the learned settings. if os.path.exists('mysql_example/mysql_model'): print('using trained model') with open('mysql_example/mysql_model', 'rb') as f: lda = dedupe.StaticDedupe(f) else: print('creating new model') lda = dedupe.Dedupe(fields, num_runs=20) lda.prepare_training(data_d, 'id') print('found training data') lda.train() # save our trained model with open('mysql_example/mysql_model', 'wb') as f: lda.write_model(f) # save our weights lda.write_settings('mysql_example/mysql_learned_settings') # ## Linking print('linking records') # MySQL connection details try: conn = connector.connect(**db_config) cur = conn.cursor(dictionary=True) cur.execute("SELECT * FROM contributions.contributions") results = cur.fetchall() cur.close() conn.close() except connector.Error as err: if err.errno == errorcode.ER_ACCESS_DENIED_ERROR: print("Something is wrong with your user name or password") elif err.errno == errorcode.ER_BAD_DB_ERROR: print("Database does not exist") else: print(err) sys.exit(1) # create a gazetteer # Use mysql to get the data # Note: for large data, consider using a database cursor to stream results # and match in chunks print('importing data from mysql') # create a dictionary of records from the mysql database # where the key is the mysql primary key data_from_mysql = {} try: conn = connector.connect(**db_config) cur = conn.cursor(dictionary=True) cur.execute("SELECT * FROM contributions.contributions") results = cur.fetchall() cur.close() conn.close() except connector.Error as err: if err.errno == errorcode.ER_ACCESS_DENIED_ERROR: print("Something is wrong with your user name or password") elif err.errno == errorcode.ER_BAD_DB_ERROR: print("Database does not exist") else: print(err) sys.exit(1) for row in results: data_from_mysql[row['id']] = row print('loaded %s records from mysql' % len(data_from_mysql)) # create a gazetteer object # The gazetteer takes a dictionary of records, and a function that # returns a canonical form for each record gazetteer = dedupe.Gazetteer(data_from_mysql, canonical) # The gazetteer can be linked against the trained model # it returns a generator of matches print('matching...') start_time = time.time() # The linking functions returns a generator # The results are pairs of record IDs that are duplicates # For example: ((record_id_1, record_id_2), confidence_score) linked_records = lda.match(gazetteer) # ## Write the results to a file print('writing results to output/mysql_output.csv') # Open up a file to write the duplicate data to with open('output/mysql_output.csv', 'w', newline='') as outfile: # create a csv writer object # header for output file csv_writer = csv.writer(outfile) csv_writer.writerow(['id_1', 'id_2', 'confidence']) # write all the duplicate pairs to the csv file for (record_id_1, record_id_2), confidence in linked_records: csv_writer.writerow([record_id_1, record_id_2, confidence]) print('done') print('elapsed time: %s' % (time.time() - start_time)) if __name__ == '__main__': # ## Prepare the database # We need to create a database that we can run the example from # The database should be called 'contributions' # ## Create MySQL connection db_config = { 'user': 'your_username', 'password': 'your_password', 'host': '127.0.0.1', 'port': 3306, 'database': 'contributions' } # ## If you have a mysql.cnf file with connection details, you can use that instead # db_config = dedupe.convenience.mysql_config_to_dict('mysql_example/mysql.cnf') preloaded_mysql_example(db_config) ``` -------------------------------- ### Run Gazetteer Example Evaluation Source: https://github.com/dedupeio/dedupe-examples/blob/main/gazetteer_example/README.md Executes the Python script for evaluating the in-memory Gazetteer matching example. This script is used to assess the performance and accuracy of the matching job. ```bash python gazetteer_evaluation.py ``` -------------------------------- ### Run PostgreSQL Gazetteer Example Source: https://github.com/dedupeio/dedupe-examples/blob/main/gazetteer_example/README.md Executes the Python script for the PostgreSQL-backed Gazetteer matching example. This script interacts with a PostgreSQL database to perform matching and updates relevant tables. ```bash python gazetteer_postgres_example.py ``` -------------------------------- ### Run Record Linkage Example Script Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Steps to execute the record linkage example. This involves changing to the 'record_linkage_example' directory and running the Python script to link matching entries between two spreadsheets of electronics products. ```bash cd record_linkage_example python record_linkage_example.py ``` -------------------------------- ### Run In-Memory Gazetteer Example Source: https://github.com/dedupeio/dedupe-examples/blob/main/gazetteer_example/README.md Executes the Python script for the in-memory Gazetteer matching example. This script performs matching entirely within the application's memory and produces 'gazetteer_output.csv'. ```bash python gazetteer_example.py ``` -------------------------------- ### Example Labeling Operation for Dedupe Training Source: https://github.com/dedupeio/dedupe-examples/blob/main/README.md Illustrates a typical interactive labeling session for training the dedupe model. Users are presented with pairs of records and asked to classify them as matches, non-matches, or unsure. ```text Phone : 2850617 Address : 3801 s. wabash Zip : Site name : ada s. mckinley st. thomas cdc Phone : 2850617 Address : 3801 s wabash ave Zip : Site name : ada s. mckinley community services - mckinley - st. thomas Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished ``` -------------------------------- ### Run Patent Disambiguation Example Source: https://github.com/dedupeio/dedupe-examples/blob/main/patent_example/README.md Executes the Python script for patent data disambiguation using the 'dedupe' library. No external dependencies beyond Python and the 'dedupe' library are required for this command. ```shell python patent_example.py ``` -------------------------------- ### Evaluate Patent Disambiguation Precision/Recall Source: https://github.com/dedupeio/dedupe-examples/blob/main/patent_example/README.md Runs a Python script to evaluate the precision and recall of the patent data disambiguation against provided reference data. Requires the 'dedupe' library and the reference CSV file. ```shell python patent_evaluation.py ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.