Sunday 15 August 2010

duplicates - Best method to deduplicate people records in Rails -


I am writing a rail app with an individual model that looks something like this:

  Create_table "People",: force => Is true T | T.string "first_name" t.string "last_name" t.string "email" t.datetime "created_at" ,: null = & gt; Incorrect t.datetime "updated_at" ,: null = & gt; False end   

I have a two step process:

  1. With people's names, fill the person's records. For example, "Tim Smith" and "Tonothy Smith"
  2. Ask the API to get a potential email address match for those people, for example, surnames etc., people's names can be unknown duplicates Are there.

    After that processing, I can have the data:

    Record 1: first_name: Tim LastName: Smith Email: tim.smith@sampleemail.com

    Record 2: First_name: Timothy last_name: Smith Email: tim.smith@sampleemail.com

    What is the best way to model for the cars that are duplicates?

    UPDATE: Clarification

    After Phase 2, I know that these two records are duplicates (i.e. the same person), my question is how to represent in the model is? Can I add a "duplicate_of_Person_ID" type field and insert the ID of the record for that field in the second record? Is there a better way?

    You can add all records at once. The first plan that comes to mind is to keep the record as the winner with the least ID and point to all the dumps. You can also have_and_belongs_to_many, which will contain a separate table, where each record says that these two people are equal, the latter quadratically increases with the number of people, though.

    Or, just copy all the information from one to the other and delete the second.

No comments:

Post a Comment