java - Encoding record samples for expectation maximization algorithm -


first, i'm programmer without data science background, working knowledge of statistics quite limited.

i'm creating entity matching tool match records across internal datasets. want use probabilistic matching technique described in these documents*. have understanding of how technique works , how apply it, except derivation of agreement/disagreement weights using expectation maximization (em).

specifically, i'm unclear on how encode record pairs double[][] format required

the em implementation have available apache common math multivariatenormalmixtureexpectationmaximization.

here concrete example: matching company records.

a company has 2 fields: name (string) , country (enum), , want generate m , u probabilistic weights using em. how create double[][] dataset each field feed em?

in case of name, string there approximate agreement / disagreement, using string similarity method (edit distance, phonetic index, etc., details aren't relevant here)

in case of country, data normalized agreement occur on exact match. countries on , under represented. record under-represented country should have higher weight 1 over-represented country.

  1. what values in inner double[] mean/represent?
  2. how many entries/columns should there be?
  3. how encode records double[]?

* documents describing probabilistic matching technique using em


Comments

Popular posts from this blog

python - How to insert QWidgets in the middle of a Layout? -

python - serve multiple gunicorn django instances under nginx ubuntu -

module - Prestashop displayPaymentReturn hook url -