java - Encoding record samples for expectation maximization algorithm -
first, i'm programmer without data science background, working knowledge of statistics quite limited.
i'm creating entity matching tool match records across internal datasets. want use probabilistic matching technique described in these documents*. have understanding of how technique works , how apply it, except derivation of agreement/disagreement weights using expectation maximization (em).
specifically, i'm unclear on how encode record pairs double[][]
format required
the em implementation have available apache common math multivariatenormalmixtureexpectationmaximization.
here concrete example: matching company records.
a company has 2 fields: name (string)
, country (enum)
, , want generate m , u probabilistic weights using em. how create double[][]
dataset each field feed em?
in case of name
, string there approximate agreement / disagreement, using string similarity method (edit distance, phonetic index, etc., details aren't relevant here)
in case of country
, data normalized agreement occur on exact match. countries on , under represented. record under-represented country should have higher weight 1 over-represented country.
- what values in inner
double[]
mean/represent? - how many entries/columns should there be?
- how encode records
double[]
?
* documents describing probabilistic matching technique using em
Comments
Post a Comment