Density estimation using Gaussian mixtures in the presence of noisy, heterogeneous and incomplete data

We present a general algorithm to infer a d-dimensional distribution function given a set of heterogeneous, noisy observations or samples. This algorithm reconstructs the error-deconvolved or 'underlying' distribution function common to all samples, even when the individual samples have unique error and missing-data properties. The underlying distribution is modeled as a mixture of Gaussians, which is completely general. Model parameters are chosen to optimize a justified, scalar objective function: the logarithm of the probability of the data under the error-convolved model, where the error convolution is different for each data point. Optimization is performed by an Expectation Maximization (EM) algorithm, extended by a regularization technique and 'split-and-merge' procedure. These extensions mitigate problems with singularities and local maxima, which are often encountered when using the EM algorithm to estimate Gaussian density mixtures.

extreme_deconvolution(
  ydata,
  ycovar,
  xamp,
  xmean,
  xcovar,
  projection = NULL,
  weight = NULL,
  fixamp = NULL,
  fixmean = NULL,
  fixcovar = NULL,
  tol = 1e-06,
  maxiter = 1e+09,
  w = 0,
  logfile = NULL,
  splitnmerge = 0,
  maxsnm = FALSE,
  likeonly = FALSE,
  logweight = FALSE
)

Arguments

ydata: [ndata,dy] matrix of observed quantities
ycovar: [ndata,dy] / [ndata,dy,dy] / [dy,dy,ndata] matrix, list or 3D array of observational error covariances (if [ndata,dy] then the error correlations are assumed to vanish)
xamp: [ngauss] array of initial amplitudes (*not* [1,ngauss])
xmean: [ngauss,dx] matrix of initial means
xcovar: [ngauss,dx,dx] list of matrices of initial covariances
projection: [ndata,dy,dx] list of projection matrices
weight: [ndata] array of weights to be applied to the data points
fixamp: (default=None) None, True/False, or list of bools
fixmean: (default=None) None, True/False, or list of bools
fixcovar: (default=None) None, True/False, or list of bools
tol: (double, default=1.e-6) tolerance for convergence
maxiter: (long, default= 10**9) maximum number of iterations to perform
w: (double, default=0.) covariance regularization parameter (of the conjugate prior)
logfile: basename for several logfiles (_c.log has output from the c-routine; _loglike.log has the log likelihood path of all the accepted routes, i.e. only parts which increase the likelihood are included, during splitnmerge)
splitnmerge: (int, default=0) depth to go down the splitnmerge path
maxsnm: (Bool, default=False) use the maximum number of split 'n' merge steps, K*(K-1)*(K-2)/2
likeonly: (Bool, default=False) only compute the total log likelihood of the data
logweight: (bool, default=False) if True, weight is actually log(weight)

Value

avgloglikedata: avgloglikedata after convergence
xamp: updated xamp
xmean: updated xmean
xcovar: updated xcovar

References

Inferring complete distribution functions from noisy, heterogeneous and incomplete observations Jo Bovy, David W. Hogg, & Sam T. Roweis, Submitted to AOAS (2009) [arXiv/0905.2979]

Author

Jo Bovy, David W. Hogg, & Sam T. Roweis

Examples

if (FALSE) {
ydata <-
c(2.62434536, 0.38824359, 0.47182825, -0.07296862, 1.86540763,
  -1.30153870, 2.74481176, 0.23879310, 1.31903910, 0.75062962,
  2.46210794, -1.06014071, 0.67758280, 0.61594565, 2.13376944,
  -0.09989127, 0.82757179, 0.12214158, 1.04221375, 1.58281521,
  -0.10061918, 2.14472371, 1.90159072, 1.50249434, 1.90085595,
  0.31627214, 0.87710977, 0.06423057, 0.73211192, 1.53035547,
  0.30833925, 0.60324647, 0.31282730, 0.15479436, 0.32875387,
  0.98733540, -0.11731035, 1.23441570, 2.65980218, 1.74204416,
  0.80816445, 0.11237104, 0.25284171, 2.69245460, 1.05080775,
  0.36300435, 1.19091548, 3.10025514, 1.12015895, 1.61720311,
  1.30017032, 0.64775015, -0.14251820, 0.65065728, 0.79110577,
  1.58662319, 1.83898341, 1.93110208, 1.28558733, 1.88514116,
  0.24560206, 2.25286816, 1.51292982, 0.70190717, 1.48851815,
  0.92442829, 2.13162939, 2.51981682, 3.18557541, -0.39649633,
  -0.44411380, 0.49553414, 1.16003707, 1.87616892, 1.31563495,
  -1.02220122, 0.69379599, 1.82797464, 1.23009474, 1.76201118,
  0.77767186, 0.79924193, 1.18656139, 1.41005165, 1.19829972,
  1.11900865, 0.32933771, 1.37756379, 1.12182127, 2.12948391,
  2.19891788, 1.18515642, 0.62471505, 0.36126959, 1.42349435,
  1.07734007, 0.65614632, 1.04359686, 0.37999916, 1.69803203,
  0.55287144, 2.22450770, 1.40349164, 1.59357852, -0.09491185,
  1.16938243, 1.74055645, 0.04629940, 0.73378149, 1.03261455,
  -0.37311732, 1.31515939, 1.84616065, 0.14048406, 1.35054598,
  -0.31228341, 0.96130449, -0.61577236, 2.12141771, 1.40890054,
  0.97538304, 0.22483838, 2.27375593, 2.96710175, -0.85798186,
  2.23616403, 2.62765075, 1.33801170, -0.19926803, 1.86334532,
  0.81907970, 0.39607937, -0.23005814, 1.55053750, 1.79280687,
  0.37646927, 1.52057634, -0.14434139, 1.80186103, 1.04656730,
  0.81343023, 0.89825413, 1.86888616, 1.75041164, 1.52946532,
  1.13770121, 1.07782113, 1.61838026, 1.23249456, 1.68255141,
  0.68988323, -1.43483776, 2.03882460, 3.18697965, 1.44136444,
  0.89984477, 0.86355526, 0.88094581, 1.01740941, -0.12201873,
  0.48290554, 0.00297317, 1.24879916, 0.70335885, 1.49521132,
  0.82529684, 1.98633519, 1.21353390, 3.19069973, -0.89636092,
  0.35308331, 1.90148689, 3.52832571, 0.75136522, 1.04366899,
  0.77368576, 2.33145711, 0.71269214, 1.68006984, 0.68019840,
  -0.27255875, 1.31354772, 1.50318481, 2.29322588, 0.88955297,
  0.38263794, 1.56276110, 1.24073709, 1.28066508, 0.92688730,
  2.16033857, 1.36949272, 2.90465871, 2.11105670, 1.65904980,
  -0.62743834, 1.60231928, 1.42028220, 1.81095167, 2.04444209)

ydata  <- matrix(ydata,length(ydata),1)
N      <- dim(ydata)[1]
ycovar <- ydata*0 + 0.01
xamp   <- c(0.5,0.5)
xmean  <- matrix(c(0.86447943, 0.67078879, 0.322681, 0.45087394),2,2)
xcovar <-
  list(matrix(c(0.03821028, 0.04014796, 0.04108113, 0.03173839),2,2),
       matrix(c(0.06219194, 0.09738021, 0.04302473, 0.06778009),2,2))
projection <- list()
for (i in 1:N)
  projection[[i]] = matrix(c(i%%2,(i+1)%%2),1,2)
res <- extreme_deconvolution(ydata, ycovar, xamp, xmean, xcovar,
         projection=projection, logfile="ExDeconDemo")
}