This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. |

If you use code found here in a publication or presentation, I would appreciate it if you would acknowledge the source.

- Mixture proportion estimation via kernel mean embedding: Technique for estimating the maximum proportion of one distribution present in another, based on kernel mean embedding
- Mixture proportion estimation: ROC-based technique for estimating the maximum proportion of one distribution present in another
- Sparse approximation of a kernel mean: For scaling kernel density estimates and kernel mean embeddings of distributions.
- Robust kernel density estimation: Views the kernel density estimate as a mean in a Hilbert space, and estimates the mean robustly via M-estimation
- Surrogate losses for label dependent costs: Figures.
- Cluster nearest neighbor algorithm for file matching, and associated EM algorithm for fitting a mixture of PPCA model with missing attributes.
- TCEM: EM algorithm for fitting a multivariate Gaussian mixture model with truncated and censored data.
- Nested support vector machines for cost-sensitive and one-class classification
- SVM path algorithms for cost-sensitive and one-class classification
- $L_2$ kernel classification, optimizing the integrated squared error of the difference of densities
- MN-SCAnn: Nonparametric annotation of multivariate, contaminated data
- Weighted L2E for partial mixture estimation
- 2$\nu$-SVM, a cost-sensitive extension of the $\nu$-SVM
- Dyadic decision trees with free-splits, for classification and other set estimation problems
- COPAP: Cyclic order preserving assigment problem for shape matching

Click the link to download.

- Mixture proportion
estimation via kernel mean embedding. This code implements the
algorithm described in
H. Ramaswamy, C. Scott, and A. Tewari, "Mixture Proportion Estimation via Kernel Embedding of Distributions," avXiv:1603.02501.

The code is in python 2.7 and requires the scipy, numpy, matplotlib, and cvxopt packages.

- Mixture proportion
estimation (version 2). This code implements the algorithm described in
C. Scott, ``A Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy Labels," AISTATS 2015.

Under the hood this code contains a scalable implementation (programmed by Daniel LeJeune) of kernel logistic regression using random Fourier features, which should be useful in a number of other contexts.

- Sparse approximation of a kernel
mean.
E. Cruz Cortes and C. Scott, ``Sparse approximation of a kernel mean."

- Robust kernel density
estimation.
J. Kim and C. Scott, ``Robust kernel density estimation,

*Journal of Machine Learning Research*, vol. 13, pp. 2529-2565, 2012. - Surrogate losses for
label-dependent costs. Generates the figures in this paper:
C. Scott, "Calibrated Surrogate Losses for Classification with Label-Dependent Costs,"

*Electronic Journal of Statistics*, vol. 6, pp. 958-992, 2012. - Cluster nearest
neighbor algorithm for file matching, and associated EM algorithm for
fitting a mixture of PPCA models with missing attributes.
G. Lee, W. Finn, and C. Scott, "Statistical file matching of flow cytometry data,"

*J. Biomedical Informatics,*vol. 44, no. 4., pp. 663-676, 2011. - TCEM: EM algorithm for
fitting a multivariate Gaussian mixture model with truncated and censored
data.
G. Lee and C. Scott, ``EM algorithms for multivariate Gaussian mixture models with truncated and censored data,"

*Computational Statistics and Data Analysis*, vol. 56, no. 9, pp. 2816-2829, 2012. - Nested support vector
machines:
Matlab code to generate cost-sensitive and one-class SVMs that are
properly nested (unlike standard SVMS) as the cost-asymmetry or density
level parameter is varied. The solution paths are piecewise linear with a
user-selected number of breakpoints.
G. Lee and C. Scott, ``Nested support vector machines," to be published in

*IEEE Trans. Signal Processing*. - SVM path algorithms:
Matlab code to generate solution paths for the cost-sensitive SVM with
varying cost-asymmetry, and the one-class SVM with varying density level
parameter. The algorithms were inspired by the path algorithm of Hastie et
al., which varies a regularization parameter, and were implemented for
comparison with the nested SVM code above.
The OC-SVM path algorithm was
detailed here:
G. Lee and C. Scott, ``The one class support vector machine solution path,"

*Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing*(ICASSP 2007), vol. 2, II-521--II-524, Honolulu, USA, April 2007.The CS-SVM algorithm is different from the one developed by Bach et al. in that we capture the cost asymmetry in a single parameter. This algorithm first finds the path of the regularization parameter when the cost asymmetry parameter is set to a specific value (the negative sample size divided by the total sample size). Then, for any fixed value of the regularization parameter, it finds the solution path as the cost asymmetry parameter varies. the first of these two path algorithms is detailed in the following class project report by Gyemin.

G. Lee, ``The Solution Path for the Balanced 2C-SVM," EECS 559 Class Project Report, University of Michigan, Fall 2006.

The second path algorithm has no documentation, but follows similar principles to the other algorithms.

- L_2 kernel
classification: Matlab code to implement a method of classification
based on the $L_2$ distance or integrated squared error, and detailed in
J. Kim and C. Scott, ``$L_2$ kernel classification,"

*IEEE Trans. Pattern Analysis and Machine Intelligence*, vol. 32, no. 10, Oct. 2010, 1822 - 1831. - MN-SCAnn: Matlab code for
for nonparametric annotation of multivariate, contaminated data,
detailed here:
C. Scott and E. Kolaczyk, ``Nonparametric assessment of contamination in multivariate data using generalized quantile sets and FDR,"

*J. Computational and Graphical Statistics*, June 1, 2010, 19(2): 439-456. - Partial
mixture estimation: R code and documentation
for semi-parametric partial mixture estimation using a weighted L2
distance, applied to microarray differential expression
analysis and detailed in
D. Rossell, R. Guerra and C. Scott, ``Semi-parametric differential expression analysis via partial mixture estimation,"

*Statistical Applications in Genetics and Molecular Biology*, vol. 7, no. 1, article 15, 2008. - 2nu-SVM:
An implementation of the 2$\nu$-SVM, a cost-sensitive
extension of the $\nu$-SVM, based on the LIBSVM
package and described in this paper:
M. Davenport, R. Baraniuk, and C. Scott, ``Tuning support vector machines for minimax and Neyman-Pearson classification,"

*IEEE Trans. Pattern Analysis and Machine Intelligence*, vol. 32, no. 10, Oct. 2010, 1888-1898. - Dyadic decision
trees:
Matlab/mex code for solving several set estimation
problems, including traditional binary and multi-class
classification, Neyman-Pearson classification, minimum
volume set estimation, and density level set estimation.
Estimates are based on free-split recursive dyadic
partitions. Practical for problems of dimension less than
10. Thanks to Gilles
Blanchard for helpful discussions
regarding the implementation of the dyadic binning
algorithm.
C. Scott and R. Nowak, ``Minimax-optimal classification with dyadic decision trees,"

*IEEE Transactions on Information Theory*, vol. 52, no. 4, pp. 1335--1353, April 2006.C. Scott and R. Nowak, ``Learning minimum volume sets,"

*Journal of Machine Learning Research*, vol. 7, pp. 665--704, April 2006. - Cyclic
contour matching:
Matlab code for aligning two point sets obtained by sampling
cyclic contours. Implements the
algorithms and reproduces the examples found in this paper:
C. Scott and R. Nowak, ``Robust contour matching via the order preserving assignment problem,"

*IEEE Transactions on Image Processing*, vol. 15, no. 7, pp. 1831-1838, July 2006.

This work was supported in part by NSF Awards 0830490 and 0953135.