Machine Learning Bibliography
Books
- Hastie, Friedman, and Tibshirani, The Elements of
Statistical Learning, 2001
- Bishop, Pattern Recognition and Machine Learning,
2006
- Ripley, Pattern Recognition and Neural Networks,
1996
- Duda, Hart, and Stork, Pattern Classification,
2nd Ed., 2002
- Tan, Steinbach, and Kumar, Introduction
to Data Mining, Addison-Wesley, 2005.
- Scholkopf and Smola, Learning with Kernels, 2002
- Mardia, Kent, and Bibby, Multivariate Analysis,
1979
- Computational
Statistics (online book)
- Sutton and Barto, Reinforcement Learning: An
Introduction, MIT Press, 1998.
- Bertsekas and Tsitsiklis, Neuro-Dynamic Programming,
Athena Scientific, 1996.
Other machine learning courses
Data repositories
Background
Matlab Software
- CVX convex program
solver by Stephen Boyd
- YALMIP,
a high-level Matlab interface to a variety of convex program solvers, such
as SeDuMi
- SeDuMi, for solving second
order cone programs. Most if not all tractable convex programs can be cast
as such.
- LIBSVM, for
support vector classification (including multiclass),
regression, and one-class classification (novelty detection).
Conferences/Publications
Nearest Neighbors
- The primary research area relating to nearest neighbor
methods is the problem of storage, data reduction, and
rapid calculation of nearest neighbors. A search on
"nearest neighbor search" or "condensed
nearest neighbors" or "editted nearest
neighbors" will return a number of references.
- Theory: Devroye, Gyorfi and Lugosi, A Probabilistic
Theory of Pattern Recognition, 1996
Density Estimation
- David Scott, Multivariate Density Estimation,
1992
- Theory: Devroye and Lugosi, Combinatorial Methods in
Density Estimation, 2001
Linear methods for classification
- Hastie et al, Bishop, and Duda et al. all have chapters
on LDA, logistic regression, and other linear
classifiers.
Decision Trees
- The first comprehensive treatment and still a standard
reference: Brieman, Friedman, Olshen and Stone, Classification
and Regression Trees, 1984
- The other standard reference is Quinlan, J. R. C4.5:
Programs for Machine Learning. Morgan Kaufmann
Publishers, 1993.
- A somewhat recent survey of research on decision trees:
Sreerama K. Murthy: Automatic Construction of Decision
Trees from Data: A Multi-Disciplinary Survey. Data Min.
Knowl. Discov. 2(4): 345-389 (1998)
- Ripley has a nice chapter on decision trees -- probably
the best place to start.
Error estimation
Boosting
- Adaboost was first developed in Freund and Schapire, A
decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer
and System Sciences, 55(1):119-139, 1997.
- A simpler proof the Adaboost's weak learning property is
given in Robert E. Schapire and Yoram Singer.
Improved boosting algorithms using confidence-rated
predictions. Machine Learning,
37(3):297-336, 1999.
- The view of Adaboost as performing functional gradient
descent was observed by a number of researchers in the
late 90's and early 00's. A representative work is
L. Mason, J. Baxter, P. L. Bartlett, and
M. Frean. Functional gradient techniques for
combining hypotheses. In A. J. Smola, P. L.
Bartlett, B. Sch?kopf, and D. Schuurmans,
editors, Advances in Large Margin Classifiers,
pages 221-246. MIT Press, 2000.
- Logitboost was introduced in Friedman, J.H., Hastie, T.,
Tibshirani, R.: Additive
logistic regression: a statistical view of boosting.
Annals of Statistics 28, 337-407 (with discussion)
(2000).
- Empirical
Bernstein Boosting,
Pannagadatta Shivaswamy, Tony Jebara; AISTATS 2010.
- Many other references to boosting can be found on Robert
Schapire's web page.
Support Vector Machines
- The original paper: Corinna Cortes and V. Vapnik, "Support-Vector
Networks'', Machine Learning, 20,
1995
- The standard reference: Scholkopf and Smola, Learning
with Kernels, 2002
- Algorithms for solving the SVM are discussed E.
Osuna, R. Freund, and F. Girosi.
"Improved training algorithm for support vector
machines." NNSP'97, 1997. http://citeseer.ist.psu.edu/osuna97improved.html, and
in J. Platt, Fast Training of Support Vector Machines
using Sequential Minimal Optimization, in Advances in Kernel Methods - Support Vector
Learning, B. Sch?kopf, C. Burges, and A.
Smola, eds., MIT Press, 1999.
Clustering
- K-means, EM for Gaussian mixture models, and hierarchical
clustering: see the recommended
texts, especially Hastie et al., Duda et al., and Bishop (although Bishop
doesn't discuss hierarchical clustering). K-means is also known as the
Lloyd-Max algorithm in the context of vector quantization.
- EM was originally introduced by Dempster, A. P., Laird,
N. M. Rubin, D. B. 1977. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the
Royal Statistical Society, B, 39, 1-38.
- Spectral clustering: an excellent introduction to spectral
clustering is the following: U. Luxburg, ``A
Tutorial on Spectral Clustering," Statistics and Computing 17(4),
395-416 (2007).
Dimensionality reduction
- Principal components analysis: The book by Mardia, Kent and Bibby
derives PCA for the ``population" case (the sample case being analagous)
for both the maximum orthogonal variance perspective and the least squares
linear approximation perspective. Note that PCA is also known as the
Karhunen-Loeve transform (KLT).
- Multidimensional scaling: The book by Mardia, Kent and Bibby
has a clean and rigorous derivation of classical MDS, associated
optimality properties, and connections to PCA. It also discusses nonmetric
MDS methods.
- The ``majorization" approach to metric MDS via stress minimization is
reviewed and analyzed by Jan de Leeuw, "Convergence of the Majorization
Method for Multidimensional
Scaling," Joumal of Classification 5:163-180 (1988)
- Isomap
- Local
linear embedding (LLE)
- Laplacian
eigenmaps
- Kernel PCA is covered in the book by Scholkopf and Smola, or see the
original paper referenced therein.
- Manifold learning
resource page
- Self-organizing maps, principal curves, and independent component
analysis (ICA) may be reviewed in Hastie et al.
- Factor analysis is treated in Mardia et al.
- An
Introduction to Variable and Feature Selection, an excellent survey
and introduction to methods of variable section that appeared in
Journal of Machine Learning Research 3 (2003) 1157-1182.
- The following article describes extensive simulations for
various learning algorithms combined with different feature
selection methods, and offers some good intuition: Hua, J., Xiong, Z.,
Lowey, J., Suh, E., and E. R. Dougherty, Optimal
Number of Features as a Function of Sample Size for Various Classification
Rules, Bioinformatics, 21, No. 8, 1509-1515, 2005.
Nonlinear regression and Gaussian Processes