机器学习相关论文整理（仅作为笔记用）

qq_41598247 2020-06-27 05:11:46

其它机器学习、深度学习算法的全面系统讲解可以阅读《机器学习-原理、算法与应用》，清华大学出版社，雷明著，由SIGAI公众号作者倾力打造。

书的购买链接
书的勘误，优化，源代码资源
这篇文章整理出了机器学习、深度学习领域的经典论文。为了减轻大家的阅读负担，只列出了最经典的一批，如有需要，可以自己根据实际情况补充。

机器学习理论

PCA(probably approximately correct)学习理论

[1] L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984.

VC(Vapnik–Chervonenkis dimension)维

[1] Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M. K. Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM. 36 (4): 929–865, 1989.

[2] Natarajan, B.K. On Learning sets and functions. Machine Learning. 4: 67–97, 1989.

[3] Karpinski, Marek; Macintyre, Angus. Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks. Journal of Computer and System Sciences. 54 (1): 169–176, 1997.

泛化理论

[1] Wolpert, D.H., Macready, W.G. No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation 1, 67, 1997.

[2] Wolpert, David. The Lack of A Priori Distinctions between Learning Algorithms. Neural Computation, pp. 1341-1390, 1996.

[3] Wolpert, D.H., and Macready, W.G. Coevolutionary free lunches. IEEE Transactions on Evolutionary Computation, 9(6): 721-735, 2005.

[4] Whitley, Darrell, and Jean Paul Watson. Complexity theory and the no free lunch theorem. In Search Methodologies, pp. 317-339. Springer, Boston, MA, 2005.

[5] Kawaguchi, K., Kaelbling, L.P, and Bengio. Generalization in deep learning. 2017.

[6] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. Understanding deep learning requires rethinking generalization. international conference on learning representations, 2017.

最优化理论和方法

[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade. Springer, 2012.

[2] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 2013.

[3] Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 2011.

[4] M. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint, 2012.

[5] T. Tieleman, and G. Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.Technical report, 2012.

[6] D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.

[7] Hardt, Moritz, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. Proceedings of The 33rd International Conference on Machine Learning. 2016.

决策树

[1] Breiman, L., Friedman, J. Olshen, R. and Stone C. Classification and Regression Trees, Wadsworth, 1984.

[2] J. Ross Quinlan. Induction of decision trees. Machine Learnin, 1(1): 81-106, 1986.

[3] J. Ross Quinlan. Learning efficient classification procedures and their application to chess end games. 1993.

[4] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, 1993.

贝叶斯分类器

[1] Rish, Irina. An empirical study of the naive Bayes classifier. IJCAI Workshop on Empirical Methods in Artificial Intelligence, 2001.

数据降维

主成分分析（PCA）

[1] Pearson, K. On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine. 2 (11): 559–572. 1901.

[2] Ian T. Jolliffe. Principal Component Analysis. Springer Verlag, New York, 1986.

[3] Scholkopf, B.,Smola,A.,Mulller,K.-P. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319, 1998.

[4] Sebastian Mika,Bernhard Scholkopf,Alexander J Smola,Klausrobert Muller,Matthias Scholz Gun. Kernel PCA and de-noising in feature spaces. neural information processing systems, 1999.

流形学习

[1] Roweis, Sam T and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500). 2000: 2323-2326.

[2] Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation. 15(6). 2003:1373-1396.

[3] He Xiaofei and Niyogi, Partha. Locality preserving projections. NIPS. 2003:234-241.

[4] Tenenbaum, Joshua B and De Silva, Vin and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500). 2000: 2319-2323.

[5] Laurens Van Der Maaten, Geoffrey E Hinton. Visualizing Data using t-SNE. 2008, Journal of Machine Learning Research.

线性判别分析（LDA）

[1] Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 Part 2: 179-188, 1936.

[2] Geoffrey J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, 1992.

logistic回归

[1] Cox, DR. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B. 20 (2): 215–242, 1958.

[2] David W Hosmer, Stanley Lemeshow. Applied logistic regression. Technometrics. 2000.

[3] Thomas P. Minka. A comparison of numerical optimizers for logistic regression, 2003.

[4] Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519-1555, 2007.

[5] Chih-Jen Lin, Ruby C.Weng, S.Sathiya Keerthi. Trust Region Newton Method for Large-Scale Logistic Regrression. Journal of Machine Learning Research,9, 627-650, 2008.

[6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9, 1871-1874, 2008.

支持向量机（SVM）

[1] B.E.Boser, I.Guyon, and V. Vapni. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144-152. ACM Press, 1992.

[2] Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20, 273-297, 1995.

[3] Bernhard Scholkopf, Christopher J. C. Burges, and Valdimir Vapnik. Extracting support data for a given task. 1995.

[4] Burges JC. A tutorial on support vector machines for pattern recognition. Bell Laboratories, Lucent Technologies, 1997.

[5] Scholkopf, Christopher J. C. Burges, and Alexander J. Smola, editor. Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, MIT Press. 1998.

[6] John C. Platt. Fast training of support vector machines using sequential minimal optimization. 1998.

[7] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for Support Vector Machines. ACM TIST, 2:27:1-27:27, 2011.

距离度量学习

[1] S. Chopra, R. Hadsell, Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), pages 349-356, San Diego, CA, 2005.

[2] Kilian Q Weinberger, Lawrence K Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 2009.

集成学习

Bagging与随机森林

[1] Breiman, Leo. Random Forests. Machine Learning 45 (1), 5-32, 2001.

Boosting算法

[1] Freund, Y. Boosting a weak learning algorithm by majority. Information and Computation, 1995.

[2] Yoav Freund, Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. computational learning theory. 1995.

[3] Freund, Y. An adaptive version of the boost by majority algorithm. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 1999.

[4] R.Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2001.

[5] Freund Y, Schapire RE. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771-780. 1999.

[7] Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics 28(2), 337–407. 2000.

[8] Jerome H Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001.

[9] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. knowledge discovery and data mining, 2016.

[10] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tieyan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. neural information processing systems, 2017.

概率图模型

贝叶斯网络

[1] Nir Friedman, Dan Geiger, Moises Goldszmidt. Bayesian Network Classifiers. Machine Learning.1997.

隐马尔可夫模型

[1] Baum, L. E., Petrie, T. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics. 37 (6): 1554–1563. 1966.

[2] Baum, L. E., Eagon, J. A. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society. 73 (3): 360. 1967.

[3] Baum, L. E., Petrie, T., Soules, G., Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics. 41: 164. 1970

[4] Baum, L.E. An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of a Markov Process. Inequalities. 3: 1–8. 1972.

[5] Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE. 77 (2): 257–286. 1989.

条件随机场

[1] Lafferty, J., McCallum, A., Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann. pp. 282–289. 2001.

内容转自：https://zhuanlan.zhihu.com/p/50837564?utm_source=wechat_session

...全文