机器学习相关论文整理(仅作为笔记用)
其它机器学习、深度学习算法的全面系统讲解可以阅读《机器学习-原理、算法与应用》,清华大学出版社,雷明著,由SIGAI公众号作者倾力打造。
书的购买链接
书的勘误,优化,源代码资源
这篇文章整理出了机器学习、深度学习领域的经典论文。为了减轻大家的阅读负担,只列出了最经典的一批,如有需要,可以自己根据实际情况补充。
机器学习理论
PCA(probably approximately correct)学习理论
[1] L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984.
VC(Vapnik–Chervonenkis dimension)维
[1] Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M. K. Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM. 36 (4): 929–865, 1989.
[2] Natarajan, B.K. On Learning sets and functions. Machine Learning. 4: 67–97, 1989.
[3] Karpinski, Marek; Macintyre, Angus. Polynomial Bounds for VC Dimension of Sigmoidal and General Pfaffian Neural Networks. Journal of Computer and System Sciences. 54 (1): 169–176, 1997.
泛化理论
[1] Wolpert, D.H., Macready, W.G. No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation 1, 67, 1997.
[2] Wolpert, David. The Lack of A Priori Distinctions between Learning Algorithms. Neural Computation, pp. 1341-1390, 1996.
[3] Wolpert, D.H., and Macready, W.G. Coevolutionary free lunches. IEEE Transactions on Evolutionary Computation, 9(6): 721-735, 2005.
[4] Whitley, Darrell, and Jean Paul Watson. Complexity theory and the no free lunch theorem. In Search Methodologies, pp. 317-339. Springer, Boston, MA, 2005.
[5] Kawaguchi, K., Kaelbling, L.P, and Bengio. Generalization in deep learning. 2017.
[6] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. Understanding deep learning requires rethinking generalization. international conference on learning representations, 2017.
最优化理论和方法
[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade. Springer, 2012.
[2] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 2013.
[3] Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 2011.
[4] M. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint, 2012.
[5] T. Tieleman, and G. Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.Technical report, 2012.
[6] D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.
[7] Hardt, Moritz, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. Proceedings of The 33rd International Conference on Machine Learning. 2016.
决策树
[1] Breiman, L., Friedman, J. Olshen, R. and Stone C. Classification and Regression Trees, Wadsworth, 1984.
[2] J. Ross Quinlan. Induction of decision trees. Machine Learnin, 1(1): 81-106, 1986.
[3] J. Ross Quinlan. Learning efficient classification procedures and their application to chess end games. 1993.
[4] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, 1993.
贝叶斯分类器
[1] Rish, Irina. An empirical study of the naive Bayes classifier. IJCAI Workshop on Empirical Methods in Artificial Intelligence, 2001.
数据降维
主成分分析(PCA)
[1] Pearson, K. On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine. 2 (11): 559–572. 1901.
[2] Ian T. Jolliffe. Principal Component Analysis. Springer Verlag, New York, 1986.
[3] Scholkopf, B.,Smola,A.,Mulller,K.-P. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319, 1998.
[4] Sebastian Mika,Bernhard Scholkopf,Alexander J Smola,Klausrobert Muller,Matthias Scholz Gun. Kernel PCA and de-noising in feature spaces. neural information processing systems, 1999.
流形学习
[1] Roweis, Sam T and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500). 2000: 2323-2326.
[2] Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation. 15(6). 2003:1373-1396.
[3] He Xiaofei and Niyogi, Partha. Locality preserving projections. NIPS. 2003:234-241.
[4] Tenenbaum, Joshua B and De Silva, Vin and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500). 2000: 2319-2323.
[5] Laurens Van Der Maaten, Geoffrey E Hinton. Visualizing Data using t-SNE. 2008, Journal of Machine Learning Research.
线性判别分析(LDA)
[1] Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 Part 2: 179-188, 1936.
[2] Geoffrey J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, 1992.
logistic回归
[1] Cox, DR. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B. 20 (2): 215–242, 1958.
[2] David W Hosmer, Stanley Lemeshow. Applied logistic regression. Technometrics. 2000.
[3] Thomas P. Minka. A comparison of numerical optimizers for logistic regression, 2003.
[4] Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519-1555, 2007.
[5] Chih-Jen Lin, Ruby C.Weng, S.Sathiya Keerthi. Trust Region Newton Method for Large-Scale Logistic Regrression. Journal of Machine Learning Research,9, 627-650, 2008.
[6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9, 1871-1874, 2008.
支持向量机(SVM)
[1] B.E.Boser, I.Guyon, and V. Vapni. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144-152. ACM Press, 1992.
[2] Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20, 273-297, 1995.
[3] Bernhard Scholkopf, Christopher J. C. Burges, and Valdimir Vapnik. Extracting support data for a given task. 1995.
[4] Burges JC. A tutorial on support vector machines for pattern recognition. Bell Laboratories, Lucent Technologies, 1997.
[5] Scholkopf, Christopher J. C. Burges, and Alexander J. Smola, editor. Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, MIT Press. 1998.
[6] John C. Platt. Fast training of support vector machines using sequential minimal optimization. 1998.
[7] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for Support Vector Machines. ACM TIST, 2:27:1-27:27, 2011.
距离度量学习
[1] S. Chopra, R. Hadsell, Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), pages 349-356, San Diego, CA, 2005.
[2] Kilian Q Weinberger, Lawrence K Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 2009.
集成学习
Bagging与随机森林
[1] Breiman, Leo. Random Forests. Machine Learning 45 (1), 5-32, 2001.
Boosting算法
[1] Freund, Y. Boosting a weak learning algorithm by majority. Information and Computation, 1995.
[2] Yoav Freund, Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. computational learning theory. 1995.
[3] Freund, Y. An adaptive version of the boost by majority algorithm. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 1999.
[4] R.Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2001.
[5] Freund Y, Schapire RE. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771-780. 1999.
[7] Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics 28(2), 337–407. 2000.
[8] Jerome H Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001.
[9] Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. knowledge discovery and data mining, 2016.
[10] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tieyan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. neural information processing systems, 2017.
概率图模型
贝叶斯网络
[1] Nir Friedman, Dan Geiger, Moises Goldszmidt. Bayesian Network Classifiers. Machine Learning.1997.
隐马尔可夫模型
[1] Baum, L. E., Petrie, T. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics. 37 (6): 1554–1563. 1966.
[2] Baum, L. E., Eagon, J. A. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society. 73 (3): 360. 1967.
[3] Baum, L. E., Petrie, T., Soules, G., Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics. 41: 164. 1970
[4] Baum, L.E. An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of a Markov Process. Inequalities. 3: 1–8. 1972.
[5] Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE. 77 (2): 257–286. 1989.
条件随机场
[1] Lafferty, J., McCallum, A., Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann. pp. 282–289. 2001.
内容转自:https://zhuanlan.zhihu.com/p/50837564?utm_source=wechat_session