[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）_[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）PDF下载_[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）电子书,介绍,[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）txt,作者,视频教程,在线阅读,[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）谢伊·科恩西格尔斯·西奥多里蒂斯

[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）

作者

谢伊·科恩西格尔斯·西奥多里蒂斯

丛书名

智能科学与技术丛书

出版社

机械工业出版社

ISBN

9782012111720

简要

简介

内容简介书籍计算机书籍 ---------------------------自然语言处理中的贝叶斯分析（原书第2版）--------------------------- 本书涵盖了流利阅读NLP中贝叶斯学习方向的论文以及从事该领域的研究所需的方法和算法。这些方法和算法部分来自于机器学习和统计学，部分是针对NLP开发的。我们涵盖推理技术，如马尔科夫链、蒙特卡罗抽样和变分推理、贝叶斯估计和非参数建模。为了应对该领域的快速变化，本书新版增加了一个新的章节，关于贝叶斯背景下的表现学习和神经网络。我们还将介绍贝叶斯统计的基本概念，如先验分布、共轭性和生成式建模。最后，我们回顾了一些基本的NLP建模技术，如语法建模、神经网络和表示学习，以及它们在贝叶斯分析中的应用。 ---------------------------机器学习：贝叶斯和优化方法（英文版·原书第2版）--------------------------- 本书通过讲解监督学习的两大支柱——回归和分类——将机器学习纳入统一视角展开讨论。书中首先讨论基础知识，包括均方、最小二乘和最大似然方法、岭回归、贝叶斯决策理论分类、逻辑回归和决策树。然后介绍新近的技术，包括稀疏建模方法，再生核希尔伯特空间中的学习、支持向量机中的学习、关注EM算法的贝叶斯推理及其近似推理变分版本、蒙特卡罗方法、聚焦于贝叶斯网络的概率图模型、隐马尔科夫模型和粒子滤波。此外，本书还深入讨论了降维和隐藏变量建模。全书以关于神经网络和深度学习架构的扩展章节结束。此外，书中还讨论了统计参数估计、维纳和卡尔曼滤波、凸性和凸优化的基础知识，其中，用一章介绍了随机逼近和梯度下降族的算法，并提出了分布式优化的相关概念、算法和在线学习技术。

---------------------------自然语言处理中的贝叶斯分析（原书第2版）---------------------------

译者序
第2版前言
第1版前言
第1版致谢
第1章　基础知识 1
1.1　概率测度 1
1.2　随机变量 2
1.2.1　连续随机变量和离散随机变量 2
1.2.2　多元随机变量的联合分布 3
1.3　条件分布 4
1.3.1　贝叶斯法则 5
1.3.2　独立随机变量与条件独立随机变量 6
1.3.3　可交换的随机变量 6
1.4　随机变量的期望 7
1.5　模型 9
1.5.1　参数模型与非参数模型 9
1.5.2　模型推断 10
1.5.3　生成模型 11
1.5.4　模型中的独立性假定 13
1.5.5　有向图模型 13
1.6　从数据场景中学习 15
1.7　贝叶斯学派和频率学派的哲学（冰山一角） 17
1.8　本章小结 17
1.9　习题 18
第2章　绪论 19
2.1　贝叶斯统计与自然语言处理的结合点概述 19
2.2　第一个例子：隐狄利克雷分配模型 22
2.2.1　狄利克雷分布 26
2.2.2　推断 28
2.2.3　总结 29
2.3　第二个例子：贝叶斯文本回归 30
2.4　本章小结 31
2.5　习题 31
第3章　先验 33
3.1　共轭先验 33
3.1.1　共轭先验和归一化常数 36
3.1.2　共轭先验在隐变量模型中的应用 37
3.1.3　混合共轭先验 38
3.1.4　重新归一化共轭分布 39
3.1.5　是否共轭的讨论 39
3.1.6　总结 40
3.2　多项式分布和类别分布的先验 40
3.2.1　再谈狄利克雷分布 41
3.2.2　Logistic正态分布 44
3.2.3　讨论 48
3.2.4　总结 49
3.3　非信息先验 49
3.3.1　均匀不正常先验 50
3.3.2　Jeffreys先验 51
3.3.3　讨论 51
3.4　共轭指数模型 52
3.5　模型中的多参数抽取 53
3.6　结构先验 54
3.7　本章小结 55
3.8　习题 56
第4章　贝叶斯估计 57
4.1　隐变量学习：两种观点 58
4.2　贝叶斯点估计 58
4.2.1　最大后验估计 59
4.2.2　基于最大后验解的后验近似 64
4.2.3　决策-理论点估计 65
4.2.4　总结 66
4.3　经验贝叶斯 66
4.4　后验的渐近行为 68
4.5　本章小结 69
4.6　习题 69
第5章　采样算法 70
5.1　MCMC算法：概述 71
5.2　MCMC推断的自然语言处理模型结构 71
5.3　吉布斯采样 73
5.3.1　坍塌吉布斯采样 76
5.3.2　运算符视图 79
5.3.3　并行化的吉布斯采样器 80
5.3.4　总结 81
5.4　Metropolis-Hastings算法 82
5.5　切片采样 84
5.5.1　辅助变量采样 85
5.5.2　切片采样和辅助变量采样在自然语言处理中的应用 85
5.6　模拟退火 86
5.7　MCMC算法的收敛性 86
5.8　马尔可夫链：基本理论 88
5.9　MCMC领域外的采样算法 89
5.10　蒙特卡罗积分 91
5.11　讨论 93
5.11.1　分布的可计算性与采样 93
5.11.2　嵌套的MCMC采样 93
5.11.3　MCMC方法的运行时间 93
5.11.4　粒子滤波 93
5.12　本章小结 95
5.13　习题 95
第6章　变分推断 97
6.1　边缘对数似然的变分界 97
6.2　平均场近似 99
6.3　平均场变分推断算法 100
6.3.1　狄利克雷-多项式变分推断 101
6.3.2　与期望最大化算法的联系 104
6.4　基于变分推断的经验贝叶斯 106
6.5　讨论 106
6.5.1　推断算法的初始化 107
6.5.2　收敛性诊断 107
6.5.3　变分推断在解码中的应用 107
6.5.4　变分推断最小化KL散度 108
6.5.5　在线的变分推断 109
6.6　本章小结 109
6.7　习题 109
第7章　非参数先验 111
7.1　狄利克雷过程：三种视角 112
7.1.1　折棍子过程 112
7.1.2　中餐馆过程 114
7.2　狄利克雷过程混合模型 115
7.2.1　基于狄利克雷过程混合模型的推断 116
7.2.2　狄利克雷过程混合是混合模型的极限 118
7.3　层次狄利克雷过程 119
7.4　PitmanYor过程 120
7.4.1　Pitman-Yor过程用于语言建模 121
7.4.2　Pitman-Yor过程的幂律行为 122
7.5　讨论 123
7.5.1　高斯过程 124
7.5.2　印度自助餐过程 124
7.5.3　嵌套的中餐馆过程 125
7.5.4　距离依赖的中餐馆过程 125
7.5.5　序列记忆器 126
7.6　本章小结 126
7.7　习题 127
第8章　贝叶斯语法模型 128
8.1　贝叶斯隐马尔可夫模型 129
8.2　概率上下文无关语法 131
8.2.1　作为多项式分布集的PCFG 133
8.2.2　PCFG的基本推断算法 133
8.2.3　作为隐马尔可夫模型的PCFG 136
8.3　贝叶斯概率上下文无关语法 137
8.3.1　PCFG的先验 137
8.3.2　贝叶斯PCFG的蒙特卡罗推断 138
8.3.3　贝叶斯PCFG的变分推断 139
8.4　适配器语法 140
8.4.1　Pitman-Yor适配器语法 141
8.4.2　PYAG的折棍子视角 142
8.4.3　基于PYAG的推断 143
8.5　层次狄利克雷过程PCFG 144
8.6　依存语法 147
8.7　同步语法 148
8.8　多语言学习 149
8.8.1　词性标注 149
8.8.2　语法归纳 151
8.9　延伸阅读 152
8.10　本章小结 153
8.11　习题 153
第9章　表征学习与神经网络 155
9.1　神经网络与表征学习：为什么是现在 155
9.2　词嵌入 158
9.2.1　词嵌入的skip-gram模型 158
9.2.2　贝叶斯skip-gram词嵌入 160
9.2.3　讨论 161
9.3　神经网络 162
9.3.1　频率论估计和反向传播算法 164
9.3.2　神经网络权值的先验 166
9.4　神经网络在自然语言处理中的现代应用 168
9.4.1　循环神经网络和递归神经网络 168
9.4.2　梯度消失与梯度爆炸问题 169
9.4.3　神经编码器-解码器模型 172
9.4.4　卷积神经网络 175
9.5　调整神经网络 177
9.5.1　正则化 177
9.5.2　超参数调整 178
9.6　神经网络生成建模 180
9.6.1　变分自编码器 180
9.6.2　生成对抗网络 185
9.7　本章小结 186
9.8　习题 187
结束语 189
附录A　基本概念 191
附录B　概率分布清单 197
参考文献 203

---------------------------机器学习：贝叶斯和优化方法（英文版·原书第2版）---------------------------

Preface........................................iv
Acknowledgments........................................vi
About the Author........................................viii
Notation........................................ix
CHAPTER1 Introduction........................................1
1.1 The Historical Context........................................1
1.2 Artificia Intelligenceand Machine Learning..........................2
1.3 Algorithms Can Learn WhatIs Hidden in the Data......................4
1.4 Typical Applications of Machine Learning............................6
Speech Recognition......................................6
Computer Vision........................................6
Multimodal Data........................................6
Natural Language Processing...............................7
Robotics........................................7
Autonomous Cars.......................................7
Challenges for the Future..................................8
1.5 Machine Learning: Major Directions................................8
1.5.1 Supervised Learning.....................................8
1.6 Unsupervised and Semisupervised Learning...........................11
1.7 Structure and a Road Map of the Book...............................12
References........................................16
CHAPTER2 Probability and Stochastic Processes.............................19
2.1 Introduction........................................20
2.2 Probability and Random Variables..................................20
2.2.1 Probability........................................20
2.2.2 Discrete Random Variables................................22
2.2.3 Continuous Random Variables..............................24
2.2.4 Meanand Variance.......................................25
2.2.5 Transformation of Random Variables.........................28
2.3 Examples of Distributions........................................29
2.3.1 Discrete Variables.......................................29
2.3.2 Continuous Variables.....................................32
2.4 Stochastic Processes........................................41
2.4.1 First-and Second-Order Statistics...........................42
2.4.2 Stationarity and Ergodicity.................................43
2.4.3 Power Spectral Density...................................46
2.4.4 Autoregressive Models....................................51
2.5 Information Theory........................................54
2.5.1 Discrete Random Variables................................56
2.5.2 Continuous Random Variables..............................59
2.6 Stochastic Convergence........................................61
Convergence Everywhere..................................62
Convergence Almost Everywhere............................62
Convergence in the Mean-Square Sense.......................62
Convergence in Probability................................63
Convergence in Distribution................................63
Problems........................................63
References........................................65
CHAPTER3 Learning in Parametric Modeling: Basic Concepts and Directions.........67
3.1 Introduction........................................67
3.2 Parameter Estimation: the Deterministic Point of View...................68
3.3 Linear Regression........................................71
3.4Classifcation........................................75
Generative Versus Discriminative Learning....................78
3.5 Biased Versus Unbiased Estimation.................................80
3.5.1 Biased or Unbiased Estimation.............................81
3.6 The Cram閞朢ao Lower Bound....................................83
3.7 Suffcient Statistic........................................87
3.8 Regularization........................................89
Inverse Problems:Ill-Conditioning and Overfittin...............91
3.9 The Bias朧ariance Dilemma......................................93
3.9.1 Mean-Square Error Estimation..............................94
3.9.2 Bias朧ariance Tradeoff...................................95
3.10 Maximum Likelihood Method.....................................98
3.10.1 Linear Regression: the Nonwhite Gaussian Noise Case............101
3.11 Bayesian Inference........................................102
3.11.1 The Maximum a Posteriori Probability Estimation Method.........107
3.12 Curse of Dimensionality........................................108
3.13 Validation........................................109
Cross-Validation........................................111
3.14 Expected Loss and Empirical Risk Functions..........................112
Learnability........................................113
3.15 Nonparametric Modeling and Estimation.............................114
Problems........................................114
MATLABExercises....................................119
References........................................119
CHAPTER4 Mean-Square Error Linear Estimation.............................121
4.1 Introduction........................................121
4.2 Mean-Square Error Linear Estimation: the Normal Equations..............122
4.2.1 The Cost Function Surface.................................123
4.3 A Geometric Viewpoint: Orthogonality Condition......................124
4.4 Extension to Complex-Valued Variables..............................127
4.4.1 Widely Linear Complex-Valued Estimation....................129
4.4.2 Optimizing With Respect to Complex-Valued Variables: Wirtinger Calculus...........................132
4.5 Linear Filtering........................................134
4.6 MSE Linear Filtering: a Frequency Domain Point of View................136
Deconvolution: Image Deblurring............................137
4.7 Some Typical Applications.......................................140
4.7.1 Interference Cancelation..................................140
4.7.2 System Identifcation.....................................141
4.7.3 Deconvolution: Channel Equalization.........................143
4.8 Algorithmic Aspects: the Levinson and Lattice-Ladder Algorithms.........149
Forward and Backward MSE Optimal Predictors................151
4.8.1 The Lattice-Ladder Scheme................................154
4.9 Mean-Square Error Estimation of Linear Models.......................158
4.9.1 The Gauss朚arkov Theorem...............................160
4.9.2 Constrained Linear Estimation: the Beamforming Case...........162
4.10 Time-Varying Statistics: Kalman Filtering............................166
Problems........................................172
MATLAB Exercises....................................174
References........................................176
CHAPTER5 Online Learning: the Stochastic Gradient Descent Family of Algorithms.....179
5.1 Introduction........................................180
5.2 The Steepest Descent Method.....................................181
5.3 Application to the Mean-Square Error Cost Function....................184
Time-Varying Step Sizes..................................190
5.3.1 The Complex-Valued Case.................................193
5.4 Stochastic Approximation........................................194
Application to the MSE Linear Estimation.....................196
5.5 The Least-Mean-Squares Adaptive Algorithm.........................198
5.5.1 Convergence and Steady-State Performance of the LMS in Stationary Environments........................................199
5.5.2 Cumulative Loss Bounds..................................204
5.6 The Affne Projection Algorithm...................................206
Geometric Interpretation of APA............................208
Orthogonal Projections....................................208
5.6.1 The Normalized LMS....................................211
5.7 The Complex-Valued Case........................................213
The Widely Linear LMS..................................213
The Widely Linear APA...................................214
5.8 Relatives of the LMS........................................214
The Sign-Error LMS.....................................214
The Least-Mean-Fourth (LMF) Algorithm.....................215
Transform-Domain LMS..................................215
5.9 Simulation Examples........................................218
5.10 Adaptive Decision Feedback Equalization............................221
5.11 The Linearly Constrained LMS....................................224
5.12 Tracking Performance of the LMS in Nonstationary Environments..........225
5.13 Distributed Learning: the Distributed LMS............................227
5.13.1 Cooperation Strategies....................................228
5.13.2 The Diffusion LMS......................................231
5.13.3 Convergence and Steady-State Performance: Some Highlights......237
5.13.4 Consensus-Based Distributed Schemes........................240
5.14 A Case Study: Target Localization..................................241
5.15 Some Concluding Remarks: Consensus Matrix........................243
Problems........................................244
MATLABExercises....................................246
References........................................247
CHAPTER6 The Least-Squares Family......................................253
6.1 Introduction........................................253
6.2 Least-Squares Linear Regression: a Geometric Perspective................254
6.3 Statistical Properties of the LS Estimator.............................257
The LS Estimator Is Unbiased..............................257
Covariance Matrix of the LS Estimator........................257
The LS Estimator Is BLUE in the Presence of White Noise........258
The LS Estimator Achieves the Cram閞朢ao Bound for White Gaussian Noise........................................259
Asymptotic Distribution of the LS Estimator...................260
6.4 Orthogonalizing the Column Space of the Input Matrix: the SVD Method....260
Pseudoinverse Matrix and SVD.............................262
6.5 Ridge Regression: a Geometric Point of View.........................265
Principal Components Regression...........................267
6.6 The Recursive Least-Squares Algorithm.............................268
Time-Iterative Computations...............................269
Time Updating of the Parameters............................270
6.7 Newton抯 Iterative Minimization Method.............................271
6.7.1 RLS and Newton抯 Method................................274
6.8 Steady-State Performance of the RLS...............................275
6.9 Complex-Valued Data: the Widely Linear RLS........................277
6.10 Computational Aspects of the LS Solution............................279
Cholesky Factorization....................................279
QR Factorization........................................279
Fast RLS Versions.......................................280
6.11 The Coordinate and Cyclic Coordinate Descent Methods.................281
6.12 Simulation Examples........................................283
6.13 Total Least-Squares........................................286
Geometric Interpretation of the Total Least-Squares Method........291
Problems........................................293
MATLAB瓻xercises....................................296
References........................................297
CHAPTER7 Classificationa Tour of the Classics..............................301
7.1 Introduction........................................301
7.2 Bayesian Classificatio........................................302
The Bayesian Classifie Minimizes the Misclassificatio Error......303
7.2.1 Average Risk........................................304
7.3 Decision (Hyper) Surfaces........................................307
7.3.1 The Gaussian Distribution Case.............................309
7.4 The Naive Bayes Classifie.......................................315
7.5 The Nearest Neighbor Rule.......................................315
7.6 Logistic Regression........................................317
7.7 Fisher抯 Linear Discriminant......................................322
7.7.1 Scatter Matrices........................................323
7.7.2 Fisher抯 Discriminant: the Two-Class Case.....................325
7.7.3 Fisher抯 Discriminant: the Multiclass Case.....................328
7.8 Classifcation Trees........................................329
7.9 Combining Classifers........................................333
No Free Lunch Theorem..................................334
Some Experimental Comparisons............................334
Schemes for Combining Classifier..........................335
7.10 The Boosting Approach........................................337
The Ada Boost Algorithm..................................337
The Log-Loss Function...................................341
7.11 Boosting Trees........................................343
Problems........................................345
MATLAB瓻xercises....................................347
References........................................349
CHAPTER8 Parameter Learning: a Convex Analytic Path........................351
8.1 Introduction........................................352
8.2 Convex Sets and Functions.......................................352
8.2.1 Convex Sets........................................353
8.2.2 Convex Functions.......................................354
8.3 Projections Onto Convex Sets.....................................357
8.3.1 Properties of Projections..................................361
8.4 Fundamental The orem of Projections Onto Convex Sets..................365
8.5 A Parallel Version of POCS.......................................369
8.6 From Convex Sets to Parameter Estimation and Machine Learning..........369
8.6.1 Regression........................................369
8.6.2 Classifcation........................................373
8.7 Infintely Many Closed Convex Sets: the Online Learning Case............374
8.7.1 Convergence of APSM....................................376
8.8 Constrained Learning........................................380
8.9 The Distributed APSM........................................382
8.10 Optimizing Nonsmooth Convex Cost Functions........................384
8.10.1 Subgradients and Subdifferentials............................385
8.10.2 Minimizing Nonsmooth Continuous Convex Loss Functions: the Batch Learning Case........................................388
8.10.3 Online Learning for Convex Optimization.....................393
8.11 Regret Analysis........................................396
Regret Analysis of the Subgradient Algorithm..................398
8.12 Online Learning and Big Data Applications: a Discussion................399
Approximation, Estimation, and Optimization Errors.............400
Batch Versus Online Learning..............................402
8.13 Proximal Operators........................................405
8.13.1 Properties of the Proximal Operator..........................407
8.13.2 Proximal Minimization...................................409
8.14 Proximal Splitting Methods for Optimization..........................412
The Proximal Forward-Backward Splitting Operator.............413
Alternating Direction Method of Multipliers (ADMM)............414
Mirror Descent Algorithms................................415
8.15 Distributed Optimization: Some Highlights...........................417
Problems........................................417
MATLABExercises....................................420
References........................................422
CHAPTER9 Sparsity-Aware Learning: Concepts and Theoretical Foundations.........427
9.1 Introduction........................................427
9.2 Searching for a Norm........................................428
9.3 The Least Absolute Shrinkage and Selection Operator (LASSO)...........431
9.4 Sparse Signal Representation......................................436
9.5 In Search of the Sparsest Solution..................................440
The Norm Minimizer...................................441
The Norm Minimizer...................................442
The Norm Minimizer...................................442
Characterization of the Norm Minimizer....................443
Geometric Interpretation..................................444
9.6 Uniqueness of the Minimizer....................................447
9.6.1 Mutua lCoherence.......................................449
9.7 Equivalence of and Minimizers: Sufficency Conditions..............451
9.7.1 Condition Implied by the Mutual Coherence Number.............451
9.7.2 The Restricted Isometry Property (RIP).......................452
9.8 Robust Sparse Signal Recovery From Noisy Measurements...............455
9.9 Compressed Sensing: the Glory of Randomness........................456
Compressed Sensing.....................................456
9.9.1 Dimensionality Reduction and Stable Embeddings...............458
9.9.2 Sub-Nyquist Sampling: Analog-to-Information Conversion........460
9.10 A Case Study: Image Denoising....................................463
Problems........................................465
MATLAB瓻xercises....................................468
References........................................469
CHAPTER10 Sparsity-Aware Learning: Algorithms and Applications.................473
10.1 Introduction........................................473
10.2 Sparsity Promoting Algorithms....................................474
10.2.1 Greedy Algorithms......................................474
10.2.2 Iterative Shrinkage/Thresholding (IST) Algorithms..............480
10.2.3 Which Algorithm Some Practical Hints......................487
10.3 Variations on the Sparsity-Aware Theme.............................492
10.4 Online Sparsity Promoting Algorithms...............................499
10.4.1 LASSO: Asymptotic Performance...........................500
10.4.2 The Adaptive Norm-Weighted LASSO........................502
10.4.3 Adaptive CoSa MPAlgorithm...............................504
10.4.4 Sparse-Adaptive Projection Subgradient Method................505
10.5 Learning Sparse Analysis Models..................................510
10.5.1 Compressed Sensing for Sparse Signal Representationin Coherent Dictionaries...................................512
10.5.2 Cosparsity........................................513
10.6 A Case Study: Time-Frequency Analysis.............................516
Gabor Transform and Frames...............................516
Time-Frequency Resolution................................517
Gabor Frames........................................518
Time-Frequency Analysis of Echolocation Signals Emitted by Bats..519
Problems........................................523
MATLAB瓻xercises....................................524
References........................................525
CHAPTER11 Learningin Reproducing Kernel Hilbert Spaces......................531
11.1 Introduction........................................532
11.2 Generalized Linear Models.......................................532
11.3 Volterra, Wiener, and Hammerstein Models...........................533
11.4 Cover抯 Theorem: Capacity of a Spacein Linear Dichotomies.............536
11.5 Reproducing Kernel Hilbert Spaces.................................539
11.5.1 Some Properties and Theoretical Highlights....................541
11.5.2 Examples of Kernel Functions..............................543
11.6 Representer Theorem........................................548
11.6.1 Semiparametric Representer Theorem........................550
11.6.2 Nonparametric Modeling: a Discussion.......................551
11.7 Kernel Ridge Regression........................................551
11.8 Support Vector Regression........................................554
11.8.1 The LinearInsensitive Optimal Regression...................555
11.9 Kernel Ridge Regression Revisited.................................561
11.10 Optimal Margin Classification Support Vector Machines.................562
11.10.1 Linearly Separable Classes: Maximum Margin Classifier.........564
11.10.2 Nonseparable Classes.....................................569
11.10.3 Performance of SVMs and Applications.......................574
11.10.4 Choice of Hyperparameters................................574
11.10.5 Multiclass Generalizations.................................575
11.11 Computational Considerations.....................................576
11.12 Random Fourier Features........................................577
11.12.1 Online and Distributed Learningin RKHS.....................579
11.13 Multiple Kernel Learning........................................580
11.14 Nonparametric Sparsity-Aware Learning: Additive Models...............582
11.15 A Case Study: Authorship Identificatio.............................584
Problems........................................587
MATLAB瓻xercises....................................589
References........................................590
CHAPTER12 Bayesian Learning: Inference and the EM Algorithm...................595
12.1 Introduction........................................595
12.2 Regression: a Bayesian Perspective.................................596
12.2.1 The Maximum Likelihood Estimator.........................597
12.2.2 The MAP Estimator......................................598
12.2.3 The Bayesian Approach...................................599
12.3 The Evidence Function and Occam抯 Razor Rule.......................605
Laplacian Approximation and the Evidence Function.............607
12.4 Latent Variables and the EM Algorithm..............................611
12.4.1 The Expectation-Maximization Algorithm.....................611
12.5 Linear Regression and the EM Algorithm.............................613
12.6 Gaussian Mixture Models........................................616
12.6.1 Gaussian Mixture Modeling and Clustering....................620
12.7 The EM Algorithm: a Lower Bound Maximization View.................623
12.8 Exponential Family of Probability Distributions........................627
12.8.1 The Exponential Family and the Maximum Entropy Method.......633
12.9 Combining Learning Models: a Probabilistic Pointof View...............634
12.9.1 Mixing Linear Regression Models...........................634
12.9.2 Mixing Logistic Regression Models..........................639
Problems........................................641
MATLAB瓻xercises....................................643
References........................................645
CHAPTER13 Bayesian Learning: Approximate Inferenceand Nonparametric Models.....647
13.1 Introduction........................................648
13.2 Variational Approximationin Bayesian Learning.......................648
The Mean Field Approximation.............................649
13.2.1 The Case of the Exponential Family of Probability Distributions.....653
13.3 A Variational Bayesian Approachto Linear Regression..................655
Computation of the Lower Bound............................660
13.4 A Variational Bayesian Approach to Gaussian Mixture Modeling...........661
13.5 When Bayesian Inference Meets Sparsity.............................665
13.6 Sparse Bayesian Learning(SBL)...................................667
13.6.1 The Spike and Slab Method................................671
13.7 The Relevance Vector Machine Framework...........................672
13.7.1 Adopting the Logistic Regression Model for Classificatio.........672
13.8 Convex Duality and Variational Bounds..............................676
13.9 Sparsity-Aware Regression: a Variational Bound Bayesian Path............681
Sparsity-Aware Learning: Some Concluding Remarks............686
13.10 Expectation Propagation........................................686
Minimizing the KL Divergence.............................688
The Expectation Propagation Algorithm.......................688
13.11 Nonparametric Bayesian Modeling.................................690
13.11.1 The Chinese Restaurant Process.............................691
13.11.2 Dirichlet Processes.......................................692
13.11.3 The Stick Breaking Construction of a DP......................697
13.11.4 Dirichlet Process Mixture Modeling..........................698
Inference........................................699
13.11.5 The Indian Buffet Process.................................701
13.12 Gaussian Processes........................................710
13.12.1 Covariance Functions and Kernels...........................711
13.12.2 Regression........................................712
13.12.3 Classifcation........................................716
13.13 A Case Study: Hyperspectral Image Unmixing.........................717
13.13.1 Hierarchical Bayesian Modeling.............................719
13.13.2 Experimental Results.....................................720
Problems........................................721
MATLAB瓻xercises....................................726
References........................................727
CHAPTER14 Monte Carlo Methods........................................731
14.1 Introduction........................................731
14.2 Monte Carlo Methods: the Main Concept.............................732
14.2.1 Random Number Generation...............................733
14.3 Random Sampling Based on Function Transformation...................735
14.4 Rejection Sampling........................................739
14.5 Importance Sampling........................................743
14.6 Monte Carlo Methods and the EM Algorithm..........................745
14.7 Markov Chain Monte Carlo Methods................................745
14.7.1 Ergodic Markov Chains...................................748
14.8 The Metropolis Method........................................754
14.8.1 Convergence Issues......................................756
14.9 Gibbs Sampling........................................758
14.10 In Search of More Efficien Methods: a Discussion.....................760
Variational Inferenceor Monte Carlo Methods..................762
14.11 A Case Study: Change-Point Detection..............................762
Problems........................................765
MATLAB瓻xercise.....................................767
References........................................768
CHAPTER15 Probabilistic Graphical Models: PartI.............................771
15.1 Introduction........................................771
15.2 The Need for Graphical Models....................................772
15.3 Bayesian Networks and the Markov Condition.........................774
15.3.1 Graphs: Basic Definition..................................775
15.3.2 Some Hintson Causality..................................779
15.3.3 d-Separation........................................781
15.3.4 Sigmoidal Bayesian Networks..............................785
15.3.5 Linear Gaussian Models...................................786
15.3.6 Multiple-Cause Networks..................................786
15.3.7 I-Maps, Soundness, Faithfulness, and Completeness..............787
15.4 Undirected Graphical Models.....................................788
15.4.1 Independencies and I-Mapsin Markov Random Fields............790
15.4.2 The Ising Model and Its Variants............................791
15.4.3 Conditional Random Fields (CRFs)..........................794
15.5 Factor Graphs........................................795
15.5.1 Graphical Models for Error Correcting Codes...................797
15.6 Moralization of Directed Graphs...................................798
15.7 Exact Inference Methods: Message Passing Algorithms..................799
15.7.1 Exact Inferencein Chains..................................799
15.7.2 Exact Inferencein Trees...................................803
15.7.3 The Sum-Product Algorithm...............................804
15.7.4 The Max-Product and Max-Sum Algorithms...................809
Problems........................................816
References........................................818
CHAPTER16 Probabilistic Graphical Models: PartII............................821
16.1 Introduction........................................821
16.2 Triangulated Graphs and Junction Trees..............................822
16.2.1 Constructinga Join Tree...................................825
16.2.2 Message Passing in Junction Trees...........................827
16.3 Approximate Inference Methods...................................830
16.3.1 Variational Methods: Local Approximation....................831
16.3.2 Block Methods for Variational Approximation..................835
16.3.3 Loopy Belief Propagation..................................839
16.4 Dynamic Graphical Models.......................................842
16.5 Hidden Markov Models........................................844
16.5.1 Inference........................................847
16.5.2 Learning the Parametersin an HMM.........................852
16.5.3 Discriminative Learning...................................855
16.6 Beyond HMMs: a Discussion......................................856
16.6.1FactorialHiddenMarkovModels............................856
16.6.2 Time-Varying Dynamic Bayesian Networks....................859
16.7 Learning Graphical Models.......................................859
16.7.1 Parameter Estimation.....................................860
16.7.2 Learning the Structure....................................864
Problems........................................864
References........................................867
CHAPTER17ParticleFiltering........................................871
17.1 Introduction........................................871
17.2 Sequential Importance Sampling...................................871
17.2.1 Importance Sampling Revisited.............................872
17.2.2 Resampling........................................873
17.2.3 Sequential Sampling.....................................875
17.3 Kalman and Particle Filtering......................................878
17.3.1 Kalman Filtering:a Bayesian Point of View....................878
17.4 Particle Filtering........................................881
17.4.1 Degeneracy........................................885
17.4.2 Generic Particle Filtering..................................886
17.4.3 Auxiliary Particle Filtering.................................889
Problems........................................895
MATLAB瓻xercises....................................898
References........................................899
CHAPTER18 Neural Networks and Deep Learning..............................901
18.1 Introduction........................................902
18.2 The Perceptron........................................904
18.3 Feed-Forward Multilayer Neural Networks...........................908
18.3.1 Fully Connected Networks.................................912
18.4 The Backpropagation Algorithm...................................913
Nonconvexity of the Cost Function...........................914
18.4.1 The Gradient Descent Backpropagation Scheme.................916
18.4.2 Variants of the Basic Gradient Descent Scheme.................924
18.4.3 Beyond the Gradient Descent Rationale.......................934
18.5 Selecting a Cos tFunction........................................935
18.6 Vanishing and Exploding Gradients.................................938
18.6.1 The Rectifie Linear Unit..................................939
18.7 Regularizing the Network........................................940
Dropout........................................943
18.8 Designing Deep Neural Networks: a Summary.........................946
18.9 Universal Approximation Property of Feed-Forward Neural Networks.......947
18.10 Neural Networks: a Bayesian Flavor................................949
18.11 Shallow Versus Deep Architectures.................................950
18.11.1 The Power of Deep Architectures............................951
18.12 Convolutional Neural Networks....................................956
18.12.1 The Need for Convolutions................................956
18.12.2 Convolution Over Volumes.................................965
18.12.3 The Full CNN Architecture................................968
18.12.4 CNNs: the Epilogue......................................971
18.13 Recurrent Neural Networks.......................................976
18.13.1 Backpropagation Through Time.............................978
18.13.2 Attentionand Memory....................................982
18.14 Adversarial Examples........................................985
Adversarial Training.....................................987
18.15 Deep Generative Models........................................988
18.15.1 Restricted Boltzmann Machines.............................988
18.15.2 Pretraining Deep Feed-Forward Networks.....................991
18.15.3 Deep Belief Networks....................................992
18.15.4 Autoencoders........................................994
18.15.5 Generative Adversarial Networks............................995
18.15.6 Variational Autoencoders..................................1004
18.16 Capsule Networks........................................1007
Training........................................1011
18.17 Deep Neural Networks: Some Final Remarks..........................1013
Transfer Learning........................................1013
Multitask Learning.......................................1014
Geometric DeepLearning.................................1015
Open Problems........................................1016
18.18 A Case Study: Neural Machine Translation...........................1017
18.19 Problems........................................1023
Computer Exercises......................................1025
References........................................1029
CHAPTER19 Dimensionality Reduction and Latent Variable Modeling................1039
19.1 Introduction........................................1040
19.2 Intrinsic Dimensionality........................................1041
19.3 Principal Component Analysis.....................................1041
PCA, SVD, and Low Rank Matrix Factorization.................1043
Minimum Error Interpretation..............................1045
PCA and Information Retrieval.............................1045
Orthogonalizing Properties of PCA and Feature Generation........1046
Latent Variables........................................1047
19.4 Canonical Correlation Analysis....................................1053
19.4.1 Relatives of CCA........................................1056
19.5 Independent Component Analysis..................................1058
19.5.1 ICA and Gaussianity.....................................1058
19.5.2 ICA and Higher-Order Cumulants...........................1059
19.5.3 Non-Gaussianity and Independent Components.................1061
19.5.4 ICA Basedon Mutual Information...........................1062
19.5.5 Alternative Paths to ICA..................................1065
The Cocktail Party Problem................................1066
19.6 Dictionary Learning: the k-SVD Algorithm...........................1069
Whythe Namek-SVD...................................1072
Dictionary Learning and Dictionary Identifiabilit...............1072
19.7 Nonnegative Matrix Factorization..................................1074
19.8 Learning Low-Dimensional Models: a Probabilistic Perspective............1076
19.8.1 Factor Analysis........................................1077
19.8.2 Probabilistic PCA.......................................1078
19.8.3 Mixture of Factors Analyzers: a Bayesian View to Compressed Sensing.......................1082
19.9 Nonlinear Dimensionality Reduction................................1085
19.9.1 Kernel PCA........................................1085
19.9.2 Graph-Based Methods....................................1087
19.10 Low Rank Matrix Factorization: a Sparse Modeling Path.................1096
19.10.1 Matrix Completion.......................................1096
19.10.2 Robust PCA........................................1100
19.10.3 Applications of Matrix Completion and ROBUSTPCA...........1101
19.11 A Case Study: FMRI Data Analysis.................................1103
Problems........................................1107
MATLAB瓻xercises....................................1107
References........................................1108
Index........................................1116

[套装书]自然语言处理中的贝叶斯分析（原书第2版）+机器学习：贝叶斯和优化方法（英文版·原书第2版）（2册）

推荐