| 作者 |
| 谢伊·科恩 西格尔斯·西奥多里蒂斯 |
| 丛书名 |
| 智能科学与技术丛书 |
| 出版社 |
| 机械工业出版社 |
| ISBN |
| 9782012111720 |
| 简要 |
| 简介 |
| 内容简介书籍计算机书籍 ---------------------------自然语言处理中的贝叶斯分析(原书第2版)--------------------------- 本书涵盖了流利阅读NLP中贝叶斯学习方向的论文以及从事该领域的研究所需的方法和算法。这些方法和算法部分来自于机器学习和统计学,部分是针对NLP开发的。我们涵盖推理技术,如马尔科夫链、蒙特卡罗抽样和变分推理、贝叶斯估计和非参数建模。为了应对该领域的快速变化,本书新版增加了一个新的章节,关于贝叶斯背景下的表现学习和神经网络。我们还将介绍贝叶斯统计的基本概念,如先验分布、共轭性和生成式建模。最后,我们回顾了一些基本的NLP建模技术,如语法建模、神经网络和表示学习,以及它们在贝叶斯分析中的应用。 ---------------------------机器学习:贝叶斯和优化方法(英文版·原书第2版)--------------------------- 本书通过讲解监督学习的两大支柱——回归和分类——将机器学习纳入统一视角展开讨论。书中首先讨论基础知识,包括均方、最小二乘和最大似然方法、岭回归、贝叶斯决策理论分类、逻辑回归和决策树。然后介绍新近的技术,包括稀疏建模方法,再生核希尔伯特空间中的学习、支持向量机中的学习、关注EM算法的贝叶斯推理及其近似推理变分版本、蒙特卡罗方法、聚焦于贝叶斯网络的概率图模型、隐马尔科夫模型和粒子滤波。此外,本书还深入讨论了降维和隐藏变量建模。全书以关于神经网络和深度学习架构的扩展章节结束。此外,书中还讨论了统计参数估计、维纳和卡尔曼滤波、凸性和凸优化的基础知识,其中,用一章介绍了随机逼近和梯度下降族的算法,并提出了分布式优化的相关概念、算法和在线学习技术。 |
| 目录 |
---------------------------自然语言处理中的贝叶斯分析(原书第2版)--------------------------- 译者序 第2版前言 第1版前言 第1版致谢 第1章 基础知识 1 1.1 概率测度 1 1.2 随机变量 2 1.2.1 连续随机变量和离散随机变量 2 1.2.2 多元随机变量的联合分布 3 1.3 条件分布 4 1.3.1 贝叶斯法则 5 1.3.2 独立随机变量与条件独立随机变量 6 1.3.3 可交换的随机变量 6 1.4 随机变量的期望 7 1.5 模型 9 1.5.1 参数模型与非参数模型 9 1.5.2 模型推断 10 1.5.3 生成模型 11 1.5.4 模型中的独立性假定 13 1.5.5 有向图模型 13 1.6 从数据场景中学习 15 1.7 贝叶斯学派和频率学派的哲学(冰山一角) 17 1.8 本章小结 17 1.9 习题 18 第2章 绪论 19 2.1 贝叶斯统计与自然语言处理的结合点概述 19 2.2 第一个例子:隐狄利克雷分配模型 22 2.2.1 狄利克雷分布 26 2.2.2 推断 28 2.2.3 总结 29 2.3 第二个例子:贝叶斯文本回归 30 2.4 本章小结 31 2.5 习题 31 第3章 先验 33 3.1 共轭先验 33 3.1.1 共轭先验和归一化常数 36 3.1.2 共轭先验在隐变量模型中的应用 37 3.1.3 混合共轭先验 38 3.1.4 重新归一化共轭分布 39 3.1.5 是否共轭的讨论 39 3.1.6 总结 40 3.2 多项式分布和类别分布的先验 40 3.2.1 再谈狄利克雷分布 41 3.2.2 Logistic正态分布 44 3.2.3 讨论 48 3.2.4 总结 49 3.3 非信息先验 49 3.3.1 均匀不正常先验 50 3.3.2 Jeffreys先验 51 3.3.3 讨论 51 3.4 共轭指数模型 52 3.5 模型中的多参数抽取 53 3.6 结构先验 54 3.7 本章小结 55 3.8 习题 56 第4章 贝叶斯估计 57 4.1 隐变量学习:两种观点 58 4.2 贝叶斯点估计 58 4.2.1 最大后验估计 59 4.2.2 基于最大后验解的后验近似 64 4.2.3 决策-理论点估计 65 4.2.4 总结 66 4.3 经验贝叶斯 66 4.4 后验的渐近行为 68 4.5 本章小结 69 4.6 习题 69 第5章 采样算法 70 5.1 MCMC算法:概述 71 5.2 MCMC推断的自然语言处理模型结构 71 5.3 吉布斯采样 73 5.3.1 坍塌吉布斯采样 76 5.3.2 运算符视图 79 5.3.3 并行化的吉布斯采样器 80 5.3.4 总结 81 5.4 Metropolis-Hastings算法 82 5.5 切片采样 84 5.5.1 辅助变量采样 85 5.5.2 切片采样和辅助变量采样在自然语言处理中的应用 85 5.6 模拟退火 86 5.7 MCMC算法的收敛性 86 5.8 马尔可夫链:基本理论 88 5.9 MCMC领域外的采样算法 89 5.10 蒙特卡罗积分 91 5.11 讨论 93 5.11.1 分布的可计算性与采样 93 5.11.2 嵌套的MCMC采样 93 5.11.3 MCMC方法的运行时间 93 5.11.4 粒子滤波 93 5.12 本章小结 95 5.13 习题 95 第6章 变分推断 97 6.1 边缘对数似然的变分界 97 6.2 平均场近似 99 6.3 平均场变分推断算法 100 6.3.1 狄利克雷-多项式变分推断 101 6.3.2 与期望最大化算法的联系 104 6.4 基于变分推断的经验贝叶斯 106 6.5 讨论 106 6.5.1 推断算法的初始化 107 6.5.2 收敛性诊断 107 6.5.3 变分推断在解码中的应用 107 6.5.4 变分推断最小化KL散度 108 6.5.5 在线的变分推断 109 6.6 本章小结 109 6.7 习题 109 第7章 非参数先验 111 7.1 狄利克雷过程:三种视角 112 7.1.1 折棍子过程 112 7.1.2 中餐馆过程 114 7.2 狄利克雷过程混合模型 115 7.2.1 基于狄利克雷过程混合模型的推断 116 7.2.2 狄利克雷过程混合是混合模型的极限 118 7.3 层次狄利克雷过程 119 7.4 PitmanYor过程 120 7.4.1 Pitman-Yor过程用于语言建模 121 7.4.2 Pitman-Yor过程的幂律行为 122 7.5 讨论 123 7.5.1 高斯过程 124 7.5.2 印度自助餐过程 124 7.5.3 嵌套的中餐馆过程 125 7.5.4 距离依赖的中餐馆过程 125 7.5.5 序列记忆器 126 7.6 本章小结 126 7.7 习题 127 第8章 贝叶斯语法模型 128 8.1 贝叶斯隐马尔可夫模型 129 8.2 概率上下文无关语法 131 8.2.1 作为多项式分布集的PCFG 133 8.2.2 PCFG的基本推断算法 133 8.2.3 作为隐马尔可夫模型的PCFG 136 8.3 贝叶斯概率上下文无关语法 137 8.3.1 PCFG的先验 137 8.3.2 贝叶斯PCFG的蒙特卡罗推断 138 8.3.3 贝叶斯PCFG的变分推断 139 8.4 适配器语法 140 8.4.1 Pitman-Yor适配器语法 141 8.4.2 PYAG的折棍子视角 142 8.4.3 基于PYAG的推断 143 8.5 层次狄利克雷过程PCFG 144 8.6 依存语法 147 8.7 同步语法 148 8.8 多语言学习 149 8.8.1 词性标注 149 8.8.2 语法归纳 151 8.9 延伸阅读 152 8.10 本章小结 153 8.11 习题 153 第9章 表征学习与神经网络 155 9.1 神经网络与表征学习:为什么是现在 155 9.2 词嵌入 158 9.2.1 词嵌入的skip-gram模型 158 9.2.2 贝叶斯skip-gram词嵌入 160 9.2.3 讨论 161 9.3 神经网络 162 9.3.1 频率论估计和反向传播算法 164 9.3.2 神经网络权值的先验 166 9.4 神经网络在自然语言处理中的现代应用 168 9.4.1 循环神经网络和递归神经网络 168 9.4.2 梯度消失与梯度爆炸问题 169 9.4.3 神经编码器-解码器模型 172 9.4.4 卷积神经网络 175 9.5 调整神经网络 177 9.5.1 正则化 177 9.5.2 超参数调整 178 9.6 神经网络生成建模 180 9.6.1 变分自编码器 180 9.6.2 生成对抗网络 185 9.7 本章小结 186 9.8 习题 187 结束语 189 附录A 基本概念 191 附录B 概率分布清单 197 参考文献 203 ---------------------------机器学习:贝叶斯和优化方法(英文版·原书第2版)--------------------------- Preface........................................iv Acknowledgments........................................vi About the Author........................................viii Notation........................................ix CHAPTER1 Introduction........................................1 1.1 The Historical Context........................................1 1.2 Artificia Intelligenceand Machine Learning..........................2 1.3 Algorithms Can Learn WhatIs Hidden in the Data......................4 1.4 Typical Applications of Machine Learning............................6 Speech Recognition......................................6 Computer Vision........................................6 Multimodal Data........................................6 Natural Language Processing...............................7 Robotics........................................7 Autonomous Cars.......................................7 Challenges for the Future..................................8 1.5 Machine Learning: Major Directions................................8 1.5.1 Supervised Learning.....................................8 1.6 Unsupervised and Semisupervised Learning...........................11 1.7 Structure and a Road Map of the Book...............................12 References........................................16 CHAPTER2 Probability and Stochastic Processes.............................19 2.1 Introduction........................................20 2.2 Probability and Random Variables..................................20 2.2.1 Probability........................................20 2.2.2 Discrete Random Variables................................22 2.2.3 Continuous Random Variables..............................24 2.2.4 Meanand Variance.......................................25 2.2.5 Transformation of Random Variables.........................28 2.3 Examples of Distributions........................................29 2.3.1 Discrete Variables.......................................29 2.3.2 Continuous Variables.....................................32 2.4 Stochastic Processes........................................41 2.4.1 First-and Second-Order Statistics...........................42 2.4.2 Stationarity and Ergodicity.................................43 2.4.3 Power Spectral Density...................................46 2.4.4 Autoregressive Models....................................51 2.5 Information Theory........................................54 2.5.1 Discrete Random Variables................................56 2.5.2 Continuous Random Variables..............................59 2.6 Stochastic Convergence........................................61 Convergence Everywhere..................................62 Convergence Almost Everywhere............................62 Convergence in the Mean-Square Sense.......................62 Convergence in Probability................................63 Convergence in Distribution................................63 Problems........................................63 References........................................65 CHAPTER3 Learning in Parametric Modeling: Basic Concepts and Directions.........67 3.1 Introduction........................................67 3.2 Parameter Estimation: the Deterministic Point of View...................68 3.3 Linear Regression........................................71 3.4Classifcation........................................75 Generative Versus Discriminative Learning....................78 3.5 Biased Versus Unbiased Estimation.................................80 3.5.1 Biased or Unbiased Estimation.............................81 3.6 The Cram閞朢ao Lower Bound....................................83 3.7 Suffcient Statistic........................................87 3.8 Regularization........................................89 Inverse Problems:Ill-Conditioning and Overfittin...............91 3.9 The Bias朧ariance Dilemma......................................93 3.9.1 Mean-Square Error Estimation..............................94 3.9.2 Bias朧ariance Tradeoff...................................95 3.10 Maximum Likelihood Method.....................................98 3.10.1 Linear Regression: the Nonwhite Gaussian Noise Case............101 3.11 Bayesian Inference........................................102 3.11.1 The Maximum a Posteriori Probability Estimation Method.........107 3.12 Curse of Dimensionality........................................108 3.13 Validation........................................109 Cross-Validation........................................111 3.14 Expected Loss and Empirical Risk Functions..........................112 Learnability........................................113 3.15 Nonparametric Modeling and Estimation.............................114 Problems........................................114 MATLABExercises....................................119 References........................................119 CHAPTER4 Mean-Square Error Linear Estimation.............................121 4.1 Introduction........................................121 4.2 Mean-Square Error Linear Estimation: the Normal Equations..............122 4.2.1 The Cost Function Surface.................................123 4.3 A Geometric Viewpoint: Orthogonality Condition......................124 4.4 Extension to Complex-Valued Variables..............................127 4.4.1 Widely Linear Complex-Valued Estimation....................129 4.4.2 Optimizing With Respect to Complex-Valued Variables: Wirtinger Calculus...........................132 4.5 Linear Filtering........................................134 4.6 MSE Linear Filtering: a Frequency Domain Point of View................136 Deconvolution: Image Deblurring............................137 4.7 Some Typical Applications.......................................140 4.7.1 Interference Cancelation..................................140 4.7.2 System Identifcation.....................................141 4.7.3 Deconvolution: Channel Equalization.........................143 4.8 Algorithmic Aspects: the Levinson and Lattice-Ladder Algorithms.........149 Forward and Backward MSE Optimal Predictors................151 4.8.1 The Lattice-Ladder Scheme................................154 4.9 Mean-Square Error Estimation of Linear Models.......................158 4.9.1 The Gauss朚arkov Theorem...............................160 4.9.2 Constrained Linear Estimation: the Beamforming Case...........162 4.10 Time-Varying Statistics: Kalman Filtering............................166 Problems........................................172 MATLAB Exercises....................................174 References........................................176 CHAPTER5 Online Learning: the Stochastic Gradient Descent Family of Algorithms.....179 5.1 Introduction........................................180 5.2 The Steepest Descent Method.....................................181 5.3 Application to the Mean-Square Error Cost Function....................184 Time-Varying Step Sizes..................................190 5.3.1 The Complex-Valued Case.................................193 5.4 Stochastic Approximation........................................194 Application to the MSE Linear Estimation.....................196 5.5 The Least-Mean-Squares Adaptive Algorithm.........................198 5.5.1 Convergence and Steady-State Performance of the LMS in Stationary Environments........................................199 5.5.2 Cumulative Loss Bounds..................................204 5.6 The Affne Projection Algorithm...................................206 Geometric Interpretation of APA............................208 Orthogonal Projections....................................208 5.6.1 The Normalized LMS....................................211 5.7 The Complex-Valued Case........................................213 The Widely Linear LMS..................................213 The Widely Linear APA...................................214 5.8 Relatives of the LMS........................................214 The Sign-Error LMS.....................................214 The Least-Mean-Fourth (LMF) Algorithm.....................215 Transform-Domain LMS..................................215 5.9 Simulation Examples........................................218 5.10 Adaptive Decision Feedback Equalization............................221 5.11 The Linearly Constrained LMS....................................224 5.12 Tracking Performance of the LMS in Nonstationary Environments..........225 5.13 Distributed Learning: the Distributed LMS............................227 5.13.1 Cooperation Strategies....................................228 5.13.2 The Diffusion LMS......................................231 5.13.3 Convergence and Steady-State Performance: Some Highlights......237 5.13.4 Consensus-Based Distributed Schemes........................240 5.14 A Case Study: Target Localization..................................241 5.15 Some Concluding Remarks: Consensus Matrix........................243 Problems........................................244 MATLABExercises....................................246 References........................................247 CHAPTER6 The Least-Squares Family......................................253 6.1 Introduction........................................253 6.2 Least-Squares Linear Regression: a Geometric Perspective................254 6.3 Statistical Properties of the LS Estimator.............................257 The LS Estimator Is Unbiased..............................257 Covariance Matrix of the LS Estimator........................257 The LS Estimator Is BLUE in the Presence of White Noise........258 The LS Estimator Achieves the Cram閞朢ao Bound for White Gaussian Noise........................................259 Asymptotic Distribution of the LS Estimator...................260 6.4 Orthogonalizing the Column Space of the Input Matrix: the SVD Method....260 Pseudoinverse Matrix and SVD.............................262 6.5 Ridge Regression: a Geometric Point of View.........................265 Principal Components Regression...........................267 6.6 The Recursive Least-Squares Algorithm.............................268 Time-Iterative Computations...............................269 Time Updating of the Parameters............................270 6.7 Newton抯 Iterative Minimization Method.............................271 6.7.1 RLS and Newton抯 Method................................274 6.8 Steady-State Performance of the RLS...............................275 6.9 Complex-Valued Data: the Widely Linear RLS........................277 6.10 Computational Aspects of the LS Solution............................279 Cholesky Factorization....................................279 QR Factorization........................................279 Fast RLS Versions.......................................280 6.11 The Coordinate and Cyclic Coordinate Descent Methods.................281 6.12 Simulation Examples........................................283 6.13 Total Least-Squares........................................286 Geometric Interpretation of the Total Least-Squares Method........291 Problems........................................293 MATLAB瓻xercises....................................296 References........................................297 CHAPTER7 Classificationa Tour of the Classics..............................301 7.1 Introduction........................................301 7.2 Bayesian Classificatio........................................302 The Bayesian Classifie Minimizes the Misclassificatio Error......303 7.2.1 Average Risk........................................304 7.3 Decision (Hyper) Surfaces........................................307 7.3.1 The Gaussian Distribution Case.............................309 7.4 The Naive Bayes Classifie.......................................315 7.5 The Nearest Neighbor Rule.......................................315 7.6 Logistic Regression........................................317 7.7 Fisher抯 Linear Discriminant......................................322 7.7.1 Scatter Matrices........................................323 7.7.2 Fisher抯 Discriminant: the Two-Class Case.....................325 7.7.3 Fisher抯 Discriminant: the Multiclass Case.....................328 7.8 Classifcation Trees........................................329 7.9 Combining Classifers........................................333 No Free Lunch Theorem..................................334 Some Experimental Comparisons............................334 Schemes for Combining Classifier..........................335 7.10 The Boosting Approach........................................337 The Ada Boost Algorithm..................................337 The Log-Loss Function...................................341 7.11 Boosting Trees........................................343 Problems........................................345 MATLAB瓻xercises....................................347 References........................................349 CHAPTER8 Parameter Learning: a Convex Analytic Path........................351 8.1 Introduction........................................352 8.2 Convex Sets and Functions.......................................352 8.2.1 Convex Sets........................................353 8.2.2 Convex Functions.......................................354 8.3 Projections Onto Convex Sets.....................................357 8.3.1 Properties of Projections..................................361 8.4 Fundamental The orem of Projections Onto Convex Sets..................365 8.5 A Parallel Version of POCS.......................................369 8.6 From Convex Sets to Parameter Estimation and Machine Learning..........369 8.6.1 Regression........................................369 8.6.2 Classifcation........................................373 8.7 Infintely Many Closed Convex Sets: the Online Learning Case............374 8.7.1 Convergence of APSM....................................376 8.8 Constrained Learning........................................380 8.9 The Distributed APSM........................................382 8.10 Optimizing Nonsmooth Convex Cost Functions........................384 8.10.1 Subgradients and Subdifferentials............................385 8.10.2 Minimizing Nonsmooth Continuous Convex Loss Functions: the Batch Learning Case........................................388 8.10.3 Online Learning for Convex Optimization.....................393 8.11 Regret Analysis........................................396 Regret Analysis of the Subgradient Algorithm..................398 8.12 Online Learning and Big Data Applications: a Discussion................399 Approximation, Estimation, and Optimization Errors.............400 Batch Versus Online Learning..............................402 8.13 Proximal Operators........................................405 8.13.1 Properties of the Proximal Operator..........................407 8.13.2 Proximal Minimization...................................409 8.14 Proximal Splitting Methods for Optimization..........................412 The Proximal Forward-Backward Splitting Operator.............413 Alternating Direction Method of Multipliers (ADMM)............414 Mirror Descent Algorithms................................415 8.15 Distributed Optimization: Some Highlights...........................417 Problems........................................417 MATLABExercises....................................420 References........................................422 CHAPTER9 Sparsity-Aware Learning: Concepts and Theoretical Foundations.........427 9.1 Introduction........................................427 9.2 Searching for a Norm........................................428 9.3 The Least Absolute Shrinkage and Selection Operator (LASSO)...........431 9.4 Sparse Signal Representation......................................436 9.5 In Search of the Sparsest Solution..................................440 The Norm Minimizer...................................441 The Norm Minimizer...................................442 The Norm Minimizer...................................442 Characterization of the Norm Minimizer....................443 Geometric Interpretation..................................444 9.6 Uniqueness of the Minimizer....................................447 9.6.1 Mutua lCoherence.......................................449 9.7 Equivalence of and Minimizers: Sufficency Conditions..............451 9.7.1 Condition Implied by the Mutual Coherence Number.............451 9.7.2 The Restricted Isometry Property (RIP).......................452 9.8 Robust Sparse Signal Recovery From Noisy Measurements...............455 9.9 Compressed Sensing: the Glory of Randomness........................456 Compressed Sensing.....................................456 9.9.1 Dimensionality Reduction and Stable Embeddings...............458 9.9.2 Sub-Nyquist Sampling: Analog-to-Information Conversion........460 9.10 A Case Study: Image Denoising....................................463 Problems........................................465 MATLAB瓻xercises....................................468 References........................................469 CHAPTER10 Sparsity-Aware Learning: Algorithms and Applications.................473 10.1 Introduction........................................473 10.2 Sparsity Promoting Algorithms....................................474 10.2.1 Greedy Algorithms......................................474 10.2.2 Iterative Shrinkage/Thresholding (IST) Algorithms..............480 10.2.3 Which Algorithm Some Practical Hints......................487 10.3 Variations on the Sparsity-Aware Theme.............................492 10.4 Online Sparsity Promoting Algorithms...............................499 10.4.1 LASSO: Asymptotic Performance...........................500 10.4.2 The Adaptive Norm-Weighted LASSO........................502 10.4.3 Adaptive CoSa MPAlgorithm...............................504 10.4.4 Sparse-Adaptive Projection Subgradient Method................505 10.5 Learning Sparse Analysis Models..................................510 10.5.1 Compressed Sensing for Sparse Signal Representationin Coherent Dictionaries...................................512 10.5.2 Cosparsity........................................513 10.6 A Case Study: Time-Frequency Analysis.............................516 Gabor Transform and Frames...............................516 Time-Frequency Resolution................................517 Gabor Frames........................................518 Time-Frequency Analysis of Echolocation Signals Emitted by Bats..519 Problems........................................523 MATLAB瓻xercises....................................524 References........................................525 CHAPTER11 Learningin Reproducing Kernel Hilbert Spaces......................531 11.1 Introduction........................................532 11.2 Generalized Linear Models.......................................532 11.3 Volterra, Wiener, and Hammerstein Models...........................533 11.4 Cover抯 Theorem: Capacity of a Spacein Linear Dichotomies.............536 11.5 Reproducing Kernel Hilbert Spaces.................................539 11.5.1 Some Properties and Theoretical Highlights....................541 11.5.2 Examples of Kernel Functions..............................543 11.6 Representer Theorem........................................548 11.6.1 Semiparametric Representer Theorem........................550 11.6.2 Nonparametric Modeling: a Discussion.......................551 11.7 Kernel Ridge Regression........................................551 11.8 Support Vector Regression........................................554 11.8.1 The LinearInsensitive Optimal Regression...................555 11.9 Kernel Ridge Regression Revisited.................................561 11.10 Optimal Margin Classification Support Vector Machines.................562 11.10.1 Linearly Separable Classes: Maximum Margin Classifier.........564 11.10.2 Nonseparable Classes.....................................569 11.10.3 Performance of SVMs and Applications.......................574 11.10.4 Choice of Hyperparameters................................574 11.10.5 Multiclass Generalizations.................................575 11.11 Computational Considerations.....................................576 11.12 Random Fourier Features........................................577 11.12.1 Online and Distributed Learningin RKHS.....................579 11.13 Multiple Kernel Learning........................................580 11.14 Nonparametric Sparsity-Aware Learning: Additive Models...............582 11.15 A Case Study: Authorship Identificatio.............................584 Problems........................................587 MATLAB瓻xercises....................................589 References........................................590 CHAPTER12 Bayesian Learning: Inference and the EM Algorithm...................595 12.1 Introduction........................................595 12.2 Regression: a Bayesian Perspective.................................596 12.2.1 The Maximum Likelihood Estimator.........................597 12.2.2 The MAP Estimator......................................598 12.2.3 The Bayesian Approach...................................599 12.3 The Evidence Function and Occam抯 Razor Rule.......................605 Laplacian Approximation and the Evidence Function.............607 12.4 Latent Variables and the EM Algorithm..............................611 12.4.1 The Expectation-Maximization Algorithm.....................611 12.5 Linear Regression and the EM Algorithm.............................613 12.6 Gaussian Mixture Models........................................616 12.6.1 Gaussian Mixture Modeling and Clustering....................620 12.7 The EM Algorithm: a Lower Bound Maximization View.................623 12.8 Exponential Family of Probability Distributions........................627 12.8.1 The Exponential Family and the Maximum Entropy Method.......633 12.9 Combining Learning Models: a Probabilistic Pointof View...............634 12.9.1 Mixing Linear Regression Models...........................634 12.9.2 Mixing Logistic Regression Models..........................639 Problems........................................641 MATLAB瓻xercises....................................643 References........................................645 CHAPTER13 Bayesian Learning: Approximate Inferenceand Nonparametric Models.....647 13.1 Introduction........................................648 13.2 Variational Approximationin Bayesian Learning.......................648 The Mean Field Approximation.............................649 13.2.1 The Case of the Exponential Family of Probability Distributions.....653 13.3 A Variational Bayesian Approachto Linear Regression..................655 Computation of the Lower Bound............................660 13.4 A Variational Bayesian Approach to Gaussian Mixture Modeling...........661 13.5 When Bayesian Inference Meets Sparsity.............................665 13.6 Sparse Bayesian Learning(SBL)...................................667 13.6.1 The Spike and Slab Method................................671 13.7 The Relevance Vector Machine Framework...........................672 13.7.1 Adopting the Logistic Regression Model for Classificatio.........672 13.8 Convex Duality and Variational Bounds..............................676 13.9 Sparsity-Aware Regression: a Variational Bound Bayesian Path............681 Sparsity-Aware Learning: Some Concluding Remarks............686 13.10 Expectation Propagation........................................686 Minimizing the KL Divergence.............................688 The Expectation Propagation Algorithm.......................688 13.11 Nonparametric Bayesian Modeling.................................690 13.11.1 The Chinese Restaurant Process.............................691 13.11.2 Dirichlet Processes.......................................692 13.11.3 The Stick Breaking Construction of a DP......................697 13.11.4 Dirichlet Process Mixture Modeling..........................698 Inference........................................699 13.11.5 The Indian Buffet Process.................................701 13.12 Gaussian Processes........................................710 13.12.1 Covariance Functions and Kernels...........................711 13.12.2 Regression........................................712 13.12.3 Classifcation........................................716 13.13 A Case Study: Hyperspectral Image Unmixing.........................717 13.13.1 Hierarchical Bayesian Modeling.............................719 13.13.2 Experimental Results.....................................720 Problems........................................721 MATLAB瓻xercises....................................726 References........................................727 CHAPTER14 Monte Carlo Methods........................................731 14.1 Introduction........................................731 14.2 Monte Carlo Methods: the Main Concept.............................732 14.2.1 Random Number Generation...............................733 14.3 Random Sampling Based on Function Transformation...................735 14.4 Rejection Sampling........................................739 14.5 Importance Sampling........................................743 14.6 Monte Carlo Methods and the EM Algorithm..........................745 14.7 Markov Chain Monte Carlo Methods................................745 14.7.1 Ergodic Markov Chains...................................748 14.8 The Metropolis Method........................................754 14.8.1 Convergence Issues......................................756 14.9 Gibbs Sampling........................................758 14.10 In Search of More Efficien Methods: a Discussion.....................760 Variational Inferenceor Monte Carlo Methods..................762 14.11 A Case Study: Change-Point Detection..............................762 Problems........................................765 MATLAB瓻xercise.....................................767 References........................................768 CHAPTER15 Probabilistic Graphical Models: PartI.............................771 15.1 Introduction........................................771 15.2 The Need for Graphical Models....................................772 15.3 Bayesian Networks and the Markov Condition.........................774 15.3.1 Graphs: Basic Definition..................................775 15.3.2 Some Hintson Causality..................................779 15.3.3 d-Separation........................................781 15.3.4 Sigmoidal Bayesian Networks..............................785 15.3.5 Linear Gaussian Models...................................786 15.3.6 Multiple-Cause Networks..................................786 15.3.7 I-Maps, Soundness, Faithfulness, and Completeness..............787 15.4 Undirected Graphical Models.....................................788 15.4.1 Independencies and I-Mapsin Markov Random Fields............790 15.4.2 The Ising Model and Its Variants............................791 15.4.3 Conditional Random Fields (CRFs)..........................794 15.5 Factor Graphs........................................795 15.5.1 Graphical Models for Error Correcting Codes...................797 15.6 Moralization of Directed Graphs...................................798 15.7 Exact Inference Methods: Message Passing Algorithms..................799 15.7.1 Exact Inferencein Chains..................................799 15.7.2 Exact Inferencein Trees...................................803 15.7.3 The Sum-Product Algorithm...............................804 15.7.4 The Max-Product and Max-Sum Algorithms...................809 Problems........................................816 References........................................818 CHAPTER16 Probabilistic Graphical Models: PartII............................821 16.1 Introduction........................................821 16.2 Triangulated Graphs and Junction Trees..............................822 16.2.1 Constructinga Join Tree...................................825 16.2.2 Message Passing in Junction Trees...........................827 16.3 Approximate Inference Methods...................................830 16.3.1 Variational Methods: Local Approximation....................831 16.3.2 Block Methods for Variational Approximation..................835 16.3.3 Loopy Belief Propagation..................................839 16.4 Dynamic Graphical Models.......................................842 16.5 Hidden Markov Models........................................844 16.5.1 Inference........................................847 16.5.2 Learning the Parametersin an HMM.........................852 16.5.3 Discriminative Learning...................................855 16.6 Beyond HMMs: a Discussion......................................856 16.6.1FactorialHiddenMarkovModels............................856 16.6.2 Time-Varying Dynamic Bayesian Networks....................859 16.7 Learning Graphical Models.......................................859 16.7.1 Parameter Estimation.....................................860 16.7.2 Learning the Structure....................................864 Problems........................................864 References........................................867 CHAPTER17ParticleFiltering........................................871 17.1 Introduction........................................871 17.2 Sequential Importance Sampling...................................871 17.2.1 Importance Sampling Revisited.............................872 17.2.2 Resampling........................................873 17.2.3 Sequential Sampling.....................................875 17.3 Kalman and Particle Filtering......................................878 17.3.1 Kalman Filtering:a Bayesian Point of View....................878 17.4 Particle Filtering........................................881 17.4.1 Degeneracy........................................885 17.4.2 Generic Particle Filtering..................................886 17.4.3 Auxiliary Particle Filtering.................................889 Problems........................................895 MATLAB瓻xercises....................................898 References........................................899 CHAPTER18 Neural Networks and Deep Learning..............................901 18.1 Introduction........................................902 18.2 The Perceptron........................................904 18.3 Feed-Forward Multilayer Neural Networks...........................908 18.3.1 Fully Connected Networks.................................912 18.4 The Backpropagation Algorithm...................................913 Nonconvexity of the Cost Function...........................914 18.4.1 The Gradient Descent Backpropagation Scheme.................916 18.4.2 Variants of the Basic Gradient Descent Scheme.................924 18.4.3 Beyond the Gradient Descent Rationale.......................934 18.5 Selecting a Cos tFunction........................................935 18.6 Vanishing and Exploding Gradients.................................938 18.6.1 The Rectifie Linear Unit..................................939 18.7 Regularizing the Network........................................940 Dropout........................................943 18.8 Designing Deep Neural Networks: a Summary.........................946 18.9 Universal Approximation Property of Feed-Forward Neural Networks.......947 18.10 Neural Networks: a Bayesian Flavor................................949 18.11 Shallow Versus Deep Architectures.................................950 18.11.1 The Power of Deep Architectures............................951 18.12 Convolutional Neural Networks....................................956 18.12.1 The Need for Convolutions................................956 18.12.2 Convolution Over Volumes.................................965 18.12.3 The Full CNN Architecture................................968 18.12.4 CNNs: the Epilogue......................................971 18.13 Recurrent Neural Networks.......................................976 18.13.1 Backpropagation Through Time.............................978 18.13.2 Attentionand Memory....................................982 18.14 Adversarial Examples........................................985 Adversarial Training.....................................987 18.15 Deep Generative Models........................................988 18.15.1 Restricted Boltzmann Machines.............................988 18.15.2 Pretraining Deep Feed-Forward Networks.....................991 18.15.3 Deep Belief Networks....................................992 18.15.4 Autoencoders........................................994 18.15.5 Generative Adversarial Networks............................995 18.15.6 Variational Autoencoders..................................1004 18.16 Capsule Networks........................................1007 Training........................................1011 18.17 Deep Neural Networks: Some Final Remarks..........................1013 Transfer Learning........................................1013 Multitask Learning.......................................1014 Geometric DeepLearning.................................1015 Open Problems........................................1016 18.18 A Case Study: Neural Machine Translation...........................1017 18.19 Problems........................................1023 Computer Exercises......................................1025 References........................................1029 CHAPTER19 Dimensionality Reduction and Latent Variable Modeling................1039 19.1 Introduction........................................1040 19.2 Intrinsic Dimensionality........................................1041 19.3 Principal Component Analysis.....................................1041 PCA, SVD, and Low Rank Matrix Factorization.................1043 Minimum Error Interpretation..............................1045 PCA and Information Retrieval.............................1045 Orthogonalizing Properties of PCA and Feature Generation........1046 Latent Variables........................................1047 19.4 Canonical Correlation Analysis....................................1053 19.4.1 Relatives of CCA........................................1056 19.5 Independent Component Analysis..................................1058 19.5.1 ICA and Gaussianity.....................................1058 19.5.2 ICA and Higher-Order Cumulants...........................1059 19.5.3 Non-Gaussianity and Independent Components.................1061 19.5.4 ICA Basedon Mutual Information...........................1062 19.5.5 Alternative Paths to ICA..................................1065 The Cocktail Party Problem................................1066 19.6 Dictionary Learning: the k-SVD Algorithm...........................1069 Whythe Namek-SVD...................................1072 Dictionary Learning and Dictionary Identifiabilit...............1072 19.7 Nonnegative Matrix Factorization..................................1074 19.8 Learning Low-Dimensional Models: a Probabilistic Perspective............1076 19.8.1 Factor Analysis........................................1077 19.8.2 Probabilistic PCA.......................................1078 19.8.3 Mixture of Factors Analyzers: a Bayesian View to Compressed Sensing.......................1082 19.9 Nonlinear Dimensionality Reduction................................1085 19.9.1 Kernel PCA........................................1085 19.9.2 Graph-Based Methods....................................1087 19.10 Low Rank Matrix Factorization: a Sparse Modeling Path.................1096 19.10.1 Matrix Completion.......................................1096 19.10.2 Robust PCA........................................1100 19.10.3 Applications of Matrix Completion and ROBUSTPCA...........1101 19.11 A Case Study: FMRI Data Analysis.................................1103 Problems........................................1107 MATLAB瓻xercises....................................1107 References........................................1108 Index........................................1116 |