Sparse Coding-白红宇

强烈建议你试试无所不能的chatGPT，快点击我

Sparse Coding

阅读量：5221 次

发布时间：2019-06-14

本文共 3715 字，大约阅读时间需要 12 分钟。

Sparse Coding

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. —— 过完备的基，无监督 The aim of sparse coding is to find a set of basis vectors $\mathbf{\phi}_i$ such that we can represent an input vector $\mathbf{x}$ as a linear combination of these basis vectors:

$\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} \end{align}$

over-complete
$\mathbf{x}\in\mathbb{R}^n$
k
n
The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data.
the coefficients a_i are no longer uniquely determined
$\mathbf{x}$

过完备的基能发掘出数据的内在结构和模型，但是其系数表示将不唯一

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients a_i be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. 原子图像的叠加 Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.

We define the sparse coding cost function on a set of m input vectors as

$\begin{align} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$

where S(.) is a sparsity cost function which penalizes a_i for being far from zero.

稀疏惩罚项

We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $\mathbf{x}$ and the second term as a sparsity penalty which forces our representation of $\mathbf{x}$ to be sparse. The constant λ is a scaling constant to determine the relative importance of these two contributions.

Although the most direct measure of sparsity is the "L₀" norm ( $S(a_i) = \mathbf{1}(|a_i|>0)$ ), it is non-differentiable （不可微） and difficult to optimize in general. In practice, common choices for the sparsity cost S(.) are the L₁ penalty $S(a_i)=\left|a_i\right|_1$ and the log penalty $S(a_i)=\log(1+a_i^2)$ .

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down a_i and scaling $\mathbf{\phi}_i$ up by some large constant. To prevent this from happening, we will constrain $\left|\left|\mathbf{\phi}\right|\right|^2$ to be less than some constant C. The full sparse coding cost function including our constraint on $\mathbf{\phi}$ is

$\begin{array}{rc} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} & \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \\ \text{subject to} & \left|\left|\mathbf{\phi}_i\right|\right|^2 \leq C, \forall i = 1,...,k \\ \end{array}$

Probabilistic Interpretation

到目前为止，我们所考虑的稀疏编码，是为了寻找到一个稀疏的、超完备基向量集，来覆盖我们的输入数据空间。现在换一种方式，我们可以从概率的角度出发，将稀疏编码算法当作一种“生成模型”。

我们将自然图像建模问题看成是一种线性叠加，叠加元素包括 k 个独立的源特征 $\mathbf{\phi}_i$ 以及加性噪声 ν ：

$\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x}) \end{align}$

我们的目标是找到一组特征基向量 $\mathbf{\phi}$ ，它使得图像的分布函数 $P(\mathbf{x}\mid\mathbf{\phi})$ 尽可能地近似于输入数据的经验分布函数 $P^*(\mathbf{x})$ 。一种实现方式是，最小化 $P^*(\mathbf{x})$ 与 $P(\mathbf{x}\mid\mathbf{\phi})$ 之间的 KL 散度，此 KL 散度表示如下：

$\begin{align} D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x} \end{align}$

因为无论我们如何选择 $\mathbf{\phi}$ ，经验分布函数 $P^*(\mathbf{x})$ 都是常量，也就是说我们只需要最大化对数似然函数 $P(\mathbf{x}\mid\mathbf{\phi})$ 。假设 ν 是具有方差σ² 的高斯白噪音，则有下式：

$\begin{align} P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right) \end{align}$

为了确定分布 $P(\mathbf{x}\mid\mathbf{\phi})$ ，我们需要指定先验分布 $P(\mathbf{a})$ 。假定我们的特征变量是独立的，我们就可以将先验概率分解为：

$\begin{align} P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i) \end{align}$

此时，我们将“稀疏”假设加入进来——假设任何一幅图像都是由相对较少的一些源特征组合起来的。因此，我们希望 a_i 的概率分布在零值附近是凸起的，而且峰值很高。一个方便的参数化先验分布就是：

$\begin{align} P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i)) \end{align}$

这里 S(a_i) 是决定先验分布的形状的函数。

当定义了 $P(\mathbf{x} \mid \mathbf{a} , \mathbf{\phi})$ 和 $P(\mathbf{a})$ 后，我们就可以写出在由 $\mathbf{\phi}$ 定义的模型之下的数据 $\mathbf{x}$ 的概率分布：

$\begin{align} P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a} \end{align}$

那么，我们的问题就简化为寻找：

$\begin{align} \mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) > \end{align}$

这里 < . > 表示的是输入数据的期望值。

不幸的是，通过对 $\mathbf{a}$ 的积分计算 $P(\mathbf{x} \mid \mathbf{\phi})$ 通常是难以实现的。虽然如此，我们注意到如果 $P(\mathbf{x} \mid \mathbf{\phi})$ 的分布（对于相应的 $\mathbf{a}$ ）足够陡峭的话，我们就可以用 $P(\mathbf{x} \mid \mathbf{\phi})$ 的最大值来估算以上积分。估算方法如下：

$\begin{align} \mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) > \end{align}$

跟之前一样，我们可以通过减小 a_i 或增大 $\mathbf{\phi}$ 来增加概率的估算值（因为 P(a_i) 在零值附近陡升）。因此我们要对特征向量 $\mathbf{\phi}$ 加一个限制以防止这种情况发生。

最后，我们可以定义一种线性生成模型的能量函数，从而将原先的代价函数重新表述为：

$\begin{array}{rl} E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\ &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{array}$

其中 λ = 2σ²β ，并且关系不大的常量已被隐藏起来。因为最大化对数似然函数等同于最小化能量函数，我们就可以将原先的优化问题重新表述为：

$\begin{align} \mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$

使用概率理论来分析，我们可以发现，选择 L₁ 惩罚和 $\log(1+a_i^2)$ 惩罚作为函数 S(.) ，分别对应于使用了拉普拉斯概率 $P(a_i) \propto \exp\left(-\beta|a_i|\right)$ 和柯西先验概率 $P(a_i) \propto \frac{\beta}{1+a_i^2}$ 。

Learning

使用稀疏编码算法学习基向量集的方法，是由两个独立的优化过程组合起来的。第一个是逐个使用训练样本 $\mathbf{x}$ 来优化系数 a_i ，第二个是一次性处理多个样本对基向量 $\mathbf{\phi}$ 进行优化。

如果使用 L₁ 范式作为稀疏惩罚函数，对 $a^{(j)}_i$ 的学习过程就简化为求解由 L₁ 范式正则化的最小二乘法问题，这个问题函数在域 $a^{(j)}_i$ 内为凸，已经有很多技术方法来解决这个问题（诸如CVX之类的凸优化软件可以用来解决L1正则化的最小二乘法问题）。如果 S(.) 是可微的，比如是对数惩罚函数，则可以采用基于梯度算法的方法，如共轭梯度法。

用 L₂ 范式约束来学习基向量，同样可以简化为一个带有二次约束的最小二乘问题，其问题函数在域 $\mathbf{\phi}$ 内也为凸。标准的凸优化软件（如CVX）或其它迭代方法就可以用来求解 $\mathbf{\phi}$ ，虽然已经有了更有效的方法，比如求解拉格朗日对偶函数（Lagrange dual）。

根据前面的的描述，稀疏编码是有一个明显的局限性的，这就是即使已经学习得到一组基向量，如果为了对新的数据样本进行“编码”，我们必须再次执行优化过程来得到所需的系数。这个显著的“实时”消耗意味着，即使是在测试中,实现稀疏编码也需要高昂的计算成本，尤其是与典型的前馈结构算法相比。

转载于:https://www.cnblogs.com/sprint1989/p/3981918.html

你可能感兴趣的文章

android开发学习笔记：圆角的Button

循环-12. 打印九九口诀表(15)

oracle树状索引详解（图摘取《收获不止oracle》）

Android Studio 设置代码提示和代码自动补全快捷键--Eclipse 风格 - 转

ORACLE基本操作备忘

Maven学习：项目之间的关系

SQL查询总结 - wanglei

安装cocoa pods时出现Operation not permitted - /usr/bin/xcodeproj的问题

makefile中使用变量

GIT笔记：将项目发布到码云

JavaScript:学习笔记(7)——VAR、LET、CONST三种变量声明的区别

JavaScript 鸭子模型

PHP典型功能与Laravel5框架开发学习笔记

SQL Server 如何查询表定义的列和索引信息

项目上传到github上

GCD 之线程死锁

NoSQL数据库常见分类

JS小工具_字符串转16进制数组_02

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！-- 愿君每日到此一游！

当前时间: 2024-11-23 12:23:07 当前IP: 18.224.70.11 联系邮箱:javaeecc@qq.com Copyright © 2020 - 2022 baihongyu.com 京ICP备2021015314号-2

强烈建议你试试无所不能的CHAT-GPT，快点击我

强烈建议你试试无所不能的CHAT-GPT，快点击我