Jun 5, 2009

gradient descent in optimization

Gradient Descent:
  • batch gradient descent
  • (fully) online gradient descent (stochastic GD, sequential GD)
  • mini-batches 
  • stochastic meta-descent

SMD 的一个 lecture 和 note
另一个讨论

参见 bishop 书 p240 关于 gradient descent 的讨论
对于 batch optimization,要使用 conjugate descent 或者 quasi-Newton 方法,它们比 batch GD 更鲁棒和更快。

但是 stochastic gradient descent 也是很强大的,

可以选择遍历所有数据,也可以随即选择数据 with replacement. 还可以用 mini-batches.
stochastic gradient descent 可以认为 stochastic approximation of gradients. batch 大小决定了 stochasticity of the approximation.

"Stochastic" 的由来:
假设 x ~ P(x) 是一个随机源, learning system 要优化的是 Expected risk
但是 P(x) 是不知道的,所以 batch GD 优化 empirical risk
stochastic 因为每次针对一个 instance,所以是 stochastic,它不 follow 真正的 gradient. 它成功的保证是
random noise introced by this procedure will not perturbate the average behavior of the algorithm, 这在实际中是可以保证的[stochasitc learning, Leon Botton]。

另见论文 Accelerated Training of Conditional Random Fields with Stochastic Meta-Descent
Unfortunately most advanced gradient methods do not tolerate the sampling noise inherent in stochastic approximation: it collapses conjugate search directions (Schraudolph & Graepel, 2003) and confuses the line searches that both conjugate gradient and quasi-Newton methods depend upon. Full second-order methods are unattractive here because the computational cost of inverting the Hessian is better amortized over a large data set.
意即 CG, quasi-Newton 不能用于 online


1. 对于为什么 stochastic 优于 batch GD,Bishop 列出了两点原因。其一 stochastic GD 更能处理数据的冗余. 假设把所有数据复制一份,batch GD 计算量 x2,而 stochastic GD 不受影响。
其二,stochastic GD 有更大的概率来逃离 local minima。因为整体 loss 的 minima 一般来说不会是所有单个数据的 minima.

同样强调 stochastic 好的论文,The general inefficiency of batch training for gradient descent learning:
batch training is almost always slower than on-line training--often orders of magnitude slower--especially on large training sets. The main reason is due to the ability of on-line training to follow curves in the error surface throughout each epoch, which allows it to safely use a larger learning rate and thus converge with less iterations through the training data. Empirical results on a large (20,000-instance) speech recognition task and on 26 other learning tasks demonstrate that convergence can be reached significantly faster using on-line training than batch training, with no apparent difference in accuracy.

那就是说 stochastic GD 要快,并且在准确率方面也好。照 bishop 的意思,准确率还要更好。那是用 mini-batches 有什么好处呢?

2. 关于 learning rate 的讨论

在 convex optimization 一书中,learning rate 又称为 step size, 并且是按照 line search 求出来的。
但是 bishop 书中,learning rate 是是固定数值。deepauto 代码中也是,就是定为 0.1。大的 learning rate 能加快收敛,但是太大又不好,类似于随机游走(MCMC 中也有类似问题)

而 stochastic meta-descent 中,learning rate 称为 gain vector,是一个向量

公式中点乘含义为 component wise multiplication. gain vector 在每一个batch 中更新.
SMD 隐含的用到 hessian,但并不用去求,而是采用一种技巧叫 Hessian-Vector product.
参见这儿一个很好的讨论。
如果要使用 finite difference 的方法,一定要用 centered difference, 而不用 单边的 difference

(f(x+rv)-f(x-rf))/2r

关于 finite difference 的误差,见如下图

是从文中提到的 NOTE 中截取的,误差是先变小后边大(左边的确是震荡的)
h 很小时,error 很大,是因为 numerical problem
h 很大时,error 很大,则是因为算法的原因

deepauto/autoencoder
mini-batches:
batch 指整个数据集划分为很多个batches
epoch: 所有的 batches 走一遍为一个 epoch

0 comments: