Data Augmentation

基本方法

  • Random Cropping 随机裁剪
  • Mirroring 镜像
  • Rotation 旋转
  • Scaling 尺度
  • Color Jitter 颜色抖动
  • Saturation and Value Jitter 饱和度和值
  • Noise 噪声
  • Brightness 明暗度

方法分类

对于有掩膜的情况,某些掩膜需要和图片保持一致性,某些数据增强方式会破环这种一致性,比如旋转。可以以此为依据将其分为两类

影响掩膜的方式 不影响掩膜的方式
随机裁剪 颜色抖动(Color Jitter)
镜像 饱和度 和 明亮程度改变
旋转 噪声
平移 图片明亮程度发生改变
尺度变换

如何阅读论文

3-pass Approach

  1. a general idea about the paper
  2. grasp the paper’s content, but not its details
  3. helps you understand the paper in depth

1 (~5 min)

Quick scan to get a bird’s-eye view of the paper

  1. Carefully read the title, abstract, and introduction
  2. Read the section and sub-section headings, but ignore everything else.
  3. Read the conclusions.
  4. Glance over the references, mentally ticking off the ones you’ve already read

Answer 5 Cs:

  1. Category: What type of the paper is? A measurement paper? An analysis of an existing system? A description of a research prototype?
  2. Context: Which other papers is it related to? Which theoretical bases were used to analyse the problem?
  3. Correctness: Do the assumptions appear to be valid?
  4. Contributions: What are the paper’s main contributions?
  5. Clarity: Is the paper well written?

2 (~ 1 hour)

Read the paper with greater care, but ignore details such as proofs. It helps to jot down the key points, or to make comments in the margins, as you read.

  1. Look carefully at the figures, diagrams and other illustrations in the paper. Pay special attention to graphs. Are the axes properly labelled? Are results shown with error bars, so that conclusions are statistically significant? Common mistakes like these will separate rushed, shoddy work from the truly excellent.

  2. Remember to mark relevant unread references for further reading (this is a good way to learn more about the background of the paper).

be able to summarize the main thrust of the paper, with supporting evidence, to someone else. This level of detail is appropriate for a paper in which you are interested, but does not lie in your research specialty

3

The key to the third pass is to attempt to virtually re-implement the paper: that is, making the same assumptions as the authors, re-create the work.

朴素贝叶斯

贝叶斯后验概率

$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}
{P(x_1, \dots, x_n)}$

朴素贝叶斯

基于条件独立的假设可以简化为

$P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}
{P(x_1, \dots, x_n)}$

分母为常数,则

$\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}$

不同的贝叶斯算法的区别主要是在$P(x_i|y)$的计算

常见贝叶斯算法

  • 高斯朴素贝叶斯

$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$

  • 多项式multinomial【文本处理使用较多】

每个类别的每个特征概率由右侧向量刻画:$\theta_y = (\theta_{y1},\ldots,\theta_{yn})$

$\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}$,

其中

$N_{yi} = \sum_{x \in T} x_i$ :为$x_i$出现在yi(类别y的第i个特征位置)的样本中的数量

$ N_{y} = \sum_{i=1}^{n} N_{yi}$ :y类所有特征的总量

$\alpha$为平滑系数


两个基本假设

  1. 特征条件独立
  2. 特征等价

Reference

  1. ref
  2. docs-ppt

梯度提升算法

XGBoost

模型参数和目标函数(训练损失和正则项)

增量训练:

泰勒展开

其中:

移除常量之后,step t的目标函数变为

这就是第t颗树的目标函数,我们可以看出只依赖于一阶和二阶梯度,因此就可以支持自定义损失函数。

此外还需要加上正则项。

进一步推导(加上正则项),可以由第t棵树的目标函数。

其中$I_j = \{i|q(x_i)=j\}$ 是被分到 $j^{th}$ 叶子节点的样本索引集合。可以进一步重写 [$T$ 代表着第$t^{th}$颗树的叶子节点数量]

可以有答案


Reference

  1. xgboost website-intro