Fork me on GitHub

决策树

[toc]

决策树

##分类树
  分类树的生成过程:当前节点,特征集A,数据集D,先判断能不能分裂节点,数据集中只有一种类别,或者特征集中的特征在数据集上的类别表现一样,没办法分再或者数据集的样本比较少了,达到了预先设定的最小分裂数目,前面一种情况就停止分裂,直接当做叶子节点,类别直接是数据集样本类别,后两种情况类别取决于哪种类别多。在能节点能够分裂的前提下,要先判断选择哪种特征进行节点分裂,这一步是决策树最重要的一步,不同的分裂手段产生了不同的决策树,分裂后继续递归。

节点分裂标准

  首先熵的概念:$Ent(D) = -\sum_{i=1}^C p_i \log p_i$,熵的值越小,纯度越高。我们希望属性分裂后的节点纯度越高越好。

信息增益: 对属性取值数目多的有偏好,对应于ID3决策树
$$
Gain(D,a) = Ent(D) - \sum_{i=1}^m\frac{|D^i|}{|D|}Ent(D^i)
$$
信息增益率: 对属性取值数目少的有偏好,对应于C4.5决策树
$$
Gain_ratio(D,a) = \frac{Gain(D,a)}{IV(a)}
\
IV(a) = -\sum_{v=1}^V \frac{|D^v|}{|D|} \log \frac{|D^v|}{|D|}
$$
基尼指数:基尼指数越小,纯度越高。所以我们选择划分后Gini指数小的属性。对应于CART
$$
\text{基尼值}\quad Gini(D) = 1-\sum_{i=1}^Cp_i^2
\
\text{基尼指数}\quad Gini_index(D,a) = \sum_{v=1}^V\frac{|D^v|}{|D|} Gini(D^v)
$$

回归树

  回归树的产生过程和分类树类似,是一棵二叉树,也是不断的在分裂节点,分割特征空间,把训练集按照特征的取值放到划分好的特征空间,其训练好的叶子节点的取值是取训练标签的平均值(损失函数为平方损失)。在分裂节点的时候,首先依次选择每个特征,然后每个特征选择一个分裂值,计算分裂后的节点平方误差损失,最终确定分裂的特征为使得平方误差损失最小的那个特征以及对应的分裂点。

决策树优缺点

优点

  1. 解释性较好
  2. 可处理缺失值
  3. 对数据的归一化和标准化不敏感
  4. 对异常值比较鲁棒
  5. 非线性模型,可以组合特征
  6. 自动地进行特征选择

缺点

  1. 容易过拟合
  2. 方差较大,由于树是自上到下生成的节点,顶层的error会传播到下一层
  3. 生成的是一个分段函数,不够平滑,对于回归任务来说,往往只能预测出特定的几个值

以下为Sklearn决策树总结的优缺点,当个搬运工了

Some advantages of decision trees are:

  1. Simple to understand and to interpret. Trees can be visualised.
  2. Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
  3. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  4. Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
  5. Able to handle multi-output problems.
  6. Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
  7. Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  8. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

  1. Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
  2. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
  3. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
  4. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
  5. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

    References

    [1] 周志华,机器学习
    [2] Hastie,The Elements of Statistical Learning
    [3] scikit-learn,Decision Tree