计算并给出一个信息增益比率的样例。

信息增益比率

因为信息增益中,不同的属性值越多,信息增益也越大,所以通过信息增益比率来做相关尝试。

样例

Outlook Temperature Humidity 湿度 Wind Play
Sunny 晴朗 Hot High False No
Sunny 晴朗 Hot High True No
Overcast 低沉 Hot High False Yes
Rainy 雨 Mild 轻微 High False Yes
Rainy 雨 Cool 凉 Normal False Yes
Rainy 雨 Cool 凉 Normal True No
Overcast 低沉 Cool 凉 Normal True Yes
Sunny 晴朗 Mild 轻微 High False No
Sunny 晴朗 Cool 凉 Normal False Yes
Rainy 雨 Mild 轻微 Normal False Yes
Sunny 晴朗 Mild 轻微 Normal False Yes
Overcast 低沉 Mild 轻微 High True Yes
Overcast 低沉 Hot Normal False Yes
Rainy 雨 Mild 轻微 High True No
Outlook  Yes No Count of each group Entropy  
sunny 晴朗 2 3 5 0.971
overcast 低沉 4 0 4 0.000
rainy 雨 3 2 5 0.971
Results 结果 Values        
Information 信息 0.694      
Overall entropy 总熵 0.940      
Information gain 信息增益 0.247      
Split information 拆分信息 1.577      
Gain ratio 增益比率 0.156      
  • Overall entropy:-5/14 * log2(5/14) - 9/14 * log2(9/14) = 0.940
    • No 的数量为 5,Yes 的数量为 9,一共 14 个
  • Entropy 计算:
    • sunny 熵: -2/5 * log2(2/5) - 3/5 * log2(3/5) = 0.971
    • overcast 熵: -4/4 * log2(4/4) - 0/4 * log2(0/4) = 0
    • sunny 熵: -3/5 * log2(3/5) - 2/5 * log2(2/5) = 0.971
  • Information: 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.694
  • Information gain:0.940 - 0.694 = 0.247 此处有精度问题
  • Split Information:-5/14 * log2(5/14) - 4/14 * log2(4/14) - 5/14 * log2(5/14) = 1.577
  • Gain ratio:0.247 / 1.577 = 0.156

公式

overall entropy:

\[Overall = -\sum_{i=1}^{n}P(T_i)log_2P(T_i)\]

Entropy:

\[Entropy(A_j) = -\sum_{i=1}^{n}P(A_j|T_i)log_2P(A_j|T_i)\]

Information:

\[Information(A) = \sum_{j=1}^{m}P(A_j)Entropy(A_j)\]

Information gain (A):

\[InformationGain(A) = Overall - Information(A)\]

Split information:

\[SplitInformation(A) = -\sum_{j=1}^{m}P(A_j)log_2P(A_j)\]

Gain ratio:

\[GainRatio(A) = \frac{InformationGain(A)}{SplitInformation(A)}\]

参考

  • https://en.wikipedia.org/wiki/Information_gain_(decision_tree)
  • https://en.wikipedia.org/wiki/Information_gain_ratio
  • https://www.itheima.com/news/20210916/180526.html