技术生活

使用python实现简单的逻辑回归

所谓逻辑回归(Logistic regression),说白了就是一种分类方法。其做法就是对一组数据其及归类建个模,未来再给一组类似数据的时候,可以预测每条数据所属的归类。

打个比方:给一大堆手写字母及每个手写字母实际所代表的字母,建个模型,未来再碰到手写字母的时候,可以识别出每个手写字母指的是哪个字母。而这种最终结果有多个可能的逻辑分析,我们称之为 Multiclass 或 multinomial classification 。

再比如说:经一大堆历史的肺部X光影像及每个X光影像是否有肿瘤的数据,建一个模型,未来有X光影像的时候,使用模型来判断一下是否有肿瘤。 而这种最终结果只有两个可能的逻辑分析,我们称之为 Binary 或 binomial classification。

逻辑回归的数学基础是类sigmoid函数(或者叫类S函数)和自然对数函数。这些函数的共同特点是在特定区间内它们的值无限趋近于某个值或趋近于无穷。

如下图:函数δ(x) = 1 / (1 + exp(-x))的值无限趋近但不会超出[0, 1]。(注:exp表示ex,e是欧拉常数)

再比如下图:当x 在区间[1->0]逐渐变小时,自然对数函数log(x)的值在无限趋近于负无穷。反之,自然对数函数log(1-x)在区间[0->1]逐渐变大时,其值无限趋近于负无穷。(注:log(x)是loge(x)的简单写法,它的另一种写法是ln(x))

逻辑回归的内在原理是:

1.线性回归(linear regression)的结果值是正无穷到负无穷。比如 ?(?) = ?₀ + ?₁?₁ + ⋯ + ?ᵣ?ᵣ ,随着x1…xn的变化, ?(?) 的值是在正无穷到负无穷之间变化。既然是正无穷到负无穷,我们就不可能在其上做分类。比如说: ?(?) 在[0, 100]范围内是某一种东西,在[100, 200]范围内是另一种东西,那[200, 300]呢?[300, 400]呢, [400, 500]呢?…没完没了,怎么弄?

2.但是呢,我们可以通过类simoid函数来把?(?)的结果(-∞, +∞)变成(0, 1)。怎么做呢?其实就是把f(x)当成参数传递给类simoid函数δ(x)即可。也就是说,以δ(f(x)) = 1 / (1 – exp(f(x)))这样的形式来看,δ(f(x))的最终就是在区间(0, 1)之内。

3.最后,我们可以设定一个阀值,比如说,δ(f(x))大于0.5的,我们认为是积极结果。δ(f(x))小于等于0.5的,我们认为是消极结果。

以python来实现两种简单的逻辑回归,做法如下:

一、单变量逻辑回归:

1.引用python package.

import random
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

2.准备测试数据。注意阀门的设定,大于0.5即为积极结果1,否则即为消极结果0。

    # Single-Variate Logistic Regression With scikit-learn
    # for given equation: y = 1 / (1 + exp(-x)),
    # this equation's shape is a sigmoid-curve which cross(0, 0.5) and its y has a limit of [0, 1]
    # create a dataset for input(x) and output(y)
    y = []
    x = []
    r = []
    for i in range(-10, 11):
        x.append(list([i]))
        value = 1 / (1 + np.exp(-i))
        result = 0
        if value > 0.5:
            result = 1
        y.append(value)
        r.append(result)

    print("x for equation: y = 1 / (1 + exp(-x)):")
    print(x)
    print("y for equation: y = 1 / (1 + exp(-x)):")
    print(y)
    print("r for equation: y = 1 / (1 + exp(-x)):")
    print(r)

3.创建逻辑回归模型并适配旧数据。注意适配数据时是用x和r数据集,并非做线性回归时使用x和y数据集。

    # Create a Model and Train It
    # You should carefully match the solver and regularization method for several reasons:
    # 'liblinear' solver doesn’t work without regularization.
    # 'newton-cg', 'sag', 'saga', and 'lbfgs' don’t support L1 regularization.
    # 'saga' is the only solver that supports elastic-net regularization.
    model = LogisticRegression(solver='liblinear', random_state=0)
    model.fit(x, r)
    # print classes, intercept, coefficient
    print("classes:", model.classes_)
    print('intercept:', model.intercept_)
    print('coefficient:', model.coef_)

4.拿旧数据预测一下,看模型针对旧数据的预测结果是否跟结果一致,一致程度有多高。并以分类报告的方式打印出结果。

    # predict the old data
    r_matrix = model.predict_proba(x)
    print('predicted matrix:', r_matrix)

    r_resp = model.predict(x)
    print('predicted response:', r_resp)

    # score for old data
    score = model.score(x, r)
    print('score:', score)

    # print classification report
    print(classification_report(r, model.predict(x)))

5.准备两组带结果的新数据(使用原方程式来生成),然后用模型预测一下,看模型针对新数据的预测是否准确。并以图形形式展示出来。

    # predict new data
    y_new = []
    x_new = []
    r_new = []
    for i in range(11, 20):
        x_new.append(list([i]))
        value = 1 / (1 + np.exp(-i))
        result = 0
        if value > 0.5:
            result = 1
        y_new.append(value)
        r_new.append(result)
    r_resp = model.predict(x_new)
    print('predicted response:', r_resp)

    y_new = []
    x_new = []
    r_new = []
    for i in range(-20, -10):
        x_new.append(list([i]))
        value = 1 / (1 + np.exp(-i))
        result = 0
        if value > 0.5:
            result = 1
        y_new.append(value)
        r_new.append(result)
    r_resp = model.predict(x_new)
    print('predicted response:', r_resp)

    # create the confusion matrix
    cm = confusion_matrix(r, model.predict(x))
    # show result on plot
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.imshow(cm)
    ax.grid(False)
    ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
    ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
    ax.set_ylim(1.5, -0.5)
    for i in range(2):
        for j in range(2):
            ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
    plt.show()

二、多变量逻辑回归。

1.准备测试数据。这里我们假定有两个变量x1, x1来分别是权重3和权重2的形式来影响最终判定。

    # Multi-Variate Logistic Regression With scikit-learn
    # The logit ?(?₁, ?₂) = ?₀ + ?₁?₁ + ?₂?₂
    # The probabilities ?(?₁, ?₂) = 1 / (1 + exp(−?(?₁, ?₂)))
    # create a dataset for inputs(x1, x2), outputs(y) and probabilities(r)
    y = []
    x = []
    r = []
    for i in range(-10, 11):
        # presume b0 = 0, b1 = 3, b2 = -2, it means that x1 has positive effect, and x2 has negative effect
        x1 = random.randint(-100, 100)
        x2 = random.randint(-100, 100)
        b0 = 0
        b1 = 3
        b2 = -2
        x.append(list([x1, x2]))
        value = 1 / (1 + np.exp(-(b0 + b1 * x1 + b2 * x2)))
        result = 0
        if value > 0.5:
            result = 1
        y.append(value)
        r.append(result)

    print("x for equation: y = 1 / (1 + exp(-x)):")
    print(x)
    print("y for equation: y = 1 / (1 + exp(-x)):")
    print(y)
    print("r for equation: y = 1 / (1 + exp(-x)):")
    print(r)

2.创建逻辑回归模型,并用它来适配一下测试数据。注意一下这里只用到了liblinear这种解算器来建立模型,还有其他诸如’newton-cg’, ‘sag’, ‘saga’, and ‘lbfgs’和’saga’这样的解算器这里没有用到。其他的解算器具体作用如何,以及如何使用,请参考sklearn技术文档。

    # Create a Model and Train It
    # You should carefully match the solver and regularization method for several reasons:
    # 'liblinear' solver doesn’t work without regularization.
    # 'newton-cg', 'sag', 'saga', and 'lbfgs' don’t support L1 regularization.
    # 'saga' is the only solver that supports elastic-net regularization.
    model = LogisticRegression(solver='liblinear', random_state=0)
    model.fit(x, r)
    # print classes, intercept, coefficient
    print("classes:", model.classes_)
    print('intercept:', model.intercept_)
    print('coefficient:', model.coef_)

3.用创建并适配好的模型,来预测一下旧的数据。并打印出来分类报告。

    # predict the old data
    r_matrix = model.predict_proba(x)
    print('predicted matrix:', r_matrix)

    r_resp = model.predict(x)
    print('predicted response:', r_resp)

    # score for old data
    score = model.score(x, r)
    print('score:', score)

    # print classification report
    print(classification_report(r, model.predict(x)))

4.以被测试的方程,再准备一组带结果的新数据。然后再用模型来预测一下新数据,看其预测准确性如何。最后再打印出分类报告。

    # predict new data
    y_new = []
    x_new = []
    r_new = []
    for i in range(-50, 51):
        # presume b0 = 0, b1 = 3, b2 = -2, it means that x1 has positive effect, and x2 has negative effect
        x1 = random.randint(-300, 300)
        x2 = random.randint(-300, 300)
        b0 = 0
        b1 = 3
        b2 = -2
        x_new.append(list([x1, x2]))
        value = 1 / (1 + np.exp(-(b0 + b1 * x1 + b2 * x2)))
        result = 0
        if value > 0.5:
            result = 1
        y_new.append(value)
        r_new.append(result)
    print("r_new for equation: y = 1 / (1 + exp(-x)):")
    print(r_new)
    r_resp = model.predict(x_new)
    print('predicted new response:', r_resp)

    # score for new data
    score = model.score(x_new, r_new)
    print('score:', score)

    # print classification report
    print(classification_report(r_new, model.predict(x_new)))

以上就用简单的方法来实现逻辑回归的过程。这里只用到了一种解算器来创建模型,其他的解算器如何使用、作用如何,目前还不是很清楚。

实际上,真实世界的情况远比以上简单的模拟要复杂得多。逻辑回归仅仅是一种基础性的线性数据分类技术,其优点在于简单和流行。了解其内在原理,再去理解更复杂的回归分析的时候,就相对容易一些了。

参考:

1.https://zh.wikipedia.org/wiki/%E9%82%8F%E8%BC%AF%E8%BF%B4%E6%AD%B8
2.https://courses.lumenlearning.com/boundless-algebra/chapter/graphs-of-exponential-and-logarithmic-functions/
3.https://realpython.com/logistic-regression-python/

完整源代码:

https://github.com/bumblezhou/python_machine_learning/blob/master/logistic_regression.py

发表评论

电子邮件地址不会被公开。 必填项已用*标注