Introducción a Naive Bayes
Naive Bayes se encuentra entre uno de los algoritmos de clasificación muy simples y potentes basados en el Teorema de Bayes con una suposición de independencia entre los predictores. El clasificador Naive Bayes asume que la presencia de una característica en una clase no está relacionada con ninguna otra característica. Naive Bayes es un algoritmo de clasificación para problemas de clasificación binarios y multiclase.
Teorema de Bayes
- Basado en el conocimiento previo de las condiciones que pueden estar relacionadas con un evento, el teorema de Bayes describe la probabilidad del evento
- la probabilidad condicional se puede encontrar de esta manera
- Supongamos que tenemos una Hipótesis ( H ) y evidencia ( E ),
De acuerdo con el teorema de Bayes, la relación entre la probabilidad de la Hipótesis antes de obtener la evidencia representada como P(H) y la probabilidad de la hipótesis después de obtener la evidencia representada como P( H|E) es:
P(H|E) = P(E|H)*P(H)/P(E)
- Probabilidad previa = P(H) es la probabilidad antes de obtener la evidencia
Probabilidad posterior = P(H|E) es la probabilidad después de obtener la evidencia - En general,
P(class|data) = (P(data|class) * P(class)) / P(data)
Ejemplo del teorema de Bayes
Supongamos que tenemos que encontrar la probabilidad de que la carta elegida al azar sea rey dado que es una carta con figuras.
Hay 4 Reyes en una Baraja de Cartas lo que implica que P(Rey) = 4/52
ya que todos los Reyes son Cartas con figuras entonces P(Cara|Rey) = 1
hay 3 Cartas con Figuras en un Palo de 13 cartas y hay 4 palos en total por lo que P(Cara) = 12/52
Por lo tanto,
P(King|face) = P(face|king)*P(king)/P(face) = 1/3
Descargar conjunto de datos aquí
Código: Implementación del algoritmo Naive Bayes desde cero usando Python
Python3
# Importing library import math import random import csv # the categorical class names are changed to numberic data # eg: yes and no encoded to 1 and 0 def encode_class(mydata): classes = [] for i in range(len(mydata)): if mydata[i][-1] not in classes: classes.append(mydata[i][-1]) for i in range(len(classes)): for j in range(len(mydata)): if mydata[j][-1] == classes[i]: mydata[j][-1] = i return mydata # Splitting the data def splitting(mydata, ratio): train_num = int(len(mydata) * ratio) train = [] # initially testset will have all the dataset test = list(mydata) while len(train) < train_num: # index generated randomly from range 0 # to length of testset index = random.randrange(len(test)) # from testset, pop data rows and put it in train train.append(test.pop(index)) return train, test # Group the data rows under each class yes or # no in dictionary eg: dict[yes] and dict[no] def groupUnderClass(mydata): dict = {} for i in range(len(mydata)): if (mydata[i][-1] not in dict): dict[mydata[i][-1]] = [] dict[mydata[i][-1]].append(mydata[i]) return dict # Calculating Mean def mean(numbers): return sum(numbers) / float(len(numbers)) # Calculating Standard Deviation def std_dev(numbers): avg = mean(numbers) variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1) return math.sqrt(variance) def MeanAndStdDev(mydata): info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)] # eg: list = [ [a, b, c], [m, n, o], [x, y, z]] # here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b + n+y)/3 # delete summaries of last class del info[-1] return info # find Mean and Standard Deviation under each class def MeanAndStdDevForClass(mydata): info = {} dict = groupUnderClass(mydata) for classValue, instances in dict.items(): info[classValue] = MeanAndStdDev(instances) return info # Calculate Gaussian Probability Density Function def calculateGaussianProbability(x, mean, stdev): expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2)))) return (1 / (math.sqrt(2 * math.pi) * stdev)) * expo # Calculate Class Probabilities def calculateClassProbabilities(info, test): probabilities = {} for classValue, classSummaries in info.items(): probabilities[classValue] = 1 for i in range(len(classSummaries)): mean, std_dev = classSummaries[i] x = test[i] probabilities[classValue] *= calculateGaussianProbability(x, mean, std_dev) return probabilities # Make prediction - highest probability is the prediction def predict(info, test): probabilities = calculateClassProbabilities(info, test) bestLabel, bestProb = None, -1 for classValue, probability in probabilities.items(): if bestLabel is None or probability > bestProb: bestProb = probability bestLabel = classValue return bestLabel # returns predictions for a set of examples def getPredictions(info, test): predictions = [] for i in range(len(test)): result = predict(info, test[i]) predictions.append(result) return predictions # Accuracy score def accuracy_rate(test, predictions): correct = 0 for i in range(len(test)): if test[i][-1] == predictions[i]: correct += 1 return (correct / float(len(test))) * 100.0 # driver code # add the data path in your system filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive bayes\filedata.csv' # load the file and store it in mydata list mydata = csv.reader(open(filename, "rt")) mydata = list(mydata) mydata = encode_class(mydata) for i in range(len(mydata)): mydata[i] = [float(x) for x in mydata[i]] # split ratio = 0.7 # 70% of data is training data and 30% is test data used for testing ratio = 0.7 train_data, test_data = splitting(mydata, ratio) print('Total number of examples are: ', len(mydata)) print('Out of these, training examples are: ', len(train_data)) print("Test examples are: ", len(test_data)) # prepare model info = MeanAndStdDevForClass(train_data) # test model predictions = getPredictions(info, test_data) accuracy = accuracy_rate(test_data, predictions) print("Accuracy of your model is: ", accuracy)
Producción:
Total number of examples are: 200 Out of these, training examples are: 140 Test examples are: 60 Accuracy of your model is: 71.2376788