Aquí predeciremos la calidad del vino sobre la base de las características dadas. Utilizamos el conjunto de datos de calidad del vino de Kaggle. Este conjunto de datos tiene las características fundamentales que son responsables de afectar la calidad del vino. Mediante el uso de varios modelos de Machine learning, predeciremos la calidad del vino. Aquí solo nos ocuparemos de la calidad del vino de tipo blanco, utilizamos técnicas de clasificación para verificar más a fondo la calidad del vino, es decir, si es bueno o lecho.
Conjunto de datos : aquí
Python3
# import libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sb
Python3
# loading the data Dataframe = pd.read_csv(R'D:\\xdatasets\winequalityN.csv')
Python3
# show rows and columns Dataframe.head()
Python3
# getting info. Dataframe.info()
Python3
Dataframe.describe()
Python3
# null value check Dataframe.isnull().sum()
Python
# plot pairplot sb.pairplot(Dataframe) #show graph plt.show()
Python3
#plot histogram Dataframe.hist(bins=20,figsize=(10,10)) #plot showing plt.show()
Python3
plt.figure(figsize=[15,6]) plt.bar(df['quality'],df['alcohol']) plt.xlabel('quality') plt.ylabel('alcohol') plt.show()
Python3
# correlation by visualization plt.figure(figsize=[18,7]) # plot correlation sb.heatmap(Dataframe.corr(),annot=True) plt.show()
Python3
colm = [] # loop for columns for i in range(len(Dataframe.corr().keys())): # loop for rows for j in range(j): if abs(Dataframe.corr().iloc[i,j]) > 0.7: colm = Dataframe.corr().columns[i]
Python3
# drop column new_df = Dataframe.drop('total sulfur dioxide',axis = 1)
Python
new_df.update(new_df.fillna(new_df.mean()))
Python3
# no of categorical columns cat = new_df.select_dtypes(include='O') # create dummies of categorical columns df_dummies = pd.get_dummies(new_df,drop_first = True) print(df_dummies)
Python3
df_dummies['best quality']=[1 if x>=7 else 0 for x in Dataframe.quality] print(df_dummies)
Python3
# import libraries from sklearn.preprocessing import train_test_split # independent variables x = df_dummies.drop(['quality','best quality'],axis=1) # dependent variable y = df_dummies['best quality'] # creating train test splits xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2,random_state=40)
Python3
# code # import libraries from sklearn.preprocessing import MinMaxScaler # creating scaler scale var. norm = MinMaxScaler() # fit the scale norm_fit = norm.fit(xtrain) # transformation of training data scal_xtrain = norm_fit.transform(xtrain) # transformation of testing data scal_xtest = norm_fit.transform(xtest) print(scal_xtrain)
Python3
# code #import libraries from sklearn.ensemble import RandomForestClassifier # for error checking from sklearn.metrics import mean_squared_error from sklearn.metrics import classification_report # create model variable rnd = RandomForestClassifier() # fit the model fit_rnd = rnd.fit(new_xtrain,ytrain) # checking the accuracy score rnd_score = rnd.score(new_xtest,ytest) print('score of model is : ',rnd_score) print('.................................') print('calculating the error') # checking mean_squared error MSE = mean_squared_error(ytest,y_predict) # checking root mean squared error RMSE = np.sqrt(MSE) print('mean squared error is : ',MSE) print('root mean squared error is : ',RMSE) print(classification_report(ytest,x_predict))
Python3
# code x_predict = list(rnd.predict(xtest)) df = {'predicted':x_predict,'original':ytest} pd.DataFrame(df).head(10)
Publicación traducida automáticamente
Artículo escrito por mayurbadole2407 y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA