=============== Scikit-learn =============== Basic Scikit-learn documentation =============== .. topic:: Introduction The objective here is to have everything useful for the projects, not to make a complete documentation of the whole package. Here I will try to document both version 1.6 and >2.0. A special enphase will be done on machine learning module ml (mllib is outdated). Scikit-learn (http://scikit-learn.org/stable/) is the Machine Learning well-known library in python. Here is a very short summary from DataCamp: .. figure:: Cheatsheets/Sklearn_Cheat_Sheet.png :scale: 100 % :alt: map to buried treasure This Cheatsheet is taken from DataCamp. Data Preparation ============== Many algorithms need to have the data normalized or scaled first. Scaling Data -------------- MinMax scaler: get the data in the [0,1] range (can be adapted to [-1,1] or other). Very sensitive to outliers. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html .. sourcecode:: python from sklearn.preprocessing import StandardScaler scaler = MinMaxScaler() # Fit only to the training data scaler.fit(X_train) # Now apply the transformations to the data: X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) To get back the original scale, use scaler.inverse_transform() function. Standard scaler: we substract the mean of data and divide by the standard deviation. (we then get the so-called Z values): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html .. sourcecode:: python from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # Fit only to the training data scaler.fit(X_train) # Now apply the transformations to the data: X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) Standard scaler can still be rather sensitive to outliers. Substracting by the median instead of the mean, and dividing by the inter quantile is much more robust to outliers: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html .. sourcecode:: python from sklearn.preprocessing import RobustScaler scaler = RobustScaler() # Fit only to the training data scaler.fit(X_train) # Now apply the transformations to the data: X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) Encoding of data -------------------- Here is the simple Label encoding. It is usually used to change strings (but also integers) in integer label .. sourcecode:: python from sklearn.preprocessing import LabelEncoder, OneHotEncoder data['col_encoded'] = LabelEncoder().fit_transform(data['col'].astype(str)) #Also possible from Pandas: data['TRANSTYPE_label'] = pd.Categorical.from_array(data['TRANSTYPE']).labels data[['TRANSTYPE','TRANSTYPE_label']] .. figure:: Images/Encoding_pandas.png :scale: 60 % :alt: map to buried treasure A maybe more general way to label encode several features in one shot (and save the encoders!) would be: .. sourcecode:: python list_encoders=[] for column in data[cat_cols]: le = preprocessing.LabelEncoder() data[column + "_encoded"] = le.fit_transform(data[column].astype(str)) list_encoders.append(le) # Then the lists of columns to encode and associated encoders (already fitted!) are available in a python dictionary: dico_encoders={'columns_to_encode':cat_cols, 'list_encoders' :list_encoders } # ...so that the dictionary can be reused later. Let's say we have new data with the same features le_test = dico_encoders['list_encoders'] data['TRANSTYPE_encoded'] = le_test[0].transform(data['TRANSTYPE'].astype(str)) # We can also convert back to the original feature: data['TRANSTYPE_back'] = le_test[0].inverse_transform(data['TRANSTYPE_encoded']) Big question: when to use Label encoding, and when to use One Hot Encoding? See https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor In essence, Label Encoding is ok when working with tree methods Stratified sampling ---------------------- .. sourcecode:: python from sklearn.datasets import make_classification from sklearn.cross_validation import StratifiedShuffleSplit from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix # We use a utility to generate artificial classification data. X, y = make_classification(n_samples=100, n_informative=10, n_classes=3) sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0) for train_idx, test_idx in sss: X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx] svc.fit(X_train, y_train) y_pred = svc.predict(X_test) #macro, micro are defined here http://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case #Macro: Take the average of the f1-score for each class macro #Micro: Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class) . accuracy_score gives the same! print(f1_score(y_test, y_pred, average="macro")) print(precision_score(y_test, y_pred, average="macro")) print(recall_score(y_test, y_pred, average="macro")) .. _RF_sklearn-label: Decision Tree ============= Here is a simple example of DT for the Iris dataset: .. sourcecode:: python from sklearn.datasets import load_iris from sklearn import tree iris = load_iris() clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) We can use a vizualization tool, graphviz, to print the decision tree. It is sometimes useful to understand the classification of particular observation. .. sourcecode:: python import graphviz dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph .. figure:: Images/Graphviz_tree.png :scale: 60 % :alt: map to buried treasure Random Forest ============= The RF algo applied on the iris dataset. For the algorithm description see :ref:`Random Forest ` .. sourcecode:: python import numpy as np import pandas as pd from IPython.core.pylabtools import figsize from matplotlib import pyplot as plt import pylab from pylab import * import matplotlib.cm as cm import matplotlib.mlab as mlab import seaborn as sns #Import of SKLEARN packages from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris #LOADING IRIS DATASET: iris = load_iris() #The Iris dataset is available through the scikit-learn API idx = list(range(len(iris.target))) np.random.shuffle(idx) #We shuffle it (important if we want to split in train and test sets) X = iris.data[idx] y = iris.target[idx] # Load data in Pandas dataFrame and then in a Pyspark dataframe data_pd = pd.DataFrame(data=np.column_stack((X,y)), columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label']) data_pd.loc[data_pd['label']==0,'species'] = 'setosa' data_pd.loc[data_pd['label']==1,'species'] = 'versicolor' data_pd.loc[data_pd['label']==2,'species'] = 'virginica' data_pd.head() .. figure:: Images/Iris_Dataset.png :scale: 100 % :alt: Iris dataset .. sourcecode:: python # A VERY SMALL EDA g = sns.PairGrid(data_pd, vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], hue="label") g.map_diag(sns.kdeplot) g.map_lower(sns.kdeplot) g.map_upper(plt.scatter) .. figure:: Images/Iris_Seaborn.png :scale: 100 % :alt: Iris with seaborn .. sourcecode:: python sns.boxplot(x='species',y='petal_length',data=data_pd) .. figure:: Images/Iris_boxplot.png :scale: 100 % :alt: Iris with seaborn's boxplot .. sourcecode:: python #FEATURES SELECTION feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] target = 'label' def train(clf_name, clf, df_used, feature_cols, target): X = df_used[feature_cols].values y = df_used[target].values # CREATION OF TRAIN-TEST SETS x_train, x_test, y_train, y_test = train_test_split(X,y,train_size=0.7, stratify=y) #stratify=y ensures that the same proportion of labels are in both train and test sets! # THE FIT ON THE TRAINING SET clf.fit(x_train, y_train) # THE CLASSIFICATION y_pred = clf.predict(x_test) # EVALUATION OF THE ACCURACY accuracy = accuracy_score(y_test, y_pred, normalize=True, sample_weight=None) print('Results with the classifier: ', clf_name.upper()) print('Accuracy derived (1=100%): ', accuracy) return y_test, y_pred clf_name = 'Random Forest' clf = RandomForestClassifier(#max_depth=50, n_estimators=100, max_features='auto', criterion='gini',#'entropy', class_weight='balanced', bootstrap=True, random_state=21, n_jobs=-1) #n_jobs=-1 uses all available cores!!! y_test, y_pred = train(clf_name, clf, data_pd, feature_cols, target) #RANKING OF VARIABLES (available only for Random Forest) print("Ranking of variables from Random Forest:") feature_importance_index_sorted = np.argsort(clf.feature_importances_)[::-1] for jj in feature_importance_index_sorted: print(feature_cols[jj],np.around(clf.feature_importances_[jj],decimals=3)*100,'%') # Accuracy and Confusion Matrix accuracy = accuracy_score(y_test, y_pred) print('Accuracy = ',accuracy) #print 'ROC_AUC = ', roc_auc print('Confusion matrix:') C = confusion_matrix(y_test, y_pred) C_normalized = C / C.astype(np.float).sum() #print C_normalized Classes = ['setosa','versicolor','virginica'] C_normalized_pd = pd.DataFrame(C_normalized,columns=Classes,index=Classes) C_normalized_pd .. figure:: Images/Accuracy_and_ConfMat.png :scale: 100 % :alt: Accuracy and Confusion matrix .. sourcecode:: python # SENSITIVITY COMPUTATION (taken from https://stackoverflow.com/questions/31324218/scikit-learn-how-to-obtain-true-positive-true-negative-false-positive-and-fal) FP = C_normalized_pd.sum(axis=0) - np.diag(C_normalized_pd) FN = C_normalized_pd.sum(axis=1) - np.diag(C_normalized_pd) TP = np.diag(C_normalized_pd) TN = C_normalized_pd.values.sum() - (FP + FN + TP) # Sensitivity, hit rate, recall, or true positive rate TPR = TP/(TP+FN) # Specificity or true negative rate TNR = TN/(TN+FP) # Precision or positive predictive value PPV = TP/(TP+FP) # Negative predictive value NPV = TN/(TN+FN) # Fall out or false positive rate FPR = FP/(FP+TN) # False negative rate FNR = FN/(TP+FN) # False discovery rate FDR = FP/(TP+FP) # Overall accuracy ACC = (TP+TN)/(TP+FP+FN+TN) print('SENSITIVITY:') Sensitivity_pd = pd.DataFrame(list(zip(Classes,TPR)),columns=['species','Sensitivity']) Sensitivity_pd .. figure:: Images/Sensitivity.png :scale: 100 % :alt: Sensitivity Neural Networks =============== Introduction to Keras package ----------------------------------------- Keras (https://keras.io/) is a python library that works ontop of the Neural Network Theano and TensorFlow libraries. Here is a very short summary from DataCamp: .. figure:: Cheatsheets/Keras_Cheat_Sheet_Python.png :scale: 100 % :alt: map to buried treasure This Cheatsheet is taken from DataCamp. Parameters Tuning ================== GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) is a method to perform a grid search fro parameter tuning. It does a cross-validation by cutting in different pieces the train and test sets. Here is an example on artificial data, taken from https://stackoverflow.com/questions/30102973/how-to-get-best-estimator-on-gridsearchcv-random-forest-classifier-scikit : .. sourcecode:: python from sklearn.model_selection import GridSearchCV from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False) base_estimator = RandomForestClassifier(n_jobs=-1, oob_score = True) param_grid = { 'n_estimators': [200, 700], 'max_features': ['auto', 'sqrt', 'log2'] } CV_rfc = GridSearchCV(estimator=base_estimator, param_grid=param_grid, cv=5) CV_rfc.fit(X_train, y_train) print(CV_rfc.best_params_) print(CV_rfc.best_score_) print(CV_rfc.best_estimator_) Output: {'n_estimators': 200, 'max_features': 'log2'} 0.86 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=700, n_jobs=-1, oob_score=True, random_state=None, verbose=0, warm_start=False) How to measure the Feature importance in RF (or other algo) ====================================================================== Mean decrease accuracy (i.e. how is reduced the accuracy if we permute the values of some feature?) ------------------------------------------------------------------------------------------ Taken http://blog.datadive.net/selecting-good-features-part-iii-random-forests/ Another popular (maybe the most robust?) feature selection method is to directly measure the impact of each feature on accuracy of the model. The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant variables, the permutation should have little to no effect on model accuracy, while permuting important variables should significantly decrease it. This method is not directly exposed in sklearn, but it is straightforward to implement it. Continuing from the previous example of ranking the features in the Boston housing dataset: .. sourcecode:: python from sklearn.cross_validation import ShuffleSplit from sklearn.metrics import r2_score from collections import defaultdict   X = boston["data"] Y = boston["target"]   rf = RandomForestRegressor() scores = defaultdict(list)   #crossvalidate the scores on a number of different random splits of the data for train_idx, test_idx in ShuffleSplit(len(X), 100, .3): X_train, X_test = X[train_idx], X[test_idx] Y_train, Y_test = Y[train_idx], Y[test_idx] r = rf.fit(X_train, Y_train) acc = r2_score(Y_test, rf.predict(X_test)) for i in range(X.shape[1]): X_t = X_test.copy() np.random.shuffle(X_t[:, i]) shuff_acc = r2_score(Y_test, rf.predict(X_t)) scores[names[i]].append((acc-shuff_acc)/acc) print( "Features sorted by their score:" ) print( sorted([(round(np.mean(score), 4), feat) for feat, score in scores.items()], reverse=True) ) Features sorted by their score: [(0.7276, 'LSTAT'), (0.5675, 'RM'), (0.0867, 'DIS'), (0.0407, 'NOX'), (0.0351, 'CRIM'), (0.0233, 'PTRATIO'), (0.0168, 'TAX'), (0.0122, 'AGE'), (0.005, 'B'), (0.0048, 'INDUS'), (0.0043, 'RAD'), (0.0004, 'ZN'), (0.0001, 'CHAS')] In this example LSTAT and RM are two features that strongly impact model performance: permuting them decreases model performance by ~73% and ~57% respectively. Keep in mind though that these measurements are made only after the model has been trained (and is depending) on all of these features. This doesn’t mean that if we train the model without one these feature, the model performance will drop by that amount, since other, correlated features can be used instead. Score explanation for individual observation (using LIME) =================================================================== Taken from http://pythondata.com/tag/explanation/ When working with classification and/or regression techniques, its always good to have the ability to 'explain' what your model is doing. Using Local Interpretable Model-agnostic Explanations (LIME), you now have the ability to quickly provide visual explanations of your model(s). According to the paper, LIME is ‘an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.’ Classification examples ---------------------------------- Simplest example: using only continuous features (not categorical): Here is an example based on the Iris dataset, taken from the (https://marcotcr.github.io/lime/tutorials/Tutorial%20-%20continuous%20and%20categorical%20features.html): .. sourcecode:: python import sklearn import sklearn.datasets import sklearn.ensemble import numpy as np import lime import lime.lime_tabular iris = sklearn.datasets.load_iris() train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(iris.data, iris.target, train_size=0.80) rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500) rf.fit(train, labels_train) Output: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) .. sourcecode:: python sklearn.metrics.accuracy_score(labels_test, rf.predict(test)) Output: 0.90 .. sourcecode:: python explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=iris.feature_names, class_names=iris.target_names, discretize_continuous=True) # We select one random observation (we can also take one particular, e.g. i=20) i = np.random.randint(0, test.shape[0]) exp = explainer.explain_instance(test[i], rf.predict_proba, num_features=2, top_labels=1) exp.show_in_notebook(show_table=True, show_all=False) .. figure:: Images/LIME_on_IRIS_classification_figure.png :scale: 100 % :alt: map to buried treasure Score explanation for individual observation (using tree-interpreter) =================================================================================== This package is designed for tree based methods like decision tree, random forest, extreme randomized forest. .. sourcecode:: python #built on top of the RF model on Iris dataset above import treeinterpreter as ti feature_cols = ['sepal_length','sepal_width','petal_length','petal_width'] Classes = ['setosa','versicolor','virginica'] def TreeInterpreter_contributions(x_test,fit,feature_cols): # Let’s predict now for a single instance. instance = x_test fit.predict_proba(instance) # Breakdown of feature contributions: prediction, bias, contributions = ti.predict(fit, instance) return prediction, bias, contributions def waterfall(series,pos_name,neg_name): #see http://pbpython.com/waterfall-chart.html fig, ax = plt.subplots(figsize=(7,7)) df = pd.DataFrame({pos_name:np.maximum(series,0),neg_name:np.minimum(series,0)}) blank = series.cumsum().shift(1).fillna(0) df.plot(kind='bar', stacked=True, bottom=blank, color=['r','b'],ax=ax) ax.get_children()[len(series)-1].set_color('g') #Put the prediction in green step = blank.reset_index(drop=True).repeat(3).shift(-1) step[1::3] = np.nan plt.plot(step.index, step.values,'-k',linewidth=0.5) return fig prediction, bias, contributions = TreeInterpreter_contributions(x_test,clf,feature_cols) class_selected = np.argmax(prediction[0]) class_name_selected = Classes[class_selected] print(class_selected,class_name_selected) test2 = pd.Series([bias[0][class_selected]]+list(contributions[0][:,class_selected])+[-prediction[0][class_selected]],index=['bias']+feature_cols+['Net']) fig = waterfall(test2,class_name_selected,'Other labels') plt.ylim([-0.1,1.1]) plt.axhline(y=0., linewidth=1, color = 'k',dashes=(5,5)) plt.show() .. figure:: Images/Contributions.png :scale: 100 % :alt: Contributions to the prediction probability by tree-interpreter Very important update in the field of interpretation: ========================================================================== - ELI5 (Explain me like I am 5): https://eli5.readthedocs.io/en/latest/ - SHAP: * https://github.com/slundberg/shap : main intro + examples * http://www.f1-predictor.com/model-interpretability-with-shap/ * https://medium.com/civis-analytics/demystifying-black-box-models-with-shap-value-analysis-3e20b536fc80 Ensemble Classification ====================================== See http://scikit-learn.org/stable/modules/ensemble.html http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html http://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html#sphx-glr-auto-examples-ensemble-plot-voting-decision-regions-py http://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_probas.html#sphx-glr-auto-examples-ensemble-plot-voting-probas-py Here a case with 3 classifiers, with a parameter grid search: .. sourcecode:: python from sklearn.ensemble import VotingClassifier from sklearn.model_selection import GridSearchCV clf1 = LogisticRegression(random_state=1) clf2 = RandomForestClassifier(random_state=1) clf3 = GaussianNB() eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft') #voting=hard is majority voting, while voting soft is sum of probabilities, less strong. #eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[2,5,1]) #if we want to weigth the classifiers params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200],} grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5) grid = grid.fit(iris.data, iris.target) Clustering ====================== Excellent intro to the field: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 Gaussian Mixture Model (GMM) ----------------------------------- http://scikit-learn.org/stable/modules/mixture.html 1. Classical GMM 2. Bayesian GMM Here we can detect the number of component automatically (in theory). Due to its Bayesian nature, the variational algorithm needs more hyper- parameters than expectation-maximization, the most important of these being the concentration parameter weight_concentration_prior. Specifying a low value for the concentration prior will make the model put most of the weight on few components set the remaining components weights very close to zero. High values of the concentration prior will allow a larger number of components to be active in the mixture. The parameters implementation of the BayesianGaussianMixture class proposes two types of prior for the weights distribution: a finite mixture model with Dirichlet distribution and an infinite mixture model with the Dirichlet Process. In practice Dirichlet Process inference algorithm is approximated and uses a truncated distribution with a fixed maximum number of components (called the Stick-breaking representation). The number of components actually used almost always depends on the data. Important parameters: * weight_concentration_prior_type : str, defaults to ‘dirichlet_process’. String describing the type of the weight concentration prior. Must be one of: 'dirichlet_process' (using the Stick-breaking representation), 'dirichlet_distribution' (can favor more uniform weights). * weight_concentration_prior : float | None, optional. The dirichlet concentration of each component on the weight distribution (Dirichlet). This is commonly called gamma in the literature. The higher concentration puts more mass in the center and will lead to more components being active, while a lower concentration parameter will lead to more mass at the edge of the mixture weights simplex. The value of the parameter must be greater than 0. If it is None, it’s set to 1. / n_components. See See http://scikit-learn.org/stable/modules/mixture.html#variational-bayesian-gaussian-mixture See http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture.BayesianGaussianMixture Based on http://scikit-learn.org/stable/auto_examples/mixture/plot_concentration_prior.html#sphx-glr-auto-examples-mixture-plot-concentration-prior-py .. sourcecode:: python import pandas as pd import numpy as np from sklearn.grid_search import GridSearchCV from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.mixture import BayesianGaussianMixture import matplotlib as mpl import matplotlib.pyplot as plt %matplotlib inline # THE GENERATED DATASET # Parameters of the dataset random_state, n_components, n_features = 2, 3, 2 colors = np.array(['#0072B2', '#F0E442', '#D55E00']) covars = np.array([[[.7, .0], [.0, .1]], [[.5, .0], [.0, .1]], [[.5, .0], [.0, .1]]]) samples = np.array([500, 500, 500]) means = np.array([[1, 0], [2, 1], [3, 2]]) #covars = np.array([[[.7, .0], [.0, .1]], # [[.5, .0], [.0, .2]], # [[.5, .0], [.0, .3]]]) #samples = np.array([5000, 5000, 5000]) #means = np.array([[1, 0], # [2, 1], # [1, 1]]) # Build a classification task using 3 informative features (COULD USE THIS) #X, y = make_classification(n_samples=1000, # n_features=10, # n_informative=3, # n_redundant=0, # n_repeated=0, # n_classes=2, # random_state=0, # shuffle=False) # Generate data rng = np.random.RandomState(random_state) X = np.vstack([ rng.multivariate_normal(means[j], covars[j], samples[j]) for j in range(n_components)]) y = np.concatenate([j * np.ones(samples[j], dtype=int) for j in range(n_components)]) X.shape,y.shape estimator = BayesianGaussianMixture( weight_concentration_prior_type="dirichlet_process", #(i.e. infinite) #or "dirichlet_distribution" (i.e. finite) n_components=5*n_components, reg_covar=0, init_params='random', max_iter=1500, mean_precision_prior=.8, random_state=random_state) #, [1, 1000, 100000]) estimator.fit(X) def plot_ellipses(ax, weights, means, covars,edgecolor='black',by_weights='no'): for n in range(means.shape[0]): eig_vals, eig_vecs = np.linalg.eigh(covars[n]) unit_eig_vec = eig_vecs[0] / np.linalg.norm(eig_vecs[0]) angle = np.arctan2(unit_eig_vec[1], unit_eig_vec[0]) # Ellipse needs degrees angle = 180 * angle / np.pi # eigenvector normalization eig_vals = 2 * np.sqrt(2) * np.sqrt(eig_vals) ell = mpl.patches.Ellipse(means[n], eig_vals[0], eig_vals[1], 180 + angle, edgecolor=edgecolor) ell.set_clip_box(ax.bbox) if by_weights=='yes': ell.set_alpha(5*weights[n]) #ell.set_facecolor('#56B4E9') ell.set_facecolor('none') ax.add_artist(ell) fig=plt.figure(1,(17,6)) ax = plt.subplot(121) ax.scatter(X[:,0],X[:,1], s=5, marker='o', color=colors[y]) ax.set_xlabel('recency',fontsize=15) ax.set_ylabel('frequency',fontsize=15) plot_ellipses(ax, estimator.weights_, estimator.means_, estimator.covariances_,edgecolor='red') plot_ellipses(ax, [1,1,1], means, covars) ax = plt.subplot(122) ax.scatter(X[:,0],X[:,1], s=5, marker='o', color=colors[y]) ax.set_xlabel('recency',fontsize=15) ax.set_ylabel('frequency',fontsize=15) plot_ellipses(ax, estimator.weights_, estimator.means_, estimator.covariances_,edgecolor='red',by_weights='yes') plot_ellipses(ax, [1,1,1], means, covars) plt.show() estimator.weights_ array([3.23183804e-01, 1.07006400e-01, 3.24464101e-01, 2.45301497e-01, 4.14353274e-05, 2.58970793e-06, 1.61856745e-07, 1.01160466e-08, 6.32252912e-10, 3.95158070e-11, 2.46973794e-12, 1.54358621e-13, 9.64741381e-15, 6.02963363e-16, 3.76852102e-17]) .. figure:: Images/GMM_bayes.png :scale: 100 % :alt: map to buried treasure