Key Drivers Analysis

Author

Dominic Schenone

Published

June 11, 2025

This post implements a few measure of variable importance, interpreted as a key drivers analysis, for certain aspects of a payment card on customer satisfaction with that payment card.

Understanding the Landscape of Machine Learning

When diving into machine learning, two primary branches emerge: supervised and unsupervised learning. Each serves a different purpose and is best suited for particular types of business or research questions.

Tip Knowing when to use supervised vs unsupervised learning is crucial. It’s not just about which algorithm performs best—it’s about which type of model is appropriate for the task at hand.


Supervised Learning: Learning from Labels

Supervised learning uses labeled data, where the correct outcome is known. The goal is to learn a mapping from inputs (features) to outputs (labels) that can be generalized to predict future outcomes.

Common Applications

  • Predicting customer churn (binary classification)
  • Forecasting sales or prices (regression)
  • Recognizing objects in images (multi-class classification)

Algorithms

  • Linear/Logistic Regression
  • Decision Trees, Random Forests
  • Support Vector Machines
  • K-Nearest Neighbors (KNN)we’ll implement this

Challenges

  • Requires a well-labeled dataset
  • Risk of overfitting to training data
  • Performance is sensitive to irrelevant features or noisy labels

Unsupervised Learning: Finding Hidden Patterns

In contrast, unsupervised learning deals with unlabeled data. The algorithm tries to uncover hidden patterns, groupings, or structures from the input alone.

Common Applications

  • Customer segmentation in marketing
  • Anomaly detection in fraud analytics
  • Topic modeling in NLP

Algorithms

  • K-Means Clustering
  • Hierarchical Clustering
  • Principal Component Analysis (PCA)
  • Latent Class Models

Challenges

  • Evaluation is less straightforward—no ground truth
  • Sensitive to initialization and scaling
  • Choosing the right number of clusters or components is tricky

Supervised vs Unsupervised: Key Differences

Aspect Supervised Learning Unsupervised Learning
Input Data Labeled Unlabeled
Goal Predict outcomes Find structure
Evaluation Metrics like accuracy, RMSE Subjective, silhouette score, etc.
Complexity Depends on model type Often computationally expensive
Common Use Cases Classification, Regression Clustering, Dimensionality Reduction

Our Use Case: Data-Driven Insights Across Domains

In this assignment, we’ll apply both supervised and unsupervised learning to two unique datasets, each offering its own opportunities and challenges.

🐧 1. Palmer Penguins Dataset

  • Purpose: Supervised learning with K-Nearest Neighbors (KNN)
  • Task: Classify penguins by species based on physical characteristics

🏷️ 2. Brand Key Drivers Dataset

  • Purpose: Unsupervised learning via K-Means clustering
  • Task: Segment respondents by their perceptions of brand characteristics

In the next sections, we’ll explore how to prepare these datasets and apply machine learning techniques to draw actionable insights. Whether predicting outcomes or discovering hidden structures, machine learning gives us a powerful lens through which to understand the data.

Load and Inspect the Penguin Dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

penguins = pd.read_csv("palmer_penguins.csv")
penguins_clean = penguins.dropna()  # Drop missing values

penguins_clean.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181 3750 male 2007
1 Adelie Torgersen 39.5 17.4 186 3800 female 2007
2 Adelie Torgersen 40.3 18.0 195 3250 female 2007
3 Adelie Torgersen 36.7 19.3 193 3450 female 2007
4 Adelie Torgersen 39.3 20.6 190 3650 male 2007
Visualize Feature Relationships by Species
sns.pairplot(penguins_clean, hue="species")
plt.suptitle("Pairplot of Penguin Features by Species", y=1.02)
plt.tight_layout()
plt.show()

Pairplot of Numeric Features in Penguin Dataset

The pairplot reveals strong species-level clustering in variables like bill_length_mm, flipper_length_mm, and body_mass_g. These features appear highly informative and will likely support strong performance in classification models such as K-Nearest Neighbors (KNN).

Compute Correlations Between Brand Attributes
drivers = pd.read_csv("data_for_drivers_analysis.csv")

# Remove non-feature columns
drivers_corr = drivers.drop(columns=["id", "brand"]).corr()
drivers_corr
satisfaction trust build differs easy appealing rewarding popular service impact
satisfaction 1.000000 0.255706 0.191896 0.184801 0.212985 0.207997 0.194561 0.171425 0.251098 0.254539
trust 0.255706 1.000000 0.399652 0.306493 0.480973 0.420704 0.423868 0.389409 0.503966 0.354062
build 0.191896 0.399652 1.000000 0.370705 0.443953 0.369456 0.435770 0.333667 0.407956 0.383639
differs 0.184801 0.306493 0.370705 1.000000 0.349693 0.418159 0.349758 0.266456 0.367615 0.416957
easy 0.212985 0.480973 0.443953 0.349693 1.000000 0.432904 0.461316 0.387304 0.456976 0.412092
appealing 0.207997 0.420704 0.369456 0.418159 0.432904 1.000000 0.481159 0.376080 0.425463 0.394282
rewarding 0.194561 0.423868 0.435770 0.349758 0.461316 0.481159 1.000000 0.350825 0.457016 0.384245
popular 0.171425 0.389409 0.333667 0.266456 0.387304 0.376080 0.350825 1.000000 0.378262 0.305265
service 0.251098 0.503966 0.407956 0.367615 0.456976 0.425463 0.457016 0.378262 1.000000 0.412313
impact 0.254539 0.354062 0.383639 0.416957 0.412092 0.394282 0.384245 0.305265 0.412313 1.000000
Visualize the Correlation Matrix with a Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(drivers_corr, annot=True, cmap="coolwarm", fmt=".2f", center=0)
plt.title("Correlation Matrix: Brand Satisfaction and Attributes")
plt.tight_layout()
plt.show

Correlation Heatmap of Brand Attributes

Variables such as trust, appealing, and service exhibit moderate positive correlation with satisfaction. These drivers could be critical in uncovering distinct consumer segments via clustering, and in predicting satisfaction using regression-style models.

🧩 Unsupervised Learning: K-Means Clustering on Brand Perceptions To discover hidden customer segments based on brand attribute perceptions, we’ll apply K-Means clustering to the key drivers dataset. This method groups observations into clusters based on similarity, helping marketers understand distinct consumer mindsets.

Scale Features for K-Means Clustering
from sklearn.preprocessing import StandardScaler

# Reload to ensure clean slate
import pandas as pd
drivers = pd.read_csv("data_for_drivers_analysis.csv")

# Use only feature columns (drop identifiers)
X = drivers.drop(columns=["id", "brand", "satisfaction"])

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Use Elbow Method to Select Number of Clusters
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

inertia = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(k_range, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.tight_layout()
plt.show()

Elbow Plot for K-Means Clustering
Fit KMeans with Chosen k (e.g., 3)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Append cluster labels to original DataFrame
drivers['cluster'] = clusters
drivers[['cluster'] + list(X.columns)].groupby("cluster").mean().round(2)
trust build differs easy appealing rewarding popular service impact
cluster
0 0.90 0.87 0.80 0.94 0.86 0.85 0.83 0.89 0.84
1 0.15 0.12 0.11 0.12 0.09 0.07 0.19 0.08 0.08
2 0.74 0.51 0.18 0.69 0.52 0.56 0.71 0.56 0.17
Use PCA to Visualize Customer Segments
from sklearn.decomposition import PCA
import seaborn as sns

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

drivers['pca1'] = X_pca[:, 0]
drivers['pca2'] = X_pca[:, 1]

plt.figure(figsize=(8, 5))
sns.scatterplot(data=drivers, x="pca1", y="pca2", hue="cluster", palette="Set2", s=70)
plt.title("K-Means Clustering of Brand Perceptions")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.tight_layout()
plt.show()

PCA Visualization of K-Means Clusters

K-Means clustering revealed three distinct customer segments based on how they perceive brand attributes:

  • Cluster 0 – The Enthusiasts
    These respondents rate all brand dimensions—trust, ease, appeal, service, and impact—very highly. They are likely loyal, satisfied users who should be nurtured through loyalty programs or brand ambassador initiatives.

  • Cluster 1 – The Selective Believers
    Moderately positive on select traits (e.g., trust, service), but less impressed with differentiation and innovation. Messaging to this group should emphasize what makes the brand unique and valuable.

  • Cluster 2 – The Disengaged Skeptics
    Low across the board on all brand attributes. These customers either need substantial re-engagement or may be unresponsive to marketing efforts altogether.

Together, these insights can guide targeted brand strategies and campaign design for different audience mindsets.

🧮 Supervised Learning: K-Nearest Neighbors with Penguin Species Classification In this analysis, we’ll use the Palmer Penguins dataset to train a K-Nearest Neighbors (KNN) classifier that predicts a penguin’s species based on physical characteristics like flipper length, bill depth, and body mass.

Load and Preprocess the Penguin Dataset
import pandas as pd
from sklearn.model_selection import train_test_split

# Load and drop missing values
penguins = pd.read_csv("palmer_penguins.csv").dropna()

# Select features and target
features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
X = penguins[features]
y = penguins["species"]

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
Cleaning the Penguin Data

This code loads the Palmer Penguins dataset, removes rows with missing values, and selects relevant features for classification. We split the dataset into training and test sets to evaluate model performance on unseen data, ensuring a fair and unbiased estimate.

Train and Predict Using K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Test Accuracy: {accuracy:.2%}")
print(classification_report(y_test, y_pred))
Test Accuracy: 78.00%
              precision    recall  f1-score   support

      Adelie       0.72      0.86      0.78        44
   Chinstrap       0.70      0.35      0.47        20
      Gentoo       0.89      0.92      0.90        36

    accuracy                           0.78       100
   macro avg       0.77      0.71      0.72       100
weighted avg       0.78      0.78      0.76       100
K-Nearest Neighbor

We train a K-Nearest Neighbors classifier with k=5, meaning predictions are based on the five closest neighbors in feature space. We then evaluate performance using accuracy and detailed classification metrics to understand how well the model performs on each species.

Visualize the Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred, labels=knn.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn.classes_)

disp.plot(cmap="Blues")
plt.title("Confusion Matrix: KNN on Penguin Species")
plt.tight_layout()
plt.show()

Confusion Matrix for KNN Classification
Don’t be Confused by the Confusion Matrix!

The confusion matrix provides a visual summary of prediction performance by showing how many observations were correctly or incorrectly classified. It highlights which species were most often confused—especially useful for understanding class-level weaknesses.

Project 2D Decision Boundaries with PCA
from sklearn.decomposition import PCA
import seaborn as sns

pca = PCA(n_components=2)
X_proj = pca.fit_transform(X)

penguins["pca1"] = X_proj[:, 0]
penguins["pca2"] = X_proj[:, 1]

# Predict clusters in full set
knn_all = KNeighborsClassifier(n_neighbors=5)
knn_all.fit(X_proj, y)
labels = knn_all.predict(X_proj)

penguins["predicted_species"] = labels

sns.scatterplot(data=penguins, x="pca1", y="pca2", hue="predicted_species", palette="Set2")
plt.title("KNN Classification with PCA Projection")
plt.tight_layout()
plt.show()

PCA Projection of KNN Classifier
PCA 2D Visualization

We use PCA to reduce the feature space to two dimensions for visualization. This allows us to plot the decision boundaries and predicted species clusters in 2D, making it easier to interpret how the KNN algorithm separates the classes.

Our K-Nearest Neighbors (KNN) model achieved an overall accuracy of 78% in predicting penguin species based on physical measurements like bill length, flipper length, and body mass.

Key insights:

  • Adelie penguins were classified well with high recall (0.86), though a few were confused with Gentoo.
  • Gentoo penguins had the best precision and recall (F1-score: 0.90), showing that their physical traits are most distinct.
  • Chinstrap penguins were the most challenging to classify, with a recall of only 0.35—often mistaken for Adelie.

The confusion matrix confirms this pattern, with most errors occurring between Adelie and Chinstrap species.

Using PCA for visualization, we saw Gentoo penguins form a distinct cluster, while Adelie and Chinstrap overlap more—explaining the classifier’s difficulty.

Overall, this analysis shows that KNN is an effective but interpretable baseline model for biological classification problems, especially when species have clearly separable physical traits.

🧾 Conclusion: What We Learned from the Data

This project explored both unsupervised and supervised machine learning techniques to derive insight from real-world marketing and scientific datasets.

📊 K-Means Clustering (Unsupervised Learning)

Using brand perception data, K-Means revealed three distinct customer segments:

  • Enthusiasts who love the brand across all dimensions
  • Selective Believers who value trust but want more differentiation
  • Disengaged Skeptics with minimal brand attachment

These segments offer clear targets for marketing strategy, product messaging, and resource allocation.

🐧 K-Nearest Neighbors (Supervised Learning)

KNN was applied to predict penguin species based on physical traits. The model:

  • Achieved 78% accuracy on a held-out test set
  • Performed best on Gentoo and Adelie species
  • Struggled with Chinstrap, where features overlap with others

This demonstrates KNN’s utility as an interpretable baseline classifier—especially when decision boundaries are nonlinear but structured.


💡 Final Takeaways

  • Unsupervised learning is ideal for discovering hidden patterns when no outcome labels exist.
  • Supervised learning shines when predictions are needed from labeled data.
  • Choosing the right tool depends on your goal: explore or predict, segment or classify.

In marketing analytics, both approaches offer value—whether it’s understanding your audience before launch, or optimizing messaging once you’ve collected results.

:::