Info

  • Explained Variance tells you how much information each component retains from the original data.
  • Components’ Importance (Loadings) tells you which features contribute the most to each principal component, helping you interpret the transformed feature space.

Data Requirement

  • No null values

    • You need to impute them (Due to the possible skewness of the data, imputing the “median” is recommended.)
  • Numeric

Explained Variance vs Components’ Importance

They are related but different.

AspectExplained VarianceComponents’ Importance (Loadings)
DefinitionThe proportion of total data variance explained by each PCThe contribution (weight) of each original feature to each PC
What it showsHow much information each PC captures from the original dataWhich original features are important for a given PC
ValuesEach value is a ratio between 0 and 1 (percent variance explained)Each value is a weight (positive or negative) representing the influence of a feature on a PC
PurposeHelps decide how many components to keepHelps interpret which features influence the PCs the most
InterpretationVariance tells you how much of the original data’s variability is captured by a principal componentLoadings tell you which features are important for a principal component
Accessingpca.explained_variance_ratio_pca.components_

Info

Mismatched Ranking— Summary

  • Low Importance, High Explained Variance: This feature might still be relevant, but it works in tandem with other features to capture broader patterns in the data. It could indicate redundancy or co-dependence with other features.
  • High Importance, Low Explained Variance: This feature is highly specialized and could be significant for certain specific insights, even though it doesn’t explain much of the overall data variance. It may be valuable for specific subgroups or nuanced patterns.

Low feature importance but high explained variance?

Possible Explanations:

  • The feature may be correlated with other highly important features, meaning the overall variance captured by the component is high, but the feature itself is not the primary driver.
  • The principal component may be driven by a combination of multiple features, and this particular feature doesn’t dominate but is still part of a highly informative component.
  • The feature’s contribution might be spread across multiple components, so while it doesn’t strongly influence any one component, it still indirectly impacts the variance captured by the entire PCA model.

Interpretation:

  • The feature may still carry important information about the overall variance, but it’s not individually strong enough to stand out. It could be contributing subtly alongside other features, or its influence might be diluted across several components.

High feature importance but low explained variance?

Possible Explanations:

  • The feature may capture a specific pattern that is significant for that principal component but doesn’t account for much of the variability in the entire dataset. In other words, the feature may be specialized in capturing minor variations or trends in the data.
  • The feature may be relevant for explaining a niche aspect of the data that only shows up in later components, which inherently capture less variance than the first few components.
  • The feature could be important for a specific subspace of the data (e.g., capturing variation in a small cluster or subset of data points), but not for the broader structure of the entire dataset.

Interpretation:

  • While the feature is important for a specific component, it doesn’t necessarily impact the overall data structure in a meaningful way. This feature may be relevant in certain contexts or subsets of the data but may not be globally informative.

Example Scenario

Low Feature Importance but High Explained Variance:

  • Imagine a component that captures a large amount of variance in patient vitals data, and you have a feature like “creatinine level” that is ranked low in importance. It could be that other vitals (e.g., blood pressure, heart rate) dominate the component, while “creatinine level” is part of the broader physiological changes captured by the component but doesn’t stand out individually.

High Feature Importance but Low Explained Variance:

  • A feature like “heart rate variability” might be important for a component that captures subtle but significant trends in the dataset (such as stress response), but the overall variance explained by that component could be small, indicating that this feature only affects a specific aspect of the data.