Companion Package

Dimensionality reduction that knows your task.

PCA finds directions of maximum variance. But variance is not the same as usefulness. If your goal is classification, the most variable direction might carry noise, not signal. nomoselect finds the subspace that captures the most task-relevant structure, using exact observer geometry to measure what matters and what gets lost.

nomoselect 0.1.0 BSD-3-Clause Depends on nomogeo scikit-learn compatible

The problem

PCA ignores what you are trying to do.

What goes wrong

Maximum variance is not maximum information

PCA picks directions with the most spread. But if the class boundary runs perpendicular to the high-variance direction, PCA projects it away. You lose the signal that matters for your task.

What nomoselect does

Task-aware observer design

nomoselect treats dimensionality reduction as an observer design problem. Given labelled data and a task (classification, minority detection, equal-weight discrimination), it finds the observer that captures the most task-relevant information, with exact diagnostics on what is kept and what is hidden.

Quick start

Drop-in replacement for PCA

from nomoselect import GeometricSubspaceSelector

# Fits like sklearn
sel = GeometricSubspaceSelector(n_components=2, task="fisher")
sel.fit(X_train, y_train)

# Transforms like sklearn
X_reduced = sel.transform(X_test)

# But also reports what PCA cannot
report = sel.report()
print(report.summary())
# Shows: visible fraction, advantage over PCA,
# hidden load, regularisation audit, conservation check

Supported tasks

fisher - weighted by class precision (default).
equal_weight - all classes contribute equally.
minority - upweights rare classes.
pairwise - optimises for specific class pairs.

What you get back

Visible fraction: how much task info is captured.
Advantage over PCA: quantified, not guessed.
Dimension cost ladder: is another component worth it?
Regularisation audit: was the geometry well-conditioned?
Conservation check: does the budget add up?

Results

Task-aware observer beats PCA consistently.

Iris

+0.32

Task observer captures 32% more class-relevant information than PCA at 2 components.

Wine

+1.00

PCA captures near-zero task info. The task observer captures all of it. Maximum possible advantage.

Breast Cancer

+0.15

Consistent gain even on well-separated data where PCA already does reasonably well.

Digits

+0.09

64 features, 10 classes. The observer finds the 5-dimensional subspace that captures the most label structure.

Install

Install from PyPI

pip install nomoselect

Requires: nomogeo >= 0.4.0, numpy, scikit-learn >= 1.2.

Tests

108 tests, all passing

cd nomoselect && python -m pytest tests/ -q

Covers selection, reporting, auditing, misuse detection, and all four task types.

Honest boundary

Consider when nomoselect helps most.

The biggest gains appear when variance and task structure point in different directions. On well-separated data where PCA already aligns with the class boundary, both methods agree. The diagnostic report always tells you the exact advantage.

High dimensions

For n >> p, use PCA pre-reduction first.

When the number of features greatly exceeds sample size (e.g., genomic data with 7000+ features), pass the data through PCA first to reduce to a manageable size, then apply nomoselect. Direct application to very high dimensions is slow.