Spend Classification Engine
Millions of transactions arrive each year with inconsistent, free-text descriptions and missing category codes. Manual tagging is slow, subjective, and breaks down at scale — leaving the spend taxonomy too noisy for reliable sourcing analytics.
Trained a Random Forest classifier on labeled historical spend, combining TF-IDF features from supplier names and line-item text with categorical signals (GL account, cost center, supplier). Class imbalance handled with balanced class weights; performance validated with stratified k-fold cross-validation and a held-out test set.
High-confidence predictions auto-classify the majority of spend straight to the taxonomy, with low-confidence cases routed for human review — turning weeks of manual tagging into a near-real-time, analytics-ready feed.
Tech stack
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
X = tfidf.fit_transform(df["line_text"])
clf = RandomForestClassifier(
n_estimators=400, class_weight="balanced")
clf.fit(X_train, y_train)
# route low-confidence rows to review
conf = clf.predict_proba(X).max(axis=1)
Consistent, scalable spend taxonomy powering sourcing & savings analytics.