Skip to content

ML Fundamentals

Core ML Concepts

Before diving into specific algorithms, you need to understand the fundamental concepts that underpin all of machine learning. Every ML problem involves a model that learns a function mapping inputs to outputs from training data, and is then evaluated on unseen test data.

Input Features (X) Output / Target (y)
┌─────────────────┐ ┌──────────────┐
│ x₁, x₂, ... xₙ │──▶ f(X) ──▶│ ŷ (predicted)│
└─────────────────┘ └──────────────┘
The goal of ML: learn f(X) from data so that ŷ ≈ y

Linear Regression

Linear regression is the simplest and most foundational ML algorithm. It models the relationship between input features and a continuous output as a linear equation.

The Math (Simplified)

For a single feature, linear regression fits a line:

ŷ = w₁x + b
where:
ŷ = predicted value
x = input feature
w₁ = weight (slope)
b = bias (intercept)

For multiple features, it becomes:

ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

The training process finds the values of weights w and bias b that minimize the loss function — typically Mean Squared Error (MSE):

MSE = (1/n) * Σ(yᵢ - ŷᵢ)²
where:
yᵢ = actual value for sample i
ŷᵢ = predicted value for sample i
n = number of samples
Price ($)
│ ●
│ ● /
│ ● /
│ ● / ●
│ ● / ●
│ ● /
│ ● /●
│ ● /
│ ● /
│ ● /
│ /
└──────────────────────────── Size (sqft)
Linear regression finds the "best fit" line
through the data points.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Example: Predict house prices from size
# X = house sizes (sqft), y = prices ($)
X = np.array([[850], [1200], [1500], [1800], [2200], [2500], [3000]])
y = np.array([150000, 220000, 260000, 310000, 380000, 420000, 510000])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Weight (slope): {model.coef_[0]:.2f}")
print(f"Bias (intercept): {model.intercept_:.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
# Predict price for a 2000 sqft house
new_house = np.array([[2000]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 2000 sqft: ${predicted_price[0]:,.2f}")

Classification

Classification predicts a discrete category (class) rather than a continuous value. The most common types are binary classification (two classes) and multi-class classification (three or more classes).

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It uses the sigmoid function to output a probability between 0 and 1.

Sigmoid Function: σ(z) = 1 / (1 + e^(-z))
Output
1.0 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ●●●●●
●●
●●
0.5 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ●●
●●
●●
0.0 ●●●●●●●●●●●●● ─ ─ ─ ─ ─ ─ ─ ─ ─
─────────────────────────────────── Input (z)
If σ(z) >= 0.5 → Class 1 (Positive)
If σ(z) < 0.5 → Class 0 (Negative)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Load a real dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred,
target_names=data.target_names))
# Get probability scores
probabilities = model.predict_proba(X_test)
print(f"Sample probabilities: {probabilities[0]}")
# Output: [0.03, 0.97] → 97% confident it's class 1

Decision Trees

A decision tree makes predictions by learning a series of if/then rules from data. It splits the data at each node based on the feature and threshold that best separates the classes.

┌─────────────────────┐
│ Income > $50,000? │
└──────────┬──────────┘
Yes / \ No
/ \
┌─────────────────┐ ┌─────────────────┐
│ Age > 30? │ │ Credit Score │
│ │ │ > 700? │
└────────┬────────┘ └────────┬─────────┘
Yes / \ No Yes / \ No
/ \ / \
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Approve│ │Review│ │Review│ │Deny │
└──────┘ └──────┘ └──────┘ └──────┘
Decision trees are intuitive and explainable --
you can trace exactly why a prediction was made.

Advantages:

  • Highly interpretable — you can explain every prediction
  • Handle both numerical and categorical features
  • No feature scaling required
  • Can capture non-linear relationships

Disadvantages:

  • Prone to overfitting (easily grow too deep)
  • Unstable — small data changes can create very different trees
  • Greedy splitting may miss globally optimal solutions
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Single Decision Tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
tree_acc = accuracy_score(y_test, tree.predict(X_test))
# Random Forest (ensemble of trees)
forest = RandomForestClassifier(
n_estimators=100,
max_depth=5,
random_state=42
)
forest.fit(X_train, y_train)
forest_acc = accuracy_score(y_test, forest.predict(X_test))
print(f"Decision Tree Accuracy: {tree_acc:.4f}")
print(f"Random Forest Accuracy: {forest_acc:.4f}")
# Feature importance
for name, importance in zip(
iris.feature_names, forest.feature_importances_
):
print(f" {name}: {importance:.4f}")

Training, Validation, and Test Split

Properly splitting your data is critical to building reliable models. You need to ensure the model generalizes to unseen data, not just memorizes the training examples.

The Three-Way Split

Full Dataset (100%)
┌──────────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────────────────────────┐ ┌───────────┐ ┌────────────┐ │
│ │ Training Set (60-70%) │ │Validation │ │ Test Set │ │
│ │ │ │(15-20%) │ │ (15-20%) │ │
│ │ Used to learn model │ │ │ │ │ │
│ │ parameters (weights) │ │ Tune │ │ Final │ │
│ │ │ │ hyper- │ │ unbiased │ │
│ │ │ │ params │ │ evaluation │ │
│ └──────────────────────────────┘ └───────────┘ └────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
SetPurposeWhen Used
TrainingLearn model parameters (weights, biases)During training
ValidationTune hyperparameters, select best modelDuring development
TestFinal unbiased evaluationOnce, at the end

Cross-Validation

When data is limited, k-fold cross-validation gives a more robust estimate of model performance by rotating which portion is used for validation.

5-Fold Cross-Validation:
Fold 1: [VAL][Train][Train][Train][Train] → Score₁
Fold 2: [Train][VAL][Train][Train][Train] → Score₂
Fold 3: [Train][Train][VAL][Train][Train] → Score₃
Fold 4: [Train][Train][Train][VAL][Train] → Score₄
Fold 5: [Train][Train][Train][Train][VAL] → Score₅
Final Score = Average(Score₁, Score₂, ..., Score₅)
from sklearn.model_selection import (
train_test_split,
cross_val_score,
KFold
)
from sklearn.ensemble import RandomForestClassifier
# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train/validation/test split
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.18, random_state=42, stratify=y_temp
)
# Results in ~70% train, 15% val, 15% test
# K-Fold Cross-Validation
model = RandomForestClassifier(n_estimators=100)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Per-fold scores: {scores}")

Overfitting and Underfitting

Understanding the balance between underfitting and overfitting is crucial for building models that generalize well.

Underfitting Good Fit Overfitting
(High Bias) (Balanced) (High Variance)
│ ● ● │ ● ● │ ● ●
│ ● ● │ /\● │ /\/\●
│──────────────── │ / \● │ / \●
│ ● │ / ● \ │ / ● \
│ ● │ / ● │/ /\ ●
│ ● │/ ● │ / \
└────────────── └────────────── └──────────────
Too simple: Just right: Too complex:
Misses the pattern Captures the true Memorizes noise
in the data underlying pattern in training data
AspectUnderfittingOverfitting
Training errorHighLow (near zero)
Test errorHighHigh
Model complexityToo simpleToo complex
CauseModel cannot capture patternsModel memorizes noise
FixMore features, complex model, more trainingRegularization, less complexity, more data

How to Detect

Error
│ \ /
│ \ Test Error /
│ \ /
│ \ /
│ \ ___________/
│ \ /
│ \ /
│ \ / ← Sweet spot
│ \/
│ \
│ \ Training Error
│ \_________________
└──────────────────────────────────── Model Complexity
Underfitting │ Overfitting
Optimal

Techniques to Prevent Overfitting

TechniqueHow It Works
Regularization (L1/L2)Adds penalty for large weights to the loss function
Early StoppingStop training when validation error starts increasing
Dropout (neural networks)Randomly disable neurons during training
Cross-ValidationUse multiple train/val splits to assess generalization
More Training DataMore examples make it harder to memorize
Data AugmentationCreate synthetic training examples (rotations, flips)
Pruning (decision trees)Remove branches that do not improve validation accuracy
Ensemble MethodsCombine multiple models to reduce variance
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import learning_curve
import numpy as np
# L2 Regularization (Ridge)
ridge = Ridge(alpha=1.0) # alpha controls regularization strength
ridge.fit(X_train, y_train)
# L1 Regularization (Lasso) -- also performs feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Lasso can drive weights to exactly zero
print(f"Non-zero features: {np.sum(lasso.coef_ != 0)}")
# Learning curves to diagnose over/underfitting
train_sizes, train_scores, val_scores = learning_curve(
RandomForestClassifier(n_estimators=100),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='accuracy'
)
# If train and val scores diverge → overfitting
# If both scores are low → underfitting
print(f"Train: {train_scores.mean(axis=1)}")
print(f"Val: {val_scores.mean(axis=1)}")

Bias-Variance Tradeoff

The bias-variance tradeoff is one of the most important concepts in ML. It explains why models err and how to balance complexity.

Total Error = Bias² + Variance + Irreducible Noise
┌─────────────────────────────────────────────────────┐
│ Bias: Error from wrong assumptions in the model │
│ → Underfitting. Model is too simple. │
│ │
│ Variance: Error from sensitivity to training data │
│ → Overfitting. Model changes too much with │
│ different training sets. │
│ │
│ Irreducible: Noise inherent in the data. │
│ → Cannot be reduced by any model. │
└─────────────────────────────────────────────────────┘
Error
│\
│ \ Bias²
│ \ / Variance
│ \ /
│ \ /
│ \ ______/
│ \ / Total Error
│ X
│ / \
│ / \__________
│ /
│ /
│──/──────────────────── Irreducible Noise
└──────────────────────────── Model Complexity
Simple Complex
Model TypeBiasVarianceExample
High bias, low varianceHighLowLinear regression on non-linear data
Low bias, high varianceLowHighDeep decision tree with no pruning
BalancedMediumMediumRegularized model with cross-validation

Evaluation Metrics

Choosing the right metric depends on your problem type and business requirements.

For Classification

Consider a binary classification problem — predicting whether an email is spam or not spam.

Predicted
Positive Negative
┌──────────┬──────────┐
Actual │ TP │ FN │
Positive │ (Hit) │ (Miss) │
├──────────┼──────────┤
Actual │ FP │ TN │
Negative │(False │(Correct │
│ Alarm) │Rejection│
└──────────┴──────────┘
TP = True Positive: Correctly identified as spam
FP = False Positive: Non-spam incorrectly flagged as spam
FN = False Negative: Spam that slipped through
TN = True Negative: Non-spam correctly allowed through

Key Metrics

MetricFormulaWhen to Use
Accuracy(TP + TN) / (TP + TN + FP + FN)Balanced classes
PrecisionTP / (TP + FP)When false positives are costly (spam filter)
Recall (Sensitivity)TP / (TP + FN)When false negatives are costly (cancer detection)
F1 Score2 * (Precision * Recall) / (Precision + Recall)When you need balance between precision and recall
AUC-ROCArea under the ROC curveOverall model quality across all thresholds

Precision vs Recall Tradeoff

Precision
1.0 │●
│ ●
│ ●
│ ●●
│ ●●
0.5 │ ●●●
│ ●●●
│ ●●●●
│ ●●●●●
0.0 │ ●●●●●●
└────────────────────────────── Recall
0.0 1.0
As you lower the classification threshold:
- Recall increases (catch more positives)
- Precision decreases (more false alarms)
The "right" threshold depends on business needs.

When to Use Which Metric

ScenarioPrioritizeWhy
Spam filterPrecisionBetter to let some spam through than block real emails
Cancer screeningRecallMust not miss any potential cancer cases
Fraud detectionF1 / AUCNeed balance — catch fraud without blocking legitimate transactions
Search enginePrecision@KTop results must be relevant
Balanced datasetAccuracyWorks well when classes are evenly distributed
Imbalanced datasetF1 / AUCAccuracy is misleading with 99/1 class split
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
confusion_matrix,
classification_report
)
# Assume y_test and y_pred from a trained model
y_test = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# All metrics at once
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
# Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=['Not Spam', 'Spam']))
# AUC requires probability scores
y_proba = [0.9, 0.1, 0.8, 0.3, 0.2, 0.85, 0.6, 0.15, 0.95, 0.05]
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")

For Regression

MetricFormulaCharacteristics
MAE (Mean Absolute Error)(1/n) * Σ|yᵢ - ŷᵢ|Robust to outliers, same unit as target
MSE (Mean Squared Error)(1/n) * Σ(yᵢ - ŷᵢ)²Penalizes large errors more heavily
RMSE (Root MSE)√MSESame unit as target, penalizes large errors
R² Score1 - (SS_res / SS_tot)0 to 1, proportion of variance explained
MAPE(1/n) * Σ|yᵢ - ŷᵢ| / |yᵢ|Percentage error, interpretable

Putting It All Together: A Complete ML Workflow

import numpy as np
import pandas as pd
from sklearn.model_selection import (
train_test_split, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline
# 1. Load and explore data
df = pd.read_csv('customer_churn.csv')
print(df.describe())
print(f"Class distribution:\n{df['churn'].value_counts()}")
# 2. Feature engineering
X = df.drop('churn', axis=1)
y = df['churn']
# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 4. Build pipeline (scaling + model)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
# 5. Hyperparameter tuning with cross-validation
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [3, 5, 10, None],
'classifier__min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
pipeline, param_grid,
cv=5, scoring='f1', n_jobs=-1
)
grid_search.fit(X_train, y_train)
# 6. Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
print(f"\nBest params: {grid_search.best_params_}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")

Summary

ConceptKey Takeaway
Linear RegressionSimplest model for continuous predictions — fit a line
Logistic RegressionClassification using sigmoid function for probabilities
Decision TreesInterpretable but prone to overfitting without ensembles
Train/Val/Test SplitAlways hold out unseen data for unbiased evaluation
Cross-ValidationMore robust evaluation, especially with limited data
OverfittingModel memorizes training data instead of learning patterns
Bias-VarianceBalance model simplicity (bias) vs flexibility (variance)
Evaluation MetricsChoose based on business needs, not just accuracy