8 Machine Learning Approaches to Segmentation
Image segmentation is a fundamental task in computer vision that involves partitioning an image into meaningful regions or objects. While traditional methods rely on thresholding and edge detection, machine learning approaches offer powerful alternatives that can learn complex patterns from data. In this chapter, we explore two complementary approaches: K-means clustering for unsupervised segmentation and Random Forests for supervised pixel classification.
8.1 K-Means Clustering for Image Segmentation
K-means is an unsupervised learning algorithm that partitions data into K distinct clusters. In the context of image segmentation, we can think of each pixel as a data point in a feature space, and K-means groups similar pixels together based on their features.
8.1.1 The K-Means Algorithm
The algorithm works through an iterative process:
- Initialize K cluster centers (centroids), either randomly or using a smart initialization strategy
- Assignment step: Assign each pixel to the nearest centroid based on distance
- Update step: Recalculate centroids as the mean of all pixels assigned to each cluster
- Repeat steps 2-3 until convergence (centroids stop moving significantly)
The key advantage of K-means is its simplicity and speed, making it suitable for large images. However, it requires specifying K in advance and assumes spherical clusters of similar size.
8.1.2 Feature Spaces for Pixel Clustering
When applying K-means to images, we have several choices for the feature space:
Color-only features: Each pixel is represented by its RGB (or other color space) values. This is the simplest approach and works well when regions have distinct colors.
Spatial features: Including the x, y coordinates along with color values helps create spatially coherent segments. The relative weighting of spatial versus color features affects the smoothness of segmentation.
Texture features: For images with textured regions, we can include local statistics like gradient magnitude, variance in local neighborhoods, or filter responses.
8.1.3 Representing Images as Feature Matrices
To apply K-means, we need to reshape our image tensor. Consider an RGB image of shape \((H, W, 3)\) where \(H\) is height, \(W\) is width, and 3 represents the RGB channels.
We reshape this to a 2D matrix of shape \((H \times W, 3)\) where each row represents one pixel. After clustering, we reshape the cluster labels back to \((H, W)\) to create the segmentation map.
# Reshape image to (num_pixels, num_features)
h, w, c = image.shape
pixels = image.reshape(-1, c)
# Apply K-means (pseudocode)
labels = kmeans(pixels, k=5)
# Reshape labels back to image dimensions
segmented = labels.reshape(h, w)8.1.4 Choosing the Number of Clusters
A common challenge is selecting an appropriate value for K. Several heuristics can help:
Elbow method: Plot the within-cluster sum of squares (WCSS) versus K. The “elbow” point where the rate of decrease sharply changes suggests a good K value.
Silhouette analysis: Measures how similar each point is to its own cluster compared to other clusters. Higher average silhouette scores indicate better-defined clusters.
Domain knowledge: Sometimes the number of segments is known from the application context (e.g., separating foreground from background suggests K=2).
8.1.5 Limitations of K-Means for Segmentation
While K-means is fast and intuitive, it has several limitations:
- It assumes clusters are spherical and equally sized
- It’s sensitive to initialization and may converge to local optima
- It treats all features equally (unless we manually scale them)
- It doesn’t capture spatial relationships directly (unless we explicitly add spatial features)
- It performs hard assignment (each pixel belongs to exactly one cluster)
Despite these limitations, K-means remains a valuable tool for quick exploratory segmentation and as a preprocessing step for more complex algorithms.
8.2 Random Forests for Supervised Segmentation
While K-means is unsupervised, Random Forests offer a supervised approach where we train a classifier on labeled examples. This is particularly powerful when we have ground truth segmentation data or can easily label a subset of pixels.
8.2.1 Decision Trees as Building Blocks
A decision tree classifies data by learning a series of binary decisions. For image segmentation, each decision might be: “Is the red channel value greater than 128?” or “Is the gradient magnitude in the local neighborhood less than 15?”
The tree is built by recursively splitting the data to maximize the purity of resulting groups. Common splitting criteria include Gini impurity or information gain (entropy).
However, individual decision trees tend to overfit, learning overly specific patterns from the training data that don’t generalize well.
8.2.2 Ensemble Learning with Random Forests
Random Forests address overfitting by training many decision trees and combining their predictions. The “forest” is random in two ways:
Bootstrap aggregating (bagging): Each tree is trained on a random subset of the training data, sampled with replacement.
Feature randomization: At each split, only a random subset of features is considered. This decorrelates the trees, making the ensemble more robust.
For classification, the final prediction is typically the majority vote across all trees. This ensemble approach significantly improves generalization compared to single trees.
8.2.3 Features for Pixel Classification
The choice of features is crucial for Random Forest segmentation. Unlike K-means which typically uses raw pixel values, Random Forests benefit from richer feature sets:
Color features: RGB values, HSV components, or other color space representations
Texture features: Local variance, entropy, or filter responses in a neighborhood around each pixel
Gradient features: Magnitude and direction of color gradients
Multi-scale features: Features computed at different spatial scales to capture both fine details and broader patterns
Context features: Statistics from larger neighborhoods to incorporate spatial context
8.2.4 Training a Random Forest Classifier
The training process requires labeled data. In image segmentation, this typically means:
- Prepare training data: Select representative pixels from different regions and assign class labels (e.g., “sky”, “grass”, “building”)
- Extract features: For each labeled pixel, compute the feature vector
- Train the forest: Use the features and labels to build the ensemble of trees
- Validate: Test on held-out pixels to assess generalization
The trained model can then predict labels for all pixels in new images by extracting their features and running them through the forest.
8.2.5 Advantages of Random Forests for Segmentation
Random Forests offer several benefits:
- They handle high-dimensional feature spaces well
- They provide feature importance scores, revealing which features are most discriminative
- They’re relatively robust to overfitting due to ensemble averaging
- They can model complex, non-linear decision boundaries
- They naturally handle multi-class problems
- Training can be parallelized across trees
8.2.6 Practical Considerations
Imbalanced classes: If one region type dominates the image (e.g., 90% sky), the classifier may become biased. Techniques like class weighting or stratified sampling can help.
Computational cost: While prediction is fast, feature extraction for every pixel can be expensive, especially for complex features or large images.
Spatial coherence: Random Forests classify each pixel independently, which can lead to noisy segmentations. Post-processing with spatial smoothing or conditional random fields can improve results.
Hyperparameters: Key parameters include the number of trees (more is generally better but with diminishing returns), maximum tree depth (controls overfitting), and minimum samples per leaf (affects smoothness of decision boundaries).
8.3 Comparing K-Means and Random Forests
These two approaches represent different philosophies in machine learning:
| Aspect | K-Means | Random Forest |
|---|---|---|
| Learning type | Unsupervised | Supervised |
| Training data | No labels needed | Requires labeled pixels |
| Speed | Very fast | Moderate (depends on features) |
| Feature engineering | Minimal | Critical for performance |
| Number of classes | Fixed (K) | Flexible |
| Spatial coherence | Natural (with spatial features) | Requires post-processing |
| Interpretability | High (cluster centers) | Moderate (feature importance) |
In practice, these methods can be combined. For example, K-means can provide initial rough segmentation, which is then refined using a Random Forest trained on a small set of manually corrected labels.
8.4 Practical Example: Segmenting Natural Images
Let’s consider segmenting a landscape photograph into semantic regions (sky, water, vegetation, ground).
K-means approach: We might use K=4 clusters with features combining RGB values and normalized x, y coordinates. The spatial weighting would encourage contiguous regions. This gives a quick initial segmentation but may struggle with regions that have similar colors (e.g., blue sky and blue water).
Random Forest approach: We would manually label a few hundred pixels from each class, then extract features including RGB, HSV values, local gradient statistics, and texture measures in 5×5 windows. The trained forest can then leverage learned patterns (e.g., “sky pixels tend to have smooth gradients and appear in the upper portion of images”).
The choice depends on available resources (time for labeling), desired accuracy, and whether we expect to segment many similar images (which justifies the upfront cost of training a supervised model).
8.5 Extensions and Advanced Topics
Both methods have numerous extensions worth exploring:
For K-means: - K-means++ initialization for better starting centroids - Mini-batch K-means for very large images - Spectral clustering for handling non-spherical clusters - Gaussian Mixture Models for soft cluster assignments
For Random Forests: - Extremely Randomized Trees for faster training - Gradient Boosted Trees for improved accuracy - Deep learning features as input to Random Forests - Structured prediction to enforce spatial coherence
Modern deep learning approaches have largely superseded these classical methods for many segmentation tasks, but K-means and Random Forests remain valuable for their interpretability, efficiency on small datasets, and ease of implementation.
8.6 Summary
This chapter introduced two foundational machine learning approaches to image segmentation:
K-means clustering provides fast, unsupervised segmentation by grouping pixels based on similarity in feature space. Its simplicity makes it ideal for exploratory analysis, though it has limitations in handling complex segmentation scenarios.
Random Forests offer supervised segmentation where we train classifiers on labeled examples. By combining multiple decision trees and leveraging rich feature sets, they can learn sophisticated segmentation rules that generalize to new images.
Understanding these methods provides a foundation for appreciating more advanced techniques while offering practical tools that remain relevant for many real-world applications. The key insight is that segmentation can be framed as a machine learning problem—whether clustering similar pixels or learning to predict pixel labels from features.
8.7 Exercises
K-means exploration: Apply K-means to a color image using only RGB features. Then repeat with RGB + spatial coordinates. Compare the segmentations. What effect does the spatial weighting have?
Optimal K: Implement the elbow method for a sample image. Plot WCSS versus K for K ranging from 2 to 10. Where does the elbow occur? Does this match your intuitive expectation of the number of meaningful regions?
Feature importance: Train a Random Forest on a labeled image segment. Extract feature importance scores. Which features are most discriminative? Does this match your intuition about what distinguishes the different regions?
Texture features: Implement local variance as a texture feature. For each pixel, compute the variance of intensities in a 5×5 window. Use this as an additional feature for either K-means or Random Forest. How does it affect segmentation quality on a textured image?
Comparison study: Take an image and segment it using both K-means (K=3) and Random Forest (with 3 classes). Manually create ground truth labels for a test set of pixels. Compute accuracy for both methods. Which performs better? Why might that be?
Class imbalance: Create a Random Forest segmentation where one class represents only 5% of pixels. What happens to classification accuracy for the minority class? Implement class weighting and observe the effect.
Spatial post-processing: After Random Forest segmentation, apply a median filter to the label map. How does this affect the spatial coherence? What is the trade-off?