4  A Segmentation Pipeline: From Simple Rules to Smarter Algorithms

4.1 The Goal

Throughout this course we are working toward a single clinical question: given a microscopy image of urothelial cells, what is the nucleus-to-cytoplasm (N/C) ratio?

\[\text{N/C ratio} = \frac{\text{nucleus pixels}}{\text{cytoplasm pixels}}\]

A normal urothelial cell has a small nucleus surrounded by a generous cytoplasm — a low N/C ratio. A cancerous cell does the opposite: the nucleus swells while the cytoplasm shrinks, pushing the ratio upward. Pathologists use this number as one of their key diagnostic signals when reviewing slides.

To compute the N/C ratio automatically, we first need to segment every image: assign each pixel one of three labels — background (0), cytoplasm (1), or nucleus (2). That is the segmentation problem, and this chapter walks through four generations of increasingly principled approaches to solving it.

Each approach corresponds directly to a lab notebook in the companion series (Lab 01 through Lab 03-v2). This chapter explains the ideas behind each notebook, walks through the key code, and reflects honestly on what works, what does not, and why.


4.2 Getting the Data

Every lab begins the same way: cloning the repository and loading the dataset. You only need to do this once per session.

Code
!git clone https://github.com/emilsar/VocEd.git
%cd VocEd

After cloning, the dataset lives inside imagedata/. It contains 200 paired files:

Folder Array shape Contents
imagedata/X/ (3, 256, 256) float32 RGB image, pixel values in [0, 1]
imagedata/y/ (256, 256) int64 Segmentation mask — one integer label per pixel

Labels: 0 = background, 1 = cytoplasm, 2 = nucleus.

One thing to notice: the colour channels sit in axis 0 (channel-first format). This is PyTorch’s convention. Matplotlib expects the channels last — (256, 256, 3) — so whenever we display an image we transpose it: X[i].transpose(1, 2, 0).

Code
import glob
import numpy as np
import matplotlib.pyplot as plt

N = len(glob.glob('imagedata/X/*.npy'))

X = np.stack([np.load(f'imagedata/X/{i}.npy') for i in range(N)])  # (N, 3, 256, 256)
y = np.stack([np.load(f'imagedata/y/{i}.npy') for i in range(N)])  # (N, 256, 256)

print(f'X  shape : {X.shape}   dtype: {X.dtype}')
print(f'y  shape : {y.shape}   dtype: {y.dtype}')
print(f'Unique labels in y: {np.unique(y)}')

4.3 Stage 1 — Grayscale Thresholding

Corresponds to Lab 01.

4.3.1 From Colour to Grayscale

Our images are full colour, but colour introduces complexity immediately: we have three channels (red, green, blue) per pixel instead of one. The simplest strategy is to collapse the three channels into a single grayscale value using the luminance formula from Chapter 2:

\[\text{gray} = 0.299 \cdot R \;+\; 0.587 \cdot G \;+\; 0.114 \cdot B\]

The weights reflect the human eye’s sensitivity to each channel — most sensitive to green, least to blue. After this conversion, every pixel is a single number between 0 (black) and 1 (white).

Code
def to_gray(img):
    """Convert a (3, H, W) channels-first image to (H, W) grayscale."""
    return 0.299 * img[0] + 0.587 * img[1] + 0.114 * img[2]

4.3.2 The Intensity Histogram

Before picking any thresholds, Lab 01 plots a per-class intensity histogram: the distribution of grayscale values separately for pixels whose true label is background, cytoplasm, and nucleus. This is the single most important diagnostic tool at this stage.

Code
IDX = 7
img7  = X[IDX]
mask7 = y[IDX]
gray7 = to_gray(img7)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].imshow(img7.transpose(1, 2, 0))
axes[0].set_title('Original RGB (image 7)')
axes[0].axis('off')

axes[1].imshow(gray7, cmap='gray', vmin=0, vmax=1)
axes[1].set_title('Grayscale')
axes[1].axis('off')

colors = ['black', 'steelblue', 'crimson']
for cls, (name, col) in enumerate(zip(['background', 'cytoplasm', 'nucleus'], colors)):
    vals = gray7[mask7 == cls]
    axes[2].hist(vals, bins=60, alpha=0.6, color=col,
                 label=f'{name} (n={len(vals):,})', density=True)
axes[2].set_xlabel('Grayscale intensity')
axes[2].set_ylabel('Density')
axes[2].set_title('Per-class intensity histogram')
axes[2].legend()

plt.tight_layout()
plt.show()

In a haematoxylin-and-eosin (H&E) stained slide, three intensity bands emerge:

  • Nucleus — deeply stained dark purple, intensity ≈ 0.24
  • Cytoplasm — lightly stained pink, intensity ≈ 0.51
  • Background — unstained white glass slide, intensity ≈ 1.00

If the three distributions barely overlap, a threshold placed in each valley will work well. If they mix heavily, no threshold can cleanly separate them — and we have found the ceiling of this approach before we have even started.

4.3.3 The Two-Threshold Rule

The histogram suggests a natural two-threshold segmentation:

  • t_nucleus — maximum grayscale value a nucleus pixel can have. Pixels darker than this are labelled nucleus.
  • t_background — minimum grayscale value a background pixel can have. Pixels brighter than this are labelled background.
  • Everything in between is labelled cytoplasm.
Code
t_nucleus    = 0.45   # pixels darker than this  →  nucleus
t_background = 0.85   # pixels brighter than this →  background
                      # everything in between     →  cytoplasm

def segment(img, t_nucleus, t_background):
    """Apply two-threshold segmentation to a single (3, H, W) image."""
    gray = to_gray(img)
    pred = np.zeros(gray.shape, dtype=np.int64)          # start: everything background (0)
    pred[gray < t_nucleus]                                = 2   # dark  → nucleus
    pred[(gray >= t_nucleus) & (gray < t_background)]    = 1   # mid   → cytoplasm
    return pred

4.3.4 Measuring Quality with the Dice Score

A side-by-side image comparison gives a qualitative feel for quality. To compare methods numerically across all 200 images we need a single number per image. The Dice score (derived in Chapter 2) measures the overlap between the predicted mask and the ground-truth mask for one class:

\[\text{Dice}(A, B) = \frac{2\,|A \cap B|}{|A| + |B|}\]

where \(A\) is the set of predicted pixels for a class and \(B\) is the ground-truth set. Dice = 1 means perfect overlap; Dice = 0 means no overlap.

We report the average of cytoplasm Dice and nucleus Dice. Background is excluded because it dominates the pixel count and inflates scores — a segmenter that labels everything background would score very well on background Dice alone.

Code
def dice_score(pred, target, cls):
    """Dice score for a single class label."""
    pred_mask   = (pred   == cls)
    target_mask = (target == cls)
    intersection = (pred_mask & target_mask).sum()
    denominator  = pred_mask.sum() + target_mask.sum()
    if denominator == 0:
        return 1.0   # both empty → perfect match by convention
    return 2 * intersection / denominator


# ── Evaluate hand-picked thresholds across all 200 images ────────────────────
cyto_scores, nuc_scores = [], []

for i in range(N):
    pred_i = segment(X[i], t_nucleus, t_background)
    cyto_scores.append(dice_score(pred_i, y[i], cls=1))
    nuc_scores.append(dice_score(pred_i, y[i], cls=2))

print(f'Cytoplasm Dice : {np.mean(cyto_scores):.4f}')
print(f'Nucleus Dice   : {np.mean(nuc_scores):.4f}')
print(f'Mean Dice      : {(np.mean(cyto_scores) + np.mean(nuc_scores)) / 2:.4f}')

4.3.5 What Stage 1 Reveals

A mean Dice around 0.53 tells us two things at once. On one hand, thresholding does capture real structure — 0.53 is well above the zero we would get from random labelling. On the other, it misses a great deal. The core limitation is threshold fragility: the value t_nucleus = 0.45 that was chosen by looking at the histogram of image 7 may be too high or too low for an image with slightly different staining or illumination. Every image has its own intensity distribution, but our rule applies the same fixed cutoffs to all 200.

Quiz: Two-Threshold Rule

Given t_nucleus = 0.45 and t_background = 0.85, a pixel with grayscale intensity 0.72 is labelled:






4.4 Stage 2 — Optimising the Thresholds

Corresponds to Lab 02.

4.4.1 The Problem with Hand-Picking

In Stage 1 we chose thresholds by eye. There is no reason to believe these are the best possible values. The space of all threshold pairs (t_nucleus, t_background) is continuous — we could spend a long time trying combinations.

Bayesian optimisation is a smarter search strategy. Rather than evaluating threshold pairs randomly, it builds a statistical model of how Dice changes across the search space, and uses that model to decide which pair to try next. After 50 evaluations it converges on a near-optimal combination far faster than a grid search could. The algorithm is described in detail in Appendix B; here we focus on what it finds and what that teaches us.

4.4.2 Protecting Against Overfitting: The Train / Test Split

Before running any optimisation, Lab 02 introduces a crucial safeguard: the train / test split. We divide the 200 images into:

  • Training set (80 %) — the 152 images the optimiser is allowed to tune against
  • Test set (20 %) — 39 images held back until the very end

Without this split, we risk overfitting: finding thresholds that score perfectly on our 200 images but fail on any new image seen in practice. The test set acts as a proxy for future, unseen data.

Lab 02 also stratifies the split by nucleus size, ensuring that images with small, medium, and large nuclei appear in both halves in proportion. This prevents all the hard cases — tiny or enormous nuclei — from accidentally piling up in one set.

Code
!pip install scikit-optimize --quiet

from sklearn.model_selection import train_test_split

# Drop images with no nucleus pixels — they cannot be evaluated meaningfully
has_nucleus = (y == 2).sum(axis=(1, 2)) > 0
X, y = X[has_nucleus], y[has_nucleus]
N = len(X)   # 191 images remain

# Stratify by nucleus size: bin images into quartiles, then split proportionally
nucleus_size = (y == 2).sum(axis=(1, 2))
quartile     = np.digitize(nucleus_size, np.percentile(nucleus_size, [25, 50, 75]))

train_idx, test_idx = train_test_split(
    np.arange(N), test_size=0.2, stratify=quartile, random_state=42
)
print(f'Training images : {len(train_idx)}')
print(f'Test images     : {len(test_idx)}')

4.4.3 What the Optimiser Finds

Bayesian optimisation searches for the threshold pair that maximises mean Dice across the 152 training images. After 50 evaluations it converges on:

  • t_nucleus ≈ 0.39 — slightly lower than our hand-picked 0.45
  • t_background ≈ 0.99 — dramatically higher than our hand-picked 0.85

The second value is the important one. Pushing t_background to 0.99 means almost nothing gets labelled background anymore. Pixels that our Stage 1 rule sent to the background bucket now land in the cytoplasm bucket. This massively increases the predicted cytoplasm area, which in turn corrects a systematic error: Stage 1 was silently misclassifying a large portion of cytoplasm pixels as background on every single image.

Code
def mean_dice(indices, t_nucleus, t_background):
    """Mean Dice (cytoplasm + nucleus) over a set of image indices."""
    scores = []
    for i in indices:
        pred = segment(X[i], t_nucleus, t_background)
        d = (dice_score(pred, y[i], 1) + dice_score(pred, y[i], 2)) / 2
        scores.append(d)
    return np.mean(scores)

lab01_nuc, lab01_bg = 0.45, 0.85   # hand-picked
best_nuc,  best_bg  = 0.3904, 0.99  # found by Bayesian optimisation

print(f'Test Dice — hand-picked  : {mean_dice(test_idx, lab01_nuc, lab01_bg):.4f}')
print(f'Test Dice — optimised    : {mean_dice(test_idx, best_nuc,  best_bg):.4f}')

The improvement is substantial — from roughly 0.74 to 0.81. This jump comes almost entirely from recalibrating t_background, not from any change to the image or the segmentation logic. The lesson: even the simplest possible model can perform much better with the right parameter values, and choosing those values by eye on a single image is unreliable.

Quiz: Bayesian Optimisation

Bayesian optimisation found the optimal t_background ≈ 0.99, far above the hand-picked 0.85. What does this reveal about Stage 1’s systematic error?






4.5 Stage 3 — Cleaning the Image

Corresponds to Lab 03.

4.5.1 The Ceiling of Pure Thresholding

Even with the best possible threshold values, a fundamental limitation remains: some cytoplasm pixels are dark enough to be confused with nucleus, and some nucleus pixels are bright enough to be confused with cytoplasm. The grayscale distributions of the two classes overlap, and no single intensity cutoff can cleanly separate them.

One response is to pre-process the image before thresholding, aiming to make the class distributions more distinct. Lab 03 tests two classical tools from image processing.

4.5.2 Gaussian Blur

Gaussian blur replaces each pixel with a weighted average of its neighbourhood, where nearby pixels get higher weight and distant pixels get lower weight. The result is a smoother image where salt-and-pepper noise is damped. Chapter 3 covers the Gaussian kernel in detail; here we apply it directly via scikit-image:

Code
!pip install scikit-optimize scikit-image --quiet

import skimage.filters as skf
import skimage.morphology as skm

# Gaussian blur on a single image — sigma controls the blur radius
# channel_axis=-1 tells skimage that the last axis is the colour channel
img7_hwc = X[7].transpose(1, 2, 0)   # (H, W, C) for skimage
blurred  = skf.gaussian(img7_hwc, sigma=1.5, channel_axis=-1)

4.5.3 Morphological Opening and Closing

After thresholding, the predicted mask often contains small isolated blobs (noise artefacts) and small holes (gaps inside otherwise solid regions). Two morphological operations from Chapter 3 address these:

  • Opening (erode then dilate) shrinks all regions first, letting small isolated blobs vanish entirely, then grows surviving regions back to their original size.
  • Closing (dilate then erode) fills small holes by first expanding regions to cover gaps, then shrinking back.

Lab 03 applies these to each class mask separately.

4.5.4 The Configurable Pipeline

Lab 03 wraps all of these steps into a single full_pipeline function controlled by boolean flags. Turning a flag off skips that step and passes the image or mask through unchanged. This makes it straightforward to test which combination of steps actually helps — an ablation study (see Lab 03’s group exercise).

Code
def full_pipeline(img, t_nucleus, t_background,
                  use_blur=True, use_open=True, use_close=True,
                  blur_sigma=1.5, morph_radius=3):
    """
    Full segmentation pipeline with optional blur and morphological cleanup.

    img           : (3, H, W) float32, channels-first
    t_nucleus     : pixels darker than this  → nucleus
    t_background  : pixels brighter than this → background
    use_blur      : apply Gaussian blur before thresholding
    use_open      : apply morphological opening after thresholding
    use_close     : apply morphological closing after thresholding
    blur_sigma    : Gaussian blur radius
    morph_radius  : radius of the disk structuring element for opening/closing
    """
    img_hwc = img.transpose(1, 2, 0)   # skimage expects channel-last

    # ── Optional: Gaussian blur ───────────────────────────────────────────────
    if use_blur:
        img_hwc = skf.gaussian(img_hwc, sigma=blur_sigma, channel_axis=-1)

    # ── Grayscale + threshold ─────────────────────────────────────────────────
    gray = 0.299 * img_hwc[:, :, 0] + 0.587 * img_hwc[:, :, 1] + 0.114 * img_hwc[:, :, 2]
    pred = np.zeros(gray.shape, dtype=np.int64)
    pred[gray < t_nucleus]                                = 2
    pred[(gray >= t_nucleus) & (gray < t_background)]    = 1

    # ── Optional: morphological cleanup, applied per class ────────────────────
    if use_open or use_close:
        disk = skm.disk(morph_radius)
        for cls in [1, 2]:
            cls_mask = (pred == cls).astype(bool)
            if use_open:
                cls_mask = skm.opening(cls_mask, disk)
            if use_close:
                cls_mask = skm.closing(cls_mask, disk)
            pred[pred == cls] = 0
            pred[cls_mask]    = cls

    return pred

4.5.5 Why the Gain Is Modest

After running Bayesian optimisation over all four parameters (t_nucleus, t_background, blur_sigma, morph_radius), the test-set Dice barely moves beyond Stage 2 — from roughly 0.79 to 0.79. The reason is straightforward: Gaussian blur and morphological operations work on pixel intensities and mask shapes; they cannot manufacture class separation that is not there in the data. If nucleus and cytoplasm pixels overlap in grayscale, blurring before thresholding does not change that overlap — it can even worsen it by smearing boundaries. Morphological operations help at the margins (removing small artefacts, filling small holes) but introduce their own errors when they are too aggressive.

Quiz: Preprocessing Limits

Stage 3 added Gaussian blur and morphological cleanup, yet test-set Dice barely improved over Stage 2. What is the fundamental reason?






4.6 Stage 4 — Targeted Cleanup

Corresponds to Lab 03-v2.

4.6.1 Diagnosing the Specific Failure Modes

Lab 03 applied generic cleanup everywhere — opening and closing on every pixel — without first asking what specifically goes wrong with a grayscale threshold. Lab 03-v2 begins with a diagnosis.

Two predictable failure modes arise from labelling every dark pixel as nucleus and every mid-intensity pixel as cytoplasm:

Holes inside nuclei. Urothelial cell nuclei stain unevenly. The interior of a nucleus can have lighter patches that fall above t_nucleus and get mislabelled as cytoplasm. In the predicted mask, a nucleus appears as a ring or crescent rather than a solid region.

Isolated cytoplasm debris. Cell debris, lipid droplets, and staining artefacts scatter throughout the slide. Many of these have mid-range intensity and get labelled cytoplasm, even though they are not attached to any cell. A genuine cytoplasm surrounds a nucleus; these fragments do not.

Generic opening and closing treat both failure modes only incidentally. Lab 03-v2 addresses each one directly.

4.6.2 Two Targeted Operations

binary_fill_holes — Given a binary mask of the predicted nucleus, this function identifies every pixel that is completely enclosed by nucleus pixels and flips it to nucleus. It plugs interior holes precisely, without altering anything outside the nucleus boundary.

remove_small_objects — Given a binary mask, this function removes any connected component — any contiguous group of touching pixels — whose total area falls below a user-specified threshold. Small debris blobs disappear; large, cohesive regions survive. Both the nucleus mask and the cytoplasm mask are cleaned this way. The size thresholds (min_nuc_size, min_cyto_size) become parameters that Bayesian optimisation can search over.

Code
from scipy.ndimage import binary_fill_holes

def segment_morph(img, t_nuc, t_bg, min_nuc_size=50, min_cyto_size=50):
    """
    Targeted cleanup pipeline.

    img           : (3, H, W) float32, channels-first
    t_nuc         : grayscale threshold for nucleus
    t_bg          : grayscale threshold for background
    min_nuc_size  : nucleus fragments smaller than this (pixels) are removed
    min_cyto_size : cytoplasm fragments smaller than this (pixels) are removed
    """
    gray = to_gray(img)

    # ── Stage 1: initial grayscale threshold ──────────────────────────────────
    nuc_mask  = gray < t_nuc
    bg_mask   = gray > t_bg
    cyto_mask = ~nuc_mask & ~bg_mask

    # ── Stage 2: nucleus cleanup ──────────────────────────────────────────────
    # Remove tiny dark specks that are too small to be a real nucleus
    nuc_mask = skm.remove_small_objects(nuc_mask, min_size=int(min_nuc_size))
    # Fill lighter holes inside nucleus bodies that were mislabelled as cytoplasm
    nuc_mask = binary_fill_holes(nuc_mask)

    # ── Stage 3: cytoplasm cleanup ────────────────────────────────────────────
    # Remove isolated debris fragments that are too small to be real cytoplasm
    if cyto_mask.any():
        cyto_mask = skm.remove_small_objects(cyto_mask, min_size=int(min_cyto_size))

    pred = np.zeros(gray.shape, dtype=np.int64)
    pred[cyto_mask] = 1
    pred[nuc_mask]  = 2
    return pred

4.6.3 Why Size Beats Proximity

An earlier version of this pipeline tried a different cytoplasm filter: any cytoplasm pixel that was not within a fixed radius of a nucleus pixel was discarded as an artefact. The biological reasoning was sound — real cytoplasm wraps around a nucleus. But the criterion is geometrically too strict. Cytoplasm can extend far from the nucleus centre in a large cell, and the filter discarded real cytoplasm alongside the debris it was targeting.

remove_small_objects makes no spatial assumption. It only asks: is this connected region large enough to plausibly be real? Small means artefact; large means real. That criterion is robust to cell shape and size variation in a way that proximity is not.

Quiz: Targeted Cleanup

Lab 03-v2 diagnoses two specific failure modes in grayscale-thresholded masks. Which pair of operations addresses each one?






4.7 Comparing All Four Stages

After running Bayesian optimisation independently for each pipeline, we can compare test-set Dice across the four approaches on a consistent basis:

Method Key idea Test Dice
Lab 01 — hand-picked thresholds Grayscale + fixed cutoffs ≈ 0.71
Lab 02 — Bayesian opt Grayscale + learned cutoffs ≈ 0.79
Lab 03 — blur + morphology Preprocessing + learned cutoffs ≈ 0.79
Lab 03-v2 — targeted cleanup Size-based artefact removal ≈ 0.80

4.7.1 What the Progression Tells Us

The biggest single improvement happens between Stage 1 and Stage 2. Bayesian optimisation replaces the hand-picked thresholds, and in doing so discovers that t_background should be pushed much higher than intuition suggests. This recalibrates the cytoplasm area estimate across the entire dataset and drives most of the Dice gain.

Stages 3 and 4 deliver smaller, incremental improvements. Each addresses a real failure mode — image noise and mask cleanup in Stage 3, nucleus holes and cytoplasm debris in Stage 4 — but neither can overcome the fundamental ceiling: all four methods share a grayscale threshold at their core, and two of the three classes overlap substantially in grayscale intensity. No amount of preprocessing resolves that overlap.

4.7.2 The Ceiling — and What Comes Next

This ceiling is not a failure of effort or creativity. It is a consequence of the representation. By projecting a full-colour image down to a single grayscale number, we discard the colour information that, to a human pathologist, is one of the clearest signals distinguishing nucleus from cytoplasm. Nuclei are purple; cytoplasm is pink. Grayscale merges these into overlapping intensity bands.

The natural next step is to stop discarding colour. Instead of asking how bright is this pixel?, we can ask what colour is this pixel? and train an algorithm to map RGB triples directly to class labels. That is exactly what the following chapters do — starting with simple colour-space classifiers and building toward deep convolutional networks that learn rich, multi-scale representations of cellular structure.

Quiz: Pipeline Comparison

The comparison table shows test-set Dice improving from ≈ 0.71 (Stage 1) to ≈ 0.80 (Stage 4). Which single stage transition accounts for the majority of that gain?





Sign in to save progress
My Progress

0 / 0

📚 Gradebook

Loading…

✏️ Speed Grader

Sign in to save progress