Code
!git clone https://github.com/emilsar/VocEd.git
%cd VocEdThroughout this course we are working toward a single clinical question: given a microscopy image of urothelial cells, what is the nucleus-to-cytoplasm (N/C) ratio?
\[\text{N/C ratio} = \frac{\text{nucleus pixels}}{\text{cytoplasm pixels}}\]
A normal urothelial cell has a small nucleus surrounded by a generous cytoplasm — a low N/C ratio. A cancerous cell does the opposite: the nucleus swells while the cytoplasm shrinks, pushing the ratio upward. Pathologists use this number as one of their key diagnostic signals when reviewing slides.
To compute the N/C ratio automatically, we first need to segment every image: assign each pixel one of three labels — background (0), cytoplasm (1), or nucleus (2). That is the segmentation problem, and this chapter walks through four generations of increasingly principled approaches to solving it.
Each approach corresponds directly to a lab notebook in the companion series (Lab 01 through Lab 03-v2). This chapter explains the ideas behind each notebook, walks through the key code, and reflects honestly on what works, what does not, and why.
Every lab begins the same way: cloning the repository and loading the dataset. You only need to do this once per session.
!git clone https://github.com/emilsar/VocEd.git
%cd VocEdAfter cloning, the dataset lives inside imagedata/. It contains 200 paired files:
| Folder | Array shape | Contents |
|---|---|---|
imagedata/X/ |
(3, 256, 256) float32 |
RGB image, pixel values in [0, 1] |
imagedata/y/ |
(256, 256) int64 |
Segmentation mask — one integer label per pixel |
Labels: 0 = background, 1 = cytoplasm, 2 = nucleus.
One thing to notice: the colour channels sit in axis 0 (channel-first format). This is PyTorch’s convention. Matplotlib expects the channels last — (256, 256, 3) — so whenever we display an image we transpose it: X[i].transpose(1, 2, 0).
import glob
import numpy as np
import matplotlib.pyplot as plt
N = len(glob.glob('imagedata/X/*.npy'))
X = np.stack([np.load(f'imagedata/X/{i}.npy') for i in range(N)]) # (N, 3, 256, 256)
y = np.stack([np.load(f'imagedata/y/{i}.npy') for i in range(N)]) # (N, 256, 256)
print(f'X shape : {X.shape} dtype: {X.dtype}')
print(f'y shape : {y.shape} dtype: {y.dtype}')
print(f'Unique labels in y: {np.unique(y)}')Corresponds to Lab 01.
Our images are full colour, but colour introduces complexity immediately: we have three channels (red, green, blue) per pixel instead of one. The simplest strategy is to collapse the three channels into a single grayscale value using the luminance formula from Chapter 2:
\[\text{gray} = 0.299 \cdot R \;+\; 0.587 \cdot G \;+\; 0.114 \cdot B\]
The weights reflect the human eye’s sensitivity to each channel — most sensitive to green, least to blue. After this conversion, every pixel is a single number between 0 (black) and 1 (white).
def to_gray(img):
"""Convert a (3, H, W) channels-first image to (H, W) grayscale."""
return 0.299 * img[0] + 0.587 * img[1] + 0.114 * img[2]Before picking any thresholds, Lab 01 plots a per-class intensity histogram: the distribution of grayscale values separately for pixels whose true label is background, cytoplasm, and nucleus. This is the single most important diagnostic tool at this stage.
IDX = 7
img7 = X[IDX]
mask7 = y[IDX]
gray7 = to_gray(img7)
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].imshow(img7.transpose(1, 2, 0))
axes[0].set_title('Original RGB (image 7)')
axes[0].axis('off')
axes[1].imshow(gray7, cmap='gray', vmin=0, vmax=1)
axes[1].set_title('Grayscale')
axes[1].axis('off')
colors = ['black', 'steelblue', 'crimson']
for cls, (name, col) in enumerate(zip(['background', 'cytoplasm', 'nucleus'], colors)):
vals = gray7[mask7 == cls]
axes[2].hist(vals, bins=60, alpha=0.6, color=col,
label=f'{name} (n={len(vals):,})', density=True)
axes[2].set_xlabel('Grayscale intensity')
axes[2].set_ylabel('Density')
axes[2].set_title('Per-class intensity histogram')
axes[2].legend()
plt.tight_layout()
plt.show()In a haematoxylin-and-eosin (H&E) stained slide, three intensity bands emerge:
If the three distributions barely overlap, a threshold placed in each valley will work well. If they mix heavily, no threshold can cleanly separate them — and we have found the ceiling of this approach before we have even started.
The histogram suggests a natural two-threshold segmentation:
t_nucleus — maximum grayscale value a nucleus pixel can have. Pixels darker than this are labelled nucleus.t_background — minimum grayscale value a background pixel can have. Pixels brighter than this are labelled background.t_nucleus = 0.45 # pixels darker than this → nucleus
t_background = 0.85 # pixels brighter than this → background
# everything in between → cytoplasm
def segment(img, t_nucleus, t_background):
"""Apply two-threshold segmentation to a single (3, H, W) image."""
gray = to_gray(img)
pred = np.zeros(gray.shape, dtype=np.int64) # start: everything background (0)
pred[gray < t_nucleus] = 2 # dark → nucleus
pred[(gray >= t_nucleus) & (gray < t_background)] = 1 # mid → cytoplasm
return predA side-by-side image comparison gives a qualitative feel for quality. To compare methods numerically across all 200 images we need a single number per image. The Dice score (derived in Chapter 2) measures the overlap between the predicted mask and the ground-truth mask for one class:
\[\text{Dice}(A, B) = \frac{2\,|A \cap B|}{|A| + |B|}\]
where \(A\) is the set of predicted pixels for a class and \(B\) is the ground-truth set. Dice = 1 means perfect overlap; Dice = 0 means no overlap.
We report the average of cytoplasm Dice and nucleus Dice. Background is excluded because it dominates the pixel count and inflates scores — a segmenter that labels everything background would score very well on background Dice alone.
def dice_score(pred, target, cls):
"""Dice score for a single class label."""
pred_mask = (pred == cls)
target_mask = (target == cls)
intersection = (pred_mask & target_mask).sum()
denominator = pred_mask.sum() + target_mask.sum()
if denominator == 0:
return 1.0 # both empty → perfect match by convention
return 2 * intersection / denominator
# ── Evaluate hand-picked thresholds across all 200 images ────────────────────
cyto_scores, nuc_scores = [], []
for i in range(N):
pred_i = segment(X[i], t_nucleus, t_background)
cyto_scores.append(dice_score(pred_i, y[i], cls=1))
nuc_scores.append(dice_score(pred_i, y[i], cls=2))
print(f'Cytoplasm Dice : {np.mean(cyto_scores):.4f}')
print(f'Nucleus Dice : {np.mean(nuc_scores):.4f}')
print(f'Mean Dice : {(np.mean(cyto_scores) + np.mean(nuc_scores)) / 2:.4f}')A mean Dice around 0.53 tells us two things at once. On one hand, thresholding does capture real structure — 0.53 is well above the zero we would get from random labelling. On the other, it misses a great deal. The core limitation is threshold fragility: the value t_nucleus = 0.45 that was chosen by looking at the histogram of image 7 may be too high or too low for an image with slightly different staining or illumination. Every image has its own intensity distribution, but our rule applies the same fixed cutoffs to all 200.
Given t_nucleus = 0.45 and t_background = 0.85, a pixel with grayscale intensity 0.72 is labelled:
Corresponds to Lab 02.
In Stage 1 we chose thresholds by eye. There is no reason to believe these are the best possible values. The space of all threshold pairs (t_nucleus, t_background) is continuous — we could spend a long time trying combinations.
Bayesian optimisation is a smarter search strategy. Rather than evaluating threshold pairs randomly, it builds a statistical model of how Dice changes across the search space, and uses that model to decide which pair to try next. After 50 evaluations it converges on a near-optimal combination far faster than a grid search could. The algorithm is described in detail in Appendix B; here we focus on what it finds and what that teaches us.
Before running any optimisation, Lab 02 introduces a crucial safeguard: the train / test split. We divide the 200 images into:
Without this split, we risk overfitting: finding thresholds that score perfectly on our 200 images but fail on any new image seen in practice. The test set acts as a proxy for future, unseen data.
Lab 02 also stratifies the split by nucleus size, ensuring that images with small, medium, and large nuclei appear in both halves in proportion. This prevents all the hard cases — tiny or enormous nuclei — from accidentally piling up in one set.
!pip install scikit-optimize --quiet
from sklearn.model_selection import train_test_split
# Drop images with no nucleus pixels — they cannot be evaluated meaningfully
has_nucleus = (y == 2).sum(axis=(1, 2)) > 0
X, y = X[has_nucleus], y[has_nucleus]
N = len(X) # 191 images remain
# Stratify by nucleus size: bin images into quartiles, then split proportionally
nucleus_size = (y == 2).sum(axis=(1, 2))
quartile = np.digitize(nucleus_size, np.percentile(nucleus_size, [25, 50, 75]))
train_idx, test_idx = train_test_split(
np.arange(N), test_size=0.2, stratify=quartile, random_state=42
)
print(f'Training images : {len(train_idx)}')
print(f'Test images : {len(test_idx)}')Bayesian optimisation searches for the threshold pair that maximises mean Dice across the 152 training images. After 50 evaluations it converges on:
t_nucleus ≈ 0.39 — slightly lower than our hand-picked 0.45t_background ≈ 0.99 — dramatically higher than our hand-picked 0.85The second value is the important one. Pushing t_background to 0.99 means almost nothing gets labelled background anymore. Pixels that our Stage 1 rule sent to the background bucket now land in the cytoplasm bucket. This massively increases the predicted cytoplasm area, which in turn corrects a systematic error: Stage 1 was silently misclassifying a large portion of cytoplasm pixels as background on every single image.
def mean_dice(indices, t_nucleus, t_background):
"""Mean Dice (cytoplasm + nucleus) over a set of image indices."""
scores = []
for i in indices:
pred = segment(X[i], t_nucleus, t_background)
d = (dice_score(pred, y[i], 1) + dice_score(pred, y[i], 2)) / 2
scores.append(d)
return np.mean(scores)
lab01_nuc, lab01_bg = 0.45, 0.85 # hand-picked
best_nuc, best_bg = 0.3904, 0.99 # found by Bayesian optimisation
print(f'Test Dice — hand-picked : {mean_dice(test_idx, lab01_nuc, lab01_bg):.4f}')
print(f'Test Dice — optimised : {mean_dice(test_idx, best_nuc, best_bg):.4f}')The improvement is substantial — from roughly 0.74 to 0.81. This jump comes almost entirely from recalibrating t_background, not from any change to the image or the segmentation logic. The lesson: even the simplest possible model can perform much better with the right parameter values, and choosing those values by eye on a single image is unreliable.
Bayesian optimisation found the optimal t_background ≈ 0.99, far above the hand-picked 0.85. What does this reveal about Stage 1’s systematic error?
Corresponds to Lab 03.
Even with the best possible threshold values, a fundamental limitation remains: some cytoplasm pixels are dark enough to be confused with nucleus, and some nucleus pixels are bright enough to be confused with cytoplasm. The grayscale distributions of the two classes overlap, and no single intensity cutoff can cleanly separate them.
One response is to pre-process the image before thresholding, aiming to make the class distributions more distinct. Lab 03 tests two classical tools from image processing.
Gaussian blur replaces each pixel with a weighted average of its neighbourhood, where nearby pixels get higher weight and distant pixels get lower weight. The result is a smoother image where salt-and-pepper noise is damped. Chapter 3 covers the Gaussian kernel in detail; here we apply it directly via scikit-image:
!pip install scikit-optimize scikit-image --quiet
import skimage.filters as skf
import skimage.morphology as skm
# Gaussian blur on a single image — sigma controls the blur radius
# channel_axis=-1 tells skimage that the last axis is the colour channel
img7_hwc = X[7].transpose(1, 2, 0) # (H, W, C) for skimage
blurred = skf.gaussian(img7_hwc, sigma=1.5, channel_axis=-1)After thresholding, the predicted mask often contains small isolated blobs (noise artefacts) and small holes (gaps inside otherwise solid regions). Two morphological operations from Chapter 3 address these:
Lab 03 applies these to each class mask separately.
Lab 03 wraps all of these steps into a single full_pipeline function controlled by boolean flags. Turning a flag off skips that step and passes the image or mask through unchanged. This makes it straightforward to test which combination of steps actually helps — an ablation study (see Lab 03’s group exercise).
def full_pipeline(img, t_nucleus, t_background,
use_blur=True, use_open=True, use_close=True,
blur_sigma=1.5, morph_radius=3):
"""
Full segmentation pipeline with optional blur and morphological cleanup.
img : (3, H, W) float32, channels-first
t_nucleus : pixels darker than this → nucleus
t_background : pixels brighter than this → background
use_blur : apply Gaussian blur before thresholding
use_open : apply morphological opening after thresholding
use_close : apply morphological closing after thresholding
blur_sigma : Gaussian blur radius
morph_radius : radius of the disk structuring element for opening/closing
"""
img_hwc = img.transpose(1, 2, 0) # skimage expects channel-last
# ── Optional: Gaussian blur ───────────────────────────────────────────────
if use_blur:
img_hwc = skf.gaussian(img_hwc, sigma=blur_sigma, channel_axis=-1)
# ── Grayscale + threshold ─────────────────────────────────────────────────
gray = 0.299 * img_hwc[:, :, 0] + 0.587 * img_hwc[:, :, 1] + 0.114 * img_hwc[:, :, 2]
pred = np.zeros(gray.shape, dtype=np.int64)
pred[gray < t_nucleus] = 2
pred[(gray >= t_nucleus) & (gray < t_background)] = 1
# ── Optional: morphological cleanup, applied per class ────────────────────
if use_open or use_close:
disk = skm.disk(morph_radius)
for cls in [1, 2]:
cls_mask = (pred == cls).astype(bool)
if use_open:
cls_mask = skm.opening(cls_mask, disk)
if use_close:
cls_mask = skm.closing(cls_mask, disk)
pred[pred == cls] = 0
pred[cls_mask] = cls
return predAfter running Bayesian optimisation over all four parameters (t_nucleus, t_background, blur_sigma, morph_radius), the test-set Dice barely moves beyond Stage 2 — from roughly 0.79 to 0.79. The reason is straightforward: Gaussian blur and morphological operations work on pixel intensities and mask shapes; they cannot manufacture class separation that is not there in the data. If nucleus and cytoplasm pixels overlap in grayscale, blurring before thresholding does not change that overlap — it can even worsen it by smearing boundaries. Morphological operations help at the margins (removing small artefacts, filling small holes) but introduce their own errors when they are too aggressive.
Stage 3 added Gaussian blur and morphological cleanup, yet test-set Dice barely improved over Stage 2. What is the fundamental reason?
Corresponds to Lab 03-v2.
Lab 03 applied generic cleanup everywhere — opening and closing on every pixel — without first asking what specifically goes wrong with a grayscale threshold. Lab 03-v2 begins with a diagnosis.
Two predictable failure modes arise from labelling every dark pixel as nucleus and every mid-intensity pixel as cytoplasm:
Holes inside nuclei. Urothelial cell nuclei stain unevenly. The interior of a nucleus can have lighter patches that fall above t_nucleus and get mislabelled as cytoplasm. In the predicted mask, a nucleus appears as a ring or crescent rather than a solid region.
Isolated cytoplasm debris. Cell debris, lipid droplets, and staining artefacts scatter throughout the slide. Many of these have mid-range intensity and get labelled cytoplasm, even though they are not attached to any cell. A genuine cytoplasm surrounds a nucleus; these fragments do not.
Generic opening and closing treat both failure modes only incidentally. Lab 03-v2 addresses each one directly.
binary_fill_holes — Given a binary mask of the predicted nucleus, this function identifies every pixel that is completely enclosed by nucleus pixels and flips it to nucleus. It plugs interior holes precisely, without altering anything outside the nucleus boundary.
remove_small_objects — Given a binary mask, this function removes any connected component — any contiguous group of touching pixels — whose total area falls below a user-specified threshold. Small debris blobs disappear; large, cohesive regions survive. Both the nucleus mask and the cytoplasm mask are cleaned this way. The size thresholds (min_nuc_size, min_cyto_size) become parameters that Bayesian optimisation can search over.
from scipy.ndimage import binary_fill_holes
def segment_morph(img, t_nuc, t_bg, min_nuc_size=50, min_cyto_size=50):
"""
Targeted cleanup pipeline.
img : (3, H, W) float32, channels-first
t_nuc : grayscale threshold for nucleus
t_bg : grayscale threshold for background
min_nuc_size : nucleus fragments smaller than this (pixels) are removed
min_cyto_size : cytoplasm fragments smaller than this (pixels) are removed
"""
gray = to_gray(img)
# ── Stage 1: initial grayscale threshold ──────────────────────────────────
nuc_mask = gray < t_nuc
bg_mask = gray > t_bg
cyto_mask = ~nuc_mask & ~bg_mask
# ── Stage 2: nucleus cleanup ──────────────────────────────────────────────
# Remove tiny dark specks that are too small to be a real nucleus
nuc_mask = skm.remove_small_objects(nuc_mask, min_size=int(min_nuc_size))
# Fill lighter holes inside nucleus bodies that were mislabelled as cytoplasm
nuc_mask = binary_fill_holes(nuc_mask)
# ── Stage 3: cytoplasm cleanup ────────────────────────────────────────────
# Remove isolated debris fragments that are too small to be real cytoplasm
if cyto_mask.any():
cyto_mask = skm.remove_small_objects(cyto_mask, min_size=int(min_cyto_size))
pred = np.zeros(gray.shape, dtype=np.int64)
pred[cyto_mask] = 1
pred[nuc_mask] = 2
return predAn earlier version of this pipeline tried a different cytoplasm filter: any cytoplasm pixel that was not within a fixed radius of a nucleus pixel was discarded as an artefact. The biological reasoning was sound — real cytoplasm wraps around a nucleus. But the criterion is geometrically too strict. Cytoplasm can extend far from the nucleus centre in a large cell, and the filter discarded real cytoplasm alongside the debris it was targeting.
remove_small_objects makes no spatial assumption. It only asks: is this connected region large enough to plausibly be real? Small means artefact; large means real. That criterion is robust to cell shape and size variation in a way that proximity is not.
Lab 03-v2 diagnoses two specific failure modes in grayscale-thresholded masks. Which pair of operations addresses each one?
After running Bayesian optimisation independently for each pipeline, we can compare test-set Dice across the four approaches on a consistent basis:
| Method | Key idea | Test Dice |
|---|---|---|
| Lab 01 — hand-picked thresholds | Grayscale + fixed cutoffs | ≈ 0.71 |
| Lab 02 — Bayesian opt | Grayscale + learned cutoffs | ≈ 0.79 |
| Lab 03 — blur + morphology | Preprocessing + learned cutoffs | ≈ 0.79 |
| Lab 03-v2 — targeted cleanup | Size-based artefact removal | ≈ 0.80 |
The biggest single improvement happens between Stage 1 and Stage 2. Bayesian optimisation replaces the hand-picked thresholds, and in doing so discovers that t_background should be pushed much higher than intuition suggests. This recalibrates the cytoplasm area estimate across the entire dataset and drives most of the Dice gain.
Stages 3 and 4 deliver smaller, incremental improvements. Each addresses a real failure mode — image noise and mask cleanup in Stage 3, nucleus holes and cytoplasm debris in Stage 4 — but neither can overcome the fundamental ceiling: all four methods share a grayscale threshold at their core, and two of the three classes overlap substantially in grayscale intensity. No amount of preprocessing resolves that overlap.
This ceiling is not a failure of effort or creativity. It is a consequence of the representation. By projecting a full-colour image down to a single grayscale number, we discard the colour information that, to a human pathologist, is one of the clearest signals distinguishing nucleus from cytoplasm. Nuclei are purple; cytoplasm is pink. Grayscale merges these into overlapping intensity bands.
The natural next step is to stop discarding colour. Instead of asking how bright is this pixel?, we can ask what colour is this pixel? and train an algorithm to map RGB triples directly to class labels. That is exactly what the following chapters do — starting with simple colour-space classifiers and building toward deep convolutional networks that learn rich, multi-scale representations of cellular structure.
The comparison table shows test-set Dice improving from ≈ 0.71 (Stage 1) to ≈ 0.80 (Stage 4). Which single stage transition accounts for the majority of that gain?