4  Building an Image Preprocessing Pipeline

4.1 Introduction

Preprocessing is the critical first step in any image analysis workflow. Before we can train machine learning models or perform segmentation analysis, we must understand how different preprocessing operations affect our images and metrics. Rather than applying all preprocessing steps at once, we’ll build our pipeline iteratively, adding one technique at a time and observing its effect on quantitative metrics.

This chapter uses the urothelial cell images from the dataset, where manual annotations label three classes: background (0), cytoplasm (1), and nucleus (2). We’ll develop a semi-automated pipeline that allows you to select an image, adjust parameters, and observe the effect of each preprocessing stage on both visual output and numerical quality metrics.


4.2 Understanding Segmentation Metrics

Before we begin preprocessing, we need to establish how we’ll measure whether our segmentation is good. Unlike classification (predicting a single label), segmentation assigns a label to every pixel, so we need metrics that compare spatial agreement between predicted and ground truth masks.

4.2.1 Dice Coefficient (F1 Score)

The Dice coefficient (also called F1 score or Sorensen index) measures the overlap between two sets. For image segmentation, it’s defined as:

\[\text{Dice} = \frac{2|X \cap Y|}{|X| + |Y|}\]

where \(X\) is the predicted segmentation and \(Y\) is the ground truth. The numerator is twice the intersection (overlap), and the denominator is the sum of areas.

Interpretation: - Dice = 1.0: Perfect overlap (ideal) - Dice = 0.5: Modest overlap - Dice = 0.0: No overlap (worst case)

Why use it: The Dice coefficient is symmetric and robust to class imbalance. It’s heavily weighted toward both precision and recall being good—a high Dice requires accurate boundaries in both dimensions.

Formula intuition: The factor of 2 makes it equivalent to the harmonic mean of precision and recall. It’s more commonly used than IoU in medical imaging.

4.2.2 Intersection over Union (IoU / Jaccard Index)

The Intersection over Union (IoU), also called the Jaccard index, is another standard metric:

\[\text{IoU} = \frac{|X \cap Y|}{|X \cup Y|}\]

where the intersection is divided by the union (all pixels belonging to either predicted or ground truth).

Interpretation: - IoU = 1.0: Perfect overlap - IoU = 0.5: Moderate overlap - IoU = 0.0: No overlap

Relationship to Dice: - IoU and Dice are related: \(\text{Dice} = \frac{2 \times \text{IoU}}{1 + \text{IoU}}\) - For good segmentations (IoU > 0.7), Dice ≈ 0.82 or higher - IoU is stricter: an IoU of 0.7 corresponds to a Dice of ~0.82

Why use it: IoU is intuitive (area of overlap divided by total area) and commonly used in computer vision challenges.

4.2.3 Per-Class Metrics

Because we have multiple classes (background, cytoplasm, nucleus), we calculate Dice and IoU separately for each class. This is critical because:

  • Background typically dominates by pixel count—naive metrics can look good if you simply predict “background” everywhere
  • Nucleus is small and medically important—we need to know if nucleus segmentation is accurate
  • Cytoplasm is intermediate—accurate cytoplasm prediction shows the pipeline distinguishes cellular boundaries

We report: - Per-class Dice: Dice for background, cytoplasm, and nucleus separately - Mean Dice: Average across all classes - Per-class IoU: IoU for each class - Mean IoU: Average across all classes

4.2.4 Implementation

Let’s first load the Cedars Sinai Data:

import numpy as np
import matplotlib.pyplot as plt
from skimage.measure import label
import pandas as pd

# ============================================================================
# LOAD CEDARS SINAI UROTHELIAL CELL DATA
# ============================================================================

# Note: If running in Jupyter/Colab, uncomment and run:
# !git clone https://github.com/emilsar/Cedars.git
# %cd Cedars/Project3
# !python 1_prepdata.py

# Load the preprocessed data — tries Windows path first, falls back to local Cedars clone
try:
    urothelial_cells = pd.read_pickle("C:/Cedars/Project3/urothelial_cell_toy_data.pkl")
except FileNotFoundError:
    urothelial_cells = pd.read_pickle("Cedars/Project3/urothelial_cell_toy_data.pkl")

# Convert to uint8 image format (0-255)
# Original shape: (batch, channels, height, width)
# Target shape: (batch, height, width, channels)
images = np.transpose(urothelial_cells["X"].numpy() * 255, (0, 2, 3, 1)).astype(np.uint8)
labels = urothelial_cells["y"]
import numpy as np

def dice_coefficient(pred, target, class_label):
    """
    Calculate Dice coefficient for a specific class.
    
    Parameters:
    -----------
    pred : np.ndarray
        Predicted segmentation (integer labels)
    target : np.ndarray
        Ground truth segmentation (integer labels)
    class_label : int
        The class to evaluate (0, 1, or 2)
    
    Returns:
    --------
    dice : float
        Dice coefficient for this class (0 to 1)
    """
    pred_binary = (pred == class_label).astype(float)
    target_binary = (target == class_label).astype(float)
    
    intersection = np.sum(pred_binary * target_binary)
    sum_pred = np.sum(pred_binary)
    sum_target = np.sum(target_binary)
    
    # Handle edge case where both pred and target are empty
    if sum_pred == 0 and sum_target == 0:
        return 1.0
    
    if sum_pred + sum_target == 0:
        return 0.0
    
    dice = 2 * intersection / (sum_pred + sum_target)
    return dice

def iou_coefficient(pred, target, class_label):
    """
    Calculate Intersection over Union (IoU) for a specific class.
    
    Parameters:
    -----------
    pred : np.ndarray
        Predicted segmentation (integer labels)
    target : np.ndarray
        Ground truth segmentation (integer labels)
    class_label : int
        The class to evaluate (0, 1, or 2)
    
    Returns:
    --------
    iou : float
        IoU for this class (0 to 1)
    """
    pred_binary = (pred == class_label).astype(float)
    target_binary = (target == class_label).astype(float)
    
    intersection = np.sum(pred_binary * target_binary)
    union = np.sum(np.logical_or(pred_binary, target_binary))
    
    # Handle edge case where union is empty
    if union == 0:
        return 1.0 if intersection == 0 else 0.0
    
    iou = intersection / union
    return iou

def calculate_all_metrics(pred, target, class_labels=[0, 1, 2]):
    """
    Calculate Dice and IoU for all classes plus mean metrics.
    
    Returns:
    --------
    metrics : dict
        Dictionary containing per-class and mean metrics
    """
    metrics = {}
    
    dice_scores = []
    iou_scores = []
    
    for class_label in class_labels:
        dice = dice_coefficient(pred, target, class_label)
        iou = iou_coefficient(pred, target, class_label)
        
        class_name = ['Background', 'Cytoplasm', 'Nucleus'][class_label]
        metrics[f'Dice_{class_name}'] = dice
        metrics[f'IoU_{class_name}'] = iou
        
        dice_scores.append(dice)
        iou_scores.append(iou)
    
    metrics['Dice_Mean'] = np.mean(dice_scores)
    metrics['IoU_Mean'] = np.mean(iou_scores)
    
    return metrics

def print_metrics(metrics):
    """Pretty-print metrics table."""
    print("=" * 70)
    print(f"{'Class':<15} {'Dice':<15} {'IoU':<15}")
    print("=" * 70)
    
    class_names = ['Background', 'Cytoplasm', 'Nucleus']
    for i, class_name in enumerate(class_names):
        dice = metrics[f'Dice_{class_name}']
        iou = metrics[f'IoU_{class_name}']
        print(f"{class_name:<15} {dice:<15.4f} {iou:<15.4f}")
    
    print("-" * 70)
    print(f"{'Mean':<15} {metrics['Dice_Mean']:<15.4f} {metrics['IoU_Mean']:<15.4f}")
    print("=" * 70)

4.3 Stage 1 - Grayscale Conversion and Basic Segmentation

4.3.1 Pipeline Overview for Stage 1

In this initial stage, we perform the minimum necessary preprocessing:

  1. Convert RGB to grayscale using the BT.709 standard (modern HDTV standard)
  2. Apply intensity-based thresholding to create initial segmentation masks for nucleus and cytoplasm
  3. Combine masks into a final 3-class segmentation
  4. Calculate and visualize metrics to establish a baseline

No denoising, no morphological operations—just the raw result of thresholding on the grayscale image. This baseline helps us understand the effect of each subsequent enhancement.

4.3.2 Stage 1 Implementation

import numpy as np
import matplotlib.pyplot as plt
import cv2
from scipy.ndimage import gaussian_filter

# ============================================================================
# STAGE 1: GRAYSCALE CONVERSION AND BASIC INTENSITY THRESHOLDING
# ============================================================================

# PARAMETERS (User-adjustable)
image_number = 7
nucleus_max = 0.3      # Intensity threshold for nucleus (dark pixels)
cytoplasm_min = 0.3    # Lower bound for cytoplasm (mid-gray)
cytoplasm_max = 0.7    # Upper bound for cytoplasm

# Get the image
img = images[image_number]
labels_true = labels[image_number]

# STEP 1: Grayscale Conversion
# Using BT.709 standard (modern HDTV): 0.2125*R + 0.7154*G + 0.0721*B
img_uint8 = img.astype(np.uint8)
img_gray = cv2.cvtColor(img_uint8, cv2.COLOR_RGB2GRAY).astype(np.float32) / 255.0

# STEP 2: Intensity-Based Segmentation
nucleus_mask = (img_gray < nucleus_max).astype(int)
cytoplasm_mask = np.logical_and(
    img_gray >= cytoplasm_min, 
    img_gray <= cytoplasm_max
).astype(int)
background_mask = (img_gray > cytoplasm_max).astype(int)

# STEP 3: Combine into 3-class segmentation
segmented = np.zeros_like(img_gray, dtype=int)
segmented[nucleus_mask == 1] = 2
segmented[cytoplasm_mask == 1] = 1
segmented[background_mask == 1] = 0

# STEP 4: Calculate Metrics
metrics_stage1 = calculate_all_metrics(segmented, labels_true, class_labels=[0, 1, 2])

# ============================================================================
# VISUALIZATION
# ============================================================================

fig, axes = plt.subplots(3, 3, figsize=(18, 12))
fig.subplots_adjust(hspace=0.35, wspace=0.3)

# Row 1: Input and Ground Truth
axes[0, 0].imshow(img)
axes[0, 0].set_title("Original RGB Image", fontsize=12, fontweight='bold')
axes[0, 0].axis('off')

axes[0, 1].imshow(img_gray, cmap='gray')
axes[0, 1].set_title("Grayscale Conversion", fontsize=12, fontweight='bold')
axes[0, 1].axis('off')

axes[0, 2].imshow(labels_true, cmap='viridis', vmin=0, vmax=2)
axes[0, 2].set_title("Ground Truth (Expert Annotation)", fontsize=12, fontweight='bold')
axes[0, 2].axis('off')

# Row 2: Segmentation Masks
axes[1, 0].imshow(nucleus_mask, cmap='viridis', vmin=0, vmax=1)
axes[1, 0].set_title("Nucleus Mask\n(intensity < 0.3)", fontsize=11)
axes[1, 0].axis('off')

axes[1, 1].imshow(cytoplasm_mask, cmap='viridis', vmin=0, vmax=1)
axes[1, 1].set_title("Cytoplasm Mask\n(0.3 ≤ intensity ≤ 0.7)", fontsize=11)
axes[1, 1].axis('off')

axes[1, 2].imshow(background_mask, cmap='gray')
axes[1, 2].set_title("Background Mask\n(intensity > 0.7)", fontsize=11)
axes[1, 2].axis('off')

# Row 3: Segmentation and Histogram
axes[2, 0].imshow(segmented, cmap='viridis', vmin=0, vmax=2)
axes[2, 0].set_title("Predicted Segmentation\n(Stage 1)", fontsize=12, fontweight='bold')
axes[2, 0].axis('off')

# Intensity Histogram with thresholds
axes[2, 1].hist(img_gray.flatten(), bins=50, alpha=0.7, color='blue', edgecolor='black')
axes[2, 1].axvline(nucleus_max, color='red', linestyle='--', linewidth=2, label=f'Nucleus threshold ({nucleus_max})')
axes[2, 1].axvline(cytoplasm_min, color='orange', linestyle='--', linewidth=2, label=f'Cytoplasm min ({cytoplasm_min})')
axes[2, 1].axvline(cytoplasm_max, color='green', linestyle='--', linewidth=2, label=f'Cytoplasm max ({cytoplasm_max})')
axes[2, 1].set_xlabel('Pixel Intensity (0.0-1.0)', fontsize=11)
axes[2, 1].set_ylabel('Frequency', fontsize=11)
axes[2, 1].set_title("Intensity Distribution with Threshold Lines", fontsize=12, fontweight='bold')
axes[2, 1].legend(fontsize=9)
axes[2, 1].grid(alpha=0.3)

axes[2, 2].axis('off')

plt.suptitle('Stage 1: Grayscale Conversion and Basic Intensity Thresholding', 
             fontsize=14, fontweight='bold', y=0.995)
plt.show()

# Print metrics
print("\n" + "="*70)
print("STAGE 1: GRAYSCALE CONVERSION AND BASIC SEGMENTATION")
print("="*70)
print_metrics(metrics_stage1)

print(f"\nImage Number: {image_number}")
print(f"Nucleus Threshold: < {nucleus_max}")
print(f"Cytoplasm Threshold: {cytoplasm_min} to {cytoplasm_max}")


======================================================================
STAGE 1: GRAYSCALE CONVERSION AND BASIC SEGMENTATION
======================================================================
======================================================================
Class           Dice            IoU            
======================================================================
Background      0.8262          0.7038         
Cytoplasm       0.8264          0.7042         
Nucleus         0.6586          0.4910         
----------------------------------------------------------------------
Mean            0.7704          0.6330         
======================================================================

Image Number: 7
Nucleus Threshold: < 0.3
Cytoplasm Threshold: 0.3 to 0.7

4.3.3 Understanding the Intensity Distribution Histogram

The intensity histogram with threshold lines is one of the most important visualizations in preprocessing because it shows whether your thresholds are well-chosen.

What it shows: - The x-axis represents pixel intensity values (0.0 = black, 1.0 = white) - The y-axis represents the frequency (count) of pixels at each intensity - The blue histogram shows the distribution of all pixel intensities in the image - The colored vertical lines show where your thresholds are placed

Why we plot it: We need to understand whether the three classes (nucleus, cytoplasm, background) are actually separated in intensity space. If they overlap heavily, no threshold-based approach will work well.

What improvement looks like: - Good separation: The three classes form distinct peaks in the histogram, with valleys between them where the threshold lines sit. Example: nucleus peak at 0.2, cytoplasm peak at 0.5, background peak at 0.8 - Poor separation: The histogram is noisy and spread out with no clear peaks. Thresholds placed in noise will capture spurious pixels

Interpreting your thresholds relative to the histogram:

Ideal case:
         nucleus_max (red line)  cytoplasm_min (orange)  cytoplasm_max (green)
              ↓                          ↓                        ↓
  ████    ██████         ██████         ██████    ████
█       █        █       █      █       █      █        █
─────────────────────────────────────────────────────────
0.0                0.3              0.7                   1.0

The thresholds sit in the "valleys" between peaks—pixel counts are low there.

Poor case:

         Red, orange, and green lines all mixed together in noisy region:
  ██████████
█████████████████████  ← Uniform noise, no clear peaks
─────────────────────
0.0                    1.0

Thresholds can't separate classes effectively.

What to look for across stages: - Stage 1: Histogram is noisy, thresholds may sit in high-frequency regions (suboptimal) - Stage 2 (after denoising): Histogram becomes sharper, noise reduced, peaks more distinct - Stage 3 (after edge enhancement): Peaks become even sharper, valleys deeper - Stage 4: Should show no change in histogram shape (morphological ops don’t affect pixel intensities)

The histogram guides parameter selection: if your thresholds don’t align with clear valleys in the histogram, adjust them until they do.

4.3.4 Interpreting Stage 1 Results

At this stage, the results are typically mediocre. Why? Because:

  • No denoising: Noise in the original image affects pixel intensities, causing incorrect threshold assignments
  • No edge enhancement: Blurry boundaries between structures lead to misclassification at interfaces
  • No morphological cleaning: Small noise artifacts and holes aren’t removed

The Stage 1 metrics serve as a baseline. We expect Dice and IoU to improve as we add preprocessing steps. The intensity histogram shows whether the classes are well-separated in intensity space, and whether the chosen thresholds are reasonable.


4.4 Stage 2 - Adding Non-Local Means Denoising

4.4.1 Why Denoise?

Microscopy images contain noise from several sources: - Photon noise (from the sensor) - Electronic noise (from the microscope’s electronics) - Biological variability (cellular structures naturally have fuzzy boundaries)

Non-local means (NL-means) denoising is a patch-based technique that removes noise while preserving edges and fine structures. Unlike Gaussian blur (which smooths everything equally), NL-means finds similar patches elsewhere in the image and averages them, preserving structure.

4.4.2 Stage 2 Implementation

from skimage.restoration import denoise_nl_means

# ============================================================================
# STAGE 2: ADD NON-LOCAL MEANS DENOISING
# ============================================================================

# Additional parameter for Stage 2
nl_means_strength = 0.3  # h parameter: higher = more aggressive denoising

# STEP 1: Grayscale Conversion (same as Stage 1)
img_gray = cv2.cvtColor(img_uint8, cv2.COLOR_RGB2GRAY).astype(np.float32) / 255.0

# STEP 2: Non-Local Means Denoising
img_denoised = denoise_nl_means(
    img_gray,
    h=nl_means_strength,           # Denoising strength
    fast_mode=True,                # Use faster approximation
    patch_size=10,                 # Size of patches to compare
    patch_distance=10              # Search radius for similar patches
)

# STEP 3: Intensity-Based Segmentation (same as Stage 1, but on denoised image)
nucleus_mask = (img_denoised < nucleus_max).astype(int)
cytoplasm_mask = np.logical_and(
    img_denoised >= cytoplasm_min, 
    img_denoised <= cytoplasm_max
).astype(int)
background_mask = (img_denoised > cytoplasm_max).astype(int)

# STEP 4: Combine into 3-class segmentation
segmented = np.zeros_like(img_denoised, dtype=int)
segmented[nucleus_mask == 1] = 2
segmented[cytoplasm_mask == 1] = 1
segmented[background_mask == 1] = 0

# STEP 5: Calculate Metrics
metrics_stage2 = calculate_all_metrics(segmented, labels_true, class_labels=[0, 1, 2])

# ============================================================================
# VISUALIZATION
# ============================================================================

fig, axes = plt.subplots(3, 3, figsize=(18, 12))
fig.subplots_adjust(hspace=0.35, wspace=0.3)

# Row 1: Comparison of Original vs Denoised
axes[0, 0].imshow(img_gray, cmap='gray')
axes[0, 0].set_title("Original Grayscale", fontsize=12, fontweight='bold')
axes[0, 0].axis('off')

axes[0, 1].imshow(img_denoised, cmap='gray')
axes[0, 1].set_title(f"After NL-Means Denoising\n(h={nl_means_strength})", fontsize=12, fontweight='bold')
axes[0, 1].axis('off')

axes[0, 2].imshow(labels_true, cmap='viridis', vmin=0, vmax=2)
axes[0, 2].set_title("Ground Truth", fontsize=12, fontweight='bold')
axes[0, 2].axis('off')

# Row 2: Difference and Segmentation Masks
diff = np.abs(img_gray - img_denoised)
im = axes[1, 0].imshow(diff, cmap='hot')
axes[1, 0].set_title("Difference (Noise Removed)", fontsize=11)
axes[1, 0].axis('off')
plt.colorbar(im, ax=axes[1, 0], fraction=0.046, pad=0.04)

axes[1, 1].imshow(nucleus_mask, cmap='viridis', vmin=0, vmax=1)
axes[1, 1].set_title("Nucleus Mask", fontsize=11)
axes[1, 1].axis('off')

axes[1, 2].imshow(segmented, cmap='viridis', vmin=0, vmax=2)
axes[1, 2].set_title("Predicted Segmentation\n(Stage 2)", fontsize=12, fontweight='bold')
axes[1, 2].axis('off')

# Row 3: Histograms Comparison
axes[2, 0].hist(img_gray.flatten(), bins=50, alpha=0.6, color='blue', 
         label='Original', edgecolor='black')
axes[2, 0].hist(img_denoised.flatten(), bins=50, alpha=0.6, color='red', 
         label='Denoised', edgecolor='black')
axes[2, 0].set_xlabel('Pixel Intensity', fontsize=11)
axes[2, 0].set_ylabel('Frequency', fontsize=11)
axes[2, 0].set_title("Intensity Distribution Comparison", fontsize=11)
axes[2, 0].legend()
axes[2, 0].grid(alpha=0.3)

axes[2, 1].hist(img_denoised.flatten(), bins=50, alpha=0.7, color='blue', edgecolor='black')
axes[2, 1].axvline(nucleus_max, color='red', linestyle='--', linewidth=2, 
            label=f'Nucleus threshold ({nucleus_max})')
axes[2, 1].axvline(cytoplasm_min, color='orange', linestyle='--', linewidth=2, 
            label=f'Cytoplasm min ({cytoplasm_min})')
axes[2, 1].axvline(cytoplasm_max, color='green', linestyle='--', linewidth=2, 
            label=f'Cytoplasm max ({cytoplasm_max})')
axes[2, 1].set_xlabel('Pixel Intensity', fontsize=11)
axes[2, 1].set_ylabel('Frequency', fontsize=11)
axes[2, 1].set_title("Denoised Image Histogram with Thresholds", fontsize=11)
axes[2, 1].legend(fontsize=9)
axes[2, 1].grid(alpha=0.3)

axes[2, 2].axis('off')

plt.suptitle('Stage 2: Adding Non-Local Means Denoising', 
             fontsize=14, fontweight='bold', y=0.995)
plt.show()

# Print metrics and comparison
print("\n" + "="*70)
print("STAGE 2: NON-LOCAL MEANS DENOISING")
print("="*70)
print_metrics(metrics_stage2)

print("\n" + "="*70)
print("STAGE COMPARISON: Stage 1 → Stage 2")
print("="*70)
print(f"{'Metric':<20} {'Stage 1':<15} {'Stage 2':<15} {'Improvement':<15}")
print("-"*70)
for key in metrics_stage1.keys():
    if 'Mean' in key:
        s1 = metrics_stage1[key]
        s2 = metrics_stage2[key]
        improvement = s2 - s1
        print(f"{key:<20} {s1:<15.4f} {s2:<15.4f} {improvement:+.4f}")


======================================================================
STAGE 2: NON-LOCAL MEANS DENOISING
======================================================================
======================================================================
Class           Dice            IoU            
======================================================================
Background      0.8356          0.7176         
Cytoplasm       0.8795          0.7850         
Nucleus         0.7921          0.6557         
----------------------------------------------------------------------
Mean            0.8357          0.7194         
======================================================================

======================================================================
STAGE COMPARISON: Stage 1 → Stage 2
======================================================================
Metric               Stage 1         Stage 2         Improvement    
----------------------------------------------------------------------
Dice_Mean            0.7704          0.8357          +0.0653
IoU_Mean             0.6330          0.7194          +0.0865

4.4.3 What Changed?

By adding non-local means denoising, we expect: - Dice and IoU to increase (cleaner image → better segmentation) - Intensity histogram to show sharper, more distinct peaks (less noise in intensity values) - Noise difference visualization to show what was removed

The magnitude of improvement depends on how noisy the original image is.


4.5 Stage 3 - Adding Edge Enhancement and Gaussian Blur

4.5.1 Why Edge Enhancement?

After denoising, edges can still be soft and blurry. Unsharp masking is a classical technique that: 1. Creates a blurred version of the image 2. Subtracts it from the original 3. Adds this “detail” back to the original

This sharpens edges without amplifying noise (because we denoise first).

After enhancement, we apply gentle Gaussian blur to smooth small variations while preserving the enhanced edges. This creates cleaner, more decisive boundaries for thresholding.

4.5.2 Stage 3 Implementation

from skimage.filters import unsharp_mask

# ============================================================================
# STAGE 3: ADD UNSHARP MASKING (EDGE ENHANCEMENT) AND GAUSSIAN BLUR
# ============================================================================

# Additional parameters for Stage 3
unsharp_radius = 2.0      # Radius of the blurred version in unsharp mask
unsharp_amount = 1.0      # How much detail to add back (1.0 = 100%)
blur_sigma = 1.5          # Gaussian blur standard deviation

# STEP 1: Grayscale Conversion (same as previous stages)
img_gray = cv2.cvtColor(img_uint8, cv2.COLOR_RGB2GRAY).astype(np.float32) / 255.0

# STEP 2: Non-Local Means Denoising (same as Stage 2)
img_denoised = denoise_nl_means(
    img_gray,
    h=nl_means_strength,
    fast_mode=True,
    patch_size=10,
    patch_distance=10
)

# STEP 3: Unsharp Masking (Edge Enhancement)
img_enhanced = unsharp_mask(
    img_denoised,
    radius=unsharp_radius,
    amount=unsharp_amount
)

# STEP 4: Gaussian Blur (Light smoothing)
img_blurred = gaussian_filter(img_enhanced, sigma=blur_sigma)

# STEP 5: Intensity-Based Segmentation (on the processed image)
nucleus_mask = (img_blurred < nucleus_max).astype(int)
cytoplasm_mask = np.logical_and(
    img_blurred >= cytoplasm_min, 
    img_blurred <= cytoplasm_max
).astype(int)
background_mask = (img_blurred > cytoplasm_max).astype(int)

# STEP 6: Combine into 3-class segmentation
segmented = np.zeros_like(img_blurred, dtype=int)
segmented[nucleus_mask == 1] = 2
segmented[cytoplasm_mask == 1] = 1
segmented[background_mask == 1] = 0

# STEP 7: Calculate Metrics
metrics_stage3 = calculate_all_metrics(segmented, labels_true, class_labels=[0, 1, 2])

# ============================================================================
# VISUALIZATION
# ============================================================================

fig, axes = plt.subplots(3, 4, figsize=(20, 12))
fig.subplots_adjust(hspace=0.35, wspace=0.3)

# Row 1: Processing Pipeline
axes[0, 0].imshow(img_gray, cmap='gray')
axes[0, 0].set_title("1. Grayscale", fontsize=11, fontweight='bold')
axes[0, 0].axis('off')

axes[0, 1].imshow(img_denoised, cmap='gray')
axes[0, 1].set_title("2. Denoised", fontsize=11, fontweight='bold')
axes[0, 1].axis('off')

axes[0, 2].imshow(img_enhanced, cmap='gray')
axes[0, 2].set_title(f"3. Edge Enhanced\n(unsharp r={unsharp_radius})", fontsize=11, fontweight='bold')
axes[0, 2].axis('off')

axes[0, 3].imshow(img_blurred, cmap='gray')
axes[0, 3].set_title(f"4. Blurred\n(σ={blur_sigma})", fontsize=11, fontweight='bold')
axes[0, 3].axis('off')

# Row 2: Masks and Segmentation
axes[1, 0].imshow(nucleus_mask, cmap='viridis', vmin=0, vmax=1)
axes[1, 0].set_title("Nucleus Mask", fontsize=10)
axes[1, 0].axis('off')

axes[1, 1].imshow(cytoplasm_mask, cmap='viridis', vmin=0, vmax=1)
axes[1, 1].set_title("Cytoplasm Mask", fontsize=10)
axes[1, 1].axis('off')

axes[1, 2].imshow(labels_true, cmap='viridis', vmin=0, vmax=2)
axes[1, 2].set_title("Ground Truth", fontsize=10)
axes[1, 2].axis('off')

axes[1, 3].imshow(segmented, cmap='viridis', vmin=0, vmax=2)
axes[1, 3].set_title("Predicted Segmentation\n(Stage 3)", fontsize=11, fontweight='bold')
axes[1, 3].axis('off')

# Row 3: Histogram Progression
axes[2, 0].hist(img_gray.flatten(), bins=50, alpha=0.5, color='blue', 
         label='Original', edgecolor='black')
axes[2, 0].hist(img_denoised.flatten(), bins=50, alpha=0.5, color='orange', 
         label='Denoised', edgecolor='black')
axes[2, 0].hist(img_enhanced.flatten(), bins=50, alpha=0.5, color='red', 
         label='Edge Enhanced', edgecolor='black')
axes[2, 0].set_xlabel('Pixel Intensity', fontsize=10)
axes[2, 0].set_ylabel('Frequency', fontsize=10)
axes[2, 0].set_title("Intensity Distribution: All Stages", fontsize=11)
axes[2, 0].legend(fontsize=9)
axes[2, 0].grid(alpha=0.3)

axes[2, 1].hist(img_blurred.flatten(), bins=50, alpha=0.7, color='green', edgecolor='black')
axes[2, 1].axvline(nucleus_max, color='red', linestyle='--', linewidth=2, 
             label=f'Nucleus threshold ({nucleus_max})')
axes[2, 1].axvline(cytoplasm_min, color='orange', linestyle='--', linewidth=2, 
             label=f'Cytoplasm min ({cytoplasm_min})')
axes[2, 1].axvline(cytoplasm_max, color='purple', linestyle='--', linewidth=2, 
             label=f'Cytoplasm max ({cytoplasm_max})')
axes[2, 1].set_xlabel('Pixel Intensity', fontsize=10)
axes[2, 1].set_ylabel('Frequency', fontsize=10)
axes[2, 1].set_title("Final Processed Image with Thresholds", fontsize=11)
axes[2, 1].legend(fontsize=9)
axes[2, 1].grid(alpha=0.3)

axes[2, 2].axis('off')
axes[2, 3].axis('off')

plt.suptitle('Stage 3: Edge Enhancement and Gaussian Blur', 
             fontsize=14, fontweight='bold', y=0.995)
plt.show()

# Print metrics and comparison
print("\n" + "="*70)
print("STAGE 3: EDGE ENHANCEMENT AND GAUSSIAN BLUR")
print("="*70)
print_metrics(metrics_stage3)

print("\n" + "="*70)
print("CUMULATIVE STAGE COMPARISON: Stage 1 → Stage 2 → Stage 3")
print("="*70)
print(f"{'Metric':<25} {'Stage 1':<15} {'Stage 2':<15} {'Stage 3':<15}")
print("-"*70)
for key in metrics_stage1.keys():
    if 'Mean' in key:
        s1 = metrics_stage1[key]
        s2 = metrics_stage2[key]
        s3 = metrics_stage3[key]
        print(f"{key:<25} {s1:<15.4f} {s2:<15.4f} {s3:<15.4f}")


======================================================================
STAGE 3: EDGE ENHANCEMENT AND GAUSSIAN BLUR
======================================================================
======================================================================
Class           Dice            IoU            
======================================================================
Background      0.8354          0.7173         
Cytoplasm       0.8793          0.7845         
Nucleus         0.7927          0.6565         
----------------------------------------------------------------------
Mean            0.8358          0.7195         
======================================================================

======================================================================
CUMULATIVE STAGE COMPARISON: Stage 1 → Stage 2 → Stage 3
======================================================================
Metric                    Stage 1         Stage 2         Stage 3        
----------------------------------------------------------------------
Dice_Mean                 0.7704          0.8357          0.8358         
IoU_Mean                  0.6330          0.7194          0.7195         

4.5.3 Key Observations

At Stage 3, we expect: - Clearer intensity peaks in the histogram (edges now sharper) - Further improvement in Dice/IoU metrics - Better separation between nucleus, cytoplasm, and background in intensity space


4.6 Stage 4 - Adding Morphological Operations

4.6.1 Why Morphological Operations?

Even after careful thresholding, segmentation masks often contain: - Small noise artifacts (isolated pixels that don’t belong) - Holes inside regions (small unfilled areas within nucleus or cytoplasm)

Binary opening removes small objects: - Erosion: Shrinks objects (small objects disappear) - Dilation: Grows objects back (restores larger objects to original size) - Combined effect: Small artifacts disappear, large structures survive

Binary closing fills holes: - Dilation: Grows objects (fills holes) - Erosion: Shrinks back to original size - Combined effect: Small holes inside regions are filled

For cell segmentation: opening cleans nucleus, closing fills holes in both nucleus and cytoplasm.

4.6.2 Stage 4 Implementation

from skimage.morphology import binary_opening, binary_closing, disk

# ============================================================================
# STAGE 4: ADD MORPHOLOGICAL OPERATIONS (OPENING AND CLOSING)
# ============================================================================

# Additional parameters for Stage 4
morph_disk_size = 3       # Size of morphological kernel
closing_disk_size = 2     # Size of closing kernel (typically smaller)

# STEPS 1-4: Same as Stage 3 (Grayscale → Denoise → Enhance → Blur)
img_gray = cv2.cvtColor(img_uint8, cv2.COLOR_RGB2GRAY).astype(np.float32) / 255.0

img_denoised = denoise_nl_means(
    img_gray,
    h=nl_means_strength,
    fast_mode=True,
    patch_size=10,
    patch_distance=10
)

img_enhanced = unsharp_mask(
    img_denoised,
    radius=unsharp_radius,
    amount=unsharp_amount
)

img_blurred = gaussian_filter(img_enhanced, sigma=blur_sigma)

# STEP 5: Intensity-Based Segmentation
nucleus_mask = (img_blurred < nucleus_max).astype(int)
cytoplasm_mask = np.logical_and(
    img_blurred >= cytoplasm_min, 
    img_blurred <= cytoplasm_max
).astype(int)

# STEP 6: Morphological Operations
# Opening: Remove small artifacts
nucleus_opened = binary_opening(nucleus_mask, disk(morph_disk_size)).astype(int)
cytoplasm_opened = binary_opening(cytoplasm_mask, disk(morph_disk_size)).astype(int)

# Closing: Fill small holes
nucleus_closed = binary_closing(nucleus_opened, disk(closing_disk_size)).astype(int)
cytoplasm_closed = binary_closing(cytoplasm_opened, disk(closing_disk_size)).astype(int)

# STEP 7: Combine into 3-class segmentation
segmented = np.zeros_like(img_blurred, dtype=int)
segmented[nucleus_closed == 1] = 2
segmented[cytoplasm_closed == 1] = 1
# Background is implicitly 0

# STEP 8: Calculate Metrics
metrics_stage4 = calculate_all_metrics(segmented, labels_true, class_labels=[0, 1, 2])

# ============================================================================
# VISUALIZATION
# ============================================================================

fig, axes = plt.subplots(4, 4, figsize=(20, 14))
fig.subplots_adjust(hspace=0.4, wspace=0.3)

# Row 1: Processing Pipeline
axes[0, 0].imshow(img_gray, cmap='gray')
axes[0, 0].set_title("1. Grayscale", fontsize=10, fontweight='bold')
axes[0, 0].axis('off')

axes[0, 1].imshow(img_denoised, cmap='gray')
axes[0, 1].set_title("2. Denoised", fontsize=10, fontweight='bold')
axes[0, 1].axis('off')

axes[0, 2].imshow(img_enhanced, cmap='gray')
axes[0, 2].set_title("3. Edge Enhanced", fontsize=10, fontweight='bold')
axes[0, 2].axis('off')

axes[0, 3].imshow(img_blurred, cmap='gray')
axes[0, 3].set_title("4. Blurred", fontsize=10, fontweight='bold')
axes[0, 3].axis('off')

# Row 2: Nucleus Processing
axes[1, 0].imshow(nucleus_mask, cmap='viridis', vmin=0, vmax=1)
axes[1, 0].set_title("Nucleus Mask\n(after thresholding)", fontsize=10)
axes[1, 0].axis('off')

axes[1, 1].imshow(nucleus_opened, cmap='viridis', vmin=0, vmax=1)
axes[1, 1].set_title(f"Nucleus Opened\n(disk={morph_disk_size})", fontsize=10)
axes[1, 1].axis('off')

axes[1, 2].imshow(nucleus_closed, cmap='viridis', vmin=0, vmax=1)
axes[1, 2].set_title(f"Nucleus Closed\n(disk={closing_disk_size})", fontsize=10)
axes[1, 2].axis('off')

artifacts = nucleus_mask - nucleus_opened
axes[1, 3].imshow(artifacts, cmap='gray')
axes[1, 3].set_title("Removed Artifacts\n(Nucleus)", fontsize=10)
axes[1, 3].axis('off')

# Row 3: Cytoplasm Processing
axes[2, 0].imshow(cytoplasm_mask, cmap='viridis', vmin=0, vmax=1)
axes[2, 0].set_title("Cytoplasm Mask\n(after thresholding)", fontsize=10)
axes[2, 0].axis('off')

axes[2, 1].imshow(cytoplasm_opened, cmap='viridis', vmin=0, vmax=1)
axes[2, 1].set_title(f"Cytoplasm Opened\n(disk={morph_disk_size})", fontsize=10)
axes[2, 1].axis('off')

axes[2, 2].imshow(cytoplasm_closed, cmap='viridis', vmin=0, vmax=1)
axes[2, 2].set_title(f"Cytoplasm Closed\n(disk={closing_disk_size})", fontsize=10)
axes[2, 2].axis('off')

artifacts_cyto = cytoplasm_mask - cytoplasm_opened
axes[2, 3].imshow(artifacts_cyto, cmap='gray')
axes[2, 3].set_title("Removed Artifacts\n(Cytoplasm)", fontsize=10)
axes[2, 3].axis('off')

# Row 4: Final Segmentation and Ground Truth
axes[3, 0].imshow(labels_true, cmap='viridis', vmin=0, vmax=2)
axes[3, 0].set_title("Ground Truth", fontsize=10, fontweight='bold')
axes[3, 0].axis('off')

axes[3, 1].imshow(segmented, cmap='viridis', vmin=0, vmax=2)
axes[3, 1].set_title("Predicted Segmentation\n(Stage 4)", fontsize=10, fontweight='bold')
axes[3, 1].axis('off')

axes[3, 2].hist(img_blurred.flatten(), bins=50, alpha=0.7, color='green', edgecolor='black')
axes[3, 2].axvline(nucleus_max, color='red', linestyle='--', linewidth=2, 
             label=f'Nucleus threshold ({nucleus_max})')
axes[3, 2].axvline(cytoplasm_min, color='orange', linestyle='--', linewidth=2, 
             label=f'Cytoplasm min ({cytoplasm_min})')
axes[3, 2].axvline(cytoplasm_max, color='purple', linestyle='--', linewidth=2, 
             label=f'Cytoplasm max ({cytoplasm_max})')
axes[3, 2].set_xlabel('Pixel Intensity', fontsize=10)
axes[3, 2].set_ylabel('Frequency', fontsize=10)
axes[3, 2].set_title("Intensity Distribution with Thresholds", fontsize=10)
axes[3, 2].legend(fontsize=8)
axes[3, 2].grid(alpha=0.3)

axes[3, 3].axis('off')

plt.suptitle('Stage 4: Morphological Operations (Opening and Closing)', 
             fontsize=14, fontweight='bold', y=0.998)
plt.show()

# Print metrics and comparison
print("\n" + "="*70)
print("STAGE 4: MORPHOLOGICAL OPERATIONS")
print("="*70)
print_metrics(metrics_stage4)

print("\n" + "="*70)
print("COMPLETE PIPELINE COMPARISON: All Stages")
print("="*70)
print(f"{'Metric':<25} {'Stage 1':<15} {'Stage 2':<15} {'Stage 3':<15} {'Stage 4':<15}")
print("-"*110)
for key in sorted(metrics_stage1.keys()):
    if 'Mean' in key or 'Nucleus' in key or 'Cytoplasm' in key:
        s1 = metrics_stage1[key]
        s2 = metrics_stage2[key]
        s3 = metrics_stage3[key]
        s4 = metrics_stage4[key]
        print(f"{key:<25} {s1:<15.4f} {s2:<15.4f} {s3:<15.4f} {s4:<15.4f}")
/tmp/ipykernel_15219/623457497.py:39: FutureWarning: `binary_opening` is deprecated since version 0.26 and will be removed in version 0.28. Use `skimage.morphology.opening` instead.
  nucleus_opened = binary_opening(nucleus_mask, disk(morph_disk_size)).astype(int)
/tmp/ipykernel_15219/623457497.py:40: FutureWarning: `binary_opening` is deprecated since version 0.26 and will be removed in version 0.28. Use `skimage.morphology.opening` instead.
  cytoplasm_opened = binary_opening(cytoplasm_mask, disk(morph_disk_size)).astype(int)
/tmp/ipykernel_15219/623457497.py:43: FutureWarning: `binary_closing` is deprecated since version 0.26 and will be removed in version 0.28. Use `skimage.morphology.closing` instead.
  nucleus_closed = binary_closing(nucleus_opened, disk(closing_disk_size)).astype(int)
/tmp/ipykernel_15219/623457497.py:44: FutureWarning: `binary_closing` is deprecated since version 0.26 and will be removed in version 0.28. Use `skimage.morphology.closing` instead.
  cytoplasm_closed = binary_closing(cytoplasm_opened, disk(closing_disk_size)).astype(int)


======================================================================
STAGE 4: MORPHOLOGICAL OPERATIONS
======================================================================
======================================================================
Class           Dice            IoU            
======================================================================
Background      0.8349          0.7165         
Cytoplasm       0.8792          0.7845         
Nucleus         0.7932          0.6573         
----------------------------------------------------------------------
Mean            0.8358          0.7195         
======================================================================

======================================================================
COMPLETE PIPELINE COMPARISON: All Stages
======================================================================
Metric                    Stage 1         Stage 2         Stage 3         Stage 4        
--------------------------------------------------------------------------------------------------------------
Dice_Cytoplasm            0.8264          0.8795          0.8793          0.8792         
Dice_Mean                 0.7704          0.8357          0.8358          0.8358         
Dice_Nucleus              0.6586          0.7921          0.7927          0.7932         
IoU_Cytoplasm             0.7042          0.7850          0.7845          0.7845         
IoU_Mean                  0.6330          0.7194          0.7195          0.7195         
IoU_Nucleus               0.4910          0.6557          0.6565          0.6573         

4.6.3 Key Observations

At Stage 4, we expect: - Nucleus and cytoplasm masks to be cleaner (fewer artifacts) - Nuclei to be more cohesive (holes filled) - Final Dice and IoU metrics to reach their peak (for this image) - Opening removes visible noise (shown in artifact difference panels)


4.7 Understanding the Complete Pipeline

4.7.1 Pipeline Summary

The complete 4-stage pipeline progressively improves segmentation:

Stage Operations Purpose
Stage 1 Grayscale + Thresholding Baseline; establish what thresholding alone can achieve
Stage 2 + Non-Local Means Remove noise while preserving edges
Stage 3 + Edge Enhancement + Blur Sharpen boundaries, smooth noise
Stage 4 + Opening + Closing Remove artifacts, fill holes

Each stage builds on the previous, and metrics should generally improve. However: - Some images may benefit more from certain stages than others - Over-processing (e.g., too much denoising) can remove fine details - The optimal parameters vary with image quality and cellular structure

4.7.2 Parameter Tuning Guidelines

nucleus_max (intensity threshold for nucleus) - Adjust based on intensity histogram; nucleus typically has values < 0.3 - If too high: cytoplasm misclassified as nucleus - If too low: nucleus not detected

cytoplasm_min and cytoplasm_max (intensity range for cytoplasm) - Should encompass the “mid-gray” values in the histogram - If range too narrow: missing cytoplasm pixels - If range too broad: background or nucleus included

nl_means_strength (denoising parameter h) - Higher values (0.5+): more aggressive denoising, but removes fine structures - Lower values (0.1-0.2): minimal denoising, preserves detail - Default (0.3): good balance for most cell images

unsharp_radius and unsharp_amount (edge enhancement) - Radius typically 1.5–3.0 (size of detail to enhance) - Amount 0.5–2.0 (how much to enhance; 1.0 = 100%) - Higher values sharpen more but risk enhancing noise

blur_sigma (Gaussian blur smoothing) - 1.0–2.0: gentle smoothing, preserves edges - 3.0+: aggressive smoothing, may lose detail - Too high can undo edge enhancement

morph_disk_size (morphological kernel) - Larger disks (5–7) remove bigger artifacts but risk shrinking nucleus - Smaller disks (2–3) preserve fine structures but miss larger artifacts - Typically 3–5 for cell images

4.7.3 When to Stop?

The pipeline is most useful when: 1. Metrics improve with each stage on your specific images 2. Visual inspection shows meaningful improvements in segmentation 3. Parameters are stable (small changes don’t drastically change results)

If metrics plateau or degrade at a stage, that stage may not be beneficial for your particular image quality.


4.8 Practical Workflow

To use this pipeline effectively:

  1. Select an image that represents your dataset’s characteristics
  2. Run Stage 1 to establish a baseline
  3. Examine the metrics and intensity histogram
  4. Progressively add stages (2, 3, 4) and observe metric changes
  5. Adjust parameters based on visual inspection
  6. Once parameters are satisfactory, apply the pipeline to the entire dataset

In the next chapter, we’ll automate this process to apply the pipeline across many images, calculate aggregate statistics, and prepare data for training.


4.9 Exercises

Exercise 4.1: Run Stage 1 on five different images from the dataset. Do the Dice scores vary significantly? What does this tell you about image quality consistency?

Exercise 4.2: For the Stage 2 visualization, estimate from the intensity histogram where the nucleus, cytoplasm, and background peaks should be. Do your manually selected thresholds align with these peaks?

Exercise 4.3: Modify Stage 3 to try different unsharp_radius values (1.0, 2.0, 3.0). How does the Dice coefficient change? What’s the trade-off between sharpening and artifacts?

Exercise 4.4: In Stage 4, set morph_disk_size to 1, 3, 5, and 7. Create a table showing how nucleus and cytoplasm Dice scores change. What disk size is optimal?

Exercise 4.5: Design an experiment where you run the full 4-stage pipeline on an image with very different parameter values. Document which parameters have the largest effect on Dice and IoU metrics.

Sign in to save progress

0 / 0

📚 Gradebook

Loading…

Sign in to save progress