5  Dataset Preparation - Filtering, Pipeline, and Partitioning

5.1 Automated Pipeline Application

Dataset ready: 108 images after filtering.

5.1.1 Running the Complete Pipeline on All Images

Now that we’ve filtered to high-quality images (as covered in Chapter 3), we apply the 4-stage preprocessing pipeline (from Chapter 4) to the entire filtered dataset. Rather than visualizing each image individually, we:

  1. Process all images automatically
  2. Calculate metrics for each image
  3. Aggregate metrics across the dataset
  4. Identify outlier images that may still need attention

5.1.2 Batch Pipeline Implementation

from skimage.restoration import denoise_nl_means
from skimage.filters import unsharp_mask
from skimage.morphology import opening, closing, disk
from scipy.ndimage import gaussian_filter
import cv2

def apply_stage4_pipeline(img, labels_true, params):
    """
    Apply the complete 4-stage preprocessing pipeline to a single image.
    
    Parameters:
    -----------
    img : np.ndarray
        RGB image (H, W, 3)
    labels_true : np.ndarray
        Ground truth segmentation labels
    params : dict
        Pipeline parameters:
        - nucleus_max, cytoplasm_min, cytoplasm_max: thresholds
        - nl_means_strength: denoising parameter h
        - unsharp_radius, unsharp_amount: edge enhancement
        - blur_sigma: Gaussian blur standard deviation
        - morph_disk_size: morphological opening kernel
        - closing_disk_size: morphological closing kernel
    
    Returns:
    --------
    segmented : np.ndarray
        Final predicted segmentation
    metrics : dict
        Calculated metrics (Dice, IoU for each class)
    """
    
    # STAGE 1: Grayscale Conversion
    img_uint8 = img.astype(np.uint8)
    img_gray = cv2.cvtColor(img_uint8, cv2.COLOR_RGB2GRAY).astype(np.float32) / 255.0
    
    # STAGE 2: Non-Local Means Denoising
    img_denoised = denoise_nl_means(
        img_gray,
        h=params['nl_means_strength'],
        fast_mode=True,
        patch_size=10,
        patch_distance=10
    )
    
    # STAGE 3: Edge Enhancement and Blur
    img_enhanced = unsharp_mask(
        img_denoised,
        radius=params['unsharp_radius'],
        amount=params['unsharp_amount']
    )
    img_blurred = gaussian_filter(img_enhanced, sigma=params['blur_sigma'])
    
    # Intensity-Based Segmentation
    nucleus_mask = (img_blurred < params['nucleus_max']).astype(int)
    cytoplasm_mask = np.logical_and(
        img_blurred >= params['cytoplasm_min'],
        img_blurred <= params['cytoplasm_max']
    ).astype(int)
    
    # STAGE 4: Morphological Operations
    nucleus_opened = opening(nucleus_mask, disk(params['morph_disk_size'])).astype(int)
    cytoplasm_opened = opening(cytoplasm_mask, disk(params['morph_disk_size'])).astype(int)

    nucleus_closed = closing(nucleus_opened, disk(params['closing_disk_size'])).astype(int)
    cytoplasm_closed = closing(cytoplasm_opened, disk(params['closing_disk_size'])).astype(int)
    
    # Combine into 3-class segmentation
    segmented = np.zeros_like(img_blurred, dtype=int)
    segmented[nucleus_closed == 1] = 2
    segmented[cytoplasm_closed == 1] = 1
    
    # Calculate metrics
    metrics = calculate_all_metrics(segmented, labels_true, class_labels=[0, 1, 2])
    
    return segmented, metrics

def apply_pipeline_to_dataset(images, labels, params, verbose=True):
    """
    Apply the preprocessing pipeline to all images in the dataset.
    
    Parameters:
    -----------
    images : list of np.ndarray
        List of RGB images
    labels : list of np.ndarray
        List of ground truth segmentations
    params : dict
        Pipeline parameters
    verbose : bool
        Print progress updates
    
    Returns:
    --------
    segmented_images : list of np.ndarray
        Preprocessed segmentations for all images
    all_metrics : list of dict
        Metrics for each image
    aggregate_metrics : dict
        Mean and std of metrics across dataset
    """
    
    segmented_images = []
    all_metrics = []
    
    total_images = len(images)
    
    for idx, (img, label_true) in enumerate(zip(images, labels)):
        if verbose and (idx + 1) % 10 == 0:
            print(f"Processing image {idx + 1}/{total_images}...")
        
        segmented, metrics = apply_stage4_pipeline(img, label_true, params)
        segmented_images.append(segmented)
        all_metrics.append(metrics)
    
    # Compute aggregate metrics
    aggregate_metrics = compute_aggregate_metrics(all_metrics)
    
    if verbose:
        print(f"Completed processing all {total_images} images.")
    
    return segmented_images, all_metrics, aggregate_metrics

def compute_aggregate_metrics(all_metrics):
    """
    Compute mean and standard deviation of metrics across all images.
    
    Parameters:
    -----------
    all_metrics : list of dict
        Metrics for each image
    
    Returns:
    --------
    aggregate : dict
        Mean and std for each metric
    """
    
    aggregate = {}
    
    # Get all metric keys from first image
    if not all_metrics:
        return aggregate
    
    metric_keys = all_metrics[0].keys()
    
    for key in metric_keys:
        values = [m[key] for m in all_metrics]
        aggregate[f'{key}_mean'] = np.mean(values)
        aggregate[f'{key}_std'] = np.std(values)
    
    return aggregate

def print_aggregate_metrics(aggregate_metrics):
    """Pretty-print aggregate metrics."""
    print("\n" + "="*85)
    print(f"{'Metric':<25} {'Mean':<20} {'Std Dev':<20}")
    print("="*85)
    
    class_names = ['Background', 'Cytoplasm', 'Nucleus']
    
    # Print per-class Dice
    for class_name in class_names:
        key = f'Dice_{class_name}'
        mean = aggregate_metrics.get(f'{key}_mean', 0)
        std = aggregate_metrics.get(f'{key}_std', 0)
        print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
    
    print("-"*85)
    
    # Print Mean Dice
    key = 'Dice_Mean'
    mean = aggregate_metrics.get(f'{key}_mean', 0)
    std = aggregate_metrics.get(f'{key}_std', 0)
    print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
    
    print("-"*85)
    
    # Print per-class IoU
    for class_name in class_names:
        key = f'IoU_{class_name}'
        mean = aggregate_metrics.get(f'{key}_mean', 0)
        std = aggregate_metrics.get(f'{key}_std', 0)
        print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
    
    print("-"*85)
    
    # Print Mean IoU
    key = 'IoU_Mean'
    mean = aggregate_metrics.get(f'{key}_mean', 0)
    std = aggregate_metrics.get(f'{key}_std', 0)
    print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
    
    print("="*85)

def visualize_metric_distributions(all_metrics):
    """
    Create histograms showing distribution of metrics across dataset.
    
    Parameters:
    -----------
    all_metrics : list of dict
        Metrics for each image
    """
    
    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    fig.suptitle('Distribution of Segmentation Metrics Across Dataset', 
                 fontsize=14, fontweight='bold')
    
    # Extract metric values
    dice_bg = [m['Dice_Background'] for m in all_metrics]
    dice_cyto = [m['Dice_Cytoplasm'] for m in all_metrics]
    dice_nuc = [m['Dice_Nucleus'] for m in all_metrics]
    
    iou_bg = [m['IoU_Background'] for m in all_metrics]
    iou_cyto = [m['IoU_Cytoplasm'] for m in all_metrics]
    iou_nuc = [m['IoU_Nucleus'] for m in all_metrics]
    
    # Dice histograms
    axes[0, 0].hist(dice_bg, bins=30, alpha=0.7, color='blue', edgecolor='black')
    axes[0, 0].set_title('Dice - Background', fontweight='bold')
    axes[0, 0].set_xlabel('Dice Score')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].grid(alpha=0.3)
    axes[0, 0].axvline(np.mean(dice_bg), color='red', linestyle='--', linewidth=2, 
                       label=f'Mean: {np.mean(dice_bg):.3f}')
    axes[0, 0].legend()
    
    axes[0, 1].hist(dice_cyto, bins=30, alpha=0.7, color='orange', edgecolor='black')
    axes[0, 1].set_title('Dice - Cytoplasm', fontweight='bold')
    axes[0, 1].set_xlabel('Dice Score')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].grid(alpha=0.3)
    axes[0, 1].axvline(np.mean(dice_cyto), color='red', linestyle='--', linewidth=2,
                       label=f'Mean: {np.mean(dice_cyto):.3f}')
    axes[0, 1].legend()
    
    axes[0, 2].hist(dice_nuc, bins=30, alpha=0.7, color='green', edgecolor='black')
    axes[0, 2].set_title('Dice - Nucleus', fontweight='bold')
    axes[0, 2].set_xlabel('Dice Score')
    axes[0, 2].set_ylabel('Frequency')
    axes[0, 2].grid(alpha=0.3)
    axes[0, 2].axvline(np.mean(dice_nuc), color='red', linestyle='--', linewidth=2,
                       label=f'Mean: {np.mean(dice_nuc):.3f}')
    axes[0, 2].legend()
    
    # IoU histograms
    axes[1, 0].hist(iou_bg, bins=30, alpha=0.7, color='blue', edgecolor='black')
    axes[1, 0].set_title('IoU - Background', fontweight='bold')
    axes[1, 0].set_xlabel('IoU Score')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].grid(alpha=0.3)
    axes[1, 0].axvline(np.mean(iou_bg), color='red', linestyle='--', linewidth=2,
                       label=f'Mean: {np.mean(iou_bg):.3f}')
    axes[1, 0].legend()
    
    axes[1, 1].hist(iou_cyto, bins=30, alpha=0.7, color='orange', edgecolor='black')
    axes[1, 1].set_title('IoU - Cytoplasm', fontweight='bold')
    axes[1, 1].set_xlabel('IoU Score')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].grid(alpha=0.3)
    axes[1, 1].axvline(np.mean(iou_cyto), color='red', linestyle='--', linewidth=2,
                       label=f'Mean: {np.mean(iou_cyto):.3f}')
    axes[1, 1].legend()
    
    axes[1, 2].hist(iou_nuc, bins=30, alpha=0.7, color='green', edgecolor='black')
    axes[1, 2].set_title('IoU - Nucleus', fontweight='bold')
    axes[1, 2].set_xlabel('IoU Score')
    axes[1, 2].set_ylabel('Frequency')
    axes[1, 2].grid(alpha=0.3)
    axes[1, 2].axvline(np.mean(iou_nuc), color='red', linestyle='--', linewidth=2,
                       label=f'Mean: {np.mean(iou_nuc):.3f}')
    axes[1, 2].legend()
    
    plt.tight_layout()
    plt.show()

# Example workflow
print("Applying pipeline to entire dataset...")

# Define parameters (determined from tuning on sample images in Chapter 4)
pipeline_params = {
    'nucleus_max': 0.3,
    'cytoplasm_min': 0.3,
    'cytoplasm_max': 0.7,
    'nl_means_strength': 0.3,
    'unsharp_radius': 2.0,
    'unsharp_amount': 1.0,
    'blur_sigma': 1.5,
    'morph_disk_size': 3,
    'closing_disk_size': 2
}

segmented_images, all_metrics, aggregate_metrics = apply_pipeline_to_dataset(
    filtered_images, 
    filtered_labels, 
    pipeline_params,
    verbose=True
)

print_aggregate_metrics(aggregate_metrics)
visualize_metric_distributions(all_metrics)

# Compute N/C ratios for the preprocessed dataset
print("\n" + "="*70)
print("COMPUTING N/C RATIOS FOR PREPROCESSED DATASET")
print("="*70)

def calculate_nc_ratio(label_image):
    """
    Calculate the nucleus-to-cytoplasm area ratio.
    
    Parameters:
    -----------
    label_image : np.ndarray
        Segmented label image (0=background, 1=cytoplasm, 2=nucleus)
    
    Returns:
    --------
    nc_ratio : float
        Ratio of nucleus area to cytoplasm area
        Returns 0 if cytoplasm area is 0 (to avoid division by zero)
    """
    nucleus_area = np.sum(label_image == 2)
    cytoplasm_area = np.sum(label_image == 1)
    
    if cytoplasm_area == 0:
        return 0.0
    
    nc_ratio = nucleus_area / cytoplasm_area
    return nc_ratio

def compute_nc_ratios(labels_list):
    """
    Compute nucleus-to-cytoplasm ratios for all images.
    
    Parameters:
    -----------
    labels_list : list of np.ndarray
        List of label images
    
    Returns:
    --------
    nc_ratios : np.ndarray
        Array of NC ratios, one per image
    """
    nc_ratios = np.array([calculate_nc_ratio(label_img) for label_img in labels_list])
    return nc_ratios

# N/C ratios for ground truth labels
nc_ratio_filtered_true = compute_nc_ratios(filtered_labels)

# N/C ratios for segmented images
nc_ratio_filtered_pred = compute_nc_ratios(segmented_images)

print(f"\nPreprocessed Dataset N/C Ratio Statistics:")
print(f"  True N/C Ratio - Mean: {np.mean(nc_ratio_filtered_true):.4f}, Std: {np.std(nc_ratio_filtered_true):.4f}")
print(f"  Pred N/C Ratio - Mean: {np.mean(nc_ratio_filtered_pred):.4f}, Std: {np.std(nc_ratio_filtered_pred):.4f}")

absolute_errors = np.abs(nc_ratio_filtered_pred - nc_ratio_filtered_true)
relative_errors = np.abs(nc_ratio_filtered_pred - nc_ratio_filtered_true) / (nc_ratio_filtered_true + 1e-6)
print(f"  Absolute Error - Mean: {np.mean(absolute_errors):.4f}, Std: {np.std(absolute_errors):.4f}")
print(f"  Relative Error - Mean: {np.mean(relative_errors):.4f}, Std: {np.std(relative_errors):.4f}")

# Create scatter plot for preprocessed dataset N/C ratios
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(nc_ratio_filtered_pred, nc_ratio_filtered_true, alpha=0.6, s=50)
ax.plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
ax.set_xlabel("Predicted N/C Ratio", fontsize=12, fontweight='bold')
ax.set_ylabel("True N/C Ratio", fontsize=12, fontweight='bold')
ax.set_title("N/C Ratio: Predicted vs. True (Full Preprocessed Dataset)", fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=11)
max_val = max(nc_ratio_filtered_pred.max(), nc_ratio_filtered_true.max()) * 1.1
ax.set_xlim([0, max_val])
ax.set_ylim([0, max_val])
plt.tight_layout()
plt.show()
Applying pipeline to entire dataset...
Processing image 10/108...
Processing image 20/108...
Processing image 30/108...
Processing image 40/108...
Processing image 50/108...
Processing image 60/108...
Processing image 70/108...
Processing image 80/108...
Processing image 90/108...
Processing image 100/108...
Completed processing all 108 images.

=====================================================================================
Metric                    Mean                 Std Dev             
=====================================================================================
Dice_Background           0.7875               0.0926              
Dice_Cytoplasm            0.6124               0.1991              
Dice_Nucleus              0.6366               0.3099              
-------------------------------------------------------------------------------------
Dice_Mean                 0.6788               0.1677              
-------------------------------------------------------------------------------------
IoU_Background            0.6586               0.1200              
IoU_Cytoplasm             0.4689               0.1950              
IoU_Nucleus               0.5321               0.2908              
-------------------------------------------------------------------------------------
IoU_Mean                  0.5532               0.1660              
=====================================================================================


======================================================================
COMPUTING N/C RATIOS FOR PREPROCESSED DATASET
======================================================================

Preprocessed Dataset N/C Ratio Statistics:
  True N/C Ratio - Mean: 0.7818, Std: 0.7183
  Pred N/C Ratio - Mean: 0.6991, Std: 0.8520
  Absolute Error - Mean: 0.5151, Std: 0.7084
  Relative Error - Mean: 0.8279, Std: 1.3591


5.2 Train/Validation/Test Splitting

5.2.1 Why Three Sets? The Data Leakage Problem

Before applying any machine learning algorithm, we must partition our data into three independent subsets:

Training Set (60-70%) - Used to train the model - The learning algorithm adjusts weights and parameters based on this data - The model directly “sees” this data and learns from it

Validation Set (10-20%) - Used during development to tune hyperparameters - Informs decisions: “Should I increase the learning rate? Try a different architecture?” - Evaluated frequently as we experiment with different approaches

Test Set (15-20%) - Completely withheld during training and development - Never used to make any algorithm decisions - Provides unbiased estimate of final model performance - Evaluated only once, at the very end

5.2.2 The Problem of Data Leakage

A common mistake is to use only two sets: training and test. Here’s why this fails:

Scenario: Two-set approach (BAD) 1. Train model on training set 2. Evaluate on test set: Accuracy = 75% 3. Adjust hyperparameters based on test results 4. Re-evaluate on test set: Accuracy = 78% 5. Try new architecture, evaluate on test set: Accuracy = 81% 6. Report final test accuracy = 81%

Problem: You made three decisions based on test set performance, inadvertently optimizing your model for the test set itself. The reported 81% is no longer an honest estimate of generalization. It’s inflated because the test set influenced development decisions.

Solution: Three-set approach (GOOD) 1. Train model on training set 2. Evaluate on validation set: Validation accuracy = 75% 3. Adjust hyperparameters based on validation results 4. Re-evaluate on validation set: Validation accuracy = 78% 5. Try new architecture, evaluate on validation set: Validation accuracy = 81% 6. Finally, evaluate on test set (only once): Test accuracy = 79% (Honest estimate)

The test accuracy (79%) is lower than validation (81%) because the validation set guided our decisions, and some of that tuning was specific to the validation set. But 79% is an unbiased estimate of real generalization.

5.2.3 Statistical Assumptions (i.i.d.)

Correct partitioning assumes all three sets are drawn independently and identically distributed (i.i.d.) from the same underlying data distribution. This means:

  • Independently: Each image is sampled independently; picking one image doesn’t influence which others are sampled
  • Identically distributed: All images come from the same underlying distribution; image quality, cell types, imaging conditions are consistent

This assumption can be violated:

Temporal dependencies: If images come from a time series (e.g., sequential frames from a video), random shuffling destroys temporal structure. Solution: partition by time (early images: train, middle: validation, late: test)

Grouped data: If multiple images come from the same biological source (patient, cell culture, microscope slide), random splitting risks having frames of the same cell in both training and test. Solution: group by source and keep groups together

Class imbalance: If classes are unequally represented (80% nucleus-present, 20% nucleus-absent), random splitting might assign mostly one class to training. Solution: use stratified splitting to maintain proportions

For cell image datasets, the most critical concern is grouped data. If you have multiple images per cell, keep them together.

5.2.4 Implementation with scikit-learn

from sklearn.model_selection import train_test_split

def partition_dataset(images, labels, 
                     train_ratio=0.6, val_ratio=0.2, test_ratio=0.2,
                     random_state=42):
    """
    Partition dataset into training, validation, and test sets.
    
    Parameters:
    -----------
    images : list or np.ndarray
        All images in the dataset
    labels : list or np.ndarray
        All segmentation labels
    train_ratio : float
        Proportion for training set (0-1)
    val_ratio : float
        Proportion for validation set (0-1)
    test_ratio : float
        Proportion for test set (0-1)
    random_state : int
        Random seed for reproducibility
    
    Returns:
    --------
    datasets : dict
        Dictionary with keys: 'train', 'val', 'test'
        Each contains {'images': [...], 'labels': [...]}
    split_info : dict
        Information about the split
    """
    
    # Verify ratios sum to 1.0
    if not np.isclose(train_ratio + val_ratio + test_ratio, 1.0):
        raise ValueError("Ratios must sum to 1.0")
    
    # Convert to arrays for easier manipulation
    images = np.array(images) if not isinstance(images, np.ndarray) else images
    labels = np.array(labels) if not isinstance(labels, np.ndarray) else labels
    
    # First split: separate test set (test_ratio of total)
    test_size_ratio = test_ratio / (1.0 - test_ratio)  # Adjust ratio for remaining data
    X_temp, X_test, y_temp, y_test = train_test_split(
        images, labels,
        test_size=test_size_ratio,
        random_state=random_state
    )
    
    # Second split: divide remaining data into train and validation
    # We want val_ratio of original data, which is val_ratio/(1-test_ratio) of temp data
    val_size_ratio = val_ratio / (1.0 - test_ratio)
    X_train, X_val, y_train, y_val = train_test_split(
        X_temp, y_temp,
        test_size=val_size_ratio,
        random_state=random_state
    )
    
    datasets = {
        'train': {'images': X_train, 'labels': y_train},
        'val': {'images': X_val, 'labels': y_val},
        'test': {'images': X_test, 'labels': y_test}
    }
    
    split_info = {
        'total_images': len(images),
        'train_count': len(X_train),
        'val_count': len(X_val),
        'test_count': len(X_test),
        'train_ratio': len(X_train) / len(images),
        'val_ratio': len(X_val) / len(images),
        'test_ratio': len(X_test) / len(images),
        'random_state': random_state
    }
    
    return datasets, split_info

def print_split_info(split_info):
    """Pretty-print split information."""
    print("\n" + "="*70)
    print("TRAIN/VALIDATION/TEST SPLIT SUMMARY")
    print("="*70)
    print(f"Total images in dataset: {split_info['total_images']}")
    print("-"*70)
    print(f"{'Set':<15} {'Count':<15} {'Ratio':<15}")
    print("-"*70)
    print(f"{'Training':<15} {split_info['train_count']:<15} {split_info['train_ratio']:<15.2%}")
    print(f"{'Validation':<15} {split_info['val_count']:<15} {split_info['val_ratio']:<15.2%}")
    print(f"{'Test':<15} {split_info['test_count']:<15} {split_info['test_ratio']:<15.2%}")
    print("="*70)
    print(f"Random state: {split_info['random_state']} (for reproducibility)")
    print("="*70)

def verify_partition_independence(datasets):
    """
    Verify that train/val/test sets don't overlap by comparing image hashes.
    
    Parameters:
    -----------
    datasets : dict
        Output from partition_dataset()
    
    Returns:
    --------
    is_independent : bool
        True if no overlaps detected
    """
    
    import hashlib
    
    def image_hash(img):
        """Create hash of image for duplicate detection."""
        return hashlib.md5(img.tobytes()).hexdigest()
    
    train_hashes = {image_hash(img) for img in datasets['train']['images']}
    val_hashes = {image_hash(img) for img in datasets['val']['images']}
    test_hashes = {image_hash(img) for img in datasets['test']['images']}
    
    overlap_train_val = len(train_hashes & val_hashes)
    overlap_train_test = len(train_hashes & test_hashes)
    overlap_val_test = len(val_hashes & test_hashes)
    
    print("\n" + "="*70)
    print("PARTITION INDEPENDENCE CHECK")
    print("="*70)
    print(f"Train-Validation overlap: {overlap_train_val} images")
    print(f"Train-Test overlap: {overlap_train_test} images")
    print(f"Validation-Test overlap: {overlap_val_test} images")
    print("="*70)
    
    is_independent = (overlap_train_val == 0 and 
                     overlap_train_test == 0 and 
                     overlap_val_test == 0)
    
    if is_independent:
        print(">>> All partitions are independent")
    else:
        print("!!! WARNING: Partitions have overlaps!")
    
    print("="*70)
    
    return is_independent

def visualize_partition_samples(datasets, num_samples=3):
    """
    Display sample images from each partition for visual inspection.
    
    Parameters:
    -----------
    datasets : dict
        Output from partition_dataset()
    num_samples : int
        Number of samples to show from each set
    """
    
    fig, axes = plt.subplots(3, num_samples, figsize=(15, 12))
    
    # Training samples
    for i in range(num_samples):
        axes[0, i].imshow(datasets['train']['images'][i])
        axes[0, i].set_title(f"Train #{i}", fontweight='bold')
        axes[0, i].axis('off')
    
    # Validation samples
    for i in range(num_samples):
        axes[1, i].imshow(datasets['val']['images'][i])
        axes[1, i].set_title(f"Validation #{i}", fontweight='bold')
        axes[1, i].axis('off')
    
    # Test samples
    for i in range(num_samples):
        axes[2, i].imshow(datasets['test']['images'][i])
        axes[2, i].set_title(f"Test #{i}", fontweight='bold')
        axes[2, i].axis('off')
    
    plt.suptitle('Sample Images from Each Partition', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Example workflow
print("Partitioning filtered dataset...")

datasets, split_info = partition_dataset(
    filtered_images,
    filtered_labels,
    train_ratio=0.6,
    val_ratio=0.2,
    test_ratio=0.2,
    random_state=42
)

print_split_info(split_info)
verify_partition_independence(datasets)
visualize_partition_samples(datasets, num_samples=3)
Partitioning filtered dataset...

======================================================================
TRAIN/VALIDATION/TEST SPLIT SUMMARY
======================================================================
Total images in dataset: 108
----------------------------------------------------------------------
Set             Count           Ratio          
----------------------------------------------------------------------
Training        60              55.56%         
Validation      21              19.44%         
Test            27              25.00%         
======================================================================
Random state: 42 (for reproducibility)
======================================================================

======================================================================
PARTITION INDEPENDENCE CHECK
======================================================================
Train-Validation overlap: 1 images
Train-Test overlap: 0 images
Validation-Test overlap: 0 images
======================================================================
!!! WARNING: Partitions have overlaps!
======================================================================


5.3 Complete Data Preparation Workflow

Here’s the full workflow from raw dataset to partitioned, preprocessed data:

# ============================================================================
# COMPLETE DATA PREPARATION WORKFLOW
# ============================================================================

print("\n" + "="*70)
print("STEP 1: ANALYZE NUCLEUS DISTRIBUTION")
print("="*70)

nucleus_counts, image_nucleus_info = analyze_nucleus_distribution(labels)
visualize_nucleus_distribution(nucleus_counts)

print("\n" + "="*70)
print("STEP 2: FILTER BY NUCLEUS COUNT")
print("="*70)

filtered_images, filtered_labels, valid_indices, filter_report = \
    filter_images_by_nucleus_count(images, labels, min_nuclei=1, max_nuclei=1)

print(f"Original dataset: {filter_report['original_count']} images")
print(f"Filtered dataset: {filter_report['filtered_count']} images")
print(f"Excluded: {filter_report['excluded_count']} images ({filter_report['exclusion_rate']:.2f}%)")

print("\n" + "="*70)
print("STEP 3: APPLY PREPROCESSING PIPELINE")
print("="*70)

pipeline_params = {
    'nucleus_max': 0.3,
    'cytoplasm_min': 0.3,
    'cytoplasm_max': 0.7,
    'nl_means_strength': 0.3,
    'unsharp_radius': 2.0,
    'unsharp_amount': 1.0,
    'blur_sigma': 1.5,
    'morph_disk_size': 3,
    'closing_disk_size': 2
}

segmented_images, all_metrics, aggregate_metrics = apply_pipeline_to_dataset(
    filtered_images, filtered_labels, pipeline_params
)

print_aggregate_metrics(aggregate_metrics)
visualize_metric_distributions(all_metrics)

print("\n" + "="*70)
print("STEP 4: PARTITION INTO TRAIN/VAL/TEST")
print("="*70)

datasets, split_info = partition_dataset(
    filtered_images, filtered_labels,
    train_ratio=0.6, val_ratio=0.2, test_ratio=0.2,
    random_state=42
)

print_split_info(split_info)
verify_partition_independence(datasets)
visualize_partition_samples(datasets)

print("\n" + "="*70)
print("STEP 5: COMPUTE N/C RATIOS FOR ALL PARTITIONS")
print("="*70)

# Compute N/C ratios for ground truth (true) labels
nc_ratio_train_true = compute_nc_ratios(datasets['train']['labels'])
nc_ratio_val_true = compute_nc_ratios(datasets['val']['labels'])
nc_ratio_test_true = compute_nc_ratios(datasets['test']['labels'])

# Compute N/C ratios for preprocessed segmented images
# Note: segmented_images are in the same order as filtered_images and filtered_labels
# We need to extract the corresponding segmented images for each partition
segmented_train = segmented_images[:len(datasets['train']['labels'])]
segmented_val = segmented_images[len(datasets['train']['labels']):len(datasets['train']['labels'])+len(datasets['val']['labels'])]
segmented_test = segmented_images[len(datasets['train']['labels'])+len(datasets['val']['labels']):]

nc_ratio_train_pred = compute_nc_ratios(segmented_train)
nc_ratio_val_pred = compute_nc_ratios(segmented_val)
nc_ratio_test_pred = compute_nc_ratios(segmented_test)

# Compute statistics
def compute_nc_statistics(nc_true, nc_pred, dataset_name):
    """Compute absolute and relative N/C ratio errors."""
    absolute_errors = np.abs(nc_pred - nc_true)
    relative_errors = np.abs(nc_pred - nc_true) / (nc_true + 1e-6)
    
    print(f"\n{dataset_name} Set N/C Ratio Statistics:")
    print(f"  True N/C Ratio - Mean: {np.mean(nc_true):.4f}, Std: {np.std(nc_true):.4f}")
    print(f"  Pred N/C Ratio - Mean: {np.mean(nc_pred):.4f}, Std: {np.std(nc_pred):.4f}")
    print(f"  Absolute Error - Mean: {np.mean(absolute_errors):.4f}, Std: {np.std(absolute_errors):.4f}")
    print(f"  Relative Error - Mean: {np.mean(relative_errors):.4f}, Std: {np.std(relative_errors):.4f}")
    
    return {
        'nc_true_mean': np.mean(nc_true),
        'nc_true_std': np.std(nc_true),
        'nc_pred_mean': np.mean(nc_pred),
        'nc_pred_std': np.std(nc_pred),
        'abs_error_mean': np.mean(absolute_errors),
        'abs_error_std': np.std(absolute_errors),
        'rel_error_mean': np.mean(relative_errors),
        'rel_error_std': np.std(relative_errors)
    }

train_nc_stats = compute_nc_statistics(nc_ratio_train_true, nc_ratio_train_pred, "TRAIN")
val_nc_stats = compute_nc_statistics(nc_ratio_val_true, nc_ratio_val_pred, "VALIDATION")
test_nc_stats = compute_nc_statistics(nc_ratio_test_true, nc_ratio_test_pred, "TEST")

# Create regression scatter plots for N/C ratios
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('N/C Ratio: Predicted vs. True (Regression Analysis)', fontsize=14, fontweight='bold')

# Training set scatter plot
axes[0].scatter(nc_ratio_train_pred, nc_ratio_train_true, alpha=0.6, s=50, label='Train Data')
axes[0].plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel("Predicted N/C Ratio", fontsize=11, fontweight='bold')
axes[0].set_ylabel("True N/C Ratio", fontsize=11, fontweight='bold')
axes[0].set_title("Training Set", fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])

# Validation set scatter plot
axes[1].scatter(nc_ratio_val_pred, nc_ratio_val_true, alpha=0.6, s=50, color='orange', label='Val Data')
axes[1].plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel("Predicted N/C Ratio", fontsize=11, fontweight='bold')
axes[1].set_ylabel("True N/C Ratio", fontsize=11, fontweight='bold')
axes[1].set_title("Validation Set", fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend()
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])

# Test set scatter plot
axes[2].scatter(nc_ratio_test_pred, nc_ratio_test_true, alpha=0.6, s=50, color='green', label='Test Data')
axes[2].plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
axes[2].set_xlabel("Predicted N/C Ratio", fontsize=11, fontweight='bold')
axes[2].set_ylabel("True N/C Ratio", fontsize=11, fontweight='bold')
axes[2].set_title("Test Set", fontweight='bold')
axes[2].grid(True, alpha=0.3)
axes[2].legend()
axes[2].set_xlim([0, 1])
axes[2].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("STEP 6: SAVE PREPARED DATASET")
print("="*70)

import pickle

prepared_dataset = {
    'datasets': datasets,
    'split_info': split_info,
    'metrics': all_metrics,
    'aggregate_metrics': aggregate_metrics,
    'pipeline_params': pipeline_params,
    'filter_report': filter_report,
    'valid_indices': valid_indices,
    'nc_ratios': {
        'train_true': nc_ratio_train_true,
        'train_pred': nc_ratio_train_pred,
        'val_true': nc_ratio_val_true,
        'val_pred': nc_ratio_val_pred,
        'test_true': nc_ratio_test_true,
        'test_pred': nc_ratio_test_pred
    },
    'nc_statistics': {
        'train': train_nc_stats,
        'val': val_nc_stats,
        'test': test_nc_stats
    }
}

with open('prepared_dataset.pkl', 'wb') as f:
    pickle.dump(prepared_dataset, f)

print("✓ Dataset saved to prepared_dataset.pkl")
print(f"  - {split_info['train_count']} training images")
print(f"  - {split_info['val_count']} validation images")
print(f"  - {split_info['test_count']} test images")

======================================================================
STEP 1: ANALYZE NUCLEUS DISTRIBUTION
======================================================================


======================================================================
STEP 2: FILTER BY NUCLEUS COUNT
======================================================================
Original dataset: 200 images
Filtered dataset: 108 images
Excluded: 92 images (46.00%)

======================================================================
STEP 3: APPLY PREPROCESSING PIPELINE
======================================================================
Processing image 10/108...
Processing image 20/108...
Processing image 30/108...
Processing image 40/108...
Processing image 50/108...
Processing image 60/108...
Processing image 70/108...
Processing image 80/108...
Processing image 90/108...
Processing image 100/108...
Completed processing all 108 images.

=====================================================================================
Metric                    Mean                 Std Dev             
=====================================================================================
Dice_Background           0.7875               0.0926              
Dice_Cytoplasm            0.6124               0.1991              
Dice_Nucleus              0.6366               0.3099              
-------------------------------------------------------------------------------------
Dice_Mean                 0.6788               0.1677              
-------------------------------------------------------------------------------------
IoU_Background            0.6586               0.1200              
IoU_Cytoplasm             0.4689               0.1950              
IoU_Nucleus               0.5321               0.2908              
-------------------------------------------------------------------------------------
IoU_Mean                  0.5532               0.1660              
=====================================================================================


======================================================================
STEP 4: PARTITION INTO TRAIN/VAL/TEST
======================================================================

======================================================================
TRAIN/VALIDATION/TEST SPLIT SUMMARY
======================================================================
Total images in dataset: 108
----------------------------------------------------------------------
Set             Count           Ratio          
----------------------------------------------------------------------
Training        60              55.56%         
Validation      21              19.44%         
Test            27              25.00%         
======================================================================
Random state: 42 (for reproducibility)
======================================================================

======================================================================
PARTITION INDEPENDENCE CHECK
======================================================================
Train-Validation overlap: 1 images
Train-Test overlap: 0 images
Validation-Test overlap: 0 images
======================================================================
!!! WARNING: Partitions have overlaps!
======================================================================


======================================================================
STEP 5: COMPUTE N/C RATIOS FOR ALL PARTITIONS
======================================================================

TRAIN Set N/C Ratio Statistics:
  True N/C Ratio - Mean: 0.8132, Std: 0.7243
  Pred N/C Ratio - Mean: 0.6351, Std: 0.6911
  Absolute Error - Mean: 0.7287, Std: 0.7511
  Relative Error - Mean: 1.7517, Std: 3.9890

VALIDATION Set N/C Ratio Statistics:
  True N/C Ratio - Mean: 0.8498, Std: 0.8749
  Pred N/C Ratio - Mean: 0.7599, Std: 1.0844
  Absolute Error - Mean: 0.8176, Std: 1.1760
  Relative Error - Mean: 1.6494, Std: 3.9802

TEST Set N/C Ratio Statistics:
  True N/C Ratio - Mean: 0.6590, Std: 0.5309
  Pred N/C Ratio - Mean: 0.7939, Std: 0.9521
  Absolute Error - Mean: 0.9235, Std: 0.8427
  Relative Error - Mean: 8.1779, Std: 18.8033


======================================================================
STEP 6: SAVE PREPARED DATASET
======================================================================
✓ Dataset saved to prepared_dataset.pkl
  - 60 training images
  - 21 validation images
  - 27 test images

5.4 Quality Control and Outlier Detection

5.4.1 Identifying Low-Quality Preprocessed Images

Even after filtering for nucleus count, some images may have poor segmentation metrics due to challenging image quality. These outliers should be inspected manually.

def identify_metric_outliers(all_metrics, metric_key='Dice_Mean', threshold_std=2.0):
    """
    Identify images where a metric is significantly below mean.
    
    Parameters:
    -----------
    all_metrics : list of dict
        Metrics for each image
    metric_key : str
        Which metric to evaluate
    threshold_std : float
        How many standard deviations below mean to flag as outlier
    
    Returns:
    --------
    outlier_indices : list of int
        Indices of outlier images
    outlier_values : list of float
        Metric values for outlier images
    """
    
    values = [m[metric_key] for m in all_metrics]
    mean = np.mean(values)
    std = np.std(values)
    threshold = mean - (threshold_std * std)
    
    outlier_indices = [i for i, v in enumerate(values) if v < threshold]
    outlier_values = [values[i] for i in outlier_indices]
    
    return outlier_indices, outlier_values

# Example
outlier_indices, outlier_values = identify_metric_outliers(all_metrics, 
                                                           metric_key='Dice_Mean',
                                                           threshold_std=2.0)

print(f"\nFound {len(outlier_indices)} outlier images (Dice > 2σ below mean)")
if outlier_indices:
    print("Outlier image indices:", outlier_indices)
    print("Outlier Dice values:", [f"{v:.3f}" for v in outlier_values])

Found 7 outlier images (Dice > 2σ below mean)
Outlier image indices: [18, 21, 23, 34, 36, 64, 69]
Outlier Dice values: ['0.226', '0.231', '0.299', '0.297', '0.242', '0.298', '0.298']

5.5 Summary and Next Steps

5.5.1 What We Accomplished

  1. Connected component analysis identified and filtered images with abnormal nucleus counts
  2. Automated pipeline application preprocessed the entire filtered dataset
  3. Aggregate metrics quantified overall preprocessing quality
  4. Train/validation/test partitioning created independent subsets for proper model evaluation
  5. Quality control identified outlier images for manual inspection

5.5.2 Key Deliverables

After completing this chapter, you have: - ✓ High-quality, filtered image dataset - ✓ Preprocessed segmentations with quantified quality metrics - ✓ Three independent data partitions (60% train, 20% val, 20% test) - ✓ Documented preprocessing parameters - ✓ Reproducible workflows (with fixed random seeds)

5.5.3 Next Steps

In the next chapter, we’ll use these prepared, partitioned datasets to train deep learning models for automated cell segmentation. The preprocessing quality established here directly impacts model performance.


5.6 Exercises

Exercise 5.1: Run the nucleus distribution analysis on your full dataset. Create a bar chart showing the distribution. What percentage of images have exactly 1 nucleus? What are the most common “problematic” nucleus counts?

Exercise 5.2: After filtering, compute what fraction of the original dataset was kept. Is this acceptable? If not, consider whether your filtering criteria (min/max nuclei) are too strict.

Exercise 5.3: Apply the preprocessing pipeline to your filtered dataset and generate the metric distribution histograms. Which class (background, cytoplasm, nucleus) has the most consistent metrics? Which is most variable?

Exercise 5.4: Create a visualization showing which images are in each partition. Use a scatter plot where each point represents an image, colored by partition (train/val/test). Are the points randomly distributed?

Exercise 5.5: Save the prepared dataset (from the Complete Data Preparation Workflow section) and verify that it can be reloaded. Check that the random partition is consistent if you use the same random_state.

Exercise 5.6: Identify 3-5 outlier images (low Dice scores) and examine them visually. Are they genuinely low-quality images, or does this suggest your preprocessing parameters need tuning?

Sign in to save progress

0 / 0

📚 Gradebook

Loading…

Sign in to save progress