Dataset ready: 108 images after filtering.
5 Dataset Preparation - Filtering, Pipeline, and Partitioning
5.1 Automated Pipeline Application
5.1.1 Running the Complete Pipeline on All Images
Now that we’ve filtered to high-quality images (as covered in Chapter 3), we apply the 4-stage preprocessing pipeline (from Chapter 4) to the entire filtered dataset. Rather than visualizing each image individually, we:
- Process all images automatically
- Calculate metrics for each image
- Aggregate metrics across the dataset
- Identify outlier images that may still need attention
5.1.2 Batch Pipeline Implementation
from skimage.restoration import denoise_nl_means
from skimage.filters import unsharp_mask
from skimage.morphology import opening, closing, disk
from scipy.ndimage import gaussian_filter
import cv2
def apply_stage4_pipeline(img, labels_true, params):
"""
Apply the complete 4-stage preprocessing pipeline to a single image.
Parameters:
-----------
img : np.ndarray
RGB image (H, W, 3)
labels_true : np.ndarray
Ground truth segmentation labels
params : dict
Pipeline parameters:
- nucleus_max, cytoplasm_min, cytoplasm_max: thresholds
- nl_means_strength: denoising parameter h
- unsharp_radius, unsharp_amount: edge enhancement
- blur_sigma: Gaussian blur standard deviation
- morph_disk_size: morphological opening kernel
- closing_disk_size: morphological closing kernel
Returns:
--------
segmented : np.ndarray
Final predicted segmentation
metrics : dict
Calculated metrics (Dice, IoU for each class)
"""
# STAGE 1: Grayscale Conversion
img_uint8 = img.astype(np.uint8)
img_gray = cv2.cvtColor(img_uint8, cv2.COLOR_RGB2GRAY).astype(np.float32) / 255.0
# STAGE 2: Non-Local Means Denoising
img_denoised = denoise_nl_means(
img_gray,
h=params['nl_means_strength'],
fast_mode=True,
patch_size=10,
patch_distance=10
)
# STAGE 3: Edge Enhancement and Blur
img_enhanced = unsharp_mask(
img_denoised,
radius=params['unsharp_radius'],
amount=params['unsharp_amount']
)
img_blurred = gaussian_filter(img_enhanced, sigma=params['blur_sigma'])
# Intensity-Based Segmentation
nucleus_mask = (img_blurred < params['nucleus_max']).astype(int)
cytoplasm_mask = np.logical_and(
img_blurred >= params['cytoplasm_min'],
img_blurred <= params['cytoplasm_max']
).astype(int)
# STAGE 4: Morphological Operations
nucleus_opened = opening(nucleus_mask, disk(params['morph_disk_size'])).astype(int)
cytoplasm_opened = opening(cytoplasm_mask, disk(params['morph_disk_size'])).astype(int)
nucleus_closed = closing(nucleus_opened, disk(params['closing_disk_size'])).astype(int)
cytoplasm_closed = closing(cytoplasm_opened, disk(params['closing_disk_size'])).astype(int)
# Combine into 3-class segmentation
segmented = np.zeros_like(img_blurred, dtype=int)
segmented[nucleus_closed == 1] = 2
segmented[cytoplasm_closed == 1] = 1
# Calculate metrics
metrics = calculate_all_metrics(segmented, labels_true, class_labels=[0, 1, 2])
return segmented, metrics
def apply_pipeline_to_dataset(images, labels, params, verbose=True):
"""
Apply the preprocessing pipeline to all images in the dataset.
Parameters:
-----------
images : list of np.ndarray
List of RGB images
labels : list of np.ndarray
List of ground truth segmentations
params : dict
Pipeline parameters
verbose : bool
Print progress updates
Returns:
--------
segmented_images : list of np.ndarray
Preprocessed segmentations for all images
all_metrics : list of dict
Metrics for each image
aggregate_metrics : dict
Mean and std of metrics across dataset
"""
segmented_images = []
all_metrics = []
total_images = len(images)
for idx, (img, label_true) in enumerate(zip(images, labels)):
if verbose and (idx + 1) % 10 == 0:
print(f"Processing image {idx + 1}/{total_images}...")
segmented, metrics = apply_stage4_pipeline(img, label_true, params)
segmented_images.append(segmented)
all_metrics.append(metrics)
# Compute aggregate metrics
aggregate_metrics = compute_aggregate_metrics(all_metrics)
if verbose:
print(f"Completed processing all {total_images} images.")
return segmented_images, all_metrics, aggregate_metrics
def compute_aggregate_metrics(all_metrics):
"""
Compute mean and standard deviation of metrics across all images.
Parameters:
-----------
all_metrics : list of dict
Metrics for each image
Returns:
--------
aggregate : dict
Mean and std for each metric
"""
aggregate = {}
# Get all metric keys from first image
if not all_metrics:
return aggregate
metric_keys = all_metrics[0].keys()
for key in metric_keys:
values = [m[key] for m in all_metrics]
aggregate[f'{key}_mean'] = np.mean(values)
aggregate[f'{key}_std'] = np.std(values)
return aggregate
def print_aggregate_metrics(aggregate_metrics):
"""Pretty-print aggregate metrics."""
print("\n" + "="*85)
print(f"{'Metric':<25} {'Mean':<20} {'Std Dev':<20}")
print("="*85)
class_names = ['Background', 'Cytoplasm', 'Nucleus']
# Print per-class Dice
for class_name in class_names:
key = f'Dice_{class_name}'
mean = aggregate_metrics.get(f'{key}_mean', 0)
std = aggregate_metrics.get(f'{key}_std', 0)
print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
print("-"*85)
# Print Mean Dice
key = 'Dice_Mean'
mean = aggregate_metrics.get(f'{key}_mean', 0)
std = aggregate_metrics.get(f'{key}_std', 0)
print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
print("-"*85)
# Print per-class IoU
for class_name in class_names:
key = f'IoU_{class_name}'
mean = aggregate_metrics.get(f'{key}_mean', 0)
std = aggregate_metrics.get(f'{key}_std', 0)
print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
print("-"*85)
# Print Mean IoU
key = 'IoU_Mean'
mean = aggregate_metrics.get(f'{key}_mean', 0)
std = aggregate_metrics.get(f'{key}_std', 0)
print(f"{key:<25} {mean:<20.4f} {std:<20.4f}")
print("="*85)
def visualize_metric_distributions(all_metrics):
"""
Create histograms showing distribution of metrics across dataset.
Parameters:
-----------
all_metrics : list of dict
Metrics for each image
"""
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Distribution of Segmentation Metrics Across Dataset',
fontsize=14, fontweight='bold')
# Extract metric values
dice_bg = [m['Dice_Background'] for m in all_metrics]
dice_cyto = [m['Dice_Cytoplasm'] for m in all_metrics]
dice_nuc = [m['Dice_Nucleus'] for m in all_metrics]
iou_bg = [m['IoU_Background'] for m in all_metrics]
iou_cyto = [m['IoU_Cytoplasm'] for m in all_metrics]
iou_nuc = [m['IoU_Nucleus'] for m in all_metrics]
# Dice histograms
axes[0, 0].hist(dice_bg, bins=30, alpha=0.7, color='blue', edgecolor='black')
axes[0, 0].set_title('Dice - Background', fontweight='bold')
axes[0, 0].set_xlabel('Dice Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(alpha=0.3)
axes[0, 0].axvline(np.mean(dice_bg), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(dice_bg):.3f}')
axes[0, 0].legend()
axes[0, 1].hist(dice_cyto, bins=30, alpha=0.7, color='orange', edgecolor='black')
axes[0, 1].set_title('Dice - Cytoplasm', fontweight='bold')
axes[0, 1].set_xlabel('Dice Score')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(alpha=0.3)
axes[0, 1].axvline(np.mean(dice_cyto), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(dice_cyto):.3f}')
axes[0, 1].legend()
axes[0, 2].hist(dice_nuc, bins=30, alpha=0.7, color='green', edgecolor='black')
axes[0, 2].set_title('Dice - Nucleus', fontweight='bold')
axes[0, 2].set_xlabel('Dice Score')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].grid(alpha=0.3)
axes[0, 2].axvline(np.mean(dice_nuc), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(dice_nuc):.3f}')
axes[0, 2].legend()
# IoU histograms
axes[1, 0].hist(iou_bg, bins=30, alpha=0.7, color='blue', edgecolor='black')
axes[1, 0].set_title('IoU - Background', fontweight='bold')
axes[1, 0].set_xlabel('IoU Score')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].grid(alpha=0.3)
axes[1, 0].axvline(np.mean(iou_bg), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(iou_bg):.3f}')
axes[1, 0].legend()
axes[1, 1].hist(iou_cyto, bins=30, alpha=0.7, color='orange', edgecolor='black')
axes[1, 1].set_title('IoU - Cytoplasm', fontweight='bold')
axes[1, 1].set_xlabel('IoU Score')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(alpha=0.3)
axes[1, 1].axvline(np.mean(iou_cyto), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(iou_cyto):.3f}')
axes[1, 1].legend()
axes[1, 2].hist(iou_nuc, bins=30, alpha=0.7, color='green', edgecolor='black')
axes[1, 2].set_title('IoU - Nucleus', fontweight='bold')
axes[1, 2].set_xlabel('IoU Score')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].grid(alpha=0.3)
axes[1, 2].axvline(np.mean(iou_nuc), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(iou_nuc):.3f}')
axes[1, 2].legend()
plt.tight_layout()
plt.show()
# Example workflow
print("Applying pipeline to entire dataset...")
# Define parameters (determined from tuning on sample images in Chapter 4)
pipeline_params = {
'nucleus_max': 0.3,
'cytoplasm_min': 0.3,
'cytoplasm_max': 0.7,
'nl_means_strength': 0.3,
'unsharp_radius': 2.0,
'unsharp_amount': 1.0,
'blur_sigma': 1.5,
'morph_disk_size': 3,
'closing_disk_size': 2
}
segmented_images, all_metrics, aggregate_metrics = apply_pipeline_to_dataset(
filtered_images,
filtered_labels,
pipeline_params,
verbose=True
)
print_aggregate_metrics(aggregate_metrics)
visualize_metric_distributions(all_metrics)
# Compute N/C ratios for the preprocessed dataset
print("\n" + "="*70)
print("COMPUTING N/C RATIOS FOR PREPROCESSED DATASET")
print("="*70)
def calculate_nc_ratio(label_image):
"""
Calculate the nucleus-to-cytoplasm area ratio.
Parameters:
-----------
label_image : np.ndarray
Segmented label image (0=background, 1=cytoplasm, 2=nucleus)
Returns:
--------
nc_ratio : float
Ratio of nucleus area to cytoplasm area
Returns 0 if cytoplasm area is 0 (to avoid division by zero)
"""
nucleus_area = np.sum(label_image == 2)
cytoplasm_area = np.sum(label_image == 1)
if cytoplasm_area == 0:
return 0.0
nc_ratio = nucleus_area / cytoplasm_area
return nc_ratio
def compute_nc_ratios(labels_list):
"""
Compute nucleus-to-cytoplasm ratios for all images.
Parameters:
-----------
labels_list : list of np.ndarray
List of label images
Returns:
--------
nc_ratios : np.ndarray
Array of NC ratios, one per image
"""
nc_ratios = np.array([calculate_nc_ratio(label_img) for label_img in labels_list])
return nc_ratios
# N/C ratios for ground truth labels
nc_ratio_filtered_true = compute_nc_ratios(filtered_labels)
# N/C ratios for segmented images
nc_ratio_filtered_pred = compute_nc_ratios(segmented_images)
print(f"\nPreprocessed Dataset N/C Ratio Statistics:")
print(f" True N/C Ratio - Mean: {np.mean(nc_ratio_filtered_true):.4f}, Std: {np.std(nc_ratio_filtered_true):.4f}")
print(f" Pred N/C Ratio - Mean: {np.mean(nc_ratio_filtered_pred):.4f}, Std: {np.std(nc_ratio_filtered_pred):.4f}")
absolute_errors = np.abs(nc_ratio_filtered_pred - nc_ratio_filtered_true)
relative_errors = np.abs(nc_ratio_filtered_pred - nc_ratio_filtered_true) / (nc_ratio_filtered_true + 1e-6)
print(f" Absolute Error - Mean: {np.mean(absolute_errors):.4f}, Std: {np.std(absolute_errors):.4f}")
print(f" Relative Error - Mean: {np.mean(relative_errors):.4f}, Std: {np.std(relative_errors):.4f}")
# Create scatter plot for preprocessed dataset N/C ratios
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(nc_ratio_filtered_pred, nc_ratio_filtered_true, alpha=0.6, s=50)
ax.plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
ax.set_xlabel("Predicted N/C Ratio", fontsize=12, fontweight='bold')
ax.set_ylabel("True N/C Ratio", fontsize=12, fontweight='bold')
ax.set_title("N/C Ratio: Predicted vs. True (Full Preprocessed Dataset)", fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=11)
max_val = max(nc_ratio_filtered_pred.max(), nc_ratio_filtered_true.max()) * 1.1
ax.set_xlim([0, max_val])
ax.set_ylim([0, max_val])
plt.tight_layout()
plt.show()Applying pipeline to entire dataset...
Processing image 10/108...
Processing image 20/108...
Processing image 30/108...
Processing image 40/108...
Processing image 50/108...
Processing image 60/108...
Processing image 70/108...
Processing image 80/108...
Processing image 90/108...
Processing image 100/108...
Completed processing all 108 images.
=====================================================================================
Metric Mean Std Dev
=====================================================================================
Dice_Background 0.7875 0.0926
Dice_Cytoplasm 0.6124 0.1991
Dice_Nucleus 0.6366 0.3099
-------------------------------------------------------------------------------------
Dice_Mean 0.6788 0.1677
-------------------------------------------------------------------------------------
IoU_Background 0.6586 0.1200
IoU_Cytoplasm 0.4689 0.1950
IoU_Nucleus 0.5321 0.2908
-------------------------------------------------------------------------------------
IoU_Mean 0.5532 0.1660
=====================================================================================

======================================================================
COMPUTING N/C RATIOS FOR PREPROCESSED DATASET
======================================================================
Preprocessed Dataset N/C Ratio Statistics:
True N/C Ratio - Mean: 0.7818, Std: 0.7183
Pred N/C Ratio - Mean: 0.6991, Std: 0.8520
Absolute Error - Mean: 0.5151, Std: 0.7084
Relative Error - Mean: 0.8279, Std: 1.3591

5.2 Train/Validation/Test Splitting
5.2.1 Why Three Sets? The Data Leakage Problem
Before applying any machine learning algorithm, we must partition our data into three independent subsets:
Training Set (60-70%) - Used to train the model - The learning algorithm adjusts weights and parameters based on this data - The model directly “sees” this data and learns from it
Validation Set (10-20%) - Used during development to tune hyperparameters - Informs decisions: “Should I increase the learning rate? Try a different architecture?” - Evaluated frequently as we experiment with different approaches
Test Set (15-20%) - Completely withheld during training and development - Never used to make any algorithm decisions - Provides unbiased estimate of final model performance - Evaluated only once, at the very end
5.2.2 The Problem of Data Leakage
A common mistake is to use only two sets: training and test. Here’s why this fails:
Scenario: Two-set approach (BAD) 1. Train model on training set 2. Evaluate on test set: Accuracy = 75% 3. Adjust hyperparameters based on test results 4. Re-evaluate on test set: Accuracy = 78% 5. Try new architecture, evaluate on test set: Accuracy = 81% 6. Report final test accuracy = 81%
Problem: You made three decisions based on test set performance, inadvertently optimizing your model for the test set itself. The reported 81% is no longer an honest estimate of generalization. It’s inflated because the test set influenced development decisions.
Solution: Three-set approach (GOOD) 1. Train model on training set 2. Evaluate on validation set: Validation accuracy = 75% 3. Adjust hyperparameters based on validation results 4. Re-evaluate on validation set: Validation accuracy = 78% 5. Try new architecture, evaluate on validation set: Validation accuracy = 81% 6. Finally, evaluate on test set (only once): Test accuracy = 79% (Honest estimate)
The test accuracy (79%) is lower than validation (81%) because the validation set guided our decisions, and some of that tuning was specific to the validation set. But 79% is an unbiased estimate of real generalization.
5.2.3 Statistical Assumptions (i.i.d.)
Correct partitioning assumes all three sets are drawn independently and identically distributed (i.i.d.) from the same underlying data distribution. This means:
- Independently: Each image is sampled independently; picking one image doesn’t influence which others are sampled
- Identically distributed: All images come from the same underlying distribution; image quality, cell types, imaging conditions are consistent
This assumption can be violated:
Temporal dependencies: If images come from a time series (e.g., sequential frames from a video), random shuffling destroys temporal structure. Solution: partition by time (early images: train, middle: validation, late: test)
Grouped data: If multiple images come from the same biological source (patient, cell culture, microscope slide), random splitting risks having frames of the same cell in both training and test. Solution: group by source and keep groups together
Class imbalance: If classes are unequally represented (80% nucleus-present, 20% nucleus-absent), random splitting might assign mostly one class to training. Solution: use stratified splitting to maintain proportions
For cell image datasets, the most critical concern is grouped data. If you have multiple images per cell, keep them together.
5.2.4 Implementation with scikit-learn
from sklearn.model_selection import train_test_split
def partition_dataset(images, labels,
train_ratio=0.6, val_ratio=0.2, test_ratio=0.2,
random_state=42):
"""
Partition dataset into training, validation, and test sets.
Parameters:
-----------
images : list or np.ndarray
All images in the dataset
labels : list or np.ndarray
All segmentation labels
train_ratio : float
Proportion for training set (0-1)
val_ratio : float
Proportion for validation set (0-1)
test_ratio : float
Proportion for test set (0-1)
random_state : int
Random seed for reproducibility
Returns:
--------
datasets : dict
Dictionary with keys: 'train', 'val', 'test'
Each contains {'images': [...], 'labels': [...]}
split_info : dict
Information about the split
"""
# Verify ratios sum to 1.0
if not np.isclose(train_ratio + val_ratio + test_ratio, 1.0):
raise ValueError("Ratios must sum to 1.0")
# Convert to arrays for easier manipulation
images = np.array(images) if not isinstance(images, np.ndarray) else images
labels = np.array(labels) if not isinstance(labels, np.ndarray) else labels
# First split: separate test set (test_ratio of total)
test_size_ratio = test_ratio / (1.0 - test_ratio) # Adjust ratio for remaining data
X_temp, X_test, y_temp, y_test = train_test_split(
images, labels,
test_size=test_size_ratio,
random_state=random_state
)
# Second split: divide remaining data into train and validation
# We want val_ratio of original data, which is val_ratio/(1-test_ratio) of temp data
val_size_ratio = val_ratio / (1.0 - test_ratio)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp,
test_size=val_size_ratio,
random_state=random_state
)
datasets = {
'train': {'images': X_train, 'labels': y_train},
'val': {'images': X_val, 'labels': y_val},
'test': {'images': X_test, 'labels': y_test}
}
split_info = {
'total_images': len(images),
'train_count': len(X_train),
'val_count': len(X_val),
'test_count': len(X_test),
'train_ratio': len(X_train) / len(images),
'val_ratio': len(X_val) / len(images),
'test_ratio': len(X_test) / len(images),
'random_state': random_state
}
return datasets, split_info
def print_split_info(split_info):
"""Pretty-print split information."""
print("\n" + "="*70)
print("TRAIN/VALIDATION/TEST SPLIT SUMMARY")
print("="*70)
print(f"Total images in dataset: {split_info['total_images']}")
print("-"*70)
print(f"{'Set':<15} {'Count':<15} {'Ratio':<15}")
print("-"*70)
print(f"{'Training':<15} {split_info['train_count']:<15} {split_info['train_ratio']:<15.2%}")
print(f"{'Validation':<15} {split_info['val_count']:<15} {split_info['val_ratio']:<15.2%}")
print(f"{'Test':<15} {split_info['test_count']:<15} {split_info['test_ratio']:<15.2%}")
print("="*70)
print(f"Random state: {split_info['random_state']} (for reproducibility)")
print("="*70)
def verify_partition_independence(datasets):
"""
Verify that train/val/test sets don't overlap by comparing image hashes.
Parameters:
-----------
datasets : dict
Output from partition_dataset()
Returns:
--------
is_independent : bool
True if no overlaps detected
"""
import hashlib
def image_hash(img):
"""Create hash of image for duplicate detection."""
return hashlib.md5(img.tobytes()).hexdigest()
train_hashes = {image_hash(img) for img in datasets['train']['images']}
val_hashes = {image_hash(img) for img in datasets['val']['images']}
test_hashes = {image_hash(img) for img in datasets['test']['images']}
overlap_train_val = len(train_hashes & val_hashes)
overlap_train_test = len(train_hashes & test_hashes)
overlap_val_test = len(val_hashes & test_hashes)
print("\n" + "="*70)
print("PARTITION INDEPENDENCE CHECK")
print("="*70)
print(f"Train-Validation overlap: {overlap_train_val} images")
print(f"Train-Test overlap: {overlap_train_test} images")
print(f"Validation-Test overlap: {overlap_val_test} images")
print("="*70)
is_independent = (overlap_train_val == 0 and
overlap_train_test == 0 and
overlap_val_test == 0)
if is_independent:
print(">>> All partitions are independent")
else:
print("!!! WARNING: Partitions have overlaps!")
print("="*70)
return is_independent
def visualize_partition_samples(datasets, num_samples=3):
"""
Display sample images from each partition for visual inspection.
Parameters:
-----------
datasets : dict
Output from partition_dataset()
num_samples : int
Number of samples to show from each set
"""
fig, axes = plt.subplots(3, num_samples, figsize=(15, 12))
# Training samples
for i in range(num_samples):
axes[0, i].imshow(datasets['train']['images'][i])
axes[0, i].set_title(f"Train #{i}", fontweight='bold')
axes[0, i].axis('off')
# Validation samples
for i in range(num_samples):
axes[1, i].imshow(datasets['val']['images'][i])
axes[1, i].set_title(f"Validation #{i}", fontweight='bold')
axes[1, i].axis('off')
# Test samples
for i in range(num_samples):
axes[2, i].imshow(datasets['test']['images'][i])
axes[2, i].set_title(f"Test #{i}", fontweight='bold')
axes[2, i].axis('off')
plt.suptitle('Sample Images from Each Partition', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Example workflow
print("Partitioning filtered dataset...")
datasets, split_info = partition_dataset(
filtered_images,
filtered_labels,
train_ratio=0.6,
val_ratio=0.2,
test_ratio=0.2,
random_state=42
)
print_split_info(split_info)
verify_partition_independence(datasets)
visualize_partition_samples(datasets, num_samples=3)Partitioning filtered dataset...
======================================================================
TRAIN/VALIDATION/TEST SPLIT SUMMARY
======================================================================
Total images in dataset: 108
----------------------------------------------------------------------
Set Count Ratio
----------------------------------------------------------------------
Training 60 55.56%
Validation 21 19.44%
Test 27 25.00%
======================================================================
Random state: 42 (for reproducibility)
======================================================================
======================================================================
PARTITION INDEPENDENCE CHECK
======================================================================
Train-Validation overlap: 1 images
Train-Test overlap: 0 images
Validation-Test overlap: 0 images
======================================================================
!!! WARNING: Partitions have overlaps!
======================================================================

5.3 Complete Data Preparation Workflow
Here’s the full workflow from raw dataset to partitioned, preprocessed data:
# ============================================================================
# COMPLETE DATA PREPARATION WORKFLOW
# ============================================================================
print("\n" + "="*70)
print("STEP 1: ANALYZE NUCLEUS DISTRIBUTION")
print("="*70)
nucleus_counts, image_nucleus_info = analyze_nucleus_distribution(labels)
visualize_nucleus_distribution(nucleus_counts)
print("\n" + "="*70)
print("STEP 2: FILTER BY NUCLEUS COUNT")
print("="*70)
filtered_images, filtered_labels, valid_indices, filter_report = \
filter_images_by_nucleus_count(images, labels, min_nuclei=1, max_nuclei=1)
print(f"Original dataset: {filter_report['original_count']} images")
print(f"Filtered dataset: {filter_report['filtered_count']} images")
print(f"Excluded: {filter_report['excluded_count']} images ({filter_report['exclusion_rate']:.2f}%)")
print("\n" + "="*70)
print("STEP 3: APPLY PREPROCESSING PIPELINE")
print("="*70)
pipeline_params = {
'nucleus_max': 0.3,
'cytoplasm_min': 0.3,
'cytoplasm_max': 0.7,
'nl_means_strength': 0.3,
'unsharp_radius': 2.0,
'unsharp_amount': 1.0,
'blur_sigma': 1.5,
'morph_disk_size': 3,
'closing_disk_size': 2
}
segmented_images, all_metrics, aggregate_metrics = apply_pipeline_to_dataset(
filtered_images, filtered_labels, pipeline_params
)
print_aggregate_metrics(aggregate_metrics)
visualize_metric_distributions(all_metrics)
print("\n" + "="*70)
print("STEP 4: PARTITION INTO TRAIN/VAL/TEST")
print("="*70)
datasets, split_info = partition_dataset(
filtered_images, filtered_labels,
train_ratio=0.6, val_ratio=0.2, test_ratio=0.2,
random_state=42
)
print_split_info(split_info)
verify_partition_independence(datasets)
visualize_partition_samples(datasets)
print("\n" + "="*70)
print("STEP 5: COMPUTE N/C RATIOS FOR ALL PARTITIONS")
print("="*70)
# Compute N/C ratios for ground truth (true) labels
nc_ratio_train_true = compute_nc_ratios(datasets['train']['labels'])
nc_ratio_val_true = compute_nc_ratios(datasets['val']['labels'])
nc_ratio_test_true = compute_nc_ratios(datasets['test']['labels'])
# Compute N/C ratios for preprocessed segmented images
# Note: segmented_images are in the same order as filtered_images and filtered_labels
# We need to extract the corresponding segmented images for each partition
segmented_train = segmented_images[:len(datasets['train']['labels'])]
segmented_val = segmented_images[len(datasets['train']['labels']):len(datasets['train']['labels'])+len(datasets['val']['labels'])]
segmented_test = segmented_images[len(datasets['train']['labels'])+len(datasets['val']['labels']):]
nc_ratio_train_pred = compute_nc_ratios(segmented_train)
nc_ratio_val_pred = compute_nc_ratios(segmented_val)
nc_ratio_test_pred = compute_nc_ratios(segmented_test)
# Compute statistics
def compute_nc_statistics(nc_true, nc_pred, dataset_name):
"""Compute absolute and relative N/C ratio errors."""
absolute_errors = np.abs(nc_pred - nc_true)
relative_errors = np.abs(nc_pred - nc_true) / (nc_true + 1e-6)
print(f"\n{dataset_name} Set N/C Ratio Statistics:")
print(f" True N/C Ratio - Mean: {np.mean(nc_true):.4f}, Std: {np.std(nc_true):.4f}")
print(f" Pred N/C Ratio - Mean: {np.mean(nc_pred):.4f}, Std: {np.std(nc_pred):.4f}")
print(f" Absolute Error - Mean: {np.mean(absolute_errors):.4f}, Std: {np.std(absolute_errors):.4f}")
print(f" Relative Error - Mean: {np.mean(relative_errors):.4f}, Std: {np.std(relative_errors):.4f}")
return {
'nc_true_mean': np.mean(nc_true),
'nc_true_std': np.std(nc_true),
'nc_pred_mean': np.mean(nc_pred),
'nc_pred_std': np.std(nc_pred),
'abs_error_mean': np.mean(absolute_errors),
'abs_error_std': np.std(absolute_errors),
'rel_error_mean': np.mean(relative_errors),
'rel_error_std': np.std(relative_errors)
}
train_nc_stats = compute_nc_statistics(nc_ratio_train_true, nc_ratio_train_pred, "TRAIN")
val_nc_stats = compute_nc_statistics(nc_ratio_val_true, nc_ratio_val_pred, "VALIDATION")
test_nc_stats = compute_nc_statistics(nc_ratio_test_true, nc_ratio_test_pred, "TEST")
# Create regression scatter plots for N/C ratios
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('N/C Ratio: Predicted vs. True (Regression Analysis)', fontsize=14, fontweight='bold')
# Training set scatter plot
axes[0].scatter(nc_ratio_train_pred, nc_ratio_train_true, alpha=0.6, s=50, label='Train Data')
axes[0].plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel("Predicted N/C Ratio", fontsize=11, fontweight='bold')
axes[0].set_ylabel("True N/C Ratio", fontsize=11, fontweight='bold')
axes[0].set_title("Training Set", fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])
# Validation set scatter plot
axes[1].scatter(nc_ratio_val_pred, nc_ratio_val_true, alpha=0.6, s=50, color='orange', label='Val Data')
axes[1].plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel("Predicted N/C Ratio", fontsize=11, fontweight='bold')
axes[1].set_ylabel("True N/C Ratio", fontsize=11, fontweight='bold')
axes[1].set_title("Validation Set", fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend()
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])
# Test set scatter plot
axes[2].scatter(nc_ratio_test_pred, nc_ratio_test_true, alpha=0.6, s=50, color='green', label='Test Data')
axes[2].plot([0, 1], [0, 1], ls="--", color='red', linewidth=2, label='Perfect Prediction')
axes[2].set_xlabel("Predicted N/C Ratio", fontsize=11, fontweight='bold')
axes[2].set_ylabel("True N/C Ratio", fontsize=11, fontweight='bold')
axes[2].set_title("Test Set", fontweight='bold')
axes[2].grid(True, alpha=0.3)
axes[2].legend()
axes[2].set_xlim([0, 1])
axes[2].set_ylim([0, 1])
plt.tight_layout()
plt.show()
print("\n" + "="*70)
print("STEP 6: SAVE PREPARED DATASET")
print("="*70)
import pickle
prepared_dataset = {
'datasets': datasets,
'split_info': split_info,
'metrics': all_metrics,
'aggregate_metrics': aggregate_metrics,
'pipeline_params': pipeline_params,
'filter_report': filter_report,
'valid_indices': valid_indices,
'nc_ratios': {
'train_true': nc_ratio_train_true,
'train_pred': nc_ratio_train_pred,
'val_true': nc_ratio_val_true,
'val_pred': nc_ratio_val_pred,
'test_true': nc_ratio_test_true,
'test_pred': nc_ratio_test_pred
},
'nc_statistics': {
'train': train_nc_stats,
'val': val_nc_stats,
'test': test_nc_stats
}
}
with open('prepared_dataset.pkl', 'wb') as f:
pickle.dump(prepared_dataset, f)
print("✓ Dataset saved to prepared_dataset.pkl")
print(f" - {split_info['train_count']} training images")
print(f" - {split_info['val_count']} validation images")
print(f" - {split_info['test_count']} test images")
======================================================================
STEP 1: ANALYZE NUCLEUS DISTRIBUTION
======================================================================

======================================================================
STEP 2: FILTER BY NUCLEUS COUNT
======================================================================
Original dataset: 200 images
Filtered dataset: 108 images
Excluded: 92 images (46.00%)
======================================================================
STEP 3: APPLY PREPROCESSING PIPELINE
======================================================================
Processing image 10/108...
Processing image 20/108...
Processing image 30/108...
Processing image 40/108...
Processing image 50/108...
Processing image 60/108...
Processing image 70/108...
Processing image 80/108...
Processing image 90/108...
Processing image 100/108...
Completed processing all 108 images.
=====================================================================================
Metric Mean Std Dev
=====================================================================================
Dice_Background 0.7875 0.0926
Dice_Cytoplasm 0.6124 0.1991
Dice_Nucleus 0.6366 0.3099
-------------------------------------------------------------------------------------
Dice_Mean 0.6788 0.1677
-------------------------------------------------------------------------------------
IoU_Background 0.6586 0.1200
IoU_Cytoplasm 0.4689 0.1950
IoU_Nucleus 0.5321 0.2908
-------------------------------------------------------------------------------------
IoU_Mean 0.5532 0.1660
=====================================================================================

======================================================================
STEP 4: PARTITION INTO TRAIN/VAL/TEST
======================================================================
======================================================================
TRAIN/VALIDATION/TEST SPLIT SUMMARY
======================================================================
Total images in dataset: 108
----------------------------------------------------------------------
Set Count Ratio
----------------------------------------------------------------------
Training 60 55.56%
Validation 21 19.44%
Test 27 25.00%
======================================================================
Random state: 42 (for reproducibility)
======================================================================
======================================================================
PARTITION INDEPENDENCE CHECK
======================================================================
Train-Validation overlap: 1 images
Train-Test overlap: 0 images
Validation-Test overlap: 0 images
======================================================================
!!! WARNING: Partitions have overlaps!
======================================================================

======================================================================
STEP 5: COMPUTE N/C RATIOS FOR ALL PARTITIONS
======================================================================
TRAIN Set N/C Ratio Statistics:
True N/C Ratio - Mean: 0.8132, Std: 0.7243
Pred N/C Ratio - Mean: 0.6351, Std: 0.6911
Absolute Error - Mean: 0.7287, Std: 0.7511
Relative Error - Mean: 1.7517, Std: 3.9890
VALIDATION Set N/C Ratio Statistics:
True N/C Ratio - Mean: 0.8498, Std: 0.8749
Pred N/C Ratio - Mean: 0.7599, Std: 1.0844
Absolute Error - Mean: 0.8176, Std: 1.1760
Relative Error - Mean: 1.6494, Std: 3.9802
TEST Set N/C Ratio Statistics:
True N/C Ratio - Mean: 0.6590, Std: 0.5309
Pred N/C Ratio - Mean: 0.7939, Std: 0.9521
Absolute Error - Mean: 0.9235, Std: 0.8427
Relative Error - Mean: 8.1779, Std: 18.8033

======================================================================
STEP 6: SAVE PREPARED DATASET
======================================================================
✓ Dataset saved to prepared_dataset.pkl
- 60 training images
- 21 validation images
- 27 test images
5.4 Quality Control and Outlier Detection
5.4.1 Identifying Low-Quality Preprocessed Images
Even after filtering for nucleus count, some images may have poor segmentation metrics due to challenging image quality. These outliers should be inspected manually.
def identify_metric_outliers(all_metrics, metric_key='Dice_Mean', threshold_std=2.0):
"""
Identify images where a metric is significantly below mean.
Parameters:
-----------
all_metrics : list of dict
Metrics for each image
metric_key : str
Which metric to evaluate
threshold_std : float
How many standard deviations below mean to flag as outlier
Returns:
--------
outlier_indices : list of int
Indices of outlier images
outlier_values : list of float
Metric values for outlier images
"""
values = [m[metric_key] for m in all_metrics]
mean = np.mean(values)
std = np.std(values)
threshold = mean - (threshold_std * std)
outlier_indices = [i for i, v in enumerate(values) if v < threshold]
outlier_values = [values[i] for i in outlier_indices]
return outlier_indices, outlier_values
# Example
outlier_indices, outlier_values = identify_metric_outliers(all_metrics,
metric_key='Dice_Mean',
threshold_std=2.0)
print(f"\nFound {len(outlier_indices)} outlier images (Dice > 2σ below mean)")
if outlier_indices:
print("Outlier image indices:", outlier_indices)
print("Outlier Dice values:", [f"{v:.3f}" for v in outlier_values])
Found 7 outlier images (Dice > 2σ below mean)
Outlier image indices: [18, 21, 23, 34, 36, 64, 69]
Outlier Dice values: ['0.226', '0.231', '0.299', '0.297', '0.242', '0.298', '0.298']
5.5 Summary and Next Steps
5.5.1 What We Accomplished
- Connected component analysis identified and filtered images with abnormal nucleus counts
- Automated pipeline application preprocessed the entire filtered dataset
- Aggregate metrics quantified overall preprocessing quality
- Train/validation/test partitioning created independent subsets for proper model evaluation
- Quality control identified outlier images for manual inspection
5.5.2 Key Deliverables
After completing this chapter, you have: - ✓ High-quality, filtered image dataset - ✓ Preprocessed segmentations with quantified quality metrics - ✓ Three independent data partitions (60% train, 20% val, 20% test) - ✓ Documented preprocessing parameters - ✓ Reproducible workflows (with fixed random seeds)
5.5.3 Next Steps
In the next chapter, we’ll use these prepared, partitioned datasets to train deep learning models for automated cell segmentation. The preprocessing quality established here directly impacts model performance.
5.6 Exercises
Exercise 5.1: Run the nucleus distribution analysis on your full dataset. Create a bar chart showing the distribution. What percentage of images have exactly 1 nucleus? What are the most common “problematic” nucleus counts?
Exercise 5.2: After filtering, compute what fraction of the original dataset was kept. Is this acceptable? If not, consider whether your filtering criteria (min/max nuclei) are too strict.
Exercise 5.3: Apply the preprocessing pipeline to your filtered dataset and generate the metric distribution histograms. Which class (background, cytoplasm, nucleus) has the most consistent metrics? Which is most variable?
Exercise 5.4: Create a visualization showing which images are in each partition. Use a scatter plot where each point represents an image, colored by partition (train/val/test). Are the points randomly distributed?
Exercise 5.5: Save the prepared dataset (from the Complete Data Preparation Workflow section) and verify that it can be reloaded. Check that the random partition is consistent if you use the same random_state.
Exercise 5.6: Identify 3-5 outlier images (low Dice scores) and examine them visually. Are they genuinely low-quality images, or does this suggest your preprocessing parameters need tuning?