BatchCorrection Function

Correct batch effects in multi-sample single-cell RNA-seq datasets.

Overview

The BatchCorrection function removes technical batch effects that arise when samples are processed in different batches, sequencing runs, or experimental conditions. It uses advanced R-based computational methods to harmonize data while preserving biological variation.

Class Information

Module: celline.functions.batch_cor
Class: BatchCorrection
Base Class: CellineFunction

Parameters

Constructor Parameters

Parameter	Type	Required	Description
`output_file_path`	`str`	Yes	Path for batch-corrected output files
`filter_func`	`Optional[Callable[[SampleSchema], bool]]`	No	Function to filter samples for correction

JobContainer Structure

Field	Type	Description
`nthread`	`str`	Number of threads (fixed to 1)
`cluster_server`	`str`	Cluster server name
`logpath`	`str`	Batch correction log path
`r_path`	`str`	R script directory
`exec_root`	`str`	Execution root directory
`proj_path`	`str`	Project root path
`sample_ids`	`str`	Comma-separated sample IDs
`project_ids`	`str`	Comma-separated project IDs
`output_dir`	`str`	Output directory path
`logpath_runtime`	`str`	Runtime log path

Usage Examples

Python API

Basic Batch Correction

      from celline import Project
from celline.functions.batch_cor import BatchCorrection

# Create project
project = Project("./my-project")

# Basic batch correction for all samples
batch_function = BatchCorrection(
    output_file_path="/path/to/batch_corrected",
    filter_func=None
)

# Execute function
result = project.call(batch_function)

Filtered Batch Correction

      from celline import Project
from celline.functions.batch_cor import BatchCorrection
from celline.DB.dev.model import SampleSchema

# Create project
project = Project("./my-project")

# Filter function for specific conditions
def same_tissue_filter(schema: SampleSchema) -> bool:
    return "brain" in (schema.title or "").lower()

# Batch correction with filtering
batch_function = BatchCorrection(
    output_file_path="/path/to/brain_batch_corrected",
    filter_func=same_tissue_filter
)

# Execute function
result = project.call(batch_function)

Platform-Specific Correction

      from celline import Project
from celline.functions.batch_cor import BatchCorrection
from celline.DB.dev.model import SampleSchema

# Create project
project = Project("./my-project")

# Filter by sequencing platform
def platform_filter(schema: SampleSchema) -> bool:
    return schema.platform.startswith("Illumina")

# Batch correction for specific platform
batch_function = BatchCorrection(
    output_file_path="/path/to/illumina_corrected",
    filter_func=platform_filter
)

# Execute function
result = project.call(batch_function)

Multi-Study Correction

      from celline import Project
from celline.functions.batch_cor import BatchCorrection
from celline.DB.dev.model import SampleSchema

# Create project
project = Project("./my-project")

# Filter for multiple studies
target_studies = ["GSE123456", "GSE789012", "GSE345678"]

def multi_study_filter(schema: SampleSchema) -> bool:
    return schema.parent in target_studies

# Batch correction across studies
batch_function = BatchCorrection(
    output_file_path="/path/to/multi_study_corrected",
    filter_func=multi_study_filter
)

# Execute function
result = project.call(batch_function)

CLI Usage

      # Basic batch correction
celline run batch

# Batch correction with verbose logging
celline run batch --verbose

Implementation Details

Batch Effect Sources

Common sources of batch effects in single-cell data:

Source	Description	Impact
Sequencing Run	Different sequencing lanes/machines	Technical variation
Sample Preparation	Different processing dates	Protocol variation
Cell Isolation	Different isolation methods	Methodological bias
Library Preparation	Kit lot variations	Technical artifacts
Operator Effects	Different personnel	Handling variation
Environmental	Temperature, humidity changes	Subtle technical effects

Correction Methods

The function employs multiple R-based correction algorithms:

ComBat-seq

      # Negative binomial model for count data
library(sva)
corrected_counts <- ComBat_seq(
    counts = count_matrix,
    batch = batch_info,
    group = condition_info
)

Harmony

      # Fast harmonic correction
library(harmony)
corrected_embedding <- RunHarmony(
    object = seurat_object,
    group.by.vars = "batch",
    reduction = "pca"
)

MNN (Mutual Nearest Neighbors)

      # Batch correction via MNN
library(batchelor)
corrected <- fastMNN(sample_list, cos.norm = TRUE)

scVI Integration

      # Deep learning approach (if available)
library(reticulate)
# Python scVI integration

File Organization

Batch correction outputs are systematically organized:

      project_root/
├── batch/
│   ├── batch_corrected_TIMESTAMP.h5ad    # Corrected AnnData object
│   ├── batch_corrected_TIMESTAMP.rds     # Corrected Seurat object
│   ├── logs/
│   │   ├── removebatch_TIMESTAMP.log     # Correction log
│   │   └── runtime/
│   │       └── removebatch_TIMESTAMP.sh  # Runtime log
│   └── removebatch_TIMESTAMP.sh          # Generated R script

Correction Workflow

Prerequisites

Samples must be:

Counted: Cell Ranger count completed
Quality Controlled: Basic QC filtering applied
Normalized: Data normalization completed
Compatible: Same species and cell types

Processing Pipeline

Sample Selection: Applies filter function to identify target samples
Batch Identification: Automatically detects batch variables
Data Integration: Combines count matrices from multiple samples
Batch Detection: Identifies technical batch effects
Correction Application: Applies appropriate correction method
Quality Assessment: Evaluates correction effectiveness
Output Generation: Saves corrected data in multiple formats

Batch Variable Detection

The function automatically identifies batch variables:

      # Automatic batch detection
batch_vars <- c("orig.ident", "sequencing_run", "prep_date", "platform")
detected_batches <- identify_batch_effects(
    seurat_object,
    variables = batch_vars,
    threshold = 0.05
)

Methods

`call(project: Project) -> Project`

Main execution method that performs batch correction.

Parameters:

project: The Celline project instance

Returns: Updated project instance

Process:

Filters samples based on provided filter function
Validates prerequisites for batch correction
Creates output directories
Generates R batch correction script
Executes correction pipeline
Saves corrected results

`register() -> str`

Returns the function identifier for registration.

Filter Functions

Common Filter Patterns

      # Filter by tissue type
def tissue_filter(tissue_type):
    def filter_func(schema):
        return tissue_type.lower() in (schema.title or "").lower()
    return filter_func

# Filter by time period
def temporal_filter(start_date, end_date):
    def filter_func(schema):
        return start_date <= schema.date <= end_date
    return filter_func

# Filter by cell count threshold
def cell_count_filter(min_cells):
    def filter_func(schema):
        return schema.cell_count >= min_cells
    return filter_func

# Combined filters for complex scenarios
def comprehensive_filter(schema):
    return (
        schema.species == "Homo sapiens" and
        "cortex" in (schema.title or "").lower() and
        schema.platform.startswith("10x") and
        schema.cell_count >= 1000
    )

Batch-Specific Filters

      # Same laboratory filter
def lab_filter(lab_name):
    def filter_func(schema):
        return schema.laboratory == lab_name
    return filter_func

# Protocol consistency filter
def protocol_filter(protocol_version):
    def filter_func(schema):
        return schema.protocol_version == protocol_version
    return filter_func

Quality Assessment

Correction Evaluation Metrics

The function evaluates correction quality using multiple metrics:

Metric	Description	Good Value
Silhouette Score	Batch mixing quality	> 0.7
kBET	Batch effect test	< 0.05
LISI	Local inverse Simpson index	> 2.0
ASW	Average silhouette width	> 0.5
Graph Connectivity	Biological preservation	> 0.8

Visualization Outputs

Generated visualizations include:

      # Before/after comparison plots
plot_umap_before_after(seurat_object)
plot_batch_effects_heatmap(corrected_data)
plot_correction_metrics(evaluation_results)

# Quality control plots
plot_gene_expression_conservation(original, corrected)
plot_cell_type_preservation(corrected_data)

Performance Considerations

Memory Requirements

Batch correction memory needs:

Samples	Cells per Sample	Memory Needed
2-5	5K-10K	32-64 GB
5-10	5K-10K	64-128 GB
10+	5K-10K	128+ GB

Processing Time

Correction time by method and data size:

Method	Sample Count	Processing Time
Harmony	5 samples	1-2 hours
ComBat-seq	5 samples	2-4 hours
FastMNN	5 samples	3-6 hours
scVI	5 samples	4-8 hours

Cluster Computing

For large-scale correction:

      #!/bin/bash
#PBS -N BatchCorrection
#PBS -l nodes=1:ppn=16
#PBS -l mem=256gb
#PBS -l walltime=12:00:00

module load R/4.1.0
Rscript batch_correction_script.R

Error Handling

Common Issues

Insufficient Samples: Need ≥2 samples for batch correction
Memory Exhaustion: Large datasets require substantial RAM
Batch Confounding: Biological and technical effects correlated
Missing Metadata: Batch variables not properly defined

Troubleshooting

      # Validate batch correction prerequisites
def validate_batch_correction(samples):
    if len(samples) < 2:
        raise ValueError("Need at least 2 samples for batch correction")
    
    # Check for batch variable availability
    batch_vars = set()
    for sample in samples:
        batch_vars.add(sample.schema.platform)
        batch_vars.add(sample.schema.prep_date)
    
    if len(batch_vars) < 2:
        print("Warning: Limited batch variation detected")

Best Practices

Sample Selection

Biological Coherence: Correct samples from same tissue/condition
Technical Diversity: Ensure multiple batches are represented
Quality Control: Remove low-quality samples before correction
Sample Size: Include sufficient cells per batch (≥500)

Method Selection

      # Choose correction method based on data characteristics
def select_correction_method(n_samples, n_cells, has_replicates):
    if n_samples <= 5 and n_cells <= 50000:
        return "Harmony"  # Fast and effective
    elif has_replicates:
        return "ComBat-seq"  # Good for designed experiments
    elif n_samples > 10:
        return "FastMNN"  # Scalable for many samples
    else:
        return "scVI"  # Deep learning for complex cases

Pipeline Integration

      # Complete batch correction workflow
from celline import Project
from celline.functions.preprocess import Preprocess
from celline.functions.batch_cor import BatchCorrection
from celline.functions.integrate import Integrate

# Process samples
project = Project("./my-project")
project.call(Preprocess())  # QC first

# Apply batch correction
project.call(BatchCorrection(
    output_file_path="./corrected_data",
    filter_func=None
))

# Final integration if needed
project.call(Integrate(
    filter_func=None,
    outfile_name="final_integrated"
))

Output Formats

Supported Output Types

Format	Description	Use Case
HDF5/H5AD	AnnData format	Python analysis
RDS	Seurat object	R analysis
H5Seurat	Seurat HDF5	Cross-platform
CSV/TSV	Matrix format	General purpose
Loom	Hierarchical format	Visualization

Metadata Preservation

Corrected outputs preserve essential metadata:

      # Metadata maintained after correction
corrected_metadata = {
    "original_sample_id": "GSM123456",
    "batch_corrected": True,
    "correction_method": "Harmony",
    "correction_date": "2024-01-15",
    "original_batch": "batch_1",
    "correction_parameters": {...}
}

Preprocess - Quality control before batch correction
Integrate - Alternative integration approach
CreateSeuratObject - Create objects for correction
Reduce - Storage optimization after correction

Troubleshooting

Common Issues

Overcorrection: Biological signals removed with batch effects
Undercorrection: Batch effects remain after correction
Memory Issues: Insufficient RAM for large datasets
Method Failure: Correction algorithm fails to converge

Debug Mode

Enable detailed R logging:

      # In R script
options(error = traceback)
sessionInfo()
print("Batch variables detected:")
print(table(metadata$batch))

Manual Correction

For debugging, run correction manually:

      # Navigate to batch directory
cd batch/

# Execute correction script manually
Rscript removebatch_TIMESTAMP.sh

Validation Steps

      # Validate correction results
def validate_correction(original, corrected):
    # Check cell count preservation
    assert corrected.n_obs == original.n_obs
    
    # Check gene preservation
    assert corrected.n_vars == original.n_vars
    
    # Evaluate batch mixing
    batch_score = calculate_batch_mixing(corrected)
    assert batch_score > 0.5, "Poor batch correction"

BatchCorrection Function

Overview

Class Information

Parameters

Constructor Parameters

JobContainer Structure

Usage Examples

Python API

Basic Batch Correction

Filtered Batch Correction

Platform-Specific Correction

Multi-Study Correction

CLI Usage

Implementation Details

Batch Effect Sources

Correction Methods

ComBat-seq

Harmony

MNN (Mutual Nearest Neighbors)

scVI Integration

File Organization

Correction Workflow

Prerequisites

Processing Pipeline

Batch Variable Detection

Methods

call(project: Project) -> Project

register() -> str

Filter Functions

Common Filter Patterns

Batch-Specific Filters

Quality Assessment

Correction Evaluation Metrics

Visualization Outputs

Performance Considerations

Memory Requirements

Processing Time

Cluster Computing

Error Handling

Common Issues

Troubleshooting

Best Practices

Sample Selection

Method Selection

Pipeline Integration

Output Formats

Supported Output Types

Metadata Preservation

Related Functions

Troubleshooting

Common Issues

Debug Mode

Manual Correction

Validation Steps

`call(project: Project) -> Project`

`register() -> str`