Integrate Function

Integrate multiple single-cell RNA-seq samples using advanced computational methods.

Overview

The Integrate function combines multiple single-cell RNA-seq samples into a unified analysis, correcting for batch effects and technical variations while preserving biological differences. It uses R-based integration methods to create harmonized datasets.

Class Information

Module: celline.functions.integrate
Class: Integrate
Base Class: CellineFunction

Parameters

Constructor Parameters

Parameter	Type	Required	Description
`filter_func`	`Optional[Callable[[SampleSchema], bool]]`	No	Function to filter samples for integration
`outfile_name`	`Optional[str]`	No	Custom output filename (default: timestamped)

JobContainer Structure

Field	Type	Description
`nthread`	`str`	Number of threads (fixed to 1)
`cluster_server`	`str`	Cluster server name
`jobname`	`str`	Job identifier
`logpath`	`str`	Integration log file path
`r_path`	`str`	R script directory
`exec_root`	`str`	Execution root
`sample_ids`	`str`	Comma-separated sample IDs
`project_ids`	`str`	Comma-separated project IDs
`all_bcmat_path`	`str`	Comma-separated matrix paths
`all_data_sample_dir_path`	`str`	Comma-separated data directories
`outfile_path`	`str`	Output file path
`logpath_runtime`	`str`	Runtime log path
`project_name`	`str`	Project name

Usage Examples

Python API

Basic Integration

      from celline import Project
from celline.functions.integrate import Integrate

# Create project
project = Project("./my-project")

# Basic integration of all samples
integrate_function = Integrate(filter_func=None, outfile_name=None)

# Execute function
result = project.call(integrate_function)

Custom Output Name

      from celline import Project
from celline.functions.integrate import Integrate

# Create project
project = Project("./my-project")

# Integration with custom output name
integrate_function = Integrate(
    filter_func=None,
    outfile_name="brain_atlas_integrated"
)

# Execute function
result = project.call(integrate_function)

Filtered Integration

      from celline import Project
from celline.functions.integrate import Integrate
from celline.DB.dev.model import SampleSchema

# Create project
project = Project("./my-project")

# Filter function to include only specific samples
def brain_samples_filter(schema: SampleSchema) -> bool:
    return "brain" in schema.title.lower() if schema.title else False

# Integration with sample filtering
integrate_function = Integrate(
    filter_func=brain_samples_filter,
    outfile_name="brain_samples_only"
)

# Execute function
result = project.call(integrate_function)

Species-Specific Integration

      from celline import Project
from celline.functions.integrate import Integrate
from celline.DB.dev.model import SampleSchema

# Create project
project = Project("./my-project")

# Filter by species
def human_samples_filter(schema: SampleSchema) -> bool:
    return schema.species == "Homo sapiens"

# Integration of human samples only
integrate_function = Integrate(
    filter_func=human_samples_filter,
    outfile_name="human_integration"
)

# Execute function
result = project.call(integrate_function)

CLI Usage

      # Basic integration
celline run integrate

# Integration with verbose logging
celline run integrate --verbose

Implementation Details

Integration Methods

The function uses state-of-the-art R-based integration methods:

Seurat Integration: CCA-based integration
Harmony: Fast integration for large datasets
FastMNN: Mutual nearest neighbors approach
scVI: Deep learning-based integration (if available)

R Script Integration

The function generates and executes R scripts:

      # Load required libraries
library(Seurat)
library(harmony)
library(SeuratWrappers)

# Load sample data
sample_list <- list()
for (i in 1:length(sample_paths)) {
    sample_list[[i]] <- Read10X_h5(sample_paths[i])
}

# Create Seurat objects
seurat_objects <- lapply(sample_list, function(x) {
    CreateSeuratObject(counts = x, min.cells = 3, min.features = 200)
})

# Integrate samples
integrated <- RunHarmony(
    object = merged_object,
    group.by.vars = "sample",
    reduction = "pca"
)

# Save integrated object
saveRDS(integrated, file = output_path)

File Organization

Integration outputs are organized systematically:

      project_root/
├── integration/
│   ├── integrated_TIMESTAMP.rds    # Integrated Seurat object
│   ├── logs/
│   │   ├── integrate_TIMESTAMP.log     # Integration log
│   │   └── RUNTIME_integrate_TIMESTAMP.log  # Runtime log
│   └── integrate_TIMESTAMP.sh      # Generated R script

Integration Workflow

Prerequisites

Samples must be:

Counted: Cell Ranger count completed
Optionally Preprocessed: QC filtering applied
Compatible: Same species and similar protocols

Processing Steps

Sample Discovery: Identifies all counted samples
Filtering: Applies user-defined filter function
Validation: Checks sample compatibility
Matrix Loading: Loads count matrices from each sample
Preprocessing: Normalizes and scales data
Integration: Applies batch correction methods
Dimensionality Reduction: Computes integrated embeddings
Output Generation: Saves integrated object

Quality Control

The integration process includes automatic QC:

      # Before integration
pre_integration_metrics <- list(
    n_samples = length(sample_list),
    total_cells = sum(sapply(sample_list, ncol)),
    total_genes = nrow(sample_list[[1]])
)

# After integration
post_integration_metrics <- list(
    integrated_cells = ncol(integrated_object),
    batch_correction_score = calculate_batch_score(integrated_object)
)

Methods

`call(project: Project) -> Project`

Main execution method that performs sample integration.

Parameters:

project: The Celline project instance

Returns: Updated project instance

Process:

Discovers and filters samples
Validates prerequisites
Generates R integration script
Executes integration pipeline
Saves integrated results

`register() -> str`

Returns the function identifier for registration.

Filter Functions

Common Filter Patterns

      # Filter by tissue type
def tissue_filter(tissue_name):
    def filter_func(schema):
        return tissue_name.lower() in (schema.title or "").lower()
    return filter_func

# Filter by platform
def platform_filter(platform_name):
    def filter_func(schema):
        return schema.platform == platform_name
    return filter_func

# Filter by date range
def date_filter(start_date, end_date):
    def filter_func(schema):
        return start_date <= schema.date <= end_date
    return filter_func

# Combined filters
def combined_filter(schema):
    return (
        schema.species == "Homo sapiens" and
        "brain" in (schema.title or "").lower() and
        schema.platform.startswith("Illumina")
    )

Filter Examples

      # Use case: Integrate only cortical samples
cortex_filter = lambda s: "cortex" in (s.title or "").lower()
project.call(Integrate(filter_func=cortex_filter))

# Use case: Integrate samples from specific study
study_filter = lambda s: s.parent == "GSE123456"
project.call(Integrate(filter_func=study_filter))

Performance Considerations

Memory Requirements

Integration memory needs scale with sample number and size:

Samples	Cells per Sample	Memory Needed
2-5	5K-10K	16-32 GB
5-10	5K-10K	32-64 GB
10+	5K-10K	64+ GB

Processing Time

Integration time depends on method and data size:

Method	Sample Count	Processing Time
Harmony	5 samples	30-60 minutes
Seurat CCA	5 samples	1-2 hours
FastMNN	5 samples	45-90 minutes

Cluster Computing

For large integrations:

      #!/bin/bash
#PBS -N Integration
#PBS -l nodes=1:ppn=8
#PBS -l mem=128gb
#PBS -l walltime=8:00:00

module load R/4.1.0
Rscript integration_script.R

Integration Methods

Harmony Integration

Fast and effective for most datasets:

      # Harmony workflow
merged <- merge(x = seurat_list[[1]], y = seurat_list[2:length(seurat_list)])
merged <- NormalizeData(merged)
merged <- FindVariableFeatures(merged)
merged <- ScaleData(merged)
merged <- RunPCA(merged)
integrated <- RunHarmony(merged, group.by.vars = "orig.ident")

Seurat CCA Integration

Canonical Correlation Analysis approach:

      # CCA workflow
anchors <- FindIntegrationAnchors(object.list = seurat_list)
integrated <- IntegrateData(anchorset = anchors)
DefaultAssay(integrated) <- "integrated"
integrated <- ScaleData(integrated)
integrated <- RunPCA(integrated)

FastMNN Integration

Mutual nearest neighbors approach:

      # FastMNN workflow
library(batchelor)
integrated <- fastMNN(seurat_list, cos.norm = TRUE)

Output Analysis

Integrated Object Structure

The output Seurat object contains:

      # Assays
integrated@assays$RNA          # Original counts
integrated@assays$integrated   # Integrated data (if CCA)

# Reductions
integrated@reductions$pca      # Principal components
integrated@reductions$harmony  # Harmony embeddings
integrated@reductions$umap     # UMAP coordinates

# Metadata
integrated@meta.data$orig.ident     # Original sample identity
integrated@meta.data$sample_id      # Sample identifier
integrated@meta.data$batch          # Batch information

Quality Assessment

Evaluate integration quality:

      # Visualize integration
DimPlot(integrated, group.by = "orig.ident")
DimPlot(integrated, group.by = "cell_type")

# Calculate batch metrics
library(harmony)
batch_score <- assess_batch_correction(integrated)

Error Handling

Common Issues

Memory Exhaustion: Reduce sample count or use cluster computing
R Package Missing: Install required packages (Seurat, harmony)
Incompatible Samples: Ensure same species and protocols
File Not Found: Verify all samples are counted

Troubleshooting

      # Check sample compatibility
def check_compatibility(samples):
    species_set = set(s.schema.species for s in samples)
    if len(species_set) > 1:
        print(f"Warning: Multiple species found: {species_set}")
    
    platform_set = set(s.schema.platform for s in samples)
    if len(platform_set) > 1:
        print(f"Info: Multiple platforms: {platform_set}")

Best Practices

Sample Selection

Biological Coherence: Integrate related samples/conditions
Technical Compatibility: Use similar protocols and platforms
Quality Control: Ensure all samples pass QC
Batch Consideration: Be aware of batch effects

Integration Strategy

      # Progressive integration for large projects
def progressive_integration(project, batch_size=5):
    """Integrate samples in batches to manage memory."""
    samples = list(project.samples.values())
    
    for i in range(0, len(samples), batch_size):
        batch = samples[i:i+batch_size]
        filter_func = lambda s: s.key in [b.schema.key for b in batch]
        
        project.call(Integrate(
            filter_func=filter_func,
            outfile_name=f"batch_{i//batch_size + 1}"
        ))

Count - Generate count matrices before integration
Preprocess - Quality control before integration
CreateSeuratObject - Create Seurat objects
Reduce - Storage optimization after integration

Troubleshooting

Common Issues

R Memory Limit: Increase R memory or use cluster
Package Conflicts: Use consistent R package versions
File Permissions: Ensure write access to integration directory
Sample Mismatch: Verify sample metadata consistency

Debug Mode

Enable detailed R logging:

      # In R script
options(error = traceback)
sessionInfo()
print(ls())  # List objects in environment

Manual Integration

For debugging, run integration manually:

      # Navigate to integration directory
cd integration/

# Execute integration script manually
Rscript integrate_TIMESTAMP.sh

Integrate Function

Overview

Class Information

Parameters

Constructor Parameters

JobContainer Structure

Usage Examples

Python API

Basic Integration

Custom Output Name

Filtered Integration

Species-Specific Integration

CLI Usage

Implementation Details

Integration Methods

R Script Integration

File Organization

Integration Workflow

Prerequisites

Processing Steps

Quality Control

Methods

call(project: Project) -> Project

register() -> str

Filter Functions

Common Filter Patterns

Filter Examples

Performance Considerations

Memory Requirements

Processing Time

Cluster Computing

Integration Methods

Harmony Integration

Seurat CCA Integration

FastMNN Integration

Output Analysis

Integrated Object Structure

Quality Assessment

Error Handling

Common Issues

Troubleshooting

Best Practices

Sample Selection

Integration Strategy

Related Functions

Troubleshooting

Common Issues

Debug Mode

Manual Integration

`call(project: Project) -> Project`

`register() -> str`