Preprocess Function
Perform quality control and cell type filtering on counted single-cell data.
Overview
The Preprocess
function performs comprehensive quality control on Cell Ranger count output. It calculates quality metrics, detects doublets using Scrublet, and filters cells based on gene count, mitochondrial content, and cell type criteria.
Class Information
- Module:
celline.functions.preprocess
- Class:
Preprocess
- Base Class:
CellineFunction
Parameters
Constructor Parameters
Parameter | Type | Required | Description |
---|---|---|---|
target_celltype | Optional[list[str]] | No | List of target cell types to include in filtering |
Usage Examples
Python API
Basic Preprocessing
from celline import Project
from celline.functions.preprocess import Preprocess
# Create project
project = Project("./my-project")
# Basic preprocessing (includes all cell types)
preprocess_function = Preprocess()
# Execute function
result = project.call(preprocess_function)
Cell Type-Specific Preprocessing
from celline import Project
from celline.functions.preprocess import Preprocess
# Create project
project = Project("./my-project")
# Preprocess with specific cell types
target_celltypes = ["Neuron", "Astrocyte", "Oligodendrocyte"]
preprocess_function = Preprocess(target_celltype=target_celltypes)
# Execute function
result = project.call(preprocess_function)
Multiple Preprocessing Scenarios
from celline import Project
from celline.functions.preprocess import Preprocess
# Create project
project = Project("./my-project")
# Scenario 1: All cell types
project.call(Preprocess())
# Scenario 2: Neuronal cells only
neuronal_types = ["Neuron", "Interneuron", "Motor_neuron"]
project.call(Preprocess(target_celltype=neuronal_types))
# Scenario 3: Immune cells only
immune_types = ["T_cell", "B_cell", "Macrophage", "NK_cell"]
project.call(Preprocess(target_celltype=immune_types))
CLI Usage
Basic Usage
# Preprocess all cell types
celline run preprocess
# Preprocess specific cell types
celline run preprocess --target-celltype Neuron Astrocyte
# Multiple cell types
celline run preprocess --target-celltype T_cell B_cell Macrophage NK_cell
CLI Arguments
Argument | Type | Description |
---|---|---|
--target-celltype , -t | str+ | Target cell types to include in preprocessing |
Implementation Details
Prerequisites
The function requires samples to be:
- Counted: Cell Ranger count must be completed
- Cell Type Predicted: Cell type prediction must be finished
Quality Control Pipeline
The preprocessing pipeline performs the following steps:
- Data Loading: Reads Cell Ranger HDF5 output using Scanpy
- Metadata Integration: Joins count data with cell type predictions
- Doublet Detection: Uses Scrublet to identify potential doublets
- QC Metrics Calculation: Computes standard quality control metrics
- Cell Filtering: Applies multi-criteria filtering
- Output Generation: Saves filtered cell information
Filtering Criteria
The function applies the following quality control filters:
Filter | Criteria | Purpose |
---|---|---|
Gene Count (Min) | ≥ 200 genes | Remove low-quality cells |
Gene Count (Max) | ≤ 5000 genes | Remove potential multiplets |
Mitochondrial Content | ≤ 5% | Remove dying/stressed cells |
Doublet Detection | Not predicted doublet | Remove computational doublets |
Cell Type | In target list | Include only specified types |
Scanpy Integration
The function uses Scanpy for data processing:
import scanpy as sc
# Read Cell Ranger output
adata = sc.read_10x_h5(path.resources_sample_counted)
# Calculate QC metrics
sc.pp.calculate_qc_metrics(
adata,
qc_vars=["mt"],
percent_top=None,
log1p=False,
inplace=True,
)
Scrublet Doublet Detection
Doublet detection using Scrublet:
import scrublet as scr
# Initialize Scrublet
scrub = scr.Scrublet(adata.X)
# Detect doublets
doublet_scores, predicted_doublets = scrub.scrub_doublets(verbose=False)
# Add to observation metadata
adata.obs["doublet_score"] = doublet_scores
adata.obs["predicted_doublets"] = predicted_doublets
Output Files
Cell Information File
The function generates a comprehensive cell information file:
data/SAMPLE_ID/cell_info.tsv
File Structure
Column | Type | Description |
---|---|---|
barcode | str | Cell barcode sequence |
project | str | Project identifier |
sample | str | Sample identifier |
cell | str | Unique cell identifier |
cell_type | str | Predicted cell type |
include | bool | Pass QC filters |
n_genes_by_counts | int | Number of detected genes |
total_counts | int | Total UMI count |
pct_counts_mt | float | Mitochondrial gene percentage |
doublet_score | float | Doublet probability score |
predicted_doublets | bool | Doublet prediction |
Example Output
barcode project sample cell cell_type include n_genes_by_counts total_counts pct_counts_mt doublet_score predicted_doublets
AAACCTGAGAAGGCCT-1 GSE123456 GSM789012 GSM789012_1 Neuron true 2450 8924 2.1 0.05 false
AAACCTGAGAAGGCCT-2 GSE123456 GSM789012 GSM789012_2 Astrocyte true 1876 5234 1.8 0.03 false
AAACCTGAGAAGGCCT-3 GSE123456 GSM789012 GSM789012_3 Doublet false 5421 15673 8.2 0.92 true
Quality Control Metrics
Standard Metrics
The function calculates standard single-cell QC metrics:
Metric | Description | Typical Range |
---|---|---|
n_genes_by_counts | Genes with non-zero counts | 200-5000 |
total_counts | Total UMI count per cell | 1000-50000 |
pct_counts_mt | Mitochondrial gene percentage | 0-20% |
doublet_score | Scrublet doublet score | 0-1 |
Mitochondrial Gene Detection
Mitochondrial genes are identified by prefix:
# Default mitochondrial gene prefix
mt_prefix = "mt-" # For mouse
# For human data, use "MT-"
# Mark mitochondrial genes
adata.var["mt"] = adata.var_names.str.startswith(mt_prefix)
Cell Type Integration
Cell types from prediction are integrated with QC data:
# Read cell type predictions
celltype_data = pl.read_csv(
path.data_sample_predicted_celltype,
separator="\t"
).rename({"scpred_prediction": "cell_type"})
# Join with observation data
obs = obs.join(celltype_data, on="cell")
Error Handling
Common Issues
- Missing Count Data: Ensure Cell Ranger count completed successfully
- Missing Cell Type Predictions: Run cell type prediction first
- Memory Issues: Large datasets may require substantial RAM
- File Permissions: Check read/write permissions
Validation Checks
# Check prerequisites
if not (path.is_counted and path.is_predicted_celltype):
print(f"Sample {sample_id} not ready for preprocessing")
continue
# Validate cell type predictions exist
if not os.path.exists(path.data_sample_predicted_celltype):
raise FileNotFoundError("Cell type predictions not found")
Performance Considerations
Memory Usage
Memory requirements depend on dataset size:
Dataset Size | Memory Needed |
---|---|
<5K cells | 4 GB |
5K-20K cells | 8 GB |
20K-50K cells | 16 GB |
>50K cells | 32+ GB |
Processing Time
Typical processing times:
Dataset Size | Processing Time |
---|---|
<5K cells | 30 seconds |
5K-20K cells | 1-2 minutes |
20K-50K cells | 2-5 minutes |
>50K cells | 5+ minutes |
Optimization Tips
- Filter Early: Apply cell type filters before heavy computations
- Memory Management: Process samples sequentially for large datasets
- Batch Processing: Group similar samples for efficiency
Methods
call(project: Project) -> Project
Main execution method that performs preprocessing on all eligible samples.
Parameters:
project
: The Celline project instance
Returns: Updated project instance
Process:
- Reads samples from
samples.toml
- Validates prerequisites for each sample
- Loads count data and cell type predictions
- Calculates QC metrics and detects doublets
- Applies filtering criteria
- Saves filtered cell information
add_cli_args(parser: ArgumentParser) -> None
Adds CLI-specific arguments to the argument parser.
cli(project: Project, args: Namespace) -> Project
CLI entry point that processes command-line arguments and executes the function.
Integration with Pipeline
Typical Workflow
from celline import Project
from celline.functions.count import Count
from celline.functions.predict_celltype import PredictCelltype
from celline.functions.preprocess import Preprocess
from celline.functions.create_seurat import CreateSeuratObject
# Complete pipeline
project = Project("./my-project")
# Count reads
project.call(Count(nthread=8))
# Predict cell types
project.call(PredictCelltype(model))
# Preprocess with QC
project.call(Preprocess(target_celltype=["Neuron", "Astrocyte"]))
# Create Seurat objects
project.call(CreateSeuratObject(useqc_matrix=True))
Cell Type Filtering
Supported Cell Types
The function supports any cell types from your prediction model:
Common Neuroscience Types:
- Neuron, Astrocyte, Oligodendrocyte, Microglia
- Interneuron, Pyramidal_neuron, Motor_neuron
- OPC (Oligodendrocyte Precursor Cell)
Common Immunology Types:
- T_cell, B_cell, NK_cell, Macrophage
- CD4_T_cell, CD8_T_cell, Regulatory_T_cell
- Plasma_cell, Dendritic_cell
Common Development Types:
- Stem_cell, Progenitor_cell
- Epithelial_cell, Endothelial_cell
- Fibroblast, Smooth_muscle_cell
Custom Cell Type Lists
# Create custom cell type groups
epithelial_types = [
"Epithelial_cell", "Basal_cell", "Ciliated_cell",
"Goblet_cell", "Club_cell"
]
# Preprocess with custom types
project.call(Preprocess(target_celltype=epithelial_types))
Related Functions
- Count - Generate count matrices before preprocessing
- PredictCelltype - Predict cell types before preprocessing
- CreateSeuratObject - Create Seurat objects after preprocessing
- Integrate - Integrate preprocessed samples
Troubleshooting
Common Issues
- Memory Error: Reduce dataset size or increase system memory
- Missing Dependencies: Install required Python packages (scanpy, scrublet, polars)
- File Not Found: Ensure count and prediction steps completed successfully
- Empty Output: Check cell type names match prediction output
Debug Mode
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
# Also enable scanpy logging
import scanpy as sc
sc.settings.verbosity = 3
Manual Quality Control
For custom QC parameters:
# Custom filtering criteria
custom_filter = {
"min_genes": 500, # Higher gene count threshold
"max_genes": 3000, # Lower gene count threshold
"max_mt_pct": 10, # Higher mitochondrial threshold
"min_counts": 1000 # Minimum UMI count
}
# Apply custom filters in your analysis