Quick Start Tutorial
This tutorial will teach you a complete workflow for performing single-cell RNA-seq analysis using Celline with real data.
🎯 What You'll Learn in This Tutorial
- Creating and configuring a Celline project
- Obtaining samples from public databases
- Data preprocessing and quality control
- Interactive analysis and visualization
📊 Data Used
This tutorial uses single-cell RNA-seq data from mouse brain cells (GSE115189).
- Dataset: GSE115189
- Number of samples: 2 samples
- Cell count: Approximately 3,000 cells
- Species: Mus musculus
🚀 Step 1: Project Setup
Creating a Project Directory
# Create working directory
mkdir celline-tutorial
cd celline-tutorial
# Initialize Celline project
# The project name is optional - if not provided, current directory will be used
celline init tutorial-project
Configuration Verification
# Verify installation by listing all available functions
celline list
# Display system information and check installation status
celline info
# Configure execution settings (optional)
celline config --system multithreading --nthread 2
What these commands do:
celline list
: Shows all available analysis functions (add, download, count, etc.)celline info
: Displays system information and validates your installationcelline config
: Sets execution preferences. Available systems aremultithreading
(local) orPBS
(cluster)
Configuration options explained:
- Execution system:
multithreading
runs jobs on your local machine,PBS
submits to a cluster - Number of threads:
2
is safe for most systems, increase if you have more CPU cores
🔍 Step 2: Data Acquisition
Adding Samples
# Add an entire dataset from GEO (Gene Expression Omnibus)
celline run add GSE115189
# Alternative: Add individual samples one by one
# celline run add GSM3169075 GSM3169076
What the add
command does:
- Downloads metadata for the specified dataset or samples
- Queries public databases (GEO, SRA) to find associated sequencing runs
- Creates a
samples.toml
file with sample information - Sets up the project structure for data download
Verifying Sample Information
# Check added samples
cat samples.toml
Example output:
GSM3169075 = "Control sample 1"
GSM3169076 = "Control sample 2"
⬇️ Step 3: Data Download
Downloading Sequencing Data
# Download sequencing data using multiple threads for faster processing
celline run download --nthread 2
# Monitor download progress
ls -la resources/
# Check detailed download status
find resources/ -name "*.log" -exec tail -5 {} \;
Understanding the download process:
- Downloads FASTQ files from NCBI SRA (Sequence Read Archive)
--nthread 2
enables parallel downloading for faster completion- Files are saved in
resources/[SAMPLE_ID]/raw/
directory - Log files in
resources/[SAMPLE_ID]/log/
track download progress
Verifying Data Structure
# Check downloaded files
find resources/ -name "*.fastq.gz" | head -5
🧮 Step 4: Count Processing
Counting with Cell Ranger
# Set up reference transcriptome for alignment (required before counting)
# This command will prompt you to specify species and reference path
celline run set_transcriptome
# Process raw FASTQ files to generate count matrices
celline run count
What happens during counting:
set_transcriptome
: Registers reference genome/transcriptome for alignmentcount
: Runs Cell Ranger or similar tool to align reads and count gene expression- Creates feature-barcode matrices (genes × cells) for each sample
- Generates quality control metrics and filtering information
This process takes time (several hours~). Upon completion, the following will be generated for each sample:
- Feature-barcode matrix
- Quality metrics
- Cell/gene summary
🔬 Step 5: Quality Control and Preprocessing
Creating Seurat Objects
# Convert count matrices to Seurat objects for analysis in R
celline run create_seurat
About Seurat objects:
- Seurat is an R package for single-cell analysis
- This step converts raw count data into a structured format
- Enables downstream analysis like clustering, differential expression
- Creates
.rds
files that can be loaded in R
Executing Preprocessing
# Perform quality control filtering and cell/gene selection
celline run preprocess
# Examine QC results
ls -la data/*/cell_info.tsv
# View preprocessing summary
head -n 20 data/*/cell_info.tsv
Quality control steps performed:
- Cell filtering: Removes low-quality cells based on gene count and mitochondrial content
- Gene filtering: Removes genes expressed in too few cells
- Doublet detection: Identifies potential cell doublets (two cells captured together)
- Metrics calculation: Computes QC statistics for each cell
Checking QC Metrics
# Check QC statistics for each sample
head data/GSM3169075/cell_info.tsv
Example output:
barcode project sample cell cell_type include n_genes_by_counts pct_counts_mt doublet_score predicted_doublets
AAACCTGAGAAGGCCT-1 GSE115189 GSM3169075 GSM3169075_1 Neuron true 2847 1.2 0.05 false
🌐 Step 6: Interactive Analysis
Starting the Web Interface
# Start interactive mode
celline interactive
Access http://localhost:8080 in your browser.
Interface Features
- Sample Overview: Sample list and processing status
- Quality Control: QC metrics visualization
- Cell Type Analysis: Cell type distribution verification
- Gene Expression: Gene expression pattern exploration
📈 Step 7: Verifying Analysis Results
Checking Basic Statistics
# Verify analysis results with Python
# First install required packages: pip install polars matplotlib
import polars as pl
# Load the quality control results
cell_info = pl.read_csv("data/GSM3169075/cell_info.tsv", separator="\t")
# Display basic statistics
print(f"Total cells processed: {len(cell_info)}")
print(f"Cells passing QC filters: {cell_info.filter(pl.col('include')).height}")
print(f"Cell types detected: {cell_info['cell_type'].unique().to_list()}")
# Show QC metric ranges
print(f"Gene count range: {cell_info['n_genes_by_counts'].min()} - {cell_info['n_genes_by_counts'].max()}")
print(f"Mitochondrial % range: {cell_info['pct_counts_mt'].min():.1f} - {cell_info['pct_counts_mt'].max():.1f}")
Understanding the QC metrics:
n_genes_by_counts
: Number of genes detected per cell (typical range: 200-5000)pct_counts_mt
: Percentage of mitochondrial gene expression (should be ≤5%)include
: Boolean flag indicating if cell passes all QC filterscell_type
: Predicted or annotated cell type (if available)
Visualizing Quality Metrics
# Visualizing QC metrics
import matplotlib.pyplot as plt
# Gene count distribution
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.hist(cell_info["n_genes_by_counts"], bins=50)
plt.xlabel("Number of genes")
plt.ylabel("Number of cells")
# Mitochondrial gene percentage distribution
plt.subplot(1, 2, 2)
plt.hist(cell_info["pct_counts_mt"], bins=50)
plt.xlabel("Mitochondrial gene percentage")
plt.ylabel("Number of cells")
plt.tight_layout()
plt.show()
🔬 Step 8: Advanced Analysis (Optional)
Cell Type Prediction
# Cell type prediction using pre-trained models
celline run predict_celltype
Dimensionality Reduction and Clustering
# Dimensionality reduction (PCA, UMAP)
celline run reduce
# Batch effect correction
celline run integrate
Custom Analysis
# Custom analysis example in Python
from celline import Project
from celline.data import Seurat
# Load project
project = Project("./")
# Get Seurat object
seurat = project.seurat("GSE115189", "GSM3169075")
# Execute custom analysis
# ...
📊 Interpreting Results
Quality Control Metrics
- n_genes_by_counts: Range of 200-5000 is typical
- pct_counts_mt: 5% or less is recommended
- doublet_score: 0.3 or less is typical
Cell Filtering
Statistics after filtering:
# Check filtering statistics
celline run info
🔄 Workflow Automation
Batch Processing Script
#!/bin/bash
# complete_workflow.sh
# Project initialization
celline init
# Add samples
celline run add GSE115189
# Data processing pipeline
celline run download --nthread 4
celline run count
celline run create_seurat
celline run preprocess
# Analysis
celline run predict_celltype
celline run reduce
echo "Analysis complete! Check results with celline interactive."
Python API Usage Example
# Python API example - complete analysis pipeline
from celline import Project
from celline.functions import Add, Download, Count, CreateSeurat, Preprocess
# Initialize project in current directory
project = Project("./")
# Step-by-step analysis pipeline
try:
# Step 1: Add samples from GEO database
print("Adding samples...")
add_func = Add(sample_ids=["GSE115189"])
project = add_func.call(project)
# Step 2: Download sequencing data
print("Downloading data...")
download_func = Download()
project = download_func.call(project)
# Step 3: Count gene expression
print("Processing counts...")
count_func = Count(nthread=2)
project = count_func.call(project)
# Step 4: Create Seurat objects
print("Creating Seurat objects...")
seurat_func = CreateSeurat()
project = seurat_func.call(project)
# Step 5: Quality control and preprocessing
print("Running QC and preprocessing...")
preprocess_func = Preprocess()
project = preprocess_func.call(project)
print("Analysis pipeline completed successfully!")
except Exception as e:
print(f"Analysis failed: {e}")
print("Check log files in resources/*/log/ for detailed error information")
🎉 Congratulations!
In this tutorial, you have completed the following:
- ✅ Creating a Celline project
- ✅ Data acquisition from public databases
- ✅ Data preprocessing and quality control
- ✅ Using the interactive analysis environment
📚 Next Steps
- CLI Reference - More detailed command options
- Interactive Mode - Detailed web interface features
- API Reference - Usage with Python programming
- Developer Guide - Developing custom features
💡 Tips and Best Practices
Memory Management
# For large datasets
celline config --nthread 1 # Reduce thread count
Progress Monitoring
# Check progress with log files
tail -f resources/*/log/*.log
Data Backup
# Backup important results
tar -czf results_backup.tar.gz data/ results/
Tutorial complete! For further analysis, please see Advanced Usage.