/Quick Start Tutorial

Quick Start Tutorial

This tutorial will teach you a complete workflow for performing single-cell RNA-seq analysis using Celline with real data.

🎯 What You'll Learn in This Tutorial

  • Creating and configuring a Celline project
  • Obtaining samples from public databases
  • Data preprocessing and quality control
  • Interactive analysis and visualization

📊 Data Used

This tutorial uses single-cell RNA-seq data from mouse brain cells (GSE115189).

  • Dataset: GSE115189
  • Number of samples: 2 samples
  • Cell count: Approximately 3,000 cells
  • Species: Mus musculus

🚀 Step 1: Project Setup

Creating a Project Directory

      # Create working directory
mkdir celline-tutorial
cd celline-tutorial

# Initialize Celline project
# The project name is optional - if not provided, current directory will be used
celline init tutorial-project

    

Configuration Verification

      # Verify installation by listing all available functions
celline list

# Display system information and check installation status
celline info

# Configure execution settings (optional)
celline config --system multithreading --nthread 2

    

What these commands do:

  • celline list: Shows all available analysis functions (add, download, count, etc.)
  • celline info: Displays system information and validates your installation
  • celline config: Sets execution preferences. Available systems are multithreading (local) or PBS (cluster)

Configuration options explained:

  • Execution system: multithreading runs jobs on your local machine, PBS submits to a cluster
  • Number of threads: 2 is safe for most systems, increase if you have more CPU cores

🔍 Step 2: Data Acquisition

Adding Samples

      # Add an entire dataset from GEO (Gene Expression Omnibus)
celline run add GSE115189

# Alternative: Add individual samples one by one
# celline run add GSM3169075 GSM3169076

    

What the add command does:

  • Downloads metadata for the specified dataset or samples
  • Queries public databases (GEO, SRA) to find associated sequencing runs
  • Creates a samples.toml file with sample information
  • Sets up the project structure for data download

Verifying Sample Information

      # Check added samples
cat samples.toml

    

Example output:

      GSM3169075 = "Control sample 1"
GSM3169076 = "Control sample 2"

    

⬇️ Step 3: Data Download

Downloading Sequencing Data

      # Download sequencing data using multiple threads for faster processing
celline run download --nthread 2

# Monitor download progress
ls -la resources/

# Check detailed download status
find resources/ -name "*.log" -exec tail -5 {} \;

    

Understanding the download process:

  • Downloads FASTQ files from NCBI SRA (Sequence Read Archive)
  • --nthread 2 enables parallel downloading for faster completion
  • Files are saved in resources/[SAMPLE_ID]/raw/ directory
  • Log files in resources/[SAMPLE_ID]/log/ track download progress

Verifying Data Structure

      # Check downloaded files
find resources/ -name "*.fastq.gz" | head -5

    

🧮 Step 4: Count Processing

Counting with Cell Ranger

      # Set up reference transcriptome for alignment (required before counting)
# This command will prompt you to specify species and reference path
celline run set_transcriptome

# Process raw FASTQ files to generate count matrices
celline run count

    

What happens during counting:

  • set_transcriptome: Registers reference genome/transcriptome for alignment
  • count: Runs Cell Ranger or similar tool to align reads and count gene expression
  • Creates feature-barcode matrices (genes × cells) for each sample
  • Generates quality control metrics and filtering information

This process takes time (several hours~). Upon completion, the following will be generated for each sample:

  • Feature-barcode matrix
  • Quality metrics
  • Cell/gene summary

🔬 Step 5: Quality Control and Preprocessing

Creating Seurat Objects

      # Convert count matrices to Seurat objects for analysis in R
celline run create_seurat

    

About Seurat objects:

  • Seurat is an R package for single-cell analysis
  • This step converts raw count data into a structured format
  • Enables downstream analysis like clustering, differential expression
  • Creates .rds files that can be loaded in R

Executing Preprocessing

      # Perform quality control filtering and cell/gene selection
celline run preprocess

# Examine QC results
ls -la data/*/cell_info.tsv

# View preprocessing summary
head -n 20 data/*/cell_info.tsv

    

Quality control steps performed:

  • Cell filtering: Removes low-quality cells based on gene count and mitochondrial content
  • Gene filtering: Removes genes expressed in too few cells
  • Doublet detection: Identifies potential cell doublets (two cells captured together)
  • Metrics calculation: Computes QC statistics for each cell

Checking QC Metrics

      # Check QC statistics for each sample
head data/GSM3169075/cell_info.tsv

    

Example output:

      barcode project sample  cell    cell_type   include n_genes_by_counts   pct_counts_mt   doublet_score   predicted_doublets
AAACCTGAGAAGGCCT-1  GSE115189   GSM3169075  GSM3169075_1    Neuron  true    2847    1.2 0.05    false

    

🌐 Step 6: Interactive Analysis

Starting the Web Interface

      # Start interactive mode
celline interactive

    

Access http://localhost:8080 in your browser.

Interface Features

  1. Sample Overview: Sample list and processing status
  2. Quality Control: QC metrics visualization
  3. Cell Type Analysis: Cell type distribution verification
  4. Gene Expression: Gene expression pattern exploration

📈 Step 7: Verifying Analysis Results

Checking Basic Statistics

      # Verify analysis results with Python
# First install required packages: pip install polars matplotlib
import polars as pl

# Load the quality control results
cell_info = pl.read_csv("data/GSM3169075/cell_info.tsv", separator="\t")

# Display basic statistics
print(f"Total cells processed: {len(cell_info)}")
print(f"Cells passing QC filters: {cell_info.filter(pl.col('include')).height}")
print(f"Cell types detected: {cell_info['cell_type'].unique().to_list()}")

# Show QC metric ranges
print(f"Gene count range: {cell_info['n_genes_by_counts'].min()} - {cell_info['n_genes_by_counts'].max()}")
print(f"Mitochondrial % range: {cell_info['pct_counts_mt'].min():.1f} - {cell_info['pct_counts_mt'].max():.1f}")

    

Understanding the QC metrics:

  • n_genes_by_counts: Number of genes detected per cell (typical range: 200-5000)
  • pct_counts_mt: Percentage of mitochondrial gene expression (should be ≤5%)
  • include: Boolean flag indicating if cell passes all QC filters
  • cell_type: Predicted or annotated cell type (if available)

Visualizing Quality Metrics

      # Visualizing QC metrics
import matplotlib.pyplot as plt

# Gene count distribution
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.hist(cell_info["n_genes_by_counts"], bins=50)
plt.xlabel("Number of genes")
plt.ylabel("Number of cells")

# Mitochondrial gene percentage distribution
plt.subplot(1, 2, 2)
plt.hist(cell_info["pct_counts_mt"], bins=50)
plt.xlabel("Mitochondrial gene percentage")
plt.ylabel("Number of cells")

plt.tight_layout()
plt.show()

    

🔬 Step 8: Advanced Analysis (Optional)

Cell Type Prediction

      # Cell type prediction using pre-trained models
celline run predict_celltype

    

Dimensionality Reduction and Clustering

      # Dimensionality reduction (PCA, UMAP)
celline run reduce

# Batch effect correction
celline run integrate

    

Custom Analysis

      # Custom analysis example in Python
from celline import Project
from celline.data import Seurat

# Load project
project = Project("./")

# Get Seurat object
seurat = project.seurat("GSE115189", "GSM3169075")

# Execute custom analysis
# ...

    

📊 Interpreting Results

Quality Control Metrics

  • n_genes_by_counts: Range of 200-5000 is typical
  • pct_counts_mt: 5% or less is recommended
  • doublet_score: 0.3 or less is typical

Cell Filtering

Statistics after filtering:

      # Check filtering statistics
celline run info

    

🔄 Workflow Automation

Batch Processing Script

      #!/bin/bash
# complete_workflow.sh

# Project initialization
celline init

# Add samples
celline run add GSE115189

# Data processing pipeline
celline run download --nthread 4
celline run count
celline run create_seurat
celline run preprocess

# Analysis
celline run predict_celltype
celline run reduce

echo "Analysis complete! Check results with celline interactive."

    

Python API Usage Example

      # Python API example - complete analysis pipeline
from celline import Project
from celline.functions import Add, Download, Count, CreateSeurat, Preprocess

# Initialize project in current directory
project = Project("./")

# Step-by-step analysis pipeline
try:
    # Step 1: Add samples from GEO database
    print("Adding samples...")
    add_func = Add(sample_ids=["GSE115189"])
    project = add_func.call(project)
    
    # Step 2: Download sequencing data
    print("Downloading data...")
    download_func = Download()
    project = download_func.call(project)
    
    # Step 3: Count gene expression
    print("Processing counts...")
    count_func = Count(nthread=2)
    project = count_func.call(project)
    
    # Step 4: Create Seurat objects
    print("Creating Seurat objects...")
    seurat_func = CreateSeurat()
    project = seurat_func.call(project)
    
    # Step 5: Quality control and preprocessing
    print("Running QC and preprocessing...")
    preprocess_func = Preprocess()
    project = preprocess_func.call(project)
    
    print("Analysis pipeline completed successfully!")
    
except Exception as e:
    print(f"Analysis failed: {e}")
    print("Check log files in resources/*/log/ for detailed error information")

    

🎉 Congratulations!

In this tutorial, you have completed the following:

  • ✅ Creating a Celline project
  • ✅ Data acquisition from public databases
  • ✅ Data preprocessing and quality control
  • ✅ Using the interactive analysis environment

📚 Next Steps

💡 Tips and Best Practices

Memory Management

      # For large datasets
celline config --nthread 1  # Reduce thread count

    

Progress Monitoring

      # Check progress with log files
tail -f resources/*/log/*.log

    

Data Backup

      # Backup important results
tar -czf results_backup.tar.gz data/ results/

    

Tutorial complete! For further analysis, please see Advanced Usage.