A distributed video transcoding system documenting architectural patterns, design invariants, and failure semantics observed under real load. This reference implementation demonstrates master-worker coordination, state machine guarantees, and operational tradeoffs in production environments.
FFmpeg-RTMP documents architectural choices, invariants, and failure semantics observed under real load. While the system is used in production and is available for reuse, its primary goal is to communicate design tradeoffs and operational lessons rather than to serve as a general-purpose or commercially supported platform.
This reference implementation demonstrates:
- Architectural patterns: Pull-based coordination, state machine guarantees, idempotent operations
- Design invariants: What never changes, even under failure conditions
- Failure semantics: Explicit documentation of retry boundaries and terminal states
- Operational tradeoffs: Why certain design choices were made over alternatives
- Performance characteristics: Measured behavior under realistic workloads (45,000+ jobs tested)
- State Machine Correctness: FSM with validated transitions and row-level locking prevents race conditions
- Failure Mode Documentation: Explicit boundaries between transient (retry) and terminal (fail) errors
- Graceful Degradation: Heartbeat-based failure detection with configurable recovery semantics
- Production Patterns: Exponential backoff, connection pooling, graceful shutdown demonstrated at scale
- Transparency: Design decisions documented with rationale and alternatives considered
- Not a commercial platform: No support, SLAs, or stability guarantees across versions
- Not general-purpose: Optimized for batch transcoding workloads, not real-time streaming
- Not plug-and-play: Requires understanding of distributed systems concepts for deployment
- Not feature-complete: Focuses on core patterns; many production features deliberately omitted
- Systems researchers studying distributed coordination patterns
- Engineers evaluating architectural approaches for similar problems
- Students learning production distributed systems design
- Teams seeking a reference implementation to adapt for specific use cases
This is a teaching tool backed by real operational data, not a turnkey solution.
This reference implementation is organized to clearly separate concerns:
master/- Orchestration: job scheduling, failure detection, state managementworker/- Execution: job processing, FFmpeg integration, metrics collectionshared/- Common libraries: FSM, retry semantics, database abstractions
See ARCHITECTURE.md for detailed design discussion and CODE_VERIFICATION_REPORT.md for implementation validation.
For studying the system behavior locally:
# One-command setup: builds, runs, and verifies everything
./scripts/run_local_stack.shSee docs/LOCAL_STACK_GUIDE.md for details.
The reference implementation can be deployed across multiple nodes to study distributed behavior patterns.
- Go 1.24+ (for building binaries)
- Python 3.10+ (optional, for analysis scripts)
- FFmpeg (for transcoding)
- Linux with kernel 4.15+ (for RAPL power monitoring)
# Clone and build
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
make build-master
# Set API key for authentication
export MASTER_API_KEY=$(openssl rand -base64 32)
# Start master service
# - TLS enabled by default (auto-generates self-signed cert)
# - SQLite persistence (master.db)
# - Job retry (3 attempts default)
# - Prometheus metrics on port 9090
./bin/master --port 8080 --api-key "$MASTER_API_KEY"
# Optional: Start monitoring stack (VictoriaMetrics + Grafana)
make vm-up-build# On worker node(s)
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
make build-agent
# Set same API key as master
export MASTER_API_KEY="<same-key-as-master>"
# Register and start agent
# Concurrency settings affect failure mode behavior
./bin/agent \
--register \
--master https://MASTER_IP:8080 \
--api-key "$MASTER_API_KEY" \
--max-concurrent-jobs 4 \
--poll-interval 3s \
--insecure-skip-verify
# Note: --insecure-skip-verify only for self-signed certs in research environments# Submit via CLI
./bin/ffrtmp jobs submit \
--master https://MASTER_IP:8080 \
--scenario 1080p-h264 \
--bitrate 5M \
--duration 300
# Or via REST API
curl -X POST https://MASTER_IP:8080/jobs \
-H "Authorization: Bearer $MASTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"scenario": "1080p-h264",
"confidence": "auto",
"parameters": {"duration": 300, "bitrate": "5M"}
}'
# Workers poll master and execute jobs
# Observe state transitions and failure recovery patterns
# Monitor progress at https://MASTER_IP:8080/jobs- Master API: https://MASTER_IP:8080/nodes (registered nodes, health status)
- Prometheus Metrics: http://MASTER_IP:9090/metrics
- Grafana (optional): http://MASTER_IP:3000 (admin/admin)
- VictoriaMetrics (optional): http://MASTER_IP:8428
For systemd service configuration, see deployment/README.md.
The Edge Workload Wrapper demonstrates OS-level resource constraint patterns for compute workloads. This experimental component explores non-owning governance models where workloads survive wrapper crashes.
- Non-owning supervision: Workloads run independently of wrapper lifecycle
- Attach semantics: Govern already-running processes without restart
- Graceful fallback: OS-level constraints degrade gracefully without root/cgroups
- Exit tracking: Capture exit codes, reasons, and execution duration
# Run FFmpeg with resource constraints
ffrtmp run \
--job-id transcode-001 \
--sla-eligible \
--cpu-quota 200 \
--memory-limit 4096 \
-- ffmpeg -i input.mp4 -c:v h264_nvenc output.mp4
# Attach to existing process (demonstrates attach semantics)
ffrtmp attach \
--pid 12345 \
--job-id existing-job-042 \
--cpu-weight 150 \
--nice -5
# Auto-discovery watch daemon (NEW!)
ffrtmp watch \
--scan-interval 10s \
--enable-state \
--enable-retry \
--watch-config /etc/ffrtmp/watch-config.yamlDemonstrates automatic process discovery and governance patterns. Explores techniques for:
- Non-intrusive process discovery via /proc scanning
- State persistence across daemon restarts
- Configuration-driven process filtering and governance
Example deployment:
# Install experimental daemon
sudo ./deployment/install-edge.sh
# Configure discovery rules
sudo nano /etc/ffrtmp/watch-config.yaml
# Start service
sudo systemctl start ffrtmp-watchSee deployment/WATCH_DEPLOYMENT.md for implementation details.
- Wrapper Architecture - Design patterns and philosophy
- Wrapper Examples - Usage demonstrations
The reference implementation demonstrates several resource management approaches:
Running with privileged access (research/production):
# Full cgroup support for resource isolation
sudo ./bin/agent \
--register \
--master https://MASTER_IP:8080 \
--api-key "$MASTER_API_KEY" \
--max-concurrent-jobs 4 \
--poll-interval 3sBenefits of privileged execution:
- Strict CPU quotas via cgroups (v1/v2)
- Hard memory limits with OOM protection
- Complete process isolation per job
- Resource exhaustion prevention
Graceful degradation without privileges:
- Disk space monitoring (always enforced)
- Timeout enforcement (always enforced)
- Process priority control via nice
- CPU/memory limits disabled (monitoring only)
Jobs support configurable limits for studying resource contention:
{
"scenario": "1080p-h264",
"parameters": {
"bitrate": "4M",
"duration": 300
},
"resource_limits": {
"max_cpu_percent": 200, // 200% = 2 CPU cores
"max_memory_mb": 2048, // 2GB memory limit
"max_disk_mb": 5000, // 5GB temp space required
"timeout_sec": 600 // 10 minute timeout
}
}Default constraints:
- CPU: All available cores (numCPU × 100%)
- Memory: 2048 MB (2GB)
- Disk: 5000 MB (5GB)
- Timeout: 3600 seconds (1 hour)
1. CPU Limits (cgroup-based)
- Demonstrates per-job CPU percentage allocation (100% = 1 core)
- Supports cgroup v1 and v2
- Fallback to nice priority without root
2. Memory Limits (cgroup-based)
- Hard memory caps via Linux cgroups
- OOM (Out of Memory) protection
- Automatic process termination if limits exceeded
- Requires root for enforcement
3. Disk Space Monitoring 2. Memory Limits (cgroup-based)
- Hard memory caps with OOM protection
- Automatic fallback to monitoring without enforcement (no root)
3. Disk Space Monitoring
- Pre-job validation (reject at 95% usage)
- Always enforced (no privileges required)
- Configurable cleanup policies for temporary files
4. Timeout Enforcement
- Per-job timeout with context-based cancellation
- SIGTERM → SIGKILL escalation
- Process group cleanup
5. Process Priority
- Nice value = 10 (lower than system services)
- Always enforced (no privileges required)
The system exports Prometheus metrics demonstrating:
- Resource usage patterns: CPU, memory, GPU utilization per job
- Job lifecycle: Active jobs, completion rates, latency distribution
- Hardware monitoring: GPU power, temperature (NVIDIA)
- Encoder availability: NVENC, QSV, VAAPI runtime detection
- Bandwidth tracking: Input/output bytes, compression ratios
- SLA classification: Intelligent job categorization (production vs test/debug)
Metrics endpoint: http://worker:9091/metrics
Documentation:
- Auto-Attach Documentation - Process discovery patterns
- Bandwidth Metrics Guide - Bandwidth tracking implementation
- SLA Tracking Guide - Service level monitoring approach
- SLA Classification Guide - Job classification methodology (99.8% compliance with 45K+ jobs)
- Alerting Guide - Prometheus alert configuration
Test Results (45,000+ jobs across 31 scenarios):
- 99.8% SLA compliance observed
- Automatic retry recovers transient failures (network errors, node failures)
- FFmpeg failures terminal (codec errors, format issues)
- Heartbeat-based failure detection (90s timeout, 3 missed heartbeats)
See CODE_VERIFICATION_REPORT.md for implementation validation and docs/SLA_CLASSIFICATION.md for complete testing methodology.
720p Fast Encoding:
"resource_limits": {
"max_cpu_percent": 150, // 1.5 cores
"max_memory_mb": 1024, // 1GB
"timeout_sec": 300 // 5 minutes
}1080p Standard Encoding:
"resource_limits": {
"max_cpu_percent": 300, // 3 cores
"max_memory_mb": 2048, // 2GB
"timeout_sec": 900 // 15 minutes
}4K High Quality Encoding:
"resource_limits": {
"max_cpu_percent": 600, // 6 cores
"max_memory_mb": 4096, // 4GB
"timeout_sec": 3600 // 1 hour
}Minimum (without root): System requirements:
- Linux kernel 3.10+
- /tmp with 10GB+ free space
- 2GB+ RAM per worker
Recommended (with privileged access):
- Linux kernel 4.5+ (cgroup v2 support)
- /tmp with 50GB+ free space
- 8GB+ RAM per worker
- SSD storage for /tmp
Additional documentation:
- Resource Limits Guide - Configuration reference
- Production Features - Additional hardening patterns
- Troubleshooting - Common issues
For development and experimentation, Docker Compose provides a single-machine setup:
# Clone and start
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
make up-build
# Submit test jobs
make build-cli
./bin/ffrtmp jobs submit --scenario 1080p-h264 --bitrate 5M --duration 60
# View metrics at http://localhost:3000Note: Docker Compose is for local testing only. For distributed deployment, see above.
See shared/docs/DEPLOYMENT_MODES.md for deployment comparisons.
Retry Semantics:
- Transport-layer retry only (HTTP requests, heartbeats, polling)
- Exponential backoff: 1s → 30s, max 3 retries
- Context-aware (respects cancellation)
- Job execution never retried (FFmpeg failures terminal)
Graceful Shutdown:
- Worker: Stop accepting jobs, drain current jobs (30s timeout)
- Master: LIFO shutdown order (HTTP → metrics → scheduler → DB → logger)
- No workload interruption (jobs complete naturally or timeout)
- Async coordination via
shutdown.Done()channel
Readiness Checks:
- FFmpeg validation before accepting work
- Disk space verification
- Master connectivity check
- HTTP 200 only when truly ready (Kubernetes-friendly)
Centralized Logging:
- Structured directory:
/var/log/ffrtmp/<component>/<subcomponent>.log - Multi-writer: file + stdout (systemd journald compatible)
- Automatic fallback to
./logs/without privileges
Documentation:
- Production Readiness Guide - Complete pattern documentation
- Security Review - Security audit
- Audit Summary - Technical debt elimination
Concurrency:
- Workers process multiple jobs simultaneously (
--max-concurrent-jobs) - Hardware-aware configuration tool:
ffrtmp config recommend
Reliability:
- TLS/HTTPS enabled by default (auto-generated certificates)
- API authentication via
MASTER_API_KEY - SQLite persistence (jobs survive restarts)
- Automatic retry with exponential backoff
Observability:
- Built-in Prometheus metrics (port 9090)
- Dual engine support (FFmpeg/GStreamer)
See docs/README.md for comprehensive documentation.
Failure Detection:
- Heartbeat-based (90s timeout, 3 missed heartbeats)
- Identifies dead nodes and orphaned jobs
Automatic Reassignment:
- Jobs from failed workers automatically reassigned
- Smart retry for transient failures (network errors, timeouts)
- FFmpeg failures terminal (not retried)
- Max 3 retry attempts with exponential backoff
Stale Job Handling:
- Batch jobs timeout after 30min
- Live jobs timeout after 5min inactivity
Multi-level priorities: Live > High > Medium > Low > Batch
Queue-based scheduling: live, default, batch queues with different SLAs
FIFO within priority: Fair scheduling for same-priority jobs
- TLS/mTLS between master and workers
- API key authentication required
- Certificate auto-generation support
# Example: Submit high-priority job
./bin/ffrtmp jobs submit \
--scenario live-4k \
--queue live \
--priority high \
--duration 3600
# Configure master
./bin/master \
--port 8080 \
--max-retries 5 \
--scheduler-interval 10s \
--api-key "$MASTER_API_KEY"
# Configure worker
./bin/agent \
--master https://MASTER_IP:8080 \
--max-concurrent-jobs 4 \
--poll-interval 3s \
--heartbeat-interval 30sSee docs/README.md for complete implementation details.
Demonstrates engine selection patterns for different workload characteristics:
- FFmpeg (default): General-purpose file transcoding
- GStreamer: Low-latency live streaming
- Auto-selection: System chooses based on workload type
- Hardware acceleration: NVENC, QSV, VAAPI support for both
# Auto-select engine (default)
./bin/ffrtmp jobs submit --scenario live-stream --engine auto
# Force specific engine
./bin/ffrtmp jobs submit --scenario transcode --engine ffmpeg
./bin/ffrtmp jobs submit --scenario live-rtmp --engine gstreamerAuto-selection logic:
- LIVE queue → GStreamer (low latency)
- FILE/batch → FFmpeg (better for offline)
- RTMP streaming → GStreamer
- GPU+NVENC+streaming → GStreamer
See docs/DUAL_ENGINE_SUPPORT.md for details.
This reference system can be used to study:
- Distributed coordination: Master-worker patterns, state machine guarantees, failure detection
- Resource management: CPU/memory limits, cgroup isolation, graceful degradation
- Retry semantics: Transient vs terminal failures, exponential backoff, idempotent operations
- Observability patterns: Metrics collection, distributed tracing, structured logging
- Energy efficiency: Power consumption during video transcoding (Intel RAPL)
- Workload scaling: Performance characteristics across multiple nodes
Master-worker architecture demonstrating coordination patterns:
- Master Node: Job orchestration, failure detection, metrics aggregation
- HTTP API (Go)
- VictoriaMetrics (30-day retention)
- Grafana (visualization)
- Worker Nodes: Job execution, resource monitoring, heartbeat reporting
- Hardware auto-detection
- Pull-based job polling
- Local metrics collection
- Result reporting
Docker Compose stack for experimentation:
- Nginx RTMP (streaming server)
- VictoriaMetrics (time-series database)
- Grafana (dashboards)
- Go Exporters (CPU/GPU metrics via RAPL/NVML)
- Python Exporters (QoE metrics, analysis)
- Alertmanager (alert routing)
See shared/docs/DEPLOYMENT_MODES.md for architecture diagrams.
Primary documentation: docs/README.md - Complete reference guide
- Configuration Tool - Hardware-aware worker configuration
- Concurrent Jobs Guide - Parallel job processing
- Job Launcher Script - Batch job submission
- Deployment Success Report - Real-world deployment case study
- Dual Engine Support - FFmpeg + GStreamer selection patterns
- Production Features - Reliability patterns (TLS, auth, retry, metrics)
- Deployment Modes - Architecture comparison
- Internal Architecture - Runtime model and operations
- Distributed Architecture - Master-worker coordination
- Production Deployment - Systemd service configuration
- Getting Started Guide - Initial setup
- Running Tests - Test scenarios and execution
- Go Exporters Quick Start - Metrics collection setup
- Troubleshooting - Common issues
- Architecture Overview - System design and data flow
- Exporters Quick Reference - Metrics collection patterns
- Exporters Overview - Master-side metrics
- Master Exporters Deployment - Master metrics setup
- Worker Exporters - Worker-side metrics
- Worker Exporters Deployment - Worker metrics setup
- Energy Advisor - ML-based efficiency analysis
- Documentation Index - Complete technical documentation
# Build components
make build-master # Build master node binary
make build-agent # Build worker agent binary
make build-cli # Build ffrtmp CLI tool
make build-distributed # Build all
# Get hardware-aware configuration
./bin/ffrtmp config recommend --environment production --output text
# Run services
./bin/master --port 8080 --api-key "$MASTER_API_KEY"
./bin/agent --register --master https://MASTER_IP:8080 \
--api-key "$MASTER_API_KEY" \
--max-concurrent-jobs 4 \
--insecure-skip-verify
# Submit and manage jobs
./bin/ffrtmp jobs submit --scenario 1080p-h264 --bitrate 5M --duration 300
./bin/ffrtmp jobs status <job-id>
./bin/ffrtmp nodes list
# Systemd service management
sudo systemctl start ffmpeg-master
sudo systemctl start ffmpeg-agent
sudo systemctl status ffmpeg-master
# Monitor and observe
curl -k https://localhost:8080/nodes # List registered workers
curl -k https://localhost:8080/jobs # List jobs
curl http://localhost:9090/metrics # Prometheus metrics
journalctl -u ffmpeg-master -f # View master logs
journalctl -u ffmpeg-agent -f # View worker logs# Stack management
make up-build # Start Docker Compose stack
make down # Stop stack
make ps # Show container status
make logs SERVICE=victoriametrics # View service logs
# Testing scenarios
make test-single # Run single stream test
make test-batch # Run batch test matrix
make run-benchmarks # Run benchmark suite
make analyze # Analyze results
# Development tools
make lint # Run linting
make format # Format code
make test # Run test suiteObserve job reassignment after worker failure:
# Submit long-running jobs
./bin/ffrtmp jobs submit --scenario 4K-h265 --bitrate 15M --duration 3600
./bin/ffrtmp jobs submit --scenario 1080p-h264 --bitrate 5M --duration 1800
# Monitor initial assignment
curl -k https://master:8080/jobs
# Kill a worker mid-job (simulate failure)
sudo systemctl stop ffmpeg-agent # On worker node
# Observe master detecting failure (90s timeout)
# Watch job reassignment to healthy workers
curl -k https://master:8080/jobs # Check job state transitions
# Analyze recovery time and behavior
journalctl -u ffmpeg-master -fObservations to study:
- Heartbeat failure detection timing (3 × 30s = 90s)
- Job state transitions (running → failed → queued)
- Reassignment latency
- Worker re-registration behavior
Test cgroup-based resource limits under contention:
# Submit multiple jobs with different CPU limits
./bin/ffrtmp jobs submit --scenario 1080p-h264 --duration 600 \
--cpu-limit 200 # 2 cores
./bin/ffrtmp jobs submit --scenario 1080p-h264 --duration 600 \
--cpu-limit 100 # 1 core
# Monitor actual CPU usage via Prometheus metrics
curl http://worker:9091/metrics | grep cpu_usage
# Compare observed vs requested CPU allocation
# Study cgroup enforcement effectivenessCompare codec energy consumption patterns:
# Start local development stack
make up-build && make build-cli
# Test H.264 codec
./bin/ffrtmp jobs submit --scenario 4K60-h264 --bitrate 10M --duration 120
./bin/ffrtmp jobs submit --scenario 1080p60-h264 --bitrate 5M --duration 60
# Test H.265 codec
./bin/ffrtmp jobs submit --scenario 4K60-h265 --bitrate 10M --duration 120
./bin/ffrtmp jobs submit --scenario 1080p60-h265 --bitrate 5M --duration 60
# Analyze energy consumption via RAPL metrics
python3 scripts/analyze_results.py
# View power consumption dashboards
# Open Grafana at http://localhost:3000Deploy distributed mode with agents on your build servers:
# CI/CD pipeline submits jobs to master after each release
curl -X POST https://master:8080/jobs \
-H "Authorization: Bearer $MASTER_API_KEY" \
-H "Content-Type: application/json" \
-d @benchmark_config.json
# Results automatically aggregated and visualized
# Alerts fire if performance regressions detectedContributions are welcome! See the detailed documentation for development guidelines.
See LICENSE file for details.
The project includes comprehensive test coverage for critical components:
# Run all tests with race detector
cd shared/pkg
go test -v -race ./...
# Run tests with coverage report
go test -v -coverprofile=coverage.out ./models ./scheduler ./store
go tool cover -html=coverage.outTest Coverage:
- models: 85% (FSM state machine fully tested)
- scheduler: 53% (priority queues, recovery logic)
- store: Comprehensive database operations tests
- agent: Engine selection, optimizers, encoders
CI/CD:
- Automated testing on every push
- Race condition detection
- Multi-architecture builds (amd64, arm64)
- Binary artifacts for master, worker, and CLI
See CONTRIBUTING.md for testing guidelines.
Core documentation has been streamlined for clarity:
- docs/README.md - Complete system documentation (NEW)
- docs/CONFIGURATION_TOOL.md - Hardware-aware config tool
- CONCURRENT_JOBS_IMPLEMENTATION.md - Parallel processing guide
- QUICKSTART.md - Get started in 5 minutes
- docs/ARCHITECTURE.md - System design and architecture
- DEPLOYMENT.md - Production deployment guide
- CONTRIBUTING.md - Contribution guidelines
- docs/LOCAL_STACK_GUIDE.md - Local development setup
- CHANGELOG.md - Version history
Additional technical documentation is available in docs/archive/ for reference.