Deploying OpenTelemetry Organizationally: From Proof of Concept to In Production at Scale

When Pluto TV set out to modernize its observability stack, we faced a challenge familiar to many engineering teams: how do you evaluate multiple observability platforms without committing to vendor lock-in and without instrumenting your entire codebase multiple times?

The answer was OpenTelemetry.

The Problem

Pluto TV's infrastructure spans 30+ services running on Kubernetes across AWS. As the platform grew, so did our need for comprehensive observability—structured logs, distributed traces, and metrics—all correlated together.

We were evaluating several observability vendors, but the traditional approach of vendor-specific instrumentation meant we'd have to re-instrument each service every time we changed platforms. That wasn't acceptable at our scale.

The OpenTelemetry Approach

OpenTelemetry (OTel) is a vendor-neutral open standard for collecting telemetry data. By instrumenting once with OTel, we could route our data to any supported backend—Datadog, Jaeger, Honeycomb, or others—simply by changing our collector configuration.

Our architecture had three layers:

Application instrumentation: Each service emits spans and logs using the OTel SDK
OTel Collector: A Daemonset deployed via Helm on our Kubernetes cluster that aggregates and processes telemetry
Export destinations: Datadog as our primary backend, with the ability to dual-export to other vendors for side-by-side comparison

Scaling to Production

The journey from proof of concept to production at scale involved several key milestones.

Custom Golang Libraries

We built custom Golang libraries that captured trace and span IDs within log events. This allowed us to link 400 million log lines daily directly to their corresponding traces in Datadog—turning what would have been separate debugging steps into a single correlated view.

Kubernetes Daemonset

I contributed to the Helm chart that deployed Datadog agents as a Daemonset across our Kubernetes cluster. This ensured every node automatically participated in log and trace ingestion, with per-service transformation rules applied at the collector level.

Volume

At peak, our pipeline processes 100,000 trace events per second, all routed through the OTel Collector before reaching Datadog.

Results

Onboarded 13 microservices in 3 months
40% improvement in debugging efficiency through correlated logs and traces
Vendor flexibility preserved: switching or adding backends requires no code changes
Observability coverage across 30+ services in production

Lessons Learned

Start with the collector, not the SDK. Getting the OTel Collector running and exporting to your destination first makes SDK instrumentation feel incremental and low-risk.

Custom context propagation matters. Off-the-shelf auto-instrumentation is a great start, but the real value comes from ensuring trace context flows correctly through async jobs, queues, and cross-service calls.

Budget for the learning curve. OpenTelemetry's power comes with complexity. Allocating time for the team to understand the data model—traces, spans, resources, attributes—pays dividends in adoption speed.

If your team is evaluating OpenTelemetry, start small: pick one service, instrument it, and export to your existing tools. The flexibility you gain is well worth the upfront investment.