Choosing your signals
Once you have set up OpenTelemetry on Platon, the next question is when to add a metric, when to add a log, and when to add a span. This page describes how the three signals relate, and offers some heuristics for choosing between them.
The three signals
Each signal is best suited to a different question.
- Metric — how often, how fast, or how big something is. Anything you want to graph or alert on. Cheap to aggregate across billions of events, but not useful for debugging one specific request.
- Log — what happened on a single request. Rich narrative. Essential for error context, audit trails, and post-hoc investigation. Expensive to aggregate at scale.
- Trace — why a specific request was slow, or how a request moved across services. The timeline of one request, made up of spans. Rarely read in bulk, but powerful when you have a problematic request.
The signals overlap. You can count log events. You can put narrative text in a span attribute. But each signal has a different cost profile, and each is optimized for a different question.
Practical heuristics
| You want to answer… | Reach for… |
|---|---|
| "User 42 clicked enroll" (business event) | Log |
| "How many enrollments per minute?" | Metric (counter) |
| "What is the p95 of enrollment latency?" | Metric (histogram) |
| "Why is enrollment slow for course X today?" | Trace |
| "How many enrollments failed in the last hour?" | Metric counter, with logs for drill-down |
| "Did this request actually reach the eligibility-checker?" | Trace |
| "Show me every request that hit endpoint Y with status 500" | Log |
| "Is our error rate climbing?" | Metric and alert |
Common anti-patterns
- Logging what should be a metric. Emitting
logger.info("order placed")for every order and then counting log lines is slow and expensive. Use a counter, and reserve the log line for a business-meaningful event with context. - Metricifying what should be a log. Attaching
user_idas a metric label turns one time-series into millions. Mimir pays for every unique label combination — high cardinality is the most common way to overload a metrics backend. - Putting bulk payload data in span attributes. Spans have attribute size limits, and Tempo query costs scale with span size. Do not put an entire JSON request body in a span.
- Alerting on log patterns instead of metrics. Log-based alerts are fragile (they break when the log format changes) and expensive (grepping Loki at scale is slow). Emit a metric when the event happens, and alert on the metric.
Tracing across services
Tracing is most valuable when it crosses service boundaries — one trace captures the enrollment request hitting enrollment-service, which calls eligibility-checker, which calls course-catalog. For this to work, trace context must propagate between services.
In most cases you do not write propagation code yourself. OTel auto-instrumentation sets and reads the W3C traceparent HTTP header in virtually every mainstream HTTP client and server, and in several message-queue libraries. Adding the SDK and enabling auto-instrumentation is enough.
Cases where you may need to handle propagation manually:
- Custom HTTP clients that do not use an auto-instrumented library.
- Message queues your OTel distribution does not instrument (rare today — Kafka, RabbitMQ, and NATS are all covered).
- Background jobs with no incoming request (the job is the root span and there is no header to read).
For the mechanics, see the upstream OpenTelemetry context propagation docs.
The service.name resource attribute is what Grafana Tempo uses to group all spans from one service. Set it once via OTEL_SERVICE_NAME and every span your app emits inherits it.
Correlating the three signals
When all three signals are wired up, you can move between them:
- From log to trace. Structured logs include
trace_idandspan_id(auto-wired by the logging instrumentation). In Grafana's Loki view, click the trace ID to open the full trace in Tempo. - From metric graph to trace. Metric exemplars are representative trace IDs attached to data points. When you see a latency spike, click the exemplar to jump to one of the slow requests.
- From trace back to logs. In Tempo, filter logs for
{trace_id="..."}and Loki surfaces every log line that participated in the request.
This correlation is the main reason teams wire up all three signals rather than stopping at logs.
Where to start
For teams adopting OTel, enabling traces with auto-instrumentation usually gives the most value for the least effort — typically one SDK import per language, and the result is cross-service timing and slow-request drill-down that is otherwise hard to get.
- Auto-instrumented HTTP metrics give you the RED method (Rate, Errors, Duration) for every endpoint without custom code.
- Logs are likely already working, and trace-log correlation comes with the logging instrumentation.