Dapper, Google’s distributed systems tracing infrastructure, is designed to provide Google developers with deeper insights into the behavior of complex distributed systems.
Introduction#
There is a service composed of five servers:
- A frontend (A)
- Two middleware servers (B and C)
- Two backend servers (D and E)
When a user sends a request to frontend server A, it subsequently issues two RPC calls to B and C. B returns a response immediately, whereas C requires further calls to backend servers D and E before returning its result to A. Finally, A responds to the initial request. For this request, a basic distributed tracing system must record identifiers and timestamps for each message sent and received on every machine involved.
Design Objectives#
Low Overhead: The performance impact on instrumented applications should be negligible.
Application-level Transparency: Developers should be unaware of the tracing system, requiring no additional development effort or code modifications for integration.
Scalability: The system must handle the anticipated scale of services and clusters for the foreseeable future.
Timely Data Analysis: Collected data should be available for analysis ideally within one minute.
Proposed Solution#
Tracing Paradigms#
Currently, academia and industry employ two primary methods to correlate records with a specific request:
Black-box Paradigm: Only records message identifiers and timestamps for each send/receive event on individual machines, subsequently using statistical inference techniques to establish correlations.
In simple terms, this approach avoids generating additional logs. Instead, it collects existing log information and correlates events using machine learning models, such as regression analysis.
Annotation-based Paradigm: Correlates all request information, identifiers, and timestamps through a globally unique ID via invasive code instrumentation, followed by analysis and processing.
Dapper employs the Annotation-based Paradigm.
Core Principles and Workflow#
Terminology#
Trace: Represents the complete call chain tracking for a single request.
Span: Denotes the request/response process between two services. A single Trace comprises multiple Spans.
TraceID: A globally unique ID generated for the duration of a Trace. All Spans within this Trace inherit this ID.
SpanID: Typically a sequential identifier for multiple Spans within a trace.
Core Workflow#
When a thread handles a traced control path, Dapper stores a trace context in ThreadLocal storage. This trace context is a small, easily copyable container housing Span attributes such as trace id and span id.
For deferred or asynchronous computations, most Google developers utilize a common control flow library to construct callbacks, which are executed by thread pools or other executors. Dapper ensures all callbacks store the trace context of their creator. When a callback executes, this trace context is associated with the appropriate thread. Thus, the IDs Dapper uses for trace reconstruction transparently apply to asynchronous control flows.
After a request concludes, the application writes the span data to a local log file.
The Dapper daemon pulls these log files and ingests the data into the Dapper collectors.
The Dapper collectors write the results into BigTable (open-source equivalent: HBase), with each trace recorded as a single row.
Application-defined Annotations#
Dapper also allows application developers to enrich trace data by adding custom information.
Through explicit API calls, developers can inject arbitrary application-specific content into Dapper traces. For instance, input and output parameters for a specific Span can be recorded.
Application-level Transparency#
Within the Java ecosystem, Application-level transparency for tracing is typically achieved using Java Agent and bytecode instrumentation techniques, as exemplified by tools like Pinpoint.
Sampling Rate#
The sampling rate directly impacts application performance overhead. In high-throughput systems, a relatively low sampling rate (e.g., 0.01%) is often employed.
Experience with applications at Google leads us to believe that aggressive sampling does not hinder the analysis of the most significant patterns in high-throughput services. If an important execution pattern occurs once in such a system, it will occur thousands of times.
However, in lower-traffic systems, even a 1% sampling rate might miss critical events. In such scenarios, an adaptive sampling rate can be designed, where the rate decreases under high load and increases automatically in systems with lower request volumes.
Furthermore, it is conceivable to implement a threshold-based rule, for instance, mandating the sampling of any Trace whose total duration exceeds a specific percentile (e.g., p95).
Sampling Optimization#
The original Dapper paper describes the sampling optimization as follows:
For every span in the collection system, we hash its trace id into a scalar z (0<=z<=1). If z is less than our collection sampling coefficient, we retain the span and write it to Bigtable; otherwise, we discard it. By relying on the trace id for the sampling decision, we either sample the entire trace or discard the entire trace, rather than processing only some spans within a trace.
This approach implements a secondary sampling coefficient during the Dapper collection process to globally control the final data write rate.
Additionally, it might be beneficial to incorporate a filtering pipeline, similar to those found in log collectors like Filebeat, to perform secondary sampling based on specific parameters of Traces and spans.
Overhead#
Trace Generation Overhead#
The primary overhead in Dapper stems from Trace generation and log writing. Therefore, selecting an appropriate sampling rate is crucial.
Collection Overhead#
Reading local trace data can also potentially interfere with the monitored workload.
In containerized environments, it is recommended to deploy the Dapper daemon process as a sidecar.