Monitoring

Introduction

Observability (also known as “monitoring”) lets you determine if the Daml Enterprise solution is healthy or not. If the state is not healthy, observability helps diagnose the root cause. There are three parts to observability: metrics, logs, and traces. These are described in this section.

To avoid becoming overwhelmed by the number of metrics and log messages, follow these steps:

The remaining sections provide references to more detailed information.

Hands-On with the Daml Enterprise - Observability Example

The Daml Enterprise - Observability Example GitHub repository provides a complete reference example for exploring the metrics that Daml Enterprise exposes. You can use it to explore the collection, aggregation, filtering, and visualization of metrics. It is self-contained, with the following components:

  • An example Docker compose file to create a run-time for all the components
  • Some shell scripts to generate requests to the Daml Enterprise solution
  • A Prometheus config file to scrape the metrics data
  • A Grafana template file(s) to visualize the metrics in a meaningful way, such as shown below in the example dashboard
A dashboard showing metrics to measure the health of the system.

Dashboard with metrics

Golden Signals and Key Metrics Quick Start

The best practice for monitoring a microservices application is an approach known as the Golden Signals, or the RED method. In this approach, metric monitoring determines whether the application is healthy and, if not healthy, which service is the root cause of the issue. The Golden Signals for HTTP and gRPC endpoints are supported for all endpoints. Key metrics specific to Daml Enterprises are also available. These are described below.

The following Golden Signal metrics for each HTTP and gRPC API are available:

  • Input request rate, as a counter
  • Error rate, as a counter (discussed below)
  • Latency (the time to process a request), as a histogram
  • Size of the payload, as a counter, following the Apache HTTP precedent

You can filter or aggregate each metric using its accompanying labels. The instrumentation labels added to each HTTP API metric are as follows:

  • http_verb: the HTTP verb (for example: GET, POST)
  • http_status: the status code (for example: 200, 401, 403, 504)
  • host: the host identifier
  • daml_version: the Daml release number
  • service: a string to identify what Daml service or Canton component is running in this process (for example: participant, domain, json_api), as well as domain if several Canton components run in a single process
  • path: the request made to the endpoint (for example: /v1/create, /v1/exercise)

The gRPC protocol is layered on top of HTTP/2, so certain labels (such as the daml_version and service) from the above section are included. The labels added by default to each gRPC API metric are as follows:

  • canton_version: the Canton protocol version
  • grpc_code: the human readable status code for gRPC (for example: OK, CANCELLED, DEADLINE_EXCEEDED)
  • The type of the client/server gRPC request, under the labels grpc_client_type and grpc_server_type
  • The protobuf package and service names, under the labels grpc_service_name and grpc_method_name

The following other key metrics are monitored:

  • A binary gauge indicates whether the node is healthy or not healthy. This can also be used to infer which node is passive in a highly available configuration because it will show as not being healthy, while the active node is always healthy.
  • A binary gauge signals whether a node is active or passive, for identifying which node is the active node.
  • A binary gauge detects when pruning is occurring.
  • Each participant node measures the count of the inflight (dirty) requests so the user can see if maxDirtyRequests limit is close to being hit. The metrics are: canton_dirty_requests and canton_max_dirty_requests.
  • Each participant node records the distribution of events (updates) received by the participant and allows drill-down by event type (package upload, party creation, or transaction), status (success or failure), participant ID, and application ID (if available). The counter is called daml_indexer_events_total.
  • The ledger event requests are totaled in a counter called daml_indexer_metered_events_total.
  • JVM garbage collection metrics are collected.

This list is not exhaustive. It highlights the most important metrics.

Set Up Metrics Scraping

Enable the Prometheus Reporter

Prometheus is recommended for metrics reporting. Other reporters (jmx, graphite, and csv) are supported, but they are deprecated. Any such reporter should be migrated to Prometheus.

Prometheus can be enabled using:

canton.monitoring.metrics.reporters = [{
  type = prometheus
  address = "localhost" // default
  port = 9000 // default
}]

Prometheus-Only Metrics

Some metrics are available only when using the Prometheus reporter. These metrics include common gRPC and HTTP metrics (which help you to measure the four golden signals), and JVM GC and memory usage metrics (if enabled). The metrics are documented in detail below.

Any metric marked with * is available only when using the Prometheus reporter.

Deprecated Reporters

JMX-based reporting (for testing purposes only) can be enabled using:

canton.monitoring.metrics.reporters = [{ type = jmx }]

Additionally, metrics can be written to a file:

canton.monitoring.metrics.reporters = [{
  type = jmx
}, {
  type = csv
  directory = "metrics"
  interval = 5s // default
  filters = [{
    contains = "canton"
  }]
}]

or reported via Graphite (to Grafana) using:

canton.monitoring.metrics.reporters = [{
  type = graphite
  address = "localhost" // default
  port = 2003
  prefix.type = hostname // default
  interval = 30s // default
  filters = [{
    contains = "canton"
  }]
}]

When using the graphite or the csv reporter, Canton periodically evaluates all metrics matching the given filters. Filter for only those metrics that are relevant to you.

In addition to Canton metrics, the process can also report Daml metrics (of the Ledger API server). Optionally, JVM metrics can be included using:

canton.monitoring.metrics.report-jvm-metrics = yes // default no

Metrics

The following sections contain the common metrics exposed for Daml services supporting a Prometheus metrics reporter.

For the metric types referenced below, see the relevant Prometheus documentation.

Participant Metrics

canton.<domain>.conflict-detection.sequencer-counter-queue

  • Summary: Size of conflict detection sequencer counter queue
  • Description: The task scheduler will work off tasks according to the timestamp order, scheduling the tasks whenever a new timestamp has been observed. This metric exposes the number of un-processed sequencer messages that will trigger a timestamp advancement.
  • Type: Counter
  • Qualification: Debug

canton.<domain>.conflict-detection.task-queue

  • Summary: Size of conflict detection task queue
  • Description: The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass. A huge number does not necessarily indicate a bottleneck; it could also mean that a huge number of tasks have not yet arrived at their execution time.
  • Type: Gauge
  • Qualification: Debug

canton.<domain>.dirty-requests

  • Summary: Size of conflict detection task queue
  • Description: The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass. A huge number does not necessarily indicate a bottleneck; it could also mean that a huge number of tasks have not yet arrived at their execution time.
  • Type: Counter
  • Qualification: Debug

canton.<domain>.protocol-messages.confirmation-request-creation

  • Summary: Time to create a confirmation request
  • Description: The time that the transaction protocol processor needs to create a confirmation request.
  • Type: Timer
  • Qualification: Debug

canton.<domain>.protocol-messages.confirmation-request-size

  • Summary: Confirmation request size
  • Description: Records the histogram of the sizes of (transaction) confirmation requests.
  • Type: Histogram
  • Qualification: Debug

canton.<domain>.protocol-messages.transaction-message-receipt

  • Summary: Time to parse a transaction message
  • Description: The time that the transaction protocol processor needs to parse and decrypt an incoming confirmation request.
  • Type: Timer
  • Qualification: Debug

canton.<domain>.request-tracker.sequencer-counter-queue

  • Summary: Size of record order publisher sequencer counter queue
  • Description: Same as for conflict-detection, but measuring the sequencer counter queues for the publishing to the ledger api server according to record time.
  • Type: Counter
  • Qualification: Debug

canton.<domain>.request-tracker.task-queue

  • Summary: Size of record order publisher task queue
  • Description: The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass.
  • Type: Gauge
  • Qualification: Debug

canton.<domain>.sequencer-client.application-handle

  • Summary: Timer monitoring time and rate of sequentially handling the event application logic
  • Description: All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.
  • Type: Timer
  • Qualification: Debug

canton.<domain>.sequencer-client.delay

  • Summary: The delay on the event processing
  • Description: Every message received from the sequencer carries a timestamp that was assigned by the sequencer when it sequenced the message. This timestamp is called the sequencing timestamp. The component receiving the message on the participant, mediator or topology manager side, is the sequencer client. Upon receiving the message, the sequencer client compares the time difference between the sequencing time and the computers local clock and exposes this difference as the given metric. The difference will include the clock-skew and the processing latency between assigning the timestamp on the sequencer and receiving the message by the recipient. If the difference is large compared to the usual latencies and if clock skew can be ruled out, then it means that the node is still trying to catch up with events that were sequenced by the sequencer a while ago. This can happen after having been offline for a while or if the node is too slow to keep up with the messaging load.
  • Type: Gauge
  • Qualification: Debug

canton.<domain>.sequencer-client.event-handle

  • Summary: Timer monitoring time and rate of entire event handling
  • Description: Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.
  • Type: Timer
  • Qualification: Debug

canton.<domain>.sequencer-client.handler.actual-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks how many such batches are processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Indicators that the configured upper bound may be too low: This metric constantly is closed to the configured maximum, which is exposed via ‘max-in-flight-event-batches’, while the system’s resources are under-utilized. Indicators that the configured upper bound may be too high: Out-of-memory errors crashing the JVM or frequent garbage collection cycles that slow down processing. The metric tracks how many of these batches have been sent to the application handler but have not yet been fully processed. This metric can help identify potential bottlenecks or issues with the application’s processing of events and provide insights into the overall workload of the system.
  • Type: Counter
  • Qualification: Saturation

canton.<domain>.sequencer-client.handler.max-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks the upper bound of such batches being processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Configured by ‘maximum-in-flight-event-batches’ parameter in the sequencer-client config The metric shows the configured upper limit on how many batches the application handler may process concurrently. The metric ‘actual-in-flight-event-batches’ tracks the actual number of currently processed batches.
  • Type: Gauge
  • Qualification: Saturation

canton.<domain>.sequencer-client.submissions.dropped

  • Summary: Count of send requests that did not cause an event to be sequenced
  • Description: Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.
  • Type: Counter
  • Qualification: Debug

canton.<domain>.sequencer-client.submissions.in-flight

  • Summary: Number of sequencer send requests we have that are waiting for an outcome or timeout
  • Description: Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.
  • Type: Counter
  • Qualification: Debug

canton.<domain>.sequencer-client.submissions.overloaded

  • Summary: Count of send requests which receive an overloaded response
  • Description: Counter that is incremented if a send request receives an overloaded response from the sequencer.
  • Type: Counter
  • Qualification: Debug

canton.<domain>.sequencer-client.submissions.sends

  • Summary: Rate and timings of send requests to the sequencer
  • Description: Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.
  • Type: Timer
  • Qualification: Debug

canton.<domain>.sequencer-client.submissions.sequencing

  • Summary: Rate and timings of sequencing requests
  • Description: This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.
  • Type: Timer
  • Qualification: Debug

canton.<domain>.traffic-control.event-above-traffic-limit

  • Summary: Event was not delivered because of traffic limit exceeded
  • Description: An event was not delivered because of insufficient traffic credit.
  • Type: Meter
  • Qualification: Traffic

canton.<domain>.traffic-control.event-delivered

  • Summary: Event was delivered
  • Description: An event was not delivered.
  • Type: Meter
  • Qualification: Traffic

canton.<domain>.traffic-control.extra-traffic-credit-available

  • Summary: Current amount of extra traffic remaining
  • Description: Gets updated with every event received.
  • Type: Gauge
  • Qualification: Traffic

canton.<domain>.traffic-control.traffic-state-topology-transaction

  • Summary: Records a new top up on the participant
  • Description: Records top up events and the new extra traffic limit associated.
  • Type: Gauge
  • Qualification: Traffic

canton.commitments.compute

  • Summary: Time spent on commitment computations.
  • Description: Participant nodes compute bilateral commitments at regular intervals. This metric exposes the time spent on each computation. If the time to compute the metrics starts to exceed the commitment intervals, this likely indicates a problem.
  • Type: Timer
  • Qualification: Debug

canton.db-storage.<service>.executor.queued

  • Summary: Number of database access tasks waiting in queue
  • Description: Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won’t show up in this metric.
  • Type: Counter
  • Qualification: Debug
  • Instances: locks, write, general

canton.db-storage.<service>.executor.running

  • Summary: Number of database access tasks currently running
  • Description: Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.
  • Type: Counter
  • Qualification: Debug
  • Instances: locks, write, general

canton.db-storage.<service>.executor.waittime

  • Summary: Scheduling time metric for database tasks
  • Description: Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.
  • Type: Timer
  • Qualification: Debug
  • Instances: locks, write, general

canton.db-storage.<storage>

  • Summary: Timer monitoring duration and rate of accessing the given storage
  • Description: Covers both read from and writes to the storage.
  • Type: Timer
  • Qualification: Debug

canton.db-storage.<storage>.load

  • Summary: The load on the given storage
  • Description: The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.
  • Type: Gauge
  • Qualification: Debug

canton.db-storage.alerts.multi-domain-event-log

  • Summary: Number of failed writes to the multi-domain event log
  • Description: Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.
  • Type: Counter
  • Qualification: Debug

canton.db-storage.alerts.single-dimension-event-log

  • Summary: Number of failed writes to the event log
  • Description: Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.
  • Type: Counter
  • Qualification: Debug

canton.dirty_requests*

  • Summary: Number of requests being validated.
  • Description: Number of requests that are currently being validated. This also covers requests submitted by other participants.
  • Type: Gauge
  • Qualification: Debug
  • Labels:
    • participant: The id of the participant for which the value applies.

canton.max_dirty_requests*

  • Summary: Configured maximum number of requests currently being validated.
  • Description: Configuration for the maximum number of requests that are currently being validated. This also covers requests submitted by other participants. A negative value means no configuration value was provided and no limit is enforced.
  • Type: Gauge
  • Qualification: Debug
  • Labels:
    • participant: The id of the participant for which the value applies.

canton.prune

  • Summary: Duration of prune operations.
  • Description: This timer exposes the duration of pruning requests from the Canton portion of the ledger.
  • Type: Timer
  • Qualification: Debug

canton.prune.max-event-age

  • Summary: Age of oldest unpruned event.
  • Description: This gauge exposes the age of the oldest, unpruned event in hours as a way to quantify the pruning backlog.
  • Type: Gauge
  • Qualification: Debug

canton.updates-published

  • Summary: Number of updates published through the read service to the indexer
  • Description: When an update is published through the read service, it has already been committed to the ledger. The indexer will subsequently store the update in a form that allows for querying the ledger efficiently.
  • Type: Meter
  • Qualification: Debug

daml.cache.evicted_weight*

  • Summary: The sum of weights of cache entries evicted.
  • Description: The total weight of the entries evicted from the cache.
  • Type: Counter
  • Qualification: Debug
  • Labels:
    • name: The cache for which the metrics are registered.

daml.cache.evictions*

  • Summary: The number of the evicted cache entries.
  • Description: When an entry is evicted from the cache, the counter is incremented.
  • Type: Counter
  • Qualification: Debug
  • Labels:
    • name: The cache for which the metrics are registered.

daml.cache.hits*

  • Summary: The number of cache hits.
  • Description: When a cache lookup encounters an existing cache entry, the counter is incremented.
  • Type: Counter
  • Qualification: Debug
  • Labels:
    • name: The cache for which the metrics are registered.

daml.cache.misses*

  • Summary: The number of cache misses.
  • Description: When a cache lookup first encounters a missing cache entry, the counter is incremented.
  • Type: Counter
  • Qualification: Debug
  • Labels:
    • name: The cache for which the metrics are registered.

daml.commands.delayed_submissions

  • Summary: The number of the delayed Daml commands.
  • Description: The number of Daml commands that have been delayed internally because they have been evaluated to require the ledger time further in the future than the expected latency.
  • Type: Meter
  • Qualification: Debug

daml.commands.failed_command_interpretations

  • Summary: The number of Daml commands that failed in interpretation.
  • Description: The number of Daml commands that have been rejected by the interpreter (e.g. badly authorized action).
  • Type: Meter
  • Qualification: Debug

daml.commands.max_in_flight_capacity

  • Summary: The maximum number of Daml commands that can await completion.
  • Description: The maximum number of Daml commands that can await completion in the Command Service.
  • Type: Counter
  • Qualification: Debug

daml.commands.max_in_flight_length

  • Summary: The number of the Daml commands awaiting completion.
  • Description: The number of the currently Daml commands awaiting completion in the Command Service.
  • Type: Counter
  • Qualification: Debug

daml.commands.reassignment_validation

  • Summary: The time to validate a reassignment command.
  • Description: The time to validate a submitted Daml command before is fed to the interpreter.
  • Type: Timer
  • Qualification: Debug

daml.commands.submissions

  • Summary: The time to fully process a Daml command.
  • Description: The time to validate and interpret a command before it is handed over to the synchronization services to be finalized (either committed or rejected).
  • Type: Timer
  • Qualification: Latency

daml.commands.submissions_running

  • Summary: The number of the Daml commands that are currently being handled by the ledger api server.
  • Description: The number of the Daml commands that are currently being handled by the ledger api server (including validation, interpretation, and handing the transaction over to the synchronization services).
  • Type: Counter
  • Qualification: Debug

daml.commands.valid_submissions

  • Summary: The total number of the valid Daml commands.
  • Description: The total number of the Daml commands that have passed validation and were sent to interpretation in this ledger api server process.
  • Type: Meter
  • Qualification: Debug

daml.commands.validation

  • Summary: The time to validate a Daml command.
  • Description: The time to validate a submitted Daml command before is fed to the interpreter.
  • Type: Timer
  • Qualification: Debug

daml.db.commit.duration.seconds*

  • Summary: The time needed to perform the SQL query commit.
  • Description: This metric measures the time it takes to commit an SQL transaction relating to the <operation>. It roughly corresponds to calling commit() on a DB connection.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The operation/pool for which the metric is registered.

daml.db.compression.duration.seconds*

  • Summary: The time needed to decompress the SQL query result.
  • Description: Some index database queries that target contracts involve a decompression step. For such queries this metric represents the time it takes to decompress contract arguments retrieved from the database.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The operation/pool for which the metric is registered.

daml.db.exec.duration.seconds*

  • Summary: The time needed to run the SQL query and read the result.
  • Description: This metric encompasses the time measured by query and commit metrics. Additionally it includes the time needed to obtain the DB connection, optionally roll it back and close the connection at the end.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The operation/pool for which the metric is registered.

daml.db.query.duration.seconds*

  • Summary: The time needed to run the SQL query.
  • Description: This metric measures the time it takes to execute a block of code (on a dedicated executor) related to the <operation> that can issue multiple SQL statements such that all run in a single DB transaction (either committed or aborted).
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The operation/pool for which the metric is registered.

daml.db.translation.duration.seconds*

  • Summary: The time needed to turn serialized Daml-LF values into in-memory objects.
  • Description: Some index database queries that target contracts and transactions involve a Daml-LF translation step. For such queries this metric stands for the time it takes to turn the serialized Daml-LF values into in-memory representation.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The operation/pool for which the metric is registered.

daml.db.wait.duration.seconds*

  • Summary: The time needed to acquire a connection to the database.
  • Description: SQL statements are run in a dedicated executor. This metric measures the time it takes between creating the SQL statement corresponding to the <operation> and the point when it starts running on the dedicated executor.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The operation/pool for which the metric is registered.

daml.execution.cache.contract_state.register_update

  • Summary: The time spent to update the cache.
  • Description: The total time spent in sequential update steps of the contract state caches updating logic. This metric is created with debugging purposes in mind.
  • Type: Timer
  • Qualification: Debug

daml.execution.cache.key_state.register_update

  • Summary: The time spent to update the cache.
  • Description: The total time spent in sequential update steps of the contract state caches updating logic. This metric is created with debugging purposes in mind.
  • Type: Timer
  • Qualification: Debug

daml.execution.cache.read_through_not_found

  • Summary: The number of cache read-throughs resulting in not found contracts.
  • Description: On cache misses, a read-through query is performed against the Index database. When the contract is not found (as result of this query), this counter is incrmented.
  • Type: Counter
  • Qualification: Debug

daml.execution.cache.resolve_divulgence_lookup

  • Summary: The number of lookups trying to resolve divulged contracts on active contracts cache hits.
  • Description: Divulged contracts are not cached in the contract state caches. On active contract cache hits, where stakeholders are not within the submission readers, a contract activeness lookup is performed against the Index database. On such lookups, this counter is incremented.
  • Type: Counter
  • Qualification: Debug

daml.execution.cache.resolve_full_lookup

  • Summary: The number of lookups trying to resolve divulged contracts on archived contracts cache hits.
  • Description: Divulged contracts are not cached in the contract state caches. On archived contract cache hits, where stakeholders are not within the submission readers, a full contract activeness lookup (including fetching contract arguments) is performed against the Index database. On such lookups, this counter is incremented.
  • Type: Counter
  • Qualification: Debug

daml.execution.engine

  • Summary: The time spent executing a Daml command.
  • Description: The time spent by the Daml engine executing a Daml command (excluding fetching data).
  • Type: Timer
  • Qualification: Debug

daml.execution.engine_running

  • Summary: The number of Daml commands currently being executed.
  • Description: The number of the commands that are currently being executed by the Daml engine (excluding fetching data).
  • Type: Counter
  • Qualification: Debug

daml.execution.get_lf_package

  • Summary: The time to fetch individual Daml code packages during interpretation.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple Daml packages. This metric exposes the time needed to fetch the packages that are necessary for interpretation.
  • Type: Timer
  • Qualification: Debug

daml.execution.lookup_active_contract

  • Summary: The time to lookup individual active contracts during interpretation.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple active contracts. This metric exposes the time to lookup individual active contracts.
  • Type: Timer
  • Qualification: Debug

daml.execution.lookup_active_contract_count_per_execution

  • Summary: The number of the active contracts looked up per Daml command.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple active contracts. This metric exposes the number of active contracts that must be looked up to process a Daml command.
  • Type: Histogram
  • Qualification: Debug

daml.execution.lookup_active_contract_per_execution

  • Summary: The compound time to lookup all active contracts in a single Daml command.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple active contracts. This metric exposes the compound time to lookup all the active contracts in a single Daml command.
  • Type: Timer
  • Qualification: Debug

daml.execution.lookup_contract_key

  • Summary: The time to lookup individual contract keys during interpretation.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple contract keys. This metric exposes the time needed to lookup individual contract keys.
  • Type: Timer
  • Qualification: Debug

daml.execution.lookup_contract_key_count_per_execution

  • Summary: The number of contract keys looked up per Daml command.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple contract keys. This metric exposes the number of contract keys that must be looked up to process a Daml command.
  • Type: Histogram
  • Qualification: Debug

daml.execution.lookup_contract_key_per_execution

  • Summary: The compound time to lookup all contract keys in a single Daml command.
  • Description: The interpretation of a command in the ledger api server might require fetching multiple contract keys. This metric exposes the compound time needed to lookup all the contract keys in a single Daml command.
  • Type: Timer
  • Qualification: Debug

daml.execution.retry

  • Summary: The number of the interpretation retries.
  • Description: The total number of interpretation retries attempted due to mismatching ledger effective time in this ledger api server process.
  • Type: Meter
  • Qualification: Debug

daml.execution.total

  • Summary: The overall time spent interpreting a Daml command.
  • Description: The time spent interpreting a Daml command in the ledger api server (includes executing Daml and fetching data).
  • Type: Timer
  • Qualification: Debug

daml.execution.total_running

  • Summary: The number of Daml commands currently being interpreted.
  • Description: The number of the commands that are currently being interpreted (includes executing Daml code and fetching data).
  • Type: Counter
  • Qualification: Debug

daml.executor.runtime.completed*

  • Summary: The number of tasks completed in an instrumented executor.
  • Description: The number of tasks completed by this executor
  • Type: Meter
  • Qualification: Debug
  • Labels:
    • name: The name of the executor service.
    • type: The type of the executor service. Can be fork_join or thread_pool.

daml.executor.runtime.duration*

  • Summary: The time a task runs in an instrumented executor.
  • Description: A task is considered running only after it has started execution.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The name of the executor service.
    • type: The type of the executor service. Can be fork_join or thread_pool.

daml.executor.runtime.idle*

  • Summary: The time that a task is idle in an instrumented executor.
  • Description: A task is considered idle if it was submitted to the executor but it has not started execution yet.
  • Type: Timer
  • Qualification: Debug
  • Labels:
    • name: The name of the executor service.
    • type: The type of the executor service. Can be fork_join or thread_pool.

daml.executor.runtime.running*

  • Summary: The number of tasks running in an instrumented executor.
  • Description: The number of currently running tasks.
  • Type: Counter
  • Qualification: Debug
  • Labels:
    • name: The name of the executor service.
    • type: The type of the executor service. Can be fork_join or thread_pool.

daml.executor.runtime.submitted*

  • Summary: The number of tasks submitted to an instrumented executor.
  • Description: Number of tasks that were submitted to the executor.
  • Type: Meter
  • Qualification: Debug
  • Labels:
    • name: The name of the executor service.
    • type: The type of the executor service: fork_join or thread_pool.

daml.index.active_contracts_buffer_size

  • Summary: The buffer size for active contracts requests.
  • Description: An Pekko stream buffer is added at the end of all streaming queries, allowing to absorb temporary downstream backpressure (e.g. when the client is slower than upstream delivery throughput). This metric gauges the size of the buffer for queries requesting active contracts that transactions satisfying a given predicate.
  • Type: Counter
  • Qualification: Saturation

daml.index.completions_buffer_size

  • Summary: The buffer size for completions requests.
  • Description: An Pekko stream buffer is added at the end of all streaming queries, allowing to absorb temporary downstream backpressure (e.g. when the client is slower than upstream delivery throughput). This metric gauges the size of the buffer for queries requesting the completed commands in a specific period of time.
  • Type: Counter
  • Qualification: Saturation

daml.index.db.active_contract_lookup_batch_size

  • Summary: The batch sizes in the active contract lookup batch-loading Contract Service.
  • Description: The number of active contract lookups contained in a batch, used in the batch-loading Contract Service.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.active_contract_lookup_buffer_capacity

  • Summary: The capacity of the active contract lookup queue.
  • Description: The maximum number of elements that can be kept in the queue of active contract lookups in the batch-loading queue of the Contract Service.
  • Type: Counter
  • Qualification: Debug

daml.index.db.active_contract_lookup_buffer_delay

  • Summary: The queuing delay for the active contract lookup queue.
  • Description: The queuing delay for the pending active contract lookups in the batch-loading queue of the Contract Service.
  • Type: Timer
  • Qualification: Debug

daml.index.db.active_contract_lookup_buffer_length

  • Summary: The number of the currently pending active contract lookups.
  • Description: The number of the currently pending active contract lookups in the batch-loading queue of the Contract Service.
  • Type: Counter
  • Qualification: Debug

daml.index.db.compression.create_argument_compressed

  • Summary: The size of the compressed arguments of a create event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of compressed arguments of a create event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.create_argument_uncompressed

  • Summary: The size of the decompressed argument of a create event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of decompressed arguments of a create event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.create_key_value_compressed

  • Summary: The size of the compressed key value of a create event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of compressed key value of a create event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.create_key_value_uncompressed

  • Summary: The size of the decompressed key value of a create event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of decompressed key value of a create event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.exercise_argument_compressed

  • Summary: The size of the compressed argument of an exercise event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of compressed arguments of an exercise event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.exercise_argument_uncompressed

  • Summary: The size of the decompressed argument of an exercise event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of decompressed arguments of an exercise event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.exercise_result_compressed

  • Summary: The size of the compressed result of an exercise event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of compressed result of an exercise event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.compression.exercise_result_uncompressed

  • Summary: The size of the decompressed result of an exercise event.
  • Description: Event information can be compressed by the indexer before storing it in the database. This metric collects statistics about the size of compressed result of an exercise event.
  • Type: Histogram
  • Qualification: Debug

daml.index.db.flat_transactions_stream.translation

  • Summary: The time needed to turn serialized Daml-LF values into in-memory objects.
  • Description: Some index database queries that target contracts and transactions involve a Daml-LF translation step. For such queries this metric stands for the time it takes to turn the serialized Daml-LF values into in-memory representation.
  • Type: Timer
  • Qualification: Debug

daml.index.db.lookup_active_contract

  • Summary: The time spent fetching a contract using its id.
  • Description: This metric exposes the time spent fetching a contract using its id from the index db. It is then used by the Daml interpreter when evaluating a command into a transaction.
  • Type: Timer
  • Qualification: Debug

daml.index.db.lookup_key

  • Summary: The time spent looking up a contract using its key.
  • Description: This metric exposes the time spent looking up a contract using its key in the index db. It is then used by the Daml interpreter when evaluating a command into a transaction.
  • Type: Timer
  • Qualification: Debug

daml.index.db.reassignment_stream.translation

  • Summary: The time needed to turn serialized Daml-LF values into in-memory objects.
  • Description: Some index database queries that target contracts and transactions involve a Daml-LF translation step. For such queries this metric stands for the time it takes to turn the serialized Daml-LF values into in-memory representation.
  • Type: Timer
  • Qualification: Debug

daml.index.db.translation.get_lf_package

  • Summary: The time needed to deserialize and decode a Daml-LF archive.
  • Description: A Daml archive before it can be used in the interpretation needs to be deserialized and decoded, in other words converted into the in-memory representation. This metric represents time necessary to do that.
  • Type: Timer
  • Qualification: Debug

daml.index.db.tree_transactions_stream.translation

  • Summary: The time needed to turn serialized Daml-LF values into in-memory objects.
  • Description: Some index database queries that target contracts and transactions involve a Daml-LF translation step. For such queries this metric stands for the time it takes to turn the serialized Daml-LF values into in-memory representation.
  • Type: Timer
  • Qualification: Debug

daml.index.flat_transactions_buffer_size

  • Summary: The buffer size for flat transactions requests.
  • Description: An Pekko stream buffer is added at the end of all streaming queries, allowing to absorb temporary downstream backpressure (e.g. when the client is slower than upstream delivery throughput). This metric gauges the size of the buffer for queries requesting flat transactions in a specific period of time that satisfy a given predicate.
  • Type: Counter
  • Qualification: Saturation

daml.index.ledger_end_sequential_id

  • Summary: The sequential id of the current ledger end kept in memory.
  • Description: The ledger end’s sequential id is a monotonically increasing integer value representing the sequential id ascribed to the most recent ledger event ingested by the index db. Please note, that only a subset of all ledger events are ingested and given a sequential id. These are: creates, consuming exercises, non-consuming exercises and divulgence events. This value can be treated as a counter of all such events visible to a given participant. This metric exposes the latest ledger end’s sequential id registered in the in-memory data set.
  • Type: Gauge
  • Qualification: Debug

daml.index.lf_value.compute_interface_view

  • Summary: The time to compute an interface view while serving transaction streams.
  • Description: Transaction API allows clients to request events by interface-id. When an event matches the interface - an interface view is computed, which adds to the latency. This metric represents the time for each such computation.
  • Type: Timer
  • Qualification: Debug

daml.index.package_metadata.decode_archive

  • Summary: The time to decode a package archive to extract metadata information.
  • Description: This metric represents the time spent scanning each uploaded package for new interfaces and corresponding templates.
  • Type: Timer
  • Qualification: Debug

daml.index.package_metadata.view_init

  • Summary: The time to initialize package metadata view.
  • Description: As the mapping between interfaces and templates is not persistent - it is computed for each Indexer restart by loading all packages which were ever uploaded and scanning them to extract metadata information.
  • Type: Timer
  • Qualification: Debug

daml.index.transaction_trees_buffer_size

  • Summary: The buffer size for transaction trees requests.
  • Description: An Pekko stream buffer is added at the end of all streaming queries, allowing to absorb temporary downstream backpressure (e.g. when the client is slower than upstream delivery throughput). This metric gauges the size of the buffer for queries requesting transaction trees.
  • Type: Counter
  • Qualification: Saturation

daml.indexer.current_record_time_lag

  • Summary: The lag between the record time of a transaction and the wall-clock time registered at the ingestion phase to the index db (in milliseconds).
  • Description: Depending on the systemic clock skew between different machines, this value can be negative.
  • Type: Gauge
  • Qualification: Debug

daml.indexer.events*

  • Summary: Number of transactions processed.
  • Description: Represents the total number of transaction acceptance, transaction rejection, package upload, party allocation, etc. events processed.
  • Type: Meter
  • Qualification: Debug
  • Labels:
    • participant_id: The id of the participant.
    • application_id: The application generating the events.
    • event_type: The type of ledger event processed (transaction, package upload, party allocation, configuration change).
    • status: Indicates if the transaction was accepted or not. Possible values accepted|rejected.

daml.indexer.last_received_record_time

  • Summary: The time of the last event ingested by the index db (in milliseconds since EPOCH).
  • Description: The last received record time is a monotonically increasing integer value that represents the record time of the last event ingested by the index db. It is measured in milliseconds since the EPOCH time.
  • Type: Gauge
  • Qualification: Debug

daml.indexer.ledger_end_sequential_id

  • Summary: The sequential id of the current ledger end kept in the database.
  • Description: The ledger end’s sequential id is a monotonically increasing integer value representing the sequential id ascribed to the most recent ledger event ingested by the index db. Please note, that only a subset of all ledger events are ingested and given a sequential id. These are: creates, consuming exercises, non-consuming exercises and divulgence events. This value can be treated as a counter of all such events visible to a given participant. This metric exposes the latest ledger end’s sequential id registered in the database.
  • Type: Gauge
  • Qualification: Debug

daml.indexer.metered_events*

  • Summary: Number of ledger events that are metered.
  • Description: Represents the number of events that will be included in the metering report. This is an estimate of the total number and not a substitute for the metering report.
  • Type: Meter
  • Qualification: Debug
  • Labels:
    • participant_id: The id of the participant.
    • application_id: The application generating the events.

daml.lapi.streams.acs_sent

  • Summary: The number of the active contracts sent by the ledger api.
  • Description: The total number of active contracts sent over the ledger api streams to all clients.
  • Type: Counter
  • Qualification: Traffic

daml.lapi.streams.active

  • Summary: The number of the active streams served by the ledger api.
  • Description: The number of ledger api streams currently being served to all clients.
  • Type: Gauge
  • Qualification: Debug

daml.lapi.streams.completions_sent

  • Summary: The number of the command completions sent by the ledger api.
  • Description: The total number of completions sent over the ledger api streams to all clients.
  • Type: Counter
  • Qualification: Traffic

daml.lapi.streams.transaction_trees_sent

  • Summary: The number of the transaction trees sent over the ledger api.
  • Description: The total number of the transaction trees sent over the ledger api streams to all clients.
  • Type: Counter
  • Qualification: Traffic

daml.lapi.streams.transactions_sent

  • Summary: The number of the flat updates sent over the ledger api.
  • Description: The total number of the flat updates sent over the ledger api streams to all clients.
  • Type: Counter
  • Qualification: Traffic

daml.lapi.streams.update_trees_sent

  • Summary: The number of the update trees sent over the ledger api.
  • Description: The total number of the update trees sent over the ledger api streams to all clients.
  • Type: Counter
  • Qualification: Traffic

daml.parallel_indexer.input_buffer_length

  • Summary: The number of elements in the queue in front of the indexer.
  • Description: The indexer has a queue in order to absorb the back pressure and facilitate batch formation during the database ingestion.
  • Type: Counter
  • Qualification: Saturation

daml.parallel_indexer.inputmapping.batch_size

  • Summary: The batch sizes in the indexer.
  • Description: The number of state updates contained in a batch used in the indexer for database submission.
  • Type: Histogram
  • Qualification: Debug

daml.parallel_indexer.output_batched_buffer_length

  • Summary: The size of the queue between the indexer and the in-memory state updating flow.
  • Description: This counter counts batches of updates passed to the in-memory flow. Batches are dynamically-sized based on amount of backpressure exerted by the downstream stages of the flow.
  • Type: Counter
  • Qualification: Debug

daml.parallel_indexer.seqmapping.duration

  • Summary: The duration of the seq-mapping stage.
  • Description: The time that a batch of updates spends in the seq-mapping stage of the indexer.
  • Type: Timer
  • Qualification: Debug

daml.parallel_indexer.updates

  • Summary: The number of the state updates persisted to the database.
  • Description: The number of the state updates persisted to the database. There are updates such as accepted transactions, configuration changes, package uloads, party allocations, rejections, etc.
  • Type: Counter
  • Qualification: Traffic

daml.services.index.<operation>

  • Summary: The time to execute an index service operation.
  • Description: The index service is an internal component responsible for access to the index db data. Its operations are invoked whenever a client request received over the ledger api requires access to the index db. This metric captures time statistics of such operations.
  • Type: Timer
  • Qualification: Debug
  • Instances: get_transaction_metering, prune, configuration_entries, lookup_configuration, party_entries, list_known_parties, get_parties, get_participant_id, lookup_maximum_ledger_time, get_events_by_contract_key, get_events_by_contract_id, lookup_contract_key, lookup_contract_state_without_divulgence, lookup_active_contract, get_active_contracts, get_transaction_tree_by_id, get_transaction_by_id, transaction_trees, transactions, get_completions_limited, get_completions, latest_pruned_offsets, current_ledger_end, get_ledger_configuration, package_entries, get_lf_archive, list_lf_packages

daml.services.index.in_memory_fan_out_buffer.prune

  • Summary: The time to remove all elements from the in-memory fan-out buffer.
  • Description: It is possible to remove the oldest entries of the in-memory fan out buffer. This metric exposes the time needed to prune the buffer.
  • Type: Timer
  • Qualification: Debug

daml.services.index.in_memory_fan_out_buffer.push

  • Summary: The time to add a new event into the buffer.
  • Description: The in-memory fan-out buffer is a buffer that stores the last ingested maxBufferSize accepted and rejected submission updates as TransactionLogUpdate. It allows bypassing IndexDB persistence fetches for recent updates for flat and transaction tree streams, command completion streams and by-event-id and by-transaction-id flat and transaction tree lookups. This metric exposes the time spent on adding a new event into the buffer.
  • Type: Timer
  • Qualification: Debug

daml.services.index.in_memory_fan_out_buffer.size

  • Summary: The size of the in-memory fan-out buffer.
  • Description: The actual size of the in-memory fan-out buffer. This metric is mostly targeted for debugging purposes.
  • Type: Histogram
  • Qualification: Saturation

daml.services.read.<operation>

  • Summary: The time to execute a read service operation.
  • Description: The read service is an internal interface for reading the events from the synchronization interfaces. The metrics expose the time needed to execute each operation.
  • Type: Timer
  • Qualification: Debug
  • Instances: incomplete_reassignment_offsets, get_connected_domains, state_updates

daml.services.write.<operation>

  • Summary: The time to execute a write service operation.
  • Description: The write service is an internal interface for changing the state through the synchronization services. The methods in this interface are all methods that are supported uniformly across all ledger implementations. This metric exposes the time needed to execute each operation.
  • Type: Timer
  • Qualification: Debug
  • Instances: prune, submit_configuration, allocate_party, upload_packages, submit_reassignment_running, submit_reassignment, submit_transaction_running, submit_transaction

daml.services.write.submit_transaction.count

  • Summary: The number of submitted transactions by the write service.
  • Description: The write service is an internal interface for changing the state through the synchronization services. The methods in this interface are all methods that are supported uniformly across all ledger implementations. This metric exposes the total number of the sumbitted transactions.
  • Type: Timer
  • Qualification: Traffic

Domain Metrics

canton.<component>.sequencer-client.application-handle

  • Summary: Timer monitoring time and rate of sequentially handling the event application logic
  • Description: All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.
  • Type: Timer
  • Qualification: Debug
  • Instances: topology-manager, mediator, sequencer

canton.<component>.sequencer-client.delay

  • Summary: The delay on the event processing
  • Description: Every message received from the sequencer carries a timestamp that was assigned by the sequencer when it sequenced the message. This timestamp is called the sequencing timestamp. The component receiving the message on the participant, mediator or topology manager side, is the sequencer client. Upon receiving the message, the sequencer client compares the time difference between the sequencing time and the computers local clock and exposes this difference as the given metric. The difference will include the clock-skew and the processing latency between assigning the timestamp on the sequencer and receiving the message by the recipient. If the difference is large compared to the usual latencies and if clock skew can be ruled out, then it means that the node is still trying to catch up with events that were sequenced by the sequencer a while ago. This can happen after having been offline for a while or if the node is too slow to keep up with the messaging load.
  • Type: Gauge
  • Qualification: Debug
  • Instances: topology-manager, mediator, sequencer

canton.<component>.sequencer-client.event-handle

  • Summary: Timer monitoring time and rate of entire event handling
  • Description: Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.
  • Type: Timer
  • Qualification: Debug
  • Instances: topology-manager, mediator, sequencer

canton.db-storage.<service>.executor.queued

  • Summary: Number of database access tasks waiting in queue
  • Description: Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won’t show up in this metric.
  • Type: Counter
  • Qualification: Debug
  • Instances: locks, write, general

canton.db-storage.<service>.executor.running

  • Summary: Number of database access tasks currently running
  • Description: Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.
  • Type: Counter
  • Qualification: Debug
  • Instances: locks, write, general

canton.db-storage.<service>.executor.waittime

  • Summary: Scheduling time metric for database tasks
  • Description: Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.
  • Type: Timer
  • Qualification: Debug
  • Instances: locks, write, general

canton.db-storage.<storage>

  • Summary: Timer monitoring duration and rate of accessing the given storage
  • Description: Covers both read from and writes to the storage.
  • Type: Timer
  • Qualification: Debug

canton.db-storage.<storage>.load

  • Summary: The load on the given storage
  • Description: The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.
  • Type: Gauge
  • Qualification: Debug

canton.db-storage.alerts.multi-domain-event-log

  • Summary: Number of failed writes to the multi-domain event log
  • Description: Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.
  • Type: Counter
  • Qualification: Debug

canton.db-storage.alerts.single-dimension-event-log

  • Summary: Number of failed writes to the event log
  • Description: Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.
  • Type: Counter
  • Qualification: Debug

canton.mediator.event-rejected

  • Summary: Event rejected because of traffic limit exceeded
  • Description: This metric is being incremented every time a sequencer rejects an event because the sender does not have enough credit.
  • Type: Meter
  • Qualification: Traffic

canton.mediator.max-event-age

  • Summary: Age of oldest unpruned mediator response.
  • Description: This gauge exposes the age of the oldest, unpruned mediator response in hours as a way to quantify the pruning backlog.
  • Type: Gauge
  • Qualification: Debug

canton.mediator.outstanding-requests

  • Summary: Number of currently outstanding requests
  • Description: This metric provides the number of currently open requests registered with the mediator.
  • Type: Gauge
  • Qualification: Debug

canton.mediator.requests

  • Summary: Number of totally processed requests
  • Description: This metric provides the number of totally processed requests since the system has been started.
  • Type: Meter
  • Qualification: Debug

canton.mediator.sequencer-client.handler.actual-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks how many such batches are processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Indicators that the configured upper bound may be too low: This metric constantly is closed to the configured maximum, which is exposed via ‘max-in-flight-event-batches’, while the system’s resources are under-utilized. Indicators that the configured upper bound may be too high: Out-of-memory errors crashing the JVM or frequent garbage collection cycles that slow down processing. The metric tracks how many of these batches have been sent to the application handler but have not yet been fully processed. This metric can help identify potential bottlenecks or issues with the application’s processing of events and provide insights into the overall workload of the system.
  • Type: Counter
  • Qualification: Saturation

canton.mediator.sequencer-client.handler.max-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks the upper bound of such batches being processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Configured by ‘maximum-in-flight-event-batches’ parameter in the sequencer-client config The metric shows the configured upper limit on how many batches the application handler may process concurrently. The metric ‘actual-in-flight-event-batches’ tracks the actual number of currently processed batches.
  • Type: Gauge
  • Qualification: Saturation

canton.mediator.sequencer-client.submissions.dropped

  • Summary: Count of send requests that did not cause an event to be sequenced
  • Description: Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.
  • Type: Counter
  • Qualification: Debug

canton.mediator.sequencer-client.submissions.in-flight

  • Summary: Number of sequencer send requests we have that are waiting for an outcome or timeout
  • Description: Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.
  • Type: Counter
  • Qualification: Debug

canton.mediator.sequencer-client.submissions.overloaded

  • Summary: Count of send requests which receive an overloaded response
  • Description: Counter that is incremented if a send request receives an overloaded response from the sequencer.
  • Type: Counter
  • Qualification: Debug

canton.mediator.sequencer-client.submissions.sends

  • Summary: Rate and timings of send requests to the sequencer
  • Description: Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.
  • Type: Timer
  • Qualification: Debug

canton.mediator.sequencer-client.submissions.sequencing

  • Summary: Rate and timings of sequencing requests
  • Description: This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.db-storage.<storage>

  • Summary: Timer monitoring duration and rate of accessing the given storage
  • Description: Covers both read from and writes to the storage.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.db-storage.<storage>.load

  • Summary: The load on the given storage
  • Description: The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.
  • Type: Gauge
  • Qualification: Debug

canton.sequencer.db-storage.alerts.multi-domain-event-log

  • Summary: Number of failed writes to the multi-domain event log
  • Description: Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.alerts.single-dimension-event-log

  • Summary: Number of failed writes to the event log
  • Description: Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.general.executor.queued

  • Summary: Number of database access tasks waiting in queue
  • Description: Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won’t show up in this metric.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.general.executor.running

  • Summary: Number of database access tasks currently running
  • Description: Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.general.executor.waittime

  • Summary: Scheduling time metric for database tasks
  • Description: Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.db-storage.locks.executor.queued

  • Summary: Number of database access tasks waiting in queue
  • Description: Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won’t show up in this metric.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.locks.executor.running

  • Summary: Number of database access tasks currently running
  • Description: Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.locks.executor.waittime

  • Summary: Scheduling time metric for database tasks
  • Description: Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.db-storage.write.executor.queued

  • Summary: Number of database access tasks waiting in queue
  • Description: Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won’t show up in this metric.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.write.executor.running

  • Summary: Number of database access tasks currently running
  • Description: Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.db-storage.write.executor.waittime

  • Summary: Scheduling time metric for database tasks
  • Description: Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.max-event-age

  • Summary: Age of oldest unpruned sequencer event.
  • Description: This gauge exposes the age of the oldest, unpruned sequencer event in hours as a way to quantify the pruning backlog.
  • Type: Gauge
  • Qualification: Debug

canton.sequencer.processed

  • Summary: Number of messages processed by the sequencer
  • Description: This metric measures the number of successfully validated messages processed by the sequencer since the start of this process.
  • Type: Meter
  • Qualification: Debug

canton.sequencer.processed-bytes

  • Summary: Number of message bytes processed by the sequencer
  • Description: This metric measures the total number of message bytes processed by the sequencer. If the message received by the sequencer contains duplicate or irrelevant fields, the contents of these fields do not contribute to this metric.
  • Type: Meter
  • Qualification: Debug

canton.sequencer.sequencer-client.handler.actual-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks how many such batches are processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Indicators that the configured upper bound may be too low: This metric constantly is closed to the configured maximum, which is exposed via ‘max-in-flight-event-batches’, while the system’s resources are under-utilized. Indicators that the configured upper bound may be too high: Out-of-memory errors crashing the JVM or frequent garbage collection cycles that slow down processing. The metric tracks how many of these batches have been sent to the application handler but have not yet been fully processed. This metric can help identify potential bottlenecks or issues with the application’s processing of events and provide insights into the overall workload of the system.
  • Type: Counter
  • Qualification: Saturation

canton.sequencer.sequencer-client.handler.max-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks the upper bound of such batches being processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Configured by ‘maximum-in-flight-event-batches’ parameter in the sequencer-client config The metric shows the configured upper limit on how many batches the application handler may process concurrently. The metric ‘actual-in-flight-event-batches’ tracks the actual number of currently processed batches.
  • Type: Gauge
  • Qualification: Saturation

canton.sequencer.sequencer-client.submissions.dropped

  • Summary: Count of send requests that did not cause an event to be sequenced
  • Description: Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.sequencer-client.submissions.in-flight

  • Summary: Number of sequencer send requests we have that are waiting for an outcome or timeout
  • Description: Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.sequencer-client.submissions.overloaded

  • Summary: Count of send requests which receive an overloaded response
  • Description: Counter that is incremented if a send request receives an overloaded response from the sequencer.
  • Type: Counter
  • Qualification: Debug

canton.sequencer.sequencer-client.submissions.sends

  • Summary: Rate and timings of send requests to the sequencer
  • Description: Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.sequencer-client.submissions.sequencing

  • Summary: Rate and timings of sequencing requests
  • Description: This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.
  • Type: Timer
  • Qualification: Debug

canton.sequencer.subscriptions

  • Summary: Number of active sequencer subscriptions
  • Description: This metric indicates the number of active subscriptions currently open and actively served subscriptions at the sequencer.
  • Type: Gauge
  • Qualification: Debug

canton.sequencer.time-requests

  • Summary: Number of time requests received by the sequencer
  • Description: When a Participant needs to know the domain time it will make a request for a time proof to be sequenced. It would be normal to see a small number of these being sequenced, however if this number becomes a significant portion of the total requests to the sequencer it could indicate that the strategy for requesting times may need to be revised to deal with different clock skews and latencies between the sequencer and participants.
  • Type: Meter
  • Qualification: Debug

canton.sequencer.traffic-control.event-delivered-cost

  • Summary: Cost of delivered event.
  • Description: Cost of an event that was delivered.
  • Type: Meter
  • Qualification: Traffic

canton.sequencer.traffic-control.event-received-size

  • Summary: Raw size of an event received in the sequencer.
  • Description: This the raw payload size of an event, on the write path. Final event cost calculation.
  • Type: Meter
  • Qualification: Traffic

canton.sequencer.traffic-control.event-rejected-cost

  • Summary: Cost of rejected event.
  • Description: Cost of an event that was rejected because it exceeded the sender’s traffic limit.
  • Type: Meter
  • Qualification: Traffic

canton.topology-manager.sequencer-client.handler.actual-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks how many such batches are processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Indicators that the configured upper bound may be too low: This metric constantly is closed to the configured maximum, which is exposed via ‘max-in-flight-event-batches’, while the system’s resources are under-utilized. Indicators that the configured upper bound may be too high: Out-of-memory errors crashing the JVM or frequent garbage collection cycles that slow down processing. The metric tracks how many of these batches have been sent to the application handler but have not yet been fully processed. This metric can help identify potential bottlenecks or issues with the application’s processing of events and provide insights into the overall workload of the system.
  • Type: Counter
  • Qualification: Saturation

canton.topology-manager.sequencer-client.handler.max-in-flight-event-batches

  • Summary: Nodes process the events from the domain’s sequencer in batches. This metric tracks the upper bound of such batches being processed in parallel.
  • Description: Incoming messages are processed by a sequencer client, which combines them into batches of size up to ‘event-inbox-size’ before sending them to an application handler for processing. Depending on the system’s configuration, the rate at which event batches are sent to the handler may be throttled to avoid overwhelming it with too many events at once. Configured by ‘maximum-in-flight-event-batches’ parameter in the sequencer-client config The metric shows the configured upper limit on how many batches the application handler may process concurrently. The metric ‘actual-in-flight-event-batches’ tracks the actual number of currently processed batches.
  • Type: Gauge
  • Qualification: Saturation

canton.topology-manager.sequencer-client.submissions.dropped

  • Summary: Count of send requests that did not cause an event to be sequenced
  • Description: Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.
  • Type: Counter
  • Qualification: Debug

canton.topology-manager.sequencer-client.submissions.in-flight

  • Summary: Number of sequencer send requests we have that are waiting for an outcome or timeout
  • Description: Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.
  • Type: Counter
  • Qualification: Debug

canton.topology-manager.sequencer-client.submissions.overloaded

  • Summary: Count of send requests which receive an overloaded response
  • Description: Counter that is incremented if a send request receives an overloaded response from the sequencer.
  • Type: Counter
  • Qualification: Debug

canton.topology-manager.sequencer-client.submissions.sends

  • Summary: Rate and timings of send requests to the sequencer
  • Description: Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.
  • Type: Timer
  • Qualification: Debug

canton.topology-manager.sequencer-client.submissions.sequencing

  • Summary: Rate and timings of sequencing requests
  • Description: This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.
  • Type: Timer
  • Qualification: Debug

Health Metrics

The following metrics are exposed for all components.

daml_health_status

  • Description: The status of the component
  • Values:
    • 0: Not healthy
    • 1: Healthy
  • Labels:
    • component: the name of the component being monitored
  • Type: Gauge

gRPC Metrics

The following metrics are exposed for all gRPC endpoints. These metrics have the following common labels attached:

  • grpc_service_name:
    fully qualified name of the gRPC service (e.g. com.daml.ledger.api.v1.ActiveContractsService)
  • grpc_method_name:
    name of the gRPC method (e.g. GetActiveContracts)
  • grpc_client_type:
    type of client connection (unary or streaming)
  • grpc_server_type:
    type of server connection (unary or streaming)
  • service:
    Canton service’s name (e.g. participant, sequencer, etc.)

daml_grpc_server_duration_seconds

  • Description: Distribution of the durations of serving gRPC requests
  • Type: Histogram

daml_grpc_server_messages_sent_total

  • Description: Total number of gRPC messages sent (on either type of connection)
  • Type: Counter

daml_grpc_server_messages_received_total

  • Description: Total number of gRPC messages received (on either type of connection)
  • Type: Counter

daml_grpc_server_started_total

  • Description: Total number of started gRPC requests (on either type of connection)
  • Type: Counter

daml_grpc_server_handled_total

  • Description: Total number of handled gRPC requests
  • Labels:
    • grpc_code: returned gRPC status code for the call (OK, CANCELLED, INVALID_ARGUMENT, etc.)
  • Type: Counter

daml_grpc_server_messages_sent_bytes

  • Description: Distribution of payload sizes in gRPC messages sent (both unary and streaming)
  • Type: Histogram

daml_grpc_server_messages_received_bytes

  • Description: Distribution of payload sizes in gRPC messages received (both unary and streaming)
  • Type: Histogram

HTTP Metrics

The following metrics are exposed for all HTTP endpoints. These metrics have the following common labels attached:

  • http_verb:
    HTTP verb used for a given call (e.g. GET or PUT)
  • host:
    fully qualified hostname of the HTTP endpoint (e.g. example.com)
  • path:
    path of the HTTP endpoint (e.g. /v1/parties/create)
  • service:
    Daml service’s name (json_api for the HTTP JSON API Service)

daml_http_requests_duration_seconds

  • Description: Distribution of the durations of serving HTTP requests
  • Type: Histogram

daml_http_requests_total

  • Description: Total number of HTTP requests completed
  • Labels:
  • Type: Counter

daml_http_websocket_messages_received_total

  • Description: Total number of WebSocket messages received
  • Type: Counter

daml_http_websocket_messages_sent_total

  • Description: Total number of WebSocket messages sent
  • Type: Counter

daml_http_requests_payload_bytes

  • Description: Distribution of payload sizes in HTTP requests received
  • Type: Histogram

daml_http_responses_payload_bytes

  • Description: Distribution of payload sizes in HTTP responses sent
  • Type: Histogram

daml_http_websocket_messages_received_bytes

  • Description: Distribution of payload sizes in WebSocket messages received
  • Type: Histogram

daml_http_websocket_messages_sent_bytes

  • Description: Distribution of payload sizes in WebSocket messages sent
  • Type: Histogram

Pruning Metrics

The following metrics are exposed for all pruning processes. These metrics have the following labels:

  • phase:
    The name of the pruning phase being monitored

daml_services_pruning_prune_started_total

  • Description: Total number of started pruning processes
  • Type: Counter

daml_services_pruning_prune_completed_total

  • Description: Total number of completed pruning processes
  • Type: Counter

JVM Metrics

The following metrics are exposed for the JVM, if enabled.

runtime_jvm_gc_time

  • Description: Time spent in a given JVM garbage collector in milliseconds
  • Labels:
    • gc: Garbage collector regions (eg: G1 Old Generation, G1 New Generation)
  • Type: Counter

runtime_jvm_gc_count

  • Description: The number of collections that have occurred for a given JVM garbage collector
  • Labels:
    • gc: Garbage collector regions (eg: G1 Old Generation, G1 New Generation)
  • Type: Counter

runtime_jvm_memory_area

  • Description: JVM memory area statistics
  • Labels:
    • area: Can be heap or non_heap
    • type: Can be committed, used or max

runtime_jvm_memory_pool

  • Description: JVM memory pool statistics
  • Labels:
    • pool: Defined pool name.
    • type: Can be committed, used or max

Logging

Canton uses Logback as the logging library. All Canton logs derive from the logger com.digitalasset.canton. By default, Canton will write a log to the file log/canton.log using the INFO log-level and will also log WARN and ERROR to stdout.

How Canton produces log files can be configured extensively on the command line using the following options:

  • -v (or --verbose) is a short option to set the Canton log level to DEBUG. This is likely the most common log option you will use.
  • --debug sets all log levels except stdout to DEBUG. Stdout is set to INFO. Note that DEBUG logs of external libraries can be very noisy.
  • --log-level-root=<level> configures the log-level of the root logger. This changes the log level of Canton and of external libraries, but not of stdout.
  • --log-level-canton=<level> configures the log-level of only the Canton logger.
  • --log-level-stdout=<level> configures the log-level of stdout. This will usually be the text displayed in the Canton console.
  • --log-file-name=log/canton.log configures the location of the log file.
  • --log-file-appender=flat|rolling|off configures if and how logging to a file should be done. The rolling appender will roll the files according to the defined date-time pattern.
  • --log-file-rolling-history=12 configures the number of historical files to keep when using the rolling appender.
  • --log-file-rolling-pattern=YYYY-mm-dd configures the rolling file suffix (and therefore the frequency) of how files should be rolled.
  • --log-truncate configures whether the log file should be truncated on startup.
  • --log-profile=container provides a default set of logging settings for a particular setup. Only the container profile is supported, which logs to STDOUT. It turns off flat file logging to avoid storage leaks due to log files within a container.
  • --log-immediate-flush=false turns off immediate flushing of the log output to the log file.

Note that if you use --log-profile, the order of the command line arguments matters. The profile settings can be overridden on the command line by placing adjustments after the profile has been selected.

Canton supports the normal log4j logging levels: TRACE, DEBUG, INFO, WARN, and ERROR.

For further customization, a custom logback configuration can be provided using JAVA_OPTS.

JAVA_OPTS="-Dlogback.configurationFile=./path-to-file.xml" ./bin/canton --config ...

If you use a custom log-file, the command line arguments for logging will not have any effect, except that --log-level-canton and --log-level-root can still be used to adjust the log level of the root loggers.

Viewing Logs

A log file viewer such as lnav is recommended to view Canton logs and resolve issues. Among other features, lnav has automatic syntax highlighting, convenient filtering for specific log messages, and the ability to view log files of different Canton components in a single view. This makes viewing logs and resolving issues more efficient than using standard UNIX tools such as less or grep.

The following features are especially useful when using lnav:

  • Viewing log files of different Canton components in a single view, merged according to timestamps (lnav <log1> <log2> ...).
  • Filtering specific log messages in (:filter-in <regex>) or out (:filter-out <regex>). When filtering messages (for example, with a given trace-id), a transaction can be traced across different components, especially when using the single-view-feature described earlier.
  • Searching for specific log messages (/<regex>) and jumping between them (n and N).
  • Automatic syntax highlighting of parts of log messages (such as timestamps) and log messages themselves (for example, WARN log messages are yellow).
  • Jumping between error (e and E) and warn messages (w and W).
  • Selectively activating and deactivating different filters and files (TAB and `` `` to activate/deactivate a filter).
  • Marking lines (m) and jumping back and forth between marked lines (u and U).
  • Jumping back and forth between lines that have the same trace-id (o and O).

The custom lnav log format file for Canton logs canton.lnav.json is bundled in any Canton release. You can install it with lnav -i canton.lnav.json. JSON-based log files (which need to use the file suffix .clog) can be viewed using the canton-json.lnav.json format file.

Detailed Logging

By default, logging omits details to avoid writing sensitive data into log files. For debugging or educational purposes, you can turn on additional logging using the following configuration switches:

canton.monitoring.logging {
    event-details = true
    api {
        message-payloads = true
        max-method-length = 1000
        max-message-lines = 10000
        max-string-length = 10000
        max-metadata-size = 10000
    }
}

This turns on payload logging in the ApiRequestLogger, which records every GRPC API invocation, and turns on detailed logging of the SequencerClient and the transaction trees. Please note that all additional events are logged at DEBUG level.

Note

Note that the detailed event logging will happen within an gRPC API Interceptor. This creates a sequential bottleneck as every message that is sent or received gets translated into a pretty-printed string. You will not be able to achieve the same performance if this setting is turned on.

Tracing

For further debugging, Canton provides a trace-id which allows you to trace the processing of requests through the system. The trace-id is exposed to logback through the mapping diagnostic context and can be included in the logback output pattern using %mdc{trace-id}.

The trace-id propagation is enabled by setting the canton.monitoring.tracing.propagation = enabled configuration option, which is enabled by default.

You can configure the service where traces and spans are reported for observing distributed traces. Refer to Traces for a preview.

Jaeger and Zipkin are supported. For example, Jaeger reporting can be configured as follows:

monitoring.tracing.tracer.exporter {
  type = jaeger
  address = ... // default: "localhost"
  port = ... // default: 14250
}

This configuration connects to a running Jaeger server to report tracing information.

You can run Jaeger in a Docker container as follows:

docker run --rm -it --name jaeger\
  -p 16686:16686 \
  -p 14250:14250 \
  jaegertracing/all-in-one:1.22.0

If you prefer not to use Docker, you can download the binary for your specific OS at Download Jaeger. Unzip the file and then run the binary jaeger-all-in-one (no arguments are needed). By default, Jaeger will expose port 16686 (for its UI, which can be seen in a browser window) and port 14250 (to which Canton will report trace information). Be sure to properly expose these ports.

Make sure that all Canton nodes in the network report to the same Jaeger server to have an accurate view of the full traces. Also, ensure that the Jaeger server is reachable by all Canton nodes.

Apart from jaeger, Canton nodes can also be configured to report in Zipkin or OTLP formats.

Sampling

You can change how often spans are sampled and reported to the configured exporter. By default, it will always report (monitoring.tracing.tracer.sampler.type = always-on). You can configure it to never report (monitoring.tracing.tracer.sampler.type = always-off), although this is less useful. Also, you can configure only a specific fraction of spans to be reported as follows:

monitoring.tracing.tracer.sampler = {
  type = trace-id-ratio
  ratio = 0.5
}

You can also change the parent-based sampling property. By default, it is turned on (monitoring.tracing.tracer.sampler.parent-based = true). When turned on, a span is sampled iff its parent is sampled (the root span will follow the configured sampling strategy). There will never be incomplete traces; either the full trace is sampled or it is not. If you change this property, all spans will follow the configured sampling strategy and ignore whether the parent is sampled.

Known Limitations

Not every trace created which can be observed in logs is reported to the configured trace collector service. Traces originated at console commands or that are part of the transaction protocol are largely reported, while other types of traces are added to the set of reported traces as the need arises.

Also, the transaction protocol trace has a known limitation: once a command is submitted and its trace is fully reported, a new trace is created for any resulting Daml events that are processed. This occurs because the ledger API does not propagate any trace context information from the command submission to the transaction subscription. As an example, when a participant creates a Ping contract, you can see the full transaction processing trace of the Ping command being submitted. However, a participant that processes the Ping by exercising Respond and creating a Pong contract creates a separate trace instead of using the same one.

This differs from a situation where a single Daml transaction results in multiple actions at the same time, such as archiving and creating multiple contracts. In that case, a single trace encompasses the entire process, since it occurs as part of a single transaction rather than the result of an external process reacting to Daml events.

Traces

Traces contain operations that are each represented by a span. A trace is a directed acyclic graph (DAG) of spans, where the edges between spans are defined as parent/child relationships (the definitions come from the Opentelemetry glossary).

Canton reports several types of traces. One example: every Canton console command that interacts with the Admin API starts a trace whose initial span last for the entire duration of the command, including the GRPC call to the specific Admin API endpoint.

A graph showing the trace of a Canton ping containing 18 spans.

Graph of a Canton ping trace containing 18 spans

Traces of Daml command submissions are important. The trace illustrated in the figure results when you perform a Canton ping using the console. The ping is a smoke test that sends a Daml transaction (create Ping, exercise choice Pong, exercise choice Archive) to test a connection. It uses a particular smart contract that is preinstalled on every Canton participant. The command uses the Admin API to access a preinstalled application, which then issues Ledger API commands operating on this smart contract. In this example, the trace contains 18 spans. The ping is started by participant1, and participant2 is the target. The trace focuses on the message exchange through the sequencer without digging deep into the message handlers or further processing of transactions.

In some cases, spans may start later than the end of their parents, due to asynchronous processing. This typically occurs when a new operation is placed on a queue to be handled later, which immediately frees the parent span and ends it.

The initial span (span 1) covers the duration of the ping operation. In span 2, the GrpcPingService in the participant node handles a GRPC request made by the console. It also lasts for the duration of the ping operation.

The Canton ping consists of three Daml commands:

  1. The admin party for participant1 creates a Ping contract.
  2. The admin party for participant2 exercises the Respond consuming choice on the contract, which results in the creation of a Pong contract.
  3. The admin party for participant1 exercises the Ack consuming choice on it.

The submission of the first of the three Daml commands (the creation of the Ping contract) starts at span 3 in the example trace. Due to a limitation explained in the next section, the other two Daml command submissions are not linked to this trace. It is possible to find them separately. In any case, span 2 will only complete once the three Daml commands are completed.

At span 3, the participant node is on the client side of the ledger API. In other use cases, it could be an application integrated with the participant. This span lasts for the duration of the GRPC call, which is received on the server side in span 4 and handled by the CantonSyncService in span 5. The request is then received and acknowledged, but not fully processed. It is processed asynchronously later, which means that spans 3 through 5 will complete before the request is handled.

Missing steps from the trace (which account for part of the gap between spans 5 and 6) are:

  • The domain routing where the participant decides which domain to use for the command submission.
  • The preparation of the initial set of messages to be sent.

The start of the Canton transaction protocol begins at span 6. In this span, participant1 sends a request to sequencer1 to sequence the initial set of confirmation request messages as part of phase 1 of the transaction protocol. The transaction protocol has seven phases.

At span 7, sequencer1 receives the request and registers it. Receipt of the messages is not part of this span. That happens asynchronously at a later point.

At span 18, as part of phase 2, mediator1 receives an informee message. It only needs to validate and register it. Since it doesn’t need to respond, span 18 has no children.

As part of phase 3, participant2 receives a message (see span 8), and participant1 also receives a message (see span 9). Both participants asynchronously validate the messages. participant2 does not need to respond. Since it is only an observer, span 8 has no children. participant1 responds, however, which is visible at span 10. There, it again makes a call to sequencer1, which receives it at span 11.

At span 12, participant1 receives a successful send response message that signals that its message to the mediator was successfully sequenced. This occurs as part of phase 4, where confirmation responses are sent to the mediator. The mediator receives it at span 13, and it validates the message (phase 5).

In spans 14 and 15, mediator1 (now at phase 6) asks sequencer1 to send the transaction result messages to the participants.

To end this round of the transaction protocol, participant1 and participant2 receive their messages at spans 16 and 17, respectively. The messages are asynchronously validated, and their projections of the virtual shared ledger are updated (phase 7).

As mentioned, there are two other transaction submissions that are unlinked from this ping trace but are part of the operation. The second one starts at a span titled admin-ping.processTransaction, which is created by participant2. The third one has the same name but is initiated by participant1.

Node Health Status

Each Canton node exposes rich health status information. Running:

<node>.health.status

returns a status object, which can be one of:

  • Failure: if the status of the node cannot be determined, including an error message of why it failed
  • NotInitialized: if the node is not yet initialized
  • Success[NodeStatus]: if the status could be determined, including the detailed status

The NodeStatus differs depending on the node type. A participant node responds with a message containing:

  • Participant id: the participant id of the node
  • Uptime: the uptime of this node
  • Ports: the ports on which the participant node exposes the Ledger and the Admin API.
  • Connected domains: the list of domains to which the participant is properly connected
  • Unhealthy domains: the list of domains to which the participant is trying to connect, but the connection is not ready for command submission
  • Active: true if this instance is the active replica (It can be false in the case of the passive instance of a high-availability deployment.)

A domain node or a sequencer node responds with a message containing:

  • Domain id: the unique identifier of the domain
  • Uptime: the uptime of this node
  • Ports: the ports on which the domain node exposes the Public and the Admin API
  • Connected Participants: the list of connected participants
  • Sequencer: a boolean flag indicating whether the embedded sequencer writer is operational

A domain topology manager or a mediator node returns:

  • Node uid: the unique identifier of the node
  • Uptime: the uptime of this node
  • Ports: the ports on which the node hosts its APIs
  • Active: true if this instance is the active replica (It can be false in the case of the passive instance of a high-availability deployment.)

Additionally, all nodes also return a components field detailing the health state of each of its internal runtime dependencies. The actual components differ per node and can give further insights into the node’s current status. Example components include storage access, domain connectivity, and sequencer backend connectivity.

Health Checks

gRPC Health Check Service

Each Canton node can optionally be configured to start a gRPC server exposing the gRPC Health Service. Passive nodes (see High Availability for more information on active/passive states) return NOT_SERVING. Consider this when configuring liveness and readiness probes in a Kubernetes environment.

The precise way the state is computed is subject to change.

Here is an example monitoring configuration to place inside a node configuration object:

monitoring.grpc-health-server {
  address = "127.0.0.1"
  port = 5861
}

Note

The gRPC health server is configured per Canton node, not per process, as is the case for the HTTP health check server (see below). This means that the configuration must be inserted within a node’s configuration object.

Note

To support usage as a Kubernetes liveness probe, the health server exposes a service named liveness that should be targeted when configuring a gRPC probe. The latter service always returns SERVING.

HTTP Health Check

Optionally, the canton process can expose an HTTP endpoint indicating whether the process believes it is healthy. This may be used as an uptime check or as a Kubernetes liveness probe. If enabled, the /health endpoint will respond to a GET HTTP request with a 200 HTTP status code (if healthy) or 500 (if unhealthy, along with a plain text description of why it is unhealthy).

To enable this health endpoint, add a monitoring section to the Canton configuration. Since this health check is for the whole process, add it directly to the canton configuration rather than for a specific node.

canton {
  monitoring.health {
   server {
      port = 7000
   }

   check {
     type = ping
     participant = participant1
     interval = 30s
   }
}

This health check causes participant1 to “ledger ping” itself every 30 seconds. The process is considered healthy if the ping is successful.

Health Dumps

You should provide as much information as possible to receive efficient support. For this purpose, Canton implements an information-gathering facility that gathers key essential system information for support staff. If you encounter an error where you need assistance, please ensure the following:

  • Start Canton in interactive mode, with the -v option to enable debug logging: ./bin/canton -v -c <myconfig>. This provides a console prompt.
  • Reproduce the error by following the steps that previously caused the error. Write down these steps so they can be provided to support staff.
  • After you observe the error, type health.dump() into the Canton console to generate a ZIP file.

This creates a dump file (.zip) that stores the following information:

  • The configuration you are using, with all sensitive data stripped from it (no passwords).
  • An extract of the log file. Sensitive data is not logged into log files.
  • A current snapshot on Canton metrics.
  • A stacktrace for each running thread.

Provide the gathered information to your support contact together with the exact list of steps that led to the issue. Providing complete information is very important to help troubleshoot issues.

Remote Health Dumps

When running a console configured to access remote nodes, the health.dump() command gathers health data from the remote nodes and packages them into resulting zip files. There is no special action required. You can obtain the health data of a specific node by targeting it when running the command. For example:

remoteParticipant1.health.dump()

When packaging large amounts of data, increase the default timeout of the dump command:

health.dump(timeout = 2.minutes)