Operational Processes

Managing domain entities

Domain bootstrapping

If you’re running a domain node in its default configuration, it will have a sequence and mediator embedded and these components will be automatically bootstrapped for you.

However if your domain operates with external sequencers and mediators for improved availability and performance properties, you need to instead configure a domain manager node (which only runs topology management) and bootstrap your domain with at least one external sequencer node and one external mediator node as illustrated below:

domainManager1.setup.bootstrap_domain(Seq(sequencer1), Seq(mediator1))

Domain managers are configured as domain-managers under the canton configuration. Domain managers are configured similarly to domain nodes, except that there is no sequencer, mediator, public api or service agreement configs.

Please note that if your sequencer is database based and you’re horizontally scaling it as described under sequencer high availability, you do not need to pass all sequencer nodes into the command above. Since they all share the same relational database, you only need to run this initialization step on one of them.

For other non-database based sequencer such as Ethereum or Fabric sequencers you need to have each node initialized individually. For these kinds of sequencers you can either initialize them as part of the initial domain bootstrap shown above or you can dynamically add a new sequencer at a later point like follows:

domainManager1.setup.onboard_new_sequencer(
  initialSequencer = sequencer1,
  newSequencer = sequencer2,
)

Distributed domain bootstrapping with separate consoles

The process outlined in the previous section only works if all nodes are accessible from the same console environment. In cases where they may each have their own isolated console environment, the bootstrapping process must be coordinated in steps with the exchange of data via files using any secure channel of communication between the environments:

// Domain manager's console: writes domain params to file
{
  domainManager1.service.get_static_domain_parameters.writeToFile(paramsFile)
}
// Sequencer's console: reads domain params from file and writes public key
{
  val domainParameters = StaticDomainParameters.tryReadFromFile(paramsFile)

  val initResponse =
    sequencer.initialization.initialize_from_beginning(domainId, domainParameters)

  initResponse.publicKey.writeToFile(file)
}
// Domain manager's console: reads sequencer's public key
{
  val sequencerPublicKey = SigningPublicKey.tryReadFromFile(file)

  domainManager1.setup.helper.authorizeKey(
    sequencerPublicKey,
    "sequencer",
    SequencerId(domainId),
  )
}
// Mediator's console: writes public key
mediator1.keys.secret.generate_signing_key("initial-key").writeToFile(file)

// Domain manager's console: reads mediator's public key and writes initial topology snapshot
{
  val mediatorKey = SigningPublicKey.tryReadFromFile(file)

  domainManager1.setup.helper.authorizeKey(
    mediatorKey,
    "mediator1",
    MediatorId(domainId),
  )

  domainManager1.topology.mediator_domain_states.authorize(
    TopologyChangeOp.Add,
    domainId,
    MediatorId(domainId),
    RequestSide.Both,
  )

  domainManager1.topology.all
    .list()
    .collectOfType[TopologyChangeOp.Positive]
    .writeToFile(file)
}
// Sequencer's console: reads initial topology snapshot and writes connection info
{
  val initialTopology =
    StoredTopologyTransactions
      .tryReadFromFile(file)
      .collectOfType[TopologyChangeOp.Positive]

  sequencer.initialization.bootstrap_topology(initialTopology)
  sequencer.sequencerConnection.writeToFile(file)
}
// Mediator's console: reads sequencer connection and domain params
{
  val sequencerConnection = SequencerConnection.tryReadFromFile(file)
  val domainParameters = StaticDomainParameters.tryReadFromFile(paramsFile)

  mediator1.mediator
    .initialize(
      domainId,
      MediatorId(domainId),
      domainParameters,
      sequencerConnection,
      None,
    )
  mediator1.health.wait_for_initialized()
}
// Domain manager's console: reads sequencer connection
{
  val sequencerConnection = SequencerConnection.tryReadFromFile(file)

  domainManager1.setup.init(sequencerConnection)

  domainManager1.health.wait_for_initialized()
}

Similarly, dynamically onboarding new sequencers (supported by Fabric and Ethereum sequencers) can be achieved in separate consoles as follows:

// Second sequencer's console: write signing key to file
{
  secondSequencer.keys.secret
    .generate_signing_key(s"${secondSequencer.name}-signing")
    .writeToFile(file1)
}

// Domain manager's console: write domain params and current topology
{
  domainManager1.service.get_static_domain_parameters.writeToFile(paramsFile)

  val sequencerSigningKey = SigningPublicKey.tryReadFromFile(file1)
  domainManager1.setup.helper.authorizeKey(
    sequencerSigningKey,
    s"${secondSequencer.name}-signing",
    sequencerId,
  )
  domainManager1.setup.helper.waitForKeyAuthorizationToBeSequenced(
    sequencerId,
    sequencerSigningKey,
  )
  domainManager1.topology.all
    .list(domainId.filterString)
    .collectOfType[TopologyChangeOp.Positive]
    .writeToFile(file1)
}

// Initial sequencer's console: read topology and write snapshot to file
{
  val topologySnapshotPositive =
    StoredTopologyTransactions
      .tryReadFromFile(file1)
      .collectOfType[TopologyChangeOp.Positive]

  val sequencingTimestamp = topologySnapshotPositive.lastChangeTimestamp.getOrElse(
    sys.error("topology snapshot is empty")
  )

  sequencer.sequencer.snapshot(sequencingTimestamp).writeToFile(file2)
}

// Second sequencer's console: read topology, snapshot and domain params
{
  val topologySnapshotPositive =
    StoredTopologyTransactions
      .tryReadFromFile(file1)
      .collectOfType[TopologyChangeOp.Positive]

  val state = SequencerSnapshot.tryReadFromFile(file2)

  val domainParameters = StaticDomainParameters.tryReadFromFile(paramsFile)

  secondSequencer.initialization
    .initialize_from_snapshot(
      domainId,
      topologySnapshotPositive,
      state,
      domainParameters,
    )
    .publicKey

  secondSequencer.health.initialized() shouldBe true

}

Dynamic domain parameters

In addition to the parameters that are specified in the configuration, some parameters can be changed at runtime (i.e., while the domain is running); these are called dynamic domain parameters.

A participant can get the current parameters on a domain it is connected to using the following command:

participant.topology.domain_parameters_changes.get_latest(mydomain.id)

A domain operator can update some of the parameters as follows:

mydomain.service.update_dynamic_parameters(_.copy(
  participantResponseTimeout = TimeoutDuration.ofSeconds(10)
))

Importing existing Contracts

You may have existing contracts, parties, and DARs in other Daml Participant Nodes (such as the Daml sandbox) that you want to import into your Canton-based participant node. To address this need, you can extract contracts and associated parties via the ledger api, modify contracts, parties, and daml archived as needed, and upload the data to Canton using the Canton Console.

You can also import existing contracts from Canton as that is useful as part of Canton upgrades across major versions with incompatible internal storage.

importing ledger contracts from other Daml Participant Nodes or instances of Canton based on previous major versions

Preparation

As contracts (1) “belong to” parties and (2) are instances of Daml templates defined in Daml Archives (DARs), importing contracts to Canton also requires creating corresponding parties and uploading DARs.

  • Contracts are often interdependent requiring care to honor dependencies such that the set of imported contracts is internally consistent. This requires particular attention if you choose to modify contracts prior to their import.
  • Additionally use of divulgence in the original ledger has likely introduced non-obvious dependencies that may impede exercising contract choices after import. As a result such divulged contracts need to be re-divulged as part of the import (by exercising existing choices or if there are no-side-effect-free choices that re-divulge the necessary contracts by extending your Daml models with new choices).
  • Party Ids have a stricter format on Canton than on non-Canton ledgers ending with a required “fingerprint” suffix, so at a minimum, you will need to “remap” party ids.
  • Canton contract keys do not have to be unique, so if your Daml models rely on uniqueness, consider extending the models using these strategies or limit your Canton Participants to connect to a single Canton domain with unique contract key semantics.
  • Canton does not support implicit party creation, so be sure to create all needed parties explicitly.
  • In addition you could choose to spread contracts, parties, and DARs across multiple Canton Participants.

With the above requirements in mind, you are ready to plan and execute the following three step process:

  1. Download parties and contracts from the existing Daml Participant Node and locate the DAR files that the contracts are based on.
  2. Modify the parties and contracts (at the minimum assigning Canton-conformant party ids).
  3. Provision Canton Participants along with at least one Canton Domain. Then upload DARs, create parties, and finally the contracts to the Canton participants. Finally connect the participants to the domain(s).

Importing an actual Ledger

To follow along with this guide, ensure you have installed and unpacked the Canton release bundle and run the following commands from the “canton-X.Y.Z” directory to set up the initial topology.

export CANTON=`pwd`
export CONF="$CANTON/examples/03-advanced-configuration"
export IMPORT="$CANTON/examples/07-repair"
bin/canton \
  -c $IMPORT/participant1.conf,$IMPORT/participant2.conf,$IMPORT/participant3.conf,$IMPORT/participant4.conf \
  -c $IMPORT/domain-export-ledger.conf,$IMPORT/domain-import-ledger.conf \
  -c $CONF/storage/h2.conf,$IMPORT/enable-preview-commands.conf \
  --bootstrap $IMPORT/import-ledger-init.canton

This sets up an “exportLedger” with a set of parties consisting of painters, house owners, and banks along with a handful of paint offer contracts and IOUs.

Define the following helper functions useful to extract parties and contracts via the ledger api:

def queryActiveContractsFromDamlLedger(
    hostname: String,
    port: Port,
    tls: Option[TlsClientConfig],
    token: Option[String] = None,
)(implicit consoleEnvironment: ConsoleEnvironment): Seq[CreatedEvent] = {

  // Helper to query the ledger api using the specified command.
  def queryLedgerApi[Svc <: AbstractStub[Svc], Result](
      command: GrpcAdminCommand[_, _, Result]
  ): Either[String, Result] =
    consoleEnvironment.grpcAdminCommandRunner
      .runCommand("sourceLedger", command, ClientConfig(hostname, port, tls), token)
      .toEither

  (for {
    // Identify all the parties on the ledger and narrow down the list to local parties.
    allParties <- queryLedgerApi(LedgerApiCommands.PartyManagementService.ListKnownParties())
    localParties = allParties.collect {
      case PartyDetails(party, _, isLocal) if isLocal => LfPartyId.assertFromString(party)
    }

    // Look up the ledger id needed next to query for the contracts.
    ledgerId <- queryLedgerApi(LedgerApiCommands.LedgerIdentityService.GetLedgerIdentity())

    // Query the ActiveContractsService for the actual contracts
    acs <- queryLedgerApi(
      LedgerApiCommands.AcsService.GetActiveContracts(ledgerId, localParties.toSet)
    )
  } yield acs.map(_.event)).valueOr(err =>
    throw new IllegalStateException(s"Failed to query parties, ledger id, or acs: $err")
  )
}

def removeCantonSpecifics(acs: Seq[CreatedEvent]): Seq[CreatedEvent] = {
  def stripPartyIdSuffix(suffixedPartyId: String): String =
    suffixedPartyId.split(SafeSimpleString.delimiter).head

  acs.map { event =>
    ValueRemapper.convertEvent(identity, stripPartyIdSuffix)(event)
  }
}

def lookUpPartyId(participant: ParticipantReference, party: String): PartyId =
  participant.parties.list(filterParty = party + SafeSimpleString.delimiter).map(_.party).head

As the first step, export the active contract set (ACS). To illustrate how to import data from non-Canton ledgers, strip the Canton-specifics by making the party ids generic (stripping the Canton-specific suffix).

val acs =
  queryActiveContractsFromDamlLedger(
    exportLedger.config.ledgerApi.address,
    exportLedger.config.ledgerApi.port,
    exportLedger.config.ledgerApi.tls.map(_.clientConfig),
  )

val acsExported = removeCantonSpecifics(acs).toList

Step number two involves preparing the Canton participants and domain by uploading DARs and creating parties. Here we choose to place the house owners, painters, and banks on different participants.

placing contracts on all the correct Canton Participants

Also modify the events to be based on the newly created party ids.

// Decide on which canton participants to host which parties along with their contracts.
// We place house owners, painters, and banks on separate participants.
val participants = Seq(participant1, participant2, participant3)
val partyAssignments =
  Seq(participant1 -> houseOwners, participant2 -> painters, participant3 -> banks)

// Connect to domain prior to uploading dars and parties.
participants.foreach { participant =>
  participant.domains.connect_local(importLedgerDomain)
  participant.dars.upload(darPath)
}

// Create canton party ids and remember mapping of plain to canton party ids.
val toCantonParty: Map[String, String] =
  partyAssignments.flatMap { case (participant, parties) =>
    val partyMappingOnParticipant = parties.map { party =>
      participant.ledger_api.parties.allocate(party, party)
      party -> lookUpPartyId(participant, party).toLf
    }
    partyMappingOnParticipant
  }.toMap

// Create traffic on all participants so that the repair commands will pick an identity snapshot that is aware of
// all party allocations
participants.foreach { participant =>
  participant.health.ping(participant, workflowId = importLedgerDomain.name)
}

// Switch the ACS to be based on canton party ids.
val acsToImportToCanton =
  acsExported.map(ValueRemapper.convertEvent(identity, toCantonParty(_)))

As the third step, perform the actual import to each participant filtering the contracts based on the location of contract stakeholders and witnesses.

// Disconnect from domain temporarily to allow import to be performed.
participants.foreach(_.domains.disconnect(importLedgerDomain.name))

// Pick a ledger create time according to the domain's clock.
val ledgerCreateTime =
  consoleEnvironment.environment.domains
    .getRunning(importLedgerDomain.name)
    .get
    .clock
    .now
    .toInstant

// Filter active contracts based on participant parties and upload.
partyAssignments.foreach { case (participant, rawParties) =>
  val parties = rawParties.map(toCantonParty(_))
  val participantAcs = acsToImportToCanton
    .collect {
      case event
          if event.signatories.intersect(parties).nonEmpty
            || event.observers.intersect(parties).nonEmpty
            || event.witnessParties.intersect(parties).nonEmpty =>
        val wrappedCreatedEvent = WrappedCreatedEvent(event)

        SerializableContractWithWitnesses(
          utils
            .contract_data_to_instance(wrappedCreatedEvent.toContractData, ledgerCreateTime),
          Set.empty,
        )
    }

  participant.repair.add(importLedgerDomain.name, participantAcs, ignoreAlreadyAdded = false)
}

def verifyActiveContractCounts() = {
  Map[LocalParticipantReference, (Boolean, Boolean)](
    participant1 -> ((true, true)),
    participant2 -> ((true, false)),
    participant3 -> ((false, true)),
  ).foreach { case (participant, (hostsPaintOfferStakeholder, hostsIouStakeholder)) =>
    val expectedCounts =
      (houseOwners.map { houseOwner =>
        houseOwner.toPartyId(participant) ->
          ((if (hostsPaintOfferStakeholder) paintOffersPerHouseOwner else 0)
            + (if (hostsIouStakeholder) 1 else 0))
      }
        ++ painters.map { painter =>
          painter.toPartyId(participant) -> (if (hostsPaintOfferStakeholder)
                                               paintOffersPerPainter
                                             else 0)
        }
        ++ banks.map { bank =>
          bank.toPartyId(participant) -> (if (hostsIouStakeholder) iousPerBank else 0)
        }).toMap[PartyId, Int]

    assertAcsCounts((participant, expectedCounts))
  }
}

/*
  If the test fails because of Errors.MismatchError.NoSharedContracts error, it could be worth to
  extend the scope of the suppressing logger.
 */
loggerFactory.assertLogsUnorderedOptional(
  {
    // Finally reconnect to the domain.
    participants.foreach(_.domains.reconnect(importLedgerDomain.name))

To demonstrate that the imported ledger works, let’s have each of the house owners accept one of the painters’ offer to paint their house.

def yesYouMayPaintMyHouse(
    houseOwner: PartyId,
    painter: PartyId,
    participant: ParticipantReference,
): Unit = {
  val iou = participant.ledger_api.acs.await[Iou.Iou](houseOwner, Iou.Iou)
  val bank = iou.value.payer
  val paintProposal = participant.ledger_api.acs
    .await[Paint.OfferToPaintHouseByPainter](
      houseOwner,
      Paint.OfferToPaintHouseByPainter,
      pp => pp.value.painter == painter.toPrim && pp.value.bank == bank,
    )
  val cmd = paintProposal.contractId
    .exerciseAcceptByOwner(houseOwner.toPrim, iou.contractId)
    .command
  val _ = clue(
    s"$houseOwner accepts paint proposal by $painter financing through ${bank.toString}"
  )(participant.ledger_api.commands.submit(Seq(houseOwner), Seq(cmd)))
}

// Have each house owner accept one of the paint offers to illustrate use of the imported ledger.
houseOwners.zip(painters).foreach { case (houseOwner, painter) =>
  yesYouMayPaintMyHouse(
    lookUpPartyId(participant1, houseOwner),
    lookUpPartyId(participant1, painter),
    participant1,
  )
}

// Illustrate that acceptance of have resulted in
{
  val paintHouseContracts = painters.map { painter =>
    participant2.ledger_api.acs
      .await[Paint.PaintHouse](lookUpPartyId(participant2, painter), Paint.PaintHouse)
  }
  assert(paintHouseContracts.size == 4)
  paintHouseContracts
}

This guide has demonstrated how to import data from non-Canton Daml Participant Nodes or from a Canton Participant of a lower major version as part of a Canton upgrade.

Backup and Restore

It is recommended that your database is frequently backed up so that the data can be restored in case of a disaster.

In the case of a restore, a participant can replay missing data from the domain considering the domain’s backup is more recent than that of the participant’s. It is important that the participant’s backup is not more recent than that of the domain’s as that would constitute a ledger fork. Therefore if you backup both participant and domain, always backup participant database before the domain.

In case of a domain restore from a backup, if a participant is ahead of the domain, the participant will refuse to connect to the domain and you must either:

  • restore the participant’s state to a backup before the disaster of the domain,
  • or roll out a new domain as a repair strategy in order to recover from a lost domain

We recommend that in production, a domain should be run with offsite synchronous replication to assure the most crucial data is always safely backed up and as up-to-date as possible.

Postgres Example

If you are using Postgres to persist the participant or domain node data, you can create backups to a file and restore it using Postgres’s utility commands pg_dump and pg_restore as shown below:

Backing up Postgres database to a file:

pg_dump -U <user> -h <host> -p <port> -w -F tar -f <fileName> <dbName>

Restoring Postgres database data from a file:

pg_restore -U <user> -h <host> -p <port> -w -d <dbName> <fileName>

Although the approach shown above works for small deployments, it is not recommended in larger deployments. For that, we suggest looking into incremental backups and refer to the resources below:

Database Failover

A database backup allows you to recover the ledger up to the point when the last backup was created. However, any command accepted after creation of the backup may be lost in case of a disaster. Therefore, restoring a backup will likely result in data loss.

If such data loss is unacceptable, you need to run Canton against a replicated database. If the data in one replica gets lost, the database can still failover to another replica without any data loss. For detailed instructions on how to setup a replicated database and how to perform failovers, we refer to the database system documentation, e.g. the high availability documentation of PostgreSQL.

It is strongly recommended to configure replication as synchronous. That means, the database should report a database transaction as successfully committed only after it has been persisted to all database replicas. In PostgreSQL, this corresponds to the setting synchronous_commit = on. If you do not follow this recommendation, you may observe data loss and/or a corrupt state after a database failover.

For PostgreSQL, Canton strives to validate the database replication configuration and fail with an error, if a misconfiguration is detected. However, this validation is of a best-effort nature; so it may fail to detect an incorrect replication configuration. For Oracle, no attempt is made to validate the database configuration. Overall, you should not rely on Canton detecting mistakes in the database configuration.

Ledger Pruning

Pruning the ledger frees up storage space by deleting state no longer needed by participants, domain sequencers, and mediators. It also serves as a mechanism to help implement right-to-forget mandates such as GDPR.

The following commands allow you to prune events and inactive contracts up to a specified time from the various components:

  • Prune participants via the prune command specifying a “ledger offset” obtained by specifying a timestamp received by a call to “get_offset_by_time”.
  • Prune domain sequencers and mediators via their respective prune_at commands.

The pruning operations impact the “regular” workload (lowering throughput during pruning by as much as 50% in our test environments), so depending on your requirements it might make sense to schedule pruning at off-peak times or during maintenance windows such as after taking database backups.

The following canton console code illustrates best practices such as :

  • Error handling ensures that pruning errors raise an alert. Catching the CommandFailure exception also ensures that a problem encountered while pruning one component still lets pruning other components proceed allowing corresponding storage to be freed up.
  • Pruning one node at a time rather than all nodes in parallel somewhat limits the impact on concurrently executing workload. If you configure pruning to run during a maintenance window with no concurrent workload, and as long as the database backend has sufficient capacity, you may prune participants and domains in parallel.
import com.digitalasset.canton.console.{CommandFailure, ParticipantReference}
import com.digitalasset.canton.data.CantonTimestamp
import java.time.Duration

def pruneAllNodes(pruneUpToIncluding: CantonTimestamp)(implicit env: ConsoleEnvironment): Unit = {
  import env._

  // If pruning a particular component fails, alert the user, but proceed pruning other components.
  // Therefore prune failures in one component still allow other components to be pruned
  // minimizing the chance of running out of overall storage space.
  def alertOnErrorButMoveOn(
      component: String,
      ts: CantonTimestamp,
      invokePruning: CantonTimestamp => Unit,
  ): Unit =
    try {
      invokePruning(ts)
    } catch {
      case _: CommandFailure =>
        logger.warn(
          s"Error pruning ${component} up to ${ts}. See previous log error for details. Moving on..."
        )
    }

  // Helper to prune a participant by time for consistency with domain prune signatures
  def pruneParticipantAt(p: ParticipantReference)(pruneUpToIncluding: CantonTimestamp): Unit = {
    val pruneUpToOffset = p.pruning.get_offset_by_time(pruneUpToIncluding.toInstant)
    pruneUpToOffset match {
      case Some(offset) => p.pruning.prune(offset)
      case None => logger.info(s"Nothing to prune up to ${pruneUpToIncluding}")
    }
  }

  val participantsToPrune = participants.all
  val domainsToPrune = domains.all

  // Prune all nodes one after the other rather than in parallel to limit the impact on concurrent workload.
  participantsToPrune.foreach(participant =>
    alertOnErrorButMoveOn(participant.name, pruneUpToIncluding, pruneParticipantAt(participant))
  )

  domainsToPrune.foreach { domain =>
    alertOnErrorButMoveOn(
      s"${domain.name} sequencer",
      pruneUpToIncluding,
      domain.sequencer.pruning.prune_at,
    )
    alertOnErrorButMoveOn(
      s"${domain.name} mediator",
      pruneUpToIncluding,
      domain.mediator.prune_at,
    )
  }
}

Invoke pruning from within your scheduling environment and by specifying the ledger data retention period like so:

val retainMostRecent = Duration.ofDays(30)
pruneAllNodes(CantonTimestamp.now().minus(retainMostRecent))

Pruning Ledgers in Test Environments

While it is a best practice for test environments to match production configurations, testing pruning involves challenges related to the amount of retained data:

  • Test environments may not have the same amount of storage space to hold data volumes present in production.
  • It may be impractical to wait long enough until test environments have accrued data to expected production retention times that are often measured in months.

As a result you may choose to prune test environments more aggressively. When using databases other than Oracle with a lower retention time, use the same code as when pruning production. On Oracle however you may observe performance degradation when pruning the majority of the ledger data in one go. In such cases breaking up pruning invocations into multiple chunks likely speeds up pruning:

// An example test environment configuration in which hardly any data is retained.
val pruningFrequency = Duration.ofDays(1)
val retainMostRecent = Duration.ofMinutes(20)
val pruningStartedAt = CantonTimestamp.now()
val isOracle = true

// Deleting the majority of rows from an Oracle table has been observed to
// take a long time. Avoid non-linear performance degradation by breaking up one prune call into
// several calls with progressively more recent pruning timestamps.
if (isOracle && retainMostRecent.compareTo(pruningFrequency) < 0) {
  val numChunks = 8L
  val delta = pruningFrequency.minus(retainMostRecent).dividedBy(numChunks)
  for (chunk <- 1L to numChunks) yield {
    val chunkRetentionTimestamp = pruningFrequency.minus(delta.multipliedBy(chunk))
    pruneAllNodes(pruningStartedAt.minus(chunkRetentionTimestamp))
  }
}

pruneAllNodes(pruningStartedAt.minus(retainMostRecent))

Repairing Participants

Canton enables interoperability of distributed participants and domains. Particularly in distributed settings without trust assumptions, faults in one part of the system should ideally produce minimal irrecoverable damage to other parts. For example if a domain is irreparably lost, the participants previously connected to that domain need to recover and be empowered to continue their workflows on a new domain.

This guide will illustrate how to replace a lost domain with a new domain providing business continuity to affected participants.

Recovering from a Lost Domain

Note

Please note that the given section describes a preview feature, due to the fact that using multiple domains is only a preview feature.

Suppose that a set of participants have been conducting workflows via a domain that runs into trouble. In fact consider that the domain has gotten into such a disastrous state that the domain is beyond repair, for example:

  • The domain has experienced data loss and is unable to be restored from backups or the backups are missing crucial recent history.
  • The domain data is found to be corrupt causing participants to lose trust in the domain as a mediator.

Next the participant operators each examine their local state, and upon coordinating conclude that their participants’ active contracts are “mostly the same”. This domain-recovery repair demo illustrates how the participants can

  • coordinate to agree on a set of contracts to use moving forward, serving as a new consistent state,
  • copying over the agreed-upon set of contracts to a brand new domain,
  • “fail over” to the new domain,
  • and finally continue running workflows on the new domain having recovered from the permanent loss of the old domain.

Repairing an actual Topology

To follow along with this guide, ensure you have installed and unpacked the Canton release bundle and run the following commands from the “canton-X.Y.Z” directory to set up the initial topology.

export CANTON=`pwd`
export CONF="$CANTON/examples/03-advanced-configuration"
export REPAIR="$CANTON/examples/07-repair"
bin/canton \
  -c $REPAIR/participant1.conf,$REPAIR/participant2.conf,$REPAIR/domain-repair-lost.conf,$REPAIR/domain-repair-new.conf \
  -c $CONF/storage/h2.conf,$REPAIR/enable-preview-commands.conf \
  --bootstrap $REPAIR/domain-repair-init.canton

To simplify the demonstration, this not only sets up the starting topology of

  • two participants, “participant1” and “participant2”, along with
  • one domain “lostDomain” that is about to become permanently unavailable leaving “participant1” and “participant2” unable to continue executing workflows,

but also already includes the ingredients needed to recover:

  • The setup includes “newDomain” that we will rely on as a replacement domain, and
  • we already enable the “enable-preview-commands” configuration needed to make available the “repair.change_domain” command.

In practice you would only add the new domain once you have the need to recover from domain loss and also only then enable the repair commands.

We simulate “lostDomain” permanently disappearing by stopping the domain and never bringing it up again to emphasize the point that the participants no longer have access to any state from domain1. We also disconnect “participant1” and “participant2” from “lostDomain” to reflect that the participants have “given up” on the domain and recognize the need for a replacement for business continuity. The fact that we disconnect the participants “at the same time” is somewhat artificial as in practice the participants might have lost connectivity to the domain at different times (more on reconciling contracts below).

lostDomain.stop()
Seq(participant1, participant2).foreach { p =>
  p.domains.disconnect(lostDomain.name)
  // Also let the participant know not to attempt to reconnect to lostDomain
  p.domains.modify(lostDomain.name, _.copy(manualConnect = true))
}
"lostDomain" has become unavailable and neither participant can connect anymore

Even though the domain is “the node that has broken”, recovering entails repairing the participants using the “newDomain” already set up. As of now, participant repairs have to be performed in an offline fashion requiring participants being repaired to be disconnected from the the new domain. However we temporarily connect to the domain, to let the topology state initialize, and disconnect only once the parties can be used on the new domain.

Seq(participant1, participant2).foreach(_.domains.connect_local(newDomain))

// Wait for topology state to appear before disconnecting again.
clue("newDomain initialization timed out") {
  eventually()(
    (
      participant1.domains.active(newDomain.name),
      participant2.domains.active(newDomain.name),
    ) shouldBe (true, true)
  )
}
// Run a few transactions on the new domain so that the topology state chosen by the repair commands
// really is the active one that we've seen
participant1.health.ping(participant2, workflowId = newDomain.name)

Seq(participant1, participant2).foreach(_.domains.disconnect(newDomain.name))

With the participants connected neither to “lostDomain” nor “newDomain”, each participant can

  • locally look up the active contracts assigned to the lost domain using the “testing.pcs_search” command made available via the “features.enable-testing-commands” configuration,
  • and invoke “repair.change_domain” (enabled via the “features.enable-preview-commands” configuration) in order to “move” the contracts to the new domain.
// Extract participant contracts from "lostDomain".
val contracts1 =
  participant1.testing.pcs_search(lostDomain.name, filterTemplate = "^Iou", activeSet = true)
val contracts2 =
  participant2.testing.pcs_search(lostDomain.name, filterTemplate = "^Iou", activeSet = true)

// Ensure that shared contracts match.
val Seq(sharedContracts1, sharedContracts2) = Seq(contracts1, contracts2).map(
  _.filter { case (_isActive, contract) =>
    contract.metadata.stakeholders.contains(Alice.toLf) &&
      contract.metadata.stakeholders.contains(Bob.toLf)
  }.toSet
)

clue("checking if contracts match") {
  sharedContracts1 shouldBe sharedContracts2
}

// Finally change the contracts from "lostDomain" to "newDomain"
participant1.repair.change_domain(
  contracts1.map(_._2.contractId),
  lostDomain.name,
  newDomain.name,
)
participant2.repair.change_domain(
  contracts2.map(_._2.contractId),
  lostDomain.name,
  newDomain.name,
  skipInactive = false,
)

Note

The code snippet above includes a check that the contracts shared among the participants match (as determined by each participant, “sharedContracts1” by “participant1” and “sharedContracts2” by “participant2). Should the contracts not match (as could happen if the participants had lost connectivity to the domain at different times), this check fails soliciting the participant operators to reach an agreement on the set of contracts. The agreed-upon set of active contracts may for example be

  • the intersection of the active contracts among the participants
  • or perhaps the union (for which the operators can use the “repair.add” command to create the contracts missing from one participant).

Also note that both the repair commands and the “testing.pcs_search” command are currently “preview” features, and therefore their names may change.

Once each participant has associated the contracts with “newDomain”, let’s have them reconnect, and we should be able to confirm that the new domain is able to execute workflows from where the lost domain disappeared.

Seq(participant1, participant2).foreach(_.domains.reconnect(newDomain.name))

// Look up a couple of contracts moved from lostDomain
val Seq(iouAlice, iouBob) = Seq(participant1 -> Alice, participant2 -> Bob).map {
  case (participant, party) =>
    participant.ledger_api.acs.await[Iou.Iou](party, Iou.Iou, _.value.owner == party.toPrim)
}

// Ensure that we can create new contracts
Seq(participant1 -> ((Alice, Bob)), participant2 -> ((Bob, Alice))).foreach {
  case (participant, (payer, owner)) =>
    participant.ledger_api.commands.submit_flat(
      Seq(payer),
      Seq(
        Iou
          .Iou(
            payer.toPrim,
            owner.toPrim,
            Iou.Amount(value = 200, currency = "USD"),
            List.empty,
          )
          .create
          .command
      ),
    )
}

// Even better: Confirm that we can exercise choices on the moved contracts
Seq(participant2 -> ((Bob, iouBob)), participant1 -> ((Alice, iouAlice))).foreach {
  case (participant, (owner, iou)) =>
    participant.ledger_api.commands
      .submit_flat(Seq(owner), Seq(iou.contractId.exerciseCall(owner.toPrim).command))
}
"newDomain" has replaced "lostDomain"

In practice, we would now be in a position to remove the “lostDomain” from both participants and to disable the repair commands again to prevent accidental use of these “dangerously powerful” tools.

This guide has demonstrated how participants can recover from losing a domain that has been permanently lost or somehow become irreparably corrupted.