One Year With Eclair

After nuking my second LND node (the first one died due to hardware failure) by my own typo and lack of any thought in the design of the CLI of LND lightning node tools, I decided to take a plunge into the world of mature and complex implementation of the protocol, Eclair by ACINQ. It has been almost one year (the birth of the node was on Christmas Day 2023), 50 thousand transactions routed, and over 30 BTC of routed value. In this post, I'd like to reflect on my experiences with Eclair, go over some of the gotchas and issues, and highlight some of the good choices that I've made since the beginning of my adventure.

Learnings from the Past Experience

While I was learning Lightning network and had very little understanding of how things worked in the whole Bitcoin space, Umbrel was my go-to solution that helped me get off the ground. It proved to be easy and somewhat educational but was not something that I would continuously run for the production setup or trust with any significant amount of bitcoin that I could not afford to lose. Lightning is built on top of the L1 (Bitcoin) network but manages the state of the channels in its own database that is negotiated and agreed upon with its peers. Any failures in the state integrity may result in the complete loss of liquidity or hefty penalty transactions (significant loss of capital). A Lightning node that participates in routing public transactions is also required to be constantly online with as little downtime as possible and only short periods offline at a time. Otherwise, you may risk causing force-closure of the channel due to expired HTLC that is measured in number of blocks.

The Setup

Taking all of my learnings into consideration, I decided to first invest in reliable enterprise-grade hardware: - Server-grade hardware with ECC memory and reliable power supply and CPU - UPS (Uninterruptible Power Supply) to avoid any headaches due to electrical spikes or drop-outs - Reliable enterprise SSDs and NVMEs - ZFS (filesystem) to mirror the critical storage and to ensure full integrity of the data (bit-rot prevention). You do need to tune ZFS for your specific workload and reliability - Reliable and replicated database (PostgreSQL) with two local and one remote replica, and a requirement to have at least two replicas committing the transaction to the disk - Backup! On-site and off-site backup of the critical configuration that you could use to restore the node if your house burns down - Spare parts, redundancy, backup, monitoring - Reliable and stable internet connectivity

The software is Eclair 0.11.0 (latest release as of today), PostgreSQL 16 with two replicas, Bitcoin Core 27.2 (with redundant storage of blocks), additional Bitcoin Core running on a separate node and in-sync with the chain (in case primary node fails), Ubuntu 22.04 with the latest docker software from the official Docker repo.

All Major Gotchas That I Came Across

While Eclair is mature and very stable in itself, it does have some quirks and design choices that you need to account for when running your node. The software is written in Scala and requires a specific version of JVM to run it, as well as JRE and Maven to build it. It doesn't mean that other versions won't work, but you may find unpleasant bugs that may result in catastrophic failures of your node with nobody to help you. All of the requirements are listed in the release notes and installation guide. Whenever in doubt, RTFM first, then ask questions.

Limited Support by the FOSS Community

Eclair is not the most popular implementation of the Lightning protocol, and therefore it is hard to find tools or plugins that could help you manage the node. GUI for the node so far is only supported by RTL and with a very limited number of features. For any sort of statistics, you are limited to either Prometheus (extensive metrics are available) or writing your own SQL on top of the Eclair tables.

On-chain Fee Differences Between Yours and Partner Nodes

This one hit me hard, and many times. I've had more than a few force-closures of the channels because of the conservative and safe default settings. The worst part is, it strikes you when there is a huge spike in fees, which results in significant losses to force-close the channel due to high fees. I am still not 100% sure how the big difference can be exploited in practice, and opted for increase of the tolerance levels to avoid surprise FCs:

eclair.on-chain-fees {
    feerate-tolerance {
      ratio-low = <0.01~> // will allow remote fee rates as low as XX our local feerate (spikes)
      ratio-high = <20.0~> // will allow remote fee rates as high as XX times our local feerate (drops)
    }
}

It is up to you and your risk tolerance to define something reasonable and yet allow for secure and reliable node operation.

Initial Lightning Network State Sync

When I just started running the node, I had very few channels and startup times were fast. Later, when I expanded the number of channels, I noted that it took my node up to 6-12 hours before it was fully in-sync and routing traffic fast. Given that ACINQ maintains one of the largest nodes on the network, I knew that there was something with my settings that caused the issue. After some research, I came across the setting that whitelisted node IDs for state sync, which immediately rang a bell since I knew from the LND days that not all peer nodes are used for the network sync. Setting the list to my most reliable and largest nodes reduced the startup settling times down to minutes again:

eclair.sync-whitelist = [
    "03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f",
    ...
]

You do not need to have too many public keys in here, and should keep it between 5-10.

Automatic MAX HTLC Adjustment for the Channel

One of the killer features of Eclair is its ability to automatically adjust MAX HTLC for the channel and reduce the number of failed transactions due to insufficient liquidity on the channel. It can be used to estimate your total channels' balances but with smart configuration and a little thinking, you can make it reasonably private while still maintaining a good transaction flow:

eclair.channel.channel-update.min-time-between-updates=1 hour # Allows for the adjustments to be made once every hour
eclair.channel.channel-update.balance-thresholds=[
  {
    available-sat = 10000
    max-htlc-sat = 0  // 0% of 10000
  },
  ...
]

You can have as many variations as you need, and ensure that the channel MAX HTLC is set well and within reasonable ranges. You would also want to account for multiple transactions going through the channel, but also account for the channel size and an average amount of sats per transaction.

Max Accepted HTLCs

By design, the Lightning channel is limited to a specific number of in-flight HTLCs, and the setting is fixed during channel opening time with no way of changing it unless you close and reopen the channel with new settings. If you find your node routing a lot of small transactions (zaps), you may quickly fail many due to that limit (I think default was in single digit range):

eclair.channel.max-htlc-value-in-flight-percent=98 # Default I think is half or 50%
eclair.channel.max-accepted-htlcs = 50

The setting above will allow for the channel to be more fully utilized and have more concurrent transactions without clogging.

CLTV Delta

This is basically a setting that is global for Eclair and sets the maximum number of remaining blocks (in time) before HTLC expires. Setting this too high may result in many HTLCs failing for the small nodes with not so great centrality, and reduce the number of routed transactions:

# CLTV delta
eclair.channel.expiry-delta-blocks = 60

Default is 144 but I found that setting this to 60 (minimum possible for my node setup and configuration) yields better results for routing. It does expose you to more risk of expired HTLCs that may cause force-closures, but I have seen only one so far on my node.

Allocate Sufficient Memory

You will want to adjust the heap size for Eclair, since the default is too small to run any sizable node. Setting JAVA_OPTS=-Xmx32g (or half the size of your available RAM) would be a good start. I would advise having at least 32GB of RAM for the node, and allocating at least 16GB (JAVA_OPTS=-Xmx16g) for smooth and fast operations.

And More Settings and Parameters to Tune

I have covered only some of the major settings that I felt were worth writing about, but there is much more you could configure and tweak. Read all of the Guides and especially focus on the Configure and a sample reference configuration file.

Good Decisions

First, going with Eclair was the right choice, along with using server-grade hardware with ECC RAM and reliable storage. Second, having a replicated database on three separate nodes with one off-site saved me from a sure destruction of all state and loss of funds. Third, deciding to only maintain channels with reliable and stable nodes saved me from some bad force-closures, where I would choose to close the channel if a peer node goes up and down too frequently, regardless of how well it routes. Even big nodes run by single operators fail badly, as do nodes operated by companies. Keeping your eyes on the node and its health, as well as the health of its peers, is something that very few operators do, which can cause failures and unnecessary loss of your and their funds.

Lastly, if you decide to run a routing node, you have a responsibility to maintain it well and monitor its health. There are many tools you could use, and with Eclair you can use Prometheus and Grafana. Keep your node's packages updated and monitor for any security-related issues that may appear from time to time, so you can mitigate them quickly.

Conclusion

So far I am satisfied with Eclair despite all of the difficulties and headaches I've had with it. It is not perfect, and it requires me to create small tools to do some basic things, but I need a stable and reliable node that I can trust. Eclair has proved to be all that I wanted, and saved my bacon a few times when I nuked one of the PostgreSQL servers and all of its data, and managed to do the same for another replica, but was able to recover and recreate from the remaining replica. Eclair is also stateless during runtime and guarantees consistency of the node regardless of how it fails. Even if you pull a plug on the node's server, it will still be able to come up and recover its consistent state that is in agreement with its peers.

Is it for everyone? No, it is definitely not for everyone or for anyone who just wants a small node to run their online shop with a few channels. You could have a very reliable and trusted node for the online shop with Eclair, but you will need some technical skills to be able to set up, maintain and recover it if things go wrong.

In the end, it is all up to you, your skills, your willingness to learn, and your risk tolerance to make that decision. For me, it was the right choice, and I have no regrets despite not having access to the latest shiny features of the Lightning network.