Upgrading

Upgrading a DC/OS cluster

An upgrade is the process of moving between major releases to add new features or to replace existing features with new features/functionality. You can upgrade DC/OS only if you have used the advanced installation process to install DC/OS on your cluster.

IMPORTANT: An upgrade is required only when changing the major or minor version of your DC/OS installation. Example: 1.13 --> 2.0

  • To update to a newer maintenance version (e.g. 2.0.1 to 2.0.2), refer to the instructions for patching.
  • To modify the cluster configuration, refer to the instructions for patching.

If upgrading is performed on a supported OS with all prerequisites fulfilled, then the upgrade should preserve the state of running tasks on the cluster.

Important guidelines

  • The Production installation method is the only recommended upgrade path for DC/OS. It is recommended that you familiarize yourself with the DC/OS Deployment Guide before proceeding.
  • Review the release notes before upgrading DC/OS.
  • Due to a cluster configuration issue with overlay networks, it is recommended to set enable_ipv6 to false in config.yaml when upgrading or configuring a new cluster. You can find additional information and a more detailed remediation procedure in our latest critical product advisory. Enterprise
  • If IPv6 is disabled in the kernel, then IPv6 must be disabled in the config.yaml file.
  • The DC/OS Enterprise license key must reside in a genconf/license.txt file. Enterprise
  • The DC/OS GUI and other higher-level system APIs may be inconsistent or unavailable until all master nodes have been upgraded. When this occurs:
    • The DC/OS GUI may not provide an accurate list of services.
  • An upgraded DC/OS Marathon leader cannot connect to the leading Mesos master until it has also been upgraded. The DC/OS UI cannot be trusted until all masters are upgraded. There are multiple Marathon scheduler instances and multiple Mesos masters, each being upgraded, and the Marathon leader may not be the Mesos leader.
  • Task history in the Mesos UI will not persist through the upgrade.

Supported upgrade paths

The following tables list the supported upgrade paths for DC/OS 2.0.

Display Icon Service
Supported
Not Supported
DC/OS 1.13 to 2.0 Upgrade Paths
Upgrade
From
Upgrade To
2.0.0 2.0.1 2.0.2 2.0.3 2.0.4 2.0.5 2.0.6
1.13.0
1.13.1
1.13.2
1.13.3
1.13.4
1.13.5
1.13.6
1.13.7
1.13.9 *See Warning Below
1.13.10 *See Warning Below

WARNING: The DC/OS 1.13.9 release and subsiquent releases include a data format change for the persisted dcos-net state that, if you upgrade to anything other than 2.0.4, can cause critical issues with dcos-net. Because of this, we recommend upgrading to release 2.0.4 or higher.

Modifying DC/OS configuration Enterprise

You cannot change your cluster configuration at the same time as upgrading to a new version. Cluster configuration changes must be done with a patch to an already installed version. For example, you cannot simultaneously upgrade a cluster from 1.13 to 2.0 and add more public agents. You can add more public agents with a patch to 1.13 and then upgrade to 2.0, or you can upgrade to 2.0 and then add more public agents by patching 2.0 after the upgrade.

Instructions

These steps must be performed for version upgrades.

Prerequisites

  • Enterprise users: DC/OS Enterprise downloads can be found here. Enterprise
  • Open Source users: DC/OS Open Source downloads can be found here. Open Source
  • Mesos, Mesos Frameworks, Marathon, Docker and all running tasks in the cluster should be stable and in a known healthy state.
  • For Mesos compatibility reasons, we recommend upgrading any running Marathon-on-Marathon instances to Marathon version 1.3.5 before proceeding with this DC/OS upgrade.
  • You must have access to copies of the config files used with the previous DC/OS version: config.yaml and ip-detect.
  • You must be using systemd 218 or newer to maintain task state.
  • All hosts (masters and agents) must be able to communicate with all other hosts as described at network security.
  • In CentOS or RedHat, install IP sets with this command (used in some IP detect scripts): sudo yum install -y ipset
  • You must be familiar with using systemctl and journalctl command line tools to review and monitor service status. Troubleshooting notes can be found at the end of this document.
  • You must be familiar with the DC/OS Production Installation instructions.
  • Take a snapshot of ZooKeeper prior to upgrading. Marathon supports rollbacks, but does not support downgrades.
  • Take a snapshot of the IAM database prior to upgrading. This is very easy to do and should be considered a necessity.
  • Ensure that Marathon event subscribers are disabled before beginning the upgrade. Leave them disabled after completing the upgrade, as this feature is now deprecated.

NOTE: Marathon event subscribers are disabled by default. Check to see if the line --event_subscriber "http_callback" has been added to sudo vi /opt/mesosphere/bin/marathon.sh on your master node(s). In such a case, you must remove that line in order to disable event subscribers.

Enterprise

  • Verify that all Marathon application constraints are valid before beginning the upgrade. Use this script to check if your constraints are valid.
  • Back up your cluster. Enterprise
  • Optional: You can add custom node and cluster health checks to your config.yaml.
  • Verify that all your masters are in a healthy state:
    • Check the Exhibitor UI to confirm that all masters have joined the quorum successfully (the status indicator will show green). The Exhibitor UI is available at http://<dcos_master>:8181/.
    • Verify that curl http://<dcos_master_private_ip>:5050/metrics/snapshot has the metric registrar/log/recovered with a value of 1 for each master.

Bootstrap Node

This procedure upgrades a DC/OS 1.13 cluster to DC/OS 2.0.

  1. Copy your existing config.yaml and ip-detect files to an empty genconf folder on your bootstrap node. The folder should be in the same directory as the installer.

  2. The syntax of the config.yaml file can be different from the earlier version. For a detailed description of the current config.yaml syntax and parameters, see the documentation.

    • You cannot change the exhibitor_storage_backend setting during an upgrade.
  3. After updating the config.yaml, compare the old config.yaml and new config.yaml. Verify that there are no differences in pathways or configurations. Changing these while upgrading can lead to catastrophic cluster failures.

  4. Modify the ip-detect file if necessary.

  5. Build your installer package.

    1. Download the dcos_generate_config.ee.sh Enterprise or dcos_generate_config.sh Open Source file.

    2. Generate the installation files. Replace <installed_cluster_version> in the below command with the DC/OS version currently running on the cluster you intend to upgrade, for example 1.13.9.

      Enterprise

      dcos_generate_config.ee.sh --generate-node-upgrade-script <installed_cluster_version>
      

      Open Source

      dcos_generate_config.sh --generate-node-upgrade-script <installed_cluster_version>
      
    3. The command in the previous step will produce a URL in the last line of its output, prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will be referred to in this document as the “Node upgrade script URL”.

  6. Run the nginx container to serve the installation files using the Docker run command. For <your-port>, specify the port value that is used in the Node upgrade script URL.

sudo docker run -d -p <your-port>:80 -v $PWD/genconf/serve:/usr/share/nginx/html:ro nginx
  1. Go to the DC/OS Master procedure to complete your installation.

DC/OS Masters

Proceed with upgrading every master node one at a time in any order using the following procedure. When you complete each upgrade, monitor the Mesos master metrics to ensure the node has rejoined the cluster and completed reconciliation.

  1. Download and run the node upgrade script:

    curl -O <Node upgrade script URL>
    sudo bash dcos_node_upgrade.sh
    
  2. Verify that the upgrade script succeeded and exited with the status code 0:

    echo $?
    0
    
  3. Validate the upgrade by running the following commands on the master node:

    1. Monitor Exhibitor and wait for it to converge.

      On DC/OS Enterprise clusters with a static master list use the command:

      sudo curl --cacert /var/lib/dcos/exhibitor-tls-artifacts/root-cert.pem --cert /var/lib/dcos/exhibitor-tls-artifacts/client-cert.pem --key /var/lib/dcos/exhibitor-tls-artifacts/client-key.pem https://localhost:8181/exhibitor/v1/cluster/status
      

      On other clusters use the command:

      curl http://localhost:8181/exhibitor/v1/cluster/status
      

      Wait until the response shows that all hosts have "description":"serving".

    2. Wait until the dcos-mesos-master unit is up and running.

    3. Verify that curl http://localhost:5050/metrics/snapshot has the metric registrar/log/recovered with a value of 1.

      NOTE: If you are upgrading from permissive to strict mode, this URL will be curl https://... and you will need a JWT for access.

      Enterprise
    4. Verify that /opt/mesosphere/bin/mesos-master --version indicates that the upgraded master is running the version of Mesos specified in the release notes, for example 1.9.1.

    5. Verify that the number of under-replicated ranges in CockroachDB has dropped to zero as the IAM database is replicated to the new master. Run the following command and confirm that the ranges_underreplicated column shows only zeros.

    sudo /opt/mesosphere/bin/cockroach node status --ranges --certs-dir=/run/dcos/pki/bouncer --host=$(/opt/mesosphere/bin/detect_ip)
    
    +----+---------------------+--------+---------------------+---------------------+------------------+-----------------------+--------+--------------------+------------------------+
    | id |       address       | build  |     updated_at      |     started_at      | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated |
    +----+---------------------+--------+---------------------+---------------------+------------------+-----------------------+--------+--------------------+------------------------+
    |  1 | 172.31.7.32:26257   | v1.1.4 | 2018-03-08 13:56:10 | 2018-02-28 20:11:00 |              195 |                   194 |    195 |                  0 |                      0 |
    |  2 | 172.31.10.48:26257  | v1.1.4 | 2018-03-08 13:56:05 | 2018-03-05 13:33:45 |              200 |                   199 |    200 |                  0 |                      0 |
    |  3 | 172.31.23.132:26257 | v1.1.4 | 2018-03-08 13:56:01 | 2018-02-28 20:18:41 |              187 |                   187 |    187 |                  0 |                      0 |
    +----+---------------------+--------+---------------------+---------------------+------------------+-----------------------+--------+--------------------+------------------------+
    

    If the ranges_underreplicated column lists any non-zero values, wait a minute and rerun the command. The values will converge to zero after all data is safely replicated.

  4. Go to the DC/OS Agents procedure to complete your installation.

DC/OS Agents

Be aware that when upgrading agent nodes, there is a five minute timeout for the agent to respond to health check pings from the mesos-masters before the agent nodes and task expire.

On all DC/OS agents:

  1. Navigate to the /opt/mesosphere/lib directory and delete this library file. Deleting this file will prevent conflicts.

      libltdl.so.7
    
  2. Download and run the node upgrade script.

    curl -O <Node upgrade script URL>
    sudo bash dcos_node_upgrade.sh
    
  3. Verify that the upgrade script succeeded and exited with the status code 0.

    echo $?
    0
    
  4. Validate the upgrade.

    • Verify that curl http://<dcos_agent_private_ip>:5051/metrics/snapshot has the metric slave/registered with a value of 1.
    • Monitor the Mesos UI to verify that the upgraded node rejoins the DC/OS cluster and that tasks are reconciled (http://<master-ip>/mesos). If you are upgrading from permissive to strict mode, this URL will be https://<master-ip>/mesos.

Troubleshooting Recommendations

The following commands should provide insight into upgrade issues:

On All Cluster Nodes

sudo journalctl -u dcos-download
sudo journalctl -u dcos-spartan
sudo systemctl | grep dcos

If your upgrade fails because of a custom node or cluster check, run these commands for more details:

dcos-check-runner check node-poststart
dcos-check-runner check cluster

On DC/OS Masters

Enterprise

sudo journalctl -u dcos-exhibitor
less /opt/mesosphere/active/exhibitor/usr/zookeeper/zookeeper.out
sudo journalctl -u dcos-mesos-dns
sudo journalctl -u dcos-mesos-master

Open Source

sudo journalctl -u dcos-exhibitor
less /var/lib/dcos/exhibitor/zookeeper/zookeeper.out
sudo journalctl -u dcos-mesos-dns
sudo journalctl -u dcos-mesos-master

On DC/OS Agents

sudo journalctl -u dcos-mesos-slave

Notes:

  • Packages available in the DC/OS 2.0 Catalog are newer than those in the older versions of Catalog. Services are not automatically upgraded when DC/OS is installed because not all DC/OS services have upgrade paths that will preserve existing states.