How to Recover a lost node in Vertica database

Following on a question i saw in one of the online forums i decided to do a step by step paper on how you can recover one failed node(when the host/node is no longer available).

  • if in case you lose one node(the node will be with no access) in your cluster, you wont be able to remove it from the database ! as it is not up(so you have to fix this in another way).
Just before you jump head first you need to remember that if you use this solution in your environment i shall not be hold responsible for any damage caused. So lets see how we can destroy one of the hosts and recover the entire host + Vertica node. In this scenario i will be using a Vertica Cluster with 3 nodes running on AWS as shown below:
[dbadmin@ip-10-333-333-210 ~]$ admintools -t list_allnodes
 Node           | Host          | State | Version         | DB
----------------+---------------+-------+-----------------+-----
 v_db1_node0001 | 10.333.333.210 | UP    | vertica-7.2.1.0 | db1
 v_db1_node0002 | 10.333.333.211 | UP    | vertica-7.2.1.0 | db1
 v_db1_node0003 | 10.333.333.226 | UP    | vertica-7.2.1.0 | db1
I have my database up and running as it can be seen. Ill go ahead and terminate one of the nodes(this is to simulate the loss of the node):
[root@ip-10-333-333-210 ~]# aws ec2 terminate-instances  --instance-ids i-5d65634232
-------------------------------
|     TerminateInstances      |
+-----------------------------+
||   TerminatingInstances    ||
|+---------------------------+|
||        InstanceId         ||
|+---------------------------+|
||  i-5d656683               ||
|+---------------------------+|
|||      CurrentState       |||
||+-------+-----------------+||
||| Code  |      Name       |||
||+-------+-----------------+||
|||  32   |  shutting-down  |||
||+-------+-----------------+||
|||      PreviousState      |||
||+---------+---------------+||
|||  Code   |     Name      |||
||+---------+---------------+||
|||  16     |  running      |||
||+---------+---------------+||
Next i can see that my database is still running(because of build-in HA) but is missing a node(the one we dropped).
[dbadmin@ip-10-333-333-210 ~]$ admintools -t list_allnodes
 Node           | Host          | State | Version         | DB
----------------+---------------+-------+-----------------+-----
 v_db1_node0001 | 10.333.333.210 | UP    | vertica-7.2.1.0 | db1
 v_db1_node0002 | 10.333.333.211 | UP    | vertica-7.2.1.0 | db1
 v_db1_node0003 | 10.333.333.226 | DOWN  | unavailable     | db1
So how we can fix this ?  Here is the list of steps:
  1. create new EC2 instance.
  2. install the required packages.
  3. download the vertica rpm and install it.
  4. create the dbadmin user and the verticadba group.
  5. enable password-less ssh access between the nodes for dbadmin user.
  6. create the same directory structure as in the lost node.
  7. copy admintools.conf file.
  8. create catalog location.
  9.  recover the database using the force option.

Create new EC2 instance.

  • it has to be on the VPC and on the subnet.
  • provide the ip address as the lost node in our case 10.333.333.226

Install the required packages and setup host for Vertica installation.

For this task use the script below unless you have an AMI template ready.
yum install openssh which dialog gdb mcelog sysstat rsync  python*  telnet ruby*  java* sudo  openssh-server openssh-clients ntpd wget -y

chkconfig sshd on
service sshd start

service iptables save
service iptables stop
chkconfig iptables off

chkconfig ntpd on
service ntpd start

echo 'session required pam_limits.so'   /etc/pam.d/su
echo '# Controls the default maxmimum open files'  /etc/sysctl.conf
echo 'fs.file-max = 65536'  /etc/sysctl.conf
echo '# Controls the default maxmimum size of a mesage queue'  /etc/sysctl.conf
echo 'kernel.msgmnb = 65536'  /etc/sysctl.conf
echo '# Controls the maximum size of a message, in bytes'  /etc/sysctl.conf
echo 'kernel.msgmax = 65536'  /etc/sysctl.conf
echo '# Controls the maximum shared segment size, in bytes'  /etc/sysctl.conf
echo 'kernel.shmmax = 68719476736'  /etc/sysctl.conf
echo '# Controls the maximum number of shared memory segments, in pages'  /etc/sysctl.conf
echo 'kernel.shmall = 4294967296'  /etc/sysctl.conf
echo '# The following 1 line added by Vertica tools. '  /etc/sysctl.conf
echo 'vm.max_map_count = 503831'  /etc/sysctl.conf
echo 'vm.swappiness = 10'  /etc/sysctl.conf

-- Limits

echo 'dbadmin - nproc  4096'   /etc/security/limits.conf
echo 'dbadmin - fsize  unlimited '  /etc/security/limits.conf
echo 'dbadmin - nofile  65536 '  /etc/security/limits.conf
echo 'dbadmin - nice    0'  /etc/security/limits.conf

-- Disk Readahead set to 4096

for DISK in df | grep vertica | awk '{print $1}' ; do
    echo "blockdev --setra 4096 $DISK"  /etc/rc.d/rc.local
done

--disable hugepages

 echo never  /sys/kernel/mm/transparent_hugepage/enabled
-- disable defrag

 echo never  /sys/kernel/mm/transparent_hugepage/defrag

-- I/O Scheduling to deadline - must have Vertica in the disk desc mount point

for DISK in df | grep vertica | awk {'print $1'} | sed 's//dev///g' ; do
    echo deadline  /sys/block/$DISK/queue/scheduler
done

sysctl -p

Download the vertica rpm and install it.

Make sure you use the same version as the one in the other two nodes.

root@ip-10-333-333-226 tmp]# rpm -ihv vertica-7.2.1-0.x86_64.RHEL6.rpm
Preparing...                          ################################# [100%]
Updating / installing...
   1:vertica-7.2.1-0                  ################################# [100%]

Vertica Analytic Database V7.2.1-0 successfully installed on host ip-10-333-333-226

To complete your NEW installation and configure the cluster, run:
 /opt/vertica/sbin/install_vertica

To complete your Vertica UPGRADE, run:
 /opt/vertica/sbin/update_vertica

Create the dbadmin user and the verticadba group.

  Run the command in the 10.333.333.226 node.

useradd dbadmin
groupadd verticadba
gpasswd -a dbadmin verticadba

Enable password-less ssh access between the nodes for dbadmin user.

This has to be enabled between the nodes.

  • i won't describe here how is done since is trivial and there are many source over the internet.

Create the same directory structure as in the lost node.

To see the directories you require open the /opt/vertica/config/admintools.conf file in the good nodes.

-- create dirs
mkdir /vertica_catalog
mkdir /vertica_data
mkdir /mnt/vertica_temp

-- grant ownership to dbadmin
chown -R dbadmin:verticadba    /vertica_catalog
chown -R dbadmin:verticadba    /vertica_data
chown -R dbadmin:verticadba    /mnt/vertica_temp

chown -R dbadmin:verticadba /opt/vertica

Copy admintools.conf file.

Copy the admintools.conf file from one of the good nodes onto the new node.

[dbadmin@ip-10-333-333-210 .ssh]$ scp /opt/vertica/config/admintools.conf dbadmin@10.333.333.226:/opt/vertica/config/admintools.conf
dbadmin@10.333.333.226's password:
admintools.conf

Create catalog location.

 You need to create the catalog location in the new node.

  • since is the node 3 that we lost, we need to create the catalog directory structure as it was before. You just need to see how it is in one of the good nodes and replicate it with node0003 naming.
[root@ip-10-333-333-226 vertica_catalog]# mkdir -p /vertica_catalog/db1/v_db1_node0003_catalog/Catalog

[root@ip-10-333-333-226 vertica_catalog]# chown -R dbadmin:verticadba /vertica_catalog

 Recover the database using the force option.

 Now that we have all ready we need to recover the node. We need to use the --force option that will enable auto recover of the node.

[dbadmin@ip-10-333-333-226 v_db1_node0003_catalog]$ /opt/vertica/bin/admintools  -t restart_node -s 10.333.333.226 -d db1 --force
Info: no password specified, using none
*** Restarting nodes for database db1 ***
        restart host 10.333.333.226 with catalog v_db1_node0003_catalog
        issuing multi-node restart
        Starting nodes:
                v_db1_node0003 (10.333.333.226)
        Starting Vertica on all nodes. Please wait, databases with large catalog may take a while to initialize.
        Node Status: v_db1_node0001: (UP) v_db1_node0003: (DOWN)
        Node Status: v_db1_node0001: (UP) v_db1_node0003: (DOWN)
        Node Status: v_db1_node0001: (UP) v_db1_node0003: (RECOVERING)
        Node Status: v_db1_node0001: (UP) v_db1_node0003: (UP)
Restart Nodes result:  1
See if all nodes are up:
[dbadmin@ip-10-333-333-226 Catalog]$ /opt/vertica/bin/admintools -t list_allnodes
 Node           | Host          | State | Version         | DB
----------------+---------------+-------+-----------------+-----

 v_db1_node0001 | 10.333.333.210 | UP    | vertica-7.2.1.0 | db1
 v_db1_node0002 | 10.333.333.211 | UP    | vertica-7.2.1.0 | db1
 v_db1_node0003 | 10.333.333.226 | UP    | vertica-7.2.1.0 | db1

 Conclusion: 

  • yes it can be done ! Is not a solution i recommend ! It is just a workaround. But sometimes when you get stuck you have to handle it the way you can :).
Hope this was helpful