Update mapping and reindex Elasticsearch indices

From Archivematica
Revision as of 04:15, 19 January 2021 by Sevein (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is now part of our official upgrading docs: https://www.archivematica.org/en/docs/latest/admin-manual/installation-setup/upgrading/upgrading/.

---

Backup Elasticsearch data

The easiest way to backup the Elasticsearch data is copying the data directory:

 sudo service elasticsearch stop
 tar cvfz var_lib_elasticsearch.tgz /var/lib/elasticsearch
 sudo service elasticsearch start

List indices to resize the Elasticsearch heap size when needed

Use the following command to list indices:

 curl -s -X GET 'http://localhost:9200/_cat/indices/%2A?v=&s=index:desc'

The output should show something like this:


 root@archivematica-test-server:~# curl -s -X GET 'http://localhost:9200/_cat/indices/%2A?v=&s=index:desc'
 health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
 yellow open   transfers     lYqkYjwZRy2XG8CP_3S3PQ   5   1          0            0      1.2kb          1.2kb
 yellow open   transferfiles K5gnDZyOQz2JdIeZ6adJsQ   5   1          0            0      1.2kb          1.2kb
 yellow open   aips          yAyK_koXThaZcWsBYfzN7w   5   1         17            0    101.4mb        101.4mb
 yellow open   aipfiles      TVrrX8jkRhWWxGfvK_M6zg   5   1      11987            0      2.9gb          2.9gb

Take the elasticsearch heap size from /etc/default/elasticsearch (Ubuntu) or /etc/sysconfig/elasticsearch (CentOS):

 root@ny-gclibrary-test-release-1:~# grep ES_JAVA_OPTS= /etc/default/elasticsearch 
 #ES_JAVA_OPTS=
 ES_JAVA_OPTS="-Xms2g -Xmx2g"

The heap size for the example is 2G.

Ensure your Elasticsearch heap size is greater than the max store.size in the indices list. For our example, it should be greater than 3GB.

  • Edit /etc/default/elasticsearch or /etc/sysconfig/elasticsearch.
  • Change ES_JAVA_OPTS to a bigger value, in our example:
 ES_JAVA_OPTS="-Xms3g -Xmx3g".
  • Restart Elasticsearch service for the changes to take effect (sudo service elasticsearch restart)

Run script to reindex and use new mappings

Use the following script:

#!/bin/bash


es_url="http://localhost:9200"

index_list='aips aipfiles transfers transferfiles'

echo -e "\nIndex list before reindexing:\n"
curl -s -X GET "${es_url}/_cat/indices/%2A?v=&s=index:desc"
echo -e "\n"

#Clone indices with _reindex API call:
for index in $index_list;do 
    echo "Reindex ${index} in ${index}_new..."
    curl -s -X POST \
      ${es_url}/_reindex \
      -H 'Content-Type: application/json' \
      -d '{
      "source": {
        "index": "'"${index}"'"
      },
      "dest": {
        "index": "'"${index}_new"'"
      }
    }' > /dev/null
done

echo -e "\n\n"

echo -e "Index list after tmp indices creation\n"
indices_output=$(curl -s -X GET "${es_url}/_cat/indices/%2A?v=&s=index:desc")
curl -s -X GET "${es_url}/_cat/indices/%2A?v=&s=index:desc"
echo -e "\n"

#Delete old indices
for index in $index_list;do
  echo "Deleting ${index}..."
  curl -s -X DELETE ${es_url}/${index} > /dev/null
done

#Restart archivematica-dashboard to create indices with new mappings
echo -e "\nRestarting archivematica-dashboard"
sudo service archivematica-dashboard restart

#Wait 30 seconds
echo "Wait 30 seconds to ensure dashboard has created the empty indices with new mapping"
sleep 30
echo -e "\n"

#When index has no docs the reindex doesn't create the new index (typically transferfiles index)
#There's a check to ensure the new index has been create before reindexing. 
#Reindex fron *_new indices:
for index in $index_list;do
  if echo "$indices_output" | grep ${index}_new >/dev/null; then
    echo "Indexing ${index} using ${index}_new ..."
    curl -s -X POST \
      ${es_url}/_reindex \
      -H 'Content-Type: application/json' \
      -d '{
      "source": {
        "index": "'"${index}_new"'"
      },
      "dest": {
        "index": "'"${index}"'"
      }
    }' > /dev/null
  fi
done

echo -e "\n"

#Delete tmp indices
for index in $index_list;do
  if echo "$indices_output" | grep ${index}_new >/dev/null; then
     echo "Deleting ${index}_new..."
     curl -s -X DELETE ${es_url}/${index}_new > /dev/null
  fi
done

echo -e "\n\nReindexing done:\n"
curl -s -X GET "${es_url}/_cat/indices/%2A?v=&s=index:desc"
echo -e "\n"

For our example it takes 11 minutes, and this is the output:

root@archivematica-test-server:~# time ./script_reindex_new_map.sh 

Index list before reindexing:

health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   transfers     lYqkYjwZRy2XG8CP_3S3PQ   5   1          3            0     11.6kb         11.6kb
yellow open   transferfiles K5gnDZyOQz2JdIeZ6adJsQ   5   1          0            0      1.2kb          1.2kb
yellow open   aips          yAyK_koXThaZcWsBYfzN7w   5   1         17            0    101.4mb        101.4mb
yellow open   aipfiles      TVrrX8jkRhWWxGfvK_M6zg   5   1      12905            0      2.6gb          2.6gb


Reindex aips in aips_new...
Reindex aipfiles in aipfiles_new...
Reindex transfers in transfers_new...
Reindex transferfiles in transferfiles_new...



Index list after tmp indices creation

health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   transfers_new gdFevH8yRdiNTdrPcfo8Lg   5   1          0            0       460b           460b
yellow open   transfers     lYqkYjwZRy2XG8CP_3S3PQ   5   1          3            0     11.6kb         11.6kb
yellow open   transferfiles K5gnDZyOQz2JdIeZ6adJsQ   5   1          0            0      1.2kb          1.2kb
yellow open   aips_new      uJ-ehaYLTfe_1lOSErfu3Q   5   1         17            0     96.8mb         96.8mb
yellow open   aips          yAyK_koXThaZcWsBYfzN7w   5   1         17            0    101.4mb        101.4mb
yellow open   aipfiles_new  00Xxu7v2QvWsq92gM247xQ   5   1      12905            0      3.1gb          3.1gb
yellow open   aipfiles      TVrrX8jkRhWWxGfvK_M6zg   5   1      12905            0      2.6gb          2.6gb


Deleting aips...
Deleting aipfiles...
Deleting transfers...
Deleting transferfiles...

Restarting archivematica-dashboard
Wait 30 seconds to ensure dashboard has created the empty indices with new mapping


Indexing aips using aips_new ...
Indexing aipfiles using aipfiles_new ...
Indexing transfers using transfers_new ...


Deleting aips_new...
Deleting aipfiles_new...
Deleting transfers_new...


Reindexing done:

health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   transfers     FC7aSVPmSmmCc_LTv1AQRA   5   1          3            0      1.2kb          1.2kb
yellow open   transferfiles 5JMAft3FQwmosZQFi7eJNw   5   1          0            0      1.2kb          1.2kb
yellow open   aips          EtwXG3-4SO2Px-4QMRufXA   5   1         17            0    102.1mb        102.1mb
yellow open   aipfiles      -PFuzslgTeWJ4CWny8VZoA   5   1      12905            0        3gb            3gb



real	10m47.114s
user	0m0.068s
sys	0m0.032s

NOTE: The script could fail because JAVA heap size out of memory (please, check /var/log/elascticsearch.log). In this case the indices will be empty, so restore /var/lib/elasticsearch from backup, increase Elasticsearch JAVA heap size and try again.

The script uses the elasticsearch API and makes the following actions:

  • Reindex the transfers, transferfiles, aips and aipfiles indices in new temporary indices
  • Delete original indices
  • Restart archivematica-dashboard service to create empty indices with new mappings
  • Reindex from temporary indices
  • Delete temporary indices

Restore Elasticsearch heap size when needed

  • Edit /etc/default/elasticsearch or /etc/sysconfig/elasticsearch when needed.
  • Change ES_JAVA_OPTS when needed.
  • Restart Elasticsearch service when needed (sudo service elasticsearch restart)