Elasticsearch Development
Archivematica 0.9+ stores AIP file information, such as METS data, using Elasticsearch. This data can be searched from the Archival Storage area of the dashboard or can be interfaced with programmatically.
Programmatic Access to indexed AIP data
To access indexed AIP data using a custom script or application, find an Elasticsearch interface library for the programming language you've chosen to use. In Archivematica we use Python with the pyes library. In our developer documentation, we'll outline the use of pyes to access AIP data, but any programming language/interface library, such as PHP and Elastica, should work.
Connecting to Elasticsearch
On this page we'll run through an example of interfacing with Elasticsearch data using a Python script that leverages the pyes library.
The first step, when using pyes, is to require the module. The following code imports pyes functionality on a system on which Archivematica is installed.
import sys sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals") from pyes import *
Next you'll want to create a connection to Elasticsearch.
conn = ES('127.0.0.1:9200')
Full text searching
Once connected to Elasticsearch, you can perform searches. Below is the code needed to do a "wildcard" search for all AIP files indexed by Elasticsearch and retrieve the first 20 items. Instead of doing a "wildcard" search you could also supply keywords, such as a certain AIP UUID.
start_page = 1 items_per_page = 20 q = StringQuery('*') try: results = conn.search_raw( query=q, indices='aips', type='aip', start=start_page - 1, size=items_per_page ) except: print 'Query error.'
Querying for specific data
While the "StringQuery" query type is good for broad searches, you may want to narrow a search down to a specific field of data to reduce false positives. Below is an example of searching documents, using "TermQuery", matching criteria within specific data. As, by default, Elasticsearch stores term values in lowercase the term value searched for must also be lowercase.
import sys sys.path.append("/usr/lib/archivematica/archivematicaCommon/externals") import pyes conn = pyes.ES('127.0.0.1:9200') q = pyes.TermQuery("METS.amdSec.ns0:amdSec_list.@ID", "amdsec_8") try: results = conn.search_raw(query=q, indices='aips') except: print 'Query failed.'
Displaying search results
Now that you've performed a couple of searches, you can display some results. The below logic cycles through each hit in a results set, representing an AIP file, and prints the UUID of the AIP the file belongs in, the Elasticsearch document ID corresponding to the indexed file data, and the path of the file within the AIP.
if results: document_ids = [] for item in results.hits.hits: aip = item._source print 'AIP ID: ' + aip['AIPUUID'] + ' / Document ID: ' + item._id print 'Filepath: ' + aip['filePath'] print document_ids.append(item._id)
Fetching specific documents
If you want to get Elasticsearch data for a specific AIP file, you can use the Elasticsearch document ID. The above code populates the document_ids
array and the below code uses this data, retrieving individual documents and extracting a specific item of data from each document.
for document_id in document_ids: data = conn.get(index_name, type_name, document_id) format = data['METS']['amdSec']['ns0:amdSec_list'][0]['ns0:techMD_list'][0]['ns0:mdWrap_list'][0]['ns0:xmlData_list'][0]['ns1:object_list'][0]['ns1:objectCharacteristics_list'][0]['ns1:format_list'][0]['ns1:formatDesignation_list'][0]['ns1:formatName'] print 'Format for document ID ' + document_id + ' is ' + format
Augmenting documents
To add additional data to an Elasticsearch document, you'll need the document ID. The following code shows an Elasticsearch query being used to find a document and update it with additional data. Note that the name of the data field being added, "__public", is prefixed with two underscores. This practice prevents the accidental overwriting of system or Archivematica-specific data. System data is prefixed with a single underscore.
import sys sys.path.append("/usr/lib/archivematica/archivematicaCommon/externals") import pyes conn = pyes.ES('127.0.0.1:9200') q = pyes.TermQuery("METS.amdSec.ns0:amdSec_list.@ID", "amdsec_8") results = conn.search_raw(query=q, indices='aips') try: if results: for item in results.hits.hits: print 'Updating ID: ' + item['_id'] document = item['_source'] document['__public'] = 'yes' conn.index(document, 'aips', 'aip', item['_id']) except: print 'Query failed.'