Difference between revisions of "Elasticsearch Development"

From Archivematica
Jump to navigation Jump to search
Line 1: Line 1:
 
Archivematica 0.9+ stores AIP file information, such as METS data, using [http://www.elasticsearch.org/ Elasticsearch]. This data can be searched from the Archival Storage area of the dashboard or can be interfaced with programmatically.
 
Archivematica 0.9+ stores AIP file information, such as METS data, using [http://www.elasticsearch.org/ Elasticsearch]. This data can be searched from the Archival Storage area of the dashboard or can be interfaced with programmatically.
  
=Programmatic Access to indexed AIP data using pyes=
+
=Programmatic Access to indexed AIP data=
  
 
To access indexed AIP data using a custom script or application, find an Elasticsearch interface library for the programming language you've chosen to use. In Archivematica we use Python with the [https://github.com/aparo/pyes/ pyes] library. In our developer documentation, we'll outline the use of pyes to access AIP data, but any programming language/interface library, such as PHP and [https://github.com/ruflin/Elastica/ Elastica],  should work.
 
To access indexed AIP data using a custom script or application, find an Elasticsearch interface library for the programming language you've chosen to use. In Archivematica we use Python with the [https://github.com/aparo/pyes/ pyes] library. In our developer documentation, we'll outline the use of pyes to access AIP data, but any programming language/interface library, such as PHP and [https://github.com/ruflin/Elastica/ Elastica],  should work.

Revision as of 15:22, 17 August 2012

Archivematica 0.9+ stores AIP file information, such as METS data, using Elasticsearch. This data can be searched from the Archival Storage area of the dashboard or can be interfaced with programmatically.

Programmatic Access to indexed AIP data

To access indexed AIP data using a custom script or application, find an Elasticsearch interface library for the programming language you've chosen to use. In Archivematica we use Python with the pyes library. In our developer documentation, we'll outline the use of pyes to access AIP data, but any programming language/interface library, such as PHP and Elastica, should work.

Connecting to Elasticsearch

On this page we'll run through an example of interfacing with Elasticsearch data using a Python script that leverages the pyes library.

The first step, when using pyes, is to require the module. The following code imports pyes functionality on a system on which Archivematica is installed.

import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *

Next you'll want to create a connection to Elasticsearch.

conn = ES('127.0.0.1:9200')

Querying Elasticsearch

Once connected to Elasticsearch, you can perform searches for specific data. Below is the code needed to do a "wildcard" search for all AIP files indexed by Elasticsearch and retrieve the first 20 items. Instead of doing a "wildcard" search you could also supply keywords, such as a certain AIP UUID.

start_page     = 1
items_per_page = 20

q = StringQuery('*')

try:
    results = conn.search_raw(
        query=q,
        indices='aips',
        type='aip',
        start=start_page - 1,
        size=items_per_page
     )
except:
    print 'Query error.'

Displaying search results

Now that you've performed a search, you can display some results. The below logic cycles through each hit in the results, representing an AIP file, and prints the UUID of the AIP the file belongs in, the Elasticsearch document ID corresponding to the indexed file data, and the path of the file within the AIP.

if results:
    document_ids = []
    for item in results.hits.hits:
        aip = item._source
        print 'AIP ID: ' + aip['AIPUUID'] + ' / Document ID: ' + item._id
        print 'Filepath: ' + aip['filePath']
        print
        document_ids.append(item._id)

Fetching specific documents

If you want to get Elasticsearch data for a specific AIP file, you can use the Elasticsearch document ID. The above code populates the document_ids array and the below code uses this data, retrieving individual documents and extracting a specific item of data from each document.

for document_id in document_ids:
    data = conn.get(index_name, type_name, document_id)

    format = data['METS']['amdSec']['ns0:amdSec_list'][0]['ns0:techMD_list'][0]['ns0:mdWrap_list'][0]['ns0:xmlData_list'][0]['ns1:object_list'][0]['ns1:objectCharacteristics_list'][0]['ns1:format_list'][0]['ns1:formatDesignation_list'][0]['ns1:formatName']

    print 'Format for document ID ' + document_id + ' is ' + format

Elasticsearch administration

Archivematica comes with a plugin for Elasticsearch, called Elasticsearch Head, that provides a web application for browsing and administering Elasticsearch data. It can be accessed at http://your.host.name:9200/_plugin/head/.

Elasticsearch Head will allow you to delete an index, if need be. If, for whatever reason, you need to delete an Elasticsearch index programmatically, this can be done with pyes using the following code.

import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
conn = ES('127.0.0.1:9200')

try:
    conn.delete_index('aips')
except:
    print "Error deleting index or index already deleted."