Archivematica - User contributions [en]

Elasticsearch Development

2019-02-15T17:24:47Z

Jraddaoui:

From Archivematica 0.9, AIP package information, such as METS data, is indexed
using [http://www.elasticsearch.org/ Elasticsearch (ES)]. This data can be
searched from the Archival Storage area of the dashboard or can be interfaced
with programmatically. For Elasticsearch administration information, such as
how to delete an Elasticsearch index, please reference the
[[Administrator_manual_1.2#Elasticsearch|administrator manual]].

'''NB.''' From Archivematica 1.7.0, users are given the option of whether to
index information using Elasticsearch, and so the information below might not
work. It will be dependant on how your Archivematica instance has been
configured.

'''NB.''' In Archivematica 1.9.0, the Elasticsearch version support has been
upgraded from ES 1.x to the 6.x version. Check [https://wiki.archivematica.org/Elasticsearch_Development_1.9 this page]
if you're running that Archivematica version or higher.

=Programmatic Access to indexed AIP data=

To access indexed AIP data using a custom script or application, find an
Elasticsearch API (Application Programming Interface) library for the
programming language you are most comfortable with. In Archivematica we use
Python with the Elasticsearch supported
[https://github.com/elastic/elasticsearch-py library]. In our developer
documentation, we will demonstrate how to use this and Python to access AIP
data, but any programming language, such as PHP and
[https://github.com/ruflin/Elastica/ Elastica], should work.

A list of [https://www.elastic.co/guide/en/elasticsearch/client/index.html officially]
supported and
[https://www.elastic.co/guide/en/elasticsearch/client/community/current/index.html community]
supported libraries can be found on the Elasticsearch website.

==Connecting to Elasticsearch (Archivematica 1.3+)==

The following example will demonstrate access to the indexes using
Elasticsearch's own
[https://github.com/elastic/elasticsearch-py Python library].

===Importing the Elasticsearch API module===

The first step is to import the Elasticsearch module and connect to the
Elasticsearch server.

<pre>
from __future__ import print_function
import sys

from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError, NotFoundError

conn = Elasticsearch(['127.0.0.1:9200'])
</pre>

'''NB.''' The additional module imports are used in this example so that the
examples below can be copied-and-pasted as desired.

===Full text searching===

Once connected to Elasticsearch, you can perform searches. Below is the code
needed to do a wildcard ('*') search for all indexed AIP files. We retrieve the
first 20 items. Instead of providing a wildcard you could also supply keywords,
such as a specific AIP UUID.

<pre>
start_page = 1
items_per_page = 20

# ref: string query, https://git.io/vhgUw
wildcard_query = { "query": {
"query_string" : {
"query": "*",
},
}}

try:
results = conn.search(
body=wildcard_query,
index="aips",
doc_type="aip",
from_=start_page - 1,
size=items_per_page,
)
except RequestError:
print("Query error")
sys.exit()
except NotFoundError:
print("No results found")
sys.exit()
</pre>

There are a number of ways to construct Elasticsearch queries. The
Elasticsearch website provides useful reference material:
[https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html Elasticsearch Full Text Queries].

===Querying for specific data===

While the string query-type is good for broad searches, you may want to
narrow a search down to a specific field of data to reduce false positives.
Below is an example of searching documents, using a "term" query to match
criteria within specific data.

<pre>
start_page = 1
items_per_page = 20

# ref: term query, https://git.io/vhrI9
term_query = {
"query": {
"constant_score": {
"filter": {
"term": {
"mets.ns0:mets_dict_list.ns0:amdSec_dict_list.@ID":
"amdsec_8"
}
}
}
}
}

try:
results = conn.search(
body=term_query,
index="aips",
doc_type="aip",
from_=start_page - 1,
size=items_per_page,
)
except RequestError:
print("Query error")
sys.exit()
except NotFoundError:
print("No results found")
sys.exit()
</pre>

Note that the construction of the query is not straightforward. Fields and
values are stored in Elasticsearch in lowercase. Properties work in uppercase
and might not work in lowercase. You can analyze a query with a `curl`
statement along the lines of:

<pre>
curl -X GET "http://127.0.0.1:9200/_analyze?pretty=true" \
-H 'Content-Type: application/json' -d'
{
"field": "METS.ns0:mets_dict_list.ns0:amdSec_dict_list.@ID",
"text": "amdSec_8"
}'
</pre>

The result is an analysis of the query string we want to use and it can reveal
common mistakes in our intuition, for example, searching for `amdSec_8` as a
mixed case string.

===Displaying search results===

Now that you have performed a couple of searches, you can display some results.
The logic below cycles through each hit in a results set. For each AIP file,
the UUID and filepath of the AIP are printed to the console.

<pre>
res = results.get("hits")
if res is not None:
for r_ in res.items():
if r_[0] == "total":
print("Total results:", r_[1])
if r_[0] == "hits":
print("Results returned:", len(r_[1]))
for aip_index in r_[1]:
# aip_index will be the complete AIP record as a Python dict
if aip_index.get("_source"):
print(
"AIP ID: {}".format(
aip_index.get("_source").get("filePath")
)
)
print(
"Filepath: {}".format(
aip_index.get("_source").get("uuid")
)
)
</pre>

===Fetching specific documents===

The AIP index inside Archivematica is separated into two Elasticsearch document
types. The AIP as a whole, and its individual files.

* ''aip''
* ''aipfile''

It might be easier to retrieve information about a specific digital object if
you know its UUID and want to query the ```aipfile``` instead. The example
below shows how we might retrieve the format identification for the file with
the UUID: ''f7428196-a11b-4093-b311-d43d607d54ca''

<pre>
from __future__ import print_function
import sys

from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError, NotFoundError

conn = Elasticsearch(["http://127.0.0.1:9200"])

start_page = 1
items_per_page = 20

term_query = {
"query": {
"constant_score": {
"filter": {
"term": {"FILEUUID": "f7428196-a11b-4093-b311-d43d607d54ca"}
}
}
}
}

try:
results = conn.search(
body=term_query,
index="aips",
doc_type="aipfile",
from_=start_page - 1,
size=items_per_page,
)
except RequestError:
print("Query error")
sys.exit()
except NotFoundError:
print("No results found")
sys.exit()

res = results.get("hits")
if res is not None:
for res_ in res.get("hits"):
file_record = res_.get("_source")
if file_record:
try:
puid = file_record["METS"]["amdSec"] \
["ns0:amdSec_dict_list"][0] \
["ns0:techMD_dict_list"][0] \
["ns0:mdWrap_dict_list"][0] \
["ns0:xmlData_dict_list"][0] \
["ns1:object_dict_list"][0] \
["ns1:objectCharacteristics_dict_list"][0] \
["ns1:format_dict_list"][0] \
["ns1:formatRegistry_dict_list"][0] \
["ns1:formatRegistryKey"]
print("Format ID: {}".format(puid))
except KeyError:
print("Problem accessing index.")
sys.exit(1)
</pre>

===Archivematica Transfer Index===

Information about Transfers is also indexed in Archivematica. The information
is less-rich when compared to what is stored in the AIP and so it is not
covered in detail here. We can use the wildcard query from the AIP examples
above to begin to look at what is in the transfer index. The query would look
like as follows:

<pre>
from __future__ import print_function
import sys

from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError, NotFoundError

conn = Elasticsearch(["http://127.0.0.1:62002"])

start_page = 1
items_per_page = 20

# ref: string query, https://git.io/vhgUw
wildcard_query = { "query": {
"query_string" : {
"query": "*",
},
}}

try:
results = conn.search(
body=wildcard_query,
index="transfers",
doc_type="transfer",
from_=start_page - 1,
size=items_per_page,
)
except RequestError:
print("Query error")
sys.exit()
except NotFoundError:
print("No results found")
sys.exit()

res = results.get("hits")
if res is not None:
for res_ in res.get("hits"):
print(res_.get("_source"))
</pre>

===Further reading===

Elasticsearch provides API functions beyond
[https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html searching].
Users who wish to make use of these capabilities in Python
can look at the Python [https://elasticsearch-py.readthedocs.io/en/master/ library documentation].

The complete [https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html Elasticsearch documentation]
reference is also available. Many of the commands are described with examples
that can be run using the [https://curl.haxx.se/ curl] command line tool.

==Connecting to Elasticsearch (Archivematica 0.9 up to Archivematica 1.2)==

Here we will run through an example of interfacing with older versions of
Archivematica using Elasticsearch with a Python script that leverages the pyes
library.

'''NB.''' Pyes use was [https://github.com/artefactual/archivematica/commit/b0ac6c642a2f070fc7a0f7c198a51c2d0509b7f7 removed]
in Archivematica 1.3. Though with some modification to the examples below it
should still be possible to adopt it to query the ES indexes.

===Importing the pyes module===

The first step, when using pyes, is to require the module. The following code
imports pyes functionality on a system on which Archivematica is installed.

<pre>
import sys
sys.path.append("/home/demo/archivematica/src/archivematicaCommon/lib/externals")
from pyes import *
</pre>

Next you'll want to create a connection to Elasticsearch.

<pre>
conn = ES('127.0.0.1:9200')
</pre>

===Full text searching===

Once connected to Elasticsearch, you can perform searches. Below is the code
needed to do a "wildcard" search for all AIP files indexed by Elasticsearch and
retrieve the first 20 items. Instead of doing a "wildcard" search you could
also supply keywords, such as a certain AIP UUID.

<pre>
start_page = 1
items_per_page = 20

q = StringQuery('*')

try:
results = conn.search_raw(
query=q,
indices='aips',
type='aip',
start=start_page - 1,
size=items_per_page
)
except:
print 'Query error.'
</pre>

===Querying for specific data===

While the "StringQuery" query type is good for broad searches, you may want to
narrow a search down to a specific field of data to reduce false-positives.
Below is an example of searching documents, using "TermQuery", matching
criteria within specific data. As, by default, Elasticsearch stores term values
in lowercase the term value searched for must also be lowercase.

<pre>
import sys
sys.path.append("/usr/lib/archivematica/archivematicaCommon/externals")
import pyes

conn = pyes.ES('127.0.0.1:9200')

q = pyes.TermQuery("METS.amdSec.ns0:amdSec_list.@ID", "amdsec_8")

try:
results = conn.search_raw(query=q, indices='aips')
except:
print 'Query failed.'
</pre>

===Displaying search results===

Now that you've performed a couple of searches, you can display some results.
The below logic cycles through each hit in a results set, representing an AIP
file, and prints the UUID of the AIP the file belongs in, the Elasticsearch
document ID corresponding to the indexed file data, and the path of the file
within the AIP.

<pre>
if results:
document_ids = []
for item in results.hits.hits:
aip = item._source
print 'AIP ID: ' + aip['AIPUUID'] + ' / Document ID: ' + item._id
print 'Filepath: ' + aip['filePath']
print
document_ids.append(item._id)
</pre>

===Fetching specific documents===

If you want to get Elasticsearch data for a specific AIP file, you can use the
Elasticsearch document ID. The above code populates the
<code>document_ids</code> array and the below code uses this data, retrieving
individual documents and extracting a specific item of data from each document.

<pre>
for document_id in document_ids:
data = conn.get(index_name, type_name, document_id)

format = data['METS']['amdSec'] \
['ns0:amdSec_list'][0] \
['ns0:techMD_list'][0] \
['ns0:mdWrap_list'][0] \
['ns0:xmlData_list'][0] \
['ns1:object_list'][0] \
['ns1:objectCharacteristics_list'][0] \
['ns1:format_list'][0] \
['ns1:formatDesignation_list'][0] \
['ns1:formatName']

print 'Format for document ID ' + document_id + ' is ' + format
</pre>

===Augmenting documents===

To add additional data to an Elasticsearch document, you will need the document
ID. The following code shows an Elasticsearch query being used to find a
document and update it with additional data. Note that the name of the data
field being added, "__public", is prefixed with two underscores. This practice
prevents the accidental overwriting of system or Archivematica-specific data.
System data is prefixed with a single underscore.

<pre>
import sys
sys.path.append("/usr/lib/archivematica/archivematicaCommon/externals")
import pyes

conn = pyes.ES('127.0.0.1:9200')

q = pyes.TermQuery("METS.amdSec.ns0:amdSec_list.@ID", "amdsec_8")

results = conn.search_raw(query=q, indices='aips')

try:
if results:
for item in results.hits.hits:
print 'Updating ID: ' + item['_id']

document = item['_source']
document['__public'] = 'yes'
conn.index(document, 'aips', 'aip', item['_id'])
except:
print 'Query failed.'
</pre>

Elasticsearch Development

2019-02-15T17:24:04Z

Jraddaoui:

Elasticsearch Development 1.9

2019-02-15T17:22:58Z

Jraddaoui: Created page with "Work in progress."

Work in progress.

Storage Service API

2017-12-29T01:42:20Z

Jraddaoui: Add browsing notes

[[Main Page]] > [[Development]] > Storage Service API

The [[Storage Service]] API provides programmatic access to moving files around in storage areas that the Storage Service has access to.

The API is written using [http://django-tastypie.readthedocs.io/en/latest/ TastyPie].

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: TastyPie is less well supported than [http://www.django-rest-framework.org/ Django REST Framework], both in terms of docs & community. We should look at replacing TastyPie with DRF.
|}

Endpoints require authentication with a username and API key. This can be submitted as GET parameters (eg <code>?username=test&api_key=e6282adabed84e39ffe451f8bf6ff1a67c1fc9f2</code>) or as a header (eg <code>Authorization: ApiKey test:e6282adabed84e39ffe451f8bf6ff1a67c1fc9f2</code>)

== A note about browsing ==

A detailed schema can be found for each of the resources by adding "schema" to the get all URL.

Example:
$ curl -X GET -H"Authorization: ApiKey test:95141fc645ed97a95893f1f865d24687f89a27ad" 'http://localhost:8000/api/v2/location/schema/?format=json
{
"allowed_detail_http_methods": [
"get",
"post"
],
"allowed_list_http_methods": [
"get"
],
"default_format": "application/json",
"default_limit": 20,
"fields": {
"description": {
"blank": false,
"default": "No default provided.",
"help_text": "Unicode string data. Ex: \"Hello World\"",
"nullable": false,
"primary_key": false,
"readonly": true,
"type": "string",
"unique": false,
"verbose_name": "description"
},
"enabled": {
"blank": true,
"default": true,
"help_text": "True if space can be accessed.",
"nullable": false,
"primary_key": false,
"readonly": false,
"type": "boolean",
"unique": false,
"verbose_name": "Enabled"
},
"path": {
"blank": false,
"default": "No default provided.",
"help_text": "Unicode string data. Ex: \"Hello World\"",
"nullable": false,
"primary_key": false,
"readonly": true,
"type": "string",
"unique": false,
"verbose_name": "path"
},
"pipeline": {
"blank": false,
"default": "No default provided.",
"help_text": "Many related resources. Can be either a list of URIs or list of individually nested resource data.",
"nullable": false,
"primary_key": false,
"readonly": false,
"related_schema": "/api/v2/pipeline/schema/",
"related_type": "to_many",
"type": "related",
"unique": false,
"verbose_name": "pipeline"
},
"purpose": {
"blank": false,
"default": "No default provided.",
"help_text": "Purpose of the space. Eg. AIP storage, Transfer source",
"nullable": false,
"primary_key": false,
"readonly": false,
"type": "string",
"unique": false,
"verbose_name": "Purpose"
},
"quota": {
"blank": false,
"default": null,
"help_text": "Size, in bytes (optional)",
"nullable": true,
"primary_key": false,
"readonly": false,
"type": "string",
"unique": false,
"verbose_name": "Quota"
},
"relative_path": {
"blank": false,
"default": "",
"help_text": "Path to location, relative to the storage space's path.",
"nullable": false,
"primary_key": false,
"readonly": false,
"type": "string",
"unique": false,
"verbose_name": "Relative Path"
},
"resource_uri": {
"blank": false,
"default": "No default provided.",
"help_text": "Unicode string data. Ex: \"Hello World\"",
"nullable": false,
"primary_key": false,
"readonly": true,
"type": "string",
"unique": false,
"verbose_name": "resource uri"
},
"space": {
"blank": false,
"default": "No default provided.",
"help_text": "A single related resource. Can be either a URI or set of nested resource data.",
"nullable": false,
"primary_key": false,
"readonly": false,
"related_schema": "/api/v2/space/schema/",
"related_type": "to_one",
"type": "related",
"unique": false,
"verbose_name": "space"
},
"used": {
"blank": false,
"default": 0,
"help_text": "Amount used, in bytes.",
"nullable": false,
"primary_key": false,
"readonly": false,
"type": "string",
"unique": false,
"verbose_name": "Used"
},
"uuid": {
"blank": true,
"default": "",
"help_text": "Unique identifier",
"nullable": false,
"primary_key": false,
"readonly": false,
"type": "string",
"unique": true,
"verbose_name": "uuid"
}
},
"filtering": {
"pipeline": 2,
"purpose": 1,
"quota": 1,
"relative_path": 1,
"space": 2,
"used": 1,
"uuid": 1
}
}

This schema, among other things, describes the fields in the resource (including the schema URI of related resource fields) and the fields that allow filtering. Valid filtering values are: Django ORM filters (e.g. startswith, exact, lte, etc.) or 1 or 2. If a filtering field is set to 2 it can be filtered over the related resource fields. For example, the locations could be filtered by their pipeline UUID setting it in a request parameter formatted with two underscore chars: <code>/api/v2/location/?pipeline__uuid=<uuid></code>

For more info on how to interact with the API see:

http://django-tastypie.readthedocs.io/en/v0.13.1/interacting.html

== Pipeline ==

=== Get all pipelines ===

* '''URL''': <code>/api/v2/pipeline/</code>
* '''Verb''': GET
* '''Parameters''': Query string parameters
** <code>description</code>: Description of the pipeline
** <code>uuid</code>: UUID of the pipeline
* '''Response''': JSON
** <code>meta</code>: Metadata on the response: number of hits, pagination information
** <code>objects</code>: List of pipelines. See [[#Get pipeline details]] for format

Returns information about all the pipelines in the system. Can be [http://django-tastypie.readthedocs.io/en/latest/resources.html#basic-filtering filtered] by the description or uuid. Disabled pipelines are not returned.

Example:
$ curl -X GET -H"Authorization: ApiKey test:95141fc645ed97a95893f1f865d24687f89a27ad" 'http://localhost:8000/api/v2/pipeline/?description__startswith=Archivematica' | python -m json.tool
{
"meta": {
"limit": 20,
"next": null,
"offset": 0,
"previous": null,
"total_count": 1
},
"objects": [
{
"description": "Archivematica on alouette",
"remote_name": "127.0.0.1",
"resource_uri": "/api/v2/pipeline/dd354557-9e6e-4918-9fe3-a65b00ecb1af/",
"uuid": "dd354557-9e6e-4918-9fe3-a65b00ecb1af"
}
]
}

=== Create new pipeline ===

* '''URL''': <code>/api/v2/pipeline/</code>
* '''Verb''': POST
* '''Parameters''': JSON body
** Should contain fields for a new pipeline: <code>uuid</code>, <code>description</code>, <code>api_key</code>, <code>api_username</code>
** <code>create_default_locations</code>: If True, will associated default [[Storage Service#Locations | Locations]] with the newly created pipeline
** <code>shared_path</code>: If default locations are created, create the [[Storage Service#Currently Processing | processing]] location at this path in the local filesystem
** <code>remote_name</code>: IP or hostname of the pipeline. If not provided and <code>create_default_locations</code> is set, will try to populate from the IP of the request.
* '''Response''': JSON with data for the pipeline

If the 'Pipelines disabled on creation' setting is set, the pipeline will be disabled by default, and will not respond to queries.

Example:
$ curl -X POST -H"Authorization: ApiKey test:95141fc645ed97a95893f1f865d24687f89a27ad" -H"Content-Type: application/json" -d'{"uuid": "99354557-9e6e-4918-9fe3-a65b00ecb199", "description": "Test pipeline", "create_default_locations": true, "api_username": "demo", "api_key": "03ecb307f5b8012f4771d245d534830378a87259"}' 'http://192.168.1.42:8000/api/v2/pipeline/'
{
"create_default_locations": true,
"description": "Test pipeline",
"remote_name": "192.168.1.42",
"resource_uri": "/api/v2/pipeline/99354557-9e6e-4918-9fe3-a65b00ecb199/",
"uuid": "99354557-9e6e-4918-9fe3-a65b00ecb199"
}

=== Get pipeline details ===

* '''URL''': <code>/api/v2/pipeline/<UUID>/</code>
* '''Verb''': GET
* '''Parameters''': None
* '''Response''': JSON
** <code>description</code>: Pipeline description
** <code>remote_name</code>: IP or hostname of the pipeline. For use in API calls
** <code>resource_uri</code>: URI for this pipeline in the API
** <code>uuid</code>: UUID of the pipeline

== Space ==

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: Is there no way to create Spaces in the API?
|}

=== Get all spaces ===

* '''URL''': <code>/api/v2/space/</code>
* '''Verb''': GET
* '''Parameters''': Query string parameters
** <code>access_protocol</code>: Protocol that the [[Storage Service#Space | Space]] uses. Must be searched based on the database code.
** <code>path</code>: Space's path
** <code>size</code>: Maximum size in bytes. Can use greater than (size__gt=1024), less than (size__lt=1024), and other Django [https://docs.djangoproject.com/en/1.8/ref/models/querysets/#field-lookups field lookups].
** <code>used</code>: Bytes stored in this space. Can use greater than (size__gt=1024), less than (size__lt=1024), and other Django [https://docs.djangoproject.com/en/1.8/ref/models/querysets/#field-lookups field lookups].
** <code>uuid</code>: UUID of the Space
* '''Response''': JSON
** <code>meta</code>: Metadata on the response: number of hits, pagination information
** <code>objects</code>: List of spaces. See [[#Get space details]] for format

Returns information about all the spaces in the system. Can be [http://django-tastypie.readthedocs.io/en/latest/resources.html#basic-filtering filtered] by several fields: access protocol, path, size, amount used, UUID and verified status. Disabled spaces are not returned.

=== Get space details ===

* '''URL''': <code>/api/v2/space/<UUID>/</code>
* '''Verb''': GET
* '''Parameters''': None
* '''Response''': JSON
** <code>access_protocol</code>: Database code for the access protocol
** <code>last_verified</code>: Date of last verification. This is a stub feature
** <code>path</code>: Space's path
** <code>resource_uri</code>: URI to the resource in the API
** <code>size</code>: Maximum size of the space in bytes.
** <code>used</code>: Bytes stored in this space.
** <code>uuid</code>: UUID of the space
** <code>verified</code>: If the space is verified. This is a stub feature
** Other space-specific fields

=== Browse space path ===

* '''URL''': <code>/api/v2/space/<UUID>/browse/</code>
* '''Verb''': GET
* '''Parameters''': Query string parameters
** <code>path</code>: Path inside the Space to look
* '''Response''': JSON
** <code>entries</code>: List of entries at path, files or directories
** <code>directories</code>: List of directories in path. Subset of `entries`.

{| class="wikitable" style="background-color:#ffffcc;" cellpadding="10";
| Version 1: Returns paths as strings
Version 2: Returns all paths base64 encoded
|}

== Location ==

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: Is there no way to create Locations in the API?
|}

=== Get all locations ===

* '''URL''': <code>/api/v2/location/</code>
* '''Verb''': GET

=== Get location details ===

* '''URL''': <code>/api/v2/location/<UUID>/</code>
* '''Verb''': GET

=== Move files to this location ===

* '''URL''': <code>/api/v2/location/<UUID>/</code>
* '''Verb''': POST
* '''Parameters''': JSON body
** <code>origin_location</code>: URI of the Location the files should be moved from
** <code>pipeline</code>: URI of the [[Storage Service#Pipeline | pipeline]]. Both Locations must be associated with this pipeline.
** <code>files</code>: List of dicts containing <code>source</code> and <code>destination</code>. The source and destination are paths relative to their Location of the files to be moved.

Intended for use with creating Transfers, SIPs, etc and other cases where files need to be moved but not tracked by the storage service.

=== Browse location path ===

* '''URL''': <code>/api/v2/location/<UUID>/browse/</code>
* '''Verb''': GET
* '''Parameters''': Query string parameters
** <code>path</code>: Path inside the Location to look
* '''Response''': JSON
** <code>entries</code>: List of entries in `path`, files or directories
** <code>directories</code>: List of directories in `path`. Subset of `entries`.

{| class="wikitable" style="background-color:#ffffcc;" cellpadding="10";
| Version 1: Returns paths as strings
Version 2: Returns all paths base64 encoded
|}

=== SWORD collection ===

* '''URL''': <code>/api/v2/location/<UUID>/sword/collection/</code>
* '''Verb''': GET, POST

See [[Sword API]] for details

== Package ==

=== Get all packages ===

* '''URL''': <code>/api/v2/file/</code>
* '''Verb''': GET

=== Create new package ===

* '''URL''': <code>/api/v2/file/</code>
* '''Verb''': POST
* '''Parameters''': JSON. Fields for a new package:
** <code>uuid</code>: UUID of the new package
** <code>origin_location</code>: URI of the Location where the package is currently
** <code>origin_path</code>: Path to the package, relative to the origin_location
** <code>current_location</code>: URI of the Location where the package should be stored
** <code>current_path</code>: Path where the package should be stored, relative to the current_location
** <code>package_type</code>: Type of package this is. One of: <code>AIP</code>, <code>AIC</code>, <code>DIP</code>, <code>transfer</code>, <code>SIP</code>, <code>file</code>, <code>deposit</code>
** <code>size</code>: Size of the package
** <code>origin_pipeline</code>: URI of the pipeline the package is from
** <code>related_package_uuid</code>: UUID of a package that is related to this one. E.g. UUID of a DIP when storing an AIP

Creates a database entry tracking the package (AIP, transfer, etc). If the package is an AIP, DIP or AIC and the current_location is an AIP or DIP storage location it also moves the files from the source to destination location. If the package is a Transfer and the current_location is transfer backlog, it is also moved.

This is handled through the modified <code>obj_create</code> function, which calls <code>Package.store_aip</code> or <code>Package.backlog_transfer</code>

=== Get package details ===

* '''URL''': <code>/api/v2/file/<UUID>/</code>
* '''Verb''': GET

=== Update package contents ===

* '''URL''': <code>/api/v2/file/<UUID>/</code>
* '''Verb''': PUT
* '''Parameters''': JSON body
** <code>reingest</code>: Flag to mark that this is reingest. Reduces chance to accidentally modify an AIP.
** <code>uuid</code>: UUID of the existing package
** <code>origin_location</code>: URI of the Location where the package is currently
** <code>origin_path</code>: Path to the package, relative to the origin_location
** <code>current_location</code>: URI of the Location where the package should be stored
** <code>current_path</code>: Path where the package should be stored, relative to the current_location
** <code>package_type</code>: Type of package this is. One of: <code>AIP</code>, <code>AIC</code>
** <code>size</code>: Size of the package
** <code>origin_pipeline</code>: URI of the pipeline the package is from. This must be the same pipeline reingest was started on (tracked through <code>Package.misc_attributes.reingest_pipeline</code>)

Updates the contents of a package during reingest. If the package is an AIP or AIC, currently stored in an AIP storage location, and the 'reingest' parameter is set, it will call <code>Package.finish_reingest</code> and merge the new AIP with the existing one.

This is implemented using a modified <code>obj_update</code> which calls <code>obj_update_hook</code>.

=== Update package metadata ===

* '''URL''': <code>/api/v2/file/<UUID>/</code>
* '''Verb''': PATCH
* '''Parameters''': JSON body
** <code>reingest</code>: Pipeline UUID or None.

Used to update metadata stored in the database for the package. Currently, this is used to update the reingest status.

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: Currently, this always sets Package.misc_attributes.reingest to None, regardless of what value was actually passed in.
|}

This is implemented using a modified <code>obj_update</code> which calls <code>obj_update_hook</code>. <code>update_in_place</code> also helps.

=== Delete package request ===

* '''URL''': <code>/api/v2/file/<UUID>/delete_aip/</code>
* '''Verb''': POST
* '''Parameters''': JSON body
** <code>event_reason</code>: Reason for deleting the AIP
** <code>pipeline</code>: URI of the pipeline the delete request is from
** <code>user_id</code>: User ID requesting the deletion. This is the ID of the user on the pipeline, and must be an integer greater than 0.
** <code>user_email</code>: Email of the user requesting the deletion.

=== Recover AIP request ===

* '''URL''': <code>/api/v2/file/<UUID>/recover_aip/</code>
* '''Verb''': POST
* '''Parameters''': JSON body
** <code>event_reason</code>: Reason for recovering the AIP
** <code>pipeline</code>: URI of the pipeline the recovery request is from
** <code>user_id</code>: User ID requesting the recovery. This is the ID of the user on the pipeline, and must be an integer greater than 0.
** <code>user_email</code>: Email of the user requesting the recovery.

=== Download single file ===

* '''URL''': <code>/api/v2/file/<UUID>/extract_file/</code>
* '''Verb''': GET, HEAD
* '''Parameters''': Query string parameters
** <code>relative_path_to_file</code>: Path to the file to download, relative to the package path.
* '''Response''': Stream of the requested file

Returns a single file from the Package. If the package is compressed, it downloads the whole AIP and extracts it.

This responds to HEAD because AtoM uses HEAD to check for the existence of a file.

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: HEAD and GET should not perform the same functions. HEAD should be updated to not return the file, and to only check for existence. Currently, the storage service has no way to check if a file exists except by downloading and extracting this AIP
|}

If the package is in [[Storage Service#Arkivum | Arkivum]], the package may not actually be available. This endpoint checks if the package is locally available. If it is, it is returned as normal. If not, it returns <code>202</code> and emails the administrator about the attempted access.

=== Download package ===

* '''URL''': <code>/api/v2/file/<UUID>/download/</code>
* '''URL''': <code>/api/v2/file/<UUID>/download/<chunk number>/</code> (for [[Storage Service#LOCKSS-o-matic | LOCKSS]] harvesting)
* '''Verb''': GET, HEAD
* '''Parameters''': None
* '''Response''': Stream of the package

Returns the entire package as a single file. If the AIP is uncompressed, create one file by using `tar`.

If the download URL has a chunk number, it will attempt to serve the LOCKSS chunk specified for that package. If the package is not in LOCKSS, it will return the the whole package.

This responds to HEAD because AtoM uses HEAD to check for the existence of a file.

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: HEAD and GET should not perform the same functions. HEAD should be updated to not return the file, and to only check for existence.
|}

If the package is in [[Storage Service#Arkivum | Arkivum]], the package may not actually be available. This endpoint checks if the package is locally available. If it is, it is returned as normal. If not, it returns <code>202</code> and emails the administrator about the attempted access.

=== Get pointer file ===

* '''URL''': <code>/api/v2/file/<UUID>/pointer_file/</code>
* '''Verb''': GET
* '''Parameters''': None
* '''Response''': Stream of the pointer file.

=== Check fixity ===

* '''URL''': <code>/api/v2/file/<UUID>/check_fixity/</code>
* '''Verb''': GET
* '''Parameters''': Query string parameters
** <code>force_local</code>: If true, download and run fixity on the AIP locally, instead of using the Space-provided fixity if available.
* '''Response''': JSON
** <code>success</code>: True if the verification succeeded, False if the verification failed, None if the scan could not start
** <code>message</code>: Human-readable string explaining the report; it will be empty for successful scans.
** <code>failures</code>: List of 0 or more errors
** <code>timestamp</code>: ISO-formated string with the datetime of the last fixity check. If the check was performed by an external system, this will be provided by that system. If not provided,or on error, it will be None.

=== AIP storage callback request ===

* '''URL''': <code>/api/v2/file/<UUID>/send_callback/post_store/</code>
* '''Verb''': GET

Request to call any Callbacks configured to run post-storage for this AIP.

{| class="wikitable" style="background-color:#ffeecc;" cellpadding="10";
| Improvement Note: This only works on locally available AIPs (AIPs stored in Spaces that are available via a UNIX filesystem layer).
|}

=== Get file information for package ===

* '''URL''': <code>/api/v2/file/<UUID>/contents/</code>
* '''Verb''': GET
* '''Response''': JSON
** <code>success</code>: True
** <code>package</code>: UUID of the package
** <code>files</code>: List of dictionaries with file information. Each dictionary has:
*** <code>source_id</code>: UUID of the file to index
*** <code>name</code>: Relative path of the file inside the package
*** <code>source_package</code>: UUID of the SIP this file is from
*** <code>checksum</code>: Checksum of the file, or an empty string
*** <code>accessionid</code>: Accession number, or an empty string
*** <code>origin</code>: UUID of the Archivematica dashboard this is from

Returns metadata about every file within the package.

=== Update file information for package ===

* '''URL''': <code>/api/v2/file/<UUID>/contents/</code>
* '''Verb''': PUT
* '''Parameters''': JSON list of dictionaries with information on the files to be added. Each dict must have the following attributes:
** <code>relative_path</code>: Relative path of the file inside the package
** <code>fileuuid</code>: UUID of the file to index
** <code>accessionid</code>: Accession number, or an empty string
** <code>sipuuid</code>: UUID of the SIP this file is from
** <code>origin</code>: UUID of the Archivematica dashboard this is from

Adds a set of files to a package.

=== Delete file information for package ===

* '''URL''': <code>/api/v2/file/<UUID>/contents/</code>
* '''Verb''': DELETE

Removes all file records associated with this package.

=== Query file information on packages ===

* '''URL''': <code>/api/v2/file/metadata/</code>
* '''Verb''': GET, POST
* '''Parameters''': Query string parameters. Must have at least one, but not all are required
** <code>relative_path</code>: Relative path of the file inside the package
** <code>fileuuid</code>: UUID of the file
** <code>accessionid</code>: Accession number
** <code>sipuuid</code>: UUID of the SIP this file is from
* '''Response''': JSON. List of dicts with file information about the files that match the query.
** <code>accessionid</code>: Accession number, or an empty string
** <code>file_extension</code>: File extension
** <code>filename</code>: Name of the file, sans path.
** <code>relative_path</code>: Relative path of the file inside the package
** <code>fileuuid</code>: UUID of the file to index
** <code>sipuuid</code>: UUID of the SIP this file is from
** <code>origin</code>: UUID of the Archivematica dashboard this is from

=== Reingest AIP ===

* '''URL''': <code>/api/v2/file/<UUID>/reingest/</code>
* '''Verb''': POST
* '''Parameters''': JSON body
** <code>pipeline</code>: UUID of the pipeline to reingest on
** <code>reingest_type</code>: Type of reingest to start. One of <code>METADATA_ONLY</code> (metadata-only reingest), <code>OBJECTS</code> (partial reingest), <code>FULL</code> (full reingest)
** <code>processing_config</code>: Optional. Name of the processing configuration to use on full reingest

=== SWORD endpoints ===

* '''URL''': <code>/api/v2/file/<UUID>/sword/</code>
* '''URL''': <code>/api/v2/file/<UUID>/sword/media/</code>
* '''URL''': <code>/api/v2/file/<UUID>/sword/state/</code>

See [[Sword API]] for details.

[[Category:Development documentation]]