Storage API

From Archivematica
Revision as of 15:17, 22 January 2014 by Hbecker (talk | contribs) (Added proposed changes section)
Jump to navigation Jump to search

This is the discussion page for the Archivematica Storage API (Issue #5158), requirements, and proposed implementations.

Goals

  • Get transfers from more locations (eg. FTP, NFS, HTTP, etc)
  • Store AIPs more flexibly (eg. LOCKSS, FEDORA)
    • be able to break into smaller chucks for external storage requirements, store metadata about the chunks
  • Configure where to store transfer backlog, quarantine location, etc

Initial Research

Goal: Look at all the places Archivematica currently accesses the filesystem, and categorize them.

Categories:

  • Transfers
    • Starting transfer, puts in watched directory according to transfer type
  • Quarantine
    • copied to watchedDirectories/quarantined/
  • Backlog transfer
    • Send to backlog (MicroServiceChainLink abd6d60c-d50f-4660-a189-ac1b34fafe85)
    • Retrieve from backlog through create SIP search, or SIP Arrangement (planned)
  • Currently Processing
    • Anything initiated by putting files in a watchedDirectory, anything being processed by a MicroServiceChain
    • touched everywhere, in all the client scripts, with python and client scripts.
    • Probably best to keep local to Archivematica
    • Already set up to be move-able with %sharedDirectory% as long as folder structure inside %sharedDirectory% is preserved
  • AIP Storage
    • done in one place: src/MCPClient/lib/clientScripts/storeAIP.py
  • Uploaded DIPs?

Ways Archivematica can touch the filesystem:

  • Python libraries & UNIX utilities, mostly
  • python's open(), shutil.{move|copy|rm}
    • mostly just in currently processing
  • python's os module (checking if file/directory exists, create directory, remove file)
  • cp, mv, mkdir, rm, chmod as client Scripts
    • mostly just processing, or moving within processing dirs
    • create transfer backup (MicroServiceChainLink 478512a6-10e4-410a-847d-ce1e25d8d31c)
    • Check for 'move to processing directory' that fetches from quarantine, backlog
      • Usually their own chainlinks, so should be straightforward to change
  • dashboard configs (eg. AIP storage location, transfer source)
    • dashboard.components.main.models.py SourceDirectory, StorageDirectory

Proposed Changes

Generic way to move files

Ticket: #6248

Problem: Storage service needs an easier and more generic way to move files & folders around. This can be called by the functions that create transfers and SIPs, store files in backlog or AIP storage.

Solution: Each Space has a 'staging path' which MUST be locally accessible to the storage service (ie things like 'mv' work), and MAY/SHOULD be on the destination filesystem, if possible (possible for something like NFS, not for LOCKSS)

Each Space implements a few functions:

  • Space.move_to_storage_service()
    • handles moving from the Space to the destination's staging area
    • possible parameters: source path (on Space), destination path (on Space)
    • optimizations can be done based on the destination Space, but are optional
    • when complete, file/directory in question is in the destination's staging area, and may or may not in the original spot
  • Space.post_move_to_storage_service()
    • handles anything specific to a Space that has to be done after a move
    • eg. LOCKSS talking with LOM
    • will probably be nothing in most cases
  • Space.move_from_storage_service(*args, **kwargs)
    • handles moving from this Space's staging area to the Space destination
    • possible parameters: source path (in Space's staging), destination path
    • chance that this will be null if move_to_storage_space optimized it
    • when completed, file/directory in question is in the final destination, may or may not be in staging area
      • possibly not in the case of LOCKSS
  • Space.post_move_from_storage_service()
    • anything Space specific after file is moved to final destination
    • eg delete local copy
    • eg LOCKSS updating pointer files, polling for result?
  • Space.get_size()
    • returns the size in bytes of what's trying to be moved
    • used to check against the quota and fail back to the client


By default, any transfer between spaces goes from the Space to the storage service, and then to the destination Space. All optimizations to bypass the SS are implemented in the move_to_storage_service function

Another possibility is to have an array of optimizations[source][destination] = optimization_function(). If the value is not null, call that, otherwise fall back to the default. This would be part of the managing resource (see below) Thoughts?


To manage all this, a Resource (not a ModelResource) will have endpoints like 'create_transfer' 'store_aip' 'move_to_backlog' that do the correct operation, and pass the correct paths to the move_to/from_storage_service functions.


Packages probably only need to track AIPs/AICs and Backlogged transfers - ie things going from a currently processing location to the storage service. It does not need to track incoming transfers, SIPs, etc - ie things getting moved to a pipeline for processing that aren't intended to reside on the SS for long.