Difference between revisions of "Upload DIP"

From Archivematica
Jump to navigation Jump to search
Line 1: Line 1:
 
== Upload to AtoM ==
 
== Upload to AtoM ==
  
This service is implemented as a Python script that will deposit a DIP into the AtoM service specified using the SWORD protocol. The script accepts a number of arguments and can be executed manually but it is normally called from Archivematica.
+
This service is implemented as a Python script that given an UUID it will deposit the corresponding DIP into the AtoM service specified. The communication protocol between both ends is called [http://swordapp.org/ SWORD]. The script accepts a number of arguments and it can be executed manually although it is normally called from Archivematica.
  
The Archivematica dashboard provides an interface so the parameters passed to <code>upload-qubit.py</code> can be personalized under Administration > AtoM DIP upload. In the same page you will find a description for every argument possible. During its configuration, we should consider four points:
+
The Archivematica dashboard provides an interface to configure the parameters passed to the script. This interface can be found under Administration > AtoM DIP upload. In the same page you will find a description for every argument possible. Please read the following explanation about the different options:
  
# The DIP UUID: use <code>--uuid</code>.
+
# Pass the DIP UUID using <code>--uuid</code>.
# Details of the AtoM service: use <code>--url</code>, <code>--email</code> and <code>--password</code>.
+
# Pass the AtoM service details using <code>--url</code>, <code>--email</code> and <code>--password</code>.
# (optional) By default, this service includes the DIP as part of the deposit. This is not feasible if the size is too big. Our recommendation is to avoid that if the DIP is over two megabytes. For that purpose, this service is able to send your DIP using [http://en.wikipedia.org/wiki/Rsync rsync], an advanced data transfer tool very flexible. Two options are available if you want to use rsync: <code>--rsync-command</code> and <code>--rsync-target</code>.
+
# (optional) By default, this service includes the DIP within the deposit request. This is not feasible if the size of the package is too big. Our recommendation is to avoid that if the DIP size is over two megabytes. For that purpose, this service is able to send your DIP using [http://en.wikipedia.org/wiki/Rsync rsync], an advanced data transfer tool very flexible. Two options are available if you want to use rsync: <code>--rsync-command</code> and <code>--rsync-target</code>. You can find more details later.
 
# (optional) Are you troubleshooting a problem? Use <code>--debug</code> to increase the script verbosity.
 
# (optional) Are you troubleshooting a problem? Use <code>--debug</code> to increase the script verbosity.
  
Line 14: Line 14:
 
=== Tips to deal with big DIPs ===
 
=== Tips to deal with big DIPs ===
  
# Send the DIPs to AtoM using rsync as explained later.
+
If you expect to work with large DIPs please refer to the next two sections to make the process more reliable:
# Enable the job scheduler capabilities in AtoM: [[qubit:Job_scheduling#Gearman_deployment]]
 
  
==== How to use rsync? ====
+
==== Send your DIPs using rsync ====
  
 
This is the most complicated part of the process. rsync is able to send files over a network connecting to a rsync server, that you can setup on your AtoM side, but most of the time people just prefer to use SSH as a secure and reliable transport. The trick is to setup both sides so the authentication step is passwordless and the good news is that SSH provides an easy way to do that using SSH keys.
 
This is the most complicated part of the process. rsync is able to send files over a network connecting to a rsync server, that you can setup on your AtoM side, but most of the time people just prefer to use SSH as a secure and reliable transport. The trick is to setup both sides so the authentication step is passwordless and the good news is that SSH provides an easy way to do that using SSH keys.
Line 39: Line 38:
  
 
<div class="note">rsync with SSH needs a shell in the remote server to work properly. If you are concerned about it try rssh, a restricted shell that can be configured to limit its usage to rsync. There are other methods like using the SSH authorized_keys file to limit the commands executed by the remote client.</div>
 
<div class="note">rsync with SSH needs a shell in the remote server to work properly. If you are concerned about it try rssh, a restricted shell that can be configured to limit its usage to rsync. There are other methods like using the SSH authorized_keys file to limit the commands executed by the remote client.</div>
 +
 +
====  Configure AtoM to process your requests asynchronously ====
 +
 +
This section is described under the AtoM wiki: https://www.qubit-toolkit.org/wiki/Job_scheduling
  
 
== Upload to ContentDM ==
 
== Upload to ContentDM ==

Revision as of 11:24, 25 October 2012

Upload to AtoM

This service is implemented as a Python script that given an UUID it will deposit the corresponding DIP into the AtoM service specified. The communication protocol between both ends is called SWORD. The script accepts a number of arguments and it can be executed manually although it is normally called from Archivematica.

The Archivematica dashboard provides an interface to configure the parameters passed to the script. This interface can be found under Administration > AtoM DIP upload. In the same page you will find a description for every argument possible. Please read the following explanation about the different options:

  1. Pass the DIP UUID using --uuid.
  2. Pass the AtoM service details using --url, --email and --password.
  3. (optional) By default, this service includes the DIP within the deposit request. This is not feasible if the size of the package is too big. Our recommendation is to avoid that if the DIP size is over two megabytes. For that purpose, this service is able to send your DIP using rsync, an advanced data transfer tool very flexible. Two options are available if you want to use rsync: --rsync-command and --rsync-target. You can find more details later.
  4. (optional) Are you troubleshooting a problem? Use --debug to increase the script verbosity.

The script can be found under /usr/lib/archivematica/upload-qubit/upload-qubit.py (check out the source code).

Tips to deal with big DIPs

If you expect to work with large DIPs please refer to the next two sections to make the process more reliable:

Send your DIPs using rsync

This is the most complicated part of the process. rsync is able to send files over a network connecting to a rsync server, that you can setup on your AtoM side, but most of the time people just prefer to use SSH as a secure and reliable transport. The trick is to setup both sides so the authentication step is passwordless and the good news is that SSH provides an easy way to do that using SSH keys.

You can just google it, but I'll write it a brief explanation.

On the Archivematica end, we have to create a SSH key for the archivematica user running the command shown below. Please, leave the passphrase field blank.

archivematica.server $ sudo -u archivematica ssh-keygen

That will generate the public/private keys under /var/lib/archivematica/.ssh/id_dsa.pub (public) and /var/lib/archivematica/.ssh/id_dsa (private).

On the AtoM side, we have to authorize that key in any of your local users. My recommendation is to create an exclusive local user for this purpose:

atom.server $ sudo useradd -d /home/archivematica -m -N -r -s /bin/bash archivematica
atom.server $ sudo passwd -l archivematica

Now, let's dump the contents of the public key into your AtoM server. The default location is /home/archivematica/.ssh/authorized_keys.

Before you try all this from Archivematica, please make sure that the SSH connection can be stablished as expected. Furthermore, the first time you try the client will store the host key fingerprint and that requires the user intervention. If you don't do that, this service won't work.
rsync with SSH needs a shell in the remote server to work properly. If you are concerned about it try rssh, a restricted shell that can be configured to limit its usage to rsync. There are other methods like using the SSH authorized_keys file to limit the commands executed by the remote client.

Configure AtoM to process your requests asynchronously

This section is described under the AtoM wiki: https://www.qubit-toolkit.org/wiki/Job_scheduling

Upload to ContentDM

TODO

Mockup of Export DIP

0.9 Export DIP

Chat about requirements

we've begun changing the workflow to get ready for multi-file upload (.e.g. all the files in the AIP instead of a single AIP.zip)
(11:08:25 AM) vangarderen.peter: austin: are you still getting a 500 error on that? I don't have an Archivematica session open right now
(11:08:41 AM) austin: I havnt tested it again since yesturday.. sec
(11:08:57 AM) Sevein: Austin, we should check URL variables values
(11:09:10 AM) austin: I did but couldnt find any issues
(11:09:21 AM) Sevein: mmm...
(11:10:11 AM) austin: should I setup a tunnel? I can only leave it on for a hour or so
(11:10:40 AM) Sevein: ok, that would be nice
(11:10:47 AM) austin: moment 
(11:11:17 AM) vangarderen.peter: so I've changed the worfklow directories slightly for this. See http://code.google.com/p/archivematica/source/detail?r=300
(11:11:35 AM) vangarderen.peter: however, this should affect the current instance of upload-qubit.py
(11:12:13 AM) vangarderen.peter: it just gets called when files hit 9-uploadDIP instead of 8-uploadDIP now, see http://code.google.com/p/archivematica/source/diff?spec=svn300&r=300&format=side&path=/trunk/includes/incron.tab
(11:12:44 AM) austin: just because I just used it... a completely nsfw ip tool http://www.moanmyip.com/
(11:13:20 AM) Sevein: vangarderen.peter: but the script is called in the same way, AFAIK it shouldn't be a problem
(11:13:29 AM) vangarderen.peter: yes, exactly
(11:13:45 AM) vangarderen.peter: so let's make sure that works as before first, then move on to the next step
(11:13:54 AM) vangarderen.peter: so the next steps will be:
(11:14:55 AM) vangarderen.peter: 1) for each AIP, Archivematica writes just the access format copies of each file to 8-reviewDIP/$UUID
(11:15:04 AM) vangarderen.peter: ^ Joseph and I are working on that today
(11:16:06 AM) vangarderen.peter: 2) users reviews files in 8-reviewDIP/$UUID and decides to upload it to ICA-AtoM by dropping /$UUID into 9-uploadDIP
(11:17:31 AM) Sevein: I am not very used to Archivematica concepts, but I can see what you mean more or less
(11:17:34 AM) vangarderen.peter: 3) upload-qubit.py gets triggered when /$UUID is dropped into 9-uploadDIP, it reads the files contents (flat file, no nesting) of the directory, loops over each file as a single file upload to Qubit
(11:17:56 AM) Sevein: so upload-qubit.py is getting a path?
(11:18:20 AM) austin: sevien join my screen
(11:18:23 AM) vangarderen.peter: yes, /$UUID
(11:18:43 AM) Sevein: ok
(11:18:45 AM) vangarderen.peter: where $UUID == c920c100-5df2-11df-a08a-0800200c9a66
(11:18:57 AM) vangarderen.peter: or d0810770-5df2-11df-a08a-0800200c9a66, etc.
(11:19:07 AM) Sevein: I see
(11:19:19 AM) vangarderen.peter: i.e. we generate a UUID to identify each SIP
(11:19:19 AM) austin: Sevein: $# = file name  $@ = path
(11:19:22 AM) vangarderen.peter: http://www.famkruithof.net/uuid/uuidgen
(11:19:37 AM) vangarderen.peter: right
(11:19:44 AM) Sevein: k
(11:19:55 AM) vangarderen.peter: then you will need a parent information object in Qubit to which you can link each of these files
(11:21:47 AM) coolidge49685 entered the room.
(11:22:06 AM) coolidge49685 is now known as peterVG
(11:22:16 AM) peterVG: lost my jabber connection in client
(11:22:22 AM) peterVG: can read but can't write
(11:22:28 AM) peterVG: here in web client now
(11:22:32 AM) peterVG: to continue...
(11:22:42 AM) peterVG: *first* you will need a parent information object in Qubit to which you can link each of these files
(11:23:34 AM) peterVG: just use the name of the directory ($UUID) as the title, assign the Level of Description to Series and make it a top-level (i.e. its parent it the root information object)
(11:24:06 AM) peterVG: (its parent *is* the root information object)
(11:25:23 AM) peterVG: for each file, just its filename as the information object title (like the current digital object import default) and set its level of description to item
(11:25:38 AM) peterVG: let's get that ^ working first
(11:26:30 AM) Sevein: ok, I see
(11:27:06 AM) peterVG: then the final iteration will include reading additional metadata from a SIP.XML file which will be included in /$UUID. This XML file will contain additional information object field values for the parent series level and each file to include as parameters in the upload
(11:27:26 AM) Sevein: ah ok
(11:27:40 AM) peterVG: I better capture all this in a wiki page ^
(11:27:42 AM) Sevein: I saw an example sent by Austin
(11:27:49 AM) Sevein: Peter, that would be great
(11:28:03 AM) Sevein: I think some jabber connecting problems interrumped your description
(11:28:09 AM) peterVG: but it mostly makes sense, what we are trying to do?
(11:28:16 AM) Sevein: 100%
(11:28:24 AM) peterVG: yes, I can't post and your posts are showing up double
(11:28:27 AM) Sevein: I don't see it difficult
(11:28:32 AM) peterVG: great!
(11:28:46 AM) ***peterVG creating wiki page <insert music>