Difference between revisions of "Large datasets"

From Archivematica
Jump to navigation Jump to search
(Created page with "- For instance, in the case of big data which Chuck just mentioned, an AIP bag should be open until its size reaches xxMB and then this bag closes and another bag opens up to ...")
 
Line 1: Line 1:
- For instance, in the case of big data which Chuck just mentioned, an AIP bag should be open until its size reaches xxMB and then this bag closes and another bag opens up to be filled with the remaining smaller files in the dataset.  
+
What happens when a body of materials to be ingested consists of thousands of files (eg a large social science research dataset), or when one file is extremely large (eg an HD video file)?
- When a single huge file has to be preserved, the critical task is how to split this file into multiple smaller units. Take an example of a movie file which could be more than 1GB, on the YouTube one can access a number of smaller clips (part 1 of 17 and so on) to watch the whole movie. In a similar way, a built-in splitter is needed during the ingest process which creates sibling AIPs for a huge file.
+
*The large number of files could be broken up and distributed across multiple AIPs, with relationships between them expressed in the METS structMaps.
 +
**The dataset could be broken into a parent AIP which acts as an Archival Information Collection, consisting entirely of a METS structMap listing all its child AIPs; each child AIP would have a link back to the parent AIP in its own structMap.
 +
*The large single file could be broken into multiple segments, each in its own AIP. Video files could be delivered to end users in these segments, the way large video files are delivered on Youtube, for example.
 +
*Other types of large files might have to be merged back into one for delivery to a user.
 +
 
 +
 
  
 
[[Category:Development documentation]]
 
[[Category:Development documentation]]

Revision as of 15:07, 8 February 2013

What happens when a body of materials to be ingested consists of thousands of files (eg a large social science research dataset), or when one file is extremely large (eg an HD video file)?

  • The large number of files could be broken up and distributed across multiple AIPs, with relationships between them expressed in the METS structMaps.
    • The dataset could be broken into a parent AIP which acts as an Archival Information Collection, consisting entirely of a METS structMap listing all its child AIPs; each child AIP would have a link back to the parent AIP in its own structMap.
  • The large single file could be broken into multiple segments, each in its own AIP. Video files could be delivered to end users in these segments, the way large video files are delivered on Youtube, for example.
  • Other types of large files might have to be merged back into one for delivery to a user.