A bottleneck refers to a point of congestion in a system, typically a place of limited resources, where workflow is prone to slow.
- For more information on bottlenecks see wikipedia.
Archivematica uses it's distributed, multi processing MCP system to mitigate the traditional problems of a processing system. However, this places higher importance on two other bottlenecks: Network and Disk activity.
In Archivematica processing, networking comes into play for two key reasons:
- distributing tasks
- central file store accessed over the network
Distributing the tasks and getting the results is fairly light traffic on the network, but if the network is congested, it will hurt the performance of the system by slowing task assignment and results.
We are currently investigating distributed file systems, to avert some of the delay of accessing files remotely. See below.
Hard drive access is one of the key bottlenecks in the Archivematica system. All of the operations performed on the objects require reading of the objects from the drive. There are a number of ways to improve disk read performance in a system.
RAID (redundant array of inexpensive disks) is a way of distributing the load of a file system on a set of drives. There are various forms of RAID, with different levels of redundancy.
- For more information on RAIDs see wikipedia.
Distributed File System
Distributed file systems are arguably a sub-set of RAIDs. They are distributed over multiple machines, to form a single file system. This has the potential to lighten the Network load for processing.
We are looking at using a distributed file system with archivematica. See Issue 669.
Ceph is a distributed file system, which is currently (July 2011) under alpha development. They have a beta 1.0 release scheduled for release 08/21/2011 see their roadmap.