Meeting 20110713

From Archivematica
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Development

  • Joseph has been working on the code side of the MCP
  • Joseph just started looking at the gearman project, and possibly integrating it. He is talking to Austin about the install.
  • This will not address our threading issues: there is not enough information in their API to figure out how the blocking calls behave in a multithreaded environment
  • Peter says timing is good as you have a lot of MCP re-factor for 0.7.2; better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc

Deployment

Testing

  • Evelyn has been testing Heritrix and Wayback

Chat log

(10:42:36 AM) epmclellan: I can take minutes for Archivematica meeting
(10:43:30 AM) ARTi: thanks epmclellan
(10:44:16 AM) epmclellan: dev news?
(10:44:23 AM) epmclellan: berwin22: ?
(10:44:44 AM) berwin22: I've been working on the code side of the MCP
(10:45:04 AM) berwin22: I've just started looking at the gearman project, and possibly integrating it.
(10:45:12 AM) berwin22: Talking to austin about the install
(10:45:21 AM) peterVG: berwin22: cool
(10:45:37 AM) berwin22: I like it's layout over other distributed systems
(10:45:57 AM) berwin22: allows for redundancy to be in place
(10:46:43 AM) peterVG: will it address our threading issues or is that still a candidate for a larger re-factor (e.g. port to Java or C)
(10:46:54 AM) JessicaB left the room (quit: Quit: Leaving.).
(10:46:58 AM) peterVG: ...in 0.9 (if we do)
(10:47:44 AM) berwin22: threading issue still exists, possibly more so.
(10:47:44 AM) berwin22: Not enough informaion in their API to figure out how the blocking calls behave in a multithreaded enviroment
(10:49:08 AM) berwin22: clients/workers may be putting their timestamp on tasks, as the assigned time, as the server wouldn't be the mcp
(10:49:23 AM) djjuhasz: Gearman sounds interesting - I'd like to check out the PHP version for Qubit
(10:50:08 AM) peterVG: berwin22: that sounds okay (timestamps)
(10:50:12 AM) berwin22: That's all from me at the moment
(10:50:41 AM) epmclellan: is there supposed to be an anti-spam thing on the minutes page? Where I have to type the words in the box?
(10:50:44 AM) djjuhasz: Are you considering Gearman to replace Twisted then?
(10:51:05 AM) djjuhasz: epmclellan: Think when you hit "Save"
(10:51:22 AM) epmclellan: then I get a box with nonsense words I'm supposed to type?
(10:51:30 AM) djjuhasz: berwin22:  Are you considering Gearman to replace Twisted then?
(10:51:32 AM) epmclellan: that's what I'm getting now
(10:52:44 AM) epmclellan: David just helped me out
(10:52:59 AM) berwin22: Gearman and twisted have two different functions, so no, not replace, but yes, considering pulling twisted.
(10:52:59 AM) berwin22: It would mean moving the xml rpc back to the standard python library, which I'm fine with.
(10:53:36 AM) djjuhasz: hmm, okay - I'll have to clarify the difference between Twisted and Gearman with you outside the meeting berwin22 :)
(10:53:55 AM) berwin22: Gearman provides a generic application framework to farm out work to  other machines or processes that are better suited to do the work.
(10:54:25 AM) berwin22: Twisted is an event-driven networking engine 
(10:54:30 AM) djjuhasz: right
(10:55:24 AM) djjuhasz: thanks
(10:55:51 AM) peterVG: gearman sounds ideal for MCP
(10:56:22 AM) peterVG: timing is good as you have a lot of MCP re-factor for 0.7.2 
(10:57:07 AM) peterVG: better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc
(10:57:44 AM) peterVG: multiple language plugins, oh, and its in Java (leave door open for multi-threading Gearman processes?) ;-)
(10:58:36 AM) berwin22: I think we can get a client working on windows too 
(10:59:07 AM) peterVG: nice
(10:59:22 AM) peterVG: more dev?
(10:59:37 AM) peterVG: web archiving testing is more like deployment, huh?
(10:59:42 AM) epmclellan: I think so
(10:59:47 AM) berwin22: I'm not sure gearman will work for us though
(10:59:53 AM) peterVG: berwin22: NO!!
(11:00:02 AM) peterVG: let's talk after meeting
(11:00:05 AM) berwin22: k
(11:00:21 AM) ARTi: yeah, not much archivematica related
(11:00:31 AM) ARTi: got heritrix and wayback up and running
(11:00:38 AM) epmclellan: pretty impressive
(11:00:42 AM) peterVG: great
(11:00:57 AM) epmclellan: I have done some crawls of Rockefeller site
(11:01:07 AM) epmclellan: am trying to analyze the WARC format
(11:01:11 AM) ARTi: however, neither of those supply indexing, and nutchwax requires a hadoop cluster :/
(11:01:13 AM) epmclellan: and also checked out the rendering in wayback
(11:01:25 AM) epmclellan: ARTi: there seem to be quite a few tools out there
(11:01:29 AM) ARTi: cool
(11:01:45 AM) epmclellan: WARC is big, there's a bunch of institutions developing tools for it
(11:01:51 AM) berwin22: My concern is that the MCP is designed to only execute supported tasks on a client,
(11:01:51 AM) berwin22: The gearman implementation of this is each supported feature has a function.
(11:01:51 AM) berwin22: This would mean the system would be far less modular; unless I can think of a way around this.
(11:01:52 AM) epmclellan: rendering, extraction, indexing, etc
(11:02:22 AM) epmclellan: so even if we don't get everything we want now, we know that other institutions are working on these tools
(11:02:32 AM) peterVG: ARTi: we'll need indexing & search. there's three other options listed at netpreserve.org, pls evaluate with epmclellan and decide on which to try first
(11:02:48 AM) epmclellan: yes, we've been looking at those
(11:02:57 AM) peterVG: kewl
(11:03:44 AM) epmclellan: what I meant was, even if the tools aren't perfect now at least we know that warc is very popular
(11:03:55 AM) epmclellan: especially with big institutions like Library of Congress
(11:04:05 AM) ARTi: yeah, wera the other indexer option.. require nutchwax too
(11:04:16 AM) ARTi: again hadoopfs cluster thingy :/
(11:04:39 AM) djjuhasz: Does somebody have a link for "ndexing & search. there's three other options listed at netpreserve.org"?
(11:04:39 AM) ARTi: but Ive read some stuff about people using solr for their warc files
(11:04:46 AM) ***djjuhasz is curious
(11:05:13 AM) ARTi: http://netpreserve.org/software/downloads.php
(11:05:14 AM) peterVG: http://netpreserve.org/software/downloads.php
(11:05:18 AM) peterVG: doh!
(11:05:47 AM) peterVG: link shootout
(11:06:36 AM) djjuhasz: thanks x 2 :)
(11:06:38 AM) ARTi: only reason Im wary of hadoop is its suppose to be pretty heavy weight..  mapreduce/GoogleFS clone
(11:08:14 AM) peterVG: yes, stay away from the Hadoop. 
(11:08:45 AM) peterVG: think ahead to deploying on-site for other clients. so ease of install/maintenance is a tool requirement
(11:09:03 AM) peterVG: time?
(11:09:23 AM) epmclellan: I've been testing Heritrix and Wayback, that's it for testing
(11:09:25 AM) epmclellan: no docs
(11:09:29 AM) epmclellan: that's it from me
(11:10:06 AM) ARTi: all from me on archivematica
(11:10:15 AM) berwin22: all from me
(11:10:26 AM) peterVG: PEACE!
(11:10:26 AM) epmclellan: ok, it's a wrap