Meeting 20110713
Jump to navigation
Jump to search
Development[edit]
- Joseph has been working on the code side of the MCP
- Joseph just started looking at the gearman project, and possibly integrating it. He is talking to Austin about the install.
- This will not address our threading issues: there is not enough information in their API to figure out how the blocking calls behave in a multithreaded environment
- Peter says timing is good as you have a lot of MCP re-factor for 0.7.2; better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc
Deployment[edit]
- Austin has got vm versions of Heritrix and Wayback running
- We are looking for indexing and searching tools, starting with the ones at http://netpreserve.org/software/downloads.php, could maybe try solr
Testing[edit]
- Evelyn has been testing Heritrix and Wayback
Chat log[edit]
(10:42:36 AM) epmclellan: I can take minutes for Archivematica meeting (10:43:30 AM) ARTi: thanks epmclellan (10:44:16 AM) epmclellan: dev news? (10:44:23 AM) epmclellan: berwin22: ? (10:44:44 AM) berwin22: I've been working on the code side of the MCP (10:45:04 AM) berwin22: I've just started looking at the gearman project, and possibly integrating it. (10:45:12 AM) berwin22: Talking to austin about the install (10:45:21 AM) peterVG: berwin22: cool (10:45:37 AM) berwin22: I like it's layout over other distributed systems (10:45:57 AM) berwin22: allows for redundancy to be in place (10:46:43 AM) peterVG: will it address our threading issues or is that still a candidate for a larger re-factor (e.g. port to Java or C) (10:46:54 AM) JessicaB left the room (quit: Quit: Leaving.). (10:46:58 AM) peterVG: ...in 0.9 (if we do) (10:47:44 AM) berwin22: threading issue still exists, possibly more so. (10:47:44 AM) berwin22: Not enough informaion in their API to figure out how the blocking calls behave in a multithreaded enviroment (10:49:08 AM) berwin22: clients/workers may be putting their timestamp on tasks, as the assigned time, as the server wouldn't be the mcp (10:49:23 AM) djjuhasz: Gearman sounds interesting - I'd like to check out the PHP version for Qubit (10:50:08 AM) peterVG: berwin22: that sounds okay (timestamps) (10:50:12 AM) berwin22: That's all from me at the moment (10:50:41 AM) epmclellan: is there supposed to be an anti-spam thing on the minutes page? Where I have to type the words in the box? (10:50:44 AM) djjuhasz: Are you considering Gearman to replace Twisted then? (10:51:05 AM) djjuhasz: epmclellan: Think when you hit "Save" (10:51:22 AM) epmclellan: then I get a box with nonsense words I'm supposed to type? (10:51:30 AM) djjuhasz: berwin22: Are you considering Gearman to replace Twisted then? (10:51:32 AM) epmclellan: that's what I'm getting now (10:52:44 AM) epmclellan: David just helped me out (10:52:59 AM) berwin22: Gearman and twisted have two different functions, so no, not replace, but yes, considering pulling twisted. (10:52:59 AM) berwin22: It would mean moving the xml rpc back to the standard python library, which I'm fine with. (10:53:36 AM) djjuhasz: hmm, okay - I'll have to clarify the difference between Twisted and Gearman with you outside the meeting berwin22 :) (10:53:55 AM) berwin22: Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. (10:54:25 AM) berwin22: Twisted is an event-driven networking engine (10:54:30 AM) djjuhasz: right (10:55:24 AM) djjuhasz: thanks (10:55:51 AM) peterVG: gearman sounds ideal for MCP (10:56:22 AM) peterVG: timing is good as you have a lot of MCP re-factor for 0.7.2 (10:57:07 AM) peterVG: better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc (10:57:44 AM) peterVG: multiple language plugins, oh, and its in Java (leave door open for multi-threading Gearman processes?) ;-) (10:58:36 AM) berwin22: I think we can get a client working on windows too (10:59:07 AM) peterVG: nice (10:59:22 AM) peterVG: more dev? (10:59:37 AM) peterVG: web archiving testing is more like deployment, huh? (10:59:42 AM) epmclellan: I think so (10:59:47 AM) berwin22: I'm not sure gearman will work for us though (10:59:53 AM) peterVG: berwin22: NO!! (11:00:02 AM) peterVG: let's talk after meeting (11:00:05 AM) berwin22: k (11:00:21 AM) ARTi: yeah, not much archivematica related (11:00:31 AM) ARTi: got heritrix and wayback up and running (11:00:38 AM) epmclellan: pretty impressive (11:00:42 AM) peterVG: great (11:00:57 AM) epmclellan: I have done some crawls of Rockefeller site (11:01:07 AM) epmclellan: am trying to analyze the WARC format (11:01:11 AM) ARTi: however, neither of those supply indexing, and nutchwax requires a hadoop cluster :/ (11:01:13 AM) epmclellan: and also checked out the rendering in wayback (11:01:25 AM) epmclellan: ARTi: there seem to be quite a few tools out there (11:01:29 AM) ARTi: cool (11:01:45 AM) epmclellan: WARC is big, there's a bunch of institutions developing tools for it (11:01:51 AM) berwin22: My concern is that the MCP is designed to only execute supported tasks on a client, (11:01:51 AM) berwin22: The gearman implementation of this is each supported feature has a function. (11:01:51 AM) berwin22: This would mean the system would be far less modular; unless I can think of a way around this. (11:01:52 AM) epmclellan: rendering, extraction, indexing, etc (11:02:22 AM) epmclellan: so even if we don't get everything we want now, we know that other institutions are working on these tools (11:02:32 AM) peterVG: ARTi: we'll need indexing & search. there's three other options listed at netpreserve.org, pls evaluate with epmclellan and decide on which to try first (11:02:48 AM) epmclellan: yes, we've been looking at those (11:02:57 AM) peterVG: kewl (11:03:44 AM) epmclellan: what I meant was, even if the tools aren't perfect now at least we know that warc is very popular (11:03:55 AM) epmclellan: especially with big institutions like Library of Congress (11:04:05 AM) ARTi: yeah, wera the other indexer option.. require nutchwax too (11:04:16 AM) ARTi: again hadoopfs cluster thingy :/ (11:04:39 AM) djjuhasz: Does somebody have a link for "ndexing & search. there's three other options listed at netpreserve.org"? (11:04:39 AM) ARTi: but Ive read some stuff about people using solr for their warc files (11:04:46 AM) ***djjuhasz is curious (11:05:13 AM) ARTi: http://netpreserve.org/software/downloads.php (11:05:14 AM) peterVG: http://netpreserve.org/software/downloads.php (11:05:18 AM) peterVG: doh! (11:05:47 AM) peterVG: link shootout (11:06:36 AM) djjuhasz: thanks x 2 :) (11:06:38 AM) ARTi: only reason Im wary of hadoop is its suppose to be pretty heavy weight.. mapreduce/GoogleFS clone (11:08:14 AM) peterVG: yes, stay away from the Hadoop. (11:08:45 AM) peterVG: think ahead to deploying on-site for other clients. so ease of install/maintenance is a tool requirement (11:09:03 AM) peterVG: time? (11:09:23 AM) epmclellan: I've been testing Heritrix and Wayback, that's it for testing (11:09:25 AM) epmclellan: no docs (11:09:29 AM) epmclellan: that's it from me (11:10:06 AM) ARTi: all from me on archivematica (11:10:15 AM) berwin22: all from me (11:10:26 AM) peterVG: PEACE! (11:10:26 AM) epmclellan: ok, it's a wrap