Difference between revisions of "Meeting 20110713"

From Archivematica
Jump to navigation Jump to search
Line 4: Line 4:
 
* This will not address our threading issues: there is not enough information in their API to figure out how the blocking calls behave in a multithreaded environment
 
* This will not address our threading issues: there is not enough information in their API to figure out how the blocking calls behave in a multithreaded environment
 
* Peter says timing is good as you have a lot of MCP re-factor for 0.7.2; better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc
 
* Peter says timing is good as you have a lot of MCP re-factor for 0.7.2; better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc
 
  
 
== Deployment ==
 
== Deployment ==
Line 13: Line 12:
 
= Testing =
 
= Testing =
  
= Documentation =
+
*Evelyn has been testing Heritrix and Wayback
*Evelyn did the user manual and screencast
 
  
 
= Chat log =
 
= Chat log =
 
<pre>
 
<pre>
 +
(10:42:36 AM) epmclellan: I can take minutes for Archivematica meeting
 +
(10:43:30 AM) ARTi: thanks epmclellan
 +
(10:44:16 AM) epmclellan: dev news?
 +
(10:44:23 AM) epmclellan: berwin22: ?
 +
(10:44:44 AM) berwin22: I've been working on the code side of the MCP
 +
(10:45:04 AM) berwin22: I've just started looking at the gearman project, and possibly integrating it.
 +
(10:45:12 AM) berwin22: Talking to austin about the install
 +
(10:45:21 AM) peterVG: berwin22: cool
 +
(10:45:37 AM) berwin22: I like it's layout over other distributed systems
 +
(10:45:57 AM) berwin22: allows for redundancy to be in place
 +
(10:46:43 AM) peterVG: will it address our threading issues or is that still a candidate for a larger re-factor (e.g. port to Java or C)
 +
(10:46:54 AM) JessicaB left the room (quit: Quit: Leaving.).
 +
(10:46:58 AM) peterVG: ...in 0.9 (if we do)
 +
(10:47:44 AM) berwin22: threading issue still exists, possibly more so.
 +
(10:47:44 AM) berwin22: Not enough informaion in their API to figure out how the blocking calls behave in a multithreaded enviroment
 +
(10:49:08 AM) berwin22: clients/workers may be putting their timestamp on tasks, as the assigned time, as the server wouldn't be the mcp
 +
(10:49:23 AM) djjuhasz: Gearman sounds interesting - I'd like to check out the PHP version for Qubit
 +
(10:50:08 AM) peterVG: berwin22: that sounds okay (timestamps)
 +
(10:50:12 AM) berwin22: That's all from me at the moment
 +
(10:50:41 AM) epmclellan: is there supposed to be an anti-spam thing on the minutes page? Where I have to type the words in the box?
 +
(10:50:44 AM) djjuhasz: Are you considering Gearman to replace Twisted then?
 +
(10:51:05 AM) djjuhasz: epmclellan: Think when you hit "Save"
 +
(10:51:22 AM) epmclellan: then I get a box with nonsense words I'm supposed to type?
 +
(10:51:30 AM) djjuhasz: berwin22:  Are you considering Gearman to replace Twisted then?
 +
(10:51:32 AM) epmclellan: that's what I'm getting now
 +
(10:52:44 AM) epmclellan: David just helped me out
 +
(10:52:59 AM) berwin22: Gearman and twisted have two different functions, so no, not replace, but yes, considering pulling twisted.
 +
(10:52:59 AM) berwin22: It would mean moving the xml rpc back to the standard python library, which I'm fine with.
 +
(10:53:36 AM) djjuhasz: hmm, okay - I'll have to clarify the difference between Twisted and Gearman with you outside the meeting berwin22 :)
 +
(10:53:55 AM) berwin22: Gearman provides a generic application framework to farm out work to  other machines or processes that are better suited to do the work.
 +
(10:54:25 AM) berwin22: Twisted is an event-driven networking engine
 +
(10:54:30 AM) djjuhasz: right
 +
(10:55:24 AM) djjuhasz: thanks
 +
(10:55:51 AM) peterVG: gearman sounds ideal for MCP
 +
(10:56:22 AM) peterVG: timing is good as you have a lot of MCP re-factor for 0.7.2
 +
(10:57:07 AM) peterVG: better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc
 +
(10:57:44 AM) peterVG: multiple language plugins, oh, and its in Java (leave door open for multi-threading Gearman processes?) ;-)
 +
(10:58:36 AM) berwin22: I think we can get a client working on windows too
 +
(10:59:07 AM) peterVG: nice
 +
(10:59:22 AM) peterVG: more dev?
 +
(10:59:37 AM) peterVG: web archiving testing is more like deployment, huh?
 +
(10:59:42 AM) epmclellan: I think so
 +
(10:59:47 AM) berwin22: I'm not sure gearman will work for us though
 +
(10:59:53 AM) peterVG: berwin22: NO!!
 +
(11:00:02 AM) peterVG: let's talk after meeting
 +
(11:00:05 AM) berwin22: k
 +
(11:00:21 AM) ARTi: yeah, not much archivematica related
 +
(11:00:31 AM) ARTi: got heritrix and wayback up and running
 +
(11:00:38 AM) epmclellan: pretty impressive
 +
(11:00:42 AM) peterVG: great
 +
(11:00:57 AM) epmclellan: I have done some crawls of Rockefeller site
 +
(11:01:07 AM) epmclellan: am trying to analyze the WARC format
 +
(11:01:11 AM) ARTi: however, neither of those supply indexing, and nutchwax requires a hadoop cluster :/
 +
(11:01:13 AM) epmclellan: and also checked out the rendering in wayback
 +
(11:01:25 AM) epmclellan: ARTi: there seem to be quite a few tools out there
 +
(11:01:29 AM) ARTi: cool
 +
(11:01:45 AM) epmclellan: WARC is big, there's a bunch of institutions developing tools for it
 +
(11:01:51 AM) berwin22: My concern is that the MCP is designed to only execute supported tasks on a client,
 +
(11:01:51 AM) berwin22: The gearman implementation of this is each supported feature has a function.
 +
(11:01:51 AM) berwin22: This would mean the system would be far less modular; unless I can think of a way around this.
 +
(11:01:52 AM) epmclellan: rendering, extraction, indexing, etc
 +
(11:02:22 AM) epmclellan: so even if we don't get everything we want now, we know that other institutions are working on these tools
 +
(11:02:32 AM) peterVG: ARTi: we'll need indexing & search. there's three other options listed at netpreserve.org, pls evaluate with epmclellan and decide on which to try first
 +
(11:02:48 AM) epmclellan: yes, we've been looking at those
 +
(11:02:57 AM) peterVG: kewl
 +
(11:03:44 AM) epmclellan: what I meant was, even if the tools aren't perfect now at least we know that warc is very popular
 +
(11:03:55 AM) epmclellan: especially with big institutions like Library of Congress
 +
(11:04:05 AM) ARTi: yeah, wera the other indexer option.. require nutchwax too
 +
(11:04:16 AM) ARTi: again hadoopfs cluster thingy :/
 +
(11:04:39 AM) djjuhasz: Does somebody have a link for "ndexing & search. there's three other options listed at netpreserve.org"?
 +
(11:04:39 AM) ARTi: but Ive read some stuff about people using solr for their warc files
 +
(11:04:46 AM) ***djjuhasz is curious
 +
(11:05:13 AM) ARTi: http://netpreserve.org/software/downloads.php
 +
(11:05:14 AM) peterVG: http://netpreserve.org/software/downloads.php
 +
(11:05:18 AM) peterVG: doh!
 +
(11:05:47 AM) peterVG: link shootout
 +
(11:06:36 AM) djjuhasz: thanks x 2 :)
 +
(11:06:38 AM) ARTi: only reason Im wary of hadoop is its suppose to be pretty heavy weight..  mapreduce/GoogleFS clone
 +
(11:08:14 AM) peterVG: yes, stay away from the Hadoop.
 +
(11:08:45 AM) peterVG: think ahead to deploying on-site for other clients. so ease of install/maintenance is a tool requirement
 +
(11:09:03 AM) peterVG: time?
 +
(11:09:23 AM) epmclellan: I've been testing Heritrix and Wayback, that's it for testing
 +
(11:09:25 AM) epmclellan: no docs
 +
(11:09:29 AM) epmclellan: that's it from me
 +
(11:10:06 AM) ARTi: all from me on archivematica
 +
(11:10:15 AM) berwin22: all from me
 +
(11:10:26 AM) peterVG: PEACE!
 +
(11:10:26 AM) epmclellan: ok, it's a wrap
 +
</pre>

Revision as of 13:12, 13 July 2011

Development

  • Joseph has been working on the code side of the MCP
  • Joseph just started looking at the gearman project, and possibly integrating it. He is talking to Austin about the install.
  • This will not address our threading issues: there is not enough information in their API to figure out how the blocking calls behave in a multithreaded environment
  • Peter says timing is good as you have a lot of MCP re-factor for 0.7.2; better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc

Deployment

Testing

  • Evelyn has been testing Heritrix and Wayback

Chat log

(10:42:36 AM) epmclellan: I can take minutes for Archivematica meeting
(10:43:30 AM) ARTi: thanks epmclellan
(10:44:16 AM) epmclellan: dev news?
(10:44:23 AM) epmclellan: berwin22: ?
(10:44:44 AM) berwin22: I've been working on the code side of the MCP
(10:45:04 AM) berwin22: I've just started looking at the gearman project, and possibly integrating it.
(10:45:12 AM) berwin22: Talking to austin about the install
(10:45:21 AM) peterVG: berwin22: cool
(10:45:37 AM) berwin22: I like it's layout over other distributed systems
(10:45:57 AM) berwin22: allows for redundancy to be in place
(10:46:43 AM) peterVG: will it address our threading issues or is that still a candidate for a larger re-factor (e.g. port to Java or C)
(10:46:54 AM) JessicaB left the room (quit: Quit: Leaving.).
(10:46:58 AM) peterVG: ...in 0.9 (if we do)
(10:47:44 AM) berwin22: threading issue still exists, possibly more so.
(10:47:44 AM) berwin22: Not enough informaion in their API to figure out how the blocking calls behave in a multithreaded enviroment
(10:49:08 AM) berwin22: clients/workers may be putting their timestamp on tasks, as the assigned time, as the server wouldn't be the mcp
(10:49:23 AM) djjuhasz: Gearman sounds interesting - I'd like to check out the PHP version for Qubit
(10:50:08 AM) peterVG: berwin22: that sounds okay (timestamps)
(10:50:12 AM) berwin22: That's all from me at the moment
(10:50:41 AM) epmclellan: is there supposed to be an anti-spam thing on the minutes page? Where I have to type the words in the box?
(10:50:44 AM) djjuhasz: Are you considering Gearman to replace Twisted then?
(10:51:05 AM) djjuhasz: epmclellan: Think when you hit "Save"
(10:51:22 AM) epmclellan: then I get a box with nonsense words I'm supposed to type?
(10:51:30 AM) djjuhasz: berwin22:  Are you considering Gearman to replace Twisted then?
(10:51:32 AM) epmclellan: that's what I'm getting now
(10:52:44 AM) epmclellan: David just helped me out
(10:52:59 AM) berwin22: Gearman and twisted have two different functions, so no, not replace, but yes, considering pulling twisted.
(10:52:59 AM) berwin22: It would mean moving the xml rpc back to the standard python library, which I'm fine with.
(10:53:36 AM) djjuhasz: hmm, okay - I'll have to clarify the difference between Twisted and Gearman with you outside the meeting berwin22 :)
(10:53:55 AM) berwin22: Gearman provides a generic application framework to farm out work to  other machines or processes that are better suited to do the work.
(10:54:25 AM) berwin22: Twisted is an event-driven networking engine 
(10:54:30 AM) djjuhasz: right
(10:55:24 AM) djjuhasz: thanks
(10:55:51 AM) peterVG: gearman sounds ideal for MCP
(10:56:22 AM) peterVG: timing is good as you have a lot of MCP re-factor for 0.7.2 
(10:57:07 AM) peterVG: better for us to benefit from work on other open-source component that provides what we were doing with job/tasks/queing etc
(10:57:44 AM) peterVG: multiple language plugins, oh, and its in Java (leave door open for multi-threading Gearman processes?) ;-)
(10:58:36 AM) berwin22: I think we can get a client working on windows too 
(10:59:07 AM) peterVG: nice
(10:59:22 AM) peterVG: more dev?
(10:59:37 AM) peterVG: web archiving testing is more like deployment, huh?
(10:59:42 AM) epmclellan: I think so
(10:59:47 AM) berwin22: I'm not sure gearman will work for us though
(10:59:53 AM) peterVG: berwin22: NO!!
(11:00:02 AM) peterVG: let's talk after meeting
(11:00:05 AM) berwin22: k
(11:00:21 AM) ARTi: yeah, not much archivematica related
(11:00:31 AM) ARTi: got heritrix and wayback up and running
(11:00:38 AM) epmclellan: pretty impressive
(11:00:42 AM) peterVG: great
(11:00:57 AM) epmclellan: I have done some crawls of Rockefeller site
(11:01:07 AM) epmclellan: am trying to analyze the WARC format
(11:01:11 AM) ARTi: however, neither of those supply indexing, and nutchwax requires a hadoop cluster :/
(11:01:13 AM) epmclellan: and also checked out the rendering in wayback
(11:01:25 AM) epmclellan: ARTi: there seem to be quite a few tools out there
(11:01:29 AM) ARTi: cool
(11:01:45 AM) epmclellan: WARC is big, there's a bunch of institutions developing tools for it
(11:01:51 AM) berwin22: My concern is that the MCP is designed to only execute supported tasks on a client,
(11:01:51 AM) berwin22: The gearman implementation of this is each supported feature has a function.
(11:01:51 AM) berwin22: This would mean the system would be far less modular; unless I can think of a way around this.
(11:01:52 AM) epmclellan: rendering, extraction, indexing, etc
(11:02:22 AM) epmclellan: so even if we don't get everything we want now, we know that other institutions are working on these tools
(11:02:32 AM) peterVG: ARTi: we'll need indexing & search. there's three other options listed at netpreserve.org, pls evaluate with epmclellan and decide on which to try first
(11:02:48 AM) epmclellan: yes, we've been looking at those
(11:02:57 AM) peterVG: kewl
(11:03:44 AM) epmclellan: what I meant was, even if the tools aren't perfect now at least we know that warc is very popular
(11:03:55 AM) epmclellan: especially with big institutions like Library of Congress
(11:04:05 AM) ARTi: yeah, wera the other indexer option.. require nutchwax too
(11:04:16 AM) ARTi: again hadoopfs cluster thingy :/
(11:04:39 AM) djjuhasz: Does somebody have a link for "ndexing & search. there's three other options listed at netpreserve.org"?
(11:04:39 AM) ARTi: but Ive read some stuff about people using solr for their warc files
(11:04:46 AM) ***djjuhasz is curious
(11:05:13 AM) ARTi: http://netpreserve.org/software/downloads.php
(11:05:14 AM) peterVG: http://netpreserve.org/software/downloads.php
(11:05:18 AM) peterVG: doh!
(11:05:47 AM) peterVG: link shootout
(11:06:36 AM) djjuhasz: thanks x 2 :)
(11:06:38 AM) ARTi: only reason Im wary of hadoop is its suppose to be pretty heavy weight..  mapreduce/GoogleFS clone
(11:08:14 AM) peterVG: yes, stay away from the Hadoop. 
(11:08:45 AM) peterVG: think ahead to deploying on-site for other clients. so ease of install/maintenance is a tool requirement
(11:09:03 AM) peterVG: time?
(11:09:23 AM) epmclellan: I've been testing Heritrix and Wayback, that's it for testing
(11:09:25 AM) epmclellan: no docs
(11:09:29 AM) epmclellan: that's it from me
(11:10:06 AM) ARTi: all from me on archivematica
(11:10:15 AM) berwin22: all from me
(11:10:26 AM) peterVG: PEACE!
(11:10:26 AM) epmclellan: ok, it's a wrap