ADMIRAL project announcements: 2010

Friday 19 November 2010

ADMIRAL Sprint 15 Plan

Notes from the planning meeting for sprint 15 can be found at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_15.

The primary goal of this sprint is to complete the work for Phase 1 of the ADMIRAL project.

As such, rather than set a time-frame for the sprint, then pick some activities and tasks to fill it, we have listed the work needed to wrap this phase, the main goal for which is closing the loop from researchers' data, submitted via ADMIRAl to the Library Databank service, and viewed via the web by the original researchers.

This will form a basis for iterative, researcher-led improvements in phase 2 of the project.

Friday 12 November 2010

ADMIRAL Sprint 14 review

A review of sprint 14 can be see at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_14.

Progress has been quite satisfying, with several scheduled activities completed, and more progress than planned on the ADMIRAL-to-Databank dataset submission tool, all in spite of experiencing a disk failure on one of our servers that took about 1.5 days effort to recover. This is balanced by less-than-planned progress on deploying ADMIRAL for the Development Group, and not yet having researcher feedback on the dataset submission interface.

We are still awaiting a finalized Databank API test suite.

A new ADMIRAL instance for the Evolutionary Development has been built and deployed, but has not yet been finally configured and handed over for use by them.

We have a functional dataset submission tool and web interface, but there are some significant usability problems that need to be adddressed before we want to consider showing it to researchers. This remaining work is mainly in the area of dataset selection, and the required server components have been developed and tested: it just remains to update the web page to use the new capabilities.

As intended, we completed work to revise the ADMIRAL local store test suite, before scheduling the sprint reviewed here.

Tuesday 26 October 2010

ADMIRAL Sprint 14 Plan

Notes from a planning meeting for sprint 14 can be found at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_14.

The planned focus for this sprint is:

Deploy ADMIRAL system for Development group
Complete repository submission testing
Implement a repository submission tool for researchers to use

Monday 25 October 2010

ADMIRAL Sprint 13 review

A review of sprint 13 can be see at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_13.

In summary, Progress has been quite satisfying, with most scheduled activities completed, and also some unscheduled progress. Our new recruit is coming up to speed well with the project development and testing activities.

The initial dataset selection and display functions that will allow researchers to view their submitted datasets are functionally complete (apart from finalizing their deployment via the RDF Databank).

We have made substantial progress on updating the local file store system build scripts and test suite, with a view to building a system for use by the Evolutionary Development group.

Some progress has been made on updating and finalizing the Library Services RDF Databank service, but we are still awaiting a version that we can test and use as a target for dataset submission scripts.

For ongoing work, our immediate priorities are to (a) finish updating the local file store test suite, and (b) configure and deploy a system for the Development group. Beyond that, we need to focus on finalizing the RDF Databank API test suite, and building tools to allow researchers to select and submit datasets to the Databank.

ADMIRAL Sprint 13 Plan

Notes from a planning meeting for sprint 13 can be found at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_13.

Due to an unplanned staff departure, sprint 12 was abandoned part way through. Sprint 13 was deferred until a replacement could be engaged. (The intervening period was used to wrap up some development work on the MILARQ project.)

(Further, although the sprint was planned per schedule, writing up the meeting got overlooked until the date of the sprint review, hence this rather overdue post.)

Monday 6 September 2010

ADMIRAL Sprint 12 plan

Sprint 12 has been planned to run over the next 3 weeks.

The Sprint 12 planning meeting is reported at:
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_12.

The three main activities scheduled are:

Deploy ADMIRAL system for Development group
Repository submission testing (mainly coordinating with AR)
Implement repository content display

Friday 3 September 2010

ADMIRAL Sprint 11 review

This sprint has been substantially taken up with introducing a new developer to the various ADMIRAL technologies. Progress (not enough) has been made with improving and debugging the ADMIRAL system generation scripts, but the resulting system is not passing its tests. Some progress has been made to creating a utility for displaying dataset content in user-friendly fashion.

Review of sprint 11: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_11

Saturday 28 August 2010

WissKi project for scientific collaboration and data sharing

As part of my CLAROS-related activity, I've been taking a poke around the WissKi project (http://www.wiss-ki.eu/), which is a German-funded, Drupal-based collaboration platform for scientific research and data sharing, and which also uses CIDOC-CRM as a base ontology.

Generally, this looks like an interesting project and I wonder if we shouldn't be looking to establish links with other data management work in Oxford and beyond. I have been asked to attend a WissKi meeting in September, so it will be interesting to see what common themes we can find.

Among other things, they have assembled a couple of useful ontology-related lists:

Name authorities: http://www.wiss-ki.eu/techwatch/name-authorities
Metadata standards: http://www.wiss-ki.eu/techwatch/metadata-standards

Wednesday 25 August 2010

ADMIRAL Sprint 10 review and Sprint 11 plan

A new (half-time) developer started on the ADMIRAL project yesterday. After the usual administrative details, and setting up as development environment, we did a mini sprint plan for the next 2 weeks of the ADMIRAL project. I say a mini-sprint plan, as we didn't do the full activity/user story selection, task breakdown and scope bartering, but rather reviewed the remnants of the most recent active sprint plan and identified key unfinished tasks to be tackled.

The next goal for the project is to complete the functionality covered by phase 1 of the project plan by the end of October, with front-to-back submission of research datasets to the Library Services Databank repository service, and providing visible web-based feedback to our research partners of the submitted datasets. This we intend to use as the basis for iterative improvements and enhancements in phase 2 of the project, with the researchers guiding us concerning what constitutes useful metadata to capture and expose with the submitted datasets.

The sprint plan for the period to 7 September aims to:

review, debug and update documentation for the ADMIRAL system scripted creation procedure
create a new ADMIRAL file sharing deployment for the evolutionary development group
file store bug fixes (password over unencrypted HTTP channel; https access reporting server configuration error
progress work on Shuffl RDF support

Review of sprint 10:
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_10

Plan for sprint 11:
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_11

We've also had a technical meeting with the Library Services developer of RDFDatabank (aka Databank), the data repository system:
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100825_technical_meeting_with_library_services

Monday 23 August 2010

Gridworks for data cleaning?

I've noticed a fair buzz recently from open government data people about Gridworks, and specifically this blog post from Jeni Tennison:

http://www.jenitennison.com/blog/node/145

I'm reminded of some problems faced publishing the FlyWeb data (http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_project), and also of some discussions with Alistair Miles about tooling for cleaning up Malariagen data (http://www.malariagen.net/).

Unsurprisingly, similar problems appear to be faced in publishing government data as open linked data, and the solution that is finding favour there is Gridworks. If it works for them, then I figure it should also work for some of the research data data we are trying to deal with. I'm thinking this is something we should look to explore in later phases of the ADMIRAL project, under the broad heading of building more formal structures around raw data (WP6).

Saturday 24 July 2010

JSON for linked data - migrating to RDF

See:
http://shuffl-announce.blogspot.com/2010/07/json-for-linked-data-migrating-to-rdf.html

Project partners meeting - 17-Jun-2010 (delayed post)

This entry was mistakenly posted to the wrong blog, so I'm belatedly moving it to it's rightful home, for the record.

We held a project partners meeting on 17 June, notes from which can be found at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100617_project_partners_meeting

The main focus for this meeting was discussion of our initial experiences with the OULS data repository, and much of the meeting was taken up with quite arcane technical discussion. Overall, it's looking very promising: we have been able to create test submissions through the provided web interface, and the API looks very straightforward for us to use by other means.

Monday 19 July 2010

ADMIRAL Sprint 9 review and Sprint 10 planning

The review of sprint 9 has been published at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_9. The main achievements for this sprint were:

a very useful project partner meeting
substantially completed the Databank test suite, except for tests dealing with metadata merging
started work on RDF serialization for Shuffl, via JRON
initial page to display RDFDatabank contents - using jQuery and rdfQuery to create a user display from RDF data
brief project description published in D-Lib

We start sprint 10 with the unfortunate news that our second developer is resigning with immediate effect to attend to family commitments. This will certainly impact achievements in the coming weeks, though we hope to maintain forward progress while we investigate and put in place alternative arrangements. This will surely test our risk assessment claim that an agile approach to development helps to mitigate staffing-related risks!

The planning notes and plan for sprint 10 are at:

Out focus for the next sprint will be:

RDF Databank test cases for metatada merging
finish RDF Databank web page to display dataset details
progress deployment of ADMIRAL data store with other research groups
progress RDF/XML serialization for Shuffl data

Wednesday 7 July 2010

ADMIRAL Sprint 9 plan

The ADMIRAL sprint 9 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_9.

This was a very simplified planning meeting for a short sprint, with a focus on scheduling priority tasks from those already planned.

For the next sprint, we aim to:

Complete test suite for accessing ADMIRAL databank
Progress RDF creation using Shuffl
Arrange meetings with Development and Elephant groups

With a view to working towards a front-to-back, researcher-to-repository, test deployment, including:

capturing dataset descriptions as RDF
packaging data+description and submiting to the RDFDatabank test system

Tuesday 6 July 2010

ADMIRAL Sprint 8 Review

The review for sprint 8 is posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_8.

Project activity has been impacted by unplanned staff absences, but useful progress has been made on a number of technical fronts. Interaction with Library Services to develop the data repository service has been going well, and we look forward to rapid progress in this area. Work to add RDF serialization to Shuffl is under way. Rolling out ADMIRAL data stores to additional research groups has been stalled.

This review is being posted nearly a week later than planned, so includes a period not covered by the original sprint plan.

Summary of achievements this sprint:

project partner and OULS meetings
attendance at JISC TransferSummit meeting
brief project description for D-Lib magazine
Shuffl WebDAV file browsing implemention complete
RDFDatabank access
RDFDatabank initial test suite
negotiating revisions to RDFDatabank API in light of experience
started on RDF serialization for Shuffl

SWORD white paper: relevant to ADMIRAL?

I've just read through a white paper about directions for the SWORD deposit protocol: http://sword2depositlifecycle.jiscpress.org/

I'm recognizing many of the discussion points we've been having about the submission API for ADMIRAL to the library service RDF Databank appearing here:

submitting datasets as packages of files
selective updating within dataset packages
accessing manifest and content
accessing metadata about the package
etc.

I'm not advocating at this stage that we should be trying to track the SWORD word, but I do think we should try to ensure that noting prevents us from creating a full SWORD interface to RDF databank at some stage in the future.

Friday 2 July 2010

RDF implementation for Shuffl, via RDF-in-JSON

I'm throwing my hat into the RDF-in-JSON ring, planning an implementation of RDF/XML serialization for Shuffl workspaces and cards using RDFQuery [1] and JRON [2]. My initial mdesign notes are at http://code.google.com/p/shuffl/wiki/JRON_implementation_notes.

http://code.google.com/p/rdfquery/
http://decentralyze.com/2010/06/04/from-json-to-rdf-in-six-easy-steps-with-jron/

Thursday 3 June 2010

Sprint 8 plan

The ADMIRAL sprint 8 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_8.

Notes from the planning meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_8.

(See previous post for Sprint 7 review.)

ADMIRAL Sprint 7 Review

The review for sprint 7 is posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_7.

Progress has been achieved on a number of fronts, and we are now ready for access to a test deployment of the OULS Databank repository system. The ADMIRAL data store is in use by the Silk research group.

We attended the JISC managibng research data programme meeting, and have established lines of potential close collaboration with the Oxford-based SUDAMIH project, which were further developed at our project advisory board meeting [1]. We also discussed collaboration with CLARION, particularly in the area of integrating their embargo manager into a different environment.

Summary of achievements this sprint:

Project and steering group meetings;
attended JISCMRD programme meeting;
Shuffl access and browsing of to ADMIRAL shared files;
abandoned attempt to use Ubuntu 10.04;
sample code for creating BagIt (Zip) packages;
enhancements to user management scripts;
initial investigation of RDF parsing using jQuery.

[1] http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100520_advisory_board_meeting

Thursday 27 May 2010

Not-so-lucid Lynx?: Ubuntu 10.04 not ready for ADMIRAL

We recently tried to update our ADMIRAL system configuration to use Ubuntu 10.04 (Lucid Lynx) rather than 9.10 (Karmic Koala). It has been our plan to base ADMIRAL on a version of Ubuntu designated for Long Term Support (LTS), and 10.04 is the most recent such system.

Unfortunately, we have run into a number of problems that mean we shall stick with 9.10, which has been performing very well, for the time being. At the heart of these problems is the vmbuilder package we use to generate ADMIRAL system images.

Ongoing notes about the problems encountered can be seen here http://imageweb.zoo.ox.ac.uk/wiki/index.php/Ubuntu_10.04.

Friday 7 May 2010

mod_dav, do we have a problem?

Tracking down a strange bug this morning, using Javascript code in FireFox+FireBug against Apache 2.2 and mod_dav. My test code does a PUT soon followed by a HEAD to the same URI (part of logic that tests to see if the file just created actually exists). But the HEAD command returns inconsistent results: sometimes 404, sometimes not, running the same test suite with no other changes. I did manage to see the incorrect sequence in a Wireshark trace, so I think that lets FF off the hook here.

A simple, but dubious, workaround is to put a 100ms delay in the test suite after initially creating the test fixture data; now all tests run fine, repeatedly.

I still find that FireFox can be a bit inconsistent in the time it takes to do things (garbage collection?), so some tests time out occasionally, but I can live with this.

Using: MacOS 10.5 and 10.6, Apache 2.2 (XAMPP distribution), Firefox 3.5.7, jQuery 1.4.2

Friday 30 April 2010

Sprint 6 and progress-to-date review

Sprint 6 has been generally fruitful:

Silk Group file sharing is ready for use;
Silk Group file sharing online user documentation has been prepared;
Shuffl WebDAV storage module has been implemented;
Shuffl is now able to run from and save workspace data to an ADMIRAL system;
HTTP authentication works well with AJAX calls;
JISC 6-month report has been submitted.

The full review is available at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_6.

This review follows a different pattern to the usual end-of-sprint review. Having a data sharing platform ready for use by researchers was originally estimated to take about 1.5 months, but the work has actually taken about 3 months. The question driving this review is: "Why so long?".

In summary, we don't feel the execution has been greatly lacking (though other experts may disagree and we welcome discussion). The mismatch between plan and execution seems to be due to oversights in the planning. Some lessons we offer are:

We should not assume that using tried-and-tested software means that there is no development risk.
We need to create a system that provides the user with a smooth, secure interface. Focusing on the user experience can easily push software components out of their design "comfort zone".
Plan for an extended period of "scoping experiments" with any external technology, to be conducted alongside user requirements gathering.
Plan to script as much as possible of a system build and configuration: this takes longer to begin with, but saves time in the long term as it facilitates repeating a build with minor variations.
Plan to create automated tests for as much as possible. It may seem stange to be creating "unit tests" for accessing a file system, but they have been invaluable for our work in testing access control mechanisms.
Plan to build separate systems for testing and live deployment. Automated tests should run with either.
When working with a number of users, plan a separate system build and deployment phase for each.
We estimate that ongoing user interactions require about half a day per week.
Estimated project management overheads (for a small project on the scale of ADMIRAL): 25% for the project manager, and 10% for other team members.
We estimate that about 20% of our development time is spent on creating technical documentation through wikis and blogs, but this is generally absorbed in the development effort estimates.

Projects aimed at changing user behaviour, such as ADMIRAL, are critically dependent on a solid technology foundation. Despite not being focused on technology development, we cannot dismiss the engineering effort that is needed to create a smooth, dependable experience for the users.

We estimate that access to in-depth expertise about the software we were using may have saved 25%-50% in some of the areas of time overrun.

Thursday 29 April 2010

ADMIRAL and DropBox (WIN)

From the early days of ADMIRAL's conception, DropBox's user-friendly multi-platform interface was a model that we wished to emulate. Unfortunately the non-local storage of files in DropBox is a showstopper for some of our users.

For users who are not concerned about off-site storage of their data, we have experimented with using DropBox alongside ADMIRAL, with a good degree of success. Our findings are described in the ADMIRAL wiki.

In summary, DropBox can be used to syncronize a remote computer with a user's ADMIRAL filestore area, thus providing an easy-to-use remote access, file recovery, file versioning capability and easy updating from any web browser. This works as an add-on to, and does not affect, the core ADMIRAL capabilites.

Web resource authentication and AJAX (WIN)

We've just run a series of experiments to test how web resource access controls work for AJAX requests. The good news is that (for us, at least, using a Firefox browser and Apache server) they work just the way we want them to.

The test environment used was:

Ubuntu 9.10 with Apache 2.2 server with mod_dav, etc., configured with ADMIRAL user accounts
Ubuntu 9.10 Firefox 3.5.9
Shuffl development code

We copied the Shuffl source code into a WebDAV-enabled area of the file server, modified one of the Shuffl demo applications to register a WebDAV storage handler, and loaded the demo application workspace in a browser.

We were able to:

save a modified workspace into a non-access-controlled WebDAV directory without entering any user credentials,
save a modified workspace into an access controlled directory on entering appropriate user credentials,
access but not modify files in a user's directory when authenticated as the research group leader.

Further, we observed that, when logged in as one user, attempts to access the area of another user was refused until correct credentials were provided.

We had anticipated that the AJAX calls might require authentication by the Javascript code if the required authentication cookies were not already established when loading the web page. The plan was to ensure that the application web page would be protected so that user credentials would be required when loading that page. As it turns out, even this simple step is not needed.

Wednesday 28 April 2010

Shuffl WebDAV test cases all running (WIN)

We have successfully implemented a WebDAV storage module for Shuffl that supports all storage interface functions, including listing the contents of a collection.

This paves the way for a more user-friedly Shuffl interface for loading and saving workspaces.

Tuesday 27 April 2010

WebDAV and Javascript same-origin violations

We've noticed some strange problems using WebDAV to access a server running on the local development machine (i.e. "localhost"). We're using Ajax code running in Firefox to issue the WebDAV HTTP requests, and an APache 2.2 server running mod_dav, etc., to service them. We're using a combination of FireBug and Wireshark to monitor HTTP traffic.

The immediate symptom we see is is that the HTTP requests using methods other than GET or POST (specifically DELETE and MKCOL) are being rejected with access-denied errors without being logged by FireBug. But looking at a Wireshark trace, what we see is an HTTP OPTION request and response, with no further HTTP exchange for the operation requested.

What appears to be happenning is that the HTTP OPTION request is being used to perform "pre-flight" checking of a cross-origin resource sharing request, per http://www.w3.org/TR/access-control/, which is being refused by the server's response.

This was puzzling for us in two ways:

That a request to localhost was being rejected in this way, and
The use of the cross-origin sharing protocol, which is performed "under the hood" by recent versions of Firefox.

The rejection of localhost requests is not consistent: on a MacBook, the request is allowed (still using Firefox and Apache), but on some Ubuntu systems it is refused. When the request is refused, the workaround is to send it explicitly to the fully qualified domain name rather that just "localhost". (This is a bit of a pain as it means our test cases are not fully portable, and I'm hoping we can later find an Apache configuration option to allow this level of resource sharing.)

UPDATE

It turns out that the above "observation" is a complete red herring. We had failed to notice that the web page was being loaded using the full domain name of the host, rather than http://localhost/.... When a URI of the form http://localhost/... is used to load the HTML page that invokes the Javascript code, then WebDAV access to localhost works just as expected.

Wireshark

We've been using Wireshark to help debug and understand protocol flows. I've used Wireshark before, and its predecessor Ethereal, but I've been very impressed at how easy recent versions are to install and use (on Linux and MacOS, at least) for high-level software debugging.

The HTTP protocol decode is really useful, and it handles messy details like re-assembling TCP packets so that protocol units are clearly displayed.

Also, it works very well with a local loopback interface, so it's not necessary to fiogure out arcane filters to exclude background network traffic when debugging a local client/server interaction.

Under Linux, remember that the Pcap library is also needed - the Ubuntu package name is libcap2-bin. Under recent versions of Ubuntu, it is also necessary to set appropriate privileges: see http://wiki.wireshark.org/CaptureSetup/CapturePrivileges.

Listing a directory using WebDAV

I found it surprisingly hard to find this simple recipe on the web, so thought I'd document it here. The difficulty of achieving this with AtomPub has been one factor holding back wider use and further development of Shuffl, so hopefully that will be changing.

To use WebDAV to list the contents of a directory, issue an HTTP request like this:

PROPFIND /webdav/ HTTP/1.1
Host: localhost
Depth: 1

<?xml version="1.0"?>
<a:propfind xmlns:a="DAV:">
  <a:prop><a:resourcetype/></a:prop>
</a:propfind>

Or, using curl, a Linux shell command like this:

curl -i -X PROPFIND http://localhost/webdav/ --upload-file - -H "Depth: 1" <<end
<?xml version="1.0"?>
<a:propfind xmlns:a="DAV:">
<a:prop><a:resourcetype/></a:prop>
</a:propfind>
end

The response is an XML file like this:

HTTP/1.1 207 Multi-Status
Date: Tue, 27 Apr 2010 09:38:30 GMT
Server: Apache/2.2.14 (Unix) DAV/2 mod_ssl/2.2.14 OpenSSL/0.9.8l PHP/5.3.1 mod_perl/2.0.4 Perl/v5.10.1
Content-Length: 706
Content-Type: text/xml; charset="utf-8"

<?xml version="1.0" encoding="utf-8"?>
<D:multistatus xmlns:D="DAV:" xmlns:ns0="DAV:">

<D:response xmlns:lp1="DAV:">
  <D:href>/webdav/</D:href>
  <D:propstat>
    <D:prop>
      <lp1:resourcetype><D:collection/></lp1:resourcetype>
    </D:prop>
    <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
</D:response>

<D:response xmlns:lp1="DAV:">
  <D:href>/webdav/README</D:href>
  <D:propstat>
    <D:prop>
      <lp1:resourcetype/>
    </D:prop>
    <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
</D:response>

<D:response xmlns:lp1="DAV:">
  <D:href>/webdav/shuffltest/</D:href>
  <D:propstat>
    <D:prop>
      <lp1:resourcetype><D:collection/></lp1:resourcetype>
    </D:prop>
    <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
</D:response>

</D:multistatus>

There, that was easy!

Tuesday 13 April 2010

Sprint 6 plan

The ADMIRAL sprint 6 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_6.

Notes from the planning meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_6.

(See previous post for Sprint 5 review)

Thursday 1 April 2010

Sprint 5 review

We've just reviewed a particularly long sprint. Notes of the review are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_5.

(We chose to run a long sprint with an intermediate review because we feel that having sprints less than 2 weeks creates too much planning overhead in our particular circumstances - reviewing and adjusting the plan half way through the sprint seems to be a workable compromise.)

The primary goal of the sprint was to complete deployment of a live ADMIRAL file sharing service for the Silk Group, based on a new virtual hosting environment. We fell just short of that goal, mainly due to having to introduce new, unanticipated, access control mechanisms to handle the group's specific requirements. But in the process of doing this, we seem to have developed a solution to a problem for which the conventional wisdom seems to be "you can't do that", viz. effective file access control for files that can be created via HTTP/WebDAV ''or'' locally or by CIFS. The new ingredient that makes this possible is Linux ACLs.

An initial draft security model for ADMIRAL has been created, though will surely benefit from review and revision as the project proceeds.

Maybe the biggest disappointment of this sprint was our failure to convene an intended monthly meeting, due primarily to unavailability of key partners, and possibly also because of not allowing enough lead time for planning.

Tuesday 30 March 2010

First attempt at a security model for ADMIRAL

As we are offering to safeguard real users' research data, I thought we should attempt "due diligence" that we were doing so in a reasonable fashion. To this end, I have been working on a security model for the ADMIRAL data stores.

This is new territory for me, and I'm fairly sure there are many things that I have overlooked, failed to properly think through, or just got plain wrong. But, like all the ADMIRAL working documents, it's public and open to review, which in this case I would eagerly welcome.

The security model is documented at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_LSDS_security_model.

I fully expect to have to revisit this over the course of the project, as requirements are developed and concerns are identified. I hope that by starting now with an imperfect model, we'll have plenty of time to clean it up and make it fit for purpose.

Thursday 25 March 2010

Reviewing new survey returns - "steady as she goes"

We held a brief meeting to review some additional data usage survey returns from the Behaviour group. Notes are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100325_Data_Surveys_Review.

There is nothing here to suggest any change in our main priorities, but some interest was noted for automatic versioning of data and data visualization.

Meanwhile, we're making progress on some tricky access control configuration issues to meet specific requirements from the Silk group, and are learning to use Linux ACLs to meet the requirements. Some notes about how this is done are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_ACL_file_access_control, but until we have a full test suite in place, this remains work in progress.

Friday 19 March 2010

The role of workflows

These are some notes from a discussion with my colleague, Jun Zhao, who has been asked asked about using our research partners as use-case studies for workflow sharing.

Our immediate response to this request was one of skepticism, based on our belief that none of our research partners would be willing to try using workflow-based tools because we couldn't see that they would gain sufficient benefits to justify the "activation energy" of deploying and learning to use such tools. In the past, our partners have been dismissive of using even very simple systems for which they could not perceive immediate benefits.

This was a somewhat surprising conclusion given the enthusiasm for workflow sharing among other bioinformatics researchers, and also researchers in other disciplines, and we wondered why this might be.

We considered each of our research group partners, covering Drosphila genomics, evolutionary development, animal behaviour, mechanical properties and evolutionary factors affecting silk, and elephant conservation in Africa. We noticed that:

each research group used quite manually intensive experimental procedures and data analysis, of which the mechanized data analysis portions were quite a small proportion,
the nature of the procedures and analysis techniques used in the different groups was very diverse, with very little opportunity for sharing between them.

This seems to stand in contrast to genetic studies that screen large numbers of samples for varying levels of gene products, or high throughput sequencing looking for significant similarities of differences in the gene sequences of different sample populations. The closest our research partners come to this is the evolutionary development group, who use shotgun gene sequencing approaches to look for interesting gene products, but even here the particular workflows used appear to be highly dependent on the experimental hypothesis being tested.

What conclusions can we draw from this? Mainly, we think, that it would be a mistake for computer scientists and software tool developers to assume that a set of tools that has been found useful by one group of researchers is useful to all groups studying similar natural phenomena. Details of experiment design would appear to be a dominant indicator for the suitability of a particular type of tool.

Sprint 5 plan review

Because of the long duration of this sprint (ADMIRAL Sprint 5, 3.5 weeks), the plan has been reviewed part way through. The main changes are pushing back on diagramming activities, and pushing forwards on file synchronization and access control interface investigations.

See: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_5.

Sheer curation as "curation by addition"

We are finally getting stuck into selecting metadata for ADMIRAL data capture for preservation, via Databank (http://databank.ouls.ox.ac.uk/). This blog post by Ben O'Steen is most helpful:
http://oxfordrepo.blogspot.com/2008/10/modelling-and-storing-phonetics.html:

"... the way I believe is best for this type of data is to capture and to curate by addition. Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time."

I think this very nicely articulates the tone for the ADMIRAL project as we set out to curate our research partners' data.
As I write this, we've just had a follow-up meeting with one of our research group partners, CH. It is very interesting to note that what we're aiming to offer initially duplicates a facility the research group have already provisioned for themselves (shared filestore), but with just enough additional capability to be useful (automatic daily backup), so in this sense we really are adding small capabilities to researchers' existing pratices. Capturing elements from this, and moving them to Databank should prove to be another small addition.
Other related links include:

https://confluence.ucop.edu/display/Curation/BagIt - BagIt specification, a very simple specification for packaging and shipping directory subtrees.
http://databank.ouls.ox.ac.uk/ - OULS Databank, Ben's realization of what is described in his blog article. Digging around in here illustrates what Ben describes in his blog post.
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_Databank_submission_requirements - Metadata selection for Databank submission.

It will be interesting to see how well we can deploy the "curation by addition".

Meeting with Silk Group researcher

Held a meeting today with CH of the Silk Group. This was partly a follow-up from the data surveys, and partly to prepare for our first live LSDS deployment. There were (mercifully) few surprises, the main points noted being:

Expectations for usability of access control interface set by Lacie NAS box that the group currently use. For the time being we'll configure the users manually, and later we'll look into UI for creating and modyfing LDAP entries.
File sharing with automatic backup is an important advance in functionality over bare NAS.
Desirability/priority of looking at automatic harvesting to LSDS is raised by our discussions; we will raise the priority of looking at solutions for this. (We've already tried to deploy the Fascinator "Watcher", but that didn't work for us. Another promising option is iFolder,)

Meeting notes are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100319_meeting_CH

Tuesday 16 March 2010

ADMIRAL virtualization environment running

Installing VMWare ESXi

Over the past few days, we have received and installed new server hardware, installed VMWare ESXi, and transferred our test file sharing system build to run in the ESXi environment.

VMWare ESXi is a "bare metal" virtualization host; that is, it installs directly onto some server hardware, rather than onto an existing operating system (in contrast to systems like VMWare Server).

Installing and using ESXi turned out to be very easy - much easier than, say, setting up a VMWare Server under Linux. In hindsight, I think this is because ESXi is a dedicated environment, with really no choices to be made, and hence far fewer system components to be configured. This is fortunate, as the VMWare documentation is pretty inpenetrable. The biggest hurdle to overcome was the fact that the ESXi system management console software (vSphere) has to be installed on a Windows client, XP service pack 2 or later, which was slightly awkward for us, since we are mostly a Linux and Mac shop these days.

Installing ADMIRAL file sharing for the Silk group

Transferring the virtual machine images from our KVM test environment to ESXi went pretty smoothly. A small change to the script used to run the virtual machine image builder (vmbuilder) allows VMWare disk images to be generated directly. Copying these files to the ESXi system is a slightly fiddly 2-stage process, taking about 20m minutes in total as the disk image file is quite large at about 600+Mb.

Getting the system running under ESXi required some rethinking of our original approach. While the pre-built image would boot directly into a newly created VM, the networking would not work in our chosen VM configuration until VMWare tools is installed. This in turn requires that the original system image has Linux kernel development tools installed (Ubuntu kernel-package and linux-headers), which are largely responsible for the large size of the disk image file. With this taken care of, the VMWare tools installation runs very smoothly, and the system can be rebooted with functional networking. (NOTE: ESXi does not support NAT, only bridged networking, so an IP address must be allocated for the new server before networking can be activated.)

The need to install VMWare tools to get networking capability means that our original plan of doing much of the system configuration automatically on first system boot has been dropped in favour of using manually initiated scripts to handle post-boot system configuration. Scripts for configuring certificates, LDAP, Samba and automatic backups require a fair degree of user interaction, but are otherwise quite straightforward.

After all this, out test suite (which checks file access via Samba, file access via WebDAV and direct HTTP access) ran straight away against the new server. For this, having the pyunit-based test suite was a real boon.

Currently, we are using Ubuntu 9.10 ("Karmic") for our ADMIRAL server platform. We do intend to update to version 9.14 ("Lucid") when it becomes available, as this version has been designated for "long term support". The LDAP configuration in 9.10 does seem to be something of a work-in-progress, so we fully expect to revisit this work when we come to upgrade the base system. We have also dropped Kerberos from our platform for the time being, because of difficulties getting it to work with a range of common clients. LDAP seems to be a reasonable comptomise, as it allows all ADMIRAL facilities to be authenticated and authorized from a common source.

More information about the setup is at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_VMWare_ESXi_notes.

Monday 8 March 2010

Sprint 5 plan

The ADMIRAL sprint 5 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_5.

Notes from the planning meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_5.

Notes from sprint 4 retrospective meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_4. The summary position for this sprint is:

Good progress was made on a number of fronts, with all but two tasks (i.e. DropBox testing and databank submission metadata requirements) completed to the point of allowing further progress. But getting all the security features to work exactly as intended is still proving tricky. A good start has been made on automated testing of the LSDS features. For the next sprint, we aim to use the progress so far (LDAP for authentication and authorization), focus on a first live deployment, and start to think about basic annotation of datasets.

The value of pair working and group working is showing itself. The web site blitz set out to create a published web site, and that was achieved in a day. Pair working has helped us to get up and running quickly (e.g. setting up the test framework). But GK needs to back off unplanned involvement until asked!

We've also held the first of a series of one-to-one meetings with our research users, notes of which are posted in the project wiki. The main purpose of these meetings has been to clarify and extend the information gleaned from the initial data audit surveys, especially with respect to understanding their requirements regarding data volumes and frequency of access. The meetings have been kept short and sweet, as we don't want the researchers to feel that we're eating into their valuable time whenever we ask for a meeting.

Friday 19 February 2010

First review of data usage surveys

An initial meeting has been held to review initial returns from the data usage survey from the Silk and Development research groups.

Meeting notes are at
http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100219_meeting_Data_Surveys_Review.

The main outcome is that our current development priorities appear to be about what the researchers would like to see us provide.

Wednesday 17 February 2010

Sprint 4 plan

The ADMIRAL sprint 4 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_4.

Notes from the planning meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_4.

Notes from sprint 3 retrospective meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_3. The summary position for this sprint is:

The general feeling is that forward progress is being made, but slower than hoped due mainly to unanticipated technical problems with integration of SSO authentication with file sharing services. We plan to push ahead with these issues only partially resolved in the hope that solutions will be found later. Meanwhile, some features will not benefit fully from SSO integration.

Friday 5 February 2010

ADMIRAL public code repository established

We have finally created a public code repository for the ADMIRAL project. It is at http://code.google.com/p/admiral-jiscmrd/.

We had put off establishing a code repository for ADMIRAL until we had a clearer view of the requirements:

Should it be public, or private? If we are using the repository for user-related information, then a public repository is not appropriate. In any case, changes to Shuffl performed as part of the ADMIRAL project will be maintained within the Shuffl project (http://code.google.com/p/shuffl/).
What kind of version management is required? We had in mind to use Mercurial, for reasons mentioned elsewhere (http://imageweb.zoo.ox.ac.uk/wiki/index.php/Mercurial_repository_publication), but did not want to commit in case any specific reasons to use a different versioning system were discovered.

Over the past couple of weeks, we have been investing some effort into configuring Samba, Apache and WebDAV to work with Kerberos authentication (http://imageweb.zoo.ox.ac.uk/wiki/index.php/Zakynthos_Configuration). It seems that what we are doing with Kerberos is (a) pushing at the boundaries of what is commonly deployed and documented, and (b) something that a number of people have asked about doing, so we felt it was time to put some of the things we are learning into public view.

We've chosen a Google Code project ("admiral-jiscmrd", as "admiral" was already taken) and have decided to go with the original plan of using Mercurial version management.

The initial content of the public code repository is a set of scripts and configuration files we are developing to automate the assembly of a virtual machine image for file sharing and web access with access control linked to a Kerberos SSO authentication infrastructure (in our case, Oxford University's).

Thursday 4 February 2010

Research data management support issues

We had a useful meeting today with our departmental IT support group to discuss ADMIRAL and its ongoing support beyond the life of the ADMIRAL project. I believe this meeting brought into focus some policy management issues that will apply across a range of systems that attempt to support data management for individuals and small research groups that don't have resources to do their own IT support. Notes from the meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_20100204_meeting_-_IT_Support.

The issue raised is that the IT support team are (understandably) hesitant to take on support for any particular system - even when these are just assemblies of off-the-shelf components, such as the base ADMIRAL data sharing platform. Yet, if we are to succeed in providing data management services to researchers, some framework for ongoing support must be worked out. The range of options is additionally constrained by the fact that commercial cloud-hosted services bring problems of responsibility for and jurisdiction over the data. In Oxford, the legal team have expressed concern about having University-owned content hosted on systems that may be anywhere in the world.

In our department, IT support do provide some help and support for staff desktop and personal machines, though not (generally) for particular applications running on them.

Assuming we have effective systems for research data management, that researchers are actually using, then what options do we have, and what resourcing is needed, to ensure that such systems are adequately supported. One key element of support will be application of security fixes.

These considerations give rise to some additional requirements:

The basic research data management system should be simple, generic and built from standard software elements which are individually supported.
The basic system, incorporating essential data storage and backup features, should be generic and usable by a sufficient number of researchers that a standard base configuration can be deployed and supported for diverse research groups.
Essential security fixes should generally be automatically applied, with a minimum of user intervention. This means, for example, that product version that are part of a standard operating system distribution should be used in preference to newer versions. (Occasional manual restarts will inevitably be required.)
The integration of basic system components should, as far as possible, be a loose coupling by lightweight and standard protocols, so that that system as a whole is not unduly dependent on a particular version of any software component.
Any specially-written software should be incorporated in such a way that its failure does not jeopardize essential data security or accessibility. The outputs of any such software (e.g. annotation tools) should be simple, standard formats that can easily be recovered using widely available off-the-shelf applications.

I currently perceive three possible (and not mutually exclusive) modes of support:

Universities and departments recognize the value to their research goals of taking on the burden of supporting some additional systems (e.g. as Southampton University have done).
If a basic system can serve the needs of researchers across several institutions, then maybe an open-source style of support model can be used, with users as a community, backed up by a handful of technical staff, can provide mutual support.
If a sufficiently large community use the system, maybe there is an opportunity for a small business to provide support on a commercial basis, e.g. funded by a fixed fee paid from research grants where data preservation and publication is a key requirement.

These are just some initial thoughts - I think there is plenty of scope for further discussion, to involve researchers, developers, funding agencies and policy makers.

Monday 1 February 2010

Sprint 3 plan

The ADMIRAL sprint 3 plan has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_3.

Notes from the planning meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_PlanMeeting_3.

Notes from sprint 2 retrospective meeting are at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintReview_2.

Now that we have a larger team, the style of reporting is being changed to reflect that much of the material is initially being drawn up in face-to-face meetings on index cards: the meeting notes summarize the discussions, and the plan focuses more on the selected schedule of activities and tasks.

Monday 18 January 2010

Selecting a platform for ADMIRAL

Part of the past week or so has been spent coming to a (tentative) decision on the basic platform for ADMIRAL data sharing. The requirements are summarized at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_LSDS_requirements_and_survey.

Reviewing the requirements, none of the more exotic options seemed to adequately address all the points. We've also been giving heightened consideration to (a) ongoing supportability of the platform by departmental IT, and (b) allowing users to use their normal SSO credentials for accessing the shared data - this is turning out to be a more important feature for user acceptance than originally allowed. All this directs us towards a platform that consists primarily of very common components:

Ubuntu-based Linux server (JeOS)
CIFS for local file sharing (mainly because reliable clients are standard in all common desktop operating systems)
Apache2+WebDAV for remote file sharing (we might have tried to use WebDAV for all file sharing, but have concerns that it could be awkward to set up in some environments)
Apache2+WebDAV to provide the basis for web application access to data, to provide additional services, such as annotation and visualization
For remote field workers, we plan to experiment with Dropbox to synchronize with the shared file service as and when Internet connections allow.
Mercurial will be trialled as an option for providing versioning of research data. The main advantage of Mercurial over Subversion for this is that it doesn't leave any hidden files in the working directory (e.g., I have used Subversion to version-manage Linux configuration files (such as those in /etc/...), and occasionally find that the hidden .svn directories can cause problems with some system management tools). Mercurial is also a distributed version management system (unlike Subversion), and might be trialled as an alternative to Dropbox for synchronizing with remote workers.
SSH/SFTP as a fallback for access to files and other facilities. SSH is a protocol that often succeeds where others fail, and can be used to tunnel arbitrary service protocols if necessary. There are quite easy-to-use (though not seamless) SFTP clients for Windows (e.g. WinSCP), MacOS (e.g. CyberDuck) and Linux (e.g. FileZilla?).

For deployment, I'm currently planning to use Ubuntu-hosted KVM virtualization. The other obvious choice would be VMWare, asd that is widely used, but I have found that remote access to a VMware server or similar hosting environment from non-Windows clients can be problematic. Also, it appears that KVM is well-integrated with Ubuntu's cloud computing infrastructure (UEC/Eucalyptus), which is itself API-compatible with Amazon EC2. This seems to give us a range of deployment options.

For using single-sign-on (SSO) credentials, the Oxford University SSO mechanisms are underpinned by Kerberos. It seems that all of the key features proposed for use (CIFS, HTTP and SSH) can be configured to use Kerberos authentication, so we should be able to use standard SSO credentials for accessing all the main services.

Daily automatic backup will be provided by installing a standard Tivoli client on the system, which will perform scheduled backup to the University Hierarchical File Storage (HFS) service. Alternative backup mechanisms could easily be configured in different deployments.

This combination of well-tried software seems to be able to meet all of our initial requirements, and provide a basis for additional features as requirements are identified.

Our aim is to create a system that will be used after the ADMIRAL project completes, so it is important that it must be something that is supportable by our departmental IT support. To this end, the various choices are subject to review when I can have a proper discussion with our IT support, who will have more experience of likely operational problems.

It is worth noting that there are two other data management projects in Oxford with some similar requirements (NeuroHub and EIDCSR); we have arranged to keep in touch, pool resources and adopt a common solution to the extent that makes sense for each project. The choices indicated here remain subject to review in discussion with these other projects.

Monday 4 January 2010

ADMIRAL sprint 2 plan

The initial plan for ADMIRAL sprint 2 has been posted at http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_2.
The plan for Sprint 1 has also been updated with review/retrospective notes - see http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL_SprintPlan_1.