Catalogues and data model

8 posts / 0 new
Last post
Catalogues and data model

This topic is dedicated to the design of the Software Catalogues feature in ScienceSoft. The general idea is to provide access to a large collection of software products (applications, tools, tests, etc) with information about the authors, the support options, the repositories where the software can be taken from, links to documentation, etc.
The questions to be answered first in my opinion are:

- How to describe a software product? We need a schema, does it exist already? Can we use anything from models like CIM or similar descriptions? What are the relationships with other entities like authors, projects and collaborations, institutes, companies, etc?

- How to implement the catalogue? The ScienceSoft web site currently runs on Drupal and a simple implementation can be done using standard Drupal features. However, this may not scale? What about data import/export? APIs?

- Where to take the data from? Of course one source is direct human registration, but can we use any existing database? Several suggestions have been provided already, among which: the EGI Application Database, Ohloh.net, the "Projet PLUME" at CNRS in France, the OpenScience Project in the US and others. How can we extract/merge/moderate/validate information?

We can start with a simple Drupal taxonomy and a form just to see how it looks, but a more long-term design is necessary.

If you want to contribute to this discussion, feeel free to reply to this topic or create more specific topics.

There have been some UK

There have been some UK-Australian collaborative efforts in this space.In particular, JISC in the UK and ANDS in Australia have worked together on the SIMAL catalogue, which uses the DOAP (Description of a Project) standard as its base schema.

I would also suggests RDF and

I would also suggests RDF and DOAP. Using Drupal's RDF Proxy you could create a searchable catalog. It shouldn't be too much work for the projects to publish RDF descriptions, some may even already do so.

Schema should come first

I'd say that before deciding "how", one has to understand "what" needs to be described/formalised. A schema can be formalised and populated using a variety of methods. Is it clear what characteristics of software need to be recorded? Is it also clear what actions will be performed on stored objects? Is it even clear what are the objects and what is the granularity?For example, are we talking about complex distributions (like e.g. gLite), simple solutions consisting of several components (like e.g. VOMS), single components (like e.g. voms-proxy-init), or single binary/source packages, or all of these? Obviusly, each object would be charaterised by a name - are we talking of an arbitrary string or a formal name space? Each of these would also be characterised by versions - for each name, several versions may exist. Each version may be characterised by availability on a number of architectures. Then, one would need to know all the support info, bug report links, license details, source repository, and such.  And so on. Actions would include browsing, searching, updating, modifying, removing, adding, annotating and such, and may be applicable to every attribute (e.g., license may change, or structure of packaging, or dependencies).I am sure there are plenty of schemas around, since there are very many software repositories and catalogs. In order to decide whether any of such schemas is suitable for ScienceSoft, one has to compile a list of needed features and check whether they can be mapped onto one or another schema.   

Granularity

>-- Is it even clear what are the objects and what is the granularity?For example, are we talking about complex distributions (like e.g. gLite), simple solutions consisting of several components (like e.g. VOMS), single components (like e.g. voms-proxy-init), or single binary/source packages, or all of these? In order to encourage re-usability the software projects should be as modular as possible in general.It could be possible to start with some of the EMI products as 'good examples' and to provide some 'templates for the schema'. Nevertheless, the communities behind the software projects should then drive thedecision about granularity and their investment of manpower maintaining the (granular) information. 

Morris Riedel - Deputy Division Leader Federated Systems and Data - Juelich Supercomputing Centre

I like the idea of using RDF

I like the idea of using RDF and DOAP. As for the schema, we have started drafting some classes. It is very loosely based on equivalent CIM objects. I'd like to have an initial simple set of objects we can play with, maybe inviting people to try and register something and see how it looks like. The initial ideas are jotted down in the attached file.I can add a wiki like section to the site so we can work together on this.

Attachments: 

Alberto Di Meglio

CERN - IT

Re-use of existing schemas

Hi,  looking at the PDF uploaded it might be interesting to check whether eduPerson et al. (for example) do not have similiar fields we can use already and then add the ones we don't need.Advantages in the future might be a direct link & re-usability with Shib-based (persons/organizations) access etc. 

Morris Riedel - Deputy Division Leader Federated Systems and Data - Juelich Supercomputing Centre

Update on implementation

A brief update of the status of the web site as of end of August.

The Member, Organization and Collaboration entities have been implemented as described in the proposed schema. It's an initial implementation of course, but it is useful to start working with real registrations. The RDF metadata is available for both the Organization and Collaboration instances, although I haven't tried to use it from outside ScienceSoft yet.

I've sent a request to the EMI Collaboration Board members to register themselves and their Organizations. Once we have enough people and Institutes, I'd like to extend the invitation to all EMI members and to the subscribers of the announce and discuss mailing lists.

Next step is the implementation of the Software Project class, so we can start registering real products and have the basic elements to start thinking about the marketplace, the reports, etc. There are already a few interesting proposals from Jedrzej in the Marketplace topic in this forum.

 

Alberto Di Meglio

CERN - IT

Add new comment

Static User

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Files must be less than 10 MB.
Allowed file types: txt pdf doc docx rtf jpg png zip vcf ppt pptx pps csv xls xlsx bmp gif tif mp3 mpg mp4 mov xml.