New niche content project

Those that have met me might have noticed an occasional interest in politics. Not party politics particularly -though I do have a horse in that race – but rather the philosophy of it and my own contrary positions on current events. It was a pleasure to integrate this interest into the company and open a new website taking up a contrary position on a hot issue: the economics of coffee. Check it out at free-trade-coffee.com.

Search Monkey Event Standard Object

I located the Search Monkey event standard object. Yahoo doing this makes it the de-facto standard markup, though its a shame it differs at the namespace level from their previous recommendation. The difference of two characters makes FIL incompatible!

Of course, my SM plugin – which is a more basic version of the same thing, is now completely redundant. I’m actually quite pleased by that, it did seem like Mission Impossible.

Hints on browsing embedded RDFa data as data

I want to share a few notes about viewing the embedded data from RDFa pages, as a sort of mini-guide for anyone interested.

The thing to get out of the way upfront is that the easiest thing to extract look the ugliest and is often hard to follow. Its worth taking a few precautions to avoid the horror of machine generated RDF/XML.  So, install the Tabulator Firefox extension from MIT and find the button labelled “N3″ – it looks like a dense network icon. Hit that for a compact text based view, and un-toggle the default Tabular as screen space requires. The default is the loose network icon.

To actually extract the data, use the RDFa Distiller service. Put in a URL and this service gives you ugly RDF/XML by default, but the Tabulator extension comes to the rescue. With Tabulator hitting “Go” gives you – unsurprisingly – a table and switching completely to N3 is just two clicks.

In table mode, Tabulator will pick up and cache labels for things as it goes along and will use the last bit of the URL if it doesn’t have a label yet. If your URLs look ugly, then the view in Tabulator will look ugly – hopefully your URLs are pretty.

Pretty can also be bad, especially if the unique part of a URL is at the front. For example:

  • http://feelitlive.com/events/2009/7/3/W2/2UH/Hyde+Park/Blur#event
  • http://feelitlive.com/events/2009/7/4/HA9/0WS/Wembley+Stadium/Take+That#event
  • http://feelitlive.com/events/2009/7/5/SW7/2AP/Royal+Albert+Hall/The+Killers#event

will all be rendered as “event” – very confusing! If this happens, switching to N3 may be the way forward. Part of the problem seems to be that Tabulator does not read RDFa on its own, which makes it harder to access the RDF and harder for Tabulator to calculate good labels. Apparently the next version will read RDFa – great.

Those links again:

Impressions from the SemTech 2009 Tuesday Keynotes

A useful 2×2 grid was discussed with dimensions being

  • the value you are able to add from the users perspective,
  • and the number of users there are.

The “no go” zone is where you are unable to add much value and there are not very many users. I felt presence or absence of competition was a third dimension of interest.

An interesting point was made about attempting to create a compelling user experience when passionate about technology. I do tend to appreciate the wonders of tech more than most, though FIL, for example, adds value entirely because of UI features (though not aesthetic UI features right at the moment).

Siri virtual assistant is a compelling product. Core innovation is taking a declarative library of process flows and turning that into a progressive conversational series of questions with process and context aware auto-completion. The auto completion was based on the model of possible intents relative to the supported process models.

Should be simple enough to add intent into many auto-completion features without needing to enumerate or model intents or processes very thoroughly, but doing so isapparently very powerful.

Updated Generic Event Info presentation application

Last night I was able to sucessfully test the Generic Event Information Search Monkey presentation application on upcoming, eventful and FeelItLive.com. I’ve pushed this version to the Gallery.

Going to work on the semantic web

I’d like to explain why I feel that the Semantic Web is a great solution for helping professional contractors and freelancers find work, and for helping companies source talent as well. I’m making assumptions in line with my experiences in the IT sector, but any solution this different technologically will benefit from being re-usable across sectors and I plan to touch on this too.

At the moment we freelancers spam agencies with Word documents that get indexed for keywords and they spam us whenever a job comes in with the same keywords on it.  I get no visibility or acknowledgement that data is updated and agencies continue to spam out roles apparently based on out-of-date home address and rate expectations. There seems to be no systematic way for me to express that I am or am not looking for roles or that I dislike being interrupted while on site.

It just doesn’t seem to work!

For example, my CV mentions the challenging T-SQL and PL/SQL work I did on a data warehousing project at Virgin Mobile using data taken out of SingleView and Strategix, and I get regularly spammed by people looking for “SingleView developers”  which as far as I’m aware is actually a UNIX scripting role, if not a role involving a proprietary technology stack.

What if a HR guy at telecommunications company wants an SQL developer with experience in is sector? The relevant keywords are “Virgin Mobile” which should be strong match (a telecommunications company), “SingleView” a strong match (a telecommunications billing platform). Other words like “Data Warehousing”, “T-SQL” and “PL/SQL” are unequivocal, but notice I do not claim to be a data warehousing expert, I merely worked on a project of that variety. On its own, that should be a weak match based on the type of relationship between the the abstract concept and my career.

Likewise there are problems of “SQL” appearing in the job spec and “T-SQL” and “PL/SQL” on the CV which should probably still be strong matches despite being less similar in terms of matching tokens. Different ways of breaking up words automatically will lead to different answers, which does not inspire confidence that a keyword driven solution can scale while retaining full accuracy.

An Information Retrieval solution that does not understand the difference between a skill, and a project goal and which is not combined with an accurate taxonomy of companies, technologies, and products is not going to rank matches as well as one that does. People’s lives are too interesting and varied to be captured by one consistent schema. For instance, consider medicine – an entirely different taxonomy of specialist skills, not project based like freelancing and therefore involving different relationships between concepts. This makes the problem essentially semantic, and on a large scale as well since there are many jobs and candidates in the economy.

I’ve just finished reading an article describing exactly this kind of IR solution. It is not fundamentally a Semantic Web problem merely a semantic one, but a IR approach that is both schema agnostic and graph-based is undoubtedly suited to searching career data across sectors.

There is no evidence in the spam I get that these problems are solved. Let alone the problems of owning and controlling my own data and controlling which agencies I want to trade with.

Then of course, Virgin Mobile no longer exists, which is a clue that it is less relevant and an opportunity for an incomplete taxonomy to do financial damage.

These last requirements make the problem an obviously web problem.  A problem screaming for Linked Data.

By retaining control of my CV at its own URI would not need to keep spamming agencies, instead, they come to me via HTTP to where my data is to get the latest version.  If I deny an agency access to my career progress – as I would if I no longer wished to trade with them – then even if they copied the data without authorisation I’m going to be decreasingly likely to get spammed simply because my data is old. As each agency tries to update and gets the bad news cost incentives will help to ensure they comply and delete cached data.

I’m assuming an ecosystem of tool vendors and community sites like LinkedIn or the sites of professional associations such as the Professional Contractors Group or the ACM would have a role in providing tools for me to put my CV on-line semantically and would offer search interfaces to agencies according to my preferences.  I would have no problem with the ACM gaining profit from doing so and it is natural I tell them about my skills and interests. SPARQL endpoints are a natural publishing medium for data at this scale, and with what I imagine would be a high level of heterogeneity you would want to use something like Pellet to drive federated search while supporting reasoning and custom rules to add value.

Likewise where product taxonomies are owned by trusted companies such as Oracle and Microsoft they can be kept up to date by the firm, as well as being expanded by other companies and organisations such as sourceforge, freshmeat, codehaus, or dbpedia in a web like way.  In particular, while project hosting sites can add coverage for the projects they host, sites like dbpedia that are more general can expand the schema with arbitrary information. If the taxonomies are expressed in a common language then unplugging technology taxonomies and replacing medical taxonomies would be straightforward.

There is a similar opportunity for technology vendors to step in and deny taxonomy data to agencies that have angered their developer community, and likewise for tool vendors and intermediaries (the PCG for example, or unions) to deny access to candidate data that they host for members.

The combined social influence of the data owners and communities, the increase in IR accuracy and distributed data maintenance would improve the work seeking and worker sourcing experience. Those seeking workers would have fewer weak or false results to contact and seekers benefit from receiving less spam.

Agencies – especially bad agencies – are the obvious losers as they would be automated away by tools, and pressured to avoid the spam they rely on today. Good agencies would be able to compete by developing better ways to deal with heterogeneity, superior match making, by providing better billing services, favourable contracts, better transparency, and helping to sell opportunities to workers – i.e. to do more of what they should be doing now.

So, I hope I have explained the current problems, why solutions are necessarily semantic in a graph-oriented model, and why economically efficient and pleasant solutions are necessarily web-like, as well as the social and economic pressures that would apply to improve service quality.

Moving forward on Search Monkey

I’ve moved forward marginally with my Search Monkey presentation app, and backwards a little as well.

The backward step was that some of the DataRSS content in the Yahoo Index appears to have vanished, this broke the examples in the previous post and on the Search Monkey site. I don’t think there is a lot a lowly web site operator like me can do to recover that, but I hope coverage will continue to be at least patchy.  I look forward to the day they just make the RDF available directly, I can see this being much more scalable for them and certainly easier for me.

At the same time I have made a little progress with adding hCalendar support. The level of adoption for hCalendar makes this compelling, though the lack of precision does mean its fundamentally restricted.  I will not, for example, be able to enhance results for pages that mention more than one event, even if one event is clearly (to a human) the primary topic of the page, choosing meaningful graphics is also circuitous.

On the topic of graphics, it is often not possible to depict an event that has not yet occurred. You might choose a graphic for purely aesthetic reasons, use an inconsistent rationale that is difficult to capture, or may abstract away the reasoning into another software module making it unavailable to the UI (as in my case).  This scenario seems to require a specific predicate which I will coin and document as I get around to it. A microformats equivalent is a non-starter as I do not have the inclination to make official representations to standards bodies for work of speculative value (see below).

In total then, the roadmap will be something like this (in order of priority):

  1. Basic support for hCalendar (actually tested on upcoming), with simple safeguards for the ambiguous cases.
  2. Support for RDFa enhanced pages mentioning multiple events, using foaf:primaryTopic to disambiguate.
  3. Support for cases where the event itself can be previewed as a commercial proposition without requiring excess detail in the data, using the new predicate unless I can find one.
  4. Support for event ontology actor and factor and associated foaf:depiction triples such that a factor or actor can be reliably depicted in the result.
  5. Support for the organiser and logo hCalendar and hCard properties. This is useless to me as the interesting entity for music events is almost never the organiser and never the attendees [ unless you are on a date ;-) ], but its feasible so deserves a modicum of attention. I don’t think photo is worth supporting as that would immediately break on conferences.

I’m realistic about the importance of this project for making money as there is no predictable way to ensure large scale adoption of the plugin and the difference in click though rates is also unpredictable. As a result, I’ll be moving this forward primarily as a hackspace project rather than a commercial one, albeit commercially motivated. Doing it on hack evenings is a simple way to box off an amount of time.

Progress will be slow for other reasons, I have to locate implementations of the relevent vocabularies, trawl the Yahoo Index for fully crawled examples and possibly establish some test cases. This will all take time and much of it will be out of my control.

UK Gov shares its data cheaply, avoiding change

No, not a story of intrusion and data-loss, this is data the government should be sharing – job adverts.

While apparently being in a position to know,  Mark Birkbeck “speculates” publicly that the UK Gov are using RDFa because:

by using RDFa to mark-up vacancies on each individual government site, it’s possible to allow each department to publish jobs however it sees fit. Many companies want to have some centralised information, not just government, but this usually involves imposing on each department some new database system or workflow. By using RDFa as the interface, each department merely needs to have the ability to publish HTML, and then they can share their data.

The second[reason] is that by publishing vacancies using RDFa, it’s easy for third-parties to ’scrape’ the data into their own databases, in a reliable way.

For example, some external company could import all vacancies for a particular region or of a particular type, and then show them on their own site.

Of course, all the benefits he mentioned a provably true – just run a page through a distiller – this is not bleeding edge technology any longer, it’s merely cutting edge.

Artists switching focus from recorded to live

According to The Times:

For years, bands would embark on loss-making tours around the world in an attempt to drum up sales for their latest album, but now the practice has been turned on its head, with groups more likely to give their records away in order to sell tickets to their shows.

The data, compiled by PRS for Music, the official body that collects and distributes royalties for British songwriters and music publishers, shows that live music generated £1.28 billion in 2008, compared with £1.24 billion yielded by the recorded music business.

It had to happen eventually… disintermediation, online sales, and a general distaste of fakeness meant record company power was fading. It’s the fakeness aspect that I hope to tap into at FeelItLive.com, by offering media and news from a wide range of sources.

VCal RDFa and Search Monkey

I just returned home from the 3rd London hack evening, which is a highly productive get together of, well, nerds in Islington. With a nod to the done manifesto, I’ve continued on into the night and done what I’ve called a “Generic Event Information” enhanced search result format for Yahoo. It’s a little thing that you can choose to add to your Yahoo search experience.

One of the challenges with this work is that Yahoo has not given these widgets a particularly easy name to throw around, but here’s a picture to help explain things:

generic event info Ultravox example

The text appears in a search result for Ultravox in London at the Roundhouse and as you can see it includes the usual snippet of text (or the event summary, if its longer) plus some very simple name value pairs – “Location” and “Starts”, which is altered to “Started” for events that have already started.

The idea is to answer the basic when and where questions common to events and also to allow users to quickly scan through search results and exclude events that they cannot attend and focus on those that they can attend, which is obviously those that they aren’t already missing! Therefore the “Starts” vs “Started” thing.

“Generic”

I can’t see a single person outside of the RDFa fan-club using this plugin unless it supports a wide variety of different sites and subject areas. There is no advantage to me in putting out something that only works for FeelItLive.com, yet its only FeelItLive.com that I’m aware of that is publishing VCal RDFa and Yahoo do not allow wildcards. I want people to use it, because I want them to focus their attention on my search results.

Chicken, meet Egg.

So, I’m doing two things:

  • Labelling the thing generic – there is no point even mentioning FIL in the promotional blurb. It can only confuse matters.
  • Inviting anyone and everyone to attach a comment to this post letting me know where they have published VCal RDFa. If they do that, I’ll do my best to support as many sites as Yahoo will allow.

I’ll also extend the same invitation to users of similar vocabularies (Edit: iCal, Event ontology, event microformats etc), though Vcal is the one recommended by Yahoo so I’ll have to see if anything else is supported.