Open Data: the basics for newbies

What is open data?


There are loads of examples of open data. It can come in loads of formats. It’s data that’s open and free in accessible formats, that is machine readable. It can be any format – like a jpeg or a PDF, but that latter has become a joke in the community. PDFs are hard to get the data out of in a usable format. It’s great for people but a bit rubbish for computers.

Open data also has a licence, which makes it open. Everything else is just the icing on the cake. OGL or creative commons are common examples.

When does it stop being open data? Can poor quality data stop it being open? The open data license isn’t a stamp of approval or quality, just of openness and restrictions.

Is all open data numerical? No. For example: mapping data, lists of landfill sites. How about lists of library books and when they were last borrowed. For one person, totals and averages are the biggest switch-offs in open data.

More and more academic research is being published as open data. There are efforts underway to make books that are out of copyright as open data. A museum in France has opened up its entire collection as open data.

Why is open data good?

It’s a value exchange. If you exchange, say, health data, there’s a value to you in getting insight and analysis from your data.

Broadly, the aims are better products and services, through data-based better understanding. This is true in both the private and public sector. In the car of public sector data, it’s already ours. We own it! So it should be opened up because we paid for it to be created.

We don’t always know the benefit of opening data – it’s taking a punt and seeing what emerges, and there’s not chance of that happening if it stays closed.

Privacy & Problems

In general, open data shouldn’t contain personally identifiable information. However, some personal data is somewhat acceptable – like public sector job titles or MPs expenses data.

Have there been negative consequences of opening data? In the US, there have been some examples of class actions based on open data. For example, Netflix released some data that people were able to combine with other data to identify people within it. While it’s pretty easy to anonymise data, sometimes edge cases makes it possible to identify people.

Linked data is the far end of the machine readable format. For example, a PDF is one star data, Excel is two star, CVS is three star, and linked data is five star. The stars are just about accessibility of the format of the data. It’s not as important as it sounds, because many conversion tools are about.

How can you rely on open data? Will it be there if your app relies on it? Well, you can’t completely, because you aren’t paying for it. (And even paying isn’t a guarantee.) Checking the metadata can give you an idea of how often it’s refreshed. And some sources are more reliable than others. the Open Data Institute allows you to certificate your data, which helps trust. One major counter-example – the US government open data sets just disappeared. It could be a switch from one president to the next leading to archiving and replacing – or it could be political. However, because of the way data is licensed, once it is opened it can’t be closed thereafter.

Some definitions

A schema is description of what data goes in what field. If you follow schemas, it makes importing data much easier. For example, if you’re publishing tree data, but using different tree fields (idtree and treeid) – that makes it very hard for machines to spot the same data type.

Is Twitter open data or public data? It’s public not open. Republishing someone’s Twitter feed on your site without embedding the tweets is against the terms and conditions.

Metadata is data about data – information about the data that’s being shared.

Platforms, data stores and portals are all places where people can publish or obtain data.

An example

The Food Standards Agency makes the business rating data open, through APIs – a means of two computers passing information between them – so you can access and use it. They also have approved premises data that’s in a bunch of separate Excel spreadsheets. One good example, one bad.

Moving beyond the basics

When you’re an expert in Excel – what tools should you move on to? Well, sometimes the best tool is the one you’re most familiar with. If you’re an Excel ninja, that’s fine. Only move on when you get interested in something else. Open Refine is one option, as it allows you to “clean” datasets – remove duplications, misspellings and other issues in your dataset.

Excel doesn’t have mapping or spacial tools. So looks at BatchGeo or CartoDB.

One thought on “Open Data: the basics for newbies

Leave a Reply

Your email address will not be published. Required fields are marked *