If you are an open data researcher you will need to handle a lot of
different file formats from datasets. Sadly, most of the time, you
don’t have the opportunity to choose which file format is the best for
your project, but you have to comport with all of them to be sure that
you won’t find a dead end. There’s always someone who knows the solution
to your problem, but that doesn’t mean that answers come easy. Here is a
guide for each file formats from the open data handbook and a suggestion with a python library to use.
JSON is a simple file format that is very easy for
any programming language to read. Its simplicity means that it is
generally easier for computers to process than others, such as XML.
Working with JSON in Python is almost the same such as working with a
python dictionary. You will need the json library, but it is
preinstalled to every python 2.6 and after.
import json json_data = open("file root") data = json.load(json_data)
Then data[“key”] prints the data for the json
XML is a widely used format for data exchange
because it gives good opportunities to keep the structure in the data
and the way files are built on, and allows developers to write parts of
the documentation in with the data without interfering with the reading
of them. This is pretty easy in python as well. You will need minidom
library. It is also preinstalled.
from xml.dom import minidom xmldoc = minidom.parse("file root") itemlist = xmldoc.getElementsByTagName("name")
This prints the data for the “name” tag.
RDF is a W3C-recommended format and makes it
possible to represent data in a form that makes it easier to combine
data from multiple sources. RDF data can be stored in XML and JSON,
among other serializations. RDF encourages the use of URLs as
identifiers, which provides a convenient way to directly interconnect
existing open data initiatives on the Web. RDF is still not widespread,
but it has been a trend among Open Government initiatives, including the
British and Spanish Government Linked Open Data projects. The inventor
of the Web, Tim Berners-Lee, has recently proposed a five-star scheme
that includes linked RDF data as a goal to be sought for open data
initiatives I use rdflib for this file format. Here is an example.
from rdflib.graph import Graph g = Graph() g.parse("file root", format="format") for stmt in g: print(stmt)
In rdf you can run queries too and return only the data you want. But this isn’t easy as parsing it.
Spreadsheets. Many authorities have information left
in the spreadsheet, for example Microsoft Excel. This data can often be
used immediately with the correct descriptions of what the different
columns mean. However, in some cases there can be macros and formulas in
spreadsheets, which may be somewhat more cumbersome to handle. It is
therefore advisable to document such calculations next to the
spreadsheet, since it is generally more accessible for users to read. I
prefer to use a tool like xls2csv and then use the output file as a csv.
But if you want for any reason to work with an xls, here is the best
source I had www.python-excel.org. The most popular is the first one, xlrd. There is also another library openpyxl, where you can work with xlsx files.
Comma Separated Files (CSV) files can be a very
useful format because it is compact and thus suitable to transfer large
sets of data with the same structure. However, the format is so spartan
that data are often useless without documentation since it can be almost
impossible to guess the significance of the different columns. It is
therefore particularly important for the comma-separated formats that
documentation of the individual fields are accurate. Furthermore it is
essential that the structure of the file is respected, as a single
omission of a field may disturb the reading of all remaining data in the
file without any real opportunity to rectify it, because it cannot be
determined how the remaining data should be interpreted. You can use the
CSV python library. Here is an example.
import csv with open('eggs.csv', 'rb') as csvfile: file = csv.reader("file root", delimiter=' ', quotechar='|') for row in file: print ', '.join(row)
Plain Text (txt) are very easy for computers to
read. They generally exclude structural metadata from inside the
document however, meaning that developers will need to create a parser
that can interpret each document as it appears. Some problems can be
caused by switching plain text files between operating systems. MS
Windows, Mac OS X and other Unix variants have their own way of telling
the computer that they have reached the end of the line. You can load
the txt file but how you will use it after that, it depends on the data
format.
text_file = open("file root", "r") lines = text_file.read()
This example will return the whole txt.
PDF Here is the biggest problem in open data file
formats. Many datasets have their data in pdf and unfortunately it isn’t
easy to read and then edit them. PDF is really presentation oriented
and not content oriented. But you can use PDFMiner
to work with it. I won’t include any example here since it isn’t a
trivial one, but you can find anything you want in their documentation.
HTML. Nowadays much data is available in HTML format
on various sites. This may well be sufficient if the data is very
stable and limited in scope. In some cases, it could be preferable to
have data in a form easier to download and manipulate, but as it is
cheap and easy to refer to a page on a website, it might be a good
starting point in the display of data. Typically, it would be most
appropriate to use tables in HTML documents to hold data, and then it is
important that the various data fields are displayed and are given IDs
which make it easy to find and manipulate data. Yahoo has developed a
tool yql that can extract
structured information from a website, and such tools can do much more
with the data if it is carefully tagged. I have used many times a python
library called Beautiful Soup for my projects.
from bs4 import BeautifulSoup soup = BeautifulSoup(html_file) soup.title soup.title.name soup.title.string soup.title.parent.name soup.p soup.p['class'] soup.a soup.find_all('a') soup.find(id="link3")
Those are only a few of what you can do with this library. By calling
the tag, will return the content. You can find more on their documentation.
Scanned image. Yes. It is true. Probably the least
suitable form for most data, but both TIFF and JPEG-2000 can at least
mark them with documentation of what is in the picture - right up to
mark up an image of a document with full text content of the document.
If images are clean, containing only text and without any noise, you can
use a library called pytesser. You will need the PIL library to use it.
Here is an example.
from pytesser import * image = Image.open('fnord.tif') # Open image object using PIL print image_to_string(image)
Proprietary formats. Last but not least, some
dedicated systems, etc. have their own data formats that they can save
or export data in. It can sometimes be enough to expose data in such a
format - especially if it is expected that further use would be in a
similar system as that which they come from. Where further information
on these proprietary formats can be found should always be indicated,
for example by providing a link to the supplier’s website. Generally it
is recommended to display data in non-proprietary formats where
feasible.. I suggest to google if there is any library specific for this
dataset.
No comments:
Post a Comment