OData: A Communications Protocol for Sharing Open Data
Microsoft released its “Open Data Protocol” (OData) in 2010 and has made it a part of its “Open Government Data Initiative” (OGDI). When I first looked into OData a few months ago, I concluded that it was “just data formatted using AtomPub,” which “has some virtues” but “is only a ‘standard’ insofar as other people recognize and use it as a standard.” As I have investigated further, however, I’ve concluded that OData is probably the most promising communications protocol currently available for making data shareable, discoverable, accessible and mashable. For this reason I think we should include OData (or something like it) belongs in the roadmap for open data development.
What OData is NOT
OData is not a file format like CSV or Excel (XLSX). It is also not a data storage system like MySQL or Postgresql. Data that is shared via OData can be exported to CSV or Excel, and it can be stored using MySQL or Postgresql as well as using no-SQL databases such as MongoDB. However, OData does not replace any of those technologies, and likewise, they do not substitute for what OData does.
OData is also not a system for standardizing web services. OData collections use a RESTful API to share their data resources, but OData is not attempting to becoming “one standard to rule them all,” and it is not a system for managing APIs, like Mashery or 3Scale.
What OData IS
OData is a protocol, like SMTP or HTTP or RSS: a system of digital message formats and rules for exchanging those messages. SMTP is a messaging protocol for sending email, and HTTP is a protocol for communicating between web browsers and web servers. Similarly, OData is a protocol for communicating when sharing data.
OData has a number of similarities to RSS, upon which it is based. It uses AtomPub, a variant of RSS, as the basis for a RESTful interface through which an OData client application can communicate with an OData server to perform the four basic functions of data storage and retrieval: create, retrieve, update and delete (CRUD).# Whereas RSS is simply a read-only protocol, OData is therefore a read-and-write protocol. The goal of RSS was to provide “really simple syndication” — the simplest possible system for publishing news feeds and other information lists so that those news items and lists could be syndicated to multiple websites. OData does pretty much the same thing, with a few enhancements to make it suitable for “really simple data sharing” via HTTP:
- OData does not package “news items.” Instead, it packages “rows of data.” A single item in RSS is typically a reference to a blog post or other article. For OData, a single item is a row of data, comparable to a record in a database table or one of the rows in an Excel spreadsheet.
- OData provides support for sharing information about the data schema of a data resource. An HTTP query to an OData resource can retrieve a list of the data fields contained in that resource, as well as the data type of each field—number, date, text, etc.
- Whereas RSS responds to a feed request by providing a fixed list of items, OData allows the data request to provide parameters specifying how the items returned should be filtered and sorted. For example, you can tell it to only return items whose “state” field is “New York.” (Socrata also provides the ability to sort and filter results within a dataset, so this is a valuable feature if we want our open data solution to be competitive.)
- Whereas RSS only allows retrieving information, OData supports the rest of CRUD as well: creating, updating and deleting.
- Finally, OData provides support for relationships that can link an item in one data table to some other item in a separate data table. It therefore can serve as a protocol not just for sharing individual tables of information but also as a protocol for sharing relational databases involving multiple data tables.
Microsoft has published an article titled “Introducing OData,” which explains how it works in additional technical detail. I’ve also created a three-minute YouTube video which highlights some of the things that OData enables.
What This Means for Open Data
According to the Open Data Foundation, the purpose of open data is to make it “possible for a user to easily perform tasks” such as the following:
- Share knowledge with other users
- Discover the existence of data
- Access the data for research and analysis
- Find detailed information describing the data and its production processes
- Access the data sources and collection instruments from which and with which the data was collected, compiled, and aggregated
- Effectively communicate with the agencies involved in the production, storage, distribution of the data
To that list of tasks, I would add that data should also be easily mashable, meaning it should be easy to select and combine data from multiple sources and produce new data objects that are themselves easy to share, discover, access, etc.
To make data more discoverable, open data systems such as Socrata and CKAN have invested quite a bit of effort in data visualization. New Amsterdam Ideas is currently building a CKAN-based website for a client, and we have seen firsthand how well this works and its limitations. The idea is that when a user uploads a CSV file or an Excel spreadsheet, the software tries to automatically parse that data and display it on the website as a chart, table or map. Sometimes this works, and sometimes it doesn't. When it fails, the reason is usually due to inconsistencies in the way the CSV or Excel file was created. I'll describe some of the reasons for those inconsistencies below. The problem is not that there is something “wrong” with the CSV or Excel file formats. The problem is that those formats are designed to be very flexible, and that flexibility in turn creates multiple opportunities for users to create datasets that are structured in ways that do not lend themselves to automatic parsing and visualization by computers.
The OData protocol solves this problem by requiring a little bit more structure and definition than you can expect to get from a spreadsheet file alone. These additional bits of structure make OData collections equivalent to web-based, interactive databases. OData collections can be browsed, sorted and filtered to create new subsets from existing datasets. Moreover, an application that is OData-aware can recognize the schema of any OData collection, and this makes it potentially easy to build websites that can atomatically visualize OData collections as tables, charts or as points on maps.
OData and Drupal Views
The most common and important use case for sharing and displaying data involves data tables organized into rows and columns. In the Drupal world, this sort of display is often created using the Views module. An OData feed rendered inside an HTML page looks quite a bit like something generated by a Drupal view, as you can see in the example below, which I’ve taken from one of the examples in Microsoft’s DataPublic demo:
Just as you might see with a Drupal View, there is a tab at the top that lets users switch between viewing the data as a table or as points plotted on a geographical map. Options at the top of the table let site visitors narrow the items to be displayed in various ways. Users can also sort the list by clicking on the column headers for “name,” “address,” “city,” etc.
Again, this is all functionality that Views can provide in a Drupal context. There is even an existing OData module for Drupal that makes it possible to display OData feeds as Drupal views. OData’s advantage over Views is that it is a fairly lightweight protocol for data sharing that is not limited to Drupal. Any website that is capable of recognizing and rendering the OData protocol can provide this sort of display and its accompanying options for filtering, sorting, etc. Moreover, an OData-capable website can generate that sort of display and its accompanying options automatically. All a user needs to do to generate that display is supply a URL to the OData feed. This means that creating visualizations of OData should be easier, because it does not require a Drupal developer with Views expertise to manually build each separate visualization. With OData, users simply need to supply the URL, and the visualization can build itself.
I have evaluated some of the other possible approaches that might become the basis for an open data standard. They include: the CSV and/or Excel file formats; the Google Data Protocol (GData), RDF and DataRSS. However, each of these has issues that limit its usefulness as a data sharing protocol.
Excel is the most commonly used spreadsheet application, and its file format has therefore become a de facto standard for sharing data tables. Most if not all competing spreadsheet applications provide the ability to import and export their data using an Excel format, as do many database applications. However, the Excel file format has a number of limitations as an open data protocol:
Limits on size of data tables
The latest version of Excel can handle a maximum of 16,384 columns and 1,048,576 rows. This is adequate for some purposes but not for all. For example, census data or meteorological data may contain millions of records.
No support for presorting, filtering or performing other basic data transformations
Since Excel is simply a file format, users are limited to downloading entire Excel files and then using whatever software exists on their computer to perform operations that select subsets or perform other transformations on a data resource. Suppose, for example, that we had a database containing thousands of rows of crime statistics for all 50 states in the United States that look like this:
Someone who only wants statistics for the state of New York would have to first download the entire file and then import it into a program which lets them extract the subset of data that they actually want. Similarly, if someone wanted to sort the data according to a particular column (such as number of homicides or total number of crimes), they would have to perform that manipulation after downloading the file.
Inconsistencies in structure of data tables
One of the reasons for Excel’s success as a spreadsheet is that it offers users a great deal of flexibility. A single Excel spreadsheet can actually contain multiple tables of data. It is a three-dimensional spreadsheet, so data can be kept in multiple sheets. Any cell in a spreadsheet can contain a formula which calculates its value based on the values in other cells. Excel also offers formatting options that make it possible to generate attractive reports based on spreadsheet data, including charts and graphs. Header and footer information can be placed above and below a table and formatted to extend across multiple cells to improve readability. In the example above, for example, the word “Date” is centered above both the “Month” and “Year” columns.
This flexibility comes at a price, however, when it comes to sharing Excel spreadsheets as open data. Excel users will inevitably use this flexibility when creating data resources, and it is unrealistic to expect that they will be willing to take the effort needed to refactor their spreadsheets into neatly organized rows and columns prior to uploading their files for the purpose of sharing. Inevitably, therefore, Excel spreadsheets will contain inconsistencies that create difficulties for computer programs that try to recognize that data and generate automatic visualizations. For example, here are some of the inconsistencies I noticed in a single, one-page Excel budget spreadsheet:
None of these discrepancies weaken Excel’s usefulness as a spreadsheet. Human readers understand how to read the data anyway. However, these discrepancies do limits its usefulness as a protocol for data that can be automatically extracted, parsed and mashed up with other data.
No standard way of defining data schema
Relational databases such as MySQL, Postgresql or Filemaker typically organize their data into fields. Each data table contains a defined set of fields, each of which has defined characteristics. For example, the “node” table which is defined in the MySQL database for a Drupal website contains a field named “nid” and defined as an integer field, meaning that integers are the only type of data that can be entered there. Another field, named “type,” is a text field. Other fields are used as timestamps, which means that they specify a date and time.
Excel also supports a number of similar data types — integers, text, dates, etc. — but it does not enforce a rule saying, “everything in column A must be an integer,” or “everything in column C must be text.” It also does not require assigning names to columns. When someone creates a spreadsheet, they may choose to enter a row (or multiple rows) of column title information at the top of the spreadsheet, but this is entirely optional. This flexibility is part of what makes Excel useful as a spreadsheet, but it again means that Excel spreadsheets are not sufficiently standardized to serve as a protocol for open data that can be automatically visualized, filtered, or mashed up with other data.
Limited support for metadata
When NAI and the Open Knowledge Foundation responded to an open data RFP from the World Bank, one of the features that the asked about was the ability to add metadata to individual rows, columns and cells of data within a dataset. CKAN does not currently offer that ability, and Excel does not provide it either. The current version of the Excel software makes it possible to attach comments to individual spreadsheet cells, but making those comments visible to browsers of an open data website would be a significant programming challenge. The only way users could reasonably be expected to view Excel comments would be to first download the Excel file, then open it using the Excel application (which they would have to own), and then inspecting the spreadsheet within Excel to find the comments.
CSV and other delimited text formats
The comma-separated values (CSV) file format is often used as a more generic, simplified format than Excel for sharing tabular data. Sometimes tab-separate value (TSV) files are also used for this purpose. These files are basically text files which use a designated character as a separator to indicate when one column ends and the next column begins.
CSV/TSV files do not have some of the features that create inconsistency problems with Excel. For example, they do have multiple data sheets, formulas in place of set values, or the ability to fill multiple cells with a single value. The delimited text format is also simpler and easier to parse than the Excel file format. Nevertheless, they are still likely to suffer from many of the same inconsistency issues as Excel. There is still there is no standard way of defining data schema, no support for metadata, and no support for presorting, filtering or performing other basic data transformations.
Spreadsheet file formats including CSV and Excel are useful standards for importing and exporting data. Drupal Views and OData can both export their data as CSV or Excel, and an open data solution should retain these options.
Databases (SQL, MongoDB, etc.)
Relational databases such as MySQL or Postgresql address many of the limitations inherent in file formats such as Excel or CSV. They enable sorting and filtering; they support relationships between data tables; and they support data schema definitions. Their main limitation as a data storage format is that they do not provide a built-in mechanism for adding metadata annotations at the level of individual data cells. However, this is not a major drawback and could be overcome with some custom programming. NoSQL databases such as MongoDB or CouchDB could also be used effectively for open data storage and might offer some advantages.
However, no database can serve by itself as an open data protocol, and none has sufficient standing as a database standard to serve as a de facto data standard.
The Google Data API (GData) has been described sometimes as a competitive alternative to OData for publishing data feeds. However, this is not really the case. The Google was developed by Google as a standard protocol for accessing the data in many of Google’s own applications, including Google Maps, Google search and about 20 others. However, Google has never attempted to offer GData as a general protocol for data sharing. Moreover, it seems to be moving away from using GData even for its own applications. The developers’ documentation for the Google Data API carries a prominent disclaimer at the top of the page which states, “Warning: Most newer Google APIs are not Google Data APIs. The Google Data APIs documentation applies only to the older APIs that are listed in the Google Data APIs directory.”
Like OData, the Google Data API is based on AtomPub and provides a RESTful interface capable of performing the basic CRUD functions for data storage and retrieval. However, it is missing several of the features of OData that are important for a data sharing protocol. GData does not provide support for sharing information about the data schema of a resource. It does not provide a standard mechanism for filtering and sorting data items. It does not have a system that defines relationships between items in different data tables.
DataRSS was a proposal put forward in 2009 by Pito Salas, who noted that he had been “looking extensively at the great variety of data-oriented REST and REST-ish APIs that are appearing, especially as part of various government transparency efforts. As an example, there is the Sunlight Foundation’s API to look up information about congress people, or the Follow TheMoney API to look up information about lobbying and political contributions.” Salas observed that “For a ‘consumer’ (that would be a programmer) of this information it’s pretty time-consuming and error prone to study the documentation of each of these ‘similar but different’ APIs. Most are quite well documented but still each has to be discovered and studied separately.” As a result, he concluded, “Creating applications (either browsers, or widgets, or middleware applications) that use and combine information from more than one source is hard.”
To address this problem, Salas proposed taking a lesson from the RSS experience and developing a data sharing protocol that would “allow a single access method to access a very broad range of data,” “be easy and cheap to implement,” and “not require any centralization.” Salas originally called his proposed standard “DataRSS” and subsquently renamed it “Decentralized Data Discovery.”
Beyond this general proposal, however, Salas has not published any code or demos of his concept since 2009. It appears that OData largely implements his ideas.
Resource Description Framework (RDF) “Resource Description Framework - in - attributes” (RDFa) are specifications for a “metadata data model” published by the World Wide Web Consortium (W3C). They specify a way of marking up the text in web pages with tags which give the text semantically meanings that can be automatically extracted by computers. For example, “Friend of a Friend” (FOAF) markup can be added to web pages to indicate that two people are friends of each other.
RDF provides a means of automatically extracting meaningful data from web pages, but it is not in itself a data sharing protocol. It is a specification that can be used to add metadata to existing information and may have some usefulness within an open data system.
A blog post comparing RDF vs. OData vs. GData vs. DataRSS:
- Where to find data: http://radar.oreilly.com/2010/12/where-to-find-data-2.html