Open Refine describes itself as 'a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services and linking it to databases like Freebase.' It was borne out of a project started by Google (and used to be called Google Refine), but is now an open source project hosted on Github.
This is a free application that you install on your system. Once installed it runs in your browser, so while it looks and behaves like a web application it is entirely contained on your own computer.
While it has a wide range of features, I'll focus on one sub-set in this blog post. How to 'wrangle' messy or un-structured data to make it more structured. This is a necessary first step if you want to analyse the data in a spreadsheet or other statistical analysis tool.
In my experience Open Refine is the best tool to work with if you need to tidy up messy data. So what can it do?
Before you do anything, you need to import your data. As you can see there are a wide range of options available to you. Most people are likely to use XLS or CSV. However, it's nice to have the option to import direct from XML, JSON or a Google Spreadsheet too.
- TSV, CSV, or values separated by a custom separator you specify
- Excel (.xls, xlsx)
- XML, RDF as XML
- Google Spreadsheets
- RDF N3 triples
OK, you've imported your data, now what? There are various issues the Open Refine can help you fix. I'll look at some of the more common ones.
Finding and removing duplicates
Perhaps one of the most common data quality issues relates to duplicate data. This could happen at the time you collect data. For example, counting the same person twice at different times or places. Or it could happen when you aggregate your data. For example accidentally copying and pasting from one Excel file to another twice.
Open Refine makes it easy to find duplicate entries in your data. Start by finding the column where you want to check for duplicates. In the menu at the top of the column select 'Facet', 'Customised facets' and then 'Duplicates facet'. You'll now be presented with a box on the left hand panel. This gives you two choices:
False - Click this to show only values that are not matching
True - Click this to show only values that are matching (ie duplicates)
Click the 'exclude' link to then remove the selected data.
Grouping similar data
Next imagine you have data where there are multiple responses that mean the same thing, but were expressed differently. For example:
- Young people
- And so on
You know that they mean the same thing, but the people entering the data have used different ways to express it. Clearly you could avoid this problem by using pre-determined taxonomies that you select from. However, that's no help once you've already collected the data.
Luckily this is also easy to fix. Select 'Facet' and then 'Text facet'. Again, the panel on the left will show a list of fields in that column that have unique names. See the video below for an example. If you hover over the name you'll see how many rows have this name and the option to edit the names. If you edit and save then the new name will be applied to all the rows at once.
If the data you need to group has a clear pattern to it then Open Refine has a powerful feature that can make this process even faster. Start by selecting 'Edit cells' and then 'Cluster and edit'. As the screenshot below shows you can now see all cells in that column that have similar responses. You can select a common name for them and group them in one go.
Trim leading and trailing whitespace
Another issue you may face is where responses have a space at the start or the end of each response. While the text facet option above will also deal with this manually, you can perform this operation automatically across your entire dataset.
‘Edit cell’, ‘Commons transforms’, ‘Trim leading and trailing whitespace’. You’ll see a confirmation message at the top of the screen showing how many cells were transformed.
Split multi-valued cells
What if you have data in one cell that should be split across two cells? Open Refine makes it simple to split data out by specific criteria. For example:
- By separator (e.g. wherever there is a specific character like ,.;-? or other character of your using
- By field length (e.g. after 5, 9 or 11 characters
To do this click on ‘Edit column’ then ‘Split into several columns’.
What else can it do?
Open Refine has many other options. One of the most interesting is the ability to enhance your data, drawing on data elsewhere on the Internet. Below are examples of recipes that can help you do this. Recipes are worked examples that have been documented. You can refer to these to help with specific tasks you need to carry out. Some examples:
- Trim whitespace from beginning and end of values
- Parse an IP address to extract a Country Name, Latitude & Longitude
- Removing duplicate rows when Exact values are found in a column
- Translate street addresses to lat/lng coordinates
- Fetching URLs From Web Services
How to avoid messy data
Open Refine is a great tool for cleaning up messy data. In my experience it's much faster than Excel or other spreadsheets at this task. However, in my view we should try to avoid data quality issues like these to begin with. That's not always possible of course, but there are situations where it is.
You may find the following resources useful: