By now, we have covered different strategies and tools that help you find and access data from open access sources on the web. However, as an open access data user, you need to know that not all available data online is ready to be used.No matter where the data comes from, you always need to be check the quality of your dataset. Data can be messy and “dirty” with lots of problems and mistakes, particularly so crowd-sourced data. You you need to think about raw data in terms of their limitations and ways you can make them shiny and valuable assets to you as a journalist. In this module, we will learn how to use OpenRefine to clean you data. OpenRefine is a free of charge and powerful open source softwarethat lets you do
- Data profiling – assessing the current state of your data and
- Data Cleaning – correcting problems spotted during data profiling.
Today’s lessons includes:
1. Basic steps of OpenRefine,
- how to import a dataset
- how to get around in the main interface.
2. Analyzing and Fixing Data
- get to know your dataset
- spot errors in it
- repair mistakes
More specifically, we will learn:
- sorting data
- faceting data
- using simple cell transformations – clustering
- detecting duplicates
- removing matching rows
The Basic Steps in OpenRefine
You will need to download OpenRefine from the web into your personal computer. Click the Open Refine logo at top left to return to the opening screen. Create a new project from the file. To open a datafile, you will need to save it into your computer.
You can import data file by clicking “Open file” and create a new project. For this exercise, we will use the dataset collections for menus from the New York Libraries. Dataset “Menu”. Once the interface of your data appears, you will see that the program recognizes you data in CVS format. Proceed to creating the project.
Data are easier to manipulate and explore when they are sorted in a logic way you. Besides, sorting data as a visual aid and reordering rows permanently is a prerequisite for more advanced operations in OpenRefine. Cell values can be sorted according to their type:
- Text: alphabetical (a to z) or reversed alphabetical (z to a)
- Numbers: smallest first or largest first
- Dates: earliest first or latest first
- Booleans: false then true or true then false
You can also choose where to sort errors and blanks by dragging them in the desired order. Errors could be sorted first (to spot them more easily) and blank values at the end (since they interest us less) with valid (normal) values in the middle.
Facets allow you to get useful insights of your dataset, by helping you examine different aspects of your data. They also allow you to apply a transformation to a subset of your data, for more advanced functions. You will mostly use the following types of facets:
- Text facets, for faceting text. Example: County or city names
- Numeric facets for numbers and dates. Example: expenditures, crime rate, etc.
- Predefined/customized facets
Text facets help you examine data that are displayed in text. Example: if your dataset has a column containing cities or country names, you will want to see at a glance what the different values are for that field and the number of occurrences for each one.You will be able to identify issues related to :
- Period and commas
- Apostrophes in places it shouldn’t be
- Other problems with data
Whereas a text facet returned a list of different choices, numeric facets yield ranges of values.
Some of the open access data you will encounter sloppy categorization of the data ad Open refine helps you identify categories that naturally fall together, by spotting similarities in their labeling. You can continue editing entries manually, but Open Refine also has a clustering function to help clean text fields.
Cluster button finds matching clusters that the machine thinks belong together. Example, in our dataset the category “dinner”is annotated in a variety of ways. Clustering will identify similar categories together. But as you scroll down, you will find entries that the algorithm has correctly grouped together. You can edit the text in the boxes at the right to choose the text that will be substituted, when you merge the clusters. If you believe that the identified categories belong together, then click “Merge Selected & Re-cluster”. Only those clusters you that check will be merged.
Detecting & Deleting Duplicates
Duplicates are annoying records that happen to appear twice (or more) in a dataset. Keeping identical records is a waste of space and can generate false representation of your data, so we will want to remove these duplicates. But, first you need to identify duplicates. The easiest way to do so is to use the customized facet “The duplicates facet.”However, it has its limitations. This facet only works on text strings and we cannot be used reliably on numeric data. In the following example, we will use it identify duplicates on our “Menu dataset” for “Places” category. Let’s sa we only want a dataset that contains one menu from each of the locations in our dataset, however our dataset contains multiple menus for each location. To detect duplicates for text data, perform the following:
- Place | Facet | Customized facets | Duplicates facet. Result will display “False” results( meaning entries that do not have duplicates and “True” – 15336 (these are marked as duplicates). Click on “True”results.
- Next, navigate to Place | Edit cells | Blank down. This function replaces the “Place” of duplicate rows by blanks (every original row will keep its ID, only subsequent ones will lose them)
- To find the duplicates: Place | Facet | Customized facets | Facet by blank. This function will display only the redundant rows
- To remove them: All |Edit Rows | Remove all matching rows