Gaining a functional level of data literacy is foundational in learning how to use data to communicate.
Data literacy is the ability to derive meaningful information from data, just as literacy in general is the ability to derive information from the written word.
Although this foundation is neither complicated nor technical, it will equip you with the language, definitions and/or jargon that you will need in order to engage in any dialogue around data processing.
For this reason, this module is recommended as pre-course material for most of our other training modules, and serves as a starting point for anyone wanting to learn more about the processing of data.
The formats that technology allows for information to be stored in has had a big impact in the way that we are able to work with information.
When we store information as data, a whole world of opportunity opens up that is supported by different aspects of, and skills generated by, various industries already in existence. For example: computer science, software engineering, visual design, journalism, economics, academics, geographic information systems (GIS), and more.
Everything we see can be tracked and converted into data: our height, our age, the amount of transport modes we take to commute to work, the number of children at school, or the ones who are not attending school.
And that data (aka information) when in the right format, this being when the data elements are allocated to rows and columns, create the required shape of a dataset to deliver the required answers.
It is useful at this stage to contextualise data. In so doing, we must address what is data? What is information? What is knowledge? And lastly, how are they all connected?
To best understand this, let us explore the following example.
Data: Golf ball size = 43mm
This piece of data as it stands alone is fairly meaningless. So we know the size of our golf ball. So what?
Let's add in another piece of data.
Data: Min size of golf ball for the tournament = 43mm
Aha! Now we have some useful information.
Information: This golf ball can be used in the tournament.
What do we know?
Knowledge: I now know that I need to check if the size of my golf balls is acceptable for different tournaments going forward.
Thus, knowledge is information that is understood, applied and learned.
Data is categorised as being either qualitative or quantitative.
Data that is numeric in form is categorised as quantitative data when, as the name suggests, the numbers are either quantifiable or measurable. Whereas qualitative data is data that is descriptive in nature. While qualitative data can take on a numeric form, it more commonly exists in other forms such as text.
It is important to know the difference between these two data types as the processes that can be performed on the data, and the insights that can be extracted from that processed data, differs vastly.
Any data values that refer to the quality of something is known as qualitative data: A description of colours, texture and feel of an object, a description of experiences, or information gathered during an interview are all qualitative data. While this type of data is more commonly captured in the form of text, it is not limited to text. Qualitative data can also exist in forms such as audio, video or photograph.
Numeric data that describes a data record in terms of its qualities, rather than being an indication of quantity or measure, is also categorised as being qualitative. An example of this is a column containing identity numbers of citizens of a country. This data is stored in the dataset as a number, but it is not quantifiable because it does not make sense to average, count or sum such numeric data. Doing so does not provide a meaningful measure of anything.
Qualitative data can also be considered categorical and by this we can understand that an individual data record is assigned one of a selection of categories under which all the data records are distributed. This makes it possible to analyse different categories of data within a dataset in relation to each other.
The three main categorical classifications are: binary, nominal (or unordered), and ordinal.
Below are a few examples of different methods for obtaining qualitative data:
Quantitative data tells you something about a measure or quantity and is data that refers to a number. For example: height, age, GDP, temperature, area, and price.
This data type is comprised of values that can be measured precisely, rather than through interpretation. Thus making quantitative data objective, replicable, and verifiable.
Quantitative data is also quantifiable. For example adding up a column of data containing the age of people, and dividing by the total number of people in the study, provides the average age of the people.
Quantitative data has two basic classifications: discrete or continuous.
Discrete data is when the measurements are integers. This is data that has gaps in it (i.e. the gap between 1 and 2 for example), where the numbers can only be whole numbers, such as the number of people in a household, scores on a test (where you receive e.g. 7/10), or shoe sizes.
Continuous data is when the measurements can take on any value within a particularly defined range, such as weight, length, and duration. In continuous data, all values are possible with no gaps in between i.e. 1.27843369981472
).
The difference can be seen in that your shoe size (discrete) is different from the size of your foot (continuous).
Quantitative data includes:
Below are a few examples of different methods for obtaining quantitative data:
Data can be captured and stored in different formats. These formats exist to support the different tasks we need to perform in order to understand the problems we are trying to solve, or to investigate and shape the stories we are trying to write.
As result data must be suitably stored in a format that supports whichever task is necessary, be it as unstructured data that is better for communication and human consumption, or structured data, that is better for computer processing because of its relational database system.
Unstructured data is often text-heavy despite typically containing data such as dates, numbers or facts. But due to the data’s irregular presentation, it cannot be processed.
An example of unstructured data could be the information contained in an email, whereas an example of structured data is the information contained within a spreadsheet.
"My name is Jane, I am a 32 year old female South African, and I live in Newlands in Cape Town, which is in the Western Cape province of South Africa"
The above example would result in irregularities when imported into a spreadsheet, making it difficult to process for analysis.
"name", "age", "gender", "nationality", "location"
"Jane", 32, "female", "South African", "Newlands, Cape Town, Western Cape, South Africa"
The above is an example of CSV or comma separated values data. Data in this structured format can be read directly by spreadsheet software. This means it is machine readable.
There are many different formats of structured, or machine readable data, and some examples of these are:
First Name | Last Name | Goals |
---|---|---|
Benni | McCarthy | 31 |
Shaun | Bartlett | 29 |
Katlego | Mphela | 23 |
Bernard | Parker | 23 |
Phil | Masinga | 19 |
Taking the data above as an example, here is that table shown in various formats
First Name Last Name Goals Benni McCarthy 31 Shaun Bartlett 29 Katlego Mphela 23 Bernard Parker 23 Phil Masinga 19
Please note that it is inherently difficult to display tab seperated values on a web page. The main takeaway should be that each column is separated by a tab and each row is on a new line.
First Name,Last Name,Goals Benni,McCarthy 31 Shaun,Bartlett 29 Katlego,Mphela 23 Bernard,Parker 23 Phil,Masinga 19
[ { "First Name":"Benni", "Last Name":"McCarthy", "Goals":"31" }, { "First Name":"Shaun", "Last Name":"Bartlett", "Goals":"29" }, { "First Name":"Katlego", "Last Name":"Mphela", "Goals":"23" }, { "First Name":"Bernard", "Last Name":"Parker", "Goals":"23" }, { "First Name":"Phil", "Last Name":"Masinga", "Goals":"19" } ]
If you wish to explore other structured data formats read we recommend the Open Data Handbook.
Metadata is the data about the data, a summary of the basic information regarding a given dataset. Every dataset should have an accompanying metadata document. It should detail the data source, collection methodology, as well field descriptions.
Metadata is necessary for understanding the scope of the data, its limitations, decoding and verifying the contents the data, understanding how to interpret the data, and should also provide the definitions of any categorical data in the dataset. For example if there exists in the data categories of high, medium or low, then the metadata needs to define these parameters so that users of the data can understand exactly what high, medium and low mean in terms of that dataset.
Not all data downloaded online or requested from government, for example, is delivered with the metadata document. Thus, when you request a dataset from an official source, always ensure you request the metadata to accompany the data.
The three types of metadata are: descriptive, structural, and administrative.
Descriptive metadata is used for discovery and identification. This includes attributes such as title, author, abstract, or keywords.
Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters. It describes the types, versions, relationships and other characteristics of digital materials.
Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it.Note: there are several subsets of administrative data.
Source: NISO. (2004) Understanding Metadata. Bethesda, MD: NISO Press, p.1View an example of what a detailed metadata document looks like.
Before we wrap up this module, let's try to identify a few data types. Using information about yourself as a subject, assess whether the following data fields are examples of quantitative or qualitative data:
Data Value | Data Type |
---|---|
Name | Qualitative |
ID number | Qualitative |
Gender | Qualitative |
Age | Quantitative |
Age Group (i.e. 35-45) | Qualitative, Categorical |
Education level | Qualitative, Categorical |
Employment Sector | Qualitative, Categorical |
Income | Quantitative, Discrete |
Population Count | Quantitative, Discrete |
Height | Quantitative, Continuous |
Of the above values that could be assigned to any person in a dataset:
This curriculum has been developed by OpenUp in collaboration with School of Data.
That's great to hear! We want to make it even better and could really use your feedback.
How will you apply what you learned?
You are free to use, share, and adapt this content to your needs. Do you want to teach others? Let us know how we can help.
We're sorry to hear that.
Please let us know how we can improve.