OpenUp | Data Basics

Gaining a functional level of data literacy is foundational in learning how to use data to communicate.

Data literacy is the ability to derive meaningful information from data, just as literacy in general is the ability to derive information from the written word.

Although this foundation is neither complicated nor technical, it will equip you with the language, definitions and/or jargon that you will need in order to engage in any dialogue around data processing.

For this reason, this module is recommended as pre-course material for most of our other training modules, and serves as a starting point for anyone wanting to learn more about the processing of data.

The formats that technology allows for information to be stored in has had a big impact in the way that we are able to work with information.

When we store information as data, a whole world of opportunity opens up that is supported by different aspects of, and skills generated by, various industries already in existence. For example: computer science, software engineering, visual design, journalism, economics, academics, geographic information systems (GIS), and more.

But what exactly is data?

Everything we see can be tracked and converted into data: our height, our age, the amount of transport modes we take to commute to work, the number of children at school, or the ones who are not attending school.

And that data (aka information) when in the right format, this being when the data elements are allocated to rows and columns, create the required shape of a dataset to deliver the required answers.

In summary, data is:

Information
A value assigned to a thing
All around us

Interested in taking one of our taught courses?
Express your interest and receive course updates

The fundamentals of data

It is useful at this stage to contextualise data. In so doing, we must address what is data? What is information? What is knowledge? And lastly, how are they all connected?

To best understand this, let us explore the following example.

Data: Golf ball size = 43mm

This piece of data as it stands alone is fairly meaningless. So we know the size of our golf ball. So what?

Let's add in another piece of data.

Data: Min size of golf ball for the tournament = 43mm

Aha! Now we have some useful information.

Information: This golf ball can be used in the tournament.

What do we know?

Knowledge: I now know that I need to check if the size of my golf balls is acceptable for different tournaments going forward.

Thus, knowledge is information that is understood, applied and learned.

Data types

Data is categorised as being either qualitative or quantitative.

Data that is numeric in form is categorised as quantitative data when, as the name suggests, the numbers are either quantifiable or measurable. Whereas qualitative data is data that is descriptive in nature. While qualitative data can take on a numeric form, it more commonly exists in other forms such as text.

It is important to know the difference between these two data types as the processes that can be performed on the data, and the insights that can be extracted from that processed data, differs vastly.

Qualitative data

Any data values that refer to the quality of something is known as qualitative data: A description of colours, texture and feel of an object, a description of experiences, or information gathered during an interview are all qualitative data. While this type of data is more commonly captured in the form of text, it is not limited to text. Qualitative data can also exist in forms such as audio, video or photograph.

Numeric data that describes a data record in terms of its qualities, rather than being an indication of quantity or measure, is also categorised as being qualitative. An example of this is a column containing identity numbers of citizens of a country. This data is stored in the dataset as a number, but it is not quantifiable because it does not make sense to average, count or sum such numeric data. Doing so does not provide a meaningful measure of anything.

Qualitative data can also be considered categorical and by this we can understand that an individual data record is assigned one of a selection of categories under which all the data records are distributed. This makes it possible to analyse different categories of data within a dataset in relation to each other.

The three main categorical classifications are: binary, nominal (or unordered), and ordinal.

Binary data distinguishes records by placing them into one of two mutually exclusive categories. For example: yes/no, true/false, right/wrong
Nominal (or unordered) data assigns records to named categories that do not inherently have a value or rank. For example categories such as colours or marital status.
Ordinal data assigns records to categories that contain some sort of implicit or natural order, for example short/medium/tall or a scale from 1 to 10.

Below are a few examples of different methods for obtaining qualitative data:

Structured or unstructured interviews
Focus groups
Direct observation
Participant observation
Written documents
Artifacts

Quantitative data

Quantitative data tells you something about a measure or quantity and is data that refers to a number. For example: height, age, GDP, temperature, area, and price.

This data type is comprised of values that can be measured precisely, rather than through interpretation. Thus making quantitative data objective, replicable, and verifiable.

Quantitative data is also quantifiable. For example adding up a column of data containing the age of people, and dividing by the total number of people in the study, provides the average age of the people.

Quantitative data has two basic classifications: discrete or continuous.

Discrete data is when the measurements are integers. This is data that has gaps in it (i.e. the gap between 1 and 2 for example), where the numbers can only be whole numbers, such as the number of people in a household, scores on a test (where you receive e.g. 7/10), or shoe sizes.

Continuous data is when the measurements can take on any value within a particularly defined range, such as weight, length, and duration. In continuous data, all values are possible with no gaps in between i.e. 1.27843369981472).

The difference can be seen in that your shoe size (discrete) is different from the size of your foot (continuous).

Quantitative data includes:

Measurements
Counts
Quantification
Calculations
Estimations
Predictions

Below are a few examples of different methods for obtaining quantitative data:

Polls
Questionnaires
Surveys
Rating scales
Physiological measurements (such as observation, direct or indirect measurement, laboratory tests)
Manipulating pre-existing statistical data using computational techniques

Data formats

Data can be captured and stored in different formats. These formats exist to support the different tasks we need to perform in order to understand the problems we are trying to solve, or to investigate and shape the stories we are trying to write.

As result data must be suitably stored in a format that supports whichever task is necessary, be it as unstructured data that is better for communication and human consumption, or structured data, that is better for computer processing because of its relational database system.

Unstructured data is often text-heavy despite typically containing data such as dates, numbers or facts. But due to the data’s irregular presentation, it cannot be processed.

An example of unstructured data could be the information contained in an email, whereas an example of structured data is the information contained within a spreadsheet.

What does unstructured and structured data look like?

Unstructured data

"My name is Jane, I am a 32 year old female South African, and I live in Newlands in Cape Town, which is in the Western Cape province of South Africa"

The above example would result in irregularities when imported into a spreadsheet, making it difficult to process for analysis.

Structured data

"name", "age", "gender", "nationality", "location"

"Jane", 32, "female", "South African", "Newlands, Cape Town, Western Cape, South Africa"

The above is an example of CSV or comma separated values data. Data in this structured format can be read directly by spreadsheet software. This means it is machine readable.

There are many different formats of structured, or machine readable data, and some examples of these are:

XLS/XLSX is a file extension for an open Extensible Markup Language (or XML) spreadsheet file format created for, supported and used by Microsoft Excel versions including 2007 and later.
Comma Separated Values (or CSV) stores tabular data in plaintext by separating every data record consisting of one or more fields on a single line, by a comma.
Tab Separated Values (or TSV) stores tabular data in plaintext by separating every data record consisting of one or more fields on a single line, by a tabbed space.
JavaScript Object Notation (or JSON) is a text-based data interchange format designed for transmitting structured data. It is most commonly used for transferring data between web applications and web servers.

Examples of structured data formats

First Name	Last Name	Goals
Benni	McCarthy	31
Shaun	Bartlett	29
Katlego	Mphela	23
Bernard	Parker	23
Phil	Masinga	19

Taking the data above as an example, here is that table shown in various formats

Tab separated values

First Name  Last Name Goals
Benni  McCarthy 31
Shaun Bartlett 29
Katlego  Mphela 23
Bernard  Parker 23
Phil  Masinga 19

Please note that it is inherently difficult to display tab seperated values on a web page. The main takeaway should be that each column is separated by a tab and each row is on a new line.

Comma separated values

First Name,Last Name,Goals
Benni,McCarthy 31
Shaun,Bartlett 29
Katlego,Mphela 23
Bernard,Parker 23
Phil,Masinga 19

JSON

[
   {
      "First Name":"Benni",
      "Last Name":"McCarthy",
      "Goals":"31"
   },
   {
      "First Name":"Shaun",
      "Last Name":"Bartlett",
      "Goals":"29"
   },
   {
      "First Name":"Katlego",
      "Last Name":"Mphela",
      "Goals":"23"
   },
   {
      "First Name":"Bernard",
      "Last Name":"Parker",
      "Goals":"23"
   },
   {
      "First Name":"Phil",
      "Last Name":"Masinga",
      "Goals":"19"
   }
]

If you wish to explore other structured data formats read we recommend the Open Data Handbook.

Metadata

Metadata is the data about the data, a summary of the basic information regarding a given dataset. Every dataset should have an accompanying metadata document. It should detail the data source, collection methodology, as well field descriptions.

Metadata is necessary for understanding the scope of the data, its limitations, decoding and verifying the contents the data, understanding how to interpret the data, and should also provide the definitions of any categorical data in the dataset. For example if there exists in the data categories of high, medium or low, then the metadata needs to define these parameters so that users of the data can understand exactly what high, medium and low mean in terms of that dataset.

Not all data downloaded online or requested from government, for example, is delivered with the metadata document. Thus, when you request a dataset from an official source, always ensure you request the metadata to accompany the data.

The three types of metadata are: descriptive, structural, and administrative.

Descriptive metadata is used for discovery and identification. This includes attributes such as title, author, abstract, or keywords.

Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters. It describes the types, versions, relationships and other characteristics of digital materials.

Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it.Note: there are several subsets of administrative data.

Source: NISO. (2004) Understanding Metadata. Bethesda, MD: NISO Press, p.1

View an example of what a detailed metadata document looks like.

Test yourself!

Before we wrap up this module, let's try to identify a few data types. Using information about yourself as a subject, assess whether the following data fields are examples of quantitative or qualitative data:

Name
ID number
Gender
Age
Education level
Employment sector
Nationality
Income
Date of birth
Height
Weight

Data Value	Data Type
Name	Qualitative
ID number	Qualitative
Gender	Qualitative
Age	Quantitative
Age Group (i.e. 35-45)	Qualitative, Categorical
Education level	Qualitative, Categorical
Employment Sector	Qualitative, Categorical
Income	Quantitative, Discrete
Population Count	Quantitative, Discrete
Height	Quantitative, Continuous

Of the above values that could be assigned to any person in a dataset:

Your name and your gender is qualitative data. It describes something about yourself.
An ID Number is interesting because although it is captured as numeric data, summing up a group of people's ID numbers does not make logical sense. Thus ID Number is also classified as being qualitative data.
A numeric value assigned to someone's age is certainly descriptive, however this is quantitative data. It would be interesting and relevant to understand the average age of person's falling into a given category.
Age group however is definitely qualitative data i.e. if the age data were captured as an age group like "35 - 45 years old". Age group data like this is also categorical because we know that a person falls into a given age group category.
The same applies to education level and employment sector. These are both qualitative and categorical data values.
A person's income is quantitative data, and although it may contain two decimal points, it is discrete.
Population count data or a count of a number of people within a given area, for example, is also quantitative and discrete.
A person's height is quantitative, and is also continuous. A very fine measure of height could be 170.182645 cm.

Credit

This curriculum has been developed by OpenUp in collaboration with School of Data.