Codebooks
In short, a codebook is a document which describes the content, structure, and the variables of a dataset. We normally use the codebook to understand a dataset.
A minimum structure of a codebook is:
- Introduction: Previous considerations, definitions, etc.
- Description of the variables.
And a minimum structure of a dataset is:
- Units of analysis and codes.
- Relevant variables
- Other variables
How to use codebooks (castellano | català)
Democracy and Dictatorship dataset
A simple example of a dataset is the Democracy and Dictatorship (DD) dataset, also known as the Cheibub-Gandhi-Vreeland (CGV) index (Cheibub, Gandhi, and Vreeland 2010). This dataset can be found in the personal webpage of one of the authors, José Antonio Cheibub.
Using Excel
Since the DD dataset is relatively short (9159 observations and 78 variables), a good choice for learning is to open it using Excel or Google Sheets. The video attached to this page shows how to explore the data via Google Sheets dynamic tables. The procedures are the following:
- Download the Excel file in the author’s webpage.
- Upload the Excel file in Google Sheets.
- Select all the rows and colums.
- Go to Data -> Create Filter
Using R
R requires a more sophisticated knowledge of data wrangling, but this type of software is essential with large datasets (the DD dataset has more than 9.000 observations). The next lines of code show how to upload the DD dataset in R and to display the first six rows. When we upload a dataset to R we must convert it to an R object, normally a dataframe. In Table 1 we display the result of applying the function head()
to the dd
dataframe.
library(foreign)
library(dplyr)
dd <- as_tibble(read.dta("https://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta"))
head(dd)
order | ctryname | year | aclpcode | cowcode | cowcode2 | ccdcodelet | ccdcodenum | aclpyear | cowcode2year | cowcodeyear | chgterr | ychgterr | flagc_cowcode2 | flage_cowcode2 | entryy | exity | cid | wdicode | imf_code | politycode | bankscode | dpicode | uncode | un_region | un_region_name | un_continent | un_continent_name | aclp_region | bornyear | endyear | dupcow | dupwdi | dupun | dupdpi | dupimf | dupbanks | exselec | legselec | closed | dejure | defacto | defacto2 | lparty | incumb | type2 | collect | nheads | nmil | nhead | npost | ndate | eheads | ageeh | emil | royal | headdiff | ehead | epost | edate | tenure08 | comm | ecens08 | edeath | flageh | democracy | assconfid | poppreselec | regime | tt | ttd | tta | flagc | flagdem | flagreg | agedem | agereg | stra |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Afghanistan | 1946 | 142 | 700 | 700 | AFG | 1 | 1421946 | 7001946 | 7001946 | 0 | 0 | 1 | 0 | 1946 | 2008 | 700 | AFG | 512 | 700 | 10 | AFG | 4 | 34 | Southern Asia | 142 | Asia | 9 | 1919 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 11.08.33 | 0 | 14 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 11.08.33 | 20 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 1 | 1 | 1 | 18 | 18 | 0 |
2 | Afghanistan | 1947 | 142 | 700 | 700 | AFG | 1 | 1421947 | 7001947 | 7001947 | 0 | 0 | 0 | 0 | 1946 | 2008 | 700 | AFG | 512 | 700 | 10 | AFG | 4 | 34 | Southern Asia | 142 | Asia | 9 | 1919 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Mohammad Zahir Shah | king | 0 | 15 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 20 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 19 | 19 | 0 | ||
3 | Afghanistan | 1948 | 142 | 700 | 700 | AFG | 1 | 1421948 | 7001948 | 7001948 | 0 | 0 | 0 | 0 | 1946 | 2008 | 700 | AFG | 512 | 700 | 10 | AFG | 4 | 34 | Southern Asia | 142 | Asia | 9 | 1919 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Mohammad Zahir Shah | king | 0 | 16 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 20 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 20 | 20 | 0 | ||
4 | Afghanistan | 1949 | 142 | 700 | 700 | AFG | 1 | 1421949 | 7001949 | 7001949 | 0 | 0 | 0 | 0 | 1946 | 2008 | 700 | AFG | 512 | 700 | 10 | AFG | 4 | 34 | Southern Asia | 142 | Asia | 9 | 1919 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Mohammad Zahir Shah | king | 0 | 17 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 20 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 21 | 0 | ||
5 | Afghanistan | 1950 | 142 | 700 | 700 | AFG | 1 | 1421950 | 7001950 | 7001950 | 0 | 0 | 0 | 0 | 1946 | 2008 | 700 | AFG | 512 | 700 | 10 | AFG | 4 | 34 | Southern Asia | 142 | Asia | 9 | 1919 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Mohammad Zahir Shah | king | 0 | 18 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 20 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 22 | 22 | 0 | ||
6 | Afghanistan | 1951 | 142 | 700 | 700 | AFG | 1 | 1421951 | 7001951 | 7001951 | 0 | 0 | 0 | 0 | 1946 | 2008 | 700 | AFG | 512 | 700 | 10 | AFG | 4 | 34 | Southern Asia | 142 | Asia | 9 | 1919 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Mohammad Zahir Shah | king | 0 | 19 | 0 | 1 | 0 | Mohammad Zahir Shah | king | 20 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 23 | 0 |
These are most common functions to explore a dataframe, all applied to the dd
dataframe:
head(dd)
: Displays the first 6 rows.tail(dd)
: Displays the last 6 rows.dim(dd)
: Displays the rows and columns.glimpse(dd)
: Displays all the columns and the first observations of the dataframe in a vertical format.View(dd)
: Opens a spreadsheet similar to Excel.
Other codebooks
Other codebooks you might want to explore are the following ones:
- UCDP Dyadic Dataset Codebook Version 20.1 (Pettersson 2020): The codebooks of the UCDP are usually clean and easy to follow. See more in the UCDP webpage.
- COW Trade Data v4.0 (Barbieri and Keshk 2016): It contains sections on the procedures and on missing data. This is one of the bold examples when the creation of a dataset generates a paper with some findings. In this case, the authors realized that missing trade data was not only associated with the level of development of the reporting country, but also with the type of relations within a dyad. Two countries in conflict are less likely to report trade data among them than countries having peaceful relations (Barbieri, Keshk, and Pollins 2009).
- Intergovernmental Organizations v3: The paper associated to this data is a good example of the new evidence in form of graphs and charts that provides a new dataset (Pevehouse et al. 2019).