Data Analysis and Visualisation in Python: Glossary

Key Points

Before we start
  • Python is an open source and platform independent programming language.

  • Jupyter Notebook and the Spyder IDE are great tools to code in and interact with Python. With the large Python community it is easy to find help on the internet.

Short Introduction to Programming in Python
  • Python is an interpreted language which can be used interactively (executing one command at a time) or in scripting mode (executing a series of commands saved in file).

  • One can assign a value to a variable in Python. Those variables can be of several types, such as string, integer, floating point and complex numbers.

  • Lists and tuples are similar in that they are ordered lists of elements; they differ in that a tuple is immutable (cannot be changed).

  • Dictionaries are data structures that provide mappings between keys and values.

Starting With Data
  • Libraries enable us to extend the functionality of Python.

  • Pandas is a popular library for working with data.

  • A Dataframe is a Pandas data structure that allows one to access data by column (name or index) or row.

  • Aggregating data using the groupby() function enables you to generate useful summaries of data quickly.

  • Plots can be created from DataFrames or subsets of data that have been generated with groupby().

Indexing, Slicing and Subsetting DataFrames in Python
  • In Python, portions of data can be accessed using indices, slices, column headings, and condition-based subsetting.

  • Python uses 0-based indexing, in which the first element in a list, tuple or any other data structure has an index of 0.

  • Pandas enables common data exploration steps such as data indexing, slicing and conditional subsetting.

Data Types and Formats
  • Pandas uses other names for data types than Python, for example: object for textual data.

  • A column in a DataFrame can only have one data type.

  • The data type in a DataFrame’s single column can be checked using dtype.

  • Make conscious decisions about how to manage missing data.

  • A DataFrame can be saved to a CSV file using the to_csv function.

Combining DataFrames with Pandas
  • Pandas’ merge and concat can be used to combine subsets of a DataFrame, or even data from different files.

  • join function combines DataFrames based on index or column.

  • Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

  • to_csv can be used to write out DataFrames in CSV format.

Data Workflows and Automation
  • Loops help automate repetitive tasks over sets of items.

  • Loops combined with functions provide a way to process data more efficiently than we could by hand.

  • Conditional statements enable execution of different operations on different data.

  • Functions enable code reuse.

Making Plots With plotnine
  • The data, aes variables and a geometry are the main elements of a plotnine graph

  • With the + operator, additional scale_*, theme_*, xlab/ylab and facet_* elements are added

Data Ingest and Visualization - Matplotlib and Pandas
  • Matplotlib is the engine behind plotnine and Pandas plots.

  • The object-based nature of matplotlib plots enables their detailed customization after they have been created.

  • Export plots to a file using the savefig method.

Accessing SQLite Databases Using Python and Pandas
  • sqlite3 provides a SQL-like interface to read, query, and write SQL databases from Python.

  • sqlite3 can be used with Pandas to read SQL data to the familiar Pandas DataFrame.

  • Pandas and sqlite3 can also be used to transfer between the CSV and SQL formats.

Glossary

0-based indexing
is a way of assigning indices to elements in a sequential, ordered data structure starting from 0, i.e. where the first element of the sequence has index 0.
CSV (file)
is an acronym which stands for Comma-Separated Values file. CSV files store tabular data, either numbers, strings, or a combination of the two, in plain text with columns separated by a comma and rows by the carriage return character.
database
is an organized collection of data.
dataframe
is a two-dimensional labeled data structure with columns of (potentially) different type.
data structure
is a particular way of organizing data in memory.
data type
is a particular kind of item that can be assigned to a variable, defined by the values it can take, the programming language in use and the operations that can be performed on it.
dictionary
is an unordered Python data structure designed to contain key-value pairs, where both the key and the value can be integers, floats or strings. Elements of a dictionary can be accessed by their key and can be modified.
docstring
is an optional documentation string to describe what a Python function does.
faceting
is the act of plotting relationships between set variables in multiple subsets of the data with the results appearing as different panels in the same figure.
float
is a Python data type designed to store positive and negative decimal numbers by means of a floating point representation.
function
is a group of related statements that perform a specific task.
integer
is a Python data type designed to store positive and negative integer numbers.
interactive mode
is an online mode of operation in which the user writes the commands directly on the command line one-by-one and execute them immediately by pressing a button on the keyword, usually Return.
join key
is a variable or an array representing the column names over which pandas.DataFrame.join() merge together columns of different data sets.
library
is a set of functions and methods grouped together to perform some specific sort of tasks.
list
is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a list can be accessed by their index and can be modified.
loop
is a sequence of instructions that is continually repeated until a condition is satisfied.
NaN
is an acronym for Not-a-Number and represents that either a value is missing or the calculation cannot output any meaningful result.
None
is an object that represents no value.
scripting mode
is an offline mode of operation in which the user writes the commands to be executed in a text file (with .py extension for Python) which is then compiled or interpreted to run the program. Notes that Python interprets script on run-time and compiles a binary version of the program to speed up the execution time.
Sequential (data structure)
is an ordered group of objects stored in memory which can be accessed specifying their index, i.e. their position, in the structure.
SQL
or Structured Query Language, is a domain-specific language for managing data stored in a relational database management system (RDBMS).
SQLite
is a self-contained, public domain SQL database engine.
string
is a Python data type designed to store sequences of characters.
tuple
is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a tuple can be accessed by their index but cannot be modified.