Conda package manager#

Content from this lesson has been inspired and adapted from a number of sources including:

Introduction#

Conda is an open source package management and environment management system that runs on multiple operating systems (Windows, Linux, macOS). Its features include:

  • Conda quickly installs, runs and updates packages and their dependencies.

  • Conda easily creates, saves, loads and switches between environments on your local computer.

  • It was created for Python programs, but it can package and distribute software for any language.

Conda is a tool that helps find and install packages, but also lets you manage different software environments where you can install different configurations of packages. For example, this enables you to install different versions of Python in two separate environments without creating incompatibities in either of those projects.

Conda, Miniconda and Anaconda

It’s common to be confused when confronted with Conda, Miniconda and Anaconda. Conda is specifically the package and environment manager tool itself. Miniconda is a distribution of Python that includes the Conda package manager and a few other core packages. Anaconda is another distribution of Python that includes the Conda package manager but also includes a number of widely used Python packages and other Anaconda features such as the Anaconda Navigator.

Miniconda versus Anaconda, reproduced from https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/01-getting-started-with-conda/index.html

Reproduced from Introduction to Conda for Data Scientists

Conda is widely used across scientific computing and data science based domains due it’s well populated package ecosystem and environment management capabilities.

  • Conda installs prebuilt packages, which allows for installing complicated packages in one step because someone else has built the tool with the right compilers and libraries

  • The cross platform nature of Conda allows for users to more easily share the environments. This helps researchers share their computational environment along side their data and analysis, helping improve the reproducibility of their research

  • Conda also provides access to widely used machine learning and data science libraries such as TensorFlow, SciPy, NumPy that are available as pre-configured, hardware specific packages (such as GPU-enabled TensorFlow) allowing for code to be as performant as possible

Installing Conda#

On ARC#

We provide a module for anaconda, meaning you don’t need to install it yourself, but you do need to take additional steps to avoid it filling up your home directory when you create additional environments.

Using conda on ARC

You also can just install miniconda in /nobackup if you prefer.

On another system#

You can install Conda from a number of sources:

Conda is cross-platform, therefore all these distributions have installers for both Windows, MacOS and Linux. For Miniconda, you visit the Miniconda page on the Conda website, select the installer corresponding to your operating system and run the downloaded file on your machine. When installing Miniconda you may be prompted to select various settings during installation, our recommendation is to leave these settings as the defaults if you’re unsure.

If you have questions or issues installing Conda locally please get in touch via the Research Computing Contact form.

Conda environments#

As well as managing packages Conda also allows you to create and manage environments. A Conda environment is a directory that contains a specific set of installed packages and tools. This allows you to separate the dependencies of different projects cleanly so for example, you can use Python 3.7 in one Conda environment to reproduce a collaborators results but use Python 3.10 in your own projects without any hassle. Conda makes it easy to switch between different environments and allows you to create and delete them as required. Conda environments also make it easier to share our environment setup between machines and with collaborators as we can export our environments into a text file.

The base environment

By default Conda includes the base environment. This contains a starting installation of Python and the dependencies of the Conda tool itself. Therefore, it’s best practice to not install packages into the base environment and create your own environments into which you install the tools you need.

Creating environments#

You can create an environment with Conda with the subcommand conda create. When creating an environment we need to give it a name; we recommend giving it a name related to the project you’re going to use the environment for.

$ conda create --name py39-env python=3.9

The above command will prompt Conda to create a new environment called py39-env and install into it python at version 3.9. We can specify multiple packages when creating a Conda environment by separating each package name with a space.

$ conda create --name data-sci-env pandas=1.4.2 matplotlib=3.5.1 scikit-learn

With the above command we create a new environment but don’t specify to install Python. However, because we’ve specified Python packages which depend on Python being installed to run Conda will install the high version of Python suitable for these packages.

Activating environments#

To use a Conda environment we need to activate it. Activating our environment does a number of steps that sets the terminal we’re using up so that it can see all of the installed packages in the environment, making it ready for use.

$ conda activate data-sci-env

(data-sci-env)$

You use the subcommand conda activate ENVNAME for environment activation, where ENVNAME is the name of the environment you wish to activate. You can see it has successfully activated when it returns your prompt with the environment name prepended in brackets.

Deactivating environments#

You can deactivate your current environment with another simple subcommand conda deactivate.

(data-sci-env)$ conda deactivate

Listing current environments#

If you ever want to see your list of current environments on your machine you can you the subcommand conda env list. This will return a list of the available Conda environments you can use and the environment location in your filesystem.

$ conda env list

Removing a Conda environment#

It is also possible to delete a Conda environment through the remove subcommand. This command is outlined below in relation to removing specific packages but can also be used to delete an entire Conda environment.

To remove the py39-env we created earlier we use the command:

$ conda remove --name py39-env --all

Conda checks for user confirmation that we wish to proceed and outlines for us exactly which packages are being removed. On proceeding with removing the environment all associated environment files and packages are deleted.

Important

Using conda remove to delete an environment is irreversible. You cannot undo deletion of an environment to the exact state it was in before deletion. However, if you have exported details of your environment it is possible to recreate it.

Sharing Conda environments#

If you need to share a Conda environment with others or between machines its possible to use Conda to export a file containing a specification of packages installed in that environment. With this environment file and Conda installed on another device its possible to recreate the environment with the same specifications.

Let’s assume we want to share our data-sci-env Conda environment with others. To do this we first need to create the environment.yml file containing our environment specification. You can create a very detailed specification that includes operating system specific hashes with the command:

$ conda activate data-sci-env

(data-sci-env)$ conda env export > environment.yml

Above, we activate the environment we want to create an environment.yml file from and then use the command conda env export. This outputs the environment specification to the standard output in the terminal so to capture and write this to a file we redirect the output to environment.yml.

This command also exports a line called prefix: specifying the directory location of the environment on your filesystem. This isn’t required when sharing your environment and should be removed, you can do this manually or use grep when exporting your environment.

(data-sci-env)$ conda env export | grep -v ^prefix: > environment.yml

We can share the environment.yml file with collaborators and/or commit the file to version control to ensure people can recreate the required Conda environment.

You can recreate a Conda environment from a file with the following command:

$ conda env create -f environment.yml

Here we’re specifying Conda create a new environment and using the -f option to specify that it creates the environment using a file with an environment specification. We pass the file path to the environment file as the argument following -f.

Creating a cross platform environment file#

As noted above using conda env export creates a highly specific environment file, this often causes difficulties when sharing environments across operating systems as the environment.yml contains operating system specific hashes for each package.

There are two possible methods of creating a more flexible environment.yml.

1. Using conda env export --from-history#

By default conda env export exports an environments entire specification, including dependencies of packages you conda install and their associated hashes. If you use conda env export --from-history Conda only exports packages explicitly installed with conda install. It does not include dependencies of those packages and therefore allows different operating systems to more flexibly install package dependencies.

For the above example with data-sci-env we would export a more flexible environment.yml with:

(data-sci-env)$ conda env export --from-history | grep -v ^prefix: > environment.yml
2. Manually create an environment.yml#

The other option is to manually specify the environment.yml file. This is often more fiddly than just exporting an environment but can be preferable to ensure all the desired dependencies of your project are captured. Environment files are written in YAML, a markup language, and have the standard pattern of:

name: data-sci-env
channels:
- defaults
dependencies:
- scikit-learn
- matplotlib=3.5.1
- pandas=1.4.3

Where you specify the environment name, a list of Conda channels used to install packages, and under dependencies a list of packages to be installed. You can also include version specification within the environment.yml allowing you to

Understanding the differences between weays to create environment files is important when you come to deciding on how best to share your project. It’s important to consider the balance of reproducibility and portability, conda env export captures the exact specification of an environment including all installed packages, their dependencies and package hashes. Sometimes this level of detail should be included to ensure maximum reproduciblity of a project, when looking to validate results, but it’s important to also balance being able to allow people to reproduce your work on other systems.

Using Conda to install packages#

With the Conda command line tool searching for and installing packages is can be performed with the following subcommands:

  • conda search

  • conda install

Searching for packages#

$ conda search python

This command searches for packages based on the argument provided. It searches in package repositories called Conda Channels which are remote websites where built Conda packages have been uploaded to. By default Conda uses the defaults channel which points to the Anaconda maintained package repository https://repo.anaconda.com/pkgs/main and https://repo.anaconda.com/pkgs/r. Other channels are also available such as conda-forge and we can specify when installing packages or when searching which channels we wish to search.

$ conda search 'python[channel=conda-forge]'

You can also search for specific version requirements with conda search:

$ conda search 'python>=3.8'

You can combine the two conditions shown above (searching a specific channel and for a specific version):

$ conda search 'python[channel=conda-forge]>=3.8'

Installing packages#

Installing packages via Conda is performed using the install subcommand with the format conda install PACKAGE, where PACKAGE is the name of the package you wish to install.

Earlier we created the data-sci-env and installed some useful data science packages. We’ve discovered we also need the statsmodels package for some extra work we want to do so we’ll look at using conda install to install this package within our existing environment.

To install packages into an existing environment we need to activate it with the subcommand shown above.

$ conda activate data-sci-env

(data-sci-env)$ conda install statsmodels

Conda will always prompt the user if we’re happy to proceed with the installation and specifies all the other packages that will be installed or updated that are required for our specified package. We confirm we wish to proceed by entering y and pressing Return.

This installs any packages that are currently not installed (Conda caches packages locally incase they are required by other packages, this speeds up installs but uses more disk space to maintain this cache).

Removing packages#

Another crucial aspect of managing an environment involves removing packages. Conda includes the remove subcommand for this operation, which allows you to specify a list of packages you wish to remove. You can do this within an activated environment, or specify to Conda the environment from which you want to remove packages.

When creating our data-sci-env we installed pandas=1.4.2, let’s imagine we made a mistake here and wanted a different version. We could remove this version of pandas with the following command:

$ conda remove -n data-sci-env pandas

When removing packages as with installing them Conda will ask for user confirmation to proceed. As you can see in the above example, removing one package may also lead to the removal of additional packages and can cause other packages to update.

With these changes made we can now install a newer version of pandas using conda install.

Updating a package#

The above example is slightly artificial as removing a package to install a more recent version is a long-winded way of doing things with Conda. If we want to update a package to a more recent version Conda provides the update subcommand to achieve this. Crucially, conda update will update a package to its most recent version and can’t be used to specific a particular version.

Let’s say we wanted to update the matplotlib library to the most recent version in our data-sci-env.

$ conda activate data-sci-env

(data-sci-env)$ conda update matplotlib

When requesting to update a package Conda will also update other dependencies of the package that you wish to update, and can potentially install new packages that are required.

Summary#

Important