Authorizing GitHub OAuth with Apache Airflow
Table of Contents
- What is Apache Airflow
- GitHub Token
- GitHub OAuth
- Airflow Config
What is Apache Airflow
Apache Airflow is an open-source workflow management platform created by the community to programmatically author, schedule, and monitor workflows. Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers.
Airflow is written in Python, and workflows are created via Python scripts. Airflow is designed under the principle of "configuration as code". While other "configuration as code" workflow platforms exist using markup languages like XML, using Python allows developers to import libraries and classes to help them create their workflows.
Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive). Previous DAG-based schedulers like Oozie and Azkaban tended to rely on multiple configuration files and file system trees to create a DAG, whereas in Airflow, DAGs can often be written in one Python file.
Personal access tokens (PATs) are an alternative to using passwords for authentication to GitHub when using the GitHub API or the command line.
Create a Personal access token from https://github.com/settings/tokens, with read: enterprise, read: org, user scopes.
Token Team ID
To get your team ID call this endpoint with your token https://api.github.com/orgs/<org-name>/teams, you should get a json output like this:
The value of the key id should be your team ID.
GitHub's OAuth implementation supports the standard authorization code grant type.
To authorize your OAuth app, consider which authorization flow best fits your app.
- Web application flow: used to authorize users for standard OAuth apps that run in the browser (the implicit grant type is not supported).
- Device flow: used for headless apps, such as CLI tools.
We will be using web application flow to authorize users for our airflow:
- Users are redirected to request their GitHub identity.
- Users are redirected back to your site by GitHub.
- Your app accesses the API with the user's access token.
Create an OAuth app here https://github.com/settings/developers, which should give you a:
- Client ID
- Client Secret
The first time you run Airflow, it will create a file called airflow.cfg in your $AIRFLOW_HOME directory (~/airflow by default). This file contains Airflow's configuration and you can edit it to change any of the settings. Add the information we gathered so far to the airflow.cfg.
auth_backend = airflow.contrib.auth.backends.github_enterprise_auth
api_rev = v3
host = github.com
client_id = <YOUR_CLIENT_ID>
client_secret = <YOUR_CLIENT_SECRET>
oauth_callback_route = /home
allowed_teams = <YOUR_TEAM_ID>
Now when you run your airflow, you should be redirected to your GitHub sign-in page, where you'll be able to log in. If you are in the allowed teams, once it logs in, it will automatically redirect you to your Airflow application page.
Implementing GitHub OAuth in Airflow, when your developers already use GitHub in their project, affords you two benefits. The developers do not have to use a separate login credential for Apache Airflow. Secondly, it gives them the ability to control which team(s) in their organization can access their Airflow application. This approach will simplify your OAuth implementation while using the tools your developers are most comfortable with.