diff --git a/.gitignore b/.gitignore index 30683a4..3e9297d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,16 @@ .DS_Store .ipynb_ckeckpoints + +## Terraform +**/.terraform/* +*.tfstate +*.tfstate.* +crash.log +crash.*.log +override.tf +override.tf.json +*_override.tf +*_override.tf.json +.terraform.tfstate.lock.info +*.terraformrc +terraform.rc diff --git a/LINUX.md b/LINUX.md index 299bf83..1b607e8 100644 --- a/LINUX.md +++ b/LINUX.md @@ -6,6 +6,42 @@ A part of the setup will be done on your **local machine** but most of the confi Please **read instructions carefully and execute all commands in the following order**. If you get stuck, don't hesitate to ask a teacher for help :raising_hand: +This setup is largely automated with **Terraform** and **Ansible**. There are three main components to the setup! **Terraform** and **ansible** are _Infrastructure as Code_ tools. +- **Terraform** excels at creating and destroying cloud resources, like virtual machines, IP addresses, databases and more! +- **Ansible** is used to configure linux machines with specific settings and software. Perfect for fine-tuning the Virtual Machine you will be creating! + +## Part 1: Setup your local computer + +In this section you'll setup your local computer and create some accounts. It will include things like: +1. Install some communication tools: Zoom, Slack +2. Create some accounts: Github, Google Cloud Platform (GCP) +3. Install Visual Studio Code (VS Code) +4. Install and authentication the GCP command line tool: `gcloud` +5. Install **terraform** on your local computer +6. Create your virtual machine with **terraform** and connect to it with **VS Code**! + +## Part 2: Configure your Virtual Machine Part 1 + +All parts of this section happen on your virtual machine. + +This section includes: +1. Authenticate your virtual machine with `gcloud` +2. Download and run an **ansible** playbook to partially configure your virtual machine +3. Login to the Github command line tool on your virtual machine +4. Copy the Le Wagon recommended **dotfiles**. **Dotfiles** are settings that will enhance your terminal and developer experience! + +## Part 3: Configure your Virtual Machine Part 2 + +All parts of this section happen on your virtual machine. + +In this section you will: +1. Download and run a second **ansible** playbook for some more fine tuning +2. Test your set up to make sure that everything has installed correctly +3. Create isolated python environments for all your challenges + + +Don't worry, we'll go into more detail in each of the individual sections. + Let's start :rocket: @@ -89,62 +125,15 @@ Have you signed up to GitHub? If not, [do it right away](https://github.com/join :point_right: **[Enable Two-Factor Authentication (2FA)](https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication#configuring-two-factor-authentication-using-text-messages)**. GitHub will send you text messages with a code when you try to log in. This is important for security and also will soon be required in order to contribute code on GitHub. -## SSH key +## Chrome - your browser -We want to safely communicate with your virtual machine using [SSH protocol](https://en.wikipedia.org/wiki/Secure_Shell). We need to generate a SSH key to authenticate. +Install the Google Chrome browser if you haven't got it already and set it as a __default browser__. -- Open your terminal +Follow the steps for your system from this link :point_right: [Install Google Chrome](https://support.google.com/chrome/answer/95346?co=GENIE.Platform%3DDesktop&hl=en-GB) -
- πŸ’‘ Windows tip +__Why Chrome?__ -We highly recommend installing [Windows Terminal](https://apps.microsoft.com/store/detail/windows-terminal/9N0DX20HK701?hl=fr-fr&gl=FR) from the Windows Store (installed on Windows 11 by default) to perform this operation -
- -- Create a SSH key - -
- Windows - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen.exe -t ed25519 -C "your_email@example.com" -``` -
- -
- MacOS & Linux - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen -t ed25519 -C "your_email@example.com" -``` -
- - -You should get the following message: `> Generating public/private algorithm key pair.` -- When you are prompted `> Enter a file in which to save the key`, press Enter -- You should be asked to `Enter a passphrase` - this is optional if you want additional security. To continue without a passphrase press enter without typing anything when asked to enter a passphrase. - -ℹ️ Don't worry if nothing prompt when you type, that is perfectly normal for security reasons. - -- You should be asked to `Enter same passphrase again`, do it. - -**❗️ You must remember this passphrase.** - -
- ❗️ /home/your_username/.ssh/id_ed25519 already exists. -If you receive this message, you may already have an SSH Key with the same name (if you are a Le Wagon Alumni or are using SSH Authentication with Github). - -To create a separate SSH key to exclusively use for this bootcamp use the following: - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen -t ed25519 -f ~/.ssh/de-bootcamp -C "your_email@example.com" -``` - -Your new SSH Key will be named `de-bootcamp`. Make sure to remember it for later! -
+We recommend to use it as your default browser as it's most compatible with testing or running your code, as well as working with Google Cloud Platform. Another alternative is Firefox, however we don't recommend using other tools like Opera, Internet Explorer or Safari. ## Google Cloud Platform setup @@ -287,314 +276,487 @@ Go to your project [APIs dashboard](https://console.cloud.google.com/apis/dashbo - Compute Engine is now enabled on your project -## Virtual Machine (VM) +## Visual Studio Code -**πŸ‘Œ Note: Skip to the next section if you already have a VM set up** +### Installation -_Note: The following section requires you already have a [Google Cloud Platform](https://cloud.google.com/) account associated with an active [Billing account](https://console.cloud.google.com/billing)._ +Let's install [Visual Studio Code](https://code.visualstudio.com) text editor. -- Go to console.cloud.google.com > > Compute Engine > VM instances > Create instance -- Name it `lewagon-data-eng-vm-`, replace `` with your own, e.g. `krokrob` -- Region `europe-west1`, choose the closest one among the [available regions](https://cloud.google.com/compute/docs/regions-zones#available) +Copy (`Ctrl` + `C`) the commands below then paste them in your terminal (`Ctrl` + `Shift` + `v`): - gcloud-console-vm-create-instance -- In the section `Machine configuration` under the sub-heading `Machine type` -- Select General purpose > PRESET > e2-standard-4 +```bash +wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > packages.microsoft.gpg +``` - gcloud-console-vm-e2-standard4 -- Boot disk > Change - - Operating system > Ubuntu - - Version > Ubuntu 22.04 LTS x86/64 - - Boot disk type > Balanced persistent disk - - Size > upgrade to 150GB +```bash +sudo install -o root -g root -m 644 packages.microsoft.gpg /etc/apt/trusted.gpg.d/ +``` - gcloud-console-vm-ubunt -- Open `Networking, Disks, ...` under `Advanced options` -- Open `Networking` +```bash +sudo sh -c 'echo "deb [arch=amd64,arm64,armhf signed-by=/etc/apt/trusted.gpg.d/packages.microsoft.gpg] https://packages.microsoft.com/repos/code stable main" > /etc/apt/sources.list.d/vscode.list' +``` - gcloud-console-vm-networking -- Go to `Network interfaces` and click on `default default (...)` with a downward arrow on the right. +```bash +rm -f packages.microsoft.gpg +``` - gcloud-console-vm-network-interfaces -- This opened a box `Edit network interface` -- Go to the dropdown `External IPv4 address`, click on it, click on `RESERVE STATIC EXTERNAL IP ADDRESS` +```bash +sudo apt update +``` - gcloud-console-vm-create-static-ip -- Give it a name, like "lewagon-data-eng-vm-ip-" (replace `` with your own) and description "Le Wagon - Data Engineering VM IP". This will take a few seconds. +```bash +sudo apt install -y code +``` - gcloud-console-reserve-static-ip +These commands will ask for your password: type it in. -- You will now have a public IP associated with your account, and later to your VM instance. Click on `Done` at the bottom of the section `Edit network interface` you were in. +:warning: When you type your password, nothing will show up on the screen, **that's normal**. This is a security feature to mask not only your password as a whole but also its length. Just type in your password and when you're done, press `Enter`. - gcloud-console-new-external-ip +### Launching from the terminal -### Public SSH key -- Open the `Security` section +Now let's launch VS Code from **the terminal**: - gcloud-console-vm-security -- Open the `Manage access` subsection +```bash +code +``` - gcloud-console-manage-access -- Go to `Add manually generated SSH keys` and click `Add item` +:heavy_check_mark: If a VS Code window has just opened, you're good to go :+1: - gcloud-console-add-manual-ssh-key -- In your terminal display your public SSH key: - - Windows: navigate to where you created your SSH key and open `id_ed25519.pub` +:x: Otherwise, please **contact a teacher** - - Mac/Linux users can use: - ```bash - cat ~/.ssh/id_ed25519.pub - # OR cat ~/.ssh/de-bootcamp.pub if you created a unique key - ``` -- Copy your public SSH key and paste it: - gcloud-console-add-ssh-key-pub -- On the right hand side you should see +### VS Code Remote SSH Extension - gcloud-console-vm-price-month -- You should be good to go and click `CREATE` at the bottom +We need to connect VS Code to a virtual machine in the cloud so you will only work on that machine during the bootcamp. A pretty useful [**Remote SSH Extension**](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) is available on the VS Code Marketplace. - gcloud-console-vm-create -- It will take a few minutes for your virtual machine (VM) to be created. Your instance will show up like below when ready, with a green circled tick, named `lewagon-data-eng-vm-krokrob` (`krokrob` being replaced by your GitHub username). +- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Extensions: Install Extensions` - gcloud-console-vm-instance-running -- Click on your instance +VSCode extensions - Search - Remote - gcloud-console-vm-running -- Go down to the section `SSH keys`, and write down your username (you need it for the next section) +- Install the extension - gcloud-console-vm-username +VS Code extensions - Remote - Details -Congrats, your virtual machine is up and running, it is time to connect it with VS Code! +That's the only extension you should install on your _local_ machine, we will install additional VS Code extensions on your _virtual machine_. -## Visual Studio Code +## Google Cloud CLI -### Installation +The `gcloud` Command Line Interface (CLI) is used to communicate with Google Cloud Platform services through your terminal. -Let's install [Visual Studio Code](https://code.visualstudio.com) text editor. +### Install gcloud -Copy (`Ctrl` + `C`) the commands below then paste them in your terminal (`Ctrl` + `Shift` + `v`): +Add the `APT` repository and install with: ```bash -wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > packages.microsoft.gpg +echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list +sudo apt-get install apt-transport-https ca-certificates gnupg +curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - +sudo apt-get update && sudo apt-get install google-cloud-sdk +sudo apt-get install google-cloud-sdk-app-engine-python ``` +To test your install, open a new terminal and run: + ```bash -sudo install -o root -g root -m 644 packages.microsoft.gpg /etc/apt/trusted.gpg.d/ +gcloud --version ``` +πŸ‘‰ [Install documentation πŸ”—](https://cloud.google.com/sdk/docs/install#deb) + + +### Authenticate gcloud + +We need to authenticate the `gcloud` CLI tool and set the project so it can interact with Google from the terminal. + +To authenticate `gcloud`, run: ```bash -sudo sh -c 'echo "deb [arch=amd64,arm64,armhf signed-by=/etc/apt/trusted.gpg.d/packages.microsoft.gpg] https://packages.microsoft.com/repos/code stable main" > /etc/apt/sources.list.d/vscode.list' +gcloud auth login ``` +And following the prompts. For pasting into the terminal, your might need to use CTRL + SHIFT + V + +You also need to set the GCP project that your are working in. For this section, you'll need your GCP Project ID, which can be found on the GCP Console at this [link here](https://console.cloud.google.com). Makes sure you copy the _Project ID_ and **not** the _Project number_. + +To set your project, replace `` with your GCP Project ID and run: + ```bash -rm -f packages.microsoft.gpg +gcloud config set project ``` +Confirm your setup with: + ```bash -sudo apt update +gcloud config list ``` +You should get an output similar to: + ```bash -sudo apt install -y code +[core] +account = taylorswift@domain.com # Should be your GCP email +disable_usage_reporting = True +project = my-gcp-project # Should be your GCP Project ID + +Your active configuration is: [default] ``` -These commands will ask for your password: type it in. -:warning: When you type your password, nothing will show up on the screen, **that's normal**. This is a security feature to mask not only your password as a whole but also its length. Just type in your password and when you're done, press `Enter`. +### Application Default Credentials -### Launching from the terminal +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. -Now let's launch VS Code from **the terminal**: +To authenticate your **Application Default Credentials**, in your terminal run: ```bash -code +gcloud auth application-default login ``` -:heavy_check_mark: If a VS Code window has just opened, you're good to go :+1: +And follow the prompts. It should open a web-page to login to your Google account. -:x: Otherwise, please **contact a teacher** +## Terraform -### VS Code Remote SSH Extension +Terraform is a tool for infrastructure as code (IAC) to create (and destroy) resources to create in the cloud. -We need to connect VS Code to a virtual machine in the cloud so you will only work on that machine during the bootcamp. A pretty useful [**Remote SSH Extension**](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) is available on the VS Code Marketplace. +Install some basic requirements: +```bash +sudo apt-get update && sudo apt-get install -y gnupg software-properties-common +``` -- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Extensions: Install Extensions` +Terraform is not available to **apt** by default, so we need to manually add the repository. +```bash +wget -O- https://apt.releases.hashicorp.com/gpg | \ + gpg --dearmor | \ + sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null +``` -VSCode extensions - Search - Remote +```bash +gpg --no-default-keyring \ + --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ + --fingerprint +``` -- Install the extension +```bash +echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ + https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \ + sudo tee /etc/apt/sources.list.d/hashicorp.list +``` -VS Code extensions - Remote - Details +Now we can install terraform directly with **apt** πŸ‘‡ +```bash +sudo apt update +sudo apt-get install terraform +``` -That's the only extension you should install on your _local_ machine, we will install additional VS Code extensions on your _virtual machine_. +Verify the installation with: -### Virtual Machine connection +```bash +terraform --version +``` -- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` -vscode-connect-to-host +## Provisioning your Virtual Machine with Terraform -- Click on `Add a new host` -- Type `ssh -i @`, for instance, my username is `somedude`, my private SSH key is located at `~/.ssh/id_rsa` on my local computer, my VM has a public IP of `34.77.50.76`: I'll type `ssh -i ~/.ssh/id_rsa somedude@34.77.50.76` +You can create Cloud Resources like Virtual Machines in different ways: +- Through the Google Cloud [Compute Engine Console πŸ”—](https://console.cloud.google.com/compute/overview) +- Using `gcloud` +- With **Infrastructure as Code** tools like Terraform -vscode-ssh-connection-command +We'll be creating our Virtual Machine with Terraform +We're almost at the point of creating your Virtual Machine. -- When prompted to `Select SSH configuration file to update`, pick the one in your home directory, under the `.ssh` folder, `~/.ssh/config` basically. Usually VS Code will pick automatically the best option, so their default should work. +The specifications of the Virtual Machine and Network Settings you'll use for the bootcamp are: +- Operation System: Ubuntu 22.04 LTS +- CPU: 4 Virtual CPU cores (2 physical CPU cores) +- RAM: 16 GB +- Storage (Persistent Disk): 100 GB balanced +- Static External IP address - so it's easier to login. -vscode-add-host-ssh-config +### Cost πŸ’Έ -- You should get a pop-up on the bottom right notifying you the host has been added +Creating and running a Virtual Machine on Google Cloud Platform costs money! -vscode-host-added +If you have created a new Google Cloud Platform account, the cost of the Virtual machine will be covered by the $300 USD credit for the first 90 days if you are diligent with turning off your Virtual Machine (or finish the _Linux and Bash_ challenge today 😎). -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` > Pick your VM IP address +❗ **The cost of running a Virtual Machine with our configuration 24 hours a day, 7 days a week is ~$150 USD per month.** ❗ -vscode-add-new-host +You can massively reduce the cost by only running the Virtual Machine when you use it. You will _NOT_ be charged for the vCPU's and RAM while the Virtual Machine is off! -- The first time, VSCode might ask you for a security permission like below, say yes / continue. +You will always pay for the Storage (equivalent of your hard-drive on your local computer). It's ~$10 USD per month for 100 GB. -vscode-remote-connection-confirm +The rule of thumb is: if Google can rent the resource out to someone else when your not using it, you only pay for it when you are using the resource. That's why you don't pay for the CPU and RAM when you are not using it, Google can rent it out to someone else, but always pay for Storage, Google can't rent it out to someone else because it has your data on it. -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Terminal: Create New Terminal (in active workspace)` > You now have a Bash terminal in your virtual machine! +### Download terraform files -vscode-command-palette-new-terminal -
-vscode-terminal +We almost have all the necessary parts to create your VM using **terraform**. We need to download the terraform files and change a few values. -- Still on your *local* computer, lets create a more readable version of your machine to connect to! +First we'll create a folder and download the terraform files with: ```bash -code ~/.ssh/config +mkdir -p ~/code/wagon-de-bootcamp +curl -L -o ~/wagon-de-bootcamp/main.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf +curl -L -o ~/wagon-de-bootcamp/provider.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf +curl -L -o ~/wagon-de-bootcamp/variables.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf +curl -L -o ~/wagon-de-bootcamp/terraform.tfvars https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars +curl -L -o ~/wagon-de-bootcamp/.terraform.lock.hcl https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl ``` -You should see something like the following: + +### Set variables + +Open up the file `~/wagon-de-bootcamp/terraform.tfvars` in VS Code or any other code editor. + +It should look like: ```bash -Host - HostName - IdentityFile - User +project_id = "" +region = "" +zone = "" +instance_name = "" +instance_user = "" ``` -You can now change Host to whatever you would like to see as the name of your connection or in terminal with `ssh `! -❗️ It is important that the `Host` alias does not contain any whitespaces ❗️ +We'll need to change some values in this file. Here's were you can find the required values: +- **project_id:** from the GCP Console at this [link here](https://console.cloud.google.com). +- **region:** take a look at the GCP Region and Zone documentation at this [link here](https://cloud.google.com/compute/docs/regions-zones). We strongly recommend you choose the closest geographical region. +- **zone:** Zone is a subset of region. it is almost always the same as **region** appended with `-a`, `-b`, or `-c`. +- **instance_name:** we recommend naming your VM: `lw-de-vm-`. Replacing `` with your GitHub username. +- **instance_user:** in your terminal, run `whoami` + +After completing this file, it should look similar to: ```bash -# For instance -Host "de-bootcamp-vm" - HostName 34.77.50.76 # replace with your VM's public IP address - IdentityFile - User +project_id = "wagon-bootcamp" +region = "europe-west1" +zone = "europe-west1-b" +instance_name = "lw-de-vm-tswift" +instance_user = "taylorswift" +``` + +Make sure to save the `terraform.tfvars` file, nagivate into the directory with the terraform files with: + +``` +cd ~/wagon-de-bootcamp ``` -**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS code for instance) +And initialise and test the files with: +```bash +terraform init + +terraform plan +``` -## VS Code Extensions +And check the output. Towards the bottom there should be a line: -Let's install some useful extensions to VS Code. +``` +Plan: 2 to add, 0 to change, 0 to destroy +``` -- Open your VS Code instance and make sure you're connected to the remote server. At the bottom left, you'll see: +We'll be adding: +- A compute engine instance +- A static external IP address -vscode-ssh +❗ If you have any errors, read the error and debug. If you need some help, raise a ticket with a teacher. -- Open the VS Code terminal (`CMD` + `` ` `` or `CTRL` + `` ` ``) then run the following commands: +If everything was successful, create your VM with: ```bash -code --install-extension ms-vscode.sublime-keybindings -code --install-extension emmanuelbeziat.vscode-great-icons -code --install-extension ms-python.python -code --install-extension KevinRose.vsc-python-indent -code --install-extension ms-python.vscode-pylance -code --install-extension redhat.vscode-yaml -code --install-extension ms-azuretools.vscode-docker -code --install-extension tamasfe.even-better-toml +terraform apply -auto-approve +``` + +It might take a while for Terraform to create the cloud resources. Once you see: + +``` +Apply complete! Resources: 2 added, 0 changed, 0 destroyed. ``` -Here is a list of the extensions you are installing: -- [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) -- [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) -- [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) -- [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) -- [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) -- [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) -- [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) -- [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +Your Virtual Machine should be up and running! Check the GCP Compute Engine console at this [link here](https://console.cloud.google.com/compute/instances) to confirm. -## Command line tools +## Virtual Machine connection -### Zsh & Git +### Create SSH keys -Instead of using the default `bash` [shell](https://en.wikipedia.org/wiki/Shell_(computing)), we will use `zsh`. +We need to connect VS Code to our Virtual Machine in the cloud so you will only work on that machine during the bootcamp. We'll use the [Remote - SSH Extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) that we previously installed. -We will also use [`git`](https://git-scm.com/), a command line software used for version control. +To create the VS Code SSH configuration, run the following in your terminal: + +```bash +gcloud compute config-ssh +``` -Let's install them, along with other useful tools: -- Open an **VS Code terminal** connected to your VM -- Copy and paste the following commands: +`gcloud` may tell you it needs to create a directory to continue. Accept and you should get an output similar to: ```bash -sudo apt update -sudo apt install -y vim tmux tree git ca-certificates curl jq unzip zsh \ -apt-transport-https gnupg software-properties-common direnv sqlite3 make \ -postgresql postgresql-contrib build-essential libssl-dev zlib1g-dev \ -libbz2-dev libreadline-dev libsqlite3-dev wget llvm \ -libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \ -gcc default-mysql-server default-libmysqlclient-dev libpython3-dev openjdk-8-jdk-headless +You should now be able to use ssh/scp with your instances. +For example, try running: + + $ ssh lw-de-vm-tswift.europe-west1-b.wagon-bootcamp +# $ ssh lw-de-vm-.. ``` -These commands might ask for your password, if they do: type it in. -:warning: When you type your password, nothing will show up on the screen, **that's normal**. This is a security feature to mask not only your password as a whole but also its length. Just type in your password and when you're done, press `Enter`. +### Connect with VS Code + +To connect to your Virtual Machine, click on the small symbol at the very bottom-left corner of VS Code: + +![](/images/vscode_remote_highlight.png) + +It should bring up a menu, click on **Connect to Host...**: -### GitHub CLI installation +![](/images/vscode_remote_menu.png) -Let's now install [GitHub official CLI](https://cli.github.com) (Command Line Interface). It's a software used to interact with your GitHub account via the command line. +Click on the name of your Virtual Machine: -In your terminal, copy-paste the following commands and type in your password if asked: +![](/images/vscode_remote_hosts.png) + +A new VS Code window will open. You may be asked to select the platform of the remote host, select **Linux**. You will then be asked to _fingerprint_ the connection. VS Code is asking if you trust the remote host you are trying to connect to. Hit enter to continue. + +![](/images/vscode_remote_fingerprint.png) + +And you are connected! It should look similar too: + +![](/images/vscode_remote_connected.png) + +Notice the connection in the very bottom-left corner of your VS Code window. It should have the Connection type (SSH), and the name of the host you are connected to. + +**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS Code) + +
+Viewing your SSH Configuration + +If you want to view your SSH configuration: +1. Start by clicking the symbol in the bottom-left corner of VS Code +2. Click on **Connect to Host...** +3. Click on **Configure SSH Hosts...*** +4. Select the configuration file. Usually the file at the top of the list. +5. View your configuration file! You may need to edit this configuration if you change computers, or want to work on more than one computer during the bootcamp. + +
+ + +## VM gcloud and Application Default Credentials + +We'll be doing some of the steps again, but that's because the virtual machine is a completely new computer! Luckily for us, `gcloud` comes pre-installed on the virtual machine. + + +### Authenticate gcloud + +We need to authenticate the `gcloud` CLI tool and set the project so it can interact with Google from the terminal. + +To authenticate `gcloud`, run: ```bash -curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg -echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null -sudo apt update -sudo apt install -y gh +gcloud auth login ``` -To check that `gh` has been successfully installed on your machine, you can run: +And following the prompts. For pasting into the terminal, your might need to use CTRL + SHIFT + V + +You also need to set the GCP project that your are working in. For this section, you'll need your GCP Project ID, which can be found on the GCP Console at this [link here](https://console.cloud.google.com). Makes sure you copy the _Project ID_ and **not** the _Project number_. + +To set your project, replace `` with your GCP Project ID and run: ```bash -gh --version +gcloud config set project ``` -:heavy_check_mark: If you see `gh version X.Y.Z (YYYY-MM-DD)`, you're good to go :+1: +Confirm your setup with: -:x: Otherwise, please **contact a teacher** +```bash +gcloud config list +``` +You should get an output similar to: + +```bash +[core] +account = taylorswift@domain.com # Should be your GCP email +disable_usage_reporting = True +project = my-gcp-project # Should be your GCP Project ID + +Your active configuration is: [default] +``` -## Oh-my-zsh -Let's install the `zsh` plugin [Oh My Zsh](https://ohmyz.sh/). +### Application Default Credentials -In a terminal execute the following command: +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. + +To authenticate your **Application Default Credentials**, in your terminal run: ```bash -sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" +gcloud auth application-default login ``` -If asked "Do you want to change your default shell to zsh?", press `Y` +And follow the prompts. It should open a web-page to login to your Google account. + + +## VM configuration with Ansible -At the end your terminal should look like this: +We'll be using [Ansible](https://docs.ansible.com/ansible/latest/getting_started/introduction.html) to configure your Virtual Machine with some software, configurations, packages, and frameworks that you'll use in the bootcamp. -![Ubuntu terminal with OhMyZsh](https://github.com/lewagon/setup/blob/master/images/oh_my_zsh.png) +Let's start by confirming that ansible is installed. In your terminal run: + +```bash +ansible --version +``` + +You should get an output similar to (some version numbers might change, that's fine): + +``` +ansible [core 2.17.9] + config file = /etc/ansible/ansible.cfg + configured module search path = ['/home/tswift/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] + ansible python module location = /usr/lib/python3/dist-packages/ansible + ansible collection location = /home/tswift/.ansible/collections:/usr/share/ansible/collections + executable location = /usr/bin/ansible + python version = 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (/usr/bin/python3) + jinja version = 3.1.2 + libyaml = True +``` + +❗ If not, raise a ticket with a teacher. + +### Ansible Playbook 1 + +Create a folder and download the ansible files: + +```bash +mkdir -p ~/vm-ansible-setup/playbooks + +curl -L -o ~/vm-ansible-setup/ansible.cfg https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/ansible.cfg +curl -L -o ~/vm-ansible-setup/hosts https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/hosts +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part1.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml +``` + +And run with: + +```bash +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part1.yml +``` -:heavy_check_mark: If it does, you can continue :+1: +And the playbook should start running! -:x: Otherwise, please **ask for a teacher** +❗ If an errors occur, raise a ticket with a teacher. You can safely run the playbook again. + +### What is the playbook installing? + +This playbook is installing a few things, while the playbook is running, let's go through them: +- Updating system packages. Ubuntu uses the `APT` package manager. +- Changing the default shell from **bash** to **zsh**, a more customizable shell that is extensible and looks great! +- Installing the **Oh-My-ZSH** plugin for the **zsh** shell. We'll use it a bit later to add some quality of life plugins and extensions for `zsh`. +- Installing **Docker** on your Virtual Machine. Docker is an open platform for developing, shipping, and running applications. You will use it throughout the bootcamp +- Installing some **Kubernetes (k8s)** tooling: Kubernetes is a system designed to for auto-scaling containerized applications. + - Installing **kubectl**: `kubectl` is the CLI tool for interacting with kubernetes clusters. + - Installing **minikube**: Minikube is a way to quickly spin up a local kubernetes cluster. Great for developing! +- Installing **terraform**: we've already installed it once, but we need to install it on our VM! **Terraform** is an Infrastructure as Code (IaC) tool. +- Install the **GitHub CLI**: the CLI tool that we'll use to interact with your GitHub account directly from the terminal. + +The playbook is also running checks to see if things are installed or not. This is so you can safely re-run the playbook without any problems. ## GitHub CLI @@ -649,120 +811,6 @@ gh auth status :x: If not, **contact a teacher**. -## Google Cloud CLI - -Install the `gcloud` CLI to communicate with [Google Cloud Platform](https://cloud.google.com/) through your terminal: -```bash -echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list -sudo apt-get install apt-transport-https ca-certificates gnupg -curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - -sudo apt-get update && sudo apt-get install google-cloud-sdk -sudo apt-get install google-cloud-sdk-app-engine-python -``` -πŸ‘‰ [Install documentation](https://cloud.google.com/sdk/docs/install#deb) - -### Create a service account key πŸ”‘ - -**πŸ‘Œ Note: Skip to the next section if you already have a service account key** - -Now that you have created a `GCP account` and a `project` (identified by its `PROJECT_ID`), we are going to configure the actions (API calls) that you want to allow your code to perform. - -
- πŸ€” Why do we need a service account key ? - - - You have created a `GCP account` linked to your credit card. Your account will be billed according to your usage of the ressources of the **Google Cloud Platform**. The billing will occur if you consume anything once the free trial is over, or if you exceed the amount of spending allowed during the free trial. - - In your `GCP account`, you have created a single `GCP project`, identified by its `PROJECT_ID`. The `GCP projects` allow you to organize and monitor more precisely how you consume the **GCP** ressources. For the purpose of the bootcamp, we are only going to create a single project. - - Now, we need a way to tell which ressources within a `GCP project` our code will be allowed to consume. Our code consumes GCP ressources through API calls. - - Since API calls are not free, it is important to define with caution how our code will be allowed to use them. During the bootcamp this will not be an issue and we are going to allow our code to use all the API of **GCP** without any restrictions. - - In the same way that there may be several projects associated with a GCP account, a project may be composed of several services (any bundle of code, whatever its form factor, that requires the usage of GCP API calls in order to fulfill its purpose). - - GCP requires that the services of the projects using API calls are registered on the platform and their credentials configured through the access granted to a `service account`. - - For the moment we will only need to use a single service and will create the corresponding `service account`. -
- -Since the [service account](https://cloud.google.com/iam/docs/service-accounts) is what identifies your application (and therefore your GCP billing account and ultimately your credit card), you are going to want to be cautious with the next steps. - -⚠️ **Do not share you service account json file πŸ”‘** ⚠️ Do not store it on your desktop, do not store it in your git codebase (even if your git repository is private), do not let it by the coffee machine, do not send it as a tweet. - -- Go to the [service accounts page](https://console.cloud.google.com/apis/credentials/serviceaccountkey) -- Select your project in the list of recent projects if asked to -- Create a service account: - - Click on **CREATE SERVICE ACCOUNT**: - - Give a `Service account name` to that account - - Click on **CREATE AND CONTINUE** - - Click on **Select a role** and choose `Quick access/Basic` then **Owner**, which gives full access to all ressources - - Click on **CONTINUE** - - Click on **DONE** -- Download the service account json file πŸ”‘: - - Click on the newly created service account - - Click on **KEYS** - - Click on **ADD KEY** then **Create new key** - - Select **JSON** and click on **CREATE** - -![](images/gcp_create_key.png) - -The browser has now saved the service account json file πŸ”‘ in your downloads directory (it is named according to your service account name, something like `le-wagon-data-123456789abc.json`) - - -### Configure Cloud sdk - -- Open the service account json file with any text editor and copy the key - ``` - # It looks like: - { - "type": "service_account", - "project_id": "kevin-bootcamp", - "private_key_id": "1234567890", - "private_key": "-----BEGIN PRIVATE KEY-----\nXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\n-----END PRIVATE KEY-----\n", - "client_email": "bootcamp@kevin-bootcamp.iam.gserviceaccount.com", - "client_id": "1234567890", - "auth_uri": "https://accounts.google.com/o/oauth2/auth", - "token_uri": "https://oauth2.googleapis.com/token", - "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", - "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bootcamp%40kevin-bootcamp.iam.gserviceaccount.com" - } - ``` -- **on your Virtual Machine**, create a `~/.gcp_keys` directory, then create a json file in it: - ``` bash - mkdir ~/.gcp_keys - touch ~/.gcp_keys/le-wagon-de-bootcamp.json - ``` -- Open the json file then store the service account json file pasting the key: - ```bash - code ~/.gcp_keys/le-wagon-de-bootcamp.json - ``` - ![service account json key](images/service_account_json_key.png) - - ❗️Don't forget to **save** the file with `CMD` + `s` or `CTRL` + `s` - -- Authenticate the `gcloud` CLI with the google account you used for GCP - ```bash - # Replace service_account_name@project_id.iam.gserviceaccount.com with your own - SERVICE_ACCOUNT_EMAIL=service_account_name@project_id.iam.gserviceaccount.com - KEY_FILE=$HOME/.gcp_keys/le-wagon-de-bootcamp.json - gcloud auth activate-service-account $SERVICE_ACCOUNT_EMAIL --key-file=$KEY_FILE - ``` -- List your active account and check your email address you used for GCP is present - ```bash - gcloud auth list - ``` -- Set your current project - ```bash - # Replace `PROJECT_ID` with the `ID` of your project, e.g. `wagon-bootcamp-123456` - gcloud config set project PROJECT_ID - ``` -- List your active account and current project and check your project is present - ```bash - gcloud config list - ``` - - ## Dotfiles Let's pimp your zsh and and vscode by installing lewagon recommanded dotfiles **on your Virtual Machine** @@ -909,474 +957,344 @@ you don't want your email to appear in public repositories you may contribute to -### zsh default terminal +--- -Set `zsh` as your default VS Code terminal. +Once you have finished installing the **dotfiles**, kill your terminal (little trash can at the top right of the terminal window) and re-open it. You might have to do it a few times until it looks similar to: -- Open terminal default profile settings +![](/images/vscode_after_ansible1.png) - Terminal profile settings -- Select `zsh /usr/bin/zsh` +The terminal should read as `zsh`. - Terminal zsh profile +## VM configuration with Ansible - Part 2 -## Disable SSH passphrase prompt +### Ansible Playbook 2 -You don't want to be asked for your passphrase every time you communicate with a distant repository. So, you need to add the plugin `ssh-agent` to `oh my zsh`: +We'll be using a second **Ansible** playbook to further configure your Virtual Machine. -First, open the `.zshrc` file: +Start by downloading the ansible playbook: ```bash -code ~/.zshrc +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part2.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml ``` -Then: -- Spot the line starting with `plugins=` -- Add `ssh-agent` at the end of the plugins list - -:heavy_check_mark: Save the `.zshrc` file with `Ctrl` + `S` and close your text editor. - - -## Docker πŸ‹ - -Docker is an open platform for developing, shipping, and running applications. - -### Install Docker and Docker Compose - -Setup the dock apt repo +And run with: ```bash -sudo install -m 0755 -d /etc/apt/keyrings - -curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg - -sudo chmod a+r /etc/apt/keyrings/docker.gpg +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part2.yml ``` -```bash -echo \ - "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ - "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ - sudo tee /etc/apt/sources.list.d/docker.list > /dev/null -``` +And the playbook should start running! If you're asked if you want VS Code to behave more like Sublime Text, click accept. -Install the right packages +❗ If any errors occur, raise a ticket with a teacher. You can safely run the playbook again. -``` -sudo apt-get update -sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -``` +
+❓ Why two Ansible playbooks? -Finally give your user permission to use `docker` +This second ansible playbook requires GitHub authorisation to fork the `lewagon/data-engineering-challenges` repository and it is also editing some of the Le Wagon recommended **dotfiles**. So we separated the process into two steps. +
-```bash -sudo groupadd docker -sudo usermod -aG docker $USER -newgrp docker -``` +### What is the playbook installing? -Run `docker run hello-world`, you should see something like: +This playbook is installing and configuring a things, while the playbook is running, let's go through them: -
- ❗️ Permission denied while trying to connect to the Docker daemon socket. ❗️ +**Python and Poetry** -If you receive an error similar to the one below, navigate to the [GCP Compute Engine Console](https://console.cloud.google.com/compute/instances) and shut down your VM by selecting the tick box next to your VM instance and clicking STOP (closing and reopening VSCode is not enough). +Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python [3.12.8](https://www.python.org/downloads/release/python-3128/) -![](images/docker_permission_denied_socket.png) +- Install **pyenv** and **pyenv-virtualenv**. We'll use **pyenv** to manage the Python versions installed on the VM +- Install Python 3.12.8 with pyenv +- Install **pipx**: [Pipx](https://pipx.pypa.io/stable/) is used to install python packages we want _globally_ available while still using virtual environments, like Poetry! +- Installing a few global python packages with **pipx**: + - **Poetry:** [Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. + - **Ruff:** [Ruff](https://docs.astral.sh/ruff/) Is used to format and lint Python code. + - **tldr:** [tldr](https://github.com/tldr-pages/tldr) has much more readable version of `man` pages. Useful for quickly finding out how a program works. -It will take a few minutes for your VM to turn off. Once it's fully off, turn your VM on again by checking the box next to the VM instance and clicking START. Give the VM a few minutes to fully start up and connect through VSCode. Once connected try `docker run hello-world` again. If you don't get an output similar to the below image, raise a ticket with a teacher. -
+**VS Code Configuration** -![](images/docker_hello.png) - -### Enable Artifact Registry API - -**πŸ‘Œ Note: Skip to the next section if you already have an Artifact Registry repository** - -[Artifact Registry](https://cloud.google.com/artifact-registry) is a GCP service you will use to store artifacts such as Docker images. The storage units are called repositories. - -- Enable the service within your project using the `gcloud` CLI: - ```bash - gcloud services enable artifactregistry.googleapis.com - ``` -- Create a new Docker repository: - ```bash - # Set the repository name - REPOSITORY=docker-hub - # Set the location of the repository. Available locations: gcloud artifacts locations list - LOCATION=europe-west1 - gcloud artifacts repositories create $REPOSITORY \ - --repository-format=docker \ - --location=$LOCATION \ - --description="Docker images storage" - ``` - -### Gcloud authentication for Docker - -You need to grant Docker access to push artifacts to (and pull from) your repository. There are different authentication methods, [gcloud credentials helper](https://cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) being the easiest. - -- Define the repository hostname matching the repository `$LOCATION`: - ```bash - # If $LOCATION is "europe-west1" - HOSTNAME=europe-west1-docker.pkg.dev - ``` -- Configure gcloud credentials helper: - ```bash - gcloud auth configure-docker $HOSTNAME - ``` -- Type `y` to accept the configuration -- Check your credentials helper is set: - ```bash - cat ~/.docker/config.json - ``` - You should get: - ```bash - { - "credHelpers": { - "europe-west1-docker.pkg.dev": "gcloud" - } - }% - ``` - - -## Kubernetes -Kubernetes (K8s) is a system designed to make deploying auto-scaling containerized applications easily. - -### Install kubectl -Kubectl is the cli for interacting with k8s! - -https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ +- Installing some **VS Code** extensions, but only on your VM. Here's a list of the extensions that are being installed: + - [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) + - [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) + - [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) + - [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) + - [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) + - [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) + - [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) + - [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +- Update the VS Code Python Interpreter path. -```bash -curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" +**Shell and System Configuration** -echo "$(cat kubectl.sha256) kubectl" | sha256sum --check +- Create the **direnv** poetry function. The same one from the lecture! This makes it easier to work with poetry. +- Adding some **Oh-My-ZSH** Plugins: by modifying your `.zshrc` file. Here's a list of the extra plugins: + - **pyenv**: Auto-complete for pyenv, a tool used to manage python virtual environments + - **gcloud**: Auto-complete for the gcloud CLI tool + - **ssh-agent**: Saves your SSH password so you only have to enter it once per session. + - **direnv**: A tool to load `.envrc` files when you `cd` into a directory. Great for loading environment variables. +- Installing **Spark**: Spark is a distributed data processing framework -sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl +**Data Engineering Challenges Repository** -kubectl version --client -kubectl version --client --output=yaml -``` +The challenges that you'll be working on throughout the bootcamp! The playbook is forking the **data-engineering-challenges** repository from **lewagon** to your own GitHub user. Then cloning that repository from your GitHub account down onto your Virtual Machine. -### Install minikube +### Restart Virtual Machine -Minikube is a way to quickly spin up a local kubernetes cluster! +Once the playbook has finished running, you need to completely shutdown your Virtual Machine so that some of the configuration updates (specifically **pyenv** and **Docker**). -https://minikube.sigs.k8s.io/docs/start/ +To shutdown your VM, navigate to the GCP Compute Engine Instances [console page πŸ”—](https://console.cloud.google.com/compute/instances). -```bash -curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 -sudo install minikube-linux-amd64 /usr/local/bin/minikube -``` - -### Test installation -To test that you can launch a cluster run: -```bash -minikube start -``` -you should see your cluster booting up : +Select your VM instance and click on the stop button: -![](images/minikube_start.png) +![](/images/gcp_vm_stop.png) -Then to check the cluster run: -```bash -kubectl get po -A -``` -you should be able to see your cluster running! : +Wait for a few minutes until the VM shows that it is completely off. You may need to refresh the page, the GCP Console doesn't dynamically update. -![](images/minikube_base.png) +When the VM is completely off, turn it on again by selecting the check box next to your instance and clicking **START/RESUME**. Give it a minute to spin up, then connect via VS Code. -To tear it all down for now: -```bash -minikube delete --all -``` +## Check your Virtual Machine Setup +We've used two ansible playbooks to configure our Virtual Machine. Let's run some manual checks in the terminal to make sure that everything has installed correctly. -## Terraform +❗ If any of these checks error out, raise a ticket with a teacher. -Terraform is a tool for infrastructure as code (IAC) to define resources to create in the cloud! +#### Python -### Install terraform +πŸ§ͺ To test: -Install some basic requirements ```bash -sudo apt-get update && sudo apt-get install -y gnupg software-properties-common +python --version ``` -Terraform is not avaliable to apt by default so we need to make it avaliable! -```bash -wget -O- https://apt.releases.hashicorp.com/gpg | \ - gpg --dearmor | \ - sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null -``` +Should return: -```bash -gpg --no-default-keyring \ - --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ - --fingerprint ``` - -```bash -echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ - https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \ - sudo tee /etc/apt/sources.list.d/hashicorp.list +Python 3.12.8 ``` -Now we can install terraform directly with apt πŸ‘‡ -```bash -sudo apt update -sudo apt-get install terraform -``` +#### Pyenv -Verify the installation with: +πŸ§ͺ To test: ```bash -terraform --version +pyenv versions ``` +Should return: +``` + system +* 3.12.8 (set by /home//.pyenv/version) +``` -## Spark +Note: There should be an `*` next to 3.12.8 -Spark is a data processing framework: +#### Pipx -Move to your home directory: +πŸ§ͺ To test: ```bash -cd ~ +pipx list ``` -Download spark: +Should return something similar too: -```bash -wget https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz ``` - -Open the tarball: - -```bash -mkdir -p ~/spark && tar -xzf spark-3.5.3-bin-hadoop3.tgz -C ~/spark +venvs are in /home//.local/share/pipx/venvs +apps are exposed on your $PATH at /home//.local/bin +manual pages are exposed at /home//.local/share/man + package poetry 2.1.1, installed using Python 3.12.8 + - poetry + package ruff 0.11.0, installed using Python 3.12.8 + - ruff + package tldr 3.3.0, installed using Python 3.12.8 + - tldr + - man1/tldr.1 ``` -Set the environment variables needed by spark: - -```bash -echo "export SPARK_HOME=$HOME/spark/spark-3.5.3-bin-hadoop3" >> .zshrc -echo 'export PATH=$PATH:$SPARK_HOME/bin' >> .zshrc -``` +#### Docker -Let's restart our shell: +πŸ§ͺ To test: ```bash -exec zsh +docker run hello-world ``` -Test Spark works by running: +Should return: -```bash -spark-shell ``` +Unable to find image 'hello-world:latest' locally +latest: Pulling from library/hello-world +e6590344b1a5: Pull complete +Digest: sha256:7e1a4e2d11e2ac7a8c3f768d4166c2defeb09d2a750b010412b6ea13de1efb19 +Status: Downloaded newer image for hello-world:latest -You should see an output similar to: +Hello from Docker! +This message shows that your installation appears to be working correctly. -```bash -Setting default log level to "WARN". -To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). -25/01/15 11:33:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -Spark context Web UI available at http://de-vm-lrae-test.europe-north1-b.c.wagon-de.internal:4040 -Spark context available as 'sc' (master = local[*], app id = local-1736940788403). -Spark session available as 'spark'. -Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 - /_/ +To generate this message, Docker took the following steps: + 1. The Docker client contacted the Docker daemon. + 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. + (amd64) + 3. The Docker daemon created a new container from that image which runs the + executable that produces the output you are currently reading. + 4. The Docker daemon streamed that output to the Docker client, which sent it + to your terminal. -Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_432) -Type in expressions to have them evaluated. -Type :help for more information. - -scala> -``` -Type `:quit` and hit enter to exit the spark-shell and continue. +To try something more ambitious, you can run an Ubuntu container with: + $ docker run -it ubuntu bash +Share images, automate workflows, and more with a free Docker ID: + https://hub.docker.com/ -## Python & Pip +For more examples and ideas, visit: + https://docs.docker.com/get-started/ +``` -Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python 3.12 ([3.12.8](https://www.python.org/downloads/release/python-3128/)). +#### Kubernetes -Let's install pyenv to manage our python versions: +We can start by testing `minikube`: ```bash -git clone https://github.com/pyenv/pyenv.git ~/.pyenv -source ~/.zprofile -exec zsh +# Start +minikube start ``` -We'll also install a useful `pyenv` plugin called [`pyenv-virtualenv`](https://github.com/pyenv/pyenv-virtualenv). Although we will be using `poetry` for Python package and virtual environment management, `pyenv-virtualenv` is useful for controlling python versions locally. +Should return: -```bash -git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv -exec zsh ``` - -Now install Python 3.12.8: -```bash -pyenv install 3.12.8 -pyenv global 3.12.8 +πŸ˜„ minikube v1.35.0 on Ubuntu 22.04 (amd64) +✨ Automatically selected the docker driver. Other choices: none, ssh +πŸ“Œ Using Docker driver with root privileges +πŸ‘ Starting "minikube" primary control-plane node in "minikube" cluster +🚜 Pulling base image v0.0.46 ... +πŸ’Ύ Downloading Kubernetes v1.32.0 preload ... + > gcr.io/k8s-minikube/kicbase...: 500.31 MiB / 500.31 MiB 100.00% 88.19 M + > preloaded-images-k8s-v18-v1...: 333.57 MiB / 333.57 MiB 100.00% 32.20 M +πŸ”₯ Creating docker container (CPUs=2, Memory=3900MB) ... +🐳 Preparing Kubernetes v1.32.0 on Docker 27.4.1 ... + β–ͺ Generating certificates and keys ... + β–ͺ Booting up control plane ... + β–ͺ Configuring RBAC rules ... +πŸ”— Configuring bridge CNI (Container Networking Interface) ... +πŸ”Ž Verifying Kubernetes components... + β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5 +🌟 Enabled addons: storage-provisioner, default-storageclass +πŸ„ Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default ``` -Now `python --version` should return `3.12.8` - -## Pipx - -Next we are going to install [pipx](https://pypa.github.io/pipx/) to install python packages we want globally available while still using virtual environments. - -Let's upgrade `pip` first: +And then make sure the kubernetes CLI utility, `kubectl`, works with: ```bash -pip install --upgrade pip +# Get pods +kubectl get po -A ``` -And install `pipx`: +Should return something similar too: -```bash -python -m pip install --user pipx # --user so that each ubuntu user can have his own 'pipx' -python -m pipx ensurepath -exec zsh ``` - -Lets install a [tldr](https://github.com/tldr-pages/tldr) with pipx - -```bash -pipx install tldr +NAMESPACE NAME READY STATUS RESTARTS AGE +kube-system coredns-668d6bf9bc-mg7b6 1/1 Running 0 72s +kube-system etcd-minikube 1/1 Running 0 78s +kube-system kube-apiserver-minikube 1/1 Running 0 76s +kube-system kube-controller-manager-minikube 1/1 Running 0 76s +kube-system kube-proxy-stk77 1/1 Running 0 72s +kube-system kube-scheduler-minikube 1/1 Running 0 76s +kube-system storage-provisioner 1/1 Running 1 (41s ago) 75s ``` -Now `tldr` should be globally available (for the current user), test it out with: +And because `minikube` is resource intensive, stop it for now with: ```bash -tldr ls +# Stop +minikube delete --all ``` -Much more readable than the classic `man ls` (although sometimes you will still need to delve into the man pages to get all of the details!) and it even has pages not included in man such as `tldr gh`: - -tldr - - -Lets add a few more packages we want globally available - -### black +Should return: -[black](https://black.readthedocs.io/en/stable/) for helping to format code - -```bash -pipx install black +``` +πŸ”₯ Deleting "minikube" in docker ... +πŸ”₯ Removing /home//.minikube/machines/minikube ... +πŸ’€ Removed all traces of the "minikube" cluster. +πŸ”₯ Successfully deleted all profiles ``` -### Poetry - -[Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. +#### Terraform -Install Poetry running the following command in your VS Code terminal: +πŸ§ͺ To test: ```bash -pipx install poetry +terraform --version ``` -Then, let's update default poetry behavior so that virtual envs are always created where `poetry install` is run. -During the bootcamp, you'll see a `.venv` folder being created inside each challenge folder. +Should return: -```bash -poetry config virtualenvs.in-project true ``` - -Finally, update your VScode settings to tell it that this `.venv` relative folder path will be your default interpreter! - -1. Open the Command Palette ( πŸͺŸ ctrl + shift + P / 🍎 cmd + shift + P ) -2. Search for: **Preference: Open Remote Settings (JSON)** - when you open your settings that should be two panels. -3. In the panel that opens on the **right side** search for the line: `python.defaultInterpreterPath` -4. Replace the value (probably `"~/.pyenv/shims/python"`) so that it looks like: - -```yml -"python.defaultInterpreterPath": ".venv/bin/python", +Terraform v1.11.2 +on linux_amd64 ``` -## Direnv +#### Spark -[Direnv](https://direnv.net/) is a great utility that will look for `.envrc` files in your directories. When you `cd` into directories with a `.envrc` files, paths will automatically be updated. In our case, this will simplify our workflow and allow us to not have to worry about Poetry managed Python virtual environments. - -1. First, setup the *direnv hook* to your zsh shell so that direnv gets activated anytime a `.envrc` file exists in current working directory. +πŸ§ͺ To test: ```bash -code ~/.zshrc +spark-shell ``` -```bash -plugins=(git gitfast ... pyenv ssh-agent direnv) # add `direnv` to the existing list of plugins -``` +Should take you into the spark shell that looks like: -2. Second, let's configure what will happens anytime `.envrc` file is found +``` +Setting default log level to "WARN". +To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +25/03/18 08:54:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Spark context Web UI available at http://lw-de-vm.europe-north1-b.c.wagon-de.internal:4040 +Spark context available as 'sc' (master = local[*], app id = local-1742288096829). +Spark session available as 'spark'. +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 + /_/ -```bash -code ~/.direnvrc -``` -- Paste the following lines - ```bash - layout_poetry() { - if [[ ! -f pyproject.toml ]]; then - log_error 'No pyproject.toml found. Use `poetry new` or `poetry init` to create one first.' - exit 2 - fi - # create venv if it doesn't exist - poetry run true - - export VIRTUAL_ENV=$(poetry env info --path) - export POETRY_ACTIVE=1 - PATH_add "$VIRTUAL_ENV/bin" - } - ``` -- Save and close the file - -😎 Now, **anytime you `cd` into a challenge folder which contains a `.envrc` file which contains `layout_poetry()` command inside, the function will get executed and your virtual env will switch to the poetry one that is defined by the `pyproject.toml` !** -- No need to prefix all commands with `poetry run `, but simply `` -- Each challenge will have its own virtual env, and it will be seamless for you to switch between challenges/envs +Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_442) +Type in expressions to have them evaluated. +Type :help for more information. +scala> +``` -## Let's Make! +Type `:quit` and hit enter to exit the spark-shell and continue. -Lets clone the challenges onto your **virtual machine** +That's all the testing we'll do for now! -```bash -export GITHUB_USERNAME=`gh api user | jq -r '.login'` -echo $GITHUB_USERNAME -``` -Then: +## Let's Make! -```bash -mkdir -p ~/code/$GITHUB_USERNAME && cd $_ -gh repo fork lewagon/data-engineering-challenges --clone -``` +Almost there! In the second ansible playbook, the `lewagon/data-engineering-challenges` repository was forked from Le Wagon to you. Let's review how it works. Our setup will look a bit like this: - +![](/images/repo_overview.png) This allows you to work on challenges, but if we push any changes to the content, you can still access them! Check your remotes match `origin` your data engineering challenges and `upstream` lewagon's! ```bash -cd data-engineering-challenges +cd ~/code/$(gh api user | jq -r '.login')/data-engineering-challenges git remote -v -# origin git@github.com:your_github_username/data-engineering-challenges.git (fetch) -# origin git@github.com:your_github_username/data-engineering-challenges.git (push) -# upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) -# upstream git@github.com:lewagon/data-engineering-challenges.git (push) +``` + +Should return: + +``` +origin git@github.com:/data-engineering-challenges.git (fetch) +origin git@github.com:/data-engineering-challenges.git (push) +upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) +upstream git@github.com:lewagon/data-engineering-challenges.git (push) ``` From challenge folder root **on the vm**, we'll run `make install`, which triggers 3 operations: diff --git a/WINDOWS.md b/WINDOWS.md index 2ec99aa..f012d81 100644 --- a/WINDOWS.md +++ b/WINDOWS.md @@ -6,6 +6,42 @@ A part of the setup will be done on your **local machine** but most of the confi Please **read instructions carefully and execute all commands in the following order**. If you get stuck, don't hesitate to ask a teacher for help :raising_hand: +This setup is largely automated with **Terraform** and **Ansible**. There are three main components to the setup! **Terraform** and **ansible** are _Infrastructure as Code_ tools. +- **Terraform** excels at creating and destroying cloud resources, like virtual machines, IP addresses, databases and more! +- **Ansible** is used to configure linux machines with specific settings and software. Perfect for fine-tuning the Virtual Machine you will be creating! + +## Part 1: Setup your local computer + +In this section you'll setup your local computer and create some accounts. It will include things like: +1. Install some communication tools: Zoom, Slack +2. Create some accounts: Github, Google Cloud Platform (GCP) +3. Install Visual Studio Code (VS Code) +4. Install and authentication the GCP command line tool: `gcloud` +5. Install **terraform** on your local computer +6. Create your virtual machine with **terraform** and connect to it with **VS Code**! + +## Part 2: Configure your Virtual Machine Part 1 + +All parts of this section happen on your virtual machine. + +This section includes: +1. Authenticate your virtual machine with `gcloud` +2. Download and run an **ansible** playbook to partially configure your virtual machine +3. Login to the Github command line tool on your virtual machine +4. Copy the Le Wagon recommended **dotfiles**. **Dotfiles** are settings that will enhance your terminal and developer experience! + +## Part 3: Configure your Virtual Machine Part 2 + +All parts of this section happen on your virtual machine. + +In this section you will: +1. Download and run a second **ansible** playbook for some more fine tuning +2. Test your set up to make sure that everything has installed correctly +3. Create isolated python environments for all your challenges + + +Don't worry, we'll go into more detail in each of the individual sections. + Let's start :rocket: @@ -89,62 +125,15 @@ Have you signed up to GitHub? If not, [do it right away](https://github.com/join :point_right: **[Enable Two-Factor Authentication (2FA)](https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication#configuring-two-factor-authentication-using-text-messages)**. GitHub will send you text messages with a code when you try to log in. This is important for security and also will soon be required in order to contribute code on GitHub. -## SSH key +## Chrome - your browser -We want to safely communicate with your virtual machine using [SSH protocol](https://en.wikipedia.org/wiki/Secure_Shell). We need to generate a SSH key to authenticate. +Install the Google Chrome browser if you haven't got it already and set it as a __default browser__. -- Open your terminal +Follow the steps for your system from this link :point_right: [Install Google Chrome](https://support.google.com/chrome/answer/95346?co=GENIE.Platform%3DDesktop&hl=en-GB) -
- πŸ’‘ Windows tip - -We highly recommend installing [Windows Terminal](https://apps.microsoft.com/store/detail/windows-terminal/9N0DX20HK701?hl=fr-fr&gl=FR) from the Windows Store (installed on Windows 11 by default) to perform this operation -
+__Why Chrome?__ -- Create a SSH key - -
- Windows - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen.exe -t ed25519 -C "your_email@example.com" -``` -
- -
- MacOS & Linux - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen -t ed25519 -C "your_email@example.com" -``` -
- - -You should get the following message: `> Generating public/private algorithm key pair.` -- When you are prompted `> Enter a file in which to save the key`, press Enter -- You should be asked to `Enter a passphrase` - this is optional if you want additional security. To continue without a passphrase press enter without typing anything when asked to enter a passphrase. - -ℹ️ Don't worry if nothing prompt when you type, that is perfectly normal for security reasons. - -- You should be asked to `Enter same passphrase again`, do it. - -**❗️ You must remember this passphrase.** - -
- ❗️ /home/your_username/.ssh/id_ed25519 already exists. -If you receive this message, you may already have an SSH Key with the same name (if you are a Le Wagon Alumni or are using SSH Authentication with Github). - -To create a separate SSH key to exclusively use for this bootcamp use the following: - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen -t ed25519 -f ~/.ssh/de-bootcamp -C "your_email@example.com" -``` - -Your new SSH Key will be named `de-bootcamp`. Make sure to remember it for later! -
+We recommend to use it as your default browser as it's most compatible with testing or running your code, as well as working with Google Cloud Platform. Another alternative is Firefox, however we don't recommend using other tools like Opera, Internet Explorer or Safari. ## Google Cloud Platform setup @@ -287,281 +276,467 @@ Go to your project [APIs dashboard](https://console.cloud.google.com/apis/dashbo - Compute Engine is now enabled on your project -## Virtual Machine (VM) +## Visual Studio Code -**πŸ‘Œ Note: Skip to the next section if you already have a VM set up** +### Installation -_Note: The following section requires you already have a [Google Cloud Platform](https://cloud.google.com/) account associated with an active [Billing account](https://console.cloud.google.com/billing)._ +Let's install [Visual Studio Code](https://code.visualstudio.com) text editor. -- Go to console.cloud.google.com > > Compute Engine > VM instances > Create instance -- Name it `lewagon-data-eng-vm-`, replace `` with your own, e.g. `krokrob` -- Region `europe-west1`, choose the closest one among the [available regions](https://cloud.google.com/compute/docs/regions-zones#available) +- Go to [Visual Studio Code download page](https://code.visualstudio.com/download). +- Click on "Windows" button +- Open the file you have just downloaded. +- Install it with few options: - gcloud-console-vm-create-instance -- In the section `Machine configuration` under the sub-heading `Machine type` -- Select General purpose > PRESET > e2-standard-4 +![VS Code installation options](https://github.com/lewagon/setup/blob/master/images/windows_vscode_installation.png) - gcloud-console-vm-e2-standard4 -- Boot disk > Change - - Operating system > Ubuntu - - Version > Ubuntu 22.04 LTS x86/64 - - Boot disk type > Balanced persistent disk - - Size > upgrade to 150GB +When the installation is finished, launch VS Code. - gcloud-console-vm-ubunt -- Open `Networking, Disks, ...` under `Advanced options` -- Open `Networking` - gcloud-console-vm-networking -- Go to `Network interfaces` and click on `default default (...)` with a downward arrow on the right. +### VS Code Remote SSH Extension - gcloud-console-vm-network-interfaces -- This opened a box `Edit network interface` -- Go to the dropdown `External IPv4 address`, click on it, click on `RESERVE STATIC EXTERNAL IP ADDRESS` +We need to connect VS Code to a virtual machine in the cloud so you will only work on that machine during the bootcamp. A pretty useful [**Remote SSH Extension**](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) is available on the VS Code Marketplace. - gcloud-console-vm-create-static-ip -- Give it a name, like "lewagon-data-eng-vm-ip-" (replace `` with your own) and description "Le Wagon - Data Engineering VM IP". This will take a few seconds. +- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Extensions: Install Extensions` - gcloud-console-reserve-static-ip +VSCode extensions - Search - Remote -- You will now have a public IP associated with your account, and later to your VM instance. Click on `Done` at the bottom of the section `Edit network interface` you were in. +- Install the extension - gcloud-console-new-external-ip +VS Code extensions - Remote - Details -### Public SSH key -- Open the `Security` section +That's the only extension you should install on your _local_ machine, we will install additional VS Code extensions on your _virtual machine_. - gcloud-console-vm-security -- Open the `Manage access` subsection - gcloud-console-manage-access -- Go to `Add manually generated SSH keys` and click `Add item` +## Google Cloud CLI - gcloud-console-add-manual-ssh-key -- In your terminal display your public SSH key: - - Windows: navigate to where you created your SSH key and open `id_ed25519.pub` +The `gcloud` Command Line Interface (CLI) is used to communicate with Google Cloud Platform services through your terminal. - - Mac/Linux users can use: - ```bash - cat ~/.ssh/id_ed25519.pub - # OR cat ~/.ssh/de-bootcamp.pub if you created a unique key - ``` -- Copy your public SSH key and paste it: +### Install gcloud - gcloud-console-add-ssh-key-pub -- On the right hand side you should see - gcloud-console-vm-price-month -- You should be good to go and click `CREATE` at the bottom +To install, download the Google Cloud CLI installer from this [link here πŸ”—](https://cloud.google.com/sdk/docs/install#windows). - gcloud-console-vm-create -- It will take a few minutes for your virtual machine (VM) to be created. Your instance will show up like below when ready, with a green circled tick, named `lewagon-data-eng-vm-krokrob` (`krokrob` being replaced by your GitHub username). +Once it's finished downloading, launch the installer and follow the prompts. You only need to install `gcloud` for the current user. - gcloud-console-vm-instance-running -- Click on your instance +On the last screen of the installer there will be four check boxes. Makes sure that the boxes for `Start Google SDK Shell` and `Run gcloud init to configure the Google Cloud CLI` are selected then click **Finish**. This should open a new **Command Prompt** window and ask a series of questions like: +- **Do you want to log in?** - type `y` and hit enter and following the prompts. It should open a web-browser to log in to your Google account. +- **Pick cloud project to use** - Select your GCP Project ID that you want to connect with `gcloud` +- **Select your region and zone** - You can safely enter `n`. It's not important to us at the moment. - gcloud-console-vm-running -- Go down to the section `SSH keys`, and write down your username (you need it for the next section) +Once you've completed the `gcloud` setup, close **Command Prompt** and re-open it, then run: - gcloud-console-vm-username +```bash +gcloud config list +``` -Congrats, your virtual machine is up and running, it is time to connect it with VS Code! +You should get an output similar to: +``` +[accessibility] +screen_reader = True/False # depends on install options +[core] +account = your_email@domain.com +disable_usage_reporting = True/False # depends on install options +project = your_gcp_project -## Visual Studio Code +Your active configurations: [default] +``` -### Installation +Now `gcloud` is installed and authenticated πŸš€ -Let's install [Visual Studio Code](https://code.visualstudio.com) text editor. -- Go to [Visual Studio Code download page](https://code.visualstudio.com/download). -- Click on "Windows" button -- Open the file you have just downloaded. -- Install it with few options: +### Application Default Credentials -![VS Code installation options](https://github.com/lewagon/setup/blob/master/images/windows_vscode_installation.png) +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. -When the installation is finished, launch VS Code. +To authenticate your **Application Default Credentials**, in your terminal run: +```bash +gcloud auth application-default login +``` -### VS Code Remote SSH Extension +And follow the prompts. It should open a web-page to login to your Google account. -We need to connect VS Code to a virtual machine in the cloud so you will only work on that machine during the bootcamp. A pretty useful [**Remote SSH Extension**](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) is available on the VS Code Marketplace. -- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Extensions: Install Extensions` +## Terraform -VSCode extensions - Search - Remote +Terraform is a tool for infrastructure as code (IAC) to create (and destroy) resources to create in the cloud. -- Install the extension +### Download -VS Code extensions - Remote - Details +To install terraform, download the **zip archive** from the Terraform install page at this [link here πŸ”—](https://developer.hashicorp.com/terraform/install). -That's the only extension you should install on your _local_ machine, we will install additional VS Code extensions on your _virtual machine_. +❗ If you are using Windows 10 or 11, download the **AMD64** version (64 bit version). + +1. Using file explorer to go to the location you downloaded the **terraform zip archive** + +2. **Unzip** the archive and two files should appear: `terraform.exe` and `license.txt`. + +3. Copy `terraform.exe` + +4. Navigate to your home directory (`C:\Users\\`) and create a directory named `cli_apps` + +5. Paste `terraform.exe` in the `cli_apps` directory + +### Add terraform to PATH -### Virtual Machine connection +We need to manually add **Terraform** to the `PATH` environment variable. The `PATH` variable contains a list of directories that your computer looks in for programs that we run from the command prompt. -- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` +To update your path: +1. Open Windows Search and search for: **Environment Variables** -vscode-connect-to-host +2. Click **Environment Variables** or **Edit environment variables for your account** -- Click on `Add a new host` -- Type `ssh -i @`, for instance, my username is `somedude`, my private SSH key is located at `~/.ssh/id_rsa` on my local computer, my VM has a public IP of `34.77.50.76`: I'll type `ssh -i ~/.ssh/id_rsa somedude@34.77.50.76` +3. Click **New** on to top right of this window -vscode-ssh-connection-command +4. Enter: `C:\Users\YOUR_USERNAME\cli_apps` - Make sure to replace `YOUR_USERNAME` with your computers user name. +5. Click **Ok** to close the `Path` variable window, and click **Ok** again to close the Environment Variable window. -- When prompted to `Select SSH configuration file to update`, pick the one in your home directory, under the `.ssh` folder, `~/.ssh/config` basically. Usually VS Code will pick automatically the best option, so their default should work. +6. Close **Command Prompt** and open it again -vscode-add-host-ssh-config +Verify the installation with: + +```bash +terraform --version +``` + + +## Provisioning your Virtual Machine with Terraform + +You can create Cloud Resources like Virtual Machines in different ways: +- Through the Google Cloud [Compute Engine Console πŸ”—](https://console.cloud.google.com/compute/overview) +- Using `gcloud` +- With **Infrastructure as Code** tools like Terraform + +We'll be creating our Virtual Machine with Terraform + +We're almost at the point of creating your Virtual Machine. + +The specifications of the Virtual Machine and Network Settings you'll use for the bootcamp are: +- Operation System: Ubuntu 22.04 LTS +- CPU: 4 Virtual CPU cores (2 physical CPU cores) +- RAM: 16 GB +- Storage (Persistent Disk): 100 GB balanced +- Static External IP address - so it's easier to login. -- You should get a pop-up on the bottom right notifying you the host has been added +### Cost πŸ’Έ -vscode-host-added +Creating and running a Virtual Machine on Google Cloud Platform costs money! -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` > Pick your VM IP address +If you have created a new Google Cloud Platform account, the cost of the Virtual machine will be covered by the $300 USD credit for the first 90 days if you are diligent with turning off your Virtual Machine (or finish the _Linux and Bash_ challenge today 😎). -vscode-add-new-host +❗ **The cost of running a Virtual Machine with our configuration 24 hours a day, 7 days a week is ~$150 USD per month.** ❗ -- The first time, VSCode might ask you for a security permission like below, say yes / continue. +You can massively reduce the cost by only running the Virtual Machine when you use it. You will _NOT_ be charged for the vCPU's and RAM while the Virtual Machine is off! -vscode-remote-connection-confirm +You will always pay for the Storage (equivalent of your hard-drive on your local computer). It's ~$10 USD per month for 100 GB. -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Terminal: Create New Terminal (in active workspace)` > You now have a Bash terminal in your virtual machine! +The rule of thumb is: if Google can rent the resource out to someone else when your not using it, you only pay for it when you are using the resource. That's why you don't pay for the CPU and RAM when you are not using it, Google can rent it out to someone else, but always pay for Storage, Google can't rent it out to someone else because it has your data on it. + +### Download terraform files + +We almost have all the necessary parts to create your VM using **terraform**. We need to download the terraform files and change a few values. + +First we'll create a folder and download the terraform files with: + +Using the Command Prompt (cmd), run the following: + +```cmd +mkdir %USERPROFILE%\wagon-de-bootcamp + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\main.tf" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\provider.tf" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\variables.tf" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\terraform.tfvars" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\.terraform.lock.hcl" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl +``` -vscode-command-palette-new-terminal -
-vscode-terminal -- Still on your *local* computer, lets create a more readable version of your machine to connect to! +### Set variables + +Open up the file `C:\Users\\wagon-de-bootcamp\terraform.tfvars` in VS Code or any other code editor. + +It should look like: ```bash -code ~/.ssh/config +project_id = "" +region = "" +zone = "" +instance_name = "" +instance_user = "" ``` -You should see something like the following: +We'll need to change some values in this file. Here's were you can find the required values: +- **project_id:** from the GCP Console at this [link here](https://console.cloud.google.com). +- **region:** take a look at the GCP Region and Zone documentation at this [link here](https://cloud.google.com/compute/docs/regions-zones). We strongly recommend you choose the closest geographical region. +- **zone:** Zone is a subset of region. it is almost always the same as **region** appended with `-a`, `-b`, or `-c`. +- **instance_name:** we recommend naming your VM: `lw-de-vm-`. Replacing `` with your GitHub username. +- **instance_user:** in Command Prompt, run `echo %username%` + +After completing this file, it should look similar to: ```bash -Host - HostName - IdentityFile - User +project_id = "wagon-bootcamp" +region = "europe-west1" +zone = "europe-west1-b" +instance_name = "lw-de-vm-tswift" +instance_user = "taylorswift" ``` -You can now change Host to whatever you would like to see as the name of your connection or in terminal with `ssh `! -❗️ It is important that the `Host` alias does not contain any whitespaces ❗️ +Make sure to save the `terraform.tfvars` file, nagivate into the directory with the terraform files with: + +``` +cd %USERPROFILE%\wagon-de-bootcamp +``` + +And initialise and test the files with: ```bash -# For instance -Host "de-bootcamp-vm" - HostName 34.77.50.76 # replace with your VM's public IP address - IdentityFile - User +terraform init + +terraform plan ``` -**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS code for instance) +And check the output. Towards the bottom there should be a line: + +``` +Plan: 2 to add, 0 to change, 0 to destroy +``` + +We'll be adding: +- A compute engine instance +- A static external IP address + +❗ If you have any errors, read the error and debug. If you need some help, raise a ticket with a teacher. + +If everything was successful, create your VM with: + +```bash +terraform apply -auto-approve +``` + +It might take a while for Terraform to create the cloud resources. Once you see: + +``` +Apply complete! Resources: 2 added, 0 changed, 0 destroyed. +``` +Your Virtual Machine should be up and running! Check the GCP Compute Engine console at this [link here](https://console.cloud.google.com/compute/instances) to confirm. -## VS Code Extensions -Let's install some useful extensions to VS Code. +## Virtual Machine connection -- Open your VS Code instance and make sure you're connected to the remote server. At the bottom left, you'll see: +### Create SSH keys -vscode-ssh +We need to connect VS Code to our Virtual Machine in the cloud so you will only work on that machine during the bootcamp. We'll use the [Remote - SSH Extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) that we previously installed. -- Open the VS Code terminal (`CMD` + `` ` `` or `CTRL` + `` ` ``) then run the following commands: +To create the VS Code SSH configuration, run the following in your terminal: ```bash -code --install-extension ms-vscode.sublime-keybindings -code --install-extension emmanuelbeziat.vscode-great-icons -code --install-extension ms-python.python -code --install-extension KevinRose.vsc-python-indent -code --install-extension ms-python.vscode-pylance -code --install-extension redhat.vscode-yaml -code --install-extension ms-azuretools.vscode-docker -code --install-extension tamasfe.even-better-toml +gcloud compute config-ssh ``` -Here is a list of the extensions you are installing: -- [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) -- [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) -- [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) -- [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) -- [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) -- [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) -- [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) -- [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +`gcloud` may tell you it needs to create a directory to continue. Accept and you should get an output similar to: +```bash +You should now be able to use ssh/scp with your instances. +For example, try running: + + $ ssh lw-de-vm-tswift.europe-west1-b.wagon-bootcamp +# $ ssh lw-de-vm-.. +``` + +### SSH File Permissions + +Windows has strict permissions for SSH files by default, we need to alter some permissions on the SSH configuration that was created by `gcloud` so VS Code can read the files and manage the SSH connection. + +In Command Prompt run: + +```cmd +icacls %USERPROFILE%\.ssh\config /inheritance:r + +icacls %USERPROFILE%\.ssh\config /grant:r %USERNAME%:(R) + +icacls %USERPROFILE%\.ssh\config /grant:r SYSTEM:(R) + +icacls %USERPROFILE%\.ssh\config +``` + +And: + +```cmd +icacls %USERPROFILE%\.ssh\google_compute_engine /inheritance:r + +icacls %USERPROFILE%\.ssh\google_compute_engine /grant:r %USERNAME%:(R) + +icacls %USERPROFILE%\.ssh\google_compute_engine /grant:r SYSTEM:(R) + +icacls %USERPROFILE%\.ssh\google_compute_engine +``` + +### Connect with VS Code + +To connect to your Virtual Machine, click on the small symbol at the very bottom-left corner of VS Code: + +![](/images/vscode_remote_highlight.png) + +It should bring up a menu, click on **Connect to Host...**: -## Command line tools +![](/images/vscode_remote_menu.png) -### Zsh & Git +Click on the name of your Virtual Machine: -Instead of using the default `bash` [shell](https://en.wikipedia.org/wiki/Shell_(computing)), we will use `zsh`. +![](/images/vscode_remote_hosts.png) -We will also use [`git`](https://git-scm.com/), a command line software used for version control. +A new VS Code window will open. You may be asked to select the platform of the remote host, select **Linux**. You will then be asked to _fingerprint_ the connection. VS Code is asking if you trust the remote host you are trying to connect to. Hit enter to continue. -Let's install them, along with other useful tools: -- Open an **VS Code terminal** connected to your VM -- Copy and paste the following commands: +![](/images/vscode_remote_fingerprint.png) + +And you are connected! It should look similar too: + +![](/images/vscode_remote_connected.png) + +Notice the connection in the very bottom-left corner of your VS Code window. It should have the Connection type (SSH), and the name of the host you are connected to. + +**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS Code) + +
+Viewing your SSH Configuration + +If you want to view your SSH configuration: +1. Start by clicking the symbol in the bottom-left corner of VS Code +2. Click on **Connect to Host...** +3. Click on **Configure SSH Hosts...*** +4. Select the configuration file. Usually the file at the top of the list. +5. View your configuration file! You may need to edit this configuration if you change computers, or want to work on more than one computer during the bootcamp. + +
+ + +## VM gcloud and Application Default Credentials + +We'll be doing some of the steps again, but that's because the virtual machine is a completely new computer! Luckily for us, `gcloud` comes pre-installed on the virtual machine. + + +### Authenticate gcloud + +We need to authenticate the `gcloud` CLI tool and set the project so it can interact with Google from the terminal. + +To authenticate `gcloud`, run: ```bash -sudo apt update -sudo apt install -y vim tmux tree git ca-certificates curl jq unzip zsh \ -apt-transport-https gnupg software-properties-common direnv sqlite3 make \ -postgresql postgresql-contrib build-essential libssl-dev zlib1g-dev \ -libbz2-dev libreadline-dev libsqlite3-dev wget llvm \ -libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \ -gcc default-mysql-server default-libmysqlclient-dev libpython3-dev openjdk-8-jdk-headless +gcloud auth login ``` -These commands might ask for your password, if they do: type it in. +And following the prompts. For pasting into the terminal, your might need to use CTRL + SHIFT + V -:warning: When you type your password, nothing will show up on the screen, **that's normal**. This is a security feature to mask not only your password as a whole but also its length. Just type in your password and when you're done, press `Enter`. +You also need to set the GCP project that your are working in. For this section, you'll need your GCP Project ID, which can be found on the GCP Console at this [link here](https://console.cloud.google.com). Makes sure you copy the _Project ID_ and **not** the _Project number_. -### GitHub CLI installation +To set your project, replace `` with your GCP Project ID and run: -Let's now install [GitHub official CLI](https://cli.github.com) (Command Line Interface). It's a software used to interact with your GitHub account via the command line. +```bash +gcloud config set project +``` -In your terminal, copy-paste the following commands and type in your password if asked: +Confirm your setup with: ```bash -curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg -echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null -sudo apt update -sudo apt install -y gh +gcloud config list ``` -To check that `gh` has been successfully installed on your machine, you can run: +You should get an output similar to: ```bash -gh --version +[core] +account = taylorswift@domain.com # Should be your GCP email +disable_usage_reporting = True +project = my-gcp-project # Should be your GCP Project ID + +Your active configuration is: [default] ``` -:heavy_check_mark: If you see `gh version X.Y.Z (YYYY-MM-DD)`, you're good to go :+1: -:x: Otherwise, please **contact a teacher** +### Application Default Credentials +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. -## Oh-my-zsh +To authenticate your **Application Default Credentials**, in your terminal run: -Let's install the `zsh` plugin [Oh My Zsh](https://ohmyz.sh/). +```bash +gcloud auth application-default login +``` -In a terminal execute the following command: +And follow the prompts. It should open a web-page to login to your Google account. + + +## VM configuration with Ansible + +We'll be using [Ansible](https://docs.ansible.com/ansible/latest/getting_started/introduction.html) to configure your Virtual Machine with some software, configurations, packages, and frameworks that you'll use in the bootcamp. + +Let's start by confirming that ansible is installed. In your terminal run: + +```bash +ansible --version +``` + +You should get an output similar to (some version numbers might change, that's fine): + +``` +ansible [core 2.17.9] + config file = /etc/ansible/ansible.cfg + configured module search path = ['/home/tswift/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] + ansible python module location = /usr/lib/python3/dist-packages/ansible + ansible collection location = /home/tswift/.ansible/collections:/usr/share/ansible/collections + executable location = /usr/bin/ansible + python version = 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (/usr/bin/python3) + jinja version = 3.1.2 + libyaml = True +``` + +❗ If not, raise a ticket with a teacher. + +### Ansible Playbook 1 + +Create a folder and download the ansible files: ```bash -sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" +mkdir -p ~/vm-ansible-setup/playbooks + +curl -L -o ~/vm-ansible-setup/ansible.cfg https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/ansible.cfg +curl -L -o ~/vm-ansible-setup/hosts https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/hosts +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part1.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml ``` -If asked "Do you want to change your default shell to zsh?", press `Y` +And run with: -At the end your terminal should look like this: +```bash +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part1.yml +``` -![Ubuntu terminal with OhMyZsh](https://github.com/lewagon/setup/blob/master/images/oh_my_zsh.png) +And the playbook should start running! -:heavy_check_mark: If it does, you can continue :+1: +❗ If an errors occur, raise a ticket with a teacher. You can safely run the playbook again. -:x: Otherwise, please **ask for a teacher** +### What is the playbook installing? + +This playbook is installing a few things, while the playbook is running, let's go through them: +- Updating system packages. Ubuntu uses the `APT` package manager. +- Changing the default shell from **bash** to **zsh**, a more customizable shell that is extensible and looks great! +- Installing the **Oh-My-ZSH** plugin for the **zsh** shell. We'll use it a bit later to add some quality of life plugins and extensions for `zsh`. +- Installing **Docker** on your Virtual Machine. Docker is an open platform for developing, shipping, and running applications. You will use it throughout the bootcamp +- Installing some **Kubernetes (k8s)** tooling: Kubernetes is a system designed to for auto-scaling containerized applications. + - Installing **kubectl**: `kubectl` is the CLI tool for interacting with kubernetes clusters. + - Installing **minikube**: Minikube is a way to quickly spin up a local kubernetes cluster. Great for developing! +- Installing **terraform**: we've already installed it once, but we need to install it on our VM! **Terraform** is an Infrastructure as Code (IaC) tool. +- Install the **GitHub CLI**: the CLI tool that we'll use to interact with your GitHub account directly from the terminal. + +The playbook is also running checks to see if things are installed or not. This is so you can safely re-run the playbook without any problems. ## GitHub CLI @@ -616,120 +791,6 @@ gh auth status :x: If not, **contact a teacher**. -## Google Cloud CLI - -Install the `gcloud` CLI to communicate with [Google Cloud Platform](https://cloud.google.com/) through your terminal: -```bash -echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list -sudo apt-get install apt-transport-https ca-certificates gnupg -curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - -sudo apt-get update && sudo apt-get install google-cloud-sdk -sudo apt-get install google-cloud-sdk-app-engine-python -``` -πŸ‘‰ [Install documentation](https://cloud.google.com/sdk/docs/install#deb) - -### Create a service account key πŸ”‘ - -**πŸ‘Œ Note: Skip to the next section if you already have a service account key** - -Now that you have created a `GCP account` and a `project` (identified by its `PROJECT_ID`), we are going to configure the actions (API calls) that you want to allow your code to perform. - -
- πŸ€” Why do we need a service account key ? - - - You have created a `GCP account` linked to your credit card. Your account will be billed according to your usage of the ressources of the **Google Cloud Platform**. The billing will occur if you consume anything once the free trial is over, or if you exceed the amount of spending allowed during the free trial. - - In your `GCP account`, you have created a single `GCP project`, identified by its `PROJECT_ID`. The `GCP projects` allow you to organize and monitor more precisely how you consume the **GCP** ressources. For the purpose of the bootcamp, we are only going to create a single project. - - Now, we need a way to tell which ressources within a `GCP project` our code will be allowed to consume. Our code consumes GCP ressources through API calls. - - Since API calls are not free, it is important to define with caution how our code will be allowed to use them. During the bootcamp this will not be an issue and we are going to allow our code to use all the API of **GCP** without any restrictions. - - In the same way that there may be several projects associated with a GCP account, a project may be composed of several services (any bundle of code, whatever its form factor, that requires the usage of GCP API calls in order to fulfill its purpose). - - GCP requires that the services of the projects using API calls are registered on the platform and their credentials configured through the access granted to a `service account`. - - For the moment we will only need to use a single service and will create the corresponding `service account`. -
- -Since the [service account](https://cloud.google.com/iam/docs/service-accounts) is what identifies your application (and therefore your GCP billing account and ultimately your credit card), you are going to want to be cautious with the next steps. - -⚠️ **Do not share you service account json file πŸ”‘** ⚠️ Do not store it on your desktop, do not store it in your git codebase (even if your git repository is private), do not let it by the coffee machine, do not send it as a tweet. - -- Go to the [service accounts page](https://console.cloud.google.com/apis/credentials/serviceaccountkey) -- Select your project in the list of recent projects if asked to -- Create a service account: - - Click on **CREATE SERVICE ACCOUNT**: - - Give a `Service account name` to that account - - Click on **CREATE AND CONTINUE** - - Click on **Select a role** and choose `Quick access/Basic` then **Owner**, which gives full access to all ressources - - Click on **CONTINUE** - - Click on **DONE** -- Download the service account json file πŸ”‘: - - Click on the newly created service account - - Click on **KEYS** - - Click on **ADD KEY** then **Create new key** - - Select **JSON** and click on **CREATE** - -![](images/gcp_create_key.png) - -The browser has now saved the service account json file πŸ”‘ in your downloads directory (it is named according to your service account name, something like `le-wagon-data-123456789abc.json`) - - -### Configure Cloud sdk - -- Open the service account json file with any text editor and copy the key - ``` - # It looks like: - { - "type": "service_account", - "project_id": "kevin-bootcamp", - "private_key_id": "1234567890", - "private_key": "-----BEGIN PRIVATE KEY-----\nXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\n-----END PRIVATE KEY-----\n", - "client_email": "bootcamp@kevin-bootcamp.iam.gserviceaccount.com", - "client_id": "1234567890", - "auth_uri": "https://accounts.google.com/o/oauth2/auth", - "token_uri": "https://oauth2.googleapis.com/token", - "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", - "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bootcamp%40kevin-bootcamp.iam.gserviceaccount.com" - } - ``` -- **on your Virtual Machine**, create a `~/.gcp_keys` directory, then create a json file in it: - ``` bash - mkdir ~/.gcp_keys - touch ~/.gcp_keys/le-wagon-de-bootcamp.json - ``` -- Open the json file then store the service account json file pasting the key: - ```bash - code ~/.gcp_keys/le-wagon-de-bootcamp.json - ``` - ![service account json key](images/service_account_json_key.png) - - ❗️Don't forget to **save** the file with `CMD` + `s` or `CTRL` + `s` - -- Authenticate the `gcloud` CLI with the google account you used for GCP - ```bash - # Replace service_account_name@project_id.iam.gserviceaccount.com with your own - SERVICE_ACCOUNT_EMAIL=service_account_name@project_id.iam.gserviceaccount.com - KEY_FILE=$HOME/.gcp_keys/le-wagon-de-bootcamp.json - gcloud auth activate-service-account $SERVICE_ACCOUNT_EMAIL --key-file=$KEY_FILE - ``` -- List your active account and check your email address you used for GCP is present - ```bash - gcloud auth list - ``` -- Set your current project - ```bash - # Replace `PROJECT_ID` with the `ID` of your project, e.g. `wagon-bootcamp-123456` - gcloud config set project PROJECT_ID - ``` -- List your active account and current project and check your project is present - ```bash - gcloud config list - ``` - - ## Dotfiles Let's pimp your zsh and and vscode by installing lewagon recommanded dotfiles **on your Virtual Machine** @@ -876,474 +937,344 @@ you don't want your email to appear in public repositories you may contribute to -### zsh default terminal +--- -Set `zsh` as your default VS Code terminal. +Once you have finished installing the **dotfiles**, kill your terminal (little trash can at the top right of the terminal window) and re-open it. You might have to do it a few times until it looks similar to: -- Open terminal default profile settings +![](/images/vscode_after_ansible1.png) - Terminal profile settings -- Select `zsh /usr/bin/zsh` +The terminal should read as `zsh`. - Terminal zsh profile +## VM configuration with Ansible - Part 2 -## Disable SSH passphrase prompt +### Ansible Playbook 2 -You don't want to be asked for your passphrase every time you communicate with a distant repository. So, you need to add the plugin `ssh-agent` to `oh my zsh`: +We'll be using a second **Ansible** playbook to further configure your Virtual Machine. -First, open the `.zshrc` file: +Start by downloading the ansible playbook: ```bash -code ~/.zshrc +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part2.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml ``` -Then: -- Spot the line starting with `plugins=` -- Add `ssh-agent` at the end of the plugins list - -:heavy_check_mark: Save the `.zshrc` file with `Ctrl` + `S` and close your text editor. - - -## Docker πŸ‹ - -Docker is an open platform for developing, shipping, and running applications. - -### Install Docker and Docker Compose - -Setup the dock apt repo +And run with: ```bash -sudo install -m 0755 -d /etc/apt/keyrings - -curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg - -sudo chmod a+r /etc/apt/keyrings/docker.gpg +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part2.yml ``` -```bash -echo \ - "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ - "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ - sudo tee /etc/apt/sources.list.d/docker.list > /dev/null -``` +And the playbook should start running! If you're asked if you want VS Code to behave more like Sublime Text, click accept. -Install the right packages +❗ If any errors occur, raise a ticket with a teacher. You can safely run the playbook again. -``` -sudo apt-get update -sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -``` - -Finally give your user permission to use `docker` +
+❓ Why two Ansible playbooks? -```bash -sudo groupadd docker -sudo usermod -aG docker $USER -newgrp docker -``` +This second ansible playbook requires GitHub authorisation to fork the `lewagon/data-engineering-challenges` repository and it is also editing some of the Le Wagon recommended **dotfiles**. So we separated the process into two steps. +
-Run `docker run hello-world`, you should see something like: +### What is the playbook installing? -
- ❗️ Permission denied while trying to connect to the Docker daemon socket. ❗️ +This playbook is installing and configuring a things, while the playbook is running, let's go through them: -If you receive an error similar to the one below, navigate to the [GCP Compute Engine Console](https://console.cloud.google.com/compute/instances) and shut down your VM by selecting the tick box next to your VM instance and clicking STOP (closing and reopening VSCode is not enough). +**Python and Poetry** -![](images/docker_permission_denied_socket.png) +Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python [3.12.8](https://www.python.org/downloads/release/python-3128/) -It will take a few minutes for your VM to turn off. Once it's fully off, turn your VM on again by checking the box next to the VM instance and clicking START. Give the VM a few minutes to fully start up and connect through VSCode. Once connected try `docker run hello-world` again. If you don't get an output similar to the below image, raise a ticket with a teacher. -
+- Install **pyenv** and **pyenv-virtualenv**. We'll use **pyenv** to manage the Python versions installed on the VM +- Install Python 3.12.8 with pyenv +- Install **pipx**: [Pipx](https://pipx.pypa.io/stable/) is used to install python packages we want _globally_ available while still using virtual environments, like Poetry! +- Installing a few global python packages with **pipx**: + - **Poetry:** [Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. + - **Ruff:** [Ruff](https://docs.astral.sh/ruff/) Is used to format and lint Python code. + - **tldr:** [tldr](https://github.com/tldr-pages/tldr) has much more readable version of `man` pages. Useful for quickly finding out how a program works. -![](images/docker_hello.png) - -### Enable Artifact Registry API - -**πŸ‘Œ Note: Skip to the next section if you already have an Artifact Registry repository** - -[Artifact Registry](https://cloud.google.com/artifact-registry) is a GCP service you will use to store artifacts such as Docker images. The storage units are called repositories. - -- Enable the service within your project using the `gcloud` CLI: - ```bash - gcloud services enable artifactregistry.googleapis.com - ``` -- Create a new Docker repository: - ```bash - # Set the repository name - REPOSITORY=docker-hub - # Set the location of the repository. Available locations: gcloud artifacts locations list - LOCATION=europe-west1 - gcloud artifacts repositories create $REPOSITORY \ - --repository-format=docker \ - --location=$LOCATION \ - --description="Docker images storage" - ``` - -### Gcloud authentication for Docker - -You need to grant Docker access to push artifacts to (and pull from) your repository. There are different authentication methods, [gcloud credentials helper](https://cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) being the easiest. - -- Define the repository hostname matching the repository `$LOCATION`: - ```bash - # If $LOCATION is "europe-west1" - HOSTNAME=europe-west1-docker.pkg.dev - ``` -- Configure gcloud credentials helper: - ```bash - gcloud auth configure-docker $HOSTNAME - ``` -- Type `y` to accept the configuration -- Check your credentials helper is set: - ```bash - cat ~/.docker/config.json - ``` - You should get: - ```bash - { - "credHelpers": { - "europe-west1-docker.pkg.dev": "gcloud" - } - }% - ``` - - -## Kubernetes -Kubernetes (K8s) is a system designed to make deploying auto-scaling containerized applications easily. - -### Install kubectl -Kubectl is the cli for interacting with k8s! - -https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ +**VS Code Configuration** -```bash -curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" +- Installing some **VS Code** extensions, but only on your VM. Here's a list of the extensions that are being installed: + - [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) + - [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) + - [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) + - [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) + - [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) + - [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) + - [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) + - [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +- Update the VS Code Python Interpreter path. -echo "$(cat kubectl.sha256) kubectl" | sha256sum --check +**Shell and System Configuration** -sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl +- Create the **direnv** poetry function. The same one from the lecture! This makes it easier to work with poetry. +- Adding some **Oh-My-ZSH** Plugins: by modifying your `.zshrc` file. Here's a list of the extra plugins: + - **pyenv**: Auto-complete for pyenv, a tool used to manage python virtual environments + - **gcloud**: Auto-complete for the gcloud CLI tool + - **ssh-agent**: Saves your SSH password so you only have to enter it once per session. + - **direnv**: A tool to load `.envrc` files when you `cd` into a directory. Great for loading environment variables. +- Installing **Spark**: Spark is a distributed data processing framework -kubectl version --client -kubectl version --client --output=yaml -``` +**Data Engineering Challenges Repository** -### Install minikube +The challenges that you'll be working on throughout the bootcamp! The playbook is forking the **data-engineering-challenges** repository from **lewagon** to your own GitHub user. Then cloning that repository from your GitHub account down onto your Virtual Machine. -Minikube is a way to quickly spin up a local kubernetes cluster! +### Restart Virtual Machine -https://minikube.sigs.k8s.io/docs/start/ +Once the playbook has finished running, you need to completely shutdown your Virtual Machine so that some of the configuration updates (specifically **pyenv** and **Docker**). -```bash -curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 -sudo install minikube-linux-amd64 /usr/local/bin/minikube -``` +To shutdown your VM, navigate to the GCP Compute Engine Instances [console page πŸ”—](https://console.cloud.google.com/compute/instances). -### Test installation -To test that you can launch a cluster run: -```bash -minikube start -``` -you should see your cluster booting up : +Select your VM instance and click on the stop button: -![](images/minikube_start.png) +![](/images/gcp_vm_stop.png) -Then to check the cluster run: -```bash -kubectl get po -A -``` -you should be able to see your cluster running! : +Wait for a few minutes until the VM shows that it is completely off. You may need to refresh the page, the GCP Console doesn't dynamically update. -![](images/minikube_base.png) +When the VM is completely off, turn it on again by selecting the check box next to your instance and clicking **START/RESUME**. Give it a minute to spin up, then connect via VS Code. -To tear it all down for now: -```bash -minikube delete --all -``` +## Check your Virtual Machine Setup +We've used two ansible playbooks to configure our Virtual Machine. Let's run some manual checks in the terminal to make sure that everything has installed correctly. -## Terraform +❗ If any of these checks error out, raise a ticket with a teacher. -Terraform is a tool for infrastructure as code (IAC) to define resources to create in the cloud! +#### Python -### Install terraform +πŸ§ͺ To test: -Install some basic requirements ```bash -sudo apt-get update && sudo apt-get install -y gnupg software-properties-common +python --version ``` -Terraform is not avaliable to apt by default so we need to make it avaliable! -```bash -wget -O- https://apt.releases.hashicorp.com/gpg | \ - gpg --dearmor | \ - sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null -``` +Should return: -```bash -gpg --no-default-keyring \ - --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ - --fingerprint ``` - -```bash -echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ - https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \ - sudo tee /etc/apt/sources.list.d/hashicorp.list +Python 3.12.8 ``` -Now we can install terraform directly with apt πŸ‘‡ -```bash -sudo apt update -sudo apt-get install terraform -``` +#### Pyenv -Verify the installation with: +πŸ§ͺ To test: ```bash -terraform --version +pyenv versions ``` +Should return: +``` + system +* 3.12.8 (set by /home//.pyenv/version) +``` -## Spark +Note: There should be an `*` next to 3.12.8 -Spark is a data processing framework: +#### Pipx -Move to your home directory: +πŸ§ͺ To test: ```bash -cd ~ +pipx list ``` -Download spark: +Should return something similar too: -```bash -wget https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz ``` - -Open the tarball: - -```bash -mkdir -p ~/spark && tar -xzf spark-3.5.3-bin-hadoop3.tgz -C ~/spark +venvs are in /home//.local/share/pipx/venvs +apps are exposed on your $PATH at /home//.local/bin +manual pages are exposed at /home//.local/share/man + package poetry 2.1.1, installed using Python 3.12.8 + - poetry + package ruff 0.11.0, installed using Python 3.12.8 + - ruff + package tldr 3.3.0, installed using Python 3.12.8 + - tldr + - man1/tldr.1 ``` -Set the environment variables needed by spark: - -```bash -echo "export SPARK_HOME=$HOME/spark/spark-3.5.3-bin-hadoop3" >> .zshrc -echo 'export PATH=$PATH:$SPARK_HOME/bin' >> .zshrc -``` +#### Docker -Let's restart our shell: +πŸ§ͺ To test: ```bash -exec zsh +docker run hello-world ``` -Test Spark works by running: +Should return: -```bash -spark-shell ``` +Unable to find image 'hello-world:latest' locally +latest: Pulling from library/hello-world +e6590344b1a5: Pull complete +Digest: sha256:7e1a4e2d11e2ac7a8c3f768d4166c2defeb09d2a750b010412b6ea13de1efb19 +Status: Downloaded newer image for hello-world:latest -You should see an output similar to: +Hello from Docker! +This message shows that your installation appears to be working correctly. -```bash -Setting default log level to "WARN". -To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). -25/01/15 11:33:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -Spark context Web UI available at http://de-vm-lrae-test.europe-north1-b.c.wagon-de.internal:4040 -Spark context available as 'sc' (master = local[*], app id = local-1736940788403). -Spark session available as 'spark'. -Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 - /_/ - -Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_432) -Type in expressions to have them evaluated. -Type :help for more information. +To generate this message, Docker took the following steps: + 1. The Docker client contacted the Docker daemon. + 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. + (amd64) + 3. The Docker daemon created a new container from that image which runs the + executable that produces the output you are currently reading. + 4. The Docker daemon streamed that output to the Docker client, which sent it + to your terminal. -scala> -``` -Type `:quit` and hit enter to exit the spark-shell and continue. +To try something more ambitious, you can run an Ubuntu container with: + $ docker run -it ubuntu bash +Share images, automate workflows, and more with a free Docker ID: + https://hub.docker.com/ -## Python & Pip +For more examples and ideas, visit: + https://docs.docker.com/get-started/ +``` -Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python 3.12 ([3.12.8](https://www.python.org/downloads/release/python-3128/)). +#### Kubernetes -Let's install pyenv to manage our python versions: +We can start by testing `minikube`: ```bash -git clone https://github.com/pyenv/pyenv.git ~/.pyenv -source ~/.zprofile -exec zsh +# Start +minikube start ``` -We'll also install a useful `pyenv` plugin called [`pyenv-virtualenv`](https://github.com/pyenv/pyenv-virtualenv). Although we will be using `poetry` for Python package and virtual environment management, `pyenv-virtualenv` is useful for controlling python versions locally. +Should return: -```bash -git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv -exec zsh ``` - -Now install Python 3.12.8: -```bash -pyenv install 3.12.8 -pyenv global 3.12.8 +πŸ˜„ minikube v1.35.0 on Ubuntu 22.04 (amd64) +✨ Automatically selected the docker driver. Other choices: none, ssh +πŸ“Œ Using Docker driver with root privileges +πŸ‘ Starting "minikube" primary control-plane node in "minikube" cluster +🚜 Pulling base image v0.0.46 ... +πŸ’Ύ Downloading Kubernetes v1.32.0 preload ... + > gcr.io/k8s-minikube/kicbase...: 500.31 MiB / 500.31 MiB 100.00% 88.19 M + > preloaded-images-k8s-v18-v1...: 333.57 MiB / 333.57 MiB 100.00% 32.20 M +πŸ”₯ Creating docker container (CPUs=2, Memory=3900MB) ... +🐳 Preparing Kubernetes v1.32.0 on Docker 27.4.1 ... + β–ͺ Generating certificates and keys ... + β–ͺ Booting up control plane ... + β–ͺ Configuring RBAC rules ... +πŸ”— Configuring bridge CNI (Container Networking Interface) ... +πŸ”Ž Verifying Kubernetes components... + β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5 +🌟 Enabled addons: storage-provisioner, default-storageclass +πŸ„ Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default ``` -Now `python --version` should return `3.12.8` - - -## Pipx - -Next we are going to install [pipx](https://pypa.github.io/pipx/) to install python packages we want globally available while still using virtual environments. -Let's upgrade `pip` first: +And then make sure the kubernetes CLI utility, `kubectl`, works with: ```bash -pip install --upgrade pip +# Get pods +kubectl get po -A ``` -And install `pipx`: +Should return something similar too: -```bash -python -m pip install --user pipx # --user so that each ubuntu user can have his own 'pipx' -python -m pipx ensurepath -exec zsh ``` - -Lets install a [tldr](https://github.com/tldr-pages/tldr) with pipx - -```bash -pipx install tldr +NAMESPACE NAME READY STATUS RESTARTS AGE +kube-system coredns-668d6bf9bc-mg7b6 1/1 Running 0 72s +kube-system etcd-minikube 1/1 Running 0 78s +kube-system kube-apiserver-minikube 1/1 Running 0 76s +kube-system kube-controller-manager-minikube 1/1 Running 0 76s +kube-system kube-proxy-stk77 1/1 Running 0 72s +kube-system kube-scheduler-minikube 1/1 Running 0 76s +kube-system storage-provisioner 1/1 Running 1 (41s ago) 75s ``` -Now `tldr` should be globally available (for the current user), test it out with: +And because `minikube` is resource intensive, stop it for now with: ```bash -tldr ls +# Stop +minikube delete --all ``` -Much more readable than the classic `man ls` (although sometimes you will still need to delve into the man pages to get all of the details!) and it even has pages not included in man such as `tldr gh`: - -tldr - - -Lets add a few more packages we want globally available - -### black +Should return: -[black](https://black.readthedocs.io/en/stable/) for helping to format code - -```bash -pipx install black +``` +πŸ”₯ Deleting "minikube" in docker ... +πŸ”₯ Removing /home//.minikube/machines/minikube ... +πŸ’€ Removed all traces of the "minikube" cluster. +πŸ”₯ Successfully deleted all profiles ``` -### Poetry - -[Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. +#### Terraform -Install Poetry running the following command in your VS Code terminal: +πŸ§ͺ To test: ```bash -pipx install poetry +terraform --version ``` -Then, let's update default poetry behavior so that virtual envs are always created where `poetry install` is run. -During the bootcamp, you'll see a `.venv` folder being created inside each challenge folder. +Should return: -```bash -poetry config virtualenvs.in-project true ``` - -Finally, update your VScode settings to tell it that this `.venv` relative folder path will be your default interpreter! - -1. Open the Command Palette ( πŸͺŸ ctrl + shift + P / 🍎 cmd + shift + P ) -2. Search for: **Preference: Open Remote Settings (JSON)** - when you open your settings that should be two panels. -3. In the panel that opens on the **right side** search for the line: `python.defaultInterpreterPath` -4. Replace the value (probably `"~/.pyenv/shims/python"`) so that it looks like: - -```yml -"python.defaultInterpreterPath": ".venv/bin/python", +Terraform v1.11.2 +on linux_amd64 ``` -## Direnv +#### Spark -[Direnv](https://direnv.net/) is a great utility that will look for `.envrc` files in your directories. When you `cd` into directories with a `.envrc` files, paths will automatically be updated. In our case, this will simplify our workflow and allow us to not have to worry about Poetry managed Python virtual environments. - -1. First, setup the *direnv hook* to your zsh shell so that direnv gets activated anytime a `.envrc` file exists in current working directory. +πŸ§ͺ To test: ```bash -code ~/.zshrc +spark-shell ``` -```bash -plugins=(git gitfast ... pyenv ssh-agent direnv) # add `direnv` to the existing list of plugins -``` +Should take you into the spark shell that looks like: -2. Second, let's configure what will happens anytime `.envrc` file is found +``` +Setting default log level to "WARN". +To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +25/03/18 08:54:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Spark context Web UI available at http://lw-de-vm.europe-north1-b.c.wagon-de.internal:4040 +Spark context available as 'sc' (master = local[*], app id = local-1742288096829). +Spark session available as 'spark'. +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 + /_/ -```bash -code ~/.direnvrc -``` -- Paste the following lines - ```bash - layout_poetry() { - if [[ ! -f pyproject.toml ]]; then - log_error 'No pyproject.toml found. Use `poetry new` or `poetry init` to create one first.' - exit 2 - fi - # create venv if it doesn't exist - poetry run true - - export VIRTUAL_ENV=$(poetry env info --path) - export POETRY_ACTIVE=1 - PATH_add "$VIRTUAL_ENV/bin" - } - ``` -- Save and close the file - -😎 Now, **anytime you `cd` into a challenge folder which contains a `.envrc` file which contains `layout_poetry()` command inside, the function will get executed and your virtual env will switch to the poetry one that is defined by the `pyproject.toml` !** -- No need to prefix all commands with `poetry run `, but simply `` -- Each challenge will have its own virtual env, and it will be seamless for you to switch between challenges/envs +Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_442) +Type in expressions to have them evaluated. +Type :help for more information. +scala> +``` -## Let's Make! +Type `:quit` and hit enter to exit the spark-shell and continue. -Lets clone the challenges onto your **virtual machine** +That's all the testing we'll do for now! -```bash -export GITHUB_USERNAME=`gh api user | jq -r '.login'` -echo $GITHUB_USERNAME -``` -Then: +## Let's Make! -```bash -mkdir -p ~/code/$GITHUB_USERNAME && cd $_ -gh repo fork lewagon/data-engineering-challenges --clone -``` +Almost there! In the second ansible playbook, the `lewagon/data-engineering-challenges` repository was forked from Le Wagon to you. Let's review how it works. Our setup will look a bit like this: - +![](/images/repo_overview.png) This allows you to work on challenges, but if we push any changes to the content, you can still access them! Check your remotes match `origin` your data engineering challenges and `upstream` lewagon's! ```bash -cd data-engineering-challenges +cd ~/code/$(gh api user | jq -r '.login')/data-engineering-challenges git remote -v -# origin git@github.com:your_github_username/data-engineering-challenges.git (fetch) -# origin git@github.com:your_github_username/data-engineering-challenges.git (push) -# upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) -# upstream git@github.com:lewagon/data-engineering-challenges.git (push) +``` + +Should return: + +``` +origin git@github.com:/data-engineering-challenges.git (fetch) +origin git@github.com:/data-engineering-challenges.git (push) +upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) +upstream git@github.com:lewagon/data-engineering-challenges.git (push) ``` From challenge folder root **on the vm**, we'll run `make install`, which triggers 3 operations: diff --git a/_partials/docker.md b/_partials/docker.md deleted file mode 100644 index 6c78316..0000000 --- a/_partials/docker.md +++ /dev/null @@ -1,34 +0,0 @@ -## Docker πŸ‹ - -Docker is an open platform for developing, shipping, and running applications. - -_if you already have Docker installed on your machine please update with the latest version_ - -### Install Docker - -Go to [Docker](https://docs.docker.com/get-docker/) website and choose your operating system: - -![](images/docker.png) - -Then follow the setup instructions, you are going to install a desktop application. - -Once done and launched, check Docker is up and running: - -```bash -docker info -``` - -You should get: - -
- ❗️ I received a permission denied when trying to connect to the Docker Daemon socket. - -If you receive an error similar to the one below, navigate to the (GCP Compute Engine Console)[https://console.cloud.google.com/compute/instances] and STOP your VM (closing VSCode is not enough). - -![](images/docker_permission_denied_socket.png) - -It will take a few minutes for your VM to turn off. Once it's fully off, turn your VM on again (check the box and click START) and try `docker run hello-world` again. If this doesn't work, raise a ticket with a teacher. - -
- -![](images/docker_info.png) diff --git a/_partials/dotfiles_terminal.md b/_partials/dotfiles_terminal.md new file mode 100644 index 0000000..062d588 --- /dev/null +++ b/_partials/dotfiles_terminal.md @@ -0,0 +1,7 @@ +--- + +Once you have finished installing the **dotfiles**, kill your terminal (little trash can at the top right of the terminal window) and re-open it. You might have to do it a few times until it looks similar to: + +![](/images/vscode_after_ansible1.png) + +The terminal should read as `zsh`. diff --git a/_partials/gcp_adc_auth.md b/_partials/gcp_adc_auth.md new file mode 100644 index 0000000..17a6c0f --- /dev/null +++ b/_partials/gcp_adc_auth.md @@ -0,0 +1,11 @@ +### Application Default Credentials + +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. + +To authenticate your **Application Default Credentials**, in your terminal run: + +```bash +gcloud auth application-default login +``` + +And follow the prompts. It should open a web-page to login to your Google account. diff --git a/_partials/gcp_auth_vm_heading.md b/_partials/gcp_auth_vm_heading.md new file mode 100644 index 0000000..24060ce --- /dev/null +++ b/_partials/gcp_auth_vm_heading.md @@ -0,0 +1,3 @@ +## VM gcloud and Application Default Credentials + +We'll be doing some of the steps again, but that's because the virtual machine is a completely new computer! Luckily for us, `gcloud` comes pre-installed on the virtual machine. diff --git a/_partials/gcp_cli_oauth.md b/_partials/gcp_cli_oauth.md new file mode 100644 index 0000000..e5fa057 --- /dev/null +++ b/_partials/gcp_cli_oauth.md @@ -0,0 +1,36 @@ +### Authenticate gcloud + +We need to authenticate the `gcloud` CLI tool and set the project so it can interact with Google from the terminal. + +To authenticate `gcloud`, run: + +```bash +gcloud auth login +``` + +And following the prompts. For pasting into the terminal, your might need to use CTRL + SHIFT + V + +You also need to set the GCP project that your are working in. For this section, you'll need your GCP Project ID, which can be found on the GCP Console at this [link here](https://console.cloud.google.com). Makes sure you copy the _Project ID_ and **not** the _Project number_. + +To set your project, replace `` with your GCP Project ID and run: + +```bash +gcloud config set project +``` + +Confirm your setup with: + +```bash +gcloud config list +``` + +You should get an output similar to: + +```bash +[core] +account = taylorswift@domain.com # Should be your GCP email +disable_usage_reporting = True +project = my-gcp-project # Should be your GCP Project ID + +Your active configuration is: [default] +``` diff --git a/_partials/gcp_cli_setup.md b/_partials/gcp_cli_setup.md index 9329452..05c83fd 100644 --- a/_partials/gcp_cli_setup.md +++ b/_partials/gcp_cli_setup.md @@ -1,14 +1,77 @@ +## Google Cloud CLI -## `gcloud` CLI +The `gcloud` Command Line Interface (CLI) is used to communicate with Google Cloud Platform services through your terminal. -Before Setting up our Google Cloud Platform account let's configure the `gcloud` CLI (A command line interface for Google Cloud Platform). Run the below and follow the terminal prompts to update your $PATH and enable shell command completion for the `.zshrc` file: +### Install gcloud + +$MAC_START +Install with `brew`: ```bash brew install --cask google-cloud-sdk ``` -Then you can: +Then install `gcloud` with: ```bash $(brew --prefix)/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/install.sh ``` + +To test your install, open a new terminal and run: + +```bash +gcloud --version +``` + +πŸ‘‰ [Install documentation πŸ”—](https://cloud.google.com/sdk/docs/install#mac) +$MAC_END +$WINDOWS_START + +To install, download the Google Cloud CLI installer from this [link here πŸ”—](https://cloud.google.com/sdk/docs/install#windows). + +Once it's finished downloading, launch the installer and follow the prompts. You only need to install `gcloud` for the current user. + +On the last screen of the installer there will be four check boxes. Makes sure that the boxes for `Start Google SDK Shell` and `Run gcloud init to configure the Google Cloud CLI` are selected then click **Finish**. This should open a new **Command Prompt** window and ask a series of questions like: +- **Do you want to log in?** - type `y` and hit enter and following the prompts. It should open a web-browser to log in to your Google account. +- **Pick cloud project to use** - Select your GCP Project ID that you want to connect with `gcloud` +- **Select your region and zone** - You can safely enter `n`. It's not important to us at the moment. + +Once you've completed the `gcloud` setup, close **Command Prompt** and re-open it, then run: + +```bash +gcloud config list +``` + +You should get an output similar to: + +``` +[accessibility] +screen_reader = True/False # depends on install options +[core] +account = your_email@domain.com +disable_usage_reporting = True/False # depends on install options +project = your_gcp_project + +Your active configurations: [default] +``` + +Now `gcloud` is installed and authenticated πŸš€ +$WINDOWS_END +$LINUX_START +Add the `APT` repository and install with: + +```bash +echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list +sudo apt-get install apt-transport-https ca-certificates gnupg +curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - +sudo apt-get update && sudo apt-get install google-cloud-sdk +sudo apt-get install google-cloud-sdk-app-engine-python +``` + +To test your install, open a new terminal and run: + +```bash +gcloud --version +``` +πŸ‘‰ [Install documentation πŸ”—](https://cloud.google.com/sdk/docs/install#deb) +$LINUX_END diff --git a/_partials/gcp_setup_end.md b/_partials/gcp_setup_end.md deleted file mode 100644 index 90d3e12..0000000 --- a/_partials/gcp_setup_end.md +++ /dev/null @@ -1,52 +0,0 @@ - -
- ℹ️ How to find the absolute path of a file? - You can drag and drop the file in your terminal. -
- -**Restart** your terminal and run: - -``` bash -echo $GOOGLE_APPLICATION_CREDENTIALS -``` - -The ouptut should be the following: - -```bash -/some/absolute/path/to/your/gcp/SERVICE_ACCOUNT_JSON_FILE_CONTAINING_YOUR_SECRET_KEY.json -``` - -Now let's verify that the path to your service account json file is correct: - -``` bash -cat $(echo $GOOGLE_APPLICATION_CREDENTIALS) -``` - -πŸ‘‰ This command should display the content of your service account json file. If it does not, ask for a TA πŸ™ - -Your code and utilities are now able to access the resources of your GCP account. - -Let's proceed with the final steps of configuration... - -- List the service accounts associated to your active account and current project -```bash -gcloud iam service-accounts list -``` -- Retrieve the service account email address, e.g. `SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com` -- List the roles of the service account from the cli (replace PROJECT_ID and SERVICE_ACCOUNT_EMAIL) -```bash -gcloud projects get-iam-policy PROJECT_ID \ ---flatten="bindings[].members" \ ---format='table(bindings.role)' \ ---filter="bindings.members:SERVICE_ACCOUNT_EMAIL" -``` -- You should see that your service account has a role of `roles/owner` - -
- Troubleshooting - -- `AccessDeniedException: 403 The project to be billed is associated with an absent billing account.` - - Make sure that billing is enabled for your Google Cloud Platform project https://cloud.google.com/billing/docs/how-to/modify-project -
- -🏁 You are done with the GCP setup! diff --git a/_partials/gcp_setup_mid.md b/_partials/gcp_setup_mid.md deleted file mode 100644 index 6bbe4ec..0000000 --- a/_partials/gcp_setup_mid.md +++ /dev/null @@ -1,11 +0,0 @@ -- Store the service account json file somewhere you'll remember, for example: - -``` bash -/Users/MACOS_USERNAME/code/GITHUB_NICKNAME/gcp/SERVICE_ACCOUNT_JSON_FILE_CONTAINING_YOUR_SECRET_KEY.json -``` - -- Store the **absolute path** to the `JSON` file as an environment variable: - -``` bash -echo 'export GOOGLE_APPLICATION_CREDENTIALS=/path/to/the/SERVICE_ACCOUNT_JSON_FILE_CONTAINING_YOUR_SECRET_KEY.json' >> ~/.aliases -``` diff --git a/_partials/gcp_setup_wsl.md b/_partials/gcp_setup_wsl.md deleted file mode 100644 index cfb48b2..0000000 --- a/_partials/gcp_setup_wsl.md +++ /dev/null @@ -1,67 +0,0 @@ -We will now move the service account json file from your Windows disk to the Ubuntu disk. This will allow the development tools in Ubuntu to access to the ressources of your GCP account. - -First, let's create a directory in which we will store the file. - -πŸ‘‰ Open an Ubuntu terminal and run the following commands - -🚨 replace `GITHUB_NICKNAME` by your **GitHub** nickname - -``` bash -cd ~/code/GITHUB_NICKNAME -ls -la -``` - -If the command does not show the `dotfiles` directory, ask for a TA πŸ™ - -Otherwise, you can proceed with the setup: - -``` bash -mkdir gcp -``` - -![](images/wsl-gcp-dir.png) - -We will now move the service account json file to the `gcp` directory we just created. - -Open a Windows **File Explorer** (Win + E) and locate the `gcp` directory in the Ubuntu file system. - -You can either: -- Use the **Quick access** link that we created earlier -- manually type the location of the `gcp` directory in the Ubuntu file system in the address bar: - -``` -\\wsl$\Ubuntu\home\UBUNTU_USERNAME\code\GITHUB_NICKNAME -``` - - -🚨 if you opt for the second option: -- replace `UBUNTU_USERNAME` by the username that you choose during the **Ubuntu** setup -- replace `GITHUB_NICKNAME` by your **GitHub** nickname - -![](images/wsl-gcp-key.png) - -Once you have located the `gcp` directory in the Windows **File Explorer**, move the service account json file that you downloaded inside of it. - -The file should now be visible from Ubuntu file system. - -πŸ‘‰ Open an Ubuntu terminal and verify that the service account json file has been moved - -``` bash -cd gcp -ls -la -``` - -![](images/wsl-gcp-dir-2.png) - -If you do not see the service account json file listed in the `gcp` directory, ask for a TA πŸ™ - -We will now store the path to your service account json file in an environment variable. - -🚨 in the following command, replace: -- `UBUNTU_USERNAME` by the username that you choose during the **Ubuntu** setup -- `GITHUB_NICKNAME` by your **GitHub** nickname -- `SERVICE_ACCOUNT_JSON_FILE_CONTAINING_YOUR_SECRET_KEY.json` by the name of your service account json file - -``` bash -echo 'export GOOGLE_APPLICATION_CREDENTIALS=/home/UBUNTU_USERNAME/code/GITHUB_NICKNAME/gcp/SERVICE_ACCOUNT_JSON_FILE_CONTAINING_YOUR_SECRET_KEY.json' >> ~/.aliases -``` diff --git a/_partials/intro.md b/_partials/intro.md index 87a0ec9..0b12a82 100644 --- a/_partials/intro.md +++ b/_partials/intro.md @@ -6,4 +6,40 @@ A part of the setup will be done on your **local machine** but most of the confi Please **read instructions carefully and execute all commands in the following order**. If you get stuck, don't hesitate to ask a teacher for help :raising_hand: +This setup is largely automated with **Terraform** and **Ansible**. There are three main components to the setup! **Terraform** and **ansible** are _Infrastructure as Code_ tools. +- **Terraform** excels at creating and destroying cloud resources, like virtual machines, IP addresses, databases and more! +- **Ansible** is used to configure linux machines with specific settings and software. Perfect for fine-tuning the Virtual Machine you will be creating! + +## Part 1: Setup your local computer + +In this section you'll setup your local computer and create some accounts. It will include things like: +1. Install some communication tools: Zoom, Slack +2. Create some accounts: Github, Google Cloud Platform (GCP) +3. Install Visual Studio Code (VS Code) +4. Install and authentication the GCP command line tool: `gcloud` +5. Install **terraform** on your local computer +6. Create your virtual machine with **terraform** and connect to it with **VS Code**! + +## Part 2: Configure your Virtual Machine Part 1 + +All parts of this section happen on your virtual machine. + +This section includes: +1. Authenticate your virtual machine with `gcloud` +2. Download and run an **ansible** playbook to partially configure your virtual machine +3. Login to the Github command line tool on your virtual machine +4. Copy the Le Wagon recommended **dotfiles**. **Dotfiles** are settings that will enhance your terminal and developer experience! + +## Part 3: Configure your Virtual Machine Part 2 + +All parts of this section happen on your virtual machine. + +In this section you will: +1. Download and run a second **ansible** playbook for some more fine tuning +2. Test your set up to make sure that everything has installed correctly +3. Create isolated python environments for all your challenges + + +Don't worry, we'll go into more detail in each of the individual sections. + Let's start :rocket: diff --git a/_partials/kata.md b/_partials/kata.md deleted file mode 100644 index 4c26cbf..0000000 --- a/_partials/kata.md +++ /dev/null @@ -1,7 +0,0 @@ -## (Bonus) Kata - -If you are done with your setup, please ask around if some classmates need some help with theirs (macOS, Linux, Windows). We will have our first lectures at 2pm and will talk about the Setup you just did + onboard you on Kitt. - -If you don't have a lot of experience with `git` and GitHub, please [(re-)watch this workshop](https://www.youtube.com/watch?v=Z9fIBT2NBGY) (`1.25` playback speed is fine). - -If you do, then you can wait for the first lecture working on this [Tic-Tac-Toe Kata](https://www.codewars.com/kata/5b817c2a0ce070ace8002be0/train/python) diff --git a/_partials/nbextensions.md b/_partials/nbextensions.md deleted file mode 100644 index 5d0a698..0000000 --- a/_partials/nbextensions.md +++ /dev/null @@ -1,75 +0,0 @@ -## `jupyter` notebook extensions - -Pimp your `jupyter` notebooks with awesome extensions: - -```bash -# install nbextensions -jupyter contrib nbextension install --user -jupyter nbextension enable toc2/main -jupyter nbextension enable collapsible_headings/main -jupyter nbextension enable spellchecker/main -jupyter nbextension enable code_prettify/code_prettify -``` - -### Custom CSS - -Improve the display of the [`details` disclosure elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/details) in your notebooks. - -Open `custom/custom.css` in the config directory: -```bash -cd $(jupyter --config-dir) -mkdir -p custom -touch custom/custom.css - custom/custom.css -``` -Edit `custom.css` with: - -```css -summary { - cursor: pointer; - display:list-item; -} -summary::marker { - font-size: 1em; -} -``` - -You can close . - -### `jupyter` check up - -Let's reset your terminal: - -```bash -exec zsh -``` - -Now, check you can launch a notebook server on your machine: - -```bash -jupyter notebook -``` - -Your web browser should open on a `jupyter` window: - -![jupyter.png](images/jupyter.png) - -Click on `New`: - -![jupyter_new.png](images/jupyter_new.png) - -A tab should open on a new notebook: - -![jupyter_notebook.png](images/jupyter_notebook.png) - -### `nbextensions` check up - -Perform a sanity check for `jupyter notebooks nbextensions`. Click on `Nbextensions`: - -![jupyter_nbextensions.png](images/jupyter_nbextensions.png) - -Untick _"disable configuration for nbextensions without explicit compatibility"_ then check that _at least_ all `nbextensions` circled in red are enabled: - -![nbextensions.png](images/nbextensions.png) - -You can close your web browser then terminate the jupyter server with `CTRL` + `C`. diff --git a/_partials/osx_python.md b/_partials/osx_python.md deleted file mode 100644 index b659a56..0000000 --- a/_partials/osx_python.md +++ /dev/null @@ -1,75 +0,0 @@ -## Installing Python (with [`pyenv`](https://github.com/pyenv/pyenv)) - -Before installing Python, please check your `xz` version with: - -```bash -brew info xz -``` - -It should be more than `5.2.0`, **if not** you should run: - -```bash -sudo rm -rf /usr/local/opt/xz -brew upgrade -brew install xz -``` - -Then run: - -```bash -brew install readline -``` - -macOS comes with an outdated version of Python that we don't want to use. You might already have installed Anaconda or something else to tinker with Python and Data Science packages. All of this does not really matter as we are going to do a professional setup of Python where you'll be able to switch which version you want to use whenever you type `python` in the terminal. - -First let's install `pyenv` with the following Terminal command: - -```bash -brew install pyenv -exec zsh -``` - -Let's install the [latest stable version of Python](https://www.python.org/doc/versions/) supported by Le Wagon's curriculum: - -```bash -pyenv install -``` - -This command might take a while, this is perfectly normal. Don't hesitate to help other students seated next to you! - -
- πŸ›  Troubleshooting - -If you encounter an error installing Python with `pyenv` about `zlib`: - -```txt -zipimport.ZipImportError: can't decompress data; zlib not available -``` - -Install `zlib` with: - -```bash -brew install zlib -export LDFLAGS="-L/usr/local/opt/zlib/lib" -export CPPFLAGS="-I/usr/local/opt/zlib/include" -``` - -Then try to install Python again: - -```bash -pyenv install -``` - -It could raise another error about `bzip2`, you can ignore it and continue to the next step. - -
-
- -OK once this command is complete, we are going to tell the system to use this version of Python **by default**. This is done with: - -```bash -pyenv global -exec zsh -``` - -To check if this worked, run `python --version`. If you see ``, perfect! If not, ask a TA that will help you debug the problem thanks to `pyenv versions` and `type -a python` (`python` should be using the `.pyenv/shims` version first). diff --git a/_partials/pip.md b/_partials/pip.md deleted file mode 100644 index 4392db5..0000000 --- a/_partials/pip.md +++ /dev/null @@ -1,43 +0,0 @@ -## Python packages - -Now that we have a pristine `lewagon` virtual environment, it's time to install some packages in it. - -First, let's upgrade `pip`, the tool to install Python Packages from [pypi.org](https://pypi.org). In the latest terminal where the virtualenv `lewagon` is activated, run: - -```bash -pip install --upgrade pip -``` - -Then let's install some packages for the first weeks of the program: - -$MAC_START -If your computer uses **Apple Silicon**, expand the paragraph below and go through it. Otherwise ignore it. - -
- πŸ‘‰  Setup for Apple Silicon πŸ‘ˆ - -``` bash -pip install -r https://raw.githubusercontent.com/lewagon/data-setup/master/specs/releases/apple_silicon.txt -``` -
- -If your computer uses **Apple Intel**, expand the paragraph below and go through it. Otherwise ignore it. - -
- πŸ‘‰  Setup for Apple Intel πŸ‘ˆ - -``` bash -pip install -r https://raw.githubusercontent.com/lewagon/data-setup/master/specs/releases/apple_intel.txt -``` -
-$MAC_END -$WINDOWS_START -``` bash -pip install -r https://raw.githubusercontent.com/lewagon/data-setup/master/specs/releases/linux.txt -``` -$WINDOWS_END -$LINUX_START -``` bash -pip install -r https://raw.githubusercontent.com/lewagon/data-setup/master/specs/releases/linux.txt -``` -$LINUX_END diff --git a/_partials/repo_overview.md b/_partials/repo_overview.md index 01c1869..6160191 100644 --- a/_partials/repo_overview.md +++ b/_partials/repo_overview.md @@ -1,34 +1,27 @@ ## Let's Make! -Lets clone the challenges onto your **virtual machine** - -```bash -export GITHUB_USERNAME=`gh api user | jq -r '.login'` -echo $GITHUB_USERNAME -``` - -Then: - -```bash -mkdir -p ~/code/$GITHUB_USERNAME && cd $_ -gh repo fork lewagon/data-engineering-challenges --clone -``` +Almost there! In the second ansible playbook, the `lewagon/data-engineering-challenges` repository was forked from Le Wagon to you. Let's review how it works. Our setup will look a bit like this: - +![](/images/repo_overview.png) This allows you to work on challenges, but if we push any changes to the content, you can still access them! Check your remotes match `origin` your data engineering challenges and `upstream` lewagon's! ```bash -cd data-engineering-challenges +cd ~/code/$(gh api user | jq -r '.login')/data-engineering-challenges git remote -v -# origin git@github.com:your_github_username/data-engineering-challenges.git (fetch) -# origin git@github.com:your_github_username/data-engineering-challenges.git (push) -# upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) -# upstream git@github.com:lewagon/data-engineering-challenges.git (push) +``` + +Should return: + +``` +origin git@github.com:/data-engineering-challenges.git (fetch) +origin git@github.com:/data-engineering-challenges.git (push) +upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) +upstream git@github.com:lewagon/data-engineering-challenges.git (push) ``` From challenge folder root **on the vm**, we'll run `make install`, which triggers 3 operations: diff --git a/_partials/terraform.md b/_partials/terraform.md index b9c925f..63ad3b9 100644 --- a/_partials/terraform.md +++ b/_partials/terraform.md @@ -1,15 +1,56 @@ ## Terraform -Terraform is a tool for infrastructure as code (IAC) to define resources to create in the cloud! +Terraform is a tool for infrastructure as code (IAC) to create (and destroy) resources to create in the cloud. -### Install terraform +$MAC_START +You can use `brew` to install terraform. In your terminal, run: -Install some basic requirements +```bash +brew tap hashicorp/tap +brew install hashicorp/tap/terraform +``` +$MAC_END +$WINDOWS_START +### Download + +To install terraform, download the **zip archive** from the Terraform install page at this [link here πŸ”—](https://developer.hashicorp.com/terraform/install). + +❗ If you are using Windows 10 or 11, download the **AMD64** version (64 bit version). + +1. Using file explorer to go to the location you downloaded the **terraform zip archive** + +2. **Unzip** the archive and two files should appear: `terraform.exe` and `license.txt`. + +3. Copy `terraform.exe` + +4. Navigate to your home directory (`C:\Users\\`) and create a directory named `cli_apps` + +5. Paste `terraform.exe` in the `cli_apps` directory + +### Add terraform to PATH + +We need to manually add **Terraform** to the `PATH` environment variable. The `PATH` variable contains a list of directories that your computer looks in for programs that we run from the command prompt. + +To update your path: +1. Open Windows Search and search for: **Environment Variables** + +2. Click **Environment Variables** or **Edit environment variables for your account** + +3. Click **New** on to top right of this window + +4. Enter: `C:\Users\YOUR_USERNAME\cli_apps` - Make sure to replace `YOUR_USERNAME` with your computers user name. + +5. Click **Ok** to close the `Path` variable window, and click **Ok** again to close the Environment Variable window. + +6. Close **Command Prompt** and open it again +$WINDOWS_END +$LINUX_START +Install some basic requirements: ```bash sudo apt-get update && sudo apt-get install -y gnupg software-properties-common ``` -Terraform is not avaliable to apt by default so we need to make it avaliable! +Terraform is not available to **apt** by default, so we need to manually add the repository. ```bash wget -O- https://apt.releases.hashicorp.com/gpg | \ gpg --dearmor | \ @@ -28,11 +69,12 @@ echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ sudo tee /etc/apt/sources.list.d/hashicorp.list ``` -Now we can install terraform directly with apt πŸ‘‡ +Now we can install terraform directly with **apt** πŸ‘‡ ```bash sudo apt update sudo apt-get install terraform ``` +$LINUX_END Verify the installation with: diff --git a/_partials/terraform_vm.md b/_partials/terraform_vm.md new file mode 100644 index 0000000..e304621 --- /dev/null +++ b/_partials/terraform_vm.md @@ -0,0 +1,171 @@ +## Provisioning your Virtual Machine with Terraform + +You can create Cloud Resources like Virtual Machines in different ways: +- Through the Google Cloud [Compute Engine Console πŸ”—](https://console.cloud.google.com/compute/overview) +- Using `gcloud` +- With **Infrastructure as Code** tools like Terraform + +We'll be creating our Virtual Machine with Terraform + +We're almost at the point of creating your Virtual Machine. + +The specifications of the Virtual Machine and Network Settings you'll use for the bootcamp are: +- Operation System: Ubuntu 22.04 LTS +- CPU: 4 Virtual CPU cores (2 physical CPU cores) +- RAM: 16 GB +- Storage (Persistent Disk): 100 GB balanced +- Static External IP address - so it's easier to login. + +### Cost πŸ’Έ + +Creating and running a Virtual Machine on Google Cloud Platform costs money! + +If you have created a new Google Cloud Platform account, the cost of the Virtual machine will be covered by the $300 USD credit for the first 90 days if you are diligent with turning off your Virtual Machine (or finish the _Linux and Bash_ challenge today 😎). + +❗ **The cost of running a Virtual Machine with our configuration 24 hours a day, 7 days a week is ~$150 USD per month.** ❗ + +You can massively reduce the cost by only running the Virtual Machine when you use it. You will _NOT_ be charged for the vCPU's and RAM while the Virtual Machine is off! + +You will always pay for the Storage (equivalent of your hard-drive on your local computer). It's ~$10 USD per month for 100 GB. + +The rule of thumb is: if Google can rent the resource out to someone else when your not using it, you only pay for it when you are using the resource. That's why you don't pay for the CPU and RAM when you are not using it, Google can rent it out to someone else, but always pay for Storage, Google can't rent it out to someone else because it has your data on it. + +### Download terraform files + +We almost have all the necessary parts to create your VM using **terraform**. We need to download the terraform files and change a few values. + +First we'll create a folder and download the terraform files with: + +$MAC_START +```bash +mkdir -p ~/wagon-de-bootcamp +curl -L -o ~/wagon-de-bootcamp/main.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf +curl -L -o ~/wagon-de-bootcamp/provider.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf +curl -L -o ~/wagon-de-bootcamp/variables.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf +curl -L -o ~/wagon-de-bootcamp/terraform.tfvars https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars +curl -L -o ~/wagon-de-bootcamp/.terraform.lock.hcl https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl +``` +$MAC_END +$WINDOWS_START +Using the Command Prompt (cmd), run the following: + +```cmd +mkdir %USERPROFILE%\wagon-de-bootcamp + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\main.tf" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\provider.tf" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\variables.tf" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\terraform.tfvars" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars + +curl -L -o "%USERPROFILE%\wagon-de-bootcamp\.terraform.lock.hcl" https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl +``` +$WINDOWS_END +$LINUX_START +```bash +mkdir -p ~/code/wagon-de-bootcamp +curl -L -o ~/wagon-de-bootcamp/main.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf +curl -L -o ~/wagon-de-bootcamp/provider.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf +curl -L -o ~/wagon-de-bootcamp/variables.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf +curl -L -o ~/wagon-de-bootcamp/terraform.tfvars https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars +curl -L -o ~/wagon-de-bootcamp/.terraform.lock.hcl https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl +``` +$LINUX_END + + +### Set variables + +$MAC_START +Open up the file `~/wagon-de-bootcamp/terraform.tfvars` in VS Code or any other code editor. +$MAC_END +$WINDOWS_START +Open up the file `C:\Users\\wagon-de-bootcamp\terraform.tfvars` in VS Code or any other code editor. +$WINDOWS_END +$LINUX_START +Open up the file `~/wagon-de-bootcamp/terraform.tfvars` in VS Code or any other code editor. +$LINUX_END + +It should look like: + +```bash +project_id = "" +region = "" +zone = "" +instance_name = "" +instance_user = "" +``` + +We'll need to change some values in this file. Here's were you can find the required values: +- **project_id:** from the GCP Console at this [link here](https://console.cloud.google.com). +- **region:** take a look at the GCP Region and Zone documentation at this [link here](https://cloud.google.com/compute/docs/regions-zones). We strongly recommend you choose the closest geographical region. +- **zone:** Zone is a subset of region. it is almost always the same as **region** appended with `-a`, `-b`, or `-c`. +- **instance_name:** we recommend naming your VM: `lw-de-vm-`. Replacing `` with your GitHub username. +$MAC_START +- **instance_user:** in your terminal, run `whoami` +$MAC_END +$WINDOWS_START +- **instance_user:** in Command Prompt, run `echo %username%` +$WINDOWS_END +$LINUX_START +- **instance_user:** in your terminal, run `whoami` +$LINUX_END + +After completing this file, it should look similar to: + +```bash +project_id = "wagon-bootcamp" +region = "europe-west1" +zone = "europe-west1-b" +instance_name = "lw-de-vm-tswift" +instance_user = "taylorswift" +``` + +Make sure to save the `terraform.tfvars` file, nagivate into the directory with the terraform files with: + +``` +$MAC_START +cd ~/wagon-de-bootcamp +$MAC_END +$WINDOWS_START +cd %USERPROFILE%\wagon-de-bootcamp +$WINDOWS_END +$LINUX_START +cd ~/wagon-de-bootcamp +$LINUX_END +``` + +And initialise and test the files with: + +```bash +terraform init + +terraform plan +``` + +And check the output. Towards the bottom there should be a line: + +``` +Plan: 2 to add, 0 to change, 0 to destroy +``` + +We'll be adding: +- A compute engine instance +- A static external IP address + +❗ If you have any errors, read the error and debug. If you need some help, raise a ticket with a teacher. + +If everything was successful, create your VM with: + +```bash +terraform apply -auto-approve +``` + +It might take a while for Terraform to create the cloud resources. Once you see: + +``` +Apply complete! Resources: 2 added, 0 changed, 0 destroyed. +``` + +Your Virtual Machine should be up and running! Check the GCP Compute Engine console at this [link here](https://console.cloud.google.com/compute/instances) to confirm. diff --git a/_partials/tldr.md b/_partials/tldr.md deleted file mode 100644 index 548b10f..0000000 --- a/_partials/tldr.md +++ /dev/null @@ -1,36 +0,0 @@ -## TLDR - -Add TLDR - a modern addition to MAN pages, which will help you find nice documentation and examples on most Linux commands: - -```bash -cd ~ -pip3 install -U pip -pip3 install tldr -``` -❗️ It is one of the very few tools we will install from the default system python interpreter, because it has se few [dependencies](https://github.com/tldr-pages/tldr/blob/main/requirements.txt) - -You can try `tldr` with: - -```bash -tldr gh -``` - -ℹ️ It's normal that it takes ~1 minute the first time, as the cache needs to be built. Subsequent calls will be fast. - -Finally you should get: - -tldr - -## gRPCurl - -gRPCurl is `curl` for [gRPC servers](https://grpc.io/docs/what-is-grpc/introduction/). - -- Install `grpcurl` - ```bash - curl -s https://grpc.io/get_grpcurl | bash - ``` -- Add `grpcurl` to your `PATH` - ```bash - echo '# Add grpcurl to PATH' >> ~/.zshrc - echo 'PATH=$PATH:$HOME/.grpcurl/bin/' >> ~/.zshrc - ``` diff --git a/_partials/ubuntu_ansible_part1.md b/_partials/ubuntu_ansible_part1.md new file mode 100644 index 0000000..c622902 --- /dev/null +++ b/_partials/ubuntu_ansible_part1.md @@ -0,0 +1,63 @@ +## VM configuration with Ansible + +We'll be using [Ansible](https://docs.ansible.com/ansible/latest/getting_started/introduction.html) to configure your Virtual Machine with some software, configurations, packages, and frameworks that you'll use in the bootcamp. + +Let's start by confirming that ansible is installed. In your terminal run: + +```bash +ansible --version +``` + +You should get an output similar to (some version numbers might change, that's fine): + +``` +ansible [core 2.17.9] + config file = /etc/ansible/ansible.cfg + configured module search path = ['/home/tswift/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] + ansible python module location = /usr/lib/python3/dist-packages/ansible + ansible collection location = /home/tswift/.ansible/collections:/usr/share/ansible/collections + executable location = /usr/bin/ansible + python version = 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (/usr/bin/python3) + jinja version = 3.1.2 + libyaml = True +``` + +❗ If not, raise a ticket with a teacher. + +### Ansible Playbook 1 + +Create a folder and download the ansible files: + +```bash +mkdir -p ~/vm-ansible-setup/playbooks + +curl -L -o ~/vm-ansible-setup/ansible.cfg https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/ansible.cfg +curl -L -o ~/vm-ansible-setup/hosts https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/hosts +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part1.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml +``` + +And run with: + +```bash +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part1.yml +``` + +And the playbook should start running! + +❗ If an errors occur, raise a ticket with a teacher. You can safely run the playbook again. + +### What is the playbook installing? + +This playbook is installing a few things, while the playbook is running, let's go through them: +- Updating system packages. Ubuntu uses the `APT` package manager. +- Changing the default shell from **bash** to **zsh**, a more customizable shell that is extensible and looks great! +- Installing the **Oh-My-ZSH** plugin for the **zsh** shell. We'll use it a bit later to add some quality of life plugins and extensions for `zsh`. +- Installing **Docker** on your Virtual Machine. Docker is an open platform for developing, shipping, and running applications. You will use it throughout the bootcamp +- Installing some **Kubernetes (k8s)** tooling: Kubernetes is a system designed to for auto-scaling containerized applications. + - Installing **kubectl**: `kubectl` is the CLI tool for interacting with kubernetes clusters. + - Installing **minikube**: Minikube is a way to quickly spin up a local kubernetes cluster. Great for developing! +- Installing **terraform**: we've already installed it once, but we need to install it on our VM! **Terraform** is an Infrastructure as Code (IaC) tool. +- Install the **GitHub CLI**: the CLI tool that we'll use to interact with your GitHub account directly from the terminal. + +The playbook is also running checks to see if things are installed or not. This is so you can safely re-run the playbook without any problems. diff --git a/_partials/ubuntu_ansible_part2.md b/_partials/ubuntu_ansible_part2.md new file mode 100644 index 0000000..f17099c --- /dev/null +++ b/_partials/ubuntu_ansible_part2.md @@ -0,0 +1,85 @@ +## VM configuration with Ansible - Part 2 + +### Ansible Playbook 2 + +We'll be using a second **Ansible** playbook to further configure your Virtual Machine. + +Start by downloading the ansible playbook: + +```bash +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part2.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml +``` + +And run with: + +```bash +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part2.yml +``` + +And the playbook should start running! If you're asked if you want VS Code to behave more like Sublime Text, click accept. + +❗ If any errors occur, raise a ticket with a teacher. You can safely run the playbook again. + +
+❓ Why two Ansible playbooks? + +This second ansible playbook requires GitHub authorisation to fork the `lewagon/data-engineering-challenges` repository and it is also editing some of the Le Wagon recommended **dotfiles**. So we separated the process into two steps. +
+ +### What is the playbook installing? + +This playbook is installing and configuring a things, while the playbook is running, let's go through them: + +**Python and Poetry** + +Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python [3.12.8](https://www.python.org/downloads/release/python-3128/) + +- Install **pyenv** and **pyenv-virtualenv**. We'll use **pyenv** to manage the Python versions installed on the VM +- Install Python 3.12.8 with pyenv +- Install **pipx**: [Pipx](https://pipx.pypa.io/stable/) is used to install python packages we want _globally_ available while still using virtual environments, like Poetry! +- Installing a few global python packages with **pipx**: + - **Poetry:** [Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. + - **Ruff:** [Ruff](https://docs.astral.sh/ruff/) Is used to format and lint Python code. + - **tldr:** [tldr](https://github.com/tldr-pages/tldr) has much more readable version of `man` pages. Useful for quickly finding out how a program works. + +**VS Code Configuration** + +- Installing some **VS Code** extensions, but only on your VM. Here's a list of the extensions that are being installed: + - [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) + - [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) + - [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) + - [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) + - [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) + - [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) + - [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) + - [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +- Update the VS Code Python Interpreter path. + +**Shell and System Configuration** + +- Create the **direnv** poetry function. The same one from the lecture! This makes it easier to work with poetry. +- Adding some **Oh-My-ZSH** Plugins: by modifying your `.zshrc` file. Here's a list of the extra plugins: + - **pyenv**: Auto-complete for pyenv, a tool used to manage python virtual environments + - **gcloud**: Auto-complete for the gcloud CLI tool + - **ssh-agent**: Saves your SSH password so you only have to enter it once per session. + - **direnv**: A tool to load `.envrc` files when you `cd` into a directory. Great for loading environment variables. +- Installing **Spark**: Spark is a distributed data processing framework + +**Data Engineering Challenges Repository** + +The challenges that you'll be working on throughout the bootcamp! The playbook is forking the **data-engineering-challenges** repository from **lewagon** to your own GitHub user. Then cloning that repository from your GitHub account down onto your Virtual Machine. + +### Restart Virtual Machine + +Once the playbook has finished running, you need to completely shutdown your Virtual Machine so that some of the configuration updates (specifically **pyenv** and **Docker**). + +To shutdown your VM, navigate to the GCP Compute Engine Instances [console page πŸ”—](https://console.cloud.google.com/compute/instances). + +Select your VM instance and click on the stop button: + +![](/images/gcp_vm_stop.png) + +Wait for a few minutes until the VM shows that it is completely off. You may need to refresh the page, the GCP Console doesn't dynamically update. + +When the VM is completely off, turn it on again by selecting the check box next to your instance and clicking **START/RESUME**. Give it a minute to spin up, then connect via VS Code. diff --git a/_partials/ubuntu_gcloud.md b/_partials/ubuntu_gcloud.md index 0ce1342..32d97ca 100644 --- a/_partials/ubuntu_gcloud.md +++ b/_partials/ubuntu_gcloud.md @@ -10,7 +10,7 @@ sudo apt-get install google-cloud-sdk-app-engine-python ``` πŸ‘‰ [Install documentation](https://cloud.google.com/sdk/docs/install#deb) -### Create a service account key πŸ”‘ + diff --git a/_partials/ubuntu_terraform.md b/_partials/ubuntu_terraform.md new file mode 100644 index 0000000..0cbf9e7 --- /dev/null +++ b/_partials/ubuntu_terraform.md @@ -0,0 +1,41 @@ +## Terraform + +Terraform is a tool for infrastructure as code (IAC) to define resources to create in the cloud! + +### Install terraform + +Install some basic requirements +```bash +sudo apt-get update && sudo apt-get install -y gnupg software-properties-common +``` + +Terraform is not available to apt by default so we need to make it available! +```bash +wget -O- https://apt.releases.hashicorp.com/gpg | \ + gpg --dearmor | \ + sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null +``` + +```bash +gpg --no-default-keyring \ + --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ + --fingerprint +``` + +```bash +echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ + https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \ + sudo tee /etc/apt/sources.list.d/hashicorp.list +``` + +Now we can install terraform directly with apt πŸ‘‡ +```bash +sudo apt update +sudo apt-get install terraform +``` + +Verify the installation with: + +```bash +terraform --version +``` diff --git a/_partials/ubuntu_vm_test.md b/_partials/ubuntu_vm_test.md new file mode 100644 index 0000000..fb019fe --- /dev/null +++ b/_partials/ubuntu_vm_test.md @@ -0,0 +1,216 @@ +## Check your Virtual Machine Setup + +We've used two ansible playbooks to configure our Virtual Machine. Let's run some manual checks in the terminal to make sure that everything has installed correctly. + +❗ If any of these checks error out, raise a ticket with a teacher. + +#### Python + +πŸ§ͺ To test: + +```bash +python --version +``` + +Should return: + +``` +Python 3.12.8 +``` + +#### Pyenv + +πŸ§ͺ To test: + +```bash +pyenv versions +``` + +Should return: + +``` + system +* 3.12.8 (set by /home//.pyenv/version) +``` + +Note: There should be an `*` next to 3.12.8 + +#### Pipx + +πŸ§ͺ To test: + +```bash +pipx list +``` + +Should return something similar too: + +``` +venvs are in /home//.local/share/pipx/venvs +apps are exposed on your $PATH at /home//.local/bin +manual pages are exposed at /home//.local/share/man + package poetry 2.1.1, installed using Python 3.12.8 + - poetry + package ruff 0.11.0, installed using Python 3.12.8 + - ruff + package tldr 3.3.0, installed using Python 3.12.8 + - tldr + - man1/tldr.1 +``` + +#### Docker + +πŸ§ͺ To test: + +```bash +docker run hello-world +``` + +Should return: + +``` +Unable to find image 'hello-world:latest' locally +latest: Pulling from library/hello-world +e6590344b1a5: Pull complete +Digest: sha256:7e1a4e2d11e2ac7a8c3f768d4166c2defeb09d2a750b010412b6ea13de1efb19 +Status: Downloaded newer image for hello-world:latest + +Hello from Docker! +This message shows that your installation appears to be working correctly. + +To generate this message, Docker took the following steps: + 1. The Docker client contacted the Docker daemon. + 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. + (amd64) + 3. The Docker daemon created a new container from that image which runs the + executable that produces the output you are currently reading. + 4. The Docker daemon streamed that output to the Docker client, which sent it + to your terminal. + +To try something more ambitious, you can run an Ubuntu container with: + $ docker run -it ubuntu bash + +Share images, automate workflows, and more with a free Docker ID: + https://hub.docker.com/ + +For more examples and ideas, visit: + https://docs.docker.com/get-started/ +``` + +#### Kubernetes + +We can start by testing `minikube`: + +```bash +# Start +minikube start +``` + +Should return: + +``` +πŸ˜„ minikube v1.35.0 on Ubuntu 22.04 (amd64) +✨ Automatically selected the docker driver. Other choices: none, ssh +πŸ“Œ Using Docker driver with root privileges +πŸ‘ Starting "minikube" primary control-plane node in "minikube" cluster +🚜 Pulling base image v0.0.46 ... +πŸ’Ύ Downloading Kubernetes v1.32.0 preload ... + > gcr.io/k8s-minikube/kicbase...: 500.31 MiB / 500.31 MiB 100.00% 88.19 M + > preloaded-images-k8s-v18-v1...: 333.57 MiB / 333.57 MiB 100.00% 32.20 M +πŸ”₯ Creating docker container (CPUs=2, Memory=3900MB) ... +🐳 Preparing Kubernetes v1.32.0 on Docker 27.4.1 ... + β–ͺ Generating certificates and keys ... + β–ͺ Booting up control plane ... + β–ͺ Configuring RBAC rules ... +πŸ”— Configuring bridge CNI (Container Networking Interface) ... +πŸ”Ž Verifying Kubernetes components... + β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5 +🌟 Enabled addons: storage-provisioner, default-storageclass +πŸ„ Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default +``` + +And then make sure the kubernetes CLI utility, `kubectl`, works with: + +```bash +# Get pods +kubectl get po -A +``` + +Should return something similar too: + +``` +NAMESPACE NAME READY STATUS RESTARTS AGE +kube-system coredns-668d6bf9bc-mg7b6 1/1 Running 0 72s +kube-system etcd-minikube 1/1 Running 0 78s +kube-system kube-apiserver-minikube 1/1 Running 0 76s +kube-system kube-controller-manager-minikube 1/1 Running 0 76s +kube-system kube-proxy-stk77 1/1 Running 0 72s +kube-system kube-scheduler-minikube 1/1 Running 0 76s +kube-system storage-provisioner 1/1 Running 1 (41s ago) 75s +``` + +And because `minikube` is resource intensive, stop it for now with: + +```bash +# Stop +minikube delete --all +``` + +Should return: + +``` +πŸ”₯ Deleting "minikube" in docker ... +πŸ”₯ Removing /home//.minikube/machines/minikube ... +πŸ’€ Removed all traces of the "minikube" cluster. +πŸ”₯ Successfully deleted all profiles +``` + +#### Terraform + +πŸ§ͺ To test: + +```bash +terraform --version +``` + +Should return: + +``` +Terraform v1.11.2 +on linux_amd64 +``` + +#### Spark + +πŸ§ͺ To test: + +```bash +spark-shell +``` + +Should take you into the spark shell that looks like: + +``` +Setting default log level to "WARN". +To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +25/03/18 08:54:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Spark context Web UI available at http://lw-de-vm.europe-north1-b.c.wagon-de.internal:4040 +Spark context available as 'sc' (master = local[*], app id = local-1742288096829). +Spark session available as 'spark'. +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 + /_/ + +Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_442) +Type in expressions to have them evaluated. +Type :help for more information. + +scala> +``` + +Type `:quit` and hit enter to exit the spark-shell and continue. + +That's all the testing we'll do for now! diff --git a/_partials/virtualenv.md b/_partials/virtualenv.md deleted file mode 100644 index 341f545..0000000 --- a/_partials/virtualenv.md +++ /dev/null @@ -1,24 +0,0 @@ -## Python Virtual Environment - -Before we start installing relevant Python packages, we will isolate the setup for the Bootcamp into a **dedicated** virtual environment. We will use a `pyenv` plugin called [`pyenv-virtualenv`](https://github.com/pyenv/pyenv-virtualenv). - -First let's install this plugin: - -```bash -git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv -exec zsh -``` - -Let's create the virtual environment we are going to use during the whole bootcamp: - -```bash -pyenv virtualenv lewagon -``` - -Let's now set the virtual environment with: - -```bash -pyenv global lewagon -``` - -Great! Anytime we'll install Python package, we'll do it in that environment. diff --git a/_partials/vscode_remote_ssh.md b/_partials/vscode_remote_ssh.md index 29e3eaa..301d89d 100644 --- a/_partials/vscode_remote_ssh.md +++ b/_partials/vscode_remote_ssh.md @@ -11,65 +11,3 @@ We need to connect VS Code to a virtual machine in the cloud so you will only wo VS Code extensions - Remote - Details That's the only extension you should install on your _local_ machine, we will install additional VS Code extensions on your _virtual machine_. - -### Virtual Machine connection - -- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` - -vscode-connect-to-host - -- Click on `Add a new host` -- Type `ssh -i @`, for instance, my username is `somedude`, my private SSH key is located at `~/.ssh/id_rsa` on my local computer, my VM has a public IP of `34.77.50.76`: I'll type `ssh -i ~/.ssh/id_rsa somedude@34.77.50.76` - -vscode-ssh-connection-command - - -- When prompted to `Select SSH configuration file to update`, pick the one in your home directory, under the `.ssh` folder, `~/.ssh/config` basically. Usually VS Code will pick automatically the best option, so their default should work. - -vscode-add-host-ssh-config - -- You should get a pop-up on the bottom right notifying you the host has been added - -vscode-host-added - -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` > Pick your VM IP address - -vscode-add-new-host - -- The first time, VSCode might ask you for a security permission like below, say yes / continue. - -vscode-remote-connection-confirm - -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Terminal: Create New Terminal (in active workspace)` > You now have a Bash terminal in your virtual machine! - -vscode-command-palette-new-terminal -
-vscode-terminal - -- Still on your *local* computer, lets create a more readable version of your machine to connect to! - -```bash -code ~/.ssh/config -``` - -You should see something like the following: - -```bash -Host - HostName - IdentityFile - User -``` -You can now change Host to whatever you would like to see as the name of your connection or in terminal with `ssh `! - -❗️ It is important that the `Host` alias does not contain any whitespaces ❗️ - -```bash -# For instance -Host "de-bootcamp-vm" - HostName 34.77.50.76 # replace with your VM's public IP address - IdentityFile - User -``` - -**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS code for instance) diff --git a/_partials/vscode_ssh_connection.md b/_partials/vscode_ssh_connection.md new file mode 100644 index 0000000..c2f671e --- /dev/null +++ b/_partials/vscode_ssh_connection.md @@ -0,0 +1,89 @@ +## Virtual Machine connection + +### Create SSH keys + +We need to connect VS Code to our Virtual Machine in the cloud so you will only work on that machine during the bootcamp. We'll use the [Remote - SSH Extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) that we previously installed. + +To create the VS Code SSH configuration, run the following in your terminal: + +```bash +gcloud compute config-ssh +``` + +`gcloud` may tell you it needs to create a directory to continue. Accept and you should get an output similar to: + +```bash +You should now be able to use ssh/scp with your instances. +For example, try running: + + $ ssh lw-de-vm-tswift.europe-west1-b.wagon-bootcamp +# $ ssh lw-de-vm-.. +``` + +$WINDOWS_START +### SSH File Permissions + +Windows has strict permissions for SSH files by default, we need to alter some permissions on the SSH configuration that was created by `gcloud` so VS Code can read the files and manage the SSH connection. + +In Command Prompt run: + +```cmd +icacls %USERPROFILE%\.ssh\config /inheritance:r + +icacls %USERPROFILE%\.ssh\config /grant:r %USERNAME%:(R) + +icacls %USERPROFILE%\.ssh\config /grant:r SYSTEM:(R) + +icacls %USERPROFILE%\.ssh\config +``` + +And: + +```cmd +icacls %USERPROFILE%\.ssh\google_compute_engine /inheritance:r + +icacls %USERPROFILE%\.ssh\google_compute_engine /grant:r %USERNAME%:(R) + +icacls %USERPROFILE%\.ssh\google_compute_engine /grant:r SYSTEM:(R) + +icacls %USERPROFILE%\.ssh\google_compute_engine +``` +$WINDOWS_END + +### Connect with VS Code + +To connect to your Virtual Machine, click on the small symbol at the very bottom-left corner of VS Code: + +![](/images/vscode_remote_highlight.png) + +It should bring up a menu, click on **Connect to Host...**: + +![](/images/vscode_remote_menu.png) + +Click on the name of your Virtual Machine: + +![](/images/vscode_remote_hosts.png) + +A new VS Code window will open. You may be asked to select the platform of the remote host, select **Linux**. You will then be asked to _fingerprint_ the connection. VS Code is asking if you trust the remote host you are trying to connect to. Hit enter to continue. + +![](/images/vscode_remote_fingerprint.png) + +And you are connected! It should look similar too: + +![](/images/vscode_remote_connected.png) + +Notice the connection in the very bottom-left corner of your VS Code window. It should have the Connection type (SSH), and the name of the host you are connected to. + +**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS Code) + +
+Viewing your SSH Configuration + +If you want to view your SSH configuration: +1. Start by clicking the symbol in the bottom-left corner of VS Code +2. Click on **Connect to Host...** +3. Click on **Configure SSH Hosts...*** +4. Select the configuration file. Usually the file at the top of the list. +5. View your configuration file! You may need to edit this configuration if you change computers, or want to work on more than one computer during the bootcamp. + +
diff --git a/_partials/win_docker.md b/_partials/win_docker.md deleted file mode 100644 index f8f44cf..0000000 --- a/_partials/win_docker.md +++ /dev/null @@ -1,23 +0,0 @@ -## Docker πŸ‹ - -Docker is an open platform for developing, shipping, and running applications. - -_if you already have Docker installed on your machine please update with the latest version_ - -### Install Docker - -Go to [Docker for WSL2](https://docs.docker.com/docker-for-windows/wsl/). - -Download and install the Docker Desktop WSL 2 backend. - -Once done, start Docker. - -You should be able to run in a Ubuntu terminal: - -```bash -docker run hello-world -``` - -The following message should print: - -![](images/docker_hello.png) diff --git a/_partials/win_jupyter.md b/_partials/win_jupyter.md deleted file mode 100644 index 1a725e4..0000000 --- a/_partials/win_jupyter.md +++ /dev/null @@ -1,42 +0,0 @@ - -## Configuring Jupyter Notebook to open in your browser - -Let's generate the configuration file for **Jupyter Notebook**... - -``` bash -jupyter notebook --generate-config -``` - -⚠️ Please copy the path returned by the previous command. - -We will now edit the generated Jupyter configuration file: - -``` bash - $HOME/.jupyter/jupyter_notebook_config.py -``` - -Locate the following line in the configuration file: - -``` python -# c.NotebookApp.use_redirect_file = True -``` - -And replace it with this one: - -``` python -c.NotebookApp.use_redirect_file = False -``` - -Let's try to run Jupyter: - -``` bash -jupyter notebook -``` - -This command should have opened a Jupyter page in your browser: - -![](images/wsl_jupyter_notebook.png) - -If it is not the case, please call a TA. - -To stop the Jupyter server in the terminal, press `Ctrl` + `C`, enter y, then press Enter. diff --git a/automation/README.md b/automation/README.md new file mode 100644 index 0000000..f603ece --- /dev/null +++ b/automation/README.md @@ -0,0 +1,563 @@ +# πŸ’» Automated DE GCP VM Setup - Terraform + Ansible + +This document contains instructions to set a largely automated VM setup on Google Cloud Platform. + +This setup has three main components, with the second and third components being largely automated: + +## Part 1: Setup of your local computer + +In this section we'll setup your local computer with some required software. This will include: +1. Installing some communication tools: Zoom, Slack +1. Creating some accounts: Github, Google Cloud Platform +1. Installing Visual Studio Code (VS Code) and some useful VS Code extensions +1. Installing and authenticating the GCP Command Line Tool (CLI): `gcloud`. You'll be using `gcloud` to create the connection between your local machine and the virtual machine you'll be creating! +1. Installing **Terraform** on your local computer. Terraform is an **Infrastructure as Code** (IaC) tool that you'll use to automate the creation of your VM! +1. Connect VS Code to your VM! + +## Part 2: Configuration of your Virtual Machine part 1 + +All parts of this section will happen on your VM. + +In this section you will: +1. Authenticate your VM with `gcloud` and for code that interacts with GCP Services +1. Run an `ansible` playbook. Ansible is another **Infrastructure as Code** tool that is used to automate the configuration and installation of software on computers. Perfect for fine tuning your VM. +1. Login to the GitHub CLI tool on your VM +1. Copy some **dotfiles** provided. **Dotfiles** are more settings that will enhance your terminal and developer experience. + +## Part 3: Configuration of your Virtual Machine part 2 + +All parts of this section will happen on your VM. + +In this section you will: +1. Run a second `ansible` playbook. This playbook will: + 1. Install some tools and frameworks (like python) + 1. Fork a repository with content + 1. Install python virtual environment for every challenge using poetry +1. You'll test your setup to make sure everything is working as intended πŸ‘Œ + +
+ +# 1️⃣ Local Machine Setup + +We'll start with some communication tools that are widely used. + +## 1.1. Zoom + +Use existing partial + +## 1.2. Slack + +Use existing partial + +## 1.3. Github + +Use existing partial + +## 1.4. Google Cloud Platform Setup + +Use existing partial + +## 1.5. GCP API's + +Use existing partial + +## ✨ 1.6. Download `gcloud` locally + +Link to `gcloud` CLI install docs at this [link here](https://cloud.google.com/sdk/docs/install). + + +
+πŸͺŸ Windows + +A note about using windows terminal vs powershell here. + +Use the installer + +
+ +
+ +
+🍎 MacOS + +Use existing partial from DS setup +
+ +
+ +
+🐧 Linux + +Use existing partial from DS setup +
+ +Remove section on creating a service account. Or leave it in but don't create a key for it. + +## 1.7. Authorize local `gcloud` + +Use existing partial from DS setup + +Add something about Windows for the installer + +- πŸͺŸ Windows: TODO - unsure. I assume it's the same +- 🍎 MacOS and 🐧 Linux: `gcloud auth login` + +## 1.8. Visual Studio Code + +Use existing partial. Modify for **Remote - SSH** connection. + +## ✨ 1.9. Install Terraform Locally + +Terraform installer docs at this [link here](https://developer.hashicorp.com/terraform/install). + +
+πŸͺŸ Windows + +Download binary and run + +
+ +
+ +
+🍎 MacOS + +```bash +brew tap hashicorp/tap +brew install hashicorp/tap/terraform +``` +
+ +
+ +
+🐧 Linux + +```bash +wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg +echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list +sudo apt update && sudo apt install terraform +``` +
+ +## ✨ 1.10. Create your VM with Terraform + +Will require different instructions for Windows, MacOS, and Linux. + +Download the `terraform` files needed to provision your VM. This will be a `curl` from the repo. + +```bash +# MacOS & Linux. TODO: add Windows +# TODO: Change branch name if before merging +mkdir -p ~/wagon-de-bootcamp +curl -L -o ~/wagon-de-bootcamp/main.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf +curl -L -o ~/wagon-de-bootcamp/provider.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf +curl -L -o ~/wagon-de-bootcamp/variables.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf +curl -L -o ~/wagon-de-bootcamp/terraform.tfvars https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars +curl -L -o ~/wagon-de-bootcamp/.terraform.lock.hcl https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl +``` + +For testing, copy the terraform files from `infra/` in the repo into `~/code//de-vm-setup`. Files to copy: +- `main.tf` +- `provider.tf` +- `variables.tf` +- `terraform.tfvars` + +We'll need the username of your LOCAL computer, this is important to know and set. +- πŸͺŸ Windows: TODO - unsure. +- 🍎 MacOS: type `whoami` in the terminal +- 🐧 Linux: type `whoami` in the terminal + +Open up `terraform.tfvars` in VS Code and take a look at it, it should look like: + +```bash +project_id = "" +region = "" +zone = "" +instance_name = "" +instance_user = "" +``` + +You'll need to change some values in this file. A good reference is the GCP Console available at this [link here](console.cloud.google.com). + +To determine your `region` and `zone`, take a look at the GCP Region and Zones documentation at this [link here](https://cloud.google.com/compute/docs/regions-zones). We strongly recommend that you select a region that is as close to you geographically. + +It should look something similar to: + +```bash +project_id = "my-gcp-project" +region = "europe-west1" +zone = "europe-west1-b" +instance_name = "de-bootcamp-vm" +instance_user = "taylorswift" # the result of `whoami` +``` + +Make sure to save the `terraform.tfvars` file and run: + +```bash +# MacOS & Linux. TODO: add Windows +cd ~/wagon-de-bootcamp + +terraform init + +terraform apply -auto-approve +``` + +Your VM should now be up and running! Check the GCP Compute Engine Console at this [link here](console.cloud.google.com/compute/instances) to confirm. + +## ✨ 1.11. Connect to your VM + +In a terminal enter the following command: + +```bash +gcloud compute config-ssh +``` + +And connect via VS Code. + +TODO: Add image assets + +
+ +# 2️⃣ Virtual Machine Part 1 + +## ✨ 2.1. Connect to VM and confirm `ansible` install + +In your VM's terminal type: + +```bash +ansible --version +``` + +Should look like: + +TODO: Add image asset + +## ✨ 2.2. Authenticate GCP CLI and ADC + +Use existing partial for bulk of this. + +`gcloud` comes pre-installed on GCP Virtual Machines! + +We need to authenticate `gcloud` on our virtual machine so we can interact with GCP services from the command line and in our code. + +```bash +# gcloud login +gcloud auth login + +# ADC login +gcloud auth application-default login +``` + +Set your GCP project in `gcloud`: + +```bash +# Replace `PROJECT_ID` with the `ID` of your project +gcloud config set project PROJECT_ID +``` + +Confirm: + +```bash +gcloud config list +``` + +## ✨ 2.3. Run Ansible playbook 1 + +Download the first ansible playbook with the following: + +```bash +# TODO: Update if merged +mkdir -p ~/vm-ansible-setup/playbooks +curl -L -o ~/vm-ansible-setup/ansible.cfg https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/ansible.cfg +curl -L -o ~/vm-ansible-setup/hosts https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/hosts +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part1.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml +``` + +And run with: + +```bash +cd ~/vm-ansible-setup && ansible-playbook playbooks/setup_vm_part1.yml +``` + +❗ If any errors occur, contact a teacher. You can safely run the playbook again. + +Close all your terminals and open a new one (you might have to do it a few times, it should go from `bash` to `zsh`). It should look like: + +TODO: Add image assets +(Imagine basic zsh + OMZ) + +## 2.4. Github CLI Auth + +Use existing partial + +Can't be easily automated without creating and copying SSH keys, and generating GitHub PAT tokens. + +## 2.5. Copy LW Dotfiles + +Use existing partial + +Can't be easily automated, needs student input. + +Close all terminals and open a new terminal, it should look like: + +TODO: Add image asset +(Imagine LW zsh setup) + +
+ +# 3️⃣ Virtual Machine Part 2 + +In this section we'll run a second `ansible` playbook and check our setup + +## ✨ 3.1. Run Ansible playbook 2 + +Download the second ansible playbook with the following: + +```bash +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part2.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml +``` + +And run with: + +```bash +cd ~/vm-ansible-setup && ansible-playbook playbooks/setup_vm_part2.yml +``` + +❗ If any errors occur, contact a teacher. You can safely run the playbook again. + +Once the playbook has finished, you need to completely SHUT DOWN your VM from the GCP console at [this link here](https://console.cloud.google.com/compute/instances). Closing your VS Code and opening it again is not sufficient. + +TODO: add image assets + +## ✨ 3.2. Check your Setup + +Things to check: + +Python: + +```bash +python --version +``` + +Should return: +```bash +Python 3.12.8 +``` + +TODO: Add image assets from existing partial + +Pyenv: + +```bash +pyenv versions +``` + +Should return: + +```bash + system +* 3.12.8 (set by /home//.pyenv/version) +``` + +Pipx: + +```bash +pipx list +``` + +Should return: + +```bash +venvs are in /home//.local/share/pipx/venvs +apps are exposed on your $PATH at /home//.local/bin +manual pages are exposed at /home//.local/share/man + package poetry 2.1.1, installed using Python 3.12.8 + - poetry + package ruff 0.11.0, installed using Python 3.12.8 + - ruff + package tldr 3.3.0, installed using Python 3.12.8 + - tldr + - man1/tldr.1 +``` + +Data Engineering Challenges repo remotes: + +```bash +cd ~/code/$(gh api user | jq -r '.login')/data-engineering-challenges +git remote -v +``` + +Should return: + +```bash +origin git@github.com:/data-engineering-challenges.git (fetch) +origin git@github.com:/data-engineering-challenges.git (push) +upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) +upstream git@github.com:lewagon/data-engineering-challenges.git (push) +``` + +Docker: +```bash +docker run hello-world +``` + +Should return: + +```bash +Unable to find image 'hello-world:latest' locally +latest: Pulling from library/hello-world +e6590344b1a5: Pull complete +Digest: sha256:7e1a4e2d11e2ac7a8c3f768d4166c2defeb09d2a750b010412b6ea13de1efb19 +Status: Downloaded newer image for hello-world:latest + +Hello from Docker! +This message shows that your installation appears to be working correctly. + +To generate this message, Docker took the following steps: + 1. The Docker client contacted the Docker daemon. + 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. + (amd64) + 3. The Docker daemon created a new container from that image which runs the + executable that produces the output you are currently reading. + 4. The Docker daemon streamed that output to the Docker client, which sent it + to your terminal. + +To try something more ambitious, you can run an Ubuntu container with: + $ docker run -it ubuntu bash + +Share images, automate workflows, and more with a free Docker ID: + https://hub.docker.com/ + +For more examples and ideas, visit: + https://docs.docker.com/get-started/ +``` + +TODO: Add image assets from existing partial + +Kubernetes: + +We can start by testing `minikube`: + +```bash +# Start +minikube start +``` + +Should return: + +```bash +πŸ˜„ minikube v1.35.0 on Ubuntu 22.04 (amd64) +✨ Automatically selected the docker driver. Other choices: none, ssh +πŸ“Œ Using Docker driver with root privileges +πŸ‘ Starting "minikube" primary control-plane node in "minikube" cluster +🚜 Pulling base image v0.0.46 ... +πŸ’Ύ Downloading Kubernetes v1.32.0 preload ... + > gcr.io/k8s-minikube/kicbase...: 500.31 MiB / 500.31 MiB 100.00% 88.19 M + > preloaded-images-k8s-v18-v1...: 333.57 MiB / 333.57 MiB 100.00% 32.20 M +πŸ”₯ Creating docker container (CPUs=2, Memory=3900MB) ... +🐳 Preparing Kubernetes v1.32.0 on Docker 27.4.1 ... + β–ͺ Generating certificates and keys ... + β–ͺ Booting up control plane ... + β–ͺ Configuring RBAC rules ... +πŸ”— Configuring bridge CNI (Container Networking Interface) ... +πŸ”Ž Verifying Kubernetes components... + β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5 +🌟 Enabled addons: storage-provisioner, default-storageclass +πŸ„ Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default +``` + +And then make sure the kubernetes CLI utility, `kubectl` works with: + +```bash +# Get pods +kubectl get po -A +``` + +Should return: + +```bash +NAMESPACE NAME READY STATUS RESTARTS AGE +kube-system coredns-668d6bf9bc-mg7b6 1/1 Running 0 72s +kube-system etcd-minikube 1/1 Running 0 78s +kube-system kube-apiserver-minikube 1/1 Running 0 76s +kube-system kube-controller-manager-minikube 1/1 Running 0 76s +kube-system kube-proxy-stk77 1/1 Running 0 72s +kube-system kube-scheduler-minikube 1/1 Running 0 76s +kube-system storage-provisioner 1/1 Running 1 (41s ago) 75s +``` + +And because `minikube` is resource intensive, stop it for now with: + +```bash +# Stop +minikube delete --all +``` + +Should return: + +```bash +πŸ”₯ Deleting "minikube" in docker ... +πŸ”₯ Removing /home//.minikube/machines/minikube ... +πŸ’€ Removed all traces of the "minikube" cluster. +πŸ”₯ Successfully deleted all profiles +``` + +Terraform: + +```bash +terraform --version +``` + +Should return: + +```bash +Terraform v1.11.2 +on linux_amd64 +``` + +Spark: + +```bash +spark-shell +``` + +Should take you into the spark shell that looks like: + +```bash +Setting default log level to "WARN". +To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +25/03/18 08:54:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Spark context Web UI available at http://lw-de-vm.europe-north1-b.c.wagon-de.internal:4040 +Spark context available as 'sc' (master = local[*], app id = local-1742288096829). +Spark session available as 'spark'. +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 + /_/ + +Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_442) +Type in expressions to have them evaluated. +Type :help for more information. + +scala> +``` + +Type `:quit` and hit enter to exit the spark-shell and continue. + +## 3.3. Run make install + +Use existing partial + +To create python venvs. + +```bash +export GITHUB_USERNAME=`gh api user | jq -r '.login'` +echo $GITHUB_USERNAME + +cd ~/code/$GITHUB_USERNAME/data-engineering-challenges + +make install +``` diff --git a/automation/infra/.terraform.lock.hcl b/automation/infra/.terraform.lock.hcl new file mode 100644 index 0000000..4eaaa9e --- /dev/null +++ b/automation/infra/.terraform.lock.hcl @@ -0,0 +1,21 @@ +# This file is maintained automatically by "terraform init". +# Manual edits may be lost in future updates. + +provider "registry.terraform.io/hashicorp/google" { + version = "6.24.0" + hashes = [ + "h1:Y9f/Q1dBiYpd8BvfSrkvSF3smM0SlHCoh66+KF0uzB8=", + "zh:0e7bb01149f50eabab725e8a0efadcb1cbfd7389f45adfb12e04f4f15a4fb5eb", + "zh:4172d07d61168e4246125e77ba5c67e96309783e2a8cd885cc51f3a73e7f14e2", + "zh:6952c1305d10b456170b2b7c34f0013ce4fd67161f6e7aa6daef61490da60252", + "zh:8ab7621209b352b12a0947865975ff83048c55a870a11306603b1b8052a3926b", + "zh:ba93efa1562d17f65001f8cce016ba903289ed985a7bec4b4d6339e3f52af3eb", + "zh:bc70ee209b816f74c9ffeaca9d3c85191ba8173f9f3f19425821a1ae9e4d47ea", + "zh:c9e8432861770f86a38a29c74d57cd5ecd7bec38fff0c719ed6136d34ae95ccd", + "zh:dfabe73e6de0cefa0b158f82647ca15325aa42bd0d8894ff82de02aed1c5814f", + "zh:e2798adc0d6edf9eb5e9ccbc2f4cd3914a0c76258e20690c86d7404490c10904", + "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c", + "zh:f8884173e9334c3c408ecb869e44478061ffd1f23de6a204f5ab454a55ea9f12", + "zh:f8ecbc3274389f6fbb5ff5fc10f06db2390d02f50c0b35ef1c07f0203c341717", + ] +} diff --git a/automation/infra/main.tf b/automation/infra/main.tf new file mode 100644 index 0000000..15b9f0a --- /dev/null +++ b/automation/infra/main.tf @@ -0,0 +1,51 @@ +resource "google_compute_address" "static_ip" { + name = "${var.instance_name}-static-ip" + region = var.region +} + + +resource "google_compute_instance" "my-instance" { + name = var.instance_name + machine_type = "e2-standard-4" + zone = var.zone + + boot_disk { + initialize_params { + image = "ubuntu-2204-jammy-v20250305" + size = 100 + type = "pd-balanced" + } + } + + network_interface { + network = "default" + access_config { + nat_ip = google_compute_address.static_ip.address + network_tier = "PREMIUM" + } + } + + metadata_startup_script = <<-EOT +#!/bin/bash +set -e + +# Ensure the user exists +if ! id "${var.instance_user}" &>/dev/null; then + useradd -m -s /bin/bash ${var.instance_user} + echo "${var.instance_user} ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers +fi + +# Ensure Ansible is installed +if ! command -v ansible &> /dev/null; then + apt update -y + apt install -y software-properties-common + add-apt-repository --yes --update ppa:ansible/ansible + apt install -y ansible +fi + +# Output Ansible version +sudo -u ${var.instance_user} ansible --version + +echo "Ansible installed successfully!" +EOT +} diff --git a/automation/infra/outputs.tf b/automation/infra/outputs.tf new file mode 100644 index 0000000..7d8908a --- /dev/null +++ b/automation/infra/outputs.tf @@ -0,0 +1,9 @@ +output "static_ip" { + description = "Static Public IP of the VM" + value = google_compute_address.static_ip.address +} + +output "instance_ip" { + description = "External IP of the VM" + value = google_compute_instance.my-instance.network_interface[0].access_config[0].nat_ip +} diff --git a/automation/infra/provider.tf b/automation/infra/provider.tf new file mode 100644 index 0000000..9e68fc4 --- /dev/null +++ b/automation/infra/provider.tf @@ -0,0 +1,13 @@ +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "6.24.0" + } + } +} + +provider "google" { + project = var.project_id + region = var.region +} diff --git a/automation/infra/terraform.tfvars b/automation/infra/terraform.tfvars new file mode 100644 index 0000000..d8554b0 --- /dev/null +++ b/automation/infra/terraform.tfvars @@ -0,0 +1,5 @@ +project_id = "" +region = "" +zone = "" +instance_name = "" +instance_user = "" diff --git a/automation/infra/variables.tf b/automation/infra/variables.tf new file mode 100644 index 0000000..0566410 --- /dev/null +++ b/automation/infra/variables.tf @@ -0,0 +1,24 @@ +variable "project_id" { + description = "GCP Project ID" + type = string +} + +variable "region" { + description = "GCP Region" + type = string +} + +variable "zone" { + description = "GCP Zone" + type = string +} + +variable "instance_name" { + description = "VM name" + type = string +} + +variable "instance_user" { + description = "Instance username" + type = string +} diff --git a/automation/vm-ansible-setup/ansible.cfg b/automation/vm-ansible-setup/ansible.cfg new file mode 100644 index 0000000..39517a9 --- /dev/null +++ b/automation/vm-ansible-setup/ansible.cfg @@ -0,0 +1,5 @@ +[defaults] +inventory = hosts +host_key_checking = False +retry_files_enabled = False +interpreter_python = auto diff --git a/automation/vm-ansible-setup/hosts b/automation/vm-ansible-setup/hosts new file mode 100644 index 0000000..13cfabe --- /dev/null +++ b/automation/vm-ansible-setup/hosts @@ -0,0 +1,2 @@ +[local] +localhost ansible_connection=local diff --git a/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml b/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml new file mode 100644 index 0000000..489d2be --- /dev/null +++ b/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml @@ -0,0 +1,285 @@ +--- +- name: Setup VM Part 1 - Update VM with non-dependant software + hosts: localhost + connection: local + become: true + vars: + my_user: "{{ lookup('env', 'USER') }}" + architecture: "{{ 'amd64' if ansible_architecture == 'x86_64' else ansible_architecture }}" + tasks: + - name: Debug Variables + debug: + msg: + - "Running playbook as the user: {{ my_user }}" + - "Running Ubuntu version: {{ ansible_distribution_release }}" + - "On architecture: {{ architecture }}" + + - name: Install apt packages + block: + - name: Check last apt update time + stat: + path: /var/lib/apt/periodic/update-success-stamp + register: apt_stamp + tags: apt + + - name: Update apt package list if update >24 hours ago + apt: + update_cache: true + become: true + when: apt_stamp.stat.exists == false or (ansible_date_time.epoch | int - apt_stamp.stat.mtime | int) > 86400 + tags: apt + + - name: Install required apt packages + apt: + name: + - vim + - tmux + - tree + - git + - ca-certificates + - curl + - jq + - unzip + - zsh + - apt-transport-https + - gnupg + - software-properties-common + - direnv + - sqlite3 + - make + - postgresql + - postgresql-contrib + - build-essential + - libssl-dev + - zlib1g-dev + - libbz2-dev + - libreadline-dev + - libsqlite3-dev + - wget + - llvm + - libncursesw5-dev + - xz-utils + - tk-dev + - libxml2-dev + - libxmlsec1-dev + - libffi-dev + - liblzma-dev + - gcc + - default-mysql-server + - default-libmysqlclient-dev + - libpython3-dev + - openjdk-8-jdk-headless + state: present + become: true + tags: apt + + - name: Set zsh as default shell + user: + name: "{{ my_user }}" + shell: /usr/bin/zsh + + - name: Install Oh My Zsh + block: + - name: Check if Oh My Zsh is installed + stat: + path: "/home/{{ my_user }}/.oh-my-zsh" + register: omz_check + tags: omz + + - name: Install Oh My Zsh from repo + shell: | + sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended + become_user: "{{ my_user }}" + when: not omz_check.stat.exists + tags: omz + + - name: Install Docker + block: + - name: Check if Docker is installed + stat: + path: /usr/bin/docker + register: docker_check + tags: docker + + - name: Add Docker GPG key + apt_key: + url: "https://download.docker.com/linux/ubuntu/gpg" + keyring: "/etc/apt/keyrings/docker.gpg" + state: present + when: not docker_check.stat.exists + tags: docker + + - name: Add Docker APT repository + apt_repository: + repo: "deb [arch={{ architecture }} signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable" + state: present + when: not docker_check.stat.exists + tags: docker + + - name: Update APT cache + apt: + update_cache: true + when: not docker_check.stat.exists + + - name: Install Docker packages + apt: + name: + - docker-ce + - docker-ce-cli + - containerd.io + - docker-buildx-plugin + - docker-compose-plugin + state: present + when: not docker_check.stat.exists + + - name: Ensure docker group exists + group: + name: docker + state: present + tags: docker + + - name: Add user to the docker group + user: + name: "{{ my_user }}" + groups: docker + append: true + tags: docker + + - name: Authenticate gcloud with Docker + block: + - name: Check if Docker authentication config exists + stat: + path: "/home/{{ my_user }}/.docker/config.json" + register: docker_config + tags: docker + + - name: Check if gcloud is authenticated with Docker + shell: "grep -q 'gcr.io' /home/{{ my_user }}/.docker/config.json" + become_user: "{{ my_user }}" + register: gcloud_docker_check + ignore_errors: true + changed_when: false + when: docker_config.stat.exists + tags: docker + + - name: Authenticate gcloud with Docker + shell: | + gcloud auth configure-docker + become_user: "{{ my_user }}" + when: not docker_config.stat.exists or gcloud_docker_check.rc != 0 + tags: docker + + - name: Enable GCP Artifact Registry API + block: + - name: Check if GCP Artifact Registry API is enabled + shell: | + gcloud services list --enabled --format="value(config.name)" | grep -q "^artifactregistry.googleapis.com$" + become_user: "{{ my_user }}" + register: gcp_ar_check + ignore_errors: true + changed_when: false + tags: gcp_ar + + - name: Enable GCP Artifact Registry API + shell: | + gcloud services enable artifactregistry.googleapis.com + become_user: "{{ my_user }}" + when: gcp_ar_check.rc != 0 + tags: gcp_ar + + - name: Install kubectl + block: + - name: Check if kubectl is installed + stat: + path: /usr/local/bin/kubectl + register: kubectl_check + tags: kubectl + + - name: Download kubectl binary and add to path + get_url: + url: "https://dl.k8s.io/release/{{ lookup('url', 'https://dl.k8s.io/release/stable.txt') }}/bin/linux/amd64/kubectl" + dest: "/usr/local/bin/kubectl" + mode: "0755" + when: not kubectl_check.stat.exists + tags: kubectl + + - name: Install minikube + block: + - name: Check if minikube is installed + stat: + path: /usr/local/bin/minikube + register: minikube_check + tags: minikube + + - name: Download minikube binary and add to path + get_url: + url: "https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64" + dest: "/usr/local/bin/minikube" + mode: "0755" + when: not minikube_check.stat.exists + tags: minikube + + - name: Install terraform + block: + - name: Check if terraform is installed + stat: + path: /usr/bin/terraform + register: terraform_check + tags: terraform + + - name: Add HashiCorp GPG key + apt_key: + url: "https://apt.releases.hashicorp.com/gpg" + keyring: "/usr/share/keyrings/hashicorp-archive-keyring.gpg" + state: present + tags: terraform + + - name: Add HashiCorp GPG repository + apt_repository: + repo: "deb [arch={{ architecture }} signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com {{ ansible_distribution_release }} main" + state: present + when: not terraform_check.stat.exists + tags: terraform + + - name: Update APT cache + apt: + update_cache: true + when: not terraform_check.stat.exists + tags: terraform + + - name: Install terraform + apt: + name: terraform + state: present + when: not terraform_check.stat.exists + tags: terraform + + - name: Install Github CLI + block: + - name: Check if Github CLI is installed + stat: + path: /usr/bin/gh + register: gh_check + tags: gh_cli + + - name: Add GitHub CLI GPG key + get_url: + url: "https://cli.github.com/packages/githubcli-archive-keyring.gpg" + dest: "/usr/share/keyrings/githubcli-archive-keyring.gpg" + mode: "0644" + when: not gh_check.stat.exists + tags: gh_cli + + - name: Add GitHub CLI APT repository + apt_repository: + repo: "deb [arch={{ architecture }} signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" + state: present + when: not gh_check.stat.exists + tags: gh_cli + + - name: Install GitHub CLI + apt: + name: gh + state: present + when: not gh_check.stat.exists + tags: gh_cli diff --git a/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml b/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml new file mode 100644 index 0000000..ab7aadd --- /dev/null +++ b/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml @@ -0,0 +1,312 @@ +--- +- name: Setup VM Part 2 - Update VM with dependant software + hosts: localhost + connection: local + become: true + vars: + my_user: "{{ lookup('env', 'USER') }}" + architecture: "{{ 'amd64' if ansible_architecture == 'x86_64' else ansible_architecture }}" + ansible_shell_executable: /usr/bin/zsh + tasks: + - name: Debug Variables + debug: + msg: + - "Running playbook as the user: {{ my_user }}" + - "Running Ubuntu version: {{ ansible_distribution_release }}" + - "On architecture: {{ architecture }}" + + - name: Install Spark + block: + - name: Check if Spark is installed + stat: + path: "/home/{{ my_user }}/spark/spark-3.5.3-bin-hadoop3" + register: spark_installed + tags: spark + + - name: Install Spark (Async) + shell: | + wget -q https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz -O /tmp/spark.tgz + mkdir -p ~/spark && tar -xzf /tmp/spark.tgz -C ~/spark + become_user: "{{ my_user }}" + when: not spark_installed.stat.exists + async: 3600 # Runs in the background for up to 60 minutes + poll: 0 # Continue running other tasks + register: spark_async + tags: spark + + - name: Debug Spark async job ID + debug: + msg: "Spark async job ID: {{ spark_async.ansible_job_id }}" + when: not spark_installed.stat.exists + + - name: Add .zshrc plugins - direnv, pyenv, gcloud + lineinfile: + path: "~/.zshrc" + regexp: "plugins=" + line: "plugins=(git gitfast last-working-dir common-aliases zsh-syntax-highlighting history-substring-search pyenv ssh-agent direnv gcloud)" + create: true + become_user: "{{ my_user }}" + tags: zshrc + + - name: Add Spark, pyenv, pipx to PATH + blockinfile: + path: "~/.zshrc" + block: | + # Spark + export SPARK_HOME=$HOME/spark/spark-3.5.3-bin-hadoop3 + export PATH=$PATH:$SPARK_HOME/bin + + # pyenv + export PYENV_ROOT="$HOME/.pyenv" + [[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH" + eval "$(pyenv init - zsh)" + + # pipx + export PATH="$HOME/.local/bin:$PATH" + marker: "# {mark} ANSIBLE CREATED BLOCK: path vars" + create: true + become_user: "{{ my_user }}" + tags: zshrc + + - name: Create direnv Poetry function + blockinfile: + path: "~/.direnvrc" + block: | + layout_poetry() { + if [[ ! -f pyproject.toml ]]; then + log_error 'No pyproject.toml found. Use `poetry new` or `poetry init` to create one first.' + exit 2 + fi + # create venv if it doesn't exist + poetry run true + + export VIRTUAL_ENV=$(poetry env info --path) + export POETRY_ACTIVE=1 + PATH_add "$VIRTUAL_ENV/bin" + } + marker: "# {mark} ANSIBLE CREATED BLOCK: layout_poetry" + create: true + mode: "0644" + become_user: "{{ my_user }}" + + - name: VS Code Config + block: + - name: Detect VS Code CLI path + shell: | + find /home/{{ my_user }}/.vscode-server/cli/servers/ -maxdepth 5 -type f -name "code" | head -n 1 + register: vscode_cli_path + changed_when: false + tags: vscode + + - name: Check installed VS Code extensions + shell: "{{ vscode_cli_path.stdout }} --list-extensions" + register: vscode_installed_extensions + changed_when: false + when: vscode_cli_path.stdout | length > 0 + tags: vscode + + - name: Install missing VS Code extensions + shell: "{{ vscode_cli_path.stdout }} --install-extension {{ item }}" + loop: + - ms-vscode.sublime-keybindings + - emmanuelbeziat.vscode-great-icons + - ms-python.python + - KevinRose.vsc-python-indent + - ms-python.vscode-pylance + - redhat.vscode-yaml + - ms-azuretools.vscode-docker + - tamasfe.even-better-toml + when: vscode_cli_path.stdout | length > 0 and vscode_installed_extensions.stdout is defined and (item not in vscode_installed_extensions.stdout) + become_user: "{{ my_user }}" + tags: vscode + + - name: Update VS Code python interpreter to poetry venv + lineinfile: + path: "~/.vscode-server/data/Machine/settings.json" + regexp: ' "python.defaultInterpreterPath": "~/.pyenv/shims/python",' + line: ' "python.defaultInterpreterPath": ".venv/bin/python",' + create: true + become_user: "{{ my_user }}" + tags: vscode + + - name: Install pyenv and pyenv-virtualenv + block: + - name: Check if pyenv is installed + stat: + path: "/home/{{ my_user }}/.pyenv" + register: pyenv_installed + tags: pyenv + + - name: Install pyenv + git: + repo: "https://github.com/pyenv/pyenv.git" + dest: "/home/{{ my_user }}/.pyenv" + become_user: "{{ my_user }}" + when: not pyenv_installed.stat.exists + tags: pyenv + + - name: Check if pyenv-virtualenv is installed + stat: + path: "/home/{{ my_user }}/.pyenv/plugins/pyenv-virtualenv" + register: pyenv_virtualenv_installed + tags: pyenv + + - name: Install pyenv-virtualenv + git: + repo: "https://github.com/pyenv/pyenv-virtualenv.git" + dest: "/home/{{ my_user }}/.pyenv/plugins/pyenv-virtualenv" + become_user: "{{ my_user }}" + when: not pyenv_virtualenv_installed.stat.exists + tags: pyenv + + - name: Install Python 3.12.8 using pyenv + block: + - name: Check if Python 3.12.8 is installed + command: > + zsh -c 'export PYENV_ROOT="$HOME/.pyenv"; + export PATH="$PYENV_ROOT/bin:$PATH"; + eval "$(pyenv init --path)"; + pyenv versions --bare' + become_user: "{{ my_user }}" + register: python_installed + changed_when: false + tags: python + + - name: Install Python 3.12.8 with pyenv - takes a while + command: > + zsh -c 'export PYENV_ROOT="$HOME/.pyenv"; + export PATH="$PYENV_ROOT/bin:$PATH"; + eval "$(pyenv init --path)"; + pyenv install 3.12.8 && pyenv global 3.12.8' + become_user: "{{ my_user }}" + when: "'3.12.8' not in python_installed.stdout" + tags: python + + - name: Install pipx + block: + - name: Check if pipx is installed + stat: + path: "/home/{{ my_user }}/.local/bin/pipx" + register: pipx_installed + tags: pipx + + - name: Install pipx + shell: | + source ~/.zshrc + pip install --upgrade pip + python -m ensurepip --default-pip + python -m pip install --user pipx + python -m pipx ensurepath + become_user: "{{ my_user }}" + when: not pipx_installed.stat.exists + tags: pipx + + - name: Install pipx packages + block: + - name: Check installed pipx packages + command: "/home/{{ my_user }}/.local/bin/pipx list" + become_user: "{{ my_user }}" + register: pipx_list + changed_when: false + tags: pipx_packages + + - name: Install missing pipx packages - poetry, tldr, ruff + command: "/home/{{ my_user }}/.local/bin/pipx install {{ item }}" + loop: + - poetry + - tldr + - ruff + become_user: "{{ my_user }}" + when: "item not in pipx_list.stdout" + tags: pipx_packages + + - name: Ensure Poetry venv is set to in-project + shell: | + source ~/.zshrc + poetry config virtualenvs.in-project true + become_user: "{{ my_user }}" + changed_when: false + + - name: Fork and clone lewagon/data-engineering-challenges + block: + - name: Get GitHub Username using gh CLI + command: gh api user --jq '.login' + become_user: "{{ my_user }}" + register: github_username + changed_when: false + tags: github + + - name: Check if lewagon/data-engineering-challenges has been forked + command: gh repo list --json nameWithOwner --jq '.[] | select(.nameWithOwner == "{{ github_username.stdout }}/data-engineering-challenges")' + become_user: "{{ my_user }}" + register: fork_exists + changed_when: false + failed_when: false + tags: github + + - name: Fork lewagon/data-engineering-challenges repo + command: gh repo fork lewagon/data-engineering-challenges --remote + become_user: "{{ my_user }}" + when: fork_exists.stdout | trim == "" + tags: github + + - name: Wait for GitHub fork to be fully available + command: "gh repo view {{ github_username.stdout }}/data-engineering-challenges" + become_user: "{{ my_user }}" + register: fork_ready + retries: 10 + delay: 10 + until: fork_ready.rc == 0 + changed_when: false + tags: github + + - name: Check if ~/code/{{ github_username.stdout }}/data-engineering-challenges exist on the VM + stat: + path: "/home/{{ my_user }}/code/{{ github_username.stdout }}/data-engineering-challenges" + register: local_repo_exists + tags: github + + - name: Clone data-engineering-challenges to VM + git: + repo: "git@github.com:{{ github_username.stdout }}/data-engineering-challenges.git" + dest: "/home/{{ my_user }}/code/{{ github_username.stdout }}/data-engineering-challenges" + clone: true + update: false + accept_hostkey: true + become_user: "{{ my_user }}" + when: not local_repo_exists.stat.exists + tags: github + + - name: Check data-engineering-challenges git remotes + command: git remote -v + args: + chdir: "/home/{{ my_user }}/code/{{ github_username.stdout }}/data-engineering-challenges" + become_user: "{{ my_user }}" + register: existing_remotes + changed_when: false + tags: github + + - name: Add upstream remote to data-engineering-challenges + command: git remote add upstream git@github.com:lewagon/data-engineering-challenges.git + args: + chdir: "/home/{{ my_user }}/code/{{ github_username.stdout }}/data-engineering-challenges" + become_user: "{{ my_user }}" + when: "'upstream' not in existing_remotes.stdout" + tags: github + + - name: Wait for Spark installation to complete - Can take a while. 1 Retry = 60s + async_status: + jid: "{{ spark_async.ansible_job_id }}" + become_user: "{{ my_user }}" + register: spark_result + until: spark_result.finished + retries: 150 + delay: 60 + when: spark_async.ansible_job_id is defined + tags: spark + + # - name: Create Poetry environments + # shell: | + # GITHUB_USERNAME=$(gh api user | jq -r '.login') + # cd ~/code/$GITHUB_USERNAME/data-engineering-challenges && make install + # become_user: "{{ my_user }}" diff --git a/build.rb b/build.rb index 01d8f9b..692b640 100755 --- a/build.rb +++ b/build.rb @@ -15,29 +15,30 @@ setup/macos_slack setup/slack_settings setup/github - ssh_key + chrome gcp_setup - virtual_machine + homebrew osx_vscode vscode_remote_ssh - vscode_extensions - cli_tools - setup/oh_my_zsh + gcp_cli_setup + gcp_cli_oauth + gcp_adc_auth + terraform + terraform_vm + vscode_ssh_connection + gcp_auth_vm_heading + gcp_cli_oauth + gcp_adc_auth + ubuntu_ansible_part1 setup/gh_cli - ubuntu_gcloud - gcp_setup_linux dotfiles dotfiles_new_student dotfiles_new_laptop dotfiles_new_laptop_heading dotfiles_new_laptop - zsh_default_terminal - setup/ssh_agent - ubuntu_docker - kubernetes - terraform - ubuntu_spark - ubuntu_python + dotfiles_terminal + ubuntu_ansible_part2 + ubuntu_vm_test repo_overview dbeaver setup/kitt @@ -54,29 +55,28 @@ setup/windows_slack setup/slack_settings setup/github - ssh_key + chrome gcp_setup - virtual_machine win_vscode vscode_remote_ssh - vscode_extensions - cli_tools - setup/oh_my_zsh + gcp_cli_setup + gcp_adc_auth + terraform + terraform_vm + vscode_ssh_connection + gcp_auth_vm_heading + gcp_cli_oauth + gcp_adc_auth + ubuntu_ansible_part1 setup/gh_cli - ubuntu_gcloud - gcp_setup_linux dotfiles dotfiles_new_student dotfiles_new_laptop dotfiles_new_laptop_heading dotfiles_new_laptop - zsh_default_terminal - setup/ssh_agent - ubuntu_docker - kubernetes - terraform - ubuntu_spark - ubuntu_python + dotfiles_terminal + ubuntu_ansible_part2 + ubuntu_vm_test repo_overview dbeaver setup/kitt @@ -93,29 +93,29 @@ setup/ubuntu_slack setup/slack_settings setup/github - ssh_key + chrome gcp_setup - virtual_machine setup/ubuntu_vscode vscode_remote_ssh - vscode_extensions - cli_tools - setup/oh_my_zsh + gcp_cli_setup + gcp_cli_oauth + gcp_adc_auth + terraform + terraform_vm + vscode_ssh_connection + gcp_auth_vm_heading + gcp_cli_oauth + gcp_adc_auth + ubuntu_ansible_part1 setup/gh_cli - ubuntu_gcloud - gcp_setup_linux dotfiles dotfiles_new_student dotfiles_new_laptop dotfiles_new_laptop_heading dotfiles_new_laptop - zsh_default_terminal - setup/ssh_agent - ubuntu_docker - kubernetes - terraform - ubuntu_spark - ubuntu_python + dotfiles_terminal + ubuntu_ansible_part2 + ubuntu_vm_test repo_overview dbeaver setup/kitt diff --git a/images/gcp_vm_stop.png b/images/gcp_vm_stop.png new file mode 100644 index 0000000..45525ee Binary files /dev/null and b/images/gcp_vm_stop.png differ diff --git a/images/repo_overview.png b/images/repo_overview.png new file mode 100644 index 0000000..6ca7867 Binary files /dev/null and b/images/repo_overview.png differ diff --git a/images/vscode_after_ansible1.png b/images/vscode_after_ansible1.png new file mode 100644 index 0000000..bf3cb2c Binary files /dev/null and b/images/vscode_after_ansible1.png differ diff --git a/images/vscode_remote_connected.png b/images/vscode_remote_connected.png new file mode 100644 index 0000000..fc35c6f Binary files /dev/null and b/images/vscode_remote_connected.png differ diff --git a/images/vscode_remote_fingerprint.png b/images/vscode_remote_fingerprint.png new file mode 100644 index 0000000..a964696 Binary files /dev/null and b/images/vscode_remote_fingerprint.png differ diff --git a/images/vscode_remote_highlight.png b/images/vscode_remote_highlight.png new file mode 100644 index 0000000..2d608ee Binary files /dev/null and b/images/vscode_remote_highlight.png differ diff --git a/images/vscode_remote_hosts.png b/images/vscode_remote_hosts.png new file mode 100644 index 0000000..6e25b1f Binary files /dev/null and b/images/vscode_remote_hosts.png differ diff --git a/images/vscode_remote_menu.png b/images/vscode_remote_menu.png new file mode 100644 index 0000000..2218351 Binary files /dev/null and b/images/vscode_remote_menu.png differ diff --git a/macOS.md b/macOS.md index 8e63d2b..f003c70 100644 --- a/macOS.md +++ b/macOS.md @@ -6,6 +6,42 @@ A part of the setup will be done on your **local machine** but most of the confi Please **read instructions carefully and execute all commands in the following order**. If you get stuck, don't hesitate to ask a teacher for help :raising_hand: +This setup is largely automated with **Terraform** and **Ansible**. There are three main components to the setup! **Terraform** and **ansible** are _Infrastructure as Code_ tools. +- **Terraform** excels at creating and destroying cloud resources, like virtual machines, IP addresses, databases and more! +- **Ansible** is used to configure linux machines with specific settings and software. Perfect for fine-tuning the Virtual Machine you will be creating! + +## Part 1: Setup your local computer + +In this section you'll setup your local computer and create some accounts. It will include things like: +1. Install some communication tools: Zoom, Slack +2. Create some accounts: Github, Google Cloud Platform (GCP) +3. Install Visual Studio Code (VS Code) +4. Install and authentication the GCP command line tool: `gcloud` +5. Install **terraform** on your local computer +6. Create your virtual machine with **terraform** and connect to it with **VS Code**! + +## Part 2: Configure your Virtual Machine Part 1 + +All parts of this section happen on your virtual machine. + +This section includes: +1. Authenticate your virtual machine with `gcloud` +2. Download and run an **ansible** playbook to partially configure your virtual machine +3. Login to the Github command line tool on your virtual machine +4. Copy the Le Wagon recommended **dotfiles**. **Dotfiles** are settings that will enhance your terminal and developer experience! + +## Part 3: Configure your Virtual Machine Part 2 + +All parts of this section happen on your virtual machine. + +In this section you will: +1. Download and run a second **ansible** playbook for some more fine tuning +2. Test your set up to make sure that everything has installed correctly +3. Create isolated python environments for all your challenges + + +Don't worry, we'll go into more detail in each of the individual sections. + Let's start :rocket: @@ -89,62 +125,15 @@ Have you signed up to GitHub? If not, [do it right away](https://github.com/join :point_right: **[Enable Two-Factor Authentication (2FA)](https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication#configuring-two-factor-authentication-using-text-messages)**. GitHub will send you text messages with a code when you try to log in. This is important for security and also will soon be required in order to contribute code on GitHub. -## SSH key +## Chrome - your browser -We want to safely communicate with your virtual machine using [SSH protocol](https://en.wikipedia.org/wiki/Secure_Shell). We need to generate a SSH key to authenticate. +Install the Google Chrome browser if you haven't got it already and set it as a __default browser__. -- Open your terminal +Follow the steps for your system from this link :point_right: [Install Google Chrome](https://support.google.com/chrome/answer/95346?co=GENIE.Platform%3DDesktop&hl=en-GB) -
- πŸ’‘ Windows tip - -We highly recommend installing [Windows Terminal](https://apps.microsoft.com/store/detail/windows-terminal/9N0DX20HK701?hl=fr-fr&gl=FR) from the Windows Store (installed on Windows 11 by default) to perform this operation -
+__Why Chrome?__ -- Create a SSH key - -
- Windows - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen.exe -t ed25519 -C "your_email@example.com" -``` -
- -
- MacOS & Linux - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen -t ed25519 -C "your_email@example.com" -``` -
- - -You should get the following message: `> Generating public/private algorithm key pair.` -- When you are prompted `> Enter a file in which to save the key`, press Enter -- You should be asked to `Enter a passphrase` - this is optional if you want additional security. To continue without a passphrase press enter without typing anything when asked to enter a passphrase. - -ℹ️ Don't worry if nothing prompt when you type, that is perfectly normal for security reasons. - -- You should be asked to `Enter same passphrase again`, do it. - -**❗️ You must remember this passphrase.** - -
- ❗️ /home/your_username/.ssh/id_ed25519 already exists. -If you receive this message, you may already have an SSH Key with the same name (if you are a Le Wagon Alumni or are using SSH Authentication with Github). - -To create a separate SSH key to exclusively use for this bootcamp use the following: - -```bash -# replace "your_email@example.com" with your GCP account email -ssh-keygen -t ed25519 -f ~/.ssh/de-bootcamp -C "your_email@example.com" -``` - -Your new SSH Key will be named `de-bootcamp`. Make sure to remember it for later! -
+We recommend to use it as your default browser as it's most compatible with testing or running your code, as well as working with Google Cloud Platform. Another alternative is Firefox, however we don't recommend using other tools like Opera, Internet Explorer or Safari. ## Google Cloud Platform setup @@ -287,85 +276,75 @@ Go to your project [APIs dashboard](https://console.cloud.google.com/apis/dashbo - Compute Engine is now enabled on your project -## Virtual Machine (VM) - -**πŸ‘Œ Note: Skip to the next section if you already have a VM set up** - -_Note: The following section requires you already have a [Google Cloud Platform](https://cloud.google.com/) account associated with an active [Billing account](https://console.cloud.google.com/billing)._ - -- Go to console.cloud.google.com > > Compute Engine > VM instances > Create instance -- Name it `lewagon-data-eng-vm-`, replace `` with your own, e.g. `krokrob` -- Region `europe-west1`, choose the closest one among the [available regions](https://cloud.google.com/compute/docs/regions-zones#available) - - gcloud-console-vm-create-instance -- In the section `Machine configuration` under the sub-heading `Machine type` -- Select General purpose > PRESET > e2-standard-4 +## Homebrew +### 1. Install: +On Mac, you need to install [Homebrew](http://brew.sh/) which is a Package Manager. +It will be used as soon as we need to install some software. +To do so, open your Terminal and run: - gcloud-console-vm-e2-standard4 -- Boot disk > Change - - Operating system > Ubuntu - - Version > Ubuntu 22.04 LTS x86/64 - - Boot disk type > Balanced persistent disk - - Size > upgrade to 150GB - - gcloud-console-vm-ubunt -- Open `Networking, Disks, ...` under `Advanced options` -- Open `Networking` +```bash +/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" +``` - gcloud-console-vm-networking -- Go to `Network interfaces` and click on `default default (...)` with a downward arrow on the right. +This will ask for your confirmation (hit `Enter`) and your **macOS user account password** (the one you use to [log in](https://support.apple.com/en-gb/HT202860) when you reboot your Macbook). +:warning: When typing a password in the Terminal, you will **not** get a visual feedback (something like `*****`), this is **normal**!! Type the password and confirm by typing `Enter`. - gcloud-console-vm-network-interfaces -- This opened a box `Edit network interface` -- Go to the dropdown `External IPv4 address`, click on it, click on `RESERVE STATIC EXTERNAL IP ADDRESS` +
+ πŸ›  If you get a Error: Not a valid ref: refs/remotes/origin/master error - gcloud-console-vm-create-static-ip -- Give it a name, like "lewagon-data-eng-vm-ip-" (replace `` with your own) and description "Le Wagon - Data Engineering VM IP". This will take a few seconds. - gcloud-console-reserve-static-ip +The full error would be: -- You will now have a public IP associated with your account, and later to your VM instance. Click on `Done` at the bottom of the section `Edit network interface` you were in. +``` bash +Error: Not a valid ref: refs/remotes/origin/master : +fatal: ambiguous argument 'refs/remotes/origin/master': unknown revision or path not in the working tree. +``` - gcloud-console-new-external-ip +Run the following commands to solve it: -### Public SSH key -- Open the `Security` section +``` bash +rm -fr $(brew --repo homebrew/core) # because you can't `brew untap homebrew/core` +brew tap homebrew/core +``` - gcloud-console-vm-security -- Open the `Manage access` subsection +
- gcloud-console-manage-access -- Go to `Add manually generated SSH keys` and click `Add item` +If you already have Homebrew, it will tell you so, that's fine, go on. - gcloud-console-add-manual-ssh-key -- In your terminal display your public SSH key: - - Windows: navigate to where you created your SSH key and open `id_ed25519.pub` +### 2. Make sure you are on the latest version: - - Mac/Linux users can use: - ```bash - cat ~/.ssh/id_ed25519.pub - # OR cat ~/.ssh/de-bootcamp.pub if you created a unique key - ``` -- Copy your public SSH key and paste it: +```bash +brew update +``` - gcloud-console-add-ssh-key-pub -- On the right hand side you should see +
+ πŸ›  If you get a /usr/local must be writable error - gcloud-console-vm-price-month -- You should be good to go and click `CREATE` at the bottom +Just run this: - gcloud-console-vm-create -- It will take a few minutes for your virtual machine (VM) to be created. Your instance will show up like below when ready, with a green circled tick, named `lewagon-data-eng-vm-krokrob` (`krokrob` being replaced by your GitHub username). +``` bash +sudo chown -R $USER:admin /usr/local +brew update +``` - gcloud-console-vm-instance-running -- Click on your instance +
- gcloud-console-vm-running -- Go down to the section `SSH keys`, and write down your username (you need it for the next section) +### 3. Then install some useful software: - gcloud-console-vm-username +Proceed running the following in the terminal (you can copy / paste all the lines at once). -Congrats, your virtual machine is up and running, it is time to connect it with VS Code! +```bash +brew upgrade git || brew install git +brew upgrade gh || brew install gh +brew upgrade wget || brew install wget +brew upgrade imagemagick || brew install imagemagick +brew upgrade jq || brew install jq +brew upgrade openssl || brew install openssl +brew upgrade tree || brew install tree +brew upgrade ncdu || brew install ncdu +brew upgrade xz || brew install xz +brew upgrade readline || brew install readline +``` ## Visual Studio Code @@ -396,170 +375,404 @@ We need to connect VS Code to a virtual machine in the cloud so you will only wo That's the only extension you should install on your _local_ machine, we will install additional VS Code extensions on your _virtual machine_. -### Virtual Machine connection -- Open VS Code > Open the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` +## Google Cloud CLI -vscode-connect-to-host +The `gcloud` Command Line Interface (CLI) is used to communicate with Google Cloud Platform services through your terminal. -- Click on `Add a new host` -- Type `ssh -i @`, for instance, my username is `somedude`, my private SSH key is located at `~/.ssh/id_rsa` on my local computer, my VM has a public IP of `34.77.50.76`: I'll type `ssh -i ~/.ssh/id_rsa somedude@34.77.50.76` +### Install gcloud -vscode-ssh-connection-command +Install with `brew`: +```bash +brew install --cask google-cloud-sdk +``` -- When prompted to `Select SSH configuration file to update`, pick the one in your home directory, under the `.ssh` folder, `~/.ssh/config` basically. Usually VS Code will pick automatically the best option, so their default should work. +Then install `gcloud` with: -vscode-add-host-ssh-config +```bash +$(brew --prefix)/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/install.sh +``` -- You should get a pop-up on the bottom right notifying you the host has been added +To test your install, open a new terminal and run: -vscode-host-added +```bash +gcloud --version +``` -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Remote-SSH: Connect to Host...` > Pick your VM IP address +πŸ‘‰ [Install documentation πŸ”—](https://cloud.google.com/sdk/docs/install#mac) -vscode-add-new-host -- The first time, VSCode might ask you for a security permission like below, say yes / continue. +### Authenticate gcloud -vscode-remote-connection-confirm +We need to authenticate the `gcloud` CLI tool and set the project so it can interact with Google from the terminal. -- Open again the [command palette](https://code.visualstudio.com/docs/getstarted/userinterface#_command-palette) > Type `Terminal: Create New Terminal (in active workspace)` > You now have a Bash terminal in your virtual machine! +To authenticate `gcloud`, run: -vscode-command-palette-new-terminal -
-vscode-terminal +```bash +gcloud auth login +``` + +And following the prompts. For pasting into the terminal, your might need to use CTRL + SHIFT + V + +You also need to set the GCP project that your are working in. For this section, you'll need your GCP Project ID, which can be found on the GCP Console at this [link here](https://console.cloud.google.com). Makes sure you copy the _Project ID_ and **not** the _Project number_. -- Still on your *local* computer, lets create a more readable version of your machine to connect to! +To set your project, replace `` with your GCP Project ID and run: ```bash -code ~/.ssh/config +gcloud config set project ``` -You should see something like the following: +Confirm your setup with: ```bash -Host - HostName - IdentityFile - User +gcloud config list ``` -You can now change Host to whatever you would like to see as the name of your connection or in terminal with `ssh `! -❗️ It is important that the `Host` alias does not contain any whitespaces ❗️ +You should get an output similar to: ```bash -# For instance -Host "de-bootcamp-vm" - HostName 34.77.50.76 # replace with your VM's public IP address - IdentityFile - User +[core] +account = taylorswift@domain.com # Should be your GCP email +disable_usage_reporting = True +project = my-gcp-project # Should be your GCP Project ID + +Your active configuration is: [default] ``` -**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS code for instance) +### Application Default Credentials -## VS Code Extensions +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. -Let's install some useful extensions to VS Code. +To authenticate your **Application Default Credentials**, in your terminal run: -- Open your VS Code instance and make sure you're connected to the remote server. At the bottom left, you'll see: +```bash +gcloud auth application-default login +``` -vscode-ssh +And follow the prompts. It should open a web-page to login to your Google account. -- Open the VS Code terminal (`CMD` + `` ` `` or `CTRL` + `` ` ``) then run the following commands: + +## Terraform + +Terraform is a tool for infrastructure as code (IAC) to create (and destroy) resources to create in the cloud. + +You can use `brew` to install terraform. In your terminal, run: ```bash -code --install-extension ms-vscode.sublime-keybindings -code --install-extension emmanuelbeziat.vscode-great-icons -code --install-extension ms-python.python -code --install-extension KevinRose.vsc-python-indent -code --install-extension ms-python.vscode-pylance -code --install-extension redhat.vscode-yaml -code --install-extension ms-azuretools.vscode-docker -code --install-extension tamasfe.even-better-toml +brew tap hashicorp/tap +brew install hashicorp/tap/terraform ``` -Here is a list of the extensions you are installing: -- [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) -- [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) -- [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) -- [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) -- [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) -- [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) -- [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) -- [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +Verify the installation with: + +```bash +terraform --version +``` + + +## Provisioning your Virtual Machine with Terraform + +You can create Cloud Resources like Virtual Machines in different ways: +- Through the Google Cloud [Compute Engine Console πŸ”—](https://console.cloud.google.com/compute/overview) +- Using `gcloud` +- With **Infrastructure as Code** tools like Terraform + +We'll be creating our Virtual Machine with Terraform + +We're almost at the point of creating your Virtual Machine. + +The specifications of the Virtual Machine and Network Settings you'll use for the bootcamp are: +- Operation System: Ubuntu 22.04 LTS +- CPU: 4 Virtual CPU cores (2 physical CPU cores) +- RAM: 16 GB +- Storage (Persistent Disk): 100 GB balanced +- Static External IP address - so it's easier to login. + +### Cost πŸ’Έ + +Creating and running a Virtual Machine on Google Cloud Platform costs money! + +If you have created a new Google Cloud Platform account, the cost of the Virtual machine will be covered by the $300 USD credit for the first 90 days if you are diligent with turning off your Virtual Machine (or finish the _Linux and Bash_ challenge today 😎). + +❗ **The cost of running a Virtual Machine with our configuration 24 hours a day, 7 days a week is ~$150 USD per month.** ❗ + +You can massively reduce the cost by only running the Virtual Machine when you use it. You will _NOT_ be charged for the vCPU's and RAM while the Virtual Machine is off! + +You will always pay for the Storage (equivalent of your hard-drive on your local computer). It's ~$10 USD per month for 100 GB. + +The rule of thumb is: if Google can rent the resource out to someone else when your not using it, you only pay for it when you are using the resource. That's why you don't pay for the CPU and RAM when you are not using it, Google can rent it out to someone else, but always pay for Storage, Google can't rent it out to someone else because it has your data on it. + +### Download terraform files + +We almost have all the necessary parts to create your VM using **terraform**. We need to download the terraform files and change a few values. + +First we'll create a folder and download the terraform files with: + +```bash +mkdir -p ~/wagon-de-bootcamp +curl -L -o ~/wagon-de-bootcamp/main.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/main.tf +curl -L -o ~/wagon-de-bootcamp/provider.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/provider.tf +curl -L -o ~/wagon-de-bootcamp/variables.tf https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/variables.tf +curl -L -o ~/wagon-de-bootcamp/terraform.tfvars https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/terraform.tfvars +curl -L -o ~/wagon-de-bootcamp/.terraform.lock.hcl https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/infra/.terraform.lock.hcl +``` + + +### Set variables + +Open up the file `~/wagon-de-bootcamp/terraform.tfvars` in VS Code or any other code editor. + +It should look like: + +```bash +project_id = "" +region = "" +zone = "" +instance_name = "" +instance_user = "" +``` + +We'll need to change some values in this file. Here's were you can find the required values: +- **project_id:** from the GCP Console at this [link here](https://console.cloud.google.com). +- **region:** take a look at the GCP Region and Zone documentation at this [link here](https://cloud.google.com/compute/docs/regions-zones). We strongly recommend you choose the closest geographical region. +- **zone:** Zone is a subset of region. it is almost always the same as **region** appended with `-a`, `-b`, or `-c`. +- **instance_name:** we recommend naming your VM: `lw-de-vm-`. Replacing `` with your GitHub username. +- **instance_user:** in your terminal, run `whoami` + +After completing this file, it should look similar to: + +```bash +project_id = "wagon-bootcamp" +region = "europe-west1" +zone = "europe-west1-b" +instance_name = "lw-de-vm-tswift" +instance_user = "taylorswift" +``` + +Make sure to save the `terraform.tfvars` file, nagivate into the directory with the terraform files with: + +``` +cd ~/wagon-de-bootcamp +``` + +And initialise and test the files with: + +```bash +terraform init + +terraform plan +``` + +And check the output. Towards the bottom there should be a line: + +``` +Plan: 2 to add, 0 to change, 0 to destroy +``` + +We'll be adding: +- A compute engine instance +- A static external IP address + +❗ If you have any errors, read the error and debug. If you need some help, raise a ticket with a teacher. + +If everything was successful, create your VM with: + +```bash +terraform apply -auto-approve +``` + +It might take a while for Terraform to create the cloud resources. Once you see: + +``` +Apply complete! Resources: 2 added, 0 changed, 0 destroyed. +``` + +Your Virtual Machine should be up and running! Check the GCP Compute Engine console at this [link here](https://console.cloud.google.com/compute/instances) to confirm. + + +## Virtual Machine connection + +### Create SSH keys + +We need to connect VS Code to our Virtual Machine in the cloud so you will only work on that machine during the bootcamp. We'll use the [Remote - SSH Extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) that we previously installed. + +To create the VS Code SSH configuration, run the following in your terminal: + +```bash +gcloud compute config-ssh +``` + +`gcloud` may tell you it needs to create a directory to continue. Accept and you should get an output similar to: + +```bash +You should now be able to use ssh/scp with your instances. +For example, try running: + + $ ssh lw-de-vm-tswift.europe-west1-b.wagon-bootcamp +# $ ssh lw-de-vm-.. +``` + + +### Connect with VS Code + +To connect to your Virtual Machine, click on the small symbol at the very bottom-left corner of VS Code: + +![](/images/vscode_remote_highlight.png) + +It should bring up a menu, click on **Connect to Host...**: + +![](/images/vscode_remote_menu.png) + +Click on the name of your Virtual Machine: + +![](/images/vscode_remote_hosts.png) + +A new VS Code window will open. You may be asked to select the platform of the remote host, select **Linux**. You will then be asked to _fingerprint_ the connection. VS Code is asking if you trust the remote host you are trying to connect to. Hit enter to continue. + +![](/images/vscode_remote_fingerprint.png) + +And you are connected! It should look similar too: + +![](/images/vscode_remote_connected.png) + +Notice the connection in the very bottom-left corner of your VS Code window. It should have the Connection type (SSH), and the name of the host you are connected to. + +**The setup of your local machine is over. All following commands will be run from within your 🚨 virtual machine**🚨 terminal (via VS Code) + +
+Viewing your SSH Configuration + +If you want to view your SSH configuration: +1. Start by clicking the symbol in the bottom-left corner of VS Code +2. Click on **Connect to Host...** +3. Click on **Configure SSH Hosts...*** +4. Select the configuration file. Usually the file at the top of the list. +5. View your configuration file! You may need to edit this configuration if you change computers, or want to work on more than one computer during the bootcamp. + +
+ + +## VM gcloud and Application Default Credentials + +We'll be doing some of the steps again, but that's because the virtual machine is a completely new computer! Luckily for us, `gcloud` comes pre-installed on the virtual machine. + +### Authenticate gcloud -## Command line tools +We need to authenticate the `gcloud` CLI tool and set the project so it can interact with Google from the terminal. -### Zsh & Git +To authenticate `gcloud`, run: -Instead of using the default `bash` [shell](https://en.wikipedia.org/wiki/Shell_(computing)), we will use `zsh`. +```bash +gcloud auth login +``` + +And following the prompts. For pasting into the terminal, your might need to use CTRL + SHIFT + V -We will also use [`git`](https://git-scm.com/), a command line software used for version control. +You also need to set the GCP project that your are working in. For this section, you'll need your GCP Project ID, which can be found on the GCP Console at this [link here](https://console.cloud.google.com). Makes sure you copy the _Project ID_ and **not** the _Project number_. -Let's install them, along with other useful tools: -- Open an **VS Code terminal** connected to your VM -- Copy and paste the following commands: +To set your project, replace `` with your GCP Project ID and run: ```bash -sudo apt update -sudo apt install -y vim tmux tree git ca-certificates curl jq unzip zsh \ -apt-transport-https gnupg software-properties-common direnv sqlite3 make \ -postgresql postgresql-contrib build-essential libssl-dev zlib1g-dev \ -libbz2-dev libreadline-dev libsqlite3-dev wget llvm \ -libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \ -gcc default-mysql-server default-libmysqlclient-dev libpython3-dev openjdk-8-jdk-headless +gcloud config set project ``` -These commands might ask for your password, if they do: type it in. +Confirm your setup with: + +```bash +gcloud config list +``` + +You should get an output similar to: + +```bash +[core] +account = taylorswift@domain.com # Should be your GCP email +disable_usage_reporting = True +project = my-gcp-project # Should be your GCP Project ID + +Your active configuration is: [default] +``` -:warning: When you type your password, nothing will show up on the screen, **that's normal**. This is a security feature to mask not only your password as a whole but also its length. Just type in your password and when you're done, press `Enter`. -### GitHub CLI installation +### Application Default Credentials -Let's now install [GitHub official CLI](https://cli.github.com) (Command Line Interface). It's a software used to interact with your GitHub account via the command line. +Application Default Credentials are for authenticating our **code** (Terraform and Python 🐍) to interact with Google services and resources. It's a small distinction between `gcloud` and **code**, but an important one. -In your terminal, copy-paste the following commands and type in your password if asked: +To authenticate your **Application Default Credentials**, in your terminal run: ```bash -curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg -echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null -sudo apt update -sudo apt install -y gh +gcloud auth application-default login ``` -To check that `gh` has been successfully installed on your machine, you can run: +And follow the prompts. It should open a web-page to login to your Google account. + + +## VM configuration with Ansible + +We'll be using [Ansible](https://docs.ansible.com/ansible/latest/getting_started/introduction.html) to configure your Virtual Machine with some software, configurations, packages, and frameworks that you'll use in the bootcamp. + +Let's start by confirming that ansible is installed. In your terminal run: ```bash -gh --version +ansible --version ``` -:heavy_check_mark: If you see `gh version X.Y.Z (YYYY-MM-DD)`, you're good to go :+1: +You should get an output similar to (some version numbers might change, that's fine): -:x: Otherwise, please **contact a teacher** +``` +ansible [core 2.17.9] + config file = /etc/ansible/ansible.cfg + configured module search path = ['/home/tswift/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] + ansible python module location = /usr/lib/python3/dist-packages/ansible + ansible collection location = /home/tswift/.ansible/collections:/usr/share/ansible/collections + executable location = /usr/bin/ansible + python version = 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (/usr/bin/python3) + jinja version = 3.1.2 + libyaml = True +``` +❗ If not, raise a ticket with a teacher. -## Oh-my-zsh +### Ansible Playbook 1 -Let's install the `zsh` plugin [Oh My Zsh](https://ohmyz.sh/). +Create a folder and download the ansible files: -In a terminal execute the following command: +```bash +mkdir -p ~/vm-ansible-setup/playbooks + +curl -L -o ~/vm-ansible-setup/ansible.cfg https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/ansible.cfg +curl -L -o ~/vm-ansible-setup/hosts https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/hosts +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part1.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part1.yml +``` + +And run with: ```bash -sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part1.yml ``` -If asked "Do you want to change your default shell to zsh?", press `Y` +And the playbook should start running! -At the end your terminal should look like this: +❗ If an errors occur, raise a ticket with a teacher. You can safely run the playbook again. -![Ubuntu terminal with OhMyZsh](https://github.com/lewagon/setup/blob/master/images/oh_my_zsh.png) +### What is the playbook installing? -:heavy_check_mark: If it does, you can continue :+1: +This playbook is installing a few things, while the playbook is running, let's go through them: +- Updating system packages. Ubuntu uses the `APT` package manager. +- Changing the default shell from **bash** to **zsh**, a more customizable shell that is extensible and looks great! +- Installing the **Oh-My-ZSH** plugin for the **zsh** shell. We'll use it a bit later to add some quality of life plugins and extensions for `zsh`. +- Installing **Docker** on your Virtual Machine. Docker is an open platform for developing, shipping, and running applications. You will use it throughout the bootcamp +- Installing some **Kubernetes (k8s)** tooling: Kubernetes is a system designed to for auto-scaling containerized applications. + - Installing **kubectl**: `kubectl` is the CLI tool for interacting with kubernetes clusters. + - Installing **minikube**: Minikube is a way to quickly spin up a local kubernetes cluster. Great for developing! +- Installing **terraform**: we've already installed it once, but we need to install it on our VM! **Terraform** is an Infrastructure as Code (IaC) tool. +- Install the **GitHub CLI**: the CLI tool that we'll use to interact with your GitHub account directly from the terminal. -:x: Otherwise, please **ask for a teacher** +The playbook is also running checks to see if things are installed or not. This is so you can safely re-run the playbook without any problems. ## GitHub CLI @@ -614,120 +827,6 @@ gh auth status :x: If not, **contact a teacher**. -## Google Cloud CLI - -Install the `gcloud` CLI to communicate with [Google Cloud Platform](https://cloud.google.com/) through your terminal: -```bash -echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list -sudo apt-get install apt-transport-https ca-certificates gnupg -curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - -sudo apt-get update && sudo apt-get install google-cloud-sdk -sudo apt-get install google-cloud-sdk-app-engine-python -``` -πŸ‘‰ [Install documentation](https://cloud.google.com/sdk/docs/install#deb) - -### Create a service account key πŸ”‘ - -**πŸ‘Œ Note: Skip to the next section if you already have a service account key** - -Now that you have created a `GCP account` and a `project` (identified by its `PROJECT_ID`), we are going to configure the actions (API calls) that you want to allow your code to perform. - -
- πŸ€” Why do we need a service account key ? - - - You have created a `GCP account` linked to your credit card. Your account will be billed according to your usage of the ressources of the **Google Cloud Platform**. The billing will occur if you consume anything once the free trial is over, or if you exceed the amount of spending allowed during the free trial. - - In your `GCP account`, you have created a single `GCP project`, identified by its `PROJECT_ID`. The `GCP projects` allow you to organize and monitor more precisely how you consume the **GCP** ressources. For the purpose of the bootcamp, we are only going to create a single project. - - Now, we need a way to tell which ressources within a `GCP project` our code will be allowed to consume. Our code consumes GCP ressources through API calls. - - Since API calls are not free, it is important to define with caution how our code will be allowed to use them. During the bootcamp this will not be an issue and we are going to allow our code to use all the API of **GCP** without any restrictions. - - In the same way that there may be several projects associated with a GCP account, a project may be composed of several services (any bundle of code, whatever its form factor, that requires the usage of GCP API calls in order to fulfill its purpose). - - GCP requires that the services of the projects using API calls are registered on the platform and their credentials configured through the access granted to a `service account`. - - For the moment we will only need to use a single service and will create the corresponding `service account`. -
- -Since the [service account](https://cloud.google.com/iam/docs/service-accounts) is what identifies your application (and therefore your GCP billing account and ultimately your credit card), you are going to want to be cautious with the next steps. - -⚠️ **Do not share you service account json file πŸ”‘** ⚠️ Do not store it on your desktop, do not store it in your git codebase (even if your git repository is private), do not let it by the coffee machine, do not send it as a tweet. - -- Go to the [service accounts page](https://console.cloud.google.com/apis/credentials/serviceaccountkey) -- Select your project in the list of recent projects if asked to -- Create a service account: - - Click on **CREATE SERVICE ACCOUNT**: - - Give a `Service account name` to that account - - Click on **CREATE AND CONTINUE** - - Click on **Select a role** and choose `Quick access/Basic` then **Owner**, which gives full access to all ressources - - Click on **CONTINUE** - - Click on **DONE** -- Download the service account json file πŸ”‘: - - Click on the newly created service account - - Click on **KEYS** - - Click on **ADD KEY** then **Create new key** - - Select **JSON** and click on **CREATE** - -![](images/gcp_create_key.png) - -The browser has now saved the service account json file πŸ”‘ in your downloads directory (it is named according to your service account name, something like `le-wagon-data-123456789abc.json`) - - -### Configure Cloud sdk - -- Open the service account json file with any text editor and copy the key - ``` - # It looks like: - { - "type": "service_account", - "project_id": "kevin-bootcamp", - "private_key_id": "1234567890", - "private_key": "-----BEGIN PRIVATE KEY-----\nXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\n-----END PRIVATE KEY-----\n", - "client_email": "bootcamp@kevin-bootcamp.iam.gserviceaccount.com", - "client_id": "1234567890", - "auth_uri": "https://accounts.google.com/o/oauth2/auth", - "token_uri": "https://oauth2.googleapis.com/token", - "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", - "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bootcamp%40kevin-bootcamp.iam.gserviceaccount.com" - } - ``` -- **on your Virtual Machine**, create a `~/.gcp_keys` directory, then create a json file in it: - ``` bash - mkdir ~/.gcp_keys - touch ~/.gcp_keys/le-wagon-de-bootcamp.json - ``` -- Open the json file then store the service account json file pasting the key: - ```bash - code ~/.gcp_keys/le-wagon-de-bootcamp.json - ``` - ![service account json key](images/service_account_json_key.png) - - ❗️Don't forget to **save** the file with `CMD` + `s` or `CTRL` + `s` - -- Authenticate the `gcloud` CLI with the google account you used for GCP - ```bash - # Replace service_account_name@project_id.iam.gserviceaccount.com with your own - SERVICE_ACCOUNT_EMAIL=service_account_name@project_id.iam.gserviceaccount.com - KEY_FILE=$HOME/.gcp_keys/le-wagon-de-bootcamp.json - gcloud auth activate-service-account $SERVICE_ACCOUNT_EMAIL --key-file=$KEY_FILE - ``` -- List your active account and check your email address you used for GCP is present - ```bash - gcloud auth list - ``` -- Set your current project - ```bash - # Replace `PROJECT_ID` with the `ID` of your project, e.g. `wagon-bootcamp-123456` - gcloud config set project PROJECT_ID - ``` -- List your active account and current project and check your project is present - ```bash - gcloud config list - ``` - - ## Dotfiles Let's pimp your zsh and and vscode by installing lewagon recommanded dotfiles **on your Virtual Machine** @@ -874,474 +973,344 @@ you don't want your email to appear in public repositories you may contribute to -### zsh default terminal +--- -Set `zsh` as your default VS Code terminal. +Once you have finished installing the **dotfiles**, kill your terminal (little trash can at the top right of the terminal window) and re-open it. You might have to do it a few times until it looks similar to: -- Open terminal default profile settings +![](/images/vscode_after_ansible1.png) - Terminal profile settings -- Select `zsh /usr/bin/zsh` +The terminal should read as `zsh`. - Terminal zsh profile +## VM configuration with Ansible - Part 2 -## Disable SSH passphrase prompt +### Ansible Playbook 2 -You don't want to be asked for your passphrase every time you communicate with a distant repository. So, you need to add the plugin `ssh-agent` to `oh my zsh`: +We'll be using a second **Ansible** playbook to further configure your Virtual Machine. -First, open the `.zshrc` file: +Start by downloading the ansible playbook: ```bash -code ~/.zshrc +curl -L -o ~/vm-ansible-setup/playbooks/setup_vm_part2.yml https://raw.githubusercontent.com/lewagon/data-engineering-setup/lorcanrae/automated-setup/automation/vm-ansible-setup/playbooks/setup_vm_part2.yml ``` -Then: -- Spot the line starting with `plugins=` -- Add `ssh-agent` at the end of the plugins list - -:heavy_check_mark: Save the `.zshrc` file with `Ctrl` + `S` and close your text editor. - - -## Docker πŸ‹ - -Docker is an open platform for developing, shipping, and running applications. - -### Install Docker and Docker Compose - -Setup the dock apt repo +And run with: ```bash -sudo install -m 0755 -d /etc/apt/keyrings - -curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg - -sudo chmod a+r /etc/apt/keyrings/docker.gpg +cd ~/vm-ansible-setup +ansible-playbook playbooks/setup_vm_part2.yml ``` -```bash -echo \ - "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ - "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ - sudo tee /etc/apt/sources.list.d/docker.list > /dev/null -``` +And the playbook should start running! If you're asked if you want VS Code to behave more like Sublime Text, click accept. -Install the right packages +❗ If any errors occur, raise a ticket with a teacher. You can safely run the playbook again. -``` -sudo apt-get update -sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -``` +
+❓ Why two Ansible playbooks? -Finally give your user permission to use `docker` +This second ansible playbook requires GitHub authorisation to fork the `lewagon/data-engineering-challenges` repository and it is also editing some of the Le Wagon recommended **dotfiles**. So we separated the process into two steps. +
-```bash -sudo groupadd docker -sudo usermod -aG docker $USER -newgrp docker -``` +### What is the playbook installing? -Run `docker run hello-world`, you should see something like: +This playbook is installing and configuring a things, while the playbook is running, let's go through them: -
- ❗️ Permission denied while trying to connect to the Docker daemon socket. ❗️ +**Python and Poetry** -If you receive an error similar to the one below, navigate to the [GCP Compute Engine Console](https://console.cloud.google.com/compute/instances) and shut down your VM by selecting the tick box next to your VM instance and clicking STOP (closing and reopening VSCode is not enough). +Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python [3.12.8](https://www.python.org/downloads/release/python-3128/) -![](images/docker_permission_denied_socket.png) +- Install **pyenv** and **pyenv-virtualenv**. We'll use **pyenv** to manage the Python versions installed on the VM +- Install Python 3.12.8 with pyenv +- Install **pipx**: [Pipx](https://pipx.pypa.io/stable/) is used to install python packages we want _globally_ available while still using virtual environments, like Poetry! +- Installing a few global python packages with **pipx**: + - **Poetry:** [Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. + - **Ruff:** [Ruff](https://docs.astral.sh/ruff/) Is used to format and lint Python code. + - **tldr:** [tldr](https://github.com/tldr-pages/tldr) has much more readable version of `man` pages. Useful for quickly finding out how a program works. -It will take a few minutes for your VM to turn off. Once it's fully off, turn your VM on again by checking the box next to the VM instance and clicking START. Give the VM a few minutes to fully start up and connect through VSCode. Once connected try `docker run hello-world` again. If you don't get an output similar to the below image, raise a ticket with a teacher. -
+**VS Code Configuration** -![](images/docker_hello.png) - -### Enable Artifact Registry API - -**πŸ‘Œ Note: Skip to the next section if you already have an Artifact Registry repository** - -[Artifact Registry](https://cloud.google.com/artifact-registry) is a GCP service you will use to store artifacts such as Docker images. The storage units are called repositories. - -- Enable the service within your project using the `gcloud` CLI: - ```bash - gcloud services enable artifactregistry.googleapis.com - ``` -- Create a new Docker repository: - ```bash - # Set the repository name - REPOSITORY=docker-hub - # Set the location of the repository. Available locations: gcloud artifacts locations list - LOCATION=europe-west1 - gcloud artifacts repositories create $REPOSITORY \ - --repository-format=docker \ - --location=$LOCATION \ - --description="Docker images storage" - ``` - -### Gcloud authentication for Docker - -You need to grant Docker access to push artifacts to (and pull from) your repository. There are different authentication methods, [gcloud credentials helper](https://cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) being the easiest. - -- Define the repository hostname matching the repository `$LOCATION`: - ```bash - # If $LOCATION is "europe-west1" - HOSTNAME=europe-west1-docker.pkg.dev - ``` -- Configure gcloud credentials helper: - ```bash - gcloud auth configure-docker $HOSTNAME - ``` -- Type `y` to accept the configuration -- Check your credentials helper is set: - ```bash - cat ~/.docker/config.json - ``` - You should get: - ```bash - { - "credHelpers": { - "europe-west1-docker.pkg.dev": "gcloud" - } - }% - ``` - - -## Kubernetes -Kubernetes (K8s) is a system designed to make deploying auto-scaling containerized applications easily. - -### Install kubectl -Kubectl is the cli for interacting with k8s! - -https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ +- Installing some **VS Code** extensions, but only on your VM. Here's a list of the extensions that are being installed: + - [Sublime Text Keymap and Settings Importer](https://marketplace.visualstudio.com/items?itemName=ms-vscode.sublime-keybindings) + - [VSCode Great Icons](https://marketplace.visualstudio.com/items?itemName=emmanuelbeziat.vscode-great-icons) + - [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) + - [Python Indent](https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent) + - [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) + - [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) + - [Docker](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-docker) + - [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml) +- Update the VS Code Python Interpreter path. -```bash -curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" +**Shell and System Configuration** -echo "$(cat kubectl.sha256) kubectl" | sha256sum --check +- Create the **direnv** poetry function. The same one from the lecture! This makes it easier to work with poetry. +- Adding some **Oh-My-ZSH** Plugins: by modifying your `.zshrc` file. Here's a list of the extra plugins: + - **pyenv**: Auto-complete for pyenv, a tool used to manage python virtual environments + - **gcloud**: Auto-complete for the gcloud CLI tool + - **ssh-agent**: Saves your SSH password so you only have to enter it once per session. + - **direnv**: A tool to load `.envrc` files when you `cd` into a directory. Great for loading environment variables. +- Installing **Spark**: Spark is a distributed data processing framework -sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl +**Data Engineering Challenges Repository** -kubectl version --client -kubectl version --client --output=yaml -``` +The challenges that you'll be working on throughout the bootcamp! The playbook is forking the **data-engineering-challenges** repository from **lewagon** to your own GitHub user. Then cloning that repository from your GitHub account down onto your Virtual Machine. -### Install minikube +### Restart Virtual Machine -Minikube is a way to quickly spin up a local kubernetes cluster! +Once the playbook has finished running, you need to completely shutdown your Virtual Machine so that some of the configuration updates (specifically **pyenv** and **Docker**). -https://minikube.sigs.k8s.io/docs/start/ +To shutdown your VM, navigate to the GCP Compute Engine Instances [console page πŸ”—](https://console.cloud.google.com/compute/instances). -```bash -curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 -sudo install minikube-linux-amd64 /usr/local/bin/minikube -``` +Select your VM instance and click on the stop button: -### Test installation -To test that you can launch a cluster run: -```bash -minikube start -``` -you should see your cluster booting up : +![](/images/gcp_vm_stop.png) -![](images/minikube_start.png) +Wait for a few minutes until the VM shows that it is completely off. You may need to refresh the page, the GCP Console doesn't dynamically update. -Then to check the cluster run: -```bash -kubectl get po -A -``` -you should be able to see your cluster running! : +When the VM is completely off, turn it on again by selecting the check box next to your instance and clicking **START/RESUME**. Give it a minute to spin up, then connect via VS Code. -![](images/minikube_base.png) -To tear it all down for now: +## Check your Virtual Machine Setup -```bash -minikube delete --all -``` +We've used two ansible playbooks to configure our Virtual Machine. Let's run some manual checks in the terminal to make sure that everything has installed correctly. +❗ If any of these checks error out, raise a ticket with a teacher. -## Terraform +#### Python -Terraform is a tool for infrastructure as code (IAC) to define resources to create in the cloud! +πŸ§ͺ To test: -### Install terraform - -Install some basic requirements ```bash -sudo apt-get update && sudo apt-get install -y gnupg software-properties-common +python --version ``` -Terraform is not avaliable to apt by default so we need to make it avaliable! -```bash -wget -O- https://apt.releases.hashicorp.com/gpg | \ - gpg --dearmor | \ - sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null -``` +Should return: -```bash -gpg --no-default-keyring \ - --keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \ - --fingerprint ``` - -```bash -echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \ - https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \ - sudo tee /etc/apt/sources.list.d/hashicorp.list +Python 3.12.8 ``` -Now we can install terraform directly with apt πŸ‘‡ -```bash -sudo apt update -sudo apt-get install terraform -``` +#### Pyenv -Verify the installation with: +πŸ§ͺ To test: ```bash -terraform --version +pyenv versions ``` +Should return: +``` + system +* 3.12.8 (set by /home//.pyenv/version) +``` -## Spark +Note: There should be an `*` next to 3.12.8 -Spark is a data processing framework: +#### Pipx -Move to your home directory: +πŸ§ͺ To test: ```bash -cd ~ +pipx list ``` -Download spark: +Should return something similar too: -```bash -wget https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz ``` - -Open the tarball: - -```bash -mkdir -p ~/spark && tar -xzf spark-3.5.3-bin-hadoop3.tgz -C ~/spark +venvs are in /home//.local/share/pipx/venvs +apps are exposed on your $PATH at /home//.local/bin +manual pages are exposed at /home//.local/share/man + package poetry 2.1.1, installed using Python 3.12.8 + - poetry + package ruff 0.11.0, installed using Python 3.12.8 + - ruff + package tldr 3.3.0, installed using Python 3.12.8 + - tldr + - man1/tldr.1 ``` -Set the environment variables needed by spark: +#### Docker -```bash -echo "export SPARK_HOME=$HOME/spark/spark-3.5.3-bin-hadoop3" >> .zshrc -echo 'export PATH=$PATH:$SPARK_HOME/bin' >> .zshrc -``` - -Let's restart our shell: +πŸ§ͺ To test: ```bash -exec zsh +docker run hello-world ``` -Test Spark works by running: +Should return: -```bash -spark-shell ``` +Unable to find image 'hello-world:latest' locally +latest: Pulling from library/hello-world +e6590344b1a5: Pull complete +Digest: sha256:7e1a4e2d11e2ac7a8c3f768d4166c2defeb09d2a750b010412b6ea13de1efb19 +Status: Downloaded newer image for hello-world:latest -You should see an output similar to: +Hello from Docker! +This message shows that your installation appears to be working correctly. -```bash -Setting default log level to "WARN". -To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). -25/01/15 11:33:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -Spark context Web UI available at http://de-vm-lrae-test.europe-north1-b.c.wagon-de.internal:4040 -Spark context available as 'sc' (master = local[*], app id = local-1736940788403). -Spark session available as 'spark'. -Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 - /_/ +To generate this message, Docker took the following steps: + 1. The Docker client contacted the Docker daemon. + 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. + (amd64) + 3. The Docker daemon created a new container from that image which runs the + executable that produces the output you are currently reading. + 4. The Docker daemon streamed that output to the Docker client, which sent it + to your terminal. -Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_432) -Type in expressions to have them evaluated. -Type :help for more information. - -scala> -``` -Type `:quit` and hit enter to exit the spark-shell and continue. +To try something more ambitious, you can run an Ubuntu container with: + $ docker run -it ubuntu bash +Share images, automate workflows, and more with a free Docker ID: + https://hub.docker.com/ -## Python & Pip +For more examples and ideas, visit: + https://docs.docker.com/get-started/ +``` -Ubuntu 22.04 has Python pre-installed, but not the version we're going to use. We are going to use Python 3.12 ([3.12.8](https://www.python.org/downloads/release/python-3128/)). +#### Kubernetes -Let's install pyenv to manage our python versions: +We can start by testing `minikube`: ```bash -git clone https://github.com/pyenv/pyenv.git ~/.pyenv -source ~/.zprofile -exec zsh +# Start +minikube start ``` -We'll also install a useful `pyenv` plugin called [`pyenv-virtualenv`](https://github.com/pyenv/pyenv-virtualenv). Although we will be using `poetry` for Python package and virtual environment management, `pyenv-virtualenv` is useful for controlling python versions locally. +Should return: -```bash -git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv -exec zsh ``` - -Now install Python 3.12.8: -```bash -pyenv install 3.12.8 -pyenv global 3.12.8 +πŸ˜„ minikube v1.35.0 on Ubuntu 22.04 (amd64) +✨ Automatically selected the docker driver. Other choices: none, ssh +πŸ“Œ Using Docker driver with root privileges +πŸ‘ Starting "minikube" primary control-plane node in "minikube" cluster +🚜 Pulling base image v0.0.46 ... +πŸ’Ύ Downloading Kubernetes v1.32.0 preload ... + > gcr.io/k8s-minikube/kicbase...: 500.31 MiB / 500.31 MiB 100.00% 88.19 M + > preloaded-images-k8s-v18-v1...: 333.57 MiB / 333.57 MiB 100.00% 32.20 M +πŸ”₯ Creating docker container (CPUs=2, Memory=3900MB) ... +🐳 Preparing Kubernetes v1.32.0 on Docker 27.4.1 ... + β–ͺ Generating certificates and keys ... + β–ͺ Booting up control plane ... + β–ͺ Configuring RBAC rules ... +πŸ”— Configuring bridge CNI (Container Networking Interface) ... +πŸ”Ž Verifying Kubernetes components... + β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5 +🌟 Enabled addons: storage-provisioner, default-storageclass +πŸ„ Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default ``` -Now `python --version` should return `3.12.8` - - -## Pipx - -Next we are going to install [pipx](https://pypa.github.io/pipx/) to install python packages we want globally available while still using virtual environments. -Let's upgrade `pip` first: +And then make sure the kubernetes CLI utility, `kubectl`, works with: ```bash -pip install --upgrade pip +# Get pods +kubectl get po -A ``` -And install `pipx`: +Should return something similar too: -```bash -python -m pip install --user pipx # --user so that each ubuntu user can have his own 'pipx' -python -m pipx ensurepath -exec zsh ``` - -Lets install a [tldr](https://github.com/tldr-pages/tldr) with pipx - -```bash -pipx install tldr +NAMESPACE NAME READY STATUS RESTARTS AGE +kube-system coredns-668d6bf9bc-mg7b6 1/1 Running 0 72s +kube-system etcd-minikube 1/1 Running 0 78s +kube-system kube-apiserver-minikube 1/1 Running 0 76s +kube-system kube-controller-manager-minikube 1/1 Running 0 76s +kube-system kube-proxy-stk77 1/1 Running 0 72s +kube-system kube-scheduler-minikube 1/1 Running 0 76s +kube-system storage-provisioner 1/1 Running 1 (41s ago) 75s ``` -Now `tldr` should be globally available (for the current user), test it out with: +And because `minikube` is resource intensive, stop it for now with: ```bash -tldr ls +# Stop +minikube delete --all ``` -Much more readable than the classic `man ls` (although sometimes you will still need to delve into the man pages to get all of the details!) and it even has pages not included in man such as `tldr gh`: - -tldr - - -Lets add a few more packages we want globally available - -### black +Should return: -[black](https://black.readthedocs.io/en/stable/) for helping to format code - -```bash -pipx install black +``` +πŸ”₯ Deleting "minikube" in docker ... +πŸ”₯ Removing /home//.minikube/machines/minikube ... +πŸ’€ Removed all traces of the "minikube" cluster. +πŸ”₯ Successfully deleted all profiles ``` -### Poetry - -[Poetry](https://python-poetry.org/) is a modern Python package manager we will use throughout the bootcamp. +#### Terraform -Install Poetry running the following command in your VS Code terminal: +πŸ§ͺ To test: ```bash -pipx install poetry +terraform --version ``` -Then, let's update default poetry behavior so that virtual envs are always created where `poetry install` is run. -During the bootcamp, you'll see a `.venv` folder being created inside each challenge folder. +Should return: -```bash -poetry config virtualenvs.in-project true ``` - -Finally, update your VScode settings to tell it that this `.venv` relative folder path will be your default interpreter! - -1. Open the Command Palette ( πŸͺŸ ctrl + shift + P / 🍎 cmd + shift + P ) -2. Search for: **Preference: Open Remote Settings (JSON)** - when you open your settings that should be two panels. -3. In the panel that opens on the **right side** search for the line: `python.defaultInterpreterPath` -4. Replace the value (probably `"~/.pyenv/shims/python"`) so that it looks like: - -```yml -"python.defaultInterpreterPath": ".venv/bin/python", +Terraform v1.11.2 +on linux_amd64 ``` -## Direnv +#### Spark -[Direnv](https://direnv.net/) is a great utility that will look for `.envrc` files in your directories. When you `cd` into directories with a `.envrc` files, paths will automatically be updated. In our case, this will simplify our workflow and allow us to not have to worry about Poetry managed Python virtual environments. - -1. First, setup the *direnv hook* to your zsh shell so that direnv gets activated anytime a `.envrc` file exists in current working directory. +πŸ§ͺ To test: ```bash -code ~/.zshrc +spark-shell ``` -```bash -plugins=(git gitfast ... pyenv ssh-agent direnv) # add `direnv` to the existing list of plugins -``` +Should take you into the spark shell that looks like: -2. Second, let's configure what will happens anytime `.envrc` file is found +``` +Setting default log level to "WARN". +To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +25/03/18 08:54:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Spark context Web UI available at http://lw-de-vm.europe-north1-b.c.wagon-de.internal:4040 +Spark context available as 'sc' (master = local[*], app id = local-1742288096829). +Spark session available as 'spark'. +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 3.5.3 + /_/ -```bash -code ~/.direnvrc -``` -- Paste the following lines - ```bash - layout_poetry() { - if [[ ! -f pyproject.toml ]]; then - log_error 'No pyproject.toml found. Use `poetry new` or `poetry init` to create one first.' - exit 2 - fi - # create venv if it doesn't exist - poetry run true - - export VIRTUAL_ENV=$(poetry env info --path) - export POETRY_ACTIVE=1 - PATH_add "$VIRTUAL_ENV/bin" - } - ``` -- Save and close the file - -😎 Now, **anytime you `cd` into a challenge folder which contains a `.envrc` file which contains `layout_poetry()` command inside, the function will get executed and your virtual env will switch to the poetry one that is defined by the `pyproject.toml` !** -- No need to prefix all commands with `poetry run `, but simply `` -- Each challenge will have its own virtual env, and it will be seamless for you to switch between challenges/envs +Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_442) +Type in expressions to have them evaluated. +Type :help for more information. +scala> +``` -## Let's Make! +Type `:quit` and hit enter to exit the spark-shell and continue. -Lets clone the challenges onto your **virtual machine** +That's all the testing we'll do for now! -```bash -export GITHUB_USERNAME=`gh api user | jq -r '.login'` -echo $GITHUB_USERNAME -``` -Then: +## Let's Make! -```bash -mkdir -p ~/code/$GITHUB_USERNAME && cd $_ -gh repo fork lewagon/data-engineering-challenges --clone -``` +Almost there! In the second ansible playbook, the `lewagon/data-engineering-challenges` repository was forked from Le Wagon to you. Let's review how it works. Our setup will look a bit like this: - +![](/images/repo_overview.png) This allows you to work on challenges, but if we push any changes to the content, you can still access them! Check your remotes match `origin` your data engineering challenges and `upstream` lewagon's! ```bash -cd data-engineering-challenges +cd ~/code/$(gh api user | jq -r '.login')/data-engineering-challenges git remote -v -# origin git@github.com:your_github_username/data-engineering-challenges.git (fetch) -# origin git@github.com:your_github_username/data-engineering-challenges.git (push) -# upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) -# upstream git@github.com:lewagon/data-engineering-challenges.git (push) +``` + +Should return: + +``` +origin git@github.com:/data-engineering-challenges.git (fetch) +origin git@github.com:/data-engineering-challenges.git (push) +upstream git@github.com:lewagon/data-engineering-challenges.git (fetch) +upstream git@github.com:lewagon/data-engineering-challenges.git (push) ``` From challenge folder root **on the vm**, we'll run `make install`, which triggers 3 operations: