My Documentation

Pyspark
- Basic Pyspark documentation
- ETL in Spark
- Machine Learning using the MLlib package
  - The Random Forest
  - Kernel Density Estimation (here 1-D only)
- Machine Learning using the ML package
- Parallelization of scikit-learn into spark
- Text analysis in Pyspark
  - Dealing with text: Tokenizer, Hashing, IDF
- Spark Streaming
- Scala Spark
- Petastorm library
  - Conversion of Spark data to Petastorm dataset
Python, Jupyter and Pandas
- Exercises on Python (Pandas, Numpy,…)
- Basic Jupyter infos
- Getting Notebooks work on server and access them using ssh
- Python environments: pip & conda
- Python linting, static code analysis
  - Pylint
  - Flake8
- Visual Studio Code set-up
- Python basic info
- Unit tests in Python: pytest
  - Coverage (of unit test): pytest-cov
- How to package an application in python
- Numpy basic documentation
- Basic Pandas documentation
- Useful plots
- Dask, or parallel Pandas
- Python API’s:
  - Flask
  - Streamlit
Scikit-learn
- Basic Scikit-learn documentation
- Data Preparation
- Decision Tree
- Random Forest
- Neural Networks
  - Introduction to Keras package
- Parameters Tuning
- How to measure the Feature importance in RF (or other algo)
  - Mean decrease accuracy (i.e. how is reduced the accuracy if we permute the values of some feature?)
- Score explanation for individual observation (using LIME)
  - Classification examples
- Score explanation for individual observation (using tree-interpreter)
- Very important update in the field of interpretation:
- Ensemble Classification
- Clustering
  - Gaussian Mixture Model (GMM)
Text Mining in Python
- Libraries and useful links
  - Basic functions
- Intro to regular expressions (REGEX)
- Bag of Words (BOW)
- TF-IDF (Term Frequency - Inverse Document Frequency)
- Cosine distance, Cosine similarity
- Word2Vec
- GloVe
- FastText
- BERT (Bidirectional Encoder Representation from Transformers)
- Chatbot
Deep Learning
- Introduction
- Keras
- Tensorflow
Reinforcement Learning
- Definitions
- Value Function
- Discounting Factor
- How to learn? Monte-Carlo vs Temporal Difference Learning
- Exploration/Exploitation trade off
- RL approaches
- The Policy Function
- The Q-function
- Q-Learning
Time Series in Python
- Basic TS concepts
  - Forecast quality metrics
  - Augmented Dickey-Fuller test
- Main forecasting methods
- Basic Pandas Time Series
- Time series decomposition
- Tsfresh: extracting features from time series
- Seasonal ARIMA: an example
Algorithms
- Supervised Machine Learning
- Unsupervised Machine Learning - Clustering
- Deep Learning
- Machine Learning Ranking
Statistics and Probabilities
- Statistics
- Probabilities
Great plot libraries
- Matplotlib
- Plotly and Dash
  - Installation
SQL Server documentation
- Basic SQL documentation
- Posgresql
Useful Bash commands (or batch)
- Find 10 largest files in a directory:
- Find previous command matching some word in bash
- Find RAM type in windows:
- Top
- Finding files and folder in linux
- Symbolic link between folders, files
Useful GIT commands
- Define name globally
- Github: how to create a repo locally and then push it to remote?
- Bitbucket/Github: how to create ssh-keys and connect to bitbucket server
- Pull requests: how-to
- How to create a local (empty) branch and upstream it to remote?
- What to do when your local dev branch is behind remote master?
- A branch exists in remote and not in local, how to get on it?
- My branch “feature” is based on branch “developement” and I wish to bring SPECIFIC new files from “development” into “feature”. How?
- How to compare 2 branches from the command line?
- Avoiding git pull
- Git aliases
- Other useful commands
- Git push configuration: matching vs simple
- Versioning in git: git tag
- What is a detached head?
- Adoption of a git flow
- After updating my windows password, cannot connect push, authentication error… (stash-bitbucket)
- Book & Cheatsheets
- Screen commands
- TMUX commands
Useful VIM commands
Data types
Interesting business models
- Transaction data analytics
  - The RFM segmentation
Sphinx
- Basic Sphinx commands
  - Subject Subtitle
  - Inline Markup
DevOps
- Airflow
  - Airflow CLI
- Docker
- Docker-compose
- Kubernetes
- Openshift
- CICD developement
  - GitLab
  - Jenkins
  - Azure DevOps
  - Tests
  - Git Flow
Infrastructure as a code (IaC)
- terraform
MLOps - Machine learning life cycle
- MLFlow
- Kubeflow
DataOps
- Feature Store
  - Hopsworks
  - FEAST
Monitoring of ML models
Azure Cloud, Databricks
- Azure Cloud
  - Azure Key Vault
- Azure Databricks
Data Science Interviews Questions
- Data Science
- Data Engineering
- MLOps
Ray

Indices and tables