My Documentation
- Pyspark
- Basic Pyspark documentation
- Installation of Spark (3.0.1) on Ubuntu machine
- Certifications
- Spark-submit tasks
- Killing YARN applications:
- Simple HDFS commands
- Importing Pyspark modules
- Creation of Pyspark objects
- RDDs
- Dataframes
- Partitions in pyspark
- Concerning partition skewness problem
- Spark executor/cores and memory management: Resources allocation in Spark
- Spark UI
- Basic commands
- Reading/writing data
- User-Defined Schemas
- Random sampling, stratified sampling
- Aggregating in Pyspark
- Joins
- Window functions
- Generate a column with dates between 2 dates
- Generate an array of dates between 2 dates
- explode operation
- Fill forward or backward in spark
- Arrays: Create time series format from row time series (ArrayType format)
- Revert from time series (list) format to traditional (exploded) format
- Converting dates in Pyspark
- Maps/dictionaries in pyspark
- NaN/Null/None handling
- Saving a table in Hadoop
- Filtering data in Pyspark
- Opening tables from Data Warehouse
- User-defined functions (UDF)
- Pandas UDF
- ETL in Spark
- Machine Learning using the MLlib package
- Machine Learning using the ML package
- Parallelization of scikit-learn into spark
- Text analysis in Pyspark
- Spark Streaming
- Scala Spark
- Petastorm library
- Basic Pyspark documentation
- Python, Jupyter and Pandas
- Exercises on Python (Pandas, Numpy,…)
- Basic Jupyter infos
- Getting Notebooks work on server and access them using ssh
- Python environments: pip & conda
- Python linting, static code analysis
- Visual Studio Code set-up
- Python basic info
- Unit tests in Python: pytest
- How to package an application in python
- Numpy basic documentation
- Basic Pandas documentation
- Good Pandas links:
- Loading Pandas dataframe from file
- Creation of some data in a Pandas dataframe
- Creating dataframe with several objects per cell
- Stacking of dataframes in Pandas
- How to shuffle the columns of a dataframe?
- Pandas and memory
- Re-setting of index in Pandas dataframes
- Iterating over Pandas dataframe rows:
- Check number of nulls in each feature column
- Identify which columns are categorical and which are not (important for some ML algorithms)
- Deleting a column, or list of columns:
- Displaying dataframes to screen
- Reading very big files using chunk
- Reading JSON blobs (from command line)
- Retrieval of data from SQL data warehouse
- Exporting data to SQL warehouse
- Transform format of dataframe: collapse multiple columns into one
- Apply function to all rows (axis=1) or to all columns (axis=0):
- Dataframe containing column of lists
- Exploding a dataframe of lists of items (with ID column) into exploded ID-item column
- Group by operations in Pandas
- Ranking inside groups
- Apply vs transform operations on groupby objects
- Comparison SQL-Pandas
- Merging and Concatenation operations
- Pivot operations
- Melting operation
- Pandas Cheatsheet
- Assigining values to dataframe
- Percentiles - quantiles in Pandas
- Saving of Pandas dataframe to LIBSVM file format and inverse
- Check that 2 dataframes are equal
- Pandas and memory
- Cutting a dataframe into train-test-validation sets
- Useful plots
- Dask, or parallel Pandas
- Python API’s:
- Scikit-learn
- Basic Scikit-learn documentation
- Data Preparation
- Decision Tree
- Random Forest
- Neural Networks
- Parameters Tuning
- How to measure the Feature importance in RF (or other algo)
- Score explanation for individual observation (using LIME)
- Score explanation for individual observation (using tree-interpreter)
- Very important update in the field of interpretation:
- Ensemble Classification
- Clustering
- Text Mining in Python
- Deep Learning
- Reinforcement Learning
- Time Series in Python
- Algorithms
- Statistics and Probabilities
- Great plot libraries
- SQL Server documentation
- Useful Bash commands (or batch)
- Useful GIT commands
- Define name globally
- Github: how to create a repo locally and then push it to remote?
- Bitbucket/Github: how to create ssh-keys and connect to bitbucket server
- Pull requests: how-to
- How to create a local (empty) branch and upstream it to remote?
- What to do when your local dev branch is behind remote master?
- A branch exists in remote and not in local, how to get on it?
- My branch “feature” is based on branch “developement” and I wish to bring SPECIFIC new files from “development” into “feature”. How?
- How to compare 2 branches from the command line?
- Avoiding git pull
- Git aliases
- Other useful commands
- Git push configuration: matching vs simple
- Versioning in git: git tag
- What is a detached head?
- Adoption of a git flow
- After updating my windows password, cannot connect push, authentication error… (stash-bitbucket)
- Book & Cheatsheets
- Screen commands
- TMUX commands
- Useful VIM commands
- Data types
- Interesting business models
- Sphinx
- DevOps
- Infrastructure as a code (IaC)
- MLOps - Machine learning life cycle
- MLFlow
- 1. MLFLow Tracking: https://www.mlflow.org/docs/latest/tracking.html
- 2. MLFLow Projects: https://www.mlflow.org/docs/latest/projects.html
- 3. MLFlow Models: https://www.mlflow.org/docs/latest/models.html
- 4. MLFlow Model registry: https://www.mlflow.org/docs/latest/model-registry.html
- How to create an open-source MLFlow server on Docker or Kubernetes
- SonarQube (or SonarCloud, SonarLint): static code analysis
- Kubeflow
- MLFlow
- DataOps
- Monitoring of ML models
- Azure Cloud, Databricks
- Data Science Interviews Questions
- Ray