=============================================== Python, Jupyter and Pandas =============================================== Exercises on Python (Pandas, Numpy,...) =============================================== - https://www.w3resource.com/python-exercises/ - https://pynative.com/python-exercises-with-solutions/ - https://edabit.com/challenge/xbZR26rHMNo32yz35 and many more ... Basic Jupyter infos =============================================== When you launch a Jupyter notebook, you can adjust its width size using the following command at the beginning: .. sourcecode:: python from IPython.core.display import display, HTML display(HTML("")) #or better: display(HTML("")) Loading package from a given (maybe different) directory: .. sourcecode:: python import sys sys.path.append('/home/user/folder/python-package') Useful internal address: https://sb-hdp-e4.fspa.myntet.se:8400/ Useful Jupyter notebook tricks: https://www.dataquest.io/blog/advanced-jupyter-notebooks-tutorial/ What is the meaning of the "python -m" flag? The -m stands for module-name in Python. The module-name should be a valid module name in Python. The -m flag in Python searches the sys.path for the named module and executes its contents as the __main__ module. It's worth mentioning this **only** works if the package has a file __main__.py Otherwise, this package can not be executed directly: .. sourcecode:: python python -m some_package some_arguments The python interpreter will looking for a __main__.py file in the package path to execute. Equivalent to: python path_to_package/__main__.py somearguments It will execute the content after: if __name__ == "__main__": - https://appdividend.com/2021/02/18/what-is-the-meaning-of-python-m-flag/ - https://stackoverflow.com/questions/7610001/what-is-the-purpose-of-the-m-switch Getting Notebooks work on server and access them using ssh ================================================================= How to keep jupyter notebook (or pyspar3Jupyter) active through ssh: Go to server with ssh (using putty) type: nohup pyspar3Jupyter > save.txt & (by this the save.txt contains the address of the notebook) type: jobs -l get pid number, this will be useful when you want to kill your pyspark session. ps -aux | grep b_number (this gets the pid if the ssh has already been shut down kill 5442 (if pid=5442) Python environments: pip & conda =============================================== Very good intro: https://towardsdatascience.com/devops-for-data-science-making-your-python-project-reproducible-f55646e110fa pip: Installing through proxy ----------------------------------------------- .. sourcecode:: python pip install --proxy=https://p998phd:p998phd@proxyvip-se.sbcore.net:8080 --trusted-host pypi.python.org -U PACKAGE_NAME How to check list of python packages installed through pip: https://pip.pypa.io/en/stable/reference/pip_list/ .. sourcecode:: python # linux python -m pip list # windows py -m pip list #(although i think "python -m pip list" works too) # or simply pip list Cleaning the pip cache: .. sourcecode:: python python -m pip cache purge pip, venv & setup.py: create a simple virtual environment for model development ----------------------------------------------- **virtualenv**: main command to create the venv is (https://docs.python.org/3/library/venv.html): python3 -m venv /path/to/new/virtual/environment/venv_name See https://madewithml.com/courses/mlops/packaging/ .. sourcecode:: python python3 -m venv venv source venv/bin/activate python -m pip install --upgrade pip setuptools wheel pip install -e . Let's unpack what's happening here: - Creating a vitual environment named venv - Activating our virtual environment. Type deactivate to exit out of the virtual environment. - Upgrading required packages so we download the latest package wheels. - Install (our packages) from (our) setup.py (-e, --editable installs a project in develop mode) Example of a setup.py: https://github.com/GokuMohandas/MLOps/blob/main/setup.py pip: How does `pip install -e .` work? ----------------------------------------------- This is taken from: https://www.reddit.com/r/learnpython/comments/ayx7za/how_does_pip_install_e_work_is_there_a_specific/ pip install is a command that takes a package and install it inside the **site-packages** folder of your Python installation (be it your main/system wide Python installation, or one inside a virtual environment). Normally, when you do this by simply writing a package name, like with pip install requests, pip looks for the package in the Python Package Index, or PyPI, which is a website. However, pip can also look for packages which are in other places (including inside your computer right now), and properly copy them to your site-packages folder. This is useful in a few specific cases: If you download the source code directly, i.e. from a github repository or another similar platform, you can use pip install to install this package without having to resort to PyPI. Now granted this is not very useful, since most people who create good packages and share them in github will also add them to PyPI anyways, but the option is there. Install a specific version of a package which is not directly available through PyPI, but may be reachable through github and others. Think about a unstable/dev build of a project: the devs don't want to make it available through PyPI to keep unaware users from downloading broken code, but you can pip install to install it as a Python package anyways, at your own risk. Install your own code as a package in your own machine. This basically copies your code over to the site-packages folder and treats it like any other package you've downloaded. Useful for testing and developing, since this makes your package behave like it would in any other system once you release it to the world. This is where pip install . comes into play: the dot is an actual argument, replacing the directory you're currently in. Most of the time you'll pip install your own packages using a terminal already inside of the project's folder, which is why you see the dot as sort of a default argument. Also keep in mind that you will some specific files in order for your package to be "installable", like a setup.py and possibly some __init.py__. Last thing to note is that pip install will install the current package as it is right now. If you pip install a package you're developing and add some new files to it afterwards, these changes will not be reflected on the actual package installed beforehand. To avoid having to pip install the package again and again after each change, you can pass the *-e* flag to make an editable install; in this case, changes to your files inside the project folder will automatically reflect in changes on your installed package in the site-packages folder. .. sourcecode:: python pip install -e . Create a package out of your code (wheels) ----------------------------------------------- See this excellent post: https://godatadriven.com/blog/a-practical-guide-to-using-setup-py/ (todo: extract example from it) The wheel package format allows Python developers to package a project's components so they can be easily and reliably installed in another system. Just like the JAR format in the JVM world, a wheel is a compressed, single-file build artifact, typically the output of a CI/CD system. Similar to a JAR, a wheel contains not only your source code but references to all of its dependencies as well. Wheels are packages that can be installed using pip from either a public repository like Pypi or a private repository. Here is wheel official documentation: https://wheel.readthedocs.io/en/stable/quickstart.html Essentially, from your setup.py file, you can create your wheel using the command: .. sourcecode:: python python setup.py bdist_wheel Sometimes people use python setup.py sdist bdist_wheel instead. See https://medium.com/ochrona/understanding-python-package-distribution-types-25d53308a9a for what is sdist (source distribution), compared to bdist (built distribution) To install a wheel file, use pip install .. sourcecode:: python # If it's ready to be used pip install someproject-1.5.0-py2-py3-none.whl # But of course you might want to install the package in editable mode, if you are still working on it (let's say setup.py is in current folder): pip install -e . # that acts on the setup.py file, not any already created wheel (I think) Note for **Databricks**: you can install wheels and run them as jobs (if there is an entrypoint to run within the wheel of course): https://databricks.com/blog/2022/02/14/deploy-production-pipelines-even-easier-with-python-wheel-tasks.html . To run a Job with a wheel, first build the Python wheel locally or in a CI/CD pipeline, then upload it to cloud storage. Specify the path of the wheel in the task and choose the method that needs to be executed as the entrypoint. Task parameters are passed to your main method via *args or **kwargs. Alternative to setup.py: https://godatadriven.com/blog/a-practical-guide-to-setuptools-and-pyproject-toml/ This uses setuptools, setup.cfg and pyproject.toml instead of setup.py. Now build your project by running .. sourcecode:: python pip -m build . --wheel Or, as before, we can do an editable install with pip install -e . Here is a sample project, as an example: https://github.com/pypa/sampleproject . Although the setup.py is still populated, the pyproject.toml file is already present and the package can be built using pip -m build . --wheel Question: we might have requirements.txt and as well the install_requires declaration within a setup.py (or better setup.cfg) ... is there a duplication/conflict? Actually no, as explained here: - https://towardsdatascience.com/requirements-vs-setuptools-python-ae3ee66e28af - https://packaging.python.org/en/latest/discussions/install-requires-vs-requirements/ In essence, the install_requires declaration specifies a minimal list of packages needed for the project to run correctly. Instead, the requirements.txt specifies a collection of packages needed to create a full python (virtual) environment. Conda packages ----------------------------------------------- What about creating Conda packages? https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs.html Here examples for Conda packages with exercises (and comparison with wheels): https://python-packaging-tutorial.readthedocs.io/en/latest/conda.html Conda environments ----------------------------------------------- Once installed, in linux, the .bashrc file will contain the block: .. sourcecode:: python # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/home/philippe/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/home/philippe/anaconda3/etc/profile.d/conda.sh" ]; then . "/home/philippe/anaconda3/etc/profile.d/conda.sh" else export PATH="/home/philippe/anaconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< That block allows the initialization of the conda env, after reloading .bashrc file. This there is NO NEED to add a line such as export PATH="/home/philippe/anaconda3/bin:/home/philippe/anaconda3/condabin:$PATH" or similar. See https://github.com/conda/conda/issues/7980 check the environments: .. sourcecode:: python conda env list # or conda info --envs There should be a base, and others, if they were created. Then, to activate a different environment: .. sourcecode:: python source activate env_name #(source in Unix, without on Windows) #or conda activate env_name How to create environments: To create a new environment with some packages: 1. From the command line (see also the `conda documentation `_): .. sourcecode:: python conda create -n env_name --yes --quiet python=3.7 numpy scipy scikit-learn statsmodels 2. From an environment.yml file (see also the `conda documentation `_): .. sourcecode:: python conda env create -f environment.yml To prepare such a environment.yml file, see the dedicated conda `documentation page `_. Basically: .. sourcecode:: python name: stats # the name of the env dependencies: - numpy - pandas Or more complex: .. sourcecode:: python name: stats2 channels: - javascript dependencies: - python=3.6 # or 2.7 - bokeh=0.9.2 - numpy=1.9.* - nodejs=0.10.* - flask - pip: - Flask-Testing How to clean the conda package cache: ..sourcecode:: python conda clean --all Pyenv & pipenv python environments --------------------------------------- Comparison of different python environment managment tools: venv, virtualenv, pyenv, pipenv, conda (and docker): https://www.pluralsight.com/tech-blog/managing-python-environments/ **Pyenv**: Opposed to Pipenv, Pyenv is a tool for managing *multiple* python installations. Installation of pyenv and using pyenv to install different python versions: https://www.liquidweb.com/kb/how-to-install-pyenv-on-ubuntu-18-04/ See also https://menziess.github.io/howto/manage/python-versions/ for installation/uninstallation. - Install from git: git clone https://github.com/pyenv/pyenv.git ~/.pyenv - config the environment: echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bashrc bash note: in Windows you can change a environment variable by set PATH=%PATH%;C:\Users\some\path\ - Look at available python versions: pyenv install --list - install a specific version: pyenv install 3.8.3 - check the installed python versions: pyenv versions Ex: * system (set by /root/.pyenv/version) 3.8.3 - Now easy to switch between different installed versions: pyenv global 3.8.3 Note: as several posts noted (for example `here `_ and `here `_), the python installation sometimes lacks a few things. Remedy with this: sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev python-openssl git **Pipenv**: Pipenv is capable of using pyenv in the background to create and activate virtual environments that require different python versions. Installation of pipenv: https://menziess.github.io/howto/manage/virtual-environments/#3-creating-a-virtual-environment Note that some people recommend to install pipenv for user only (see here, step 1 only: https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-programming-environment-on-an-ubuntu-20-04-server). A user installation prevents breaking any system-wide packages. If pipenv isn’t available in your shell after installation, you’ll need to add the user base’s binary directory to your PATH. pip3 install --user pipenv and be sure add ~/.local/bin to the head of your PATH environment variable: export PATH=$PATH:/home/[your_user]/.local/bin/ To upgrade pipenv at any time: pip3 install --user --upgrade pipenv When pipenv installed and configured, we can create a new pipenv project in a project directory with (https://www.pluralsight.com/tech-blog/managing-python-environments/) $ pipenv --python $PYTHON_VERSION which will initialize the project using the specified Python version (if pyenv is installed, it can even install Python versions on-demand). To start with, this creates: - a Pipfile config file at the project home specifying Python version, sources, and any installed packages - a fresh virtual environment housed in the pipenv working directory We no longer have to manage installs with pip and virtual environments separately - pipenv takes care of both! To install a package, simply running $ pipenv install $PACKAGE_NAME will both install the package into the virtual environment, and write the package as a dependency into the Pipfile. This Pipfile is then all we need to rebuild the project elsewhere, rather than the requirements.txt used by other managers - simply running pipenv install on a directory with a Pipfile will recreate the environment. To activate the environment, $ pipenv shell Pipenv exhaustively builds out the dependency graph, flagging any issues and generating a validated Pipfile.lock for fully specifying every dependency in the project. We can trigger this manually for the requirements in our Pipfile with $ pipenv lock To deactivate your virtual environment, run: $ deactivate Remove the virtual environment: $ pipenv --rm Note switching (https://menziess.github.io/howto/manage/virtual-environments/#5-switching-to-pipenv): If you are already using another virtual environment tool, switching is quite easy. If you run pipenv install, it automatically detects the requirements.txt file: requirements.txt found, instead of Pipfile! Converting… Or you can explicitly pass the requirement.txt file as an argument, which may be useful if you have put development dependencies in a separate file: $ pipenv install -r dev-requirements.txt --dev And if you want to switch back to using requirement.txt files, you can run: $ pipenv lock -r > requirements.txt $ pipenv lock -r -d > dev-requirements.txt See for this: https://pipenv.kennethreitz.org/en/latest/advanced/#generating-a-requirements-txt Note (seehttps://github.com/pypa/pipenv/issues/3150): in Azure DevOps I have been using such a line: $ pipenv install -d --system --deploy --ignore-pipfile pipenv install --ignore-pipfile is nearly equivalent to pipenv sync, but pipenv sync will never attempt to re-lock your dependencies as it is considered an atomic operation. pipenv install by default does attempt to re-lock unless using the --deploy flag. More infos: - https://pypi.org/project/pipenv/ - https://pipenv-fork.readthedocs.io/en/latest/basics.html Python linting, static code analysis ======================================= Pylint --------------------------------------- How to create a configuration file .pylintrc in your project: pylint --generate-rcfile > .pylintrc Also how to format the report (could we put the format in the .pylintrc?), here is an example (https://community.sonarsource.com/t/pylint-results-not-reported-uploaded-by-scanner/4208): .. sourcecode:: python #Let's have a function: $ cat sample.py def function1(rrrr_mm_dd): print "We do not use any argument" $ pylint sample.py -r n --msg-template="{path}:{line}: [{msg_id}({symbol}), {obj}] {msg}" | tee pylint.txt No config file found, using default configuration ************* Module sample sample.py:1: [C0111(missing-docstring), ] Missing module docstring sample.py:1: [C0111(missing-docstring), function1] Missing function docstring Flake8 --------------------------------------- Visual Studio Code set-up ======================================= Taken from https://menziess.github.io/howto/enhance/your-python-vscode-workflow/ The default values of the settings.json file can be seen in https://code.visualstudio.com/docs/getstarted/settings In settings.json (ctrl-shift-P): .. sourcecode:: python { "python.pythonPath": ".venv/bin/python" } For testing and linting, we can use install in the local (project) environment pipenv install -d mypy autopep8 \ flake8 pytest bandit pydocstyle The settings of vscode can be overridden by workspace settings per project. In settings.json: .. sourcecode:: python { "python.autoComplete.addBrackets": true, "python.formatting.provider": "autopep8", "python.jediEnabled": false, "python.linting.mypyEnabled": true, "python.linting.flake8Enabled": true, "python.linting.pylintEnabled": false, "python.linting.pydocstyleEnabled": true, "python.testing.unittestEnabled": false, "python.testing.nosetestsEnabled": false, "python.testing.pytestEnabled": true, "python.testing.pytestArgs": [ "tests" ] } Some of these frameworks produce temporary folders, which can clutter your file explorer, and slow down file indexing. You can disable indexing for these files by passing a glob pattern to the files.watcherExclude field: .. sourcecode:: python { "files.watcherExclude": { "**/build/**": true, "**/dist/**": true, "**/.ipynb_checkpoints/**": true, "**/*.egg-info/**": true, "**/.pytest_cache/**": true, "**/__pycache__/**": true, "**/.mypy_cache/**": true, "**/.venv/**": true }, "files.exclude": { "**/.pytest_cache/**": true, "**/.mypy_cache/**": true, "**/__pycache__/**": true, "**/*.egg-info/**": true } } Python basic info ======================================= Formats for printing --------------------------------------- See https://www.geeksforgeeks.org/python-output-formatting/ The general syntax for a format placeholder is: %[flags][width][.precision]type .. sourcecode:: python # print integer and float value print("Geeks : % 2d, Portal : % 5.2f" %(1, 05.333)) # print exponential value print("% 10.3E"% (356.08977)) Using format(): .. sourcecode:: python # using format() method and refering a position of the object print('{0} and {1}'.format('Geeks', 'Portal')) # combining positional and keyword arguments print('Number one portal is {0}, {1}, and {other}.' .format('Geeks', 'For', other ='Geeks')) # using format() method with number print("Geeks :{0:2d}, Portal :{1:8.2f}". format(12, 00.546)) How many cores in the edge node? ----------------------------------------------- .. sourcecode:: python import multiprocessing print(multiprocessing.cpu_count()) 56 Similar to linux command nproc --all (or grep -c ^processor /proc/cpuinfo) The command grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' gives 14. Means these are true cores, 56 are the number of threads. Basic dates in Python ----------------------------------------------- How to add/substract some time to dates in python? .. sourcecode:: python import datetime from dateutil.relativedelta import relativedelta sub_days = datetime.datetime.today() + relativedelta(days=-6) sub_months = datetime.datetime.today() + relativedelta(months=-6) sub_years = datetime.datetime.today() + relativedelta(years=-6) sub_hours = datetime.datetime.today() + relativedelta(hours=-6) sub_mins = datetime.datetime.today() + relativedelta(minutes=-6) sub_seconds = datetime.datetime.today() + relativedelta(seconds=-6) print("Current Date Time:", datetime.datetime.today()) print("Subtract 6 days:", sub_days) print("Subtract 6 months:", sub_months) print("Subtract 6 years:", sub_years) print("Subtract 6 hours:", sub_hours) print("Subtract 6 mins:", sub_mins) print("Subtract 6 seconds:", sub_seconds) How to convert dates from datetime to string: .. sourcecode:: python from datetime import datetime datetime.today().strftime("%Y-%m-%d") How to get first day of month: .. sourcecode:: python from datetime import datetime datetime.today().replace(day=1) Docstrings in functions and classes: ----------------------------------------------- Docstring is a great tool for code understanding, especially when it is not written by you...or when you wrote it long time ago! The idea is just to supply each function and class with a consistent explanation of its aim (why is it needed for, what it does), the description of the different input and output objects. It is a good habit to use them. There is a PEP on docstrings (PEP257): https://www.python.org/dev/peps/pep-0257/ Here some few examples, taken/adapted from https://www.geeksforgeeks.org/python-docstrings/: One line docstrings (for very obvious functions): .. sourcecode:: python def power(a, b): """Returns arg1 raised to power arg2.""" return a**b # To access the function description, for example from your notebook, you can use: print(power.__doc__) # Or similarly: help(power) Multi line docstrings: .. sourcecode:: python def my_function(arg1,arg2): """ Summary line. Extended description of function. Parameters: arg1 (int): Description of arg1 arg2 (int): Description of arg2 Returns: result (int): Description of return value """ result = arg1+arg2 return result print(my_function.__doc__) Class docstrings: .. sourcecode:: python class ComplexNumber: """ This is a class for mathematical operations on complex numbers. Attributes: real (int): The real part of complex number. imag (int): The imaginary part of complex number. """ def __init__(self, real, imag): """ The constructor for ComplexNumber class. Parameters: real (int): The real part of complex number. imag (int): The imaginary part of complex number. """ def add(self, num): """ The function to add two Complex Numbers. Parameters: num (ComplexNumber): The complex number to be added. Returns: ComplexNumber: A complex number which contains the sum. """ re = self.real + num.real im = self.imag + num.imag return ComplexNumber(re, im) help(ComplexNumber) # to access Class docstring help(ComplexNumber.add) # to access method's docstring PEP - Code Refactoring - Autopep8 ----------------------------------------------- See https://pypi.org/project/autopep8/ .. sourcecode:: python autopep8 --in-place --aggressive --aggressive code.py If done with Visual Studio Code, the settings should be adapted. Type 'Ctrl + ,' and this will open the options pallet. Here type in proxy and this will show all the proxy settings. Click on the settings.json file and update the contents so they look like the following: .. sourcecode:: python { "http.proxy": "http://{your_pid_here}:{your_pid_here}@proxyvip-se.sbcore.net:8080", "http.proxyStrictSSL": false, "python.linting.enabled": true, "python.linting.pep8Args": [ "--ignore=E501,E265" ], "python.linting.pep8Enabled": true, "python.linting.pylintEnabled": true, "python.pythonPath": "C:\\Anaconda3\\python.exe", "window.zoomLevel": 0, "python.dataScience.jupyterServerURI": "http://sb-hdpdev-e3.fspa.myntet.se:4191/?token=test" } Unit tests in Python: pytest ======================================================= Good links: - https://realpython.com/pytest-python-testing/ - https://menziess.github.io/howto/test/python-code/ - Testing Flask app: https://testdriven.io/blog/flask-pytest/ with example: https://gitlab.com/patkennedy79/flask_user_management_example/-/tree/main/ Tests can be considered at three levels: * Unit: Unit tests test the functionality of an individual unit of code isolated from its dependencies. They are the first line of defense against errors and inconsistencies in your codebase. They test from the inside out, from the programmer's point of view. * Functional (or integration): Functional/integration tests test multiple components of a software product to make sure the components are working together properly. Typically, these tests focus on functionality that the user will be utilizing. They test from the outside in, from the end user's point of view. * End-to-end Both unit and functional testing are fundamental parts of the Test-Driven Development (TDD: https://testdriven.io/test-driven-development/) process. Testing should be combined with a Continuous Integration (CI) process to ensure that your tests are constantly being executed, ideally on each commit to your repository. How to discover the unit tests (pytest): https://docs.pytest.org/en/stable/goodpractices.html#test-discovery Tests outside application code: Putting tests into an extra directory outside your actual application code might be useful if you have many functional tests or for other reasons want to keep tests separate from actual application code (often a good idea). Note that no __init__.py is necessary in the tests/ folder, as Pytest can identify the files natively: .. sourcecode:: python setup.py mypkg/ __init__.py app.py view.py tests/ test_app.py test_view.py ... About fixtures (from the link above): Imagine you’re writing a function, format_data_for_display(), to process the data returned by an API endpoint. The data represents a list of people, each with a given name, family name, and job title. The function should output a list of strings that include each person’s full name (their given_name followed by their family_name), a colon, and their title. To test this, you might write the following code: .. sourcecode:: python def format_data_for_display(people): ... # Implement this! def test_format_data_for_display(): people = [ { "given_name": "Alfonsa", "family_name": "Ruiz", "title": "Senior Software Engineer", }, { "given_name": "Sayid", "family_name": "Khan", "title": "Project Manager", }, ] assert format_data_for_display(people) == [ "Alfonsa Ruiz: Senior Software Engineer", "Sayid Khan: Project Manager", ] Now suppose you need to write another function to transform the data into comma-separated values for use in Excel. The test would look awfully similar: .. sourcecode:: python def format_data_for_excel(people): ... # Implement this! def test_format_data_for_excel(): people = [ { "given_name": "Alfonsa", "family_name": "Ruiz", "title": "Senior Software Engineer", }, { "given_name": "Sayid", "family_name": "Khan", "title": "Project Manager", }, ] assert format_data_for_excel(people) == """given,family,title Alfonsa,Ruiz,Senior Software Engineer Sayid,Khan,Project Manager """ If you find yourself writing several tests that all make use of the same underlying test data (or python object), then a fixture may be in your future. You can pull the repeated data into a single function decorated with @pytest.fixture to indicate that the function is a pytest fixture: .. sourcecode:: python import pytest @pytest.fixture def example_people_data(): return [ { "given_name": "Alfonsa", "family_name": "Ruiz", "title": "Senior Software Engineer", }, { "given_name": "Sayid", "family_name": "Khan", "title": "Project Manager", }, ] You can use the fixture by adding it as an argument to your tests. Its value will be the return value of the fixture function: .. sourcecode:: python def test_format_data_for_display(example_people_data): assert format_data_for_display(example_people_data) == [ "Alfonsa Ruiz: Senior Software Engineer", "Sayid Khan: Project Manager", ] def test_format_data_for_excel(example_people_data): assert format_data_for_excel(example_people_data) == """given,family,title Alfonsa,Ruiz,Senior Software Engineer Sayid,Khan,Project Manager """ Each test is now notably shorter but still has a clear path back to the data it depends on. Be sure to name your fixture something specific. That way, you can quickly determine if you want to use it when writing new tests in the future! Another simple fixture example (from https://menziess.github.io/howto/test/python-code/): .. sourcecode:: python # Let's have some function def say_hello_to(name='World'): return f'Hello {name}!' # We define here the fixture in the test file: """Some data for our tests.""" from pytest import fixture @fixture def names(): return 'Bob', '', None, 123, [], () # Now the test can run like this, to test many different formats at once (defined in the fixture function): def test_say_hello_to(names): assert say_hello_to('Stefan') == 'Hello Stefan!' bob, empty, none, integer, li, tup = names assert say_hello_to(bob) == 'Hello Bob!' assert say_hello_to(empty) == 'Hello !' assert say_hello_to(none) == 'Hello None!' assert say_hello_to(integer) == 'Hello 123!' assert say_hello_to(li) == 'Hello []!' assert say_hello_to(tup) == 'Hello ()!' Doctest: we can also do tests using function docstrings: .. sourcecode:: python # Here some function with a test in the docstring: def say_hello_to(name='World'): """Say hello. >>> say_hello_to('Stefan') 'Hello Bob!' """ return f'Hello {name}!' Now the test will run like this: ➜ pytest --doctest-modules ... 009 >>> say_hello_to('Stefan') Expected: 'Hello Bob!' Got: 'Hello Stefan!' So here, the test is defined in the docstring itself! How to parametrize tests functions in Pytest (mix of old post here: https://www.softwaretestinghelp.com/pytest-tutorial/ and documentation here: https://docs.pytest.org/en/6.2.x/parametrize.html#:~:text=%40pytest.mark.parametrize%20allows%20one%20to%20define%20multiple%20sets%20of,enables%20parametrization%20of%20arguments%20for%20a%20test%20function.): Let's say we have 2 files, `parametrize/mathlib.py` and `parametrize/test_mathlib.py`. In `parametrize/mathlib.py` insert the following code that will return the square of a number. .. sourcecode:: python def cal_square(num): return num * num In the parametrize/test_mathlib.py we have the related tests: .. sourcecode:: python import mathlib # Test case 1 def test_cal_square_1( ): assert mathlib.cal_square(5) == 25 # Test case 2 def test_cal_square_2( ): assert mathlib.cal_square(6) == 36 # Test case 3 def test_cal_square_3( ): assert mathlib.cal_square(7) == 49 and so on, there might be a big number of values we might need to check. How would it be possible to simplify this and have instead ONE unique function that could be parametrized so that multiple values could be entered and test the tested function? .. sourcecode:: python import pytest import mathlib @pytest.mark.parametrize("test_input,expected_output", [ (5, 25), (6, 36), (7, 49), (8, 64) ] ) def test_cal_square(test_input, expected_output): assert mathlib.cal_square(test_input) == expected_output You can also parametrize multiple parameters at once like this: .. sourcecode:: python import pytest @pytest.mark.parametrize("x", [0, 1]) @pytest.mark.parametrize("y", [2, 3]) def test_foo(x, y): pass Some advice on how to document unit tests (from https://testdriven.io/blog/flask-pytest/): Let's say we have some class User within a /project/models.py file. The test related to the instanciation of that class would be such: .. sourcecode:: python from project.models import User def test_new_user(): """ GIVEN a User model WHEN a new User is created THEN check the email, hashed_password, and role fields are defined correctly """ user = User('patkennedy79@gmail.com', 'FlaskIsAwesome') assert user.email == 'patkennedy79@gmail.com' assert user.hashed_password != 'FlaskIsAwesome' assert user.role == 'user' Tests are one of the most difficult aspects of a project to maintain. Often, the code (including the level of comments) for test suites is nowhere near the level of quality as the code being tested. A common practice is to use the GIVEN-WHEN-THEN structure: * GIVEN - what are the initial conditions for the test? * WHEN - what is occurring that needs to be tested? * THEN - what is the expected response? Coverage (of unit test): pytest-cov ----------------------------------------------------------- Coverage gives the fraction of the code which is covered by unit tests, in percent. You need to define a .coveragerc file that will basically tell what not to include in the coverage calculation. Pytest-cov is built on top of coverage.py package (https://coverage.readthedocs.io/en/latest/index.html). For example (see https://coverage.readthedocs.io/en/latest/source.html#source) .. sourcecode:: python [run] omit = # omit anything in a .local directory anywhere */.local/* # omit everything in /usr /usr/* # omit this single file utils/tirefire.py Also a single function or class can be omitted by adding the comment next to its start (see https://coverage.readthedocs.io/en/coverage-4.3.3/excluding.html, https://coverage.readthedocs.io/en/latest/config.html) .. sourcecode:: python class MyObject(object): def __init__(self): blah1() blah2() def __repr__(self): # pragma: no cover return "" So here the "# pragma: no cover" avoids the __repr__ to be used in coverage calculation. If we want to omit the full class in coverage calculation, .. sourcecode:: python class MyObject(object): # pragma: no cover Some good links on coverage: - https://rorymurdock.github.io/2019/11/23/Code-Coverage.html, https://gist.github.com/rorymurdock/f8c1ace6e35684261823530e19510478 - https://pypi.org/project/pytest-cov/, https://coverage.readthedocs.io/en/latest/index.html How to package an application in python =========================================================== Good links: - General tutorial: https://packaging.python.org/tutorials/packaging-projects/ - Here is an example of project that can be used to build a package: https://github.com/pypa/sampleproject Numpy basic documentation =========================================================== .. figure:: Cheatsheets/Numpy_Python_Cheat_Sheet.png :scale: 100 % :alt: map to buried treasure This Cheatsheet is taken from DataCamp. Basic Pandas documentation ============================================================ .. topic:: Introduction The objective here is to have everything useful for the projects, not to make a complete documentation of the whole package. Here I will try to document both version 1.6 and >2.0. A special enphase will be done on machine learning module ml (mllib is outdated). Good Pandas links: ---------------------------- A good link on data manipulations: https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/ Loading Pandas dataframe from file ------------------------------------------------------------ .. sourcecode:: python #Loading a Pandas dataframe: df_pd = pd.read_csv("/home/BC4350/Desktop/Iris.csv") Creation of some data in a Pandas dataframe ------------------------------------------------------------ .. sourcecode:: python # A set of baby names and birth rates: names = ['Bob','Jessica','Mary','John','Mel'] births = [968, 155, 77, 578, 973] #We merge the 2 lists using the zip function: BabyDataSet = list(zip(names,births)) #We create the DataFrame: df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births']) Names Births 0 Bob 968 1 Jessica 155 2 Mary 77 3 John 578 4 Mel 973 Creating dataframe with several objects per cell ------------------------------------------------------------ .. sourcecode:: python a = ['a1','a2','a3'] b = ['b1','b2','b3'] uu = [[a,b] for a,b in list(zip(a,b))] vv = [{'a':a,'b':b} for a,b in list(zip(a,b))] df = pd.DataFrame() df['version_list'] = uu df['version_dico'] = vv df version_list version_dico 0 [a1, b1] {'a': 'a1', 'b': 'b1'} 1 [a2, b2] {'a': 'a2', 'b': 'b2'} 2 [a3, b3] {'a': 'a3', 'b': 'b3'} Stacking of dataframes in Pandas ------------------------------------------------------------ This will create a new df that contains the columns of both dataframes: .. sourcecode:: python df1 = pd.DataFrame([1,2,3],columns=['A']) df2 = pd.DataFrame([4,5,6],columns=['B']) df3 = pd.concat([df1,df2],axis=1) How to shuffle the columns of a dataframe? ------------------------------------------------------------ Simply by using the "sample" method, which allows to shuffle rows (only). For that we first transpose the df first: .. sourcecode:: python # Shuffling the columns df_T = df.T df_T = df_T.sample(frac=1) df = df_T.T Pandas and memory ------------------------------------------------------------ How to estimate the size a dataframe takes in memory? .. sourcecode:: python df = pd.DataFrame(np.random.random((100,100))) df.values.nbytes 80000 #number of bytes #Here it gives the number of bytes for EACH column: df.memory_usage() #info() df.info() gives the types of the columns and the total memory used Re-setting of index in Pandas dataframes --------------------------------------------------- http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html .. sourcecode:: python # Use a column of df for index: ts_all.set_index('transactiondate',inplace=True) # Reset index to 0,1,2,3... (note that the old index will be as the first column of the df) ts_all.reset_index(inplace=True) Iterating over Pandas dataframe rows: --------------------------------------------------- http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iterrows.html A simple example: .. sourcecode:: python for i, row in df.iterrows() print(row) Check number of nulls in each feature column ----------------------------------------------------- .. sourcecode:: python # This will output all column names and the number of nulls in them nulls_per_column = df.isnull().sum() print(nulls_per_column) Identify which columns are categorical and which are not (important for some ML algorithms) -------------------------------------------------------------------- .. sourcecode:: python # Create a boolean mask for categorical columns categorical_feature_mask = df.dtypes == object # Get list of categorical column names categorical_columns = df.columns[categorical_feature_mask].tolist() # Get list of non-categorical column names non_categorical_columns = df.columns[~categorical_feature_mask].tolist() Deleting a column, or list of columns: ----------------------------------------------------- .. sourcecode:: python df.drop(['column'], axis=1, inplace=True) df.drop(['column1','column2'], axis=1, inplace=True) Displaying dataframes to screen ----------------------------------------------------- .. sourcecode:: python #This allows you to display as many rows as you wish when you display the dataframe (works also for max_rows): pd.options.display.max_columns = 50 #By default 20 only #This display the 5 first rows: df.head(5) #This display the 5 last rows: df.tail(5) Display several dataframes in the same HTML format in one cell: from IPython.core import display as ICD ICD.display(df1.head()) ICD.display(df2.head()) Reading very big files using chunk ----------------------------------------------------- For csv that can be bigger than the RAM, we can load chunks of them, and perform (for example, can be different action) a filtering on the chunks like this: .. sourcecode:: python def filter_is_long_trip(data): "Returns DataFrame filtering trips longer than 20 minutes" is_long_trip = (data.trip_time_in_secs > 1200) return data.loc[is_long_trip] chunks = [] for chunk in pd.read_csv(filename, chunksize=1000): chunks.append(filter_is_long_trip(chunk)) #or in a simpler way: chunks = [filter_is_long_trip(chunk) for chunk in pd.read_csv(filename,chunksize=1000) ] #Then we can use these filtered chunks and stack them into a single dataframe: df = pd.concat(chunks) Reading JSON blobs (from command line) ----------------------------------------------------- .. sourcecode:: python import pandas as pd import sys json_string = sys.argv[1] print(pd.DataFrame(eval(json_string))) # We run the code like this: python test_json.py {'knid':{'0':'123456','1':'123456','2':'123457'},'score':{'0':'C2-1','1':'C2-2','2':'C4-1'},'join_dt':{'0':'2018-01-01','1':'2018-01-02','2':'2018-01-03'}} .. figure:: Images/Json_output.png :scale: 100 % :alt: Json output Retrieval of data from SQL data warehouse ----------------------------------------------------- This exports the data in a simple array: .. sourcecode:: python import pyodbc as odbc # Some super SQL query sql = """SELECT top 100 table as RUN_TS ,b.[AC_KEY] ,[PROBABILITY_TRUE] FROM [DB].[test].[B_DCS_DK_ROL] b JOIN db.ctrl.run_info r ON r.RUN_ID=b.RUN_ID """ conn = odbc.connect(r'Driver={SQL Server};Server=SERVER;Database=DB;Trusted_Connection=yes;') crsr = conn.cursor() crsr.execute(sql) params=crsr.fetchall() crsr.close() conn.close() But if we want to have the data immediately loaded into a dataframe, then we can use these functions: .. sourcecode:: python import pypyodbc as odbc def Extract_data_from_SQLserver(Server,DataBase,SQLcommand): cnxn = odbc.connect(r'Driver={SQL Server};Server='+Server+';Database='+DataBase+';Trusted_Connection=yes;') cursor = cnxn.cursor() #THE EXTRACTION OF HEADER AND DATA res = cursor.execute(SQLcommand) header = [tuple[0] for tuple in res.description] data = cursor.fetchall() #WRITING RESULT TO CSV df = pd.DataFrame(data, columns=header) cursor.close() cnxn.close() return df #And we can use it like this: #some SQL command: SQLcommand = """ select * from db.dbo.table order by field1, field2 """ df = Extract_data_from_SQLserver('server','db',SQLcommand) Exporting data to SQL warehouse ------------------------------------------- Let's say we have some dataframe, here FinalListModel1: .. sourcecode:: python import pypyodbc as odbc conn = odbc.connect(r'Driver={SQL Server};Server=SERVER;Database=DB;Trusted_Connection=yes;') rows1 = list(FinalListModel1['caseid']) rows2 = list(FinalListModel1['recordkey']) rows3 = list(FinalListModel1['score1']) rows = list(zip(rows1,rows2,rows3)) cursor = conn.cursor() stm=""" DROP TABLE [DB].[dbo].[table] CREATE TABLE [DB].[dbo].[table] ( [caseid] nvarchar(255), [recordkey] nvarchar(255), [score1] float ) """ res = cursor.execute(stm) cursor.executemany('INSERT INTO [DB].[dbo].[table] VALUES (?, ?, ?)', rows) conn.commit() cursor.close() conn.close() Transform format of dataframe: collapse multiple columns into one ------------------------------------------------------------------------------------------------ https://stackoverflow.com/questions/28520036/how-to-collapse-columns-into-row-elements-in-pandas Here the task is to collapse multiple columns into one, keeping the same index (called "level_1" in the result) .. sourcecode:: python df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde')) df.head() a b c d e 0 0.682871 0.287474 0.896795 0.043722 0.629443 1 0.456231 0.158333 0.796718 0.967837 0.611682 2 0.499535 0.545836 0.403043 0.465932 0.733136 3 0.553565 0.688499 0.813727 0.183788 0.631529 df.unstack().reset_index() level_0 level_1 0 0 a 0 0.682871 1 a 1 0.456231 2 a 2 0.499535 3 a 3 0.553565 4 b 0 0.287474 5 b 1 0.158333 6 b 2 0.545836 7 b 3 0.688499 8 c 0 0.896795 9 c 1 0.796718 10 c 2 0.403043 11 c 3 0.813727 12 d 0 0.043722 .... 19 e 3 0.631529 # A more convenient form could be: df2 = df.unstack().reset_index().loc[:,['level_1',0]] df2.columns = ['index','value'] df2.set_index('index',inplace=True) df2 value index 0 0.682871 1 0.456231 2 0.499535 3 0.553565 0 0.287474 1 0.158333 2 0.545836 3 0.688499 0 0.896795 1 0.796718 2 0.403043 3 0.813727 0 0.043722 ... 3 0.631529 Apply function to all rows (axis=1) or to all columns (axis=0): -------------------------------------------------------------------------------- .. sourcecode:: python #We need a function: here it counts the number of NaN in a x object def num_missing(x): return sum(x.isnull()) #Applying per column: print "Missing values per column:" print df.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column #Applying per row: print "Missing values per row:" print df.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each row See also http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply Note that it is also possible to add arguments of the function (if it has) in an "args" parameter of apply: for example: df.apply(your_function, args=(2,3,4) ) Here other example: .. sourcecode:: python def subtract_custom_value(x, custom_value): return x-custom_value df.apply(subtract_custom_value, args=(5,)) See also http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply Dataframe containing column of lists ------------------------------------------------ 1. From 1 column of lists to several columns (explode operation) Based on https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns Containing lists in a column is handy for example when dealing with time series, or in general to contain different data format in the same dataframe. How to explode the lists to several columns? Let's say we have a df like this: .. sourcecode:: python d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'], ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]} df2 = pd.DataFrame(d1) print (df2) teams 0 [SF, NYG] 1 [SF, NYG] 2 [SF, NYG] 3 [SF, NYG] 4 [SF, NYG] 5 [SF, NYG] 6 [SF, NYG] We can explode the column of lists in 2 columns in the same dataframe like this: .. sourcecode:: python df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) print (df2) teams team1 team2 0 [SF, NYG] SF NYG 1 [SF, NYG] SF NYG 2 [SF, NYG] SF NYG 3 [SF, NYG] SF NYG 4 [SF, NYG] SF NYG 5 [SF, NYG] SF NYG 6 [SF, NYG] SF NYG We can also do the same and create a new dataframe: .. sourcecode:: python df3 = pd.DataFrame(df2['teams'].values.tolist(), columns=['team1','team2']) print (df3) team1 team2 0 SF NYG 1 SF NYG 2 SF NYG 3 SF NYG 4 SF NYG 5 SF NYG 6 SF NYG The same operation using apply function is a bad idea as very slow (loop). For the same kind of operation in Spark there is the command "explode". See section "Revert from time series (list) format to traditional (exploded) format". 2. From several columns to 1 column of lists How to do the inverse operation in Pandas? Making a column of lists from several columns? In Spark I know (See subsection "Create time series format from row time series") In pandas a simple apply function can do it (although might be slow): .. sourcecode:: python df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) df.head() a b 0 1 4 1 2 5 2 3 6 df['ab'] = df[['a', 'b']].apply(lambda x: list(x), axis = 1) df.head() a b ab 0 1 4 [1, 4] 1 2 5 [2, 5] 2 3 6 [3, 6] Note that there is a MUCH faster way (try %timeit), since apply is a slow function: df['ab'] = [[a,b] for a,b in zip(df['a'], df['b'])] The problem is that the syntax is not as flexible (does not allow long list of columns...) A better way of doing this (also suited to multiple columns at once). Very useful, as I often need to do such operations to convert events into time series: .. sourcecode:: python # https://stackoverflow.com/questions/40357671/apply-list-function-on-multiple-columns-pandas # One df with 2 columns. We want to create a column with lists of B based on column A groups df = pd.DataFrame({'A': [1,1,2,2,2,2,3],'B':['a','b','c','d','e','f','g']}) df = df.groupby('A').agg({'B': lambda x: list(x)}) print (df) B A 1 [a, b] 2 [c, d, e, f] 3 [g] Exploding a dataframe of lists of items (with ID column) into exploded ID-item column -------------------------------------------------------------------------------------------------------- From https://towardsdatascience.com/pandas-tips-i-wish-i-knew-before-ef4ea6a39e1a Let’s create a DataFrame with a column that has a random number of elements in lists: .. sourcecode:: python n = 10 df = pd.DataFrame( { "list_col": [[random.randint(0, 10) for _ in range(random.randint(3, 5))] for _ in range(10)], } ) df.shape #(10, 1) output list_col 0 [0, 8, 4, 10] 1 [0, 9, 9, 7] 2 [7, 1, 0, 9, 6] 3 [1, 3, 7] 4 [1, 0, 1] Now, let’s execute the explode function. .. sourcecode:: python df = df.explode("list_col") df.shape #(40, 1) output list_col 0 0 0 8 0 4 0 10 1 0 1 9 1 9 1 7 Group by operations in Pandas ------------------------------------------------ For a dataframe df with column ID, we can create a group by ID and count like this: .. sourcecode:: python df.groupby(['ID']).size().reset_index(name='count') #Or equivalently: df.groupby(['ID']).size().rename('count').reset_index() Where the rename just gives a name to the new column created (the count) and the reset_index gives a dataframe shape to the grouped object. Multiple aggregation on groups: .. sourcecode:: python #Here if we want to aggregate on several standard methods, like sum and max: df.groupby(['ID'])[['age','height']].agg(['max','sum']) #We can also aggrgate using a user-defined function: def data_range(series): return series.max() - series.min() df.groupby(['ID'])[['age','height']].agg(data_range) #We can also use dictionaries (to add names to aggregates): df.groupby(['ID'])[['age','height']].agg({'my_sum':'sum','my_range':data_range) In the case we want to make counts of the biggest groups in a dataframe: .. sourcecode:: python #If we want to group by only one feature, "ID" and see which are biggest groups, then the simplest is: df['ID'].value_counts() #Equivalently (same result), we can use: df[['ID']].groupby(['ID']).size().sort_values(ascending=False) #or: df[['ID']].groupby(['ID']).size().reset_index(name="count").sort_values("count",ascending=False) for a df with named column .. figure:: Images/Groupby0.png :scale: 70 % :alt: map to buried treasure .. sourcecode:: python #Equivalently (same result but with named "count" column), we can use: df[['ID']].groupby(['ID']).size().reset_index(name="count").sort_values("count",ascending=False) In the case we want several features to be grouped, the second method hereabove is appropriate: .. sourcecode:: python #Equivalently (same result), we can use: df[['ID','merchant','Target2']].groupby(['ID','merchant','Target2']).size().sort_values(ascending=False) #This produces the series at left, in the following figure. #An equivalent way outputs the same info but as a dataframe (with named new column), not a pandas series: df[['ID','merchant','Target2']].groupby(['ID','merchant','Target2']).size().reset_index(name='count').sort_values(['count'],ascending=False) .. figure:: Images/Groupby1.png :scale: 70 % :alt: map to buried treasure In the case we want to extract N rows randomly per group. So let's say we have a dataframe and group by a key "b": .. sourcecode:: python df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10,11], 'b': [1,1,1,0,0,0,0,2,2,2,2]}) df.head(20) #There are 2 ways to do it: #slower, but ouptut sorted by key: df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=2).head(20) #much faster, just output not sorted by key: df.sample(frac=1).groupby('b').head(2) Ranking inside groups ----------------------------------------------------- Let's say you want to rank data grouped by some columns: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.rank.html ) We start from some dataframe: .. sourcecode:: python caseid merchant time 0 1 a 1 1 1 a 2 2 1 a 3 3 2 b 1 4 2 b 2 5 2 c 1 .. sourcecode:: python df['rank'] = df.groupby(['caseid','merchant'])['time'].rank(ascending=False).astype(int) #Result: caseid merchant time rank 0 1 a 1 3 1 1 a 2 2 2 1 a 3 1 3 2 b 1 2 4 2 b 2 1 5 2 c 1 1 Apply vs transform operations on groupby objects ----------------------------------------------------- Investigate here: https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object Comparison SQL-Pandas ------------------------------ An EXCELLENT post comparing Pandas and SQL is here: https://codeburst.io/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e SQL-like WINDOW function... how to do in Pandas? Here is a good example of SQL window function: A first SQL query: .. sourcecode:: python SELECT state_name, state_population, SUM(state_population) OVER() AS national_population FROM population ORDER BY state_name Pandas: .. sourcecode:: python df.assign(national_population=df.state_population.sum()).sort_values('state_name') A second SQL query: .. sourcecode:: python SELECT state_name, state_population, region, SUM(state_population) OVER(PARTITION BY region) AS regional_population FROM population ORDER BY state_name Pandas: (here on ONE COLUMN! the "state_population") .. sourcecode:: python df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')).sort_values('state_name') Example of computing the cumulative sum of a quantity over 2 groups: .. sourcecode:: python df = pd.DataFrame({'col1' : ['a','a','b','b','a'], 'col2' : ['2013/01/03 00:00:00', '2013/03/05 09:43:31', '2013/03/07 00:00:00',\ '2013/03/07 00:00:00', '2013/03/07 00:00:00'], 'col3' : [1,3,1,2,0]}) df = df.sort_values(['col1','col2']) col1 col2 col3 0 a 2013/01/03 00:00:00 1 1 a 2013/03/05 09:43:31 3 4 a 2013/03/07 00:00:00 0 2 b 2013/03/07 00:00:00 1 3 b 2013/03/07 00:00:00 2 df = df.assign(cumsum_col3=df.groupby('col1')['col3'].transform('cumsum')).sort_values('col1') col1 col2 col3 cumsum_col3 0 a 2013/01/03 00:00:00 1 1 1 a 2013/03/05 09:43:31 3 4 4 a 2013/03/07 00:00:00 0 4 2 b 2013/03/07 00:00:00 1 1 3 b 2013/03/07 00:00:00 2 3 In spark it would have been: .. sourcecode:: python df = pd.DataFrame({'col1' : ['a','a','b','b','a'], 'col2' : ['2013/01/03 00:00:00', '2013/03/05 09:43:31', '2013/03/07 00:00:00',\ '2013/03/07 00:00:00', '2013/03/07 00:00:00'], 'col3' : [1,3,1,2,0]}) df = df.sort_values(['col1','col2']) dff = sqlContext.createDataFrame( df ) dff.show() +----+-------------------+----+ |col1| col2|col3| +----+-------------------+----+ | a|2013/01/03 00:00:00| 1| | a|2013/03/05 09:43:31| 3| | b|2013/03/07 00:00:00| 1| | b|2013/03/07 00:00:00| 2| | a|2013/03/07 00:00:00| 0| +----+-------------------+----+ window = Window.partitionBy('col1').orderBy(asc('col1'),asc('col2')) dff=dff.withColumn('cumsum_col3', sum('col3').over(window)) dff.orderBy(asc('col1'),asc('col2')).show() +----+-------------------+----+-----------+ |col1| col2|col3|cumsum_col3| +----+-------------------+----+-----------+ | a|2013/01/03 00:00:00| 1| 1| | a|2013/03/05 09:43:31| 3| 4| | a|2013/03/07 00:00:00| 0| 4| | b|2013/03/07 00:00:00| 1| 3| | b|2013/03/07 00:00:00| 2| 3| +----+-------------------+----+-----------+ In general, comparison between simple SQL and Pandas operations: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html A simple selection for a few different id, in SQL: .. sourcecode:: python SELECT KNID,CREATIONDATE,CREDIT_SCORE,produkt_count,customer_since FROM table WHERE KNID in('0706741860','2805843406','2002821926','0711691685','0411713083') And with pandas: .. sourcecode:: python knid_list = ['0706741860','2805843406','2002821926','0711691685','0411713083'] for i,item in enumerate(knid_list): if i==0: filter_knids = (data['KNID']==item) if i>0 : filter_knids = (data['KNID']==item)|filter_knids data.loc[filter_knids,['KNID','CREATIONDATE','CREDIT_SCORE','produkt_count','customer_since']] Merging and Concatenation operations --------------------------------------------------- In Pandas, all types of merging operations (the "join" in SQL) are done using the :py:func:`merge` command (see http://pandas.pydata.org/pandas-docs/stable/merging.html ): .. sourcecode:: python pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False) Note: if you need to merge 2 dataframes using several columns at the same time, it is possible: .. sourcecode:: python new_df = pd.merge(A_df, B_df, how='inner', left_on=['A_c1','c2'], right_on = ['B_c1','c2']) Here is an excellent comparison between SQL and Pandas: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join Pivot operations --------------------------------- The pivot allows to change the order of columns as such. Let's say we have some data as a time series, for different customers A,B,C...: .. sourcecode:: python import pandas.util.testing as tm; tm.N = 3 def unpivot(frame): N, K = frame.shape data = {'balance' : frame.values.ravel('F'), 'customer' : np.asarray(frame.columns).repeat(N), 'date' : np.tile(np.asarray(frame.index), K)} return pd.DataFrame(data, columns=['date', 'customer', 'balance']) df = unpivot(tm.makeTimeDataFrame()) .. figure:: Images/pivot_table1.png :scale: 100 % :alt: output .. sourcecode:: python df_pivot = df.pivot(index='date', columns='customer', values='balance') .. figure:: Images/pivot_table2.png :scale: 100 % :alt: output Melting operation --------------------------------- The melt operation simply reorganizes the dataframe. Let's say we have this df: .. sourcecode:: python df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year']) df .. figure:: Images/PandasMelt1.png :scale: 100 % :alt: Pandas Melt Now we want to reorganize the df so that we have one column "Year" and one column "Name", which contains all name. We then expect to have a third column containing the values: .. sourcecode:: python df_melt = pd.melt(df, id_vars=['Year'], var_name='Name') #value_name='bidule' if we want to change the name of the value column. df_melt .. figure:: Images/PandasMelt2.png :scale: 100 % :alt: Pandas Melt Pandas Cheatsheet ------------------ .. figure:: Cheatsheets/Python_Pandas_Cheat_Sheet_2.png :scale: 100 % :alt: map to buried treasure This Cheatsheet is taken from DataCamp. Also have a look at the cookbook: http://pandas.pydata.org/pandas-docs/stable/cookbook.html Assigining values to dataframe --------------------------------------------- We have a dataframe df with column A and B, and want to assign values to a new column ln_A .. sourcecode:: python df = pd.DataFrame({'A': range(1, 6), 'B': np.random.randn(5)}) df A B 0 1 0.846677 1 2 0.749287 2 3 -0.236784 3 4 0.004051 4 5 0.360944 df = df.assign(ln_A = lambda x: np.log(x.A)) df A B ln_A 0 1 0.846677 0.00 1 2 0.749287 0.693 2 3 -0.236784 1.098 3 4 0.004051 1.386 4 5 0.360944 1.609 #We can also do like this to assign to a whole column: newcol = np.log(df['B']) df = df.assign(ln_B=newcol) df A B ln_A ln_B 0 1 0.846677 0.00 -0.166 1 2 0.749287 0.693 -0.288 2 3 -0.236784 1.098 NaN 3 4 0.004051 1.386 -5.508 4 5 0.360944 1.609 -1.019 #Of course the assignement to a whole column is better done using the simpler command: df['ln_B2'] = np.log(df['B']) #But the assign command is powerful because it allows the use of lambda functions. #Also, user-defined functions can be applied, using assign: def function_me(row): if row['A'] != 2: rest = 5 return rest else: rest = 2 return rest df = df.assign(bidon=df.apply(function_me, axis=1)) df A B ln_A ln_B bidon 0 1 0.846677 0.00 -0.166 5 1 2 0.749287 0.693 -0.288 2 2 3 -0.236784 1.098 NaN 5 3 4 0.004051 1.386 -5.508 5 4 5 0.360944 1.609 -1.019 5 Assigning using a function (with use of the .apply method of dataframes): .. sourcecode:: python #Let's say we have a dataframe with a column "credit_score", you want to encode it using your own-defined rules: df = pd.DataFrame(['c-1','c-3','c-2'],columns=['credit_score']) def set_target(row): if row['credit_score'] =='c-1' : return 0 elif row['credit_score'] =='c-2' : return 1 elif row['credit_score'] =='c-3' : return 2 else: return 99 #Creating new variable called "Target" df = df.assign(credit_score_encoded=df.apply(set_target, axis=1)) df credit_score credit_score_encoded 0 c-1 0 1 c-3 2 2 c-2 1 Percentiles - quantiles in Pandas -------------------------------------------- For example, to get the 5% percentile and the 95% percentile of a dataframe (for all columns, here columns are "2015" and "2016"), we can do: .. sourcecode:: python df.quantile([0.05,0.95]) Saving of Pandas dataframe to LIBSVM file format and inverse ------------------ The ``LIBSVM`` file format is often used in Spark (especially <=1.6). .. sourcecode:: python import pandas as pd import numpy as np from sklearn.datasets import dump_svmlight_file df = pd.DataFrame() df['Id'] = np.arange(10) df['F1'] = np.random.rand(10,) df['F2'] = np.random.rand(10,) df['Target'] = np.random.randint(2,size=10) #map(lambda x: -1 if x < 0.5 else 1, np.random.rand(10,)) X = df[np.setdiff1d(df.columns,['Id','Target'])] y = df.Target dump_svmlight_file(X,y,'smvlight.dat',zero_based=True,multilabel=False) #Now reading a SVMLigt file into (almost) a pandas object: from sklearn.datasets import load_svmlight_file data = load_svmlight_file('smvlight.dat') XX,yy = data[0],data[1] Note: we may also load two (or more) datasets at once: load_svmlight_fileS! X_train, y_train, X_test, y_test = load_svmlight_files( ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt") ) Check that 2 dataframes are equal --------------------------------------------- ...and if not what differs between them: .. sourcecode:: python def dataframes_comparison_tool(d1,d2): df1 = d1.copy() df2 = d2.copy() df1 = df1.fillna(0) df2 = df2.fillna(0) ne_stacked = (df1 != df2).stack() changed = ne_stacked[ne_stacked] difference_locations = np.where(df1 != df2) changed_from = df1.values[difference_locations] changed_to = df2.values[difference_locations] return pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index) dataframes_comparison_tool(result,dask_result) Pandas and memory -------------------------------------- .. sourcecode:: python #lists all dataframes in memory alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)] print(alldfs) # df1, df2 Cutting a dataframe into train-test-validation sets -------------------------------------------------------------------------- .. sourcecode:: python def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None): np.random.seed(seed) perm = np.random.permutation(df.index) m = len(df.index) train_end = int(train_percent * m) validate_end = int(validate_percent * m) + train_end train = df.iloc[perm[:train_end]] validate = df.iloc[perm[train_end:validate_end]] test = df.iloc[perm[validate_end:]] return train, validate, test np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE')) train, validate, test = train_validate_test_split(df,train_percent=0.6,validate_percent=0.2) #if validation_percent=0, then test will just be complement of train test. Useful plots =========== The Swarbee plot of seaborn -------------------------------------- .. sourcecode:: python import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_iris import pandas as pd import numpy as np iris = load_iris() df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['species']) # Create bee swarm plot with Seaborn's default settings sns.swarmplot(x='species',y='petal length (cm)',data=df) plt.xlabel('species') plt.ylabel('length') plt.show() .. figure:: Images/Swarbee_plot.png :scale: 100 % :alt: map to buried treasure This plot is taken from DataCamp. Computation of PDF AND CDF plots (having only PDF) -------------------------------------------------------------------- Here I don't have the data behind, but it is roughly a dataframe with a PDF called df['fraction']. We want a multiplot with both PDF and CDF. .. sourcecode:: python # This formats the plots such that they appear on separate rows fig, axes = plt.subplots(nrows=2, ncols=1) # Plot the PDF df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True, range=(0,.3)) plt.show() # Plot the CDF df.fraction.plot(ax=axes[1], kind='hist', bins=30, normed=True, cumulative=True, range=(0,.3)) plt.show() And the output is: .. figure:: Images/PDF_CDF.png :scale: 100 % :alt: map to buried treasure This plot is taken from DataCamp. Matplotlib: main functions -------------------------------- fig.savefig('2016.png',dpi=600, bbox_inches='tight') Saving objects in Python -------------------------------- Here are the functions for saving objects (using pickle, it is also possible and faster using cPickle, but not always available) and compressing them (using gzip): .. sourcecode:: python def save(myObject, filename): ''' Save an object to a compressed disk file. Works well with huge objects. ''' #import cPickle #(not always installed) #file = gzip.GzipFile(filename, 'wb') #cPickle.dump(myObject, file, protocol = -1) #file.close() #store the object #myObject = {'a':'blah','b':range(10)} file = gzip.open(filename,'wb') #ex: 'testPickleFile.pklz' pickle.dump(myObject,file) file.close() def load(filename): ''' Loads a compressed object from disk ''' #file = gzip.GzipFile(filename, 'rb') #myObject = cPickle.load(file) #file.close() #return myObject #restore the object file = gzip.open(filename,'rb') #ex: 'testPickleFile.pklz' myObject = pickle.load(file) file.close() return myObject And we can use them like this: .. sourcecode:: python myObject = {'a':'blah','b':range(10)} #store the object save(myObject,'bidule.pklz') #restore the object myNewObject = load('bidule.pklz') print( myObject ) print( myNewObject ) Dask, or parallel Pandas ===================================== Links: - Cheatsheet: http://docs.dask.org/en/latest/_downloads/daskcheatsheet.pdf - Dask general documentation: http://docs.dask.org/en/latest/dataframe.html - Intro: https://towardsdatascience.com/how-i-learned-to-love-parallelized-applies-with-python-pandas-dask-and-numba-f06b0b367138 - Intro: https://sigdelta.com/blog/dask-introduction/ - On a cluster of several machines: http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes - Dask overview video (16 minutes): https://www.youtube.com/watch?v=ods97a5Pzw0 - Detailed Dask overview video (40 minutes): https://www.youtube.com/watch?v=mjQ7tCQxYFQ - Parallelizing sklearn: https://github.com/dask/dask-examples/blob/master/machine-learning.ipynb Other package: swifter: - https://github.com/jmcarpenter2/swifter - https://medium.com/@jmcarpenter2/swiftapply-automatically-efficient-pandas-apply-operations-50e1058909f9 Python API's: ============================================= Flask --------------------------------------------- Flask fast tutorial: https://flask.palletsprojects.com/en/1.1.x/quickstart/ One nice example of data in and result output: https://pythonbasics.org/flask-template-data/ .. sourcecode:: python from flask import Flask, render_template, request app = Flask(__name__, template_folder='templates') @app.route('/') def student(): return render_template('student.html') @app.route('/result', methods=['POST', 'GET']) def result(): if request.method == 'POST': result = request.form return render_template("result.html", result=result) if __name__ == 'main': app.run(debug=True) Then put these templates into a folder "templates" in the project repo: student.html: .. sourcecode:: html result.html .. sourcecode:: html {% for key, value in result.items() %} {% endfor %}

{{ key }}	{{ value }}

Then to launch the stuff, run this: .. sourcecode:: python python -m flask run Examples of deployment of a flask app using Azure DevOps: - https://docs.microsoft.com/en-us/azure/devops/pipelines/ecosystems/python-webapp?view=azure-devops - https://elevate-org.com/2019/10/15/build-devops-ci-cd-pipeline-for-python-flask-with-azure-devops/ Streamlit ---------------------------------------------- https://www.geeksforgeeks.org/deploy-a-machine-learning-model-using-streamlit-library/ Streamlit cheatsheet: https://share.streamlit.io/daniellewisdl/streamlit-cheat-sheet/app.py