Lecture notes for the course
# Computer-Guided Data Analysis
in the Bachelor's study program in Physics at Karlsruhe Institute of Technology (KIT)

U. Husemann (based on material from G. Quast, T. Ferber and U. Husemann)

Summer Semester 2024

## Foreword
These lecture notes were created for the course Computer-Guided Data Analysis (CgDA) at Karlsruhe Institute of Technology (KIT) in the summer semester 2024. The lecture notes follow the content of the lectures closely, provides some additional information and is suitable both for follow-up work on the course and for self-study. The lecture notes are written as a Jupyter notebook, integrating text in Markdown with equations in $\LaTeX$ and computer code in Python.

The aim of this course is to teach central key skills for dealing with data in physics studies and beyond. These skills are applied in the physics laboratory course. The content of this course includes the basics of digital signal processing and statistical data analysis. The lectures introduce these topics and the practical exercises, which closely accompany the lectures, give participants the opportunity to apply what they have learned to relevant questions in their physics studies. The central tool of the course for the processing and graphical representation of data is the programming language Python with the program libraries NumPy and Matplotlib.

The course is a compulsory key qualification with 2 ECTS credits in the Bachelor's study program in physics at KIT and is usually taken in the second semester in preparation for the physics laboratory course in the third semester. This translation of the lecture notes into English was done using a [DeepL](www.deepl.com) translation as the starting point. It can be used as a starting point for the topics of signal processing and data analysis in the English-language Master's study program in Physics at KIT.

The concept of the course was largely developed by Günter Quast and further refined by changing lecturers. In the original version, the basics of the Python tool were taught "just in time" together with the topics of signal processing and data analysis. Since the winter semester 2023, the new course "Programming and Algorithms", developed by Torben Ferber, has been offered in the first semester of the Bachelor's study program in Physics. The use of Python in CgDA builds on this course.

Ulrich Husemann in May 2024

# Chapter 1: Introduction

## Course Philosophy

We have formulated the qualification objective of CgDA as follows: "Students master the basics of statistical analysis and the visualization of data and can apply these using concrete examples". In other words, in CgDA Bachelor students should learn a basic understanding of how to handle data and the basic tools for data analysis using computers during your studies. Bachelor students will use these tools in the physics laboratory course from the third semester onwards. In the longer term, we want you to be able to use the computer **creatively** to solve scientific problems.

These qualification goals contribute to your "data literacy" in a broader context. Data is often referred to as the "gold of the 21st century" or the "oil of the 21st century". Accordingly, data literacy is one of the key skills of the 21st century.

> **Excursus: Elements of data literacy**
> 1. establish a data culture
> 1. provide data
> 1. evaluate data
> 1. interpret results
> 1. interpreting data
> 1. derive action
>
> Source: [Competence framework for data literacy (in German)](https://hochschulforumdigitalisierung.de/sites/default/files/dateien/HFD_AP_Nr_47_DALI_Kompetenzrahmen_WEB.pdf), see also [Essential Elements of Comprehensive Data Literacy](https://files.eric.ed.gov/fulltext/ED620527.pdf)

You will be the most important data literacy teacher yourself: The best way to learn it is by *trying it out yourself*. You will get an overview of the concepts and background in the lecture. In the practical exercises that accompany the lecture, you will have the opportunity to gain insights into signal processing and statistical data analysis by handling data yourself. At the same time, you will develop computer code and thus expand your programming toolbox. This often requires you to do your own research in online documentation. Generative artificial intelligence methods may also be suitable for your research. Learning in a team is explicitly encouraged!

>**Comment**: Of course, the great master detective Sherlock Holmes was already aware of the great importance of data. In the short story "A Scandal in Bohemia" from 1891, Arthur Conan Doyle puts the following words into his mouth:
>
> *"I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. "*

## Computers and data in science

Computers have become an integral part of modern science. Typical computer applications in the natural and engineering sciences include

>**Automated control and regulation of experiments**
> Many modern experiments are controlled by computers. Data is often collected via a **finite state machine** (FSM), which is used to configure the experiment and trigger the start and stop of data collection. Switching power supplies on and off, opening and closing valves, monitoring or actively stabilizing temperature, commonly referred to as **slow control**, are further typical computer applications in science. Safety aspects such as an emergency shutdown are also computer-controlled. This is often summarized (also in industrial systems) under the generic term **programmable logic controller** (PLC).

>**Acquisition, storage and evaluation of measurement data**
> The measuring equipment of an experiment usually supplies electrical signals such as currents or voltages, which are digitized and processed by an electronic **data acquisition (DAQ) system**. The data acquisition system is often a special computer, e.g., based on programmable logic (field programmable gate array, FPGA). The data is then usually stored and analyzed using standard computers. Reading a scale on a pointer measuring device is practically only used in education.

>**Algebraic and numerical calculations and simulations**
> Many processes in nature and technology are so complicated that they cannot be meaningfully described using analytical calculations with paper and pencil. For complicated analytical calculations, computer algebra software can help to carry out transformations or simplifications. Computer algebra is the subject of a "sister lecture" to CgDA, which is offered in the second half of the summer semester.
>The basic physical equations can often be formulated, but not solved exactly. One example of this is the equation of motion of complicated mechanical systems such as the double pendulum. These problems are solved numerically no a computer. Software that simulates complicated physical systems numerically includes packages based on the **Monte Carlo** (MC) method. This method is based on random numbers and will be briefly introduced in chapter 2 of these lecture notes. Simulations using the **finite element method** (FEM) are based on the decomposition of the system into smaller and simpler elements, where the physical equations are easier to solve.

>**Documentation and communication between scientists**
> Lab books and classic telephones are a thing of the past in most laboratories. Electronic logbooks are used to document experimental setups, experimental conditions and the performance of experiments, while smartphones are used for photos and videos. Jupyter notebooks, among others, are suitable for documenting experiments; this is also the method of choice in the physics laboratory courses at KIT for a few years now. Communication between scientists also takes place on the computer. Historically, today's [World Wide Web originated from the communication needs of physicists at CERN](https://home.cern/science/computing/where-web-was-born), the particle physics research center near Geneva, Switzerland. In large-scale international physics experiments, video conferencing was used for communication between working groups worldwide at least 20 years before the coronavirus pandemic. Today, video conferencing tools such as Zoom, Teams or Webex have become an integral part of global communication.

### Hardware, operating systems, software
The hardware, operating systems and software used in science are diverse: The hardware ranges from simple microcontrollers (e.g., [Arduino](https://www.arduino.cc)) and single-board computers (e.g., [Raspberry Pi](https://www.raspberrypi.org)) to standard PCs and highly specialized programmable electronics. In addition to the commercial operating systems Windows and macOS, the free UNIX-like operating system Linux is today's standard for scientific computing.
Over the course of time, various programming languages have been used for scientific computing. Until the 1990s, Fortran (Formula Translation) was the standard language and was then partially replaced by the programming language C or C++. Since the mid-2000s, Python was established as the standard for many aspects of scientific computing, graphical representation and, above all, machine learning. For these reasons, Python is the programming tool of choice for the CgDA course.

### Interaction with the computer
There are many ways of interacting with modern computers.  Initially in mainframe computers, and from the 1970s also in PCs, the **terminal** (also: console), a combination of a keyboard and screen, was the most important device for entering and displaying data. Since the 1980s, **graphical user interfaces** (GUI) were established in addition to the terminal. As a result, the computer mouse emerged as an additional input device. Since the late 2000s, smartphones, tablets and virtual reality goggles have added control via touchscreens, gestures and voice control.

Even today, interaction with the computer via the terminal is still possible and still the method of choice for many applications in the scientific world. The command-line interface (CLI) serves as the interface, e.g. via a virtual terminal in the GUI. This allows individual commands as well as small programs (scripts) to be executed and their output to be displayed.

### Control of workflows
In science, a long and sometimes complicated sequence of work steps is often necessary to achieve a result. An essential part of "[good scientific practice](#gwp)" is the reproducibility of results (possibly even after many years). The use of the computer ensures the correct processing of the workflow and the reproducibility of the results.
>**Example: Measurement of a current-voltage characteristic**
>
><img src="img/IV.png" width=50%>
>
> 1. model prediction: For a resistor $R$, a linear dependence of the current $I$ on the applied voltage $V$ is expected based on Ohm's law: $I=V/R$. For a given voltage, the computer calculates the expected current for a known resistance - this is the physical model on which the measurement is based.
> 1. control of the experiment, data acquisition and calibration: The computer now sets a specified series of voltages at the voltage source. It then reads a value for the measured current from the ammeter. Known deviations of the measured value from the true value are corrected by calibrating the measured values if necessary.
> 1. statistical analysis: The data is now compared on the computer with the model (Ohm's law), in this case by fitting a straight line to the data. A measurement of the resistance and its statistical uncertainty is derived from slope of the straight line. Other systematic influences may also need to be taken into account.
> 1. graphical representation: The (calibrated) measurement data and the straight line are used to generate plots or tables on the computer.

Simple work processes such as those in the example above can already be carried out "by hand" at school. On the computer, both interactive control via a GUI and automated control via a script would be conceivable. The following table summarizes the advantages and disadvantages of these methods:
| | Interactive | Automated |
|-|-|-|
| **Advantages** | ++ very intuitive | + relatively easy to learn |
| | + very easy to learn | ++ fully reproducible |
| | | + universally applicable |
| | | + simultaneous documentation of configuration and workflow |
| **Disadvantages** | - complicated workflows hardly reproducible | - training required |
| | - no documentation | |
| | - usually limited to one application | |

## Course structure

1. **Introduction**: Organizational issues, computers and data in science, recommendations on working environment and literature, review of Python
1. **Statistical Data Analysis I - Probabilities**: Measurement and probability, probability theory, one-dimensional and multidimensional probability distributions, covariance and correlation, error propagation
1. **Digital Signal Processing**: Digital data and data formats, fundamentals of signal processing
1. **Statistical Data Analysis II - Parameter Estimation**: Least squares method, function fitting, parameter uncertainties, numerical optimization (optional), outlook (maximum likelihood method, hypothesis testing, machine learning)

<a name="gwp"></a>
>**Excursus: honesty in studies**
>
>KIT pursues the philosophy of *research-oriented teaching*. This means, among other things, that the ethical standards of research are taught and practiced during the study programs. This includes *scientific integrity* and *good scientific practice*, as laid down by the German Research Foundation in its [Code of Conduct for Safeguarding Good Scientific Practice](https://zenodo.org/doi/10.5281/zenodo.3923601).
>
>In this course, teamwork is very much encouraged in the practical exercises, but attempts to cheat, such as copying solutions or invented/falsified data, are prohibited and may lead to the exclusion of candidates from further examinations. The legal framework for this is provided by the [General Regulations on Honesty in Practicals and Examinations (in German)](https://www.sle.kit.edu/downloads/AmtlicheBekanntmachungen/2007_06.pdf). The CgDA team will investigate attempts at cheating: A proven attempt to cheat will result in failing the assignment, repeated attempts to cheat will result in failing the course.
>
>In recent years, new possibilities of generative artificial intelligence (AI), e.g. ChatGPT, Microsoft Copilot or Github Copilot, have also opened up new cheating possibilities - we will of course also investigate these. Feel free to use generative AI for research or to summarize topics, but do the fact check yourself - you are responsible for the result, not the AI! This is what we did for translating these lecture notes into English using [DeepL](www.deepl.com).

### Recommendation: working environment

It is best to use your own laptop computer with the Linux, Windows or macOS operating system for the course. Your laptop should meet the following hardware requirements (as of 2024)
- **Processor**: at least Intel Core i3, AMD Ryzen 3 or Apple M1
- **RAM**: at least 4 GB, preferably 8 GB
- **Free hard disk space**: at least 50 GB

Unfortunately, at this point in time, there is no free Python environment for tablets (even with a keyboard).

You will need a working Python environment for the course. There are three straightforward ways to arrive at a working Python environment, which you already know from the course "Programming and Algorithms":
1. **Your own installation**: Own laptop with Python installation (e.g. Miniconda), development environment/editor (e.g. Visual Studio Code) and version control (git). This option is the most suitable for the course because it gives you the most flexibility and you will learn the most in the process.
1. **JupyterLab**: Web interface for Jupyter notebooks on KIT servers. This option works on any device with an internet connection, web browser and keyboard, but the web interface does not provide a good editor. Jupyter notebooks are currently used as standard in the physics laboratory course.
1. **Pool room**: Room of the KIT Department of Physics, equipped with Linux PCs on which the required software is installed. With this option you do not have to worry about the software installation, but you are dependent on free times in the pool room.

### Recommendation: literature
There is currently no textbook for this course that follows a similar philosophy and/or covers all content. There are several English-language  books or e-books on statistical data analysis in physics that are well suited to deepening the topics of probabilities and parameter estimation:
- G. Cowan: *Statistical Data Analysis*, Oxford (1997), [book](https://katalog.bibliothek.kit.edu/cgi-bin/koha/opac-detail.pl?biblionumber=177314), [e-book](https://katalog.bibliothek.kit.edu/cgi-bin/koha/opac-detail.pl?biblionumber=1132425): compact, clearly written book.
- M. Erdmann, T. Hebbeker, A. Schmidt: *Statistische Methoden in der Experimentalphysik*, Pearson (2020), [KIT Library](https://katalog.bibliothek.kit.edu/cgi-bin/koha/opac-detail.pl?biblionumber=1176256): Focus on practical aspects, modern topics, computer use. Only in German.
- G. Bohm, G. Zech: *Introduction to Statistics and Data Analysis for Physicists*, DESY eBook (2006), [e-book](www-library.desy.de/preparch/books/vstatmp_engl.pdf): detailed and complete.
- V. Blobel, E. Lohrmann: *Statistical and numerical methods of data analysis*, DESY eBook (2012), [website](https://www.desy.de/~sschmitt/blobel/ebuch.html): Combining mathematical foundations and computer use (in Fortran). Only in German.
- R. J. Barlow, *Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences*, Wiley (1989), [book](https://katalog.bibliothek.kit.edu/cgi-bin/koha/opac-detail.pl?biblionumber=1090191): classic of statistical methods in physics.

In addition, Günter Quast has created a series of scripts, including [Data Evaluation in the Laboratory Course](https://etpwww.etp.kit.edu/~quast/Skripte/Datenauswertung.html) and [Fitting Data with the Least Squares Method](https://etpwww.etp.kit.edu/~quast/Skripte/Chi2Method.pdf). There is also an [Introduction to Jupyter notebooks and various tutorials on Jupyter, Python and statistical data analysis](https://etpwww.etp.kit.edu/~quast/jupyter/jupyterTutorial.html). These are all available in German only.

### Recapitulation: Python

This course uses the **Python** programming language, which you already know from the "Programming and Algorithms" course in the first semester. Python has the following advantages:
- Python is a **high-level programming language** that is universally applicable and at the same time **intuitive and easy to learn**.
- Python programs can not only be used as compiled computer codes, but can also be executed (interpreted) **interactively**. Python can therefore be used as a scripting language, as in this course.
- Python supports **several common programming paradigms** such as procedural, object-oriented, data-oriented or functional programming.
- Python is available for **all common operating systems** and provides an **extensive program library** for standard functions and classes. Python is also easy to extend and there is a vast number of additional libraries.

A good working knowledge of Python is a great advantage for this course. Many of the foundations for this were laid in the "Programming and Algorithms" course. The following points are particularly relevant, also with regard to the application in the physics laboratory  course (indicating chapters in the lecture "Programming and Algorithms" by Torben Ferber in the winter semester 2023/2024):
- Fundamentals: variables, functions, assignments, control structures, built-in data structures (lectures 2-4).
- Scientific computing with NumPy and SciPy: data structures, vectorization, interpolation (lectures 11-12).
- Graphical representation of data with Matplotlib (lecture 13).
- Data formats, input and output (lecture 12).
- Jupyter (Lecture 13).

It is recommended to review the materials from these lectures of the course "Programming and Algorithms" in preparation for CgDA using the [slides](https://web.etp.kit.edu/~ferber/vorlesungen/programmieren_und_algorithmen/slides/) and [web pages with further materials](https://web.etp.kit.edu/~ferber/vorlesungen/programmieren_und_algorithmen/html/intro.html) (name and password: student). These materials are currently only available in German.

In addition, videos with step-by-step instructions on Jupyter, Python, NumPy and matplotlib from the course "Einführung in das rechnergestützte Arbeiten" (ERA) by Andreas Poenicke as well as [several tutorials for Jupyter and Python](https://etpwww.etp.kit.edu/~quast/jupyter/jupyterTutorial.html) by Alexander Heidelbach and Günter Quast are available – currently only in German.



### Scientific computing with Python

Python provides extensive libraries for scientific computing. In this course, you will use the standard libraries [NumPy](https://numpy.org) and [SciPy](https://scipy.org), which you have already become familiar with in "Programming and Algorithms".

> **NumPy** is a basic library for scientific computing that provides powerful tools for dealing with multidimensional arrays (vectors, matrices, tensors). These data structures are currently (2024) the de facto standard for working with arrays in Python. Based on these data structures, NumPy provides efficiently implemented numerical methods for mathematical functions, random numbers, linear algebra, Fourier transforms, etc. In Python, NumPy is loaded with the statement ```import numpy as np```.

> The **SciPy** library is based on NumPy data structures and contains implementations of algorithms for numerical integration, interpolation, optimization, linear algebra and statistics.

A complete "ecosystem" of programming tools has developed around NumPy and SciPy. These tools include, among others
- [Project Jupyter](https://jupyter.org) for interactive programming in several programming languages (see below),
- [Matplotlib](https://matplotlib.org) for graphical representation of data (see below),
- [SymPy](https://www.sympy.org/en/index.html) for symbolic computing in Python (similar to computer algebra systems such as Mathematica),
- [pandas](https://pandas.pydata.org) for more complex data structures and their analysis.

### Interactive programming
In addition to executing computer programs as compiled machine code or as interpreted scripts, computer algebra systems such as Mathematica use the approach of storing programs in the form of **(electronic) notebooks**. Notebooks offer the advantage of combining texts, interactively executable program code and illustrations in one electronic document. For this reason, they are also used at KIT in the physics laboratory course: The notebook contains the entire experiment in a single document, from the theoretical principles to the execution and measurement data to the evaluation. This way of working is ideal for documenting smaller projects as part of your studies and beyond. For more complex workflows, such as the analysis of particle physics data, more powerful tools are used.

The current standard for interactive programming environments is [Project Jupyter](https://jupyter.org). The project consists of the web-based user interface *JupyterLab* and the document format *Jupyter Notebook*, in which the notebooks are stored.

<img src="img/jupyterlab_en.png">

You have several options for working with Jupyter Notebooks in this course:
1. web interface to [JupyterLab server of the SCC](https://jupyter-test.scc.kit.edu/). This new service of the SCC was established with the help of the KIT Physics Department and is available in its pilot phase for this course. This approach only requires an end device with internet connection, web browser and (at least virtual) keyboard.
1. web interface to [JupyterLab instance of the Physics Department](https://jupytermachine.etp.kit.edu/). This approach also only requires an end device with an internet connection, web browser and (at least a virtual) keyboard.
1. source code editor, e.g. [Visual Studio Code](https://code.visualstudio.com) with Jupyter extension. This approach offers the most comprehensive and convenient editing functions. It requires a local Python and JupyterLab installation, e.g. via Miniconda.
1. web interface to the local Python and JupyterLab installation on your own device (see figure above).

One disadvantage of all these approaches is that it is not yet possible to work interactively on a project with several users at the same time ("collaboratively").

### Graphical representation of data
Graphical representation is an essential element in gaining and communicating scientific knowledge from data:
- In exploratory data analysis and data mining (also: deductive statistics), unknown data can often be "combed through" graphically more easily than in arrays in order to recognize patterns or correlations.
- Statistical inference (also: inductive statistics) uses graphical illustrations to display data (often compressed) and adapt models to the data (e.g. "straight line fit") in order to draw conclusions about their parameters (e.g. slope of the line).
- Graphical representations often play a central role in the dissemination and communication of scientific results. Results are documented in protocols and then published in the form of specialist articles, slide presentations, websites, social media posts, etc.

One of the qualification objectives of this course is to enable you to analyze data, compare it with models and present it graphically in a meaningful way. This qualification is relevant for physics laboratory courses, Bachelor's and Master's theses and far beyond your studies. Central questions that you should ask yourself when graphically representing data are
- What "message" should be conveyed with the graph?
- Which variables should be shown? Which value ranges are suitable?
- What type of graph is suitable for the variables and the message?

In this course, we would like to introduce you to [Matplotlib](https://matplotlib.org) as a tool for the graphical representation of data in Python and give you tips on suitable forms of representation.

> **Matplotlib** 
>
> <img src="img/matplotlib.png" width=70%>
>
> The current standard tool for visualization in Python in conjunction with NumPy and SciPy is the library [Matplotlib](https://matplotlib.org). The illustrations are (with the right settings) of very high quality and can be designed and animated interactively. They can be integrated directly into JupyterLab and saved in many different file formats. A number of other graphics packages are based on Matplotlib. You can access the functionality of Matplotlib in various ways. The most common application programming interface (API) is [`matplotlib.pyplot`](https://matplotlib.org/stable/api/pyplot_summary.html); it offers functionality in the style of the commercial platform for programming and numerical calculations [MATLAB](https://de.mathworks.com/products/matlab.html), which is widely used in the engineering sciences. In Python, this API is usually integrated with ```import matplotlib.pyplot as plt```. There is also an API for object-oriented programming.

In [None]:
# import NumPy and matplotlib
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# basic math
a = 5                   # assignment
b = 5 * 2 - 3 + 10 / 2  # basic arithmetics
c = 10 % 7              # modulo operation (remainder of division)
d = 2**3                # power
print(a,b,c,d)

In [None]:
# data types
e = True            # truth value (bool): True, False
f = -1234 	        # integer number (int): -3, 2, 1234
g = 1.234     	    # floating number (float): 1.234
h = 1.-2.j  	    # complex number: 1.-2.j 
i = 'Hallo'         # string: 'Hallo'

In [None]:
# comparison operators (return: truth value)
a == a              # equal
a != b              # not equal
a > c               # greater than
a < b               # smaller than
a >= a              # greater or equal
c <= d              # smaller or equal

In [None]:
# logical operators (return: truth value)
True and True 	    # logical AND
False or True       # logical OR
not False	        # logical negation

In [None]:
"""control structures (indentation is mandatory!)"""

# for loop: for <variable> in <list>:
for i in [1,2,5]:
    print( i )

# while loop: while <condition>:
i = 0
while i <= 3: 
    print( i )
    i += 1 # increment i by 1

# branching: if <condition 1>: elif <condition 2>: else:
c1, c2 = False, True
if c1:
    print( "if" )
elif c2:
    print( "elif" )
else:
    print( "else" )

In [None]:
"""Python's built-in list types"""

# tuple: immutable (values cannot be changed)
a = (3.1, 'car', 1.3-1.2j)
print( a )

# list: mutable (values can be changed)
b = [3.1, 'car', 1.3-1.2j]
b[1] = 'bicyle'
print ( b )

# dictionary = key-value pairs
c = {'last':'Husemann', 
     'first':'Ulrich',
     'phone':24038 }
print( c )

In [None]:
"""working with Python lists"""

# some list
a = [ 12, 3, 2, 79, -1, 346 ]

# list length
print( len( a ) )

# print element with index 4 (starting from 0)
print( a[4] )

# slice list: pick elements with indexes 1..3 
print( a[1:4] )

# sort list in place
a.sort()
print( a )

# list comprehension: create a list with squared values of all integers between 1 and 10
b = [ i**2 for i in range(1,11) ]
print (b)

In [None]:
"""example on functions: compute Fibonacci sequence"""

def fibonacci( fmin = 0, fmax = 100 ):
	"""function to print the Fibonacci numbers between fmin and fmax."""
	
	# first two elements of sequence
	f0, f1 = 0, 1

	# compute elements until fmax using the recurrence relation f(n) = f(n-2) + f(n-1)
	while f0 < fmax:
		if f0 >= fmin:
			print( f0 )
		f0, f1 = f1, f0 + f1

	# return value: last Fibonacci number
	return f0

# call the function to output the Fibonacci numbers	
print( "Printing Fibonacci numbers smaller than 1000: ")
f = fibonacci( fmax = 1000 )

### NumPy crash course
NumPy provides powerful data structures for homogeneous arrays, i.e., arrays whose elements are all of the same type. Wherever possible, arithmetic operations are performed in parallel on the entire NumPy array. The code required for this is already precompiled in NumPy and can therefore be executed very fast.

In [None]:
# import NumPy with short name np
import numpy as np

# many ways to initialize NumPy arrays (here: only one-dimensional arrays = vectors)
a = np.array( [-1.3, 2.2, -3.1] )   # explicit values
b = np.zeros( 3 ) 	                # vector of three zeros
c = np.ones( 5 )		            # vector of five ones
d = np.linspace(0.,2.,5)            # vector of five equidistant values in interval [0,2]
e = np.arange(0,5)                  # vector of integers between 0 and (excluding) 5

# apply arithmetic operations on each element of the array
f = d - c
print( f )

# apply built-in function for cosine on each element of the array (after element-wise product of d and e)
g = np.cos( d*e )
print( g )

>**Excursus: scalar data types in Python**
>
>In physics, some applications do require system-oriented programming, and therefore, details about the bits and bytes and their representation in the computer are relevant. In different computer systems, this representation is different, both in terms of the number and possibly also the arrangement of the bits: On the common x86_64 architecture (Intel, AMD), a signed integer number (int) is stored with 32 bits. The most significant bit (MSB) always contains the sign; the remaining 31 bits can be used to represent values between $-2^{31}$ and $+2^{31}-1$.
> NumPy uses its own data types instead of Python's native data types. These can be represented as values with a fixed number of bits (regardless of the system). Here are some examples:
>- `np.bool_`: truth values (stored as 8 bits, unlike the Python type `bool`)
>- `np.int8`: 8-bit integer (sign bit + 7 bits, $-2^7 = -128$ to $2^7-1=127$)
>- `np.uint8`: unsigned 8-bit integer (8 bits, 0 to $2^8 - 1 = 255$)
>- Analog: `np.int16`, `np.uint16`, `np.int32`, `np.uint32`, `inp.nt64`, `np.uint64` etc.
>
> Alternatively and probably more common is a naming system for data types based on the C programming language, which is however machine-dependent, e.g. on the x86_64 architecture `np.intc` corresponds to a 32-bit number, i.e. `np.int32`.
>
> Floating point numbers (floats) are particularly important for the numerical calculation of physical quantities. The numerical precision is often relevant here, which is determined by the number of bits for the mantissa and exponent (remember: in the number $-3.14159\cdot 10^{12}$ the mantissa is $314159$ and the exponent is $12$):
>- A 32-bit floating point number is referred to as a "single precision" representation on the x86_64 architecture. As with integers, the MSB represents the sign. The following eight bits (23 to 30) are used for the exponent and the remaining 23 bits (22 to 0) for the mantissa. In NumPy, the corresponding data type is `np.float32` (corresponds to `np.single` on x86_64).
>- For more precise calculations, floating point numbers are available in "double precision" (`np.float64`, corresponds to `np.double` on x86_64, 52-bit mantissa, 11-bit exponent) or "quadruple precision" (`np.float128`, corresponds to `np.longdouble` on x86_64).
>- For complex numbers, `np.complex64` corresponds to two 32-bit floating point numbers for the real and imaginary parts.

In [None]:
"""multidimensional data types and linear algebra"""

# import NumPy
import numpy as np

# initialize a 3x4 matrix and reshape it to 2x6
matrix1 = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix_reshaped = matrix1.reshape( 2, 6 )
print( matrix_reshaped )

# transpose and multiply all elements with 1.5
matrix2 = 1.5*matrix1.transpose() 
print( matrix2 )

# dot product of two matrices
matrix3 = np.dot(matrix1,matrix2)    
print( matrix3 )

# scalar product of first and third row
scalar_product = np.dot(matrix1[:1,:],matrix2[:,2:3])
print( scalar_product )

In [None]:
"""random numbers in NumPy"""

# import NumPy
import numpy as np

# initialize random number generator with a seed (more on random numbers later)
rng = np.random.default_rng( seed=42 )

# draw uniformly random numbers distributed in interval [0,1)
a = rng.random()
# the generator produces a sequence of random numbers, hence a ≠ b
b = rng.random() 
# a 3x2 matrix of random numbers can be generated in one go
c = rng.random(size=(3,2)) 

print( a, b )
print( c )

# random numbers can also be generated for other distributions (more later)
# normal/Gaussian
d = rng.normal(-2,0.5,[10]) 
# binomial
e = rng.binomial(5,0.2,[10])     
# Poisson
f = rng.poisson(5,[10,10])      

print( d, e )
print( f )

### Matplotlib crash course 

In [None]:
"""first steps: plotting a function"""

# import NumPy
import numpy as np

# import pyplot module from matplotlib
import matplotlib.pyplot as plt 	

# x values: 200 equidistant real numbers between -20 and 20
x = np.linspace( -20, 20, 200) 	
	
# y values: sinc function applied on entire array of x values
y = np.sin(x)/x 			

# plot y vs. x
plt.plot(x,y) 

# save as vector graphics in PDF format
plt.savefig('sinc.pdf')

# display on screen
plt.show() 

In [None]:
"""if you are counting discrete items: use a bar chart"""

# import NumPy
import numpy as np

# import pyplot module from matplotlib
import matplotlib.pyplot as plt 	

# five bars
x = np.arange( 5 ) # integer numbers from 0 to 4
y = np.array( [ 10, 15, 4, 12, 2 ] ) # some values

# bar chart
bars = plt.bar( x, y )

# y axis label
plt.ylabel('Number of Entries')

# x axis ticks get names
plt.xticks( x,["Knives", "Forks", "Table Spoons", "Tea Spoons", "Chopsticks" ])

# als PDF speichern
plt.savefig("barchart.pdf") 
plt.show()

In [None]:
"""
if you are discretizing continuous values: 
use a histogram, in which each bins contains the frequency of the class
"""

# import NumPy
import numpy as np

# import pyplot module from matplotlib
import matplotlib.pyplot as plt 	

# array of 1000 Gaussian random numbers
rng = np.random.default_rng( 42 ) 
x = rng.normal( 5., 0.1, 1000 )

# histogram with fixed number of bins
# returns arrays of bin content and bin edges
nbins = 10 
n, bins, patches = plt.hist(x, nbins)

# axis labels with units
plt.xlabel( "$x$ (cm)" )
plt.ylabel( "Frequency")

plt.savefig( "histogram.pdf" )
plt.show()


In [None]:
"""histograms can also be stacked on top of each other"""

# import NumPy
import numpy as np

# import pyplot module from matplotlib
import matplotlib.pyplot as plt 	

# arrays of exponential and normal random numbers
rng = np.random.default_rng( 42 ) 
x1 = rng.exponential( scale=2.0, size=10000 )
x2 = rng.normal( loc=5., scale=0.5, size=1000 )

# stacked histograms
binedges = np.linspace( 0, 10, 51 )
n, bins, patches = plt.hist( (x1,x2), bins=binedges, histtype='barstacked' )

# axis labels
plt.xlabel( "$x$ (cm)" ) 
plt.ylabel( "Frequency" )

plt.savefig( "histogram_stack.pdf" )
plt.show()

In [None]:
"""individual measurements: display as data points with uncertainty bars"""

# import NumPy
import numpy as np

# import pyplot module from matplotlib
import matplotlib.pyplot as plt 	

# initialize random number generator
rng = np.random.default_rng( 42 )

# data points on a straight line shifted by random offset from Gaussian distribution
N, yerr = 10, 0.5
x = np.linspace( 1, 10, N )
y = 8.-( x + rng.normal(size=N, scale=yerr))

# plot with uncertainty bars on y values and axis labels
err = plt.errorbar(x,y,yerr,fmt='bo') 
plt.xlabel('$x$')
plt.ylabel('$y$')

plt.savefig('datapoints.pdf')
plt.show()

In [None]:
"""pairs of measurements: scatter plot"""

# import NumPy
import numpy as np

# import pyplot module from matplotlib
import matplotlib.pyplot as plt 	

# initialize random number generator
rng = np.random.default_rng( 42 )

# 1000 random data points around straight line with y=8-x
N, yerr = 1000, 0.5
x = 10.*rng.uniform( size=N ) # uniformly in x
y = 8.-( x + rng.normal(size=N, scale=yerr) ) # Gaussian around y=8-x in y

# scatter plot with marker size 1
plt.scatter( x, y, s=1.0 )

plt.savefig( "scatterplot.pdf" )
plt.show()

The illustrations shown here are only intended to give you a first insight into the representation of data with Matplotlib. The Matplotlib library has many other options for displaying data. You can find sample code for this in the [Matplotlib gallery](https://matplotlib.org/stable/gallery/index.html). It is best to try it out for yourself and remember: graphics are only one tool for displaying data as meaningfully as possible.