---
# Jupyter notebook tutorial:
### Test and demonstate usage of PhyPraKit stand-alone tools¶

                                                Günter Quast, July 2023



---
---
## Jupyter Notebook Fundamentals

#This file of type `.ipynb` contains a tutorial as a `Jupyter notebook`.
`Jupyter` provides a browser interface with a (simple) development environment 
for *Python* programs and explanatory texts in intuitive *Markdown* format.
The input of formulas in *LaTeX* format is also supported.

A summary of the most important commands for using *Jupyter* as a working environment can be
found in the notebook
[*JupyterCheatsheet.ipynb*](https://git.scc.kit.edu/yh5078/datenanalyse/-/blob/master/jupyter/JupyterCheatsheet.ipynb)
(German).
Basics for statistical data analysis can be found in the notebooks
[*IntroStatistik.ipynb*](https://git.scc.kit.edu/yh5078/datenanalyse/-/blob/master/jupyter/IntroStatistik.ipynb)
(German) and
[*Fehlerrechnung.ipynb*](https://git.scc.kit.edu/yh5078/datenanalyse/-/blob/master/jupyter/Fehlerrechnung.ipynb) (German).

In *Jupyter*, code and text are entered into individual cells.
Active cells are indicated by a blue bar in the margin.
They can be in two states: in edit mode the input field is white, in command mode it is grayed out.
Clicking in the border area selects the command mode, clicking in the text field of a code cell
switches to edit mode.
The `esc` key can also be used to leave the edit mode.

Pressing `a` in command mode creates a new empty cell above the active cell, `b` creates one below.
Entering `dd` deletes the corresponding cell.

Cells can be either of the type `Markdown` or `Code`.
Entering `m` in command mode sets the type Markdown, entering `y` selects the type Code.

The cell content is processed - i.e. setting text or executing code - by entering `shift+return`,
or `alt+return` if a new, empty cell should also be created.

The settings mentioned here as well as the insertion, deletion or execution of cells can also be
executed via the pull-down menu at the top.

---
---


## Test and demonstate usage of PhyPraKit stand-alone tools

PhyPrakit provides some Python-Scripts that perform basic actions on data and fit models
defined in a *yaml* file. In general, no extra private code is needed in addition.

  - *plotData*   plot data and uncertainties from file in *yaml* format
  - *plotCSV*    plot data from a file in CSV format; german ',' is replaced by '.'
  - *run_phyFit* run a fit defined in a *yaml* file
  - *csv2yaml*   convert data in CSV format (e.g. MS Excel export) to *yaml* format
  - *smoothCSV*  resample data from a CSV file

The **kafe2** package also provides a stand-alone tool,

   - kafe2go     run a fit with *kafe2* from an input file in *yaml* format

Execution of scripts is done by use of the Jupyter *%run* magic command. For this to work, the 
python script must be specified with its full path, or contained in the current jupyter work directory.


---

## General remarks

The stand-anlone scripts take a number of parametrs on the command line. If a script is started without 
any parameters, usage help is printed. See this example:

In [None]:
%run plotCSV.py -h

--- 
## Plot data fom CSV file

CSV or "comma separated values" is a common data format in Data Science to store tabular data in human-radable format.

As an example, we condiser the file  `Wellenform.csv"; the first few kines look as follows:

```
Time,Channel A
(ms),(V)

-0.34927999,-0.00045778
-0.34799999,-0.00045778
-0.34671999,-0.00045778
-0.34543999,-0.00045778
        ...

```
These first three lines constiute the so-called "header" with meta-information describing the nature of
the data in each of the columns with numerical values constituting the actueal data. 
The first line contains the so-calles "keys", i. e. the names of the data entries in the respecive columns, 
and the second line contains the physical units of these values. The third line is an empty one. 
The so-called "field separator" is `','` in this case.

With this information and the help of the script introduced in the prevous cell we can now plot the data, 
i. e. generate a graphical representation thereof. As `,` is the default field separator, the only additional
information needed is the total number of header lines, which is 3, so the command to execute is

```%run plotCSV -H 3 Wellenform.csv```.

Try it out on the following code cell:

In [None]:
%run plotCSV -H 3 Wellenform.csv

As a result, a properly labelled graph is shown. 

---

## Statistical analysis of measured data

A typicalfirst step in data analyis consists of inspecting a frequency distribution of measured data.
The program *plotData* contains the necessary code; it shows the distribution and calculates the mean
and standard deviation of the data. The file *simple_data.yaml*, as shown below, contains all the necessary
input an can easily be worked on using the Editor provided by Jupyter. Just double-click on the file name
in the file list on the left-hand side in your Jupyter window to open it. To generate a new file, right-click
in the list, provide a name of a new, empty file, and open it by double-clicking. 

```
  # Beispiel einer Histogramm-Darstellung
  # -------------------------------------
  type: histogram
  title: "Wiederholte Messungen von Tischhöhen"
  label: Beispieldaten
  x_label: 'Höhe h (cm)'
  y_label: 'Verteilungsdichte f(h)'

  # Daten:
  raw_data: [
  79.83,79.63,79.68,79.82,80.81,79.97,79.68,80.32,79.69,79.18,
  80.04,79.80,79.98,80.15,79.77,80.30,80.18,80.25,79.88,80.02 ]
  n_bins: 20
  bin_range: [79., 81.]
  # alternatively an array for the bin edges can be specified
  #bin_edges: [79., 79.5, 80, 80.5, 81.]

  model_label: Gauss-Verteilung
  model_density_function: |
    def normal_distribution(x, mu=79.9, sigma=0.346):
      return np.exp(-0.5 *((x-mu)/sigma)**2)/np.sqrt(2.*np.pi*sigma**2)
```

The simple command to run the example looks like this: 

```
%run plotData simple_data.ydat
```

In [None]:
%run plotData simple_data.ydat

---

## Performing fits of analytical models to data

Fitting models to experimental data, or parametrizing measuements with a functional dependence,
is another one of the routine tasks in data analysis. Two stand-alone fitting programs relying on
the *phyFit* or *kafe2* packages are provided:  

 - kafe2go
 - run_phyFit

First, let us see how the interfaces are defined by running the scrips with the -h key (for "help"):

In [None]:
%run run_phyFit -h

In [None]:
%run kafe2go.py -h

Now, run a very simple fit of a straight line to data with only independent uncertainties in the x- and y-directions, as
specified in the file *simpleFit.fit*. You may want to inspect the input by double-clicking on the file name. 

In [None]:
%run run_phyFit simpleFit.yfit

In [None]:
%run kafe2go simpleFit.yfit

#### A more complex fit example with different types of uncertainties

In many cases, uncertainties are a bit more complex than in the previous example. We
now consider different types of uncertainties affecting the x- and/or y-directions. 
These can be independent or correlated, or absolute or relative, as examplified in the 
file *test_xy.fit*. To inspect this input file, double-click on the name in the directory listing on the left-hand side. The file will be displayed in an editor tab, from where
it is possible to change the contents and try out modifications. 
Executing this example is not more complicated than the first one:

In [None]:
%run run_phyFit.py test_xy.yfit

Note that a simplified data format is used above relying on default values for the properties of uncertainties,
which are assumed to be independent, absolute and uncorrelated if not specified otherwise. Running *kafe2* yields the same result as *phyFit*, if the option to calculate asymmetric
parameter uncertainties is chosen. This is shown here:

In [None]:
%run kafe2go --asymmetric test_xy.yfit

## Fitting a model to histogram data
Fitting a model to histogram data is also possible. In this case a cost function based
on least-suares is often not a good approximation, and threfore both *phyFit* and *kafe2* 
use as the default a negative-Log-likelihood function taking care of the Poisson nature of 
the uncertainties. Here are the commands to run the example with *phyFit* and *kafe2*:

In [None]:
%run run_phyFit.py hFit.yfit

In [None]:
%run kafe2go hFit.yfit

--- 

## Handling data in CSV - Format

The CSV (for Comma- or Charachter-Separated Values) is quite common in data science, and 
many software packages or hardware devices export data in this format or at least support
it (including MS EXCEL and Leybold Cassy). 

PhyPraKit provides the tool *csv2yml* to ease the conversion to the more general *yaml* 
format. After converting the input data, extra lines can be added using any text editor
or, better, the editor provided as part of Jupyter Notebooks. 
Here is an example without input showing all options:

In [None]:
%run csv2yml.py -h

#### CSV Example
 
 The command to convert a file with audio data looks like this:

In [None]:
%run csv2yml.py AudioData.csv

The CSV tools of PhyPraKit also can handle the output of typical Windows-Programs using decimal commas 
instead of the internationally used dot. To be unambiguous, the field delimiter is then `';'` and not the
usual commal. We just need to tell the tools *csv2yml*, *plotCSV* or *smoothCSV* to take this into account.
To generate a valid *yaml* block from such an input, execute: 

In [None]:
%run csv2yml -d ";" Excel_output.csv

Using a text editor, e.g. by creating a new, empty file by right-clicking in the director list on the left-hand side
and double-clicking on it, the *yaml*-block from the above output can be copied to a new *yaml*-file. Mark by moving
the mouse over the respecive lines with left button pressed, then press `<ctrl>-c` to copy; activate the destination 
window and area by pointing with the mouse and left-clicking, then type `<ctrl>-v`. 

This file should also contain additional information, most importantly  the "meta-data" giving information 
on the origin of the data. Possibly the key fields need adjustments to be compatible with *run_phyFit* or 
*kafe2go*, and a fit model should be added.

A valid fit-input file for a straight-line-fit based on the data contained in the file *Excel_output.csv* looks like this:

```
x_data: [0.05, 0.36, 0.68, 0.8, 1.09, 1.46, 1.71, 1.83, 2.44, 2.09, 3.72, 4.36, 4.6]
y_data: [0.35, 0.26, 0.52, 0.44, 0.48, 0.55, 0.66, 0.48, 0.75, 0.7, 0.75, 0.8, 0.9]
y_errors: [0.06, 0.07, 0.05, 0.05, 0.07, 0.07, 0.09, 0.1, 0.11, 0.1, 0.11, 0.12, 0.1]
x_errors: 3% 

# model specification
model_label: 'line fit'
model_function: |
    def linModel(x, a=0, y0=1.):
      return y0 + a * x
```

In [None]:
%run run_phyFit.py simpleFit.yfit

### Fixing oversampling issues

Sometimes exported CSV data suffer from oversampling, i. e. far too many values are recorded for
as meaningful analysis of the data. Fortunately, this can be fixed retrospectively using the
stand-alone tool *smoothCSV*. Execute the following line to see what it is meant to do:

In [None]:
%run smoothCSV -h

In [None]:
%run smoothCSV -H 3 -w 10 -r Wellenform.csv

The result of this action is a significantly reduced data volume of the 
waveform shown in the very first example of this tuorial notebook. 