--- title: "mmstat4" author: - name: "Sigbert Klinke" email: sigbert@hu-berlin.de date: "`r format(Sys.time(), '%d %B, %Y')`" output: html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{mmstat4} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r , echo=FALSE} suppressPackageStartupMessages(library("mmstat4")) ``` # Introduction ## Aim This package was designed with the aim of distributing educational resources for statistics courses targeted at students. Once the teaching materials have been downloaded, the primary functions of this package include: * Opening Shiny Apps, R, Python, and R Markdown files directly within RStudio. * Loading data files seamlessly. * Viewing HTML and PDF files conveniently in a web browser. With this feature, students can access various educational materials such as interactive apps, R code, data files, and other resources that can be helpful in learning statistical concepts. By providing easy access to these materials, the package aims to facilitate the learning process for students and make it more interactive and engaging. GitHub allows you to download the repository as a ZIP file. You can find the option to download under the Code button (Download ZIP) in the repository. mmstat4 works with this ZIP file, but you can also use one of your own ZIP files. In my courses, I assume that all R programs run in a freshly started R environment, meaning there are no path dependencies, and all necessary libraries are loaded within the R program. My repositories contain not only the example programs for the students but also the programs I use to create images and tables, as well as the Shiny Apps I demonstrate. ## Installation You can install `mmstat4` from CRAN using: ```{r, eval=FALSE} install.packages("mmstat4") ``` Alternatively, you can install the development version from GitHub using `devtools`: ```{r, eval=FALSE} devtools::install_github("sigbertklinke/mmstat") ``` ## Getting started A component of the package includes a small ZIP file containing educational materials. Initially, we need to instruct `mmstat4` to utilize this ZIP file instead of the larger ZIP file for my Data Analysis I and II courses. ```{r, eval=FALSE} ghget('local') ghopen("example_mcnemar.R") # open a R example file ``` `ghget` returns the key (`local`) associated with the currently active ZIP file. `ghopen` launches the example file in RStudio. To access the equivalent Python script, use: ```{r, eval=FALSE} ghopen("example_mcnemar.py") # open a Python example file ``` Note: To run Python scripts, ensure local Python installation. Scripts execute within `mmstat4.xxxx` virtual environment, created upon script run or open. User approval is crucial. Upon setup, script checks for `init_py.R` in ZIP file. If found, it's executed, often installing Python modules with `reticulate::py_install('module name')`. Shiny apps can also be launched in RStudio and run locally. ```{r, eval=FALSE} ghopen("pca_best_line/app.R") # open a Shiny app ``` Data files can be loaded with: ```{r, eval=FALSE} x <- ghload("TelefonDaten.csv") # load a data set head(x) ``` ```{r, echo=FALSE} x <- ghget("local") x <- ghload("TelefonDaten.csv") # load a data set head(x) ``` HTML and PDF files will open in the default application: ```{r, eval=FALSE} ghopen("Formelsammlung.pdf") # open a PDF file ``` # Using a ZIP file or repository ## `ghget` A ZIP file or repository can be stored locally or in the internet. A key-value approach can be used to determine the location of the source ZIP file. If no key is defined then `ghget` uses the base name of the source ZIP file as the key. ```{r, eval=FALSE} ghget(dummy="https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip") ``` Three repository keys are predefined: `hu.data`, `hu.stat` and `dummy`. You can retrieve them via ```{r, eval=FALSE} ghget('dummy') ghget('hu.stat') ghget('hu.data') ``` If you do not use a key, the programme will create one and return it as result. ```{r, error=TRUE} ghget(system.file("zip/mmstat4.dummy.zip", package = "mmstat4")) ghget("https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip") # tries https://github.com/my/github_repo/archive/refs/heads/[main|master].zip ghget("my/github_repo") # will fail # ghget() # uses 'hu.data' ``` `ghget` downloads the ZIP file, saves it to a temporary location and unpacks it. For non-temporary locations, see the FAQ. ## Full and short names for files In addition, unique short names, related to the ZIP file content, are generated from the path components. After unpacking the ZIP file, unique short names are generated for these files. ```{r} ghget('dummy') gd <- ghdecompose(ghlist(full.names=TRUE)) head(gd) ``` The file name is split into four parts. The last two parts, `minpath` and `filename`, are used to create short names: 1. the short name for `/tmp/RtmpXXXXXX/mmstat4.dummy-main/LICENSE` is `LICENSE`. There was no other file named `LICENSE` in the ZIP file. Therefore, it is sufficient to address this file in the ZIP file. 2. the short name for `/tmp/RtmpXXXXXX/mmstat4.dummy-main/data/BANK2.sav` is `data/BANK2.sav`. There is another file called `BANK2.sav` in the ZIP file, but to address it uniquely, `data/BANK2.sav` is sufficient for this file in the ZIP file (the other is `dbscan/BANK2.sav`). Currently, no check is made whether two files with identical basenames are also identical in content. ```{r} ghlist("BANK2", full.names=TRUE) # full names ghlist("BANK2") # short names ``` ## `ghopen`, `ghload`, `ghsource` The short names (or the full names) can be used to work with the files ```{r, eval=4} x <- ghload("data/BANK2.sav") # load data via rio::import ghopen("univariate/example_ecdf.R") # open file in RStudio editor ghsource("univariate/example_ecdf.R") # execute file via source ghlist("example_ecdf") # "univariate/" was unnecessary ``` ## `ghlist`, `ghquery` With `ghlist` you can get a list of unique (short) names for all files or a subset based on a regular expression `pattern` in the repository ```{r} str(ghlist()) # get all short names ghlist("\\.pdf$") # get all short names of PDF files ``` With `ghquery` you can query the list of unique (short) names for all files based on the overlap distance. ```{r} ghlist("bnk") # pattern = "bnk ghquery("bnk") # nearest string matching to "bnk" ``` ## `ghfile`, `ghpath`, `ghdecompose` `ghfile` tries to find a unique match for a given file and returns the full path. If there is no unique match, an error is returned with some possible matches. `ghdecompose` builds a data frame and decomposes the full names of the files into * `outpath` the path part which is the same for all files (basically the place where the ZIP file is extraced to), * `inpath` the path part that is not used in `minpath`, but in the ZIP file, * `minpath` the minimum path part, so that all files are uniquely addressable, * `filename` the base name of the file, and * `source` the input for shortpath. The short names for the files are built from the components `minpath` and `filename`. `ghpath` builds up the short name with various path components from a `ghdecompose` object. ```{r} ghfile('data/BANK2.sav') ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4")) fnf <- ghlist(full.names=TRUE) dfn <- ghdecompose(fnf) head(dfn) head(ghpath(dfn)) ``` ## RStudio addins The package comes with two RStudio addins (see under `Addins -> MMSTAT4`): * _Open a file from a zip file_ (`ghopenAddin`), which gives access to the unzipped zip file and opens the selected file in an RStudio editor window. * _Execute a Shiny app from a zip file_ (`ghappAddin`), which extracts all directories containing Shiny apps and opens the selected app in a web browser (using the default browser). # Creating an own ZIP file ## Preparation 1: Libraries/Modules used Currently there are the following routines to support R/Python code snippets: * `pkglist` or `modlist`, which extracts all `library`/`require`/`import` calls from code snippets and returns a frequency table of the packages or and modules called. ```{r} ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4")) files <- ghlist(pattern="*.R$", full.names = TRUE) cat(head(pkglist(files, repos="https://cloud.r-project.org"), 12)) ``` Note that the line for `CHAID` is commented out. The package cannot be found in CRAN, but you can install it from R-Forge. ```{r, eval=FALSE} cat(head(pkglist(files, repos=c("https://cloud.r-project.org", "http://R-Forge.R-project.org")), 12)) ``` You can add a file `init_R.R` or `init_py.R` to your ZIP file, which installs the necessary R packages or Python modules. ## Preparation 2: Scripts run independently `checkFiles` checks whether each R code snippet runs smoothly in a freshly started R. ```{r, eval=FALSE} # just check the last files from the list # Note that the R console will show more output (warnings etc.) checkFile(files, start=435) # alternatively: Rsolo ``` Three modes are available for checking a `file`: 1. `exist`: Does the source file exist? 2. `parse`: Is `parse(file)` or `python -m "file"` successful? (default) 3. `run`: Is `Rscript "file"` or `python3 "file"` successful? ## Preparation 3: Searching for (and removing) duplicate files `dupFiles` uses checksums to check whether files exist twice. ```{r} files <- ghlist(full.names = TRUE) head(dupFiles(files)) # alternatively: Rdups ``` Note: there is also an error message if the necessary libraries are not installed! ## ZIP file and access names Once you created your ZIP file you need to know under which names a specific file can be accessed. In the example we use a ZIP file which comes with the package `mmstat4`: ```{r} ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4")) ghnames <- ghdecompose(ghlist(full.names=TRUE)) ghnames[58,] ``` The shortest possible name is determined by `minpath` and `filename`. But all other paths determined by `uniquepath`, `minpath` and `filename` should also work. For file number 58, the following access names are possible: * `BANK2.sav` will not work since more than one file named `BANK2.sav` in the ZIP file. * `dbscan/BANK2.sav` will work since this the shortest possible name. * `cluster/dbscan/BANK2.sav`, `data/cluster/dbscan/BANK2.sav`, and `examples/data/cluster/dbscan/BANK2.sav` will work. ```{r, error=TRUE} x1 <- ghload("BANK2.sav") x2 <- ghload("dbscan/BANK2.sav") x3 <- ghload("cluster/dbscan/BANK2.sav") x4 <- ghload("data/cluster/dbscan/BANK2.sav") x5 <- ghload("examples/data/cluster/dbscan/BANK2.sav") ``` # Frequently asked questions ## Something is not working properly. Where can I get help? Please email me at `sigbert@hu-berlin.de`. You can also try the current development version of the package from GitHub: ```{r, eval=FALSE} # install.packages("devtools") devtools::install_github("sigbertklinke/mmstat4") ``` ## Can I use a password protected ZIP file? No, this is not supported. ## How can I force a reload of a zip file? ```{r, eval=FALSE} ghget("dummy", .force=TRUE) ``` ## How can I store a zip file permanently? ```{r, eval=FALSE} ghget("dummy", .tempdir=FALSE) # install non-temporarily ghget("dummy", .tempdir="~/mmstat4") # install non-temporarily to ~/mmstat4 ghget("dummy", .tempdir=TRUE) # install again temporarily ``` Note: If a repository was installed permanently and you switch back to temporarily storage then the downloaded files will not be deleted. ## How can I find all directories with Shiny apps? ```{r, eval=FALSE} ghget("dummy", .tempdir=TRUE) ghlist(pattern="/(app|server)\\.R$") ghopen("dbscan") # open the app ``` ## How can I find all `csv` data files? ```{r} ghget("dummy", .tempdir=TRUE) ghlist(pattern="\\.csv$", ignore.case=TRUE, full.names=TRUE) # use mmstat4::ghload for importing ghlist(pattern="\\.csv$") pechstein <- ghload("pechstein.csv") str(pechstein) ``` ## What should I install to use Python scripts? For Ubuntu (Linux) install: ```{r, eval=FALSE} sudo apt-get install python3 python3-dev python3-pip python3-venv libbz2-dev ``` Note: `mmstat4` installs these Python modules `numpy`, `scipy`, `statsmodels`, `pandas`, `scikit-learn`, `matplotlib`, and `seaborn` by default. ## `init_py.R` is only called if the virtual environment is created. Can I force a new call? Yes, delete the virtual environment and recreate it ```{r, eval=FALSE} reticulate::virtualenv_remove('mmstat4') ghinstall('py', force=TRUE) ``` # Default repositories The package recognises three standard repositories: `dummy`, `hu.stat`, and `hu.data`. ```{r, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'} tabl <- " | Repository | Size | ZIP file location | | :--------------- | :-------------| :--------| | `dummy` | 3 MB | `https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip` | | `hu.data` | 29 MB | `https://github.com/sigbertklinke/mmstat4.data/archive/refs/heads/main.zip` | | `hu.stat` | 31 MB | `https://github.com/sigbertklinke/mmstat4.stat/archive/refs/heads/main.zip` | " cat(tabl) ``` `dummy` is small subsample of `hu.stat` and `hu.data` which is intended for examples and test purposes. ## Lecture Notes Sigbert Klinke, HU Berlin ### Basic statistics I+II (in german) Mathematische Grundlagen - Einführung - Grundbegriffe - Univariate Verteilungen - Parameter univariater Verteilungen - Bivariate Verteilungen - Parameter bivariater Verteilungen - Regressionanalyse - Zeitreihenanalyse - Indexzahlen - Wahrscheinlichkeitsrechnung - Zufallsvariablen - So lügt man mit Statistik - Wichtige Verteilungsmodelle - Stichprobentheorie - Statistische Schätzverfahren - Regressionsmodell - Konfidenzintervalle - Statistische Testverfahren - Parameterische Tests - Nichtparametrische Tests ```{r, eval=FALSE} ghget("hu.stat") ghopen("Statistik.pdf") ghopen("Aufgaben.pdf") ghopen("Loesungen.pdf") ghopen("Formelsammlung.pdf") ``` ### Data analysis General - R - Basics and data generation - Test and estimation theory - Parameter of distributions - Distribution - Transformations - Robust statistics - Missing values - Subgroup analysis - Correlation and association - Multivariate graphics - Principal component analysis - Exploratory factor analysis - Reliability - Cluster analysis - Regression analysis - Linear regression - Nonparametric regression - Classification and regression trees - Neural networks ```{r, eval=FALSE} ghget("hu.data") ghopen("dataanalysis.pdf") ``` ## Lecture Notes Bernd Rönz, HU Berlin (in german) ### Computergestützte Statistik I mit SPSS 10 (2001) Einführung - Entdeckung und Identifikation von Ausreißern - Prüfung der Verteilungsform von Variablen - Parametervergleiche bei unbhängigen Stichproben - Anhänge A-D, Literaturverzeichnis, Index ```{r, eval=FALSE} ghget("hu.data") ghopen("cs1_roenz.pdf") ``` ### Computergestützte Statistik II mit SPSS 10 (2000) Vorwort - Überprüfung von Zusammenhängen - Regressionsanalyse - Reliabilitäts- und Homogenitätsanalyse von Konstrukten - Anhänge A-H, Literaturverzeichnis, Stichwortverzeichnis ```{r, eval=FALSE} ghget("hu.data") ghopen("cs2_roenz.pdf") ``` ### Generalisierte lineare Modelle mit SPSS 10 (2001) Einführung - Verallgemeinerte lineare Modelle (generalized linear models, GLM) - Modellierung binärer Daten - Das multinomiale Logit Modell - Modellierung multinomialer Daten (log-lineare Modelle) - Literaturverzeichnis, Index ```{r, eval=FALSE} ghget("hu.data") ghopen("glm_roenz.pdf") ```