| Title: | Count Words and Characters in R Markdown and Jupyter Notebooks |
|---|---|
| Description: | Computes word, character, and non-whitespace character counts in R Markdown documents and Jupyter notebooks, with or without code chunks. Returns results as a data frame. |
| Authors: | Sigbert Klinke [aut, cre] |
| Maintainer: | Sigbert Klinke <[email protected]> |
| License: | GPL-3 |
| Version: | 0.3.1 |
| Built: | 2026-05-15 06:53:58 UTC |
| Source: | https://github.com/sigbertklinke/rmdwc |
This function extracts text from specific cell types (e.g., markdown) in one or more .ipynb files
and counts the number of characters, words, and lines. It optionally excludes certain patterns (e.g., code fences).
The function uses a helper function rmdcount() to perform the counting on the extracted text.
ipynbcount( files, celltype = c("markdown"), space = "[[:space:]]", word = "[[:space:]]+", line = "\n", exclude = "```\\{.*?```" )ipynbcount( files, celltype = c("markdown"), space = "[[:space:]]", word = "[[:space:]]+", line = "\n", exclude = "```\\{.*?```" )
files |
character: vector of paths to |
celltype |
character: vector indicating which cell types to include (default is |
space |
character: pattern to split a text at spaces (default: |
word |
character: pattern to split a text at word boundaries (default: |
line |
character: pattern to split lines (default: |
exclude |
character: pattern to exclude text parts, e.g. code chunks (default: |
This function assumes that the notebook files are valid JSON and contain a list of cells under the cells field.
It temporarily writes the extracted content to a file to reuse the rmdcount() logic.
A data frame with counts of characters, words, and lines for each file. Additional columns include file (base name) and path (directory).
file <- system.file('ipynb/example_data_analysis.ipynb', package="rmdwc") ipynbcount(file) # without code ipynbcount(file, celltype=c("markdown", "code")) # with codefile <- system.file('ipynb/example_data_analysis.ipynb', package="rmdwc") ipynbcount(file) # without code ipynbcount(file, celltype=c("markdown", "code")) # with code
rmdcount counts lines, words, bytes, characters and non-whitespace characters in R Markdown files excluding code chunks.
txtcount counts lines, words, bytes, characters and non-whitespace characters in plain text files.
Note that the counts may differ a bit from unix wc and Libre Office because
it depends on the definition of a line, a word and a character.
rmdcount( files = NULL, space = "[[:space:]]", word = "[[:space:]]+", line = "\n", exclude = "```\\{.*?```" ) txtcount( files = NULL, space = "[[:space:]]", word = "[[:space:]]+", line = "\n" )rmdcount( files = NULL, space = "[[:space:]]", word = "[[:space:]]+", line = "\n", exclude = "```\\{.*?```" ) txtcount( files = NULL, space = "[[:space:]]", word = "[[:space:]]+", line = "\n" )
files |
character: file name(s) |
space |
character: pattern to split a text at spaces (default: |
word |
character: pattern to split a text at word boundaries (default: |
line |
character: pattern to split lines (default: |
exclude |
character: pattern to exclude text parts, e.g. code chunks (default: |
We define:
the number of lines. It differs from unix wc -l since wc counts the number of newlines.
it is considered to be a character or characters delimited by white space. However, a "word" is in general a fuzzy concept, for example is "3.141593" a word? Therefore different programs may count differently, for more details see the discussion to the Libreoffice bug Word count gives wrong results - Another Example Comment 5.
The following approach is used to detect lines, words, characters and non-whitespace characters.
strsplit(rmd, line)[[1]] with line='\n'
charToRaw(rmd)
strsplit(rmd, word)[[1]] with word='[[:space:]]+'
strsplit(rmd, '')[[1]]
strsplit(gsub(space, '', rmd), '')[[1]] with space='[[:space:]]'
If txtcount is used then code chunks are deleted with gsub('```\\{.*?```', '', rmd) before counting.
a data frame with following elements
basename of file
number of lines
number of words
number of bytes
number of characters
number of non-whitespace characters
path of file
# count excluding code chunks files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") rmdcount(files) # count including code chunks txtcount(files) # or rmdcount(files, exclude='') # count for a set of R Markdown docs files <- list.files(path=system.file('rmarkdown', package="rmdwc"), pattern="*.Rmd", full.names=TRUE) rmdcount(files) # use of rmdcount() in a R Markdown document if (interactive()) { files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") file.edit(files) # SAVE(!) the file and knit it } # count including code chunks files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") txtcount(files)# count excluding code chunks files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") rmdcount(files) # count including code chunks txtcount(files) # or rmdcount(files, exclude='') # count for a set of R Markdown docs files <- list.files(path=system.file('rmarkdown', package="rmdwc"), pattern="*.Rmd", full.names=TRUE) rmdcount(files) # use of rmdcount() in a R Markdown document if (interactive()) { files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") file.edit(files) # SAVE(!) the file and knit it } # count including code chunks files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") txtcount(files)
Applies rmdcount to the current R Markdown document
rmdcountAddin()rmdcountAddin()
nothing
if (interactive()) rmdcountAddin()if (interactive()) rmdcountAddin()
Counts words, characters and non-whitespace characters in a string. Is used in rmdcount, see details there.
rmdwcl(rmd, space = "[[:space:]]", word = "[[:space:]]+", line = "\n")rmdwcl(rmd, space = "[[:space:]]", word = "[[:space:]]+", line = "\n")
rmd |
character: R Markdown document as string |
space |
character: pattern to split a text at spaces (default: |
word |
character: pattern to split a text at word boundaries (default: |
line |
character: pattern to split lines (default: |
a list
file <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") fcont <- readChar(file, file.info(file)$size) rmdwcl(fcont)file <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc") fcont <- readChar(file, file.info(file)$size) rmdwcl(fcont)