The R source code comparison is based on similarity coefficients for the names used in R programs or expressions. Use cases would be the detection of
In the first case, detection of similar code sequences can lead to better code quality if similar code is embedded in a function rather than repeatedly in different places. In the second case, cheating is looked for.
The goal, however, is not perfect detection of similar code sequences, but rather to give clues as to where similar code sequences might be.
We have several steps to take:
A thanks to Masatoshi Nishimura to his blog post The Best Document Similarity Algorithm: A Beginner’s Guide which is very inspiring.
files <- ... # get file names from somehere, e.g. list.files()
prgs <- sourcecode(files, title=basename(files))
docs <- documents(prgs, type="names")
sim <- similarities(docs) # you may use alternatively tfidf()
dfsim <- matrix2dataframe(sim)
head(dfsim, n=25)
browse(prgs, dfsim, n=6) # creates and opens a HTML file
files <- ... # get file names from somehere, e.g. list.files()
# load all expressions with at least `minlines` lines
prgs <- sourcecode(files, title=basename(files), minlines=0)
docs <- documents(prgs, type="names")
sim <- similarities(docs) # you may use alternatively tfidf()
sim <- same_file(sim) # do not compare expressions within one file
dfsim <- matrix2dataframe(sim)
head(dfsim, n=25)
browse(prgs, dfsim, n=6) # creates and opens a HTML file
The makers of the package SimilaR (R Source Code Similarity Evaluation) have provided some sample files for testing:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files))
#>
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/aa.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/aa1.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/bucketSort1.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/bucketSort1_addLines.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/bucketSort1_variables.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/isPrime2.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/isPrime2_addLines.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/isPrime2_callReverse.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/kendall4.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/kendall4_variables.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/kombinuj1.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/kombinuj1_variables.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/kwantyle1.R
#> /tmp/RtmpXjArjB/Rinsta29753d4af3/rscc/examples/kwantyle1_variables.R
names(prgs)
#> [1] "aa.R" "aa1.R"
#> [3] "bucketSort1.R" "bucketSort1_addLines.R"
#> [5] "bucketSort1_variables.R" "isPrime2.R"
#> [7] "isPrime2_addLines.R" "isPrime2_callReverse.R"
#> [9] "kendall4.R" "kendall4_variables.R"
#> [11] "kombinuj1.R" "kombinuj1_variables.R"
#> [13] "kwantyle1.R" "kwantyle1_variables.R"
The parameter title
sets another title for the program
instead of files
.
The parameter silent=TRUE
suppresses the output of the
parsed files. If an error occurs during parsing, the file will not be
loaded and will not be included in the following steps.
If you want to consider expressions and not the whole R file, you
have to set the parameter minlines
. sourcecode
checks whether an expression in the source file has more than
minlines
lines. If so, the expression is kept for further
analysis. The name of the list items in prgs
is then
title[number]
. For example, you could access the expression
named prgs[["aa.R[1]"]]
.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), minlines=3, silent=TRUE)
names(prgs)
#> [1] "aa.R[1]" "aa1.R[1]"
#> [3] "bucketSort1.R[1]" "bucketSort1_addLines.R[1]"
#> [5] "bucketSort1_variables.R[1]" "isPrime2.R[1]"
#> [7] "isPrime2_addLines.R[1]" "isPrime2_callReverse.R[1]"
#> [9] "kendall4.R[1]" "kendall4_variables.R[1]"
#> [11] "kombinuj1.R[1]" "kombinuj1_variables.R[1]"
#> [13] "kwantyle1.R[1]" "kwantyle1_variables.R[1]"
The next step is to create documents, based on the names of variables, functions and operators from the parsed source codes.
docs <- documents(prgs)
# create term document frequency table
freq_table(docs)[1:8,1:8]
#> vword
#> vdoc asd asd1 asd2 asd3 asd4 asd5 asd6 asd7
#> aa.R[1] 1 0 0 0 0 0 0 0
#> aa1.R[1] 1 0 0 0 0 0 0 0
#> bucketSort1.R[1] 0 0 0 0 0 0 0 0
#> bucketSort1_addLines.R[1] 0 0 0 0 0 0 0 0
#> bucketSort1_variables.R[1] 0 0 0 0 0 0 0 0
#> isPrime2.R[1] 0 0 0 0 0 0 0 0
#> isPrime2_addLines.R[1] 0 0 0 0 0 0 0 0
#> isPrime2_callReverse.R[1] 0 1 1 1 1 1 1 1
With the type
parameter you can distinguish between
different types of names and with ignore.case
the names are
either treated case-insensitive or not.
type
With the type
parameter you can distinguish between
different types of names:
cat(as.character(prgs[[1]])) # source code
#> asd <- function(x) {
#> for (i in x) {
#> cat(i)
#> x[i] <- 3
#> }
#> }
all.vars(prgs[[1]]) # type="v", default
#> [1] "asd" "i" "x"
all.names(prgs[[1]]) # type="n"
#> [1] "<-" "asd" "function" "{" "for" "i"
#> [7] "x" "{" "cat" "i" "<-" "["
#> [13] "x" "i"
setdiff(all.names(prgs[[1]]), all.vars(prgs[[1]])) # type="f"
#> [1] "<-" "function" "{" "for" "cat" "["
minlen
and ignore.case
With the parameter minlen
you can exclude names that are
shorter than minlen
. The default is minlen=2
because the name of an index variable in loops often consists of only
one letter, for example for (i in 1:n)
.
ignore.case
is either TRUE
or
FALSE
. If TRUE
(default), then names will
case-insensitive, otherwise case-sensitive.
In cluster analysis has been developed several similarity coefficients. They use words and basically calculate the intersection of words in documents over the union of words.
To calculate similarity coefficients between all source text segments based on the names used:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
similarities(docs)[1:8,1:8]
#> aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R 1 1 0.0000000 0.0000000
#> aa1.R 1 1 0.0000000 0.0000000
#> bucketSort1.R 0 0 1.0000000 1.0000000
#> bucketSort1_addLines.R 0 0 1.0000000 1.0000000
#> bucketSort1_variables.R 0 0 0.1428571 0.1428571
#> isPrime2.R 0 0 0.0000000 0.0000000
#> isPrime2_addLines.R 0 0 0.0000000 0.0000000
#> isPrime2_callReverse.R 0 0 0.0000000 0.0000000
#> bucketSort1_variables.R isPrime2.R isPrime2_addLines.R
#> aa.R 0.0000000 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000 0.0000000
#> bucketSort1.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_addLines.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_variables.R 1.0000000 0.0000000 0.0000000
#> isPrime2.R 0.0000000 1.0000000 1.0000000
#> isPrime2_addLines.R 0.0000000 1.0000000 1.0000000
#> isPrime2_callReverse.R 0.0000000 0.1111111 0.1111111
#> isPrime2_callReverse.R
#> aa.R 0.0000000
#> aa1.R 0.0000000
#> bucketSort1.R 0.0000000
#> bucketSort1_addLines.R 0.0000000
#> bucketSort1_variables.R 0.0000000
#> isPrime2.R 0.1111111
#> isPrime2_addLines.R 0.1111111
#> isPrime2_callReverse.R 1.0000000
This calculates a matrix of Jaccard coefficients based on the variable names.
If the Jaccard coefficient is used then each entry can be interpreted as the percentage:
$$\frac{\text{number of words used in both documents}}{\text{number of all words used in both documents}}.$$ The interpretation will be different if another similarity coefficient is used! But in any case, a higher similarity coefficient corresponds to a larger proportion of variable names in both files (or expressions).
coeff
(similarity)With the parameter coeff
a certain similarity
coefficient can be calculated (default: jaccard
).
If you specify two sets with unique names set1
,
set2
and one set setfull
with predefined
names, four numbers will be calculated (default:
setfull <- unique(c(set1,set2))
):
inset1 <- setfull %in% unique(set1)
inset2 <- setfull %in% unique(set2)
p <- length(setfull)
n11 <- sum(inset1 & inset2)
n10 <- sum(inset1 & !inset2)
n01 <- sum(!inset1 & inset2)
n00 <- sum(!inset1 & !inset2)
The following coefficients can be calculated:
braun = n11/max(n01+n11, n10+n11)
,dice = 2*n11/(n01+n10+2*n11)
,jaccard = n11/(n01+n10+n11)
(default),kappa = 1/(1+p/2*(n01+n10)/(n00*n11-n01*n10))
,kulczynski = n11/(n01+n10)
,matching = (n00+n11)/p
,ochiai = n11/sqrt((n11+n10)*(n11+n10))
,phi = (n11*n00-n10*n01)/sqrt((n11+n10)*(n11+n10)*(n00+n10)*(n00+n10))
,russelrao = n11/p
,simpson = n11/min(n01+n11, n10+n11)
,sneath = n11/(n11+2*n01+2*n10)
,tanimoto = (n11+n00)/(n11+2*n01+2*n10+n00)
, andyule = (n11*n00-n01*n10)/(n11*n00-n01*n10)
.If a coefficient name is not found or a NaN
is generated
then a zero is returned.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
similarities(docs, coeff="m")[1:8,1:8]
#> aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R 1 1 0.0000000 0.0000000
#> aa1.R 1 1 0.0000000 0.0000000
#> bucketSort1.R 0 0 1.0000000 1.0000000
#> bucketSort1_addLines.R 0 0 1.0000000 1.0000000
#> bucketSort1_variables.R 0 0 0.1428571 0.1428571
#> isPrime2.R 0 0 0.0000000 0.0000000
#> isPrime2_addLines.R 0 0 0.0000000 0.0000000
#> isPrime2_callReverse.R 0 0 0.0000000 0.0000000
#> bucketSort1_variables.R isPrime2.R isPrime2_addLines.R
#> aa.R 0.0000000 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000 0.0000000
#> bucketSort1.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_addLines.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_variables.R 1.0000000 0.0000000 0.0000000
#> isPrime2.R 0.0000000 1.0000000 1.0000000
#> isPrime2_addLines.R 0.0000000 1.0000000 1.0000000
#> isPrime2_callReverse.R 0.0000000 0.1111111 0.1111111
#> isPrime2_callReverse.R
#> aa.R 0.0000000
#> aa1.R 0.0000000
#> bucketSort1.R 0.0000000
#> bucketSort1_addLines.R 0.0000000
#> bucketSort1_variables.R 0.0000000
#> isPrime2.R 0.1111111
#> isPrime2_addLines.R 0.1111111
#> isPrime2_callReverse.R 1.0000000
Another possibility to find similar source codes as cosine of angles of term frequency–inverse document frequency:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
tfidf(docs)[1:8,1:8]
#>
#> aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R 1 1 0.0000000 0.0000000
#> aa1.R 1 1 0.0000000 0.0000000
#> bucketSort1.R 0 0 1.0000000 1.0000000
#> bucketSort1_addLines.R 0 0 1.0000000 1.0000000
#> bucketSort1_variables.R 0 0 0.3128966 0.3128966
#> isPrime2.R 0 0 0.0000000 0.0000000
#> isPrime2_addLines.R 0 0 0.0000000 0.0000000
#> isPrime2_callReverse.R 0 0 0.0000000 0.0000000
#>
#> bucketSort1_variables.R isPrime2.R
#> aa.R 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000
#> bucketSort1.R 0.3128966 0.0000000
#> bucketSort1_addLines.R 0.3128966 0.0000000
#> bucketSort1_variables.R 1.0000000 0.0000000
#> isPrime2.R 0.0000000 1.0000000
#> isPrime2_addLines.R 0.0000000 1.0000000
#> isPrime2_callReverse.R 0.0000000 0.2021137
#>
#> isPrime2_addLines.R isPrime2_callReverse.R
#> aa.R 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000
#> bucketSort1.R 0.0000000 0.0000000
#> bucketSort1_addLines.R 0.0000000 0.0000000
#> bucketSort1_variables.R 0.0000000 0.0000000
#> isPrime2.R 1.0000000 0.2021137
#> isPrime2_addLines.R 1.0000000 0.2021137
#> isPrime2_callReverse.R 0.2021137 1.0000000
The attribute tfidf
of the result matrix contains the
term frequency–inverse document frequency matrix.
Since the matrix of coefficients can not easily be overviewed we have must transform them for visualisation.
matrix2dataframe
With matrix2dataframe
you transform the matrix of
coefficients to a data frame where the first column is the row index,
the second column the column index and the third column the the matrix
entry.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
simm <- similarities(docs, coeff="m")
simdf <- matrix2dataframe(simm)
head(simdf, 10)
#> row col matching
#> 1 aa.R aa1.R 1.0000000
#> 2 bucketSort1.R bucketSort1_addLines.R 1.0000000
#> 3 isPrime2.R isPrime2_addLines.R 1.0000000
#> 4 kombinuj1.R kombinuj1_variables.R 0.3000000
#> 5 kwantyle1.R kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571
#> 7 bucketSort1_addLines.R bucketSort1_variables.R 0.1428571
#> 8 kendall4.R kendall4_variables.R 0.1428571
#> 9 isPrime2.R isPrime2_callReverse.R 0.1111111
#> 10 isPrime2_addLines.R isPrime2_callReverse.R 0.1111111
As you can see the lines are ordered by decreasing values. The
parameter decreasing=FALSE
will order it by ascending
order.
The output can be interpreted line by line:
aa.R
and
aa1.R
are identical. Each variable name used in
aa.R
is also used in aa1.R
and vice
versa.bucketSort1_addLines.R
and bucketSort1.R
are identical.isPrime2_addLines.R
and
isPrime2.R
are identical.kombinuj1_variables.R
and kombinuj1.R
are not identical. In both
files, the variable names overlap by 30%.In case of a symmetric similarity matrix only the upper triangle is considered otherwise the whole matrix.
You may use a stripchart
to see all similarity
coefficients:
as_igraph
In a second step you can plot the coefficients in a diagram, where thicker edges correspond to higher similarity coefficients.
library("igraph")
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs, type="n", minlen=3)
simm <- similarities(docs)
graph <- as_igraph(simm, diag=FALSE)
# color all edges wit a large similarity coefficients in red
E(graph)$color <- ifelse(E(graph)$weight>0.4, "red", "grey")
plot(graph, edge.width=1+3*E(graph)$weight)
box()
The package igraph is used for
the graphical representation. In as_igraph
the function
igraph::graph_from_adjacency_matrix
is used. In case of a
symmetric coefficient matrix, undirected graphs are used, otherwise a
directed graph is used.
same_file
If you are only interested in the differences between files, you can set the similarities between expressions to some predefined value (default: zero) if they are in the same file. The use case here is to detect plagiarism in different files.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE, minlines=1)
docs <- documents(prgs)
simm <- similarities(docs)
simm[1:3,1:3]
#> aa.R[1] aa1.R[1] bucketSort1.R[1]
#> aa.R[1] 1 1 0
#> aa1.R[1] 1 1 0
#> bucketSort1.R[1] 0 0 1
simm <- same_file(simm)
simm[1:3,1:3]
#> aa.R[1] aa1.R[1] bucketSort1.R[1]
#> aa.R[1] 0 1 0
#> aa1.R[1] 1 0 0
#> bucketSort1.R[1] 0 0 0
The last step is to compare the relevant source codes. The command
browse
creates and opens an HTML page with the source codes
side by side.