My journey from R to Julia

A very brief introduction to Julia for epidemiologists

Julia
R
Python
Data science
Programming
Epidemiology
Scientific computing
Author
Affiliation

California Department of Public Health

Published

January 14, 2023

Note

My history with R

For 15 years, as volunteer adjunct faculty, I taught “Applied Epidemiology using R” at the UC Berkeley School of Public Health. I started teaching this course in the early 2000s when most people were not interested in R. Eventually R’s popularity grew, and so did my course enrollment. I taught basic R programming applied to epidemiologic concepts. Students worked on a project of their choosing. Because I had students from multiple disciplines, their projects were often very innovative and I learned a lot from them.

Over the 15 years I witnessed the emergence of “data science” and students’ ingenuity and creativeness in their projects. Naturally, the course evolved too. I introduced Bayesian networks as a unifying framework to introduce probablistic dependence, causal graphs, and decision networks (for decision analysis) [1].

I started dabbling in Python. Unfortunately, the COVID-19 pandemic interrupted my teaching because I was the health officer of San Francisco and director of the Population Health Division. Working on the pandemic response left me little time to conduct analyses or to learn Python.1

I cannot remember when, but eventually I discovered Julia—a programming language designed for scientific computing with the intuition of Python or R, but with the speed of C++. I fell in love with Julia and I gave up on learning Python. I did not, and do not, have the time to maintain core competency in more than 1.5 programming languages.2 As I learned more Julia, I became convinced, that for me, learning Julia was a better long term investment than sticking with R.

I use programming to explore or test my epidemiologic intuition, to learn new methods, and to visualize and analyze data. I have a personal interest in Bayesian networks, decision networks, causal inference, Markov decision processes, and agent-based modeling.3

Here are some key Julia features that I believe epidemiologists will value:

  • multiple dispatch (see below)
  • composite types
  • just-in-time compiling
  • speed (very very fast)
  • community (mostly computer science, mathematics, engineering)
  • 1-based indexing like R and MATLAB (Python is 0-based indexing)
  • Capabilities similar to dpylr in R 4, 5
  • Pipe operator similar to %>% in R
  • general programming language (like Python)

Julia enables more intuitive programming. For example, in R, we try to avoid loops because they are very inefficient. In Julia, loops are efficient because they compile before execution. This promotes programming that is more natural.

Below I demonstrate multiple dispatch with a trivial example. R is single dispatch.

Multiple dispatch example in Julia

I was the founding developer of the R ‘epitools’ package.6 I developed functions for basic epidemiologic analyses (eg, 2x2 tables), several from Rothman’s “Modern Epidemiology” textbook. For example, if I wanted to create a function to calculate an odds ratio for a 2x2 contingency table, the data could be provided in several ways:

  • four integer counts,
  • two proportions, or
  • a 2x2 table (matrix)
  • two vectors with categorical data

If I wanted to write one function to handle these possible data types as arguments, I would have to do a lot of processing and checking of data types in order to call the next function (nested or external). This requires much more work than is necessary. Let’s do this is Julia using multiple dispatch.

In 2003, we published a study that provided evidence that drinking unfiltered municipal tap water was associated with developing cryptosporidiosis among patients with advanced HIV disease [2]. Here is a contingency table from this paper.

Exposure Case Control
Highest 12 6
Intermediate 35 64
Lowest 2 29

Let’s calculate the unadjusted odds ratio comparing the highest exposure category to the lowest exposure category. Here is the 2x2 table for this calculation.

Exposure Case Control
Highest 12 6
Lowest 2 29

For an appropriately structured table, for example,

Exposure Case Control
Highest a b
Lowest c d

the odds ratio is the cross-product:

\[ OR = \frac{a d}{b c} \]

For a case-control design, the odds ratio is the ratio of the exposure odds.

\[ OR = \frac{p_1/(1 - p_1)}{p_0/(1 - p_0)} \]

We will create three functions that can receive three different types of arguments to calculate an odds ratio.

  • four integer counts,
  • two proportions, or
  • a 2x2 table (matrix)

The catch is that the three functions will have the same name: oddsratio. This is possible in Julia because of multiple dispatch. In contrast, R is single dispatch.

## Function 1
function oddsratio(a::Int, b::Int, c::Int, d::Int)
    or = (a * d) / (b * c)
    return or
end 
oddsratio (generic function with 1 method)

Let’s test the oddsratio function by passing four integers from our 2x2 table.

oddsratio(12, 6, 2, 29)
29.0

Here is the second function to handle arguments that are proportions; for example, the exposure odds comparing cases to controls.

## Function 2
function oddsratio(p1::Float64, p0::Float64)
    or = ((p1)/(1 - p1)) / ((p0)/(1 - p0))
    return or
end
oddsratio (generic function with 2 methods)

Let’s test the oddsratio function by passing two proportions from our 2x2 table.

prop1 = 12 / (12 + 2) # probability of exposure among cases
prop0 = 6 / (6 + 29) # probability of exposure among controls
oddsratio(prop1, prop0)
28.999999999999982

Finally, here is the third function to handle an argument that is a 2x2 table (matrix).

## Function 3
function oddsratio(x::Matrix{Int})
    or = (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1])
    return or
end
oddsratio (generic function with 3 methods)

Let’s test the oddsratio function by passing our 2x2 table.

tab = [12 6; 2 29]
2×2 Matrix{Int64}:
 12   6
  2  29
oddsratio(tab)
29.0

This is called multiple dispatch. The oddsratio function has three methods and can handle multiple data types. Notice how easy that was.

methods(oddsratio)
# 3 methods for generic function oddsratio from Main:
  • oddsratio(x::Matrix{Int64}) in Main at In[6]:2
  • oddsratio(p1::Float64, p0::Float64) in Main at In[4]:2
  • oddsratio(a::Int64, b::Int64, c::Int64, d::Int64) in Main at In[2]:2

Wow! Now anyone, including me, can add new methods to the oddsratio function without disrupting previous methods. This is very powerful. Other languages accomplish this using object-oriented programming. Julia is not an object-oriented programming lanuage.

Summary

In this blog entry I summarized why I switched from R to Julia. I illustrated how multiple dispatch works with functions. Notice how easy it was for me to create an oddsratio function with three methods to handle different argument data types (integer counts, proportions, and a matrix).

I enjoy Julia and you will too. You can also run R or Python from Julia without skipping a beat.

I will be posting simple examples that highlight Julia features applied to basic epidemiology or epidemiologic programming.

Appendix

Julia is very flexible in the creation of functions. Here are the same three functions in an abbreviated form.

## Function 1
oddsratio(a::Int, b::Int, c::Int, d::Int) = (a * d) / (b * c)
oddsratio (generic function with 3 methods)
## Function 2
oddsratio(p1::Float64, p0::Float64) = ((p1)/(1 - p1)) / ((p0)/(1 - p0))
oddsratio (generic function with 3 methods)
## Function 3
oddsratio(x::Matrix{Int}) = (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1])
oddsratio (generic function with 3 methods)

From my biased perspective, Julia has a simplicity and elegance that is lacking in R.

References

1.
Aragón TJ. Population health thinking using Bayesian networks. University of California, eScholarship [Internet]. 2018; Available from: https://escholarship.org/uc/item/8000r5m5
2.
Aragón TJ, Novotny S, Enanoria W, Vugia DJ, Khalakdina A, Katz MH. Endemic cryptosporidiosis and exposure to municipal tap water in persons with acquired immunodeficiency syndrome (AIDS): A case-control study. BMC Public Health [Internet]. 2003 Jan;3(1). Available from: https://doi.org/10.1186/1471-2458-3-2

Footnotes

  1. I did not enjoy Python.↩︎

  2. I still use R occassionally; that is the “.5”.↩︎

  3. The more I study the more I am humbled by the enormous talent out there. I am only stratching the surface.↩︎

  4. See https://bkamins.github.io/julialang/2020/07/03/dplyr-vs-df.html↩︎

  5. See https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/↩︎

  6. See https://cran.r-project.org/web/packages/epitools/index.html↩︎