Julia for Data Analysis

“A book review for population health data scientists”

Book review
Julia
R
Python
Data science
Programming
Epidemiology
Scientific computing
Author
Affiliation

California Department of Public Health

Published

February 20, 2023

1 Overview

Julia is an open source, general purpose programming language for scientific computing and is well suited for data science. I taught R programming for several years at the UC Berkeley School of Public Health. I recently posted a blog entry on why I switched from R to Julia (“My Journey from R to Julia”). In this blog entry I review the book “Julia for Data Analysis” by Bogumit Kaminski [1]. In short, this is an outstanding book that I highly recommend without any reservations (5/5 stars). This review is a working blog that I will be updating with highlights from the book.

For data science, Julia is exploding in popularity and there are numerous outstanding resources for learning Julia, including books, videos, and blog postings. I purchased the print and eBook option which includes an online “liveBook” which is easy to read. However, we cannot beat having a book in hand to read and markup. The author provides a GitHub repository with code and data files.

The author, Bogumit Kaminski, is a core developer of the DataFrames.jl package. He is an associate professor and head of the Decision Support and Analysis Unit at the SGH Warsaw School of Economics, as well as adjunct professor at the data science laboratory, Ryerson University, Toronto.

This book is perfect for population health data scientists already familiar with R or Python, or already have basic proficiency with Julia but need an in depth and systematic introduction to Julia for data science.

Here is the table of contents.

  1. Introduction
  2. Getting started with Julia
  3. Julia’s support for scaling projects
  4. Working with collections in Julia
  5. Advanced topics on handling collections
  6. Working with strings
  7. Handling time-series data and missing values
  8. First steps with data frames
  9. Getting data from a data frame
  10. Creating data frame objects
  11. Converting and grouping data frames
  12. Mutating and transforming data frames
  13. Advanced transformations of data frames
  14. Creating web services for sharing data analysis results

2 Book highlights

2.1 Chapter 1 Introduction

My blog posting “My Journey from R to Julia” is a good summary of what is covered in the Introduction. I will cover just one item—execution speed, and compare it to R.

We will construct a for loop summation of a random sequence of integers from 1 to 1,000,000,000 (1 billion) that are sampled without replacment.1 Here is the correct answer as a reference:

## In Julia
julia>  sum(1:1_000_000_000)
500000000500000000

## In R
> options(digits=20)
> sum(1:1000000000)
[1] 5.000000005e+17

By default, R would give the wrong answer (not shown) because it uses 64-bit floating point numbers.2 To get the correct answer we need 64-bit integers. For this I used the bit64 R package to get the correct answer (below).

> require("bit64")
> n = 1000000000
> samp = sample(1:n, n, replace=FALSE)
> sum_n = function(x){
+     s = as.integer64(0)
+     for (i in x){
+         s = s + i
+     }
+     s
+ }
> system.time(x <- sum_n(samp))
     user    system   elapsed 
11094.792   101.378 11201.145 
> x
integer64
[1] 500000000500000000

To get the correct answer in R, the execution time was about 11,095 seconds. Okay, let’s try Julia.

julia> using StatsBase, BenchmarkTools
julia> n = 1_000_000_000;
julia> samp = sample(1:n, n, replace=false);
julia> function sum_n(x)
           s = 0
               for i in x
                   s = s + i
               end
               return s
       end
sum_n (generic function with 1 method)

julia> @btime sum_n(samp)
  158.118 ms (1 allocation: 16 bytes)
500000000500000000

In Julia, it took about 158 milliseconds. R is about 70,168 times slower than Julia!3

Conclusion: Compared to R, Julia can handle large for loops for summation and give an accurate answer fast. To add integers correctly using a for loop, R requires using the bit64 package but is 70,168 time slower than Julia. With R, we are taught to avoid for loops — now you know why. However, a for loop is a workhorse tool that we want available to us; hence, this a huge advantage of Julia.

To see more benchmarks visit Which programming language is fastest?.

2.2 2 Getting started with Julia

2.2.1 Basic data types

If you are familar with R or Python, you will feel comfortable with Julia. Here is a character:

## character; notice single quotation marks
'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

A string is a collection of characters:

## string; notice double quotation marks
"Hello World"
"Hello World"

An array is a collection in brackets. A vector is a 1-dimensional array. Commas are used to create a vector. By default, the vector is displayed vertically, but it is not a “column vector”.

## vector
[1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

To display horizontally use the show function.4

## vector
show([1, 2, 3])
[1, 2, 3]

A matrix is a two-dimensional array. Numbers separated by spaces create one row of a matrix. This is not a vector.

## matrix
[1 2 3 4]
1×4 Matrix{Int64}:
 1  2  3  4

Matrix rows separated by a semi-colon creates a multi-row matrix:

## matrix
[1 2 3 4; 6 7 8 9]
2×4 Matrix{Int64}:
 1  2  3  4
 6  7  8  9

The size function returns the dimensions of an array. Previously, we learned that a vector is a 1-dimensional array.

size([1, 2, 3])
(3,)

We see this vector is 1-dimensional ((3,)); in contrast,

size([1 2 3])
(1, 3)

this is a (1, 3) matrix or 2-dimensional.

A tuple is a collection in parentheses, and commas separate the elements. Above, the size function returned a tuple. Tuples are immutable; their elements cannot be changed. Immutable objects increase computational speed.

## tuple
(1, 2, 3)
(1, 2, 3)

Tuple elements can be named.

## named tuple
(four = 4, five = 5, six = 6)
(four = 4, five = 5, six = 6)

The assigment operation (=) is used to bind data values to a variable. Binding in Julia is similar to binding in Python, but not in R.

x = [1, 2]
y = x
y
2-element Vector{Int64}:
 1
 2

If we change an element in x, the same change occurs in y because same array was binded to the variables x and y.

x[2] = 99
## notice that y changes also
y
2-element Vector{Int64}:
  1
 99

This is not the case in R:

> x = c(1, 2)
> y = x
> x[2] = 99
> x
[1]  1 99
> y
[1] 1 2

To replicate the R experience in Julia we assign a copy of x to y.

x = [1, 2]
y = copy(x)
x[2] = 99
x
2-element Vector{Int64}:
  1
 99
y
2-element Vector{Int64}:
 1
 2

2.2.2 Basic functions

In my blog posting “My Journey from R to Julia” I demonstrated how to create a simple function to calculate the odds ratio using 3 methods with multiple dispatch.5 I will use the function that calculates the odds ratio using the cross-product of 4 integers, and build a more useful function to illustrate some features of Julia.

For an appropriately structured table, for example,

Exposure Disease No disease
Yes a b
No c d

the odds ratio is the cross-product:

\[ OR = \frac{a d}{b c} \]

Here is a simple Julia function to calculate the odds ratio:

function oddsratio(a, b, c, d)
    or = (a * d) / (b * c)
    return or
end
oddsratio (generic function with 1 method)

Because this is a simple function, it can also be created in an abbreviated form:

oddsratio(a, b, c, d) = (a * d) / (b * c)
oddsratio (generic function with 1 method)

Here is data from a case-control study [2]:

Exposure Case Control
Highest 12 6
Lowest 2 29

Let’s test the oddsratio function by passing four integers from our 2x2 table.

oddsratio(12, 6, 2, 29)
29.0

The function arguments, a, b, c, and d are called positional arguments and are always required in the correct order.

We now add a keyword argument which is optional. Keyword arguments are separated from the positional arguments by a semicolon (;). Any argument, positional or keyword, can be assigned a default value. We’ll create oddsratio2function to calculate a confidence interval using the Normal approximation. The keyword argument will have a default confidence level of 0.95.

using Distributions # to access standard normal distribution 
function oddsratio2(a, b, c, d; level = 0.95)
    zv = quantile(Normal(), 0.5*(1 + level))
    est = (a * d) / (b * c)
    log_or = log(est)
    se_log_or = sqrt((1/a) + (1/b) + (1/c) + (1/d))
    lcl = exp(log_or - zv * se_log_or)
    ucl = exp(log_or + zv * se_log_or)
    return (
        or = est, 
        confint = (lcl, ucl), 
        level = level
    )
end
oddsratio2 (generic function with 1 method)

By default, oddsratio2 will calculate the 95% confidence interval:

oddsratio2(12, 6, 2, 29)
(or = 29.0, confint = (5.110695577009899, 164.55685675804662), level = 0.95)

I can also calculate 99% confidence intervals:

results = oddsratio2(12, 6, 2, 29; level = 0.99)
(or = 29.0, confint = (2.9619778898301936, 283.93189661797715), level = 0.99)

And we can index elements of the named tuple.

results.confint
(2.9619778898301936, 283.93189661797715)

2.2.3 Anonymous functions

Functions can be arguments to functions. For example, we will create a times_two function and past it to other functions.

times_two(x) = 2 * x
map(times_two, [1, 2, 8])
3-element Vector{Int64}:
  2
  4
 16
sum(times_two, [1, 2, 8])
22
using StatsBase
mean(times_two, [1, 2, 8])
7.333333333333333

Alternatively, we can pass the times_two function as an anonymous function; that is, a function without a name.

map(x -> 2 * x, [1, 2, 8])
3-element Vector{Int64}:
  2
  4
 16
sum(x -> 2 * x, [1, 2, 8])
22
mean(x -> 2 * x, [1, 2, 8])
7.333333333333333

We can even calculate the odds ratio as an anonymous function.

map((a, b, c, d) -> (a * d)/(b * c), (12, 6, 2, 29)...)
29.0

Notice I used a trick. By default, the map function maps and executes a function with elements of a collection (eg, [1 ,2, 8]). However, for the odds ratio calcuation, I need to map the arguments (a, b, c, d) to their values (12, 6, 2, 29), then calculate the odds ratio. Therefore, I used the splat operator (...) to break up the collection of values so that they could be mapped to their arguments first. Although I used a tuple (12, 6, 2, 29), an array also works [12, 6, 2, 29].

map((a, b, c, d) -> (a * d)/(b * c), [12, 6, 2, 29]...)
29.0

The splat operator (...) converts [12, 6, 2, 29] to 12, 6, 2, 29 which is very convenient if the vector is very large. In this case, with only four integers, I could have passed the integers without the splat operator.

map((a, b, c, d) -> (a * d)/(b * c), 12, 6, 2, 29)
29.0

Chapter 1 also covers other topics including loops, conditional expressions, and scoping. Chapters 3 to 7 cover practical tools for processing and managing data in Julia. I will focus the remainder of this book review on data frames.

2.3 Chapters 8–13: Working with data frames

The author is a lead developer of the DataFrames.jl Julia package. So in these chapters he covers this package thoroughly and you will not be disappointed. I will over cover the following:

  • Downloading a CSV data file from a website
  • Reading a CSV file into a data frame
  • Conducting a common analytic workflows

The data set we will use is from National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). The version of the NHEFS data set we will use in from Professor Migual Hernan’s textbook “Causal Inference — What if.”

import Downloads
using CSV, DataFrames

Downloads.download("https://www.hsph.harvard.edu/miguel-hernan/wp-content" *
                   "/uploads/sites/1268/2019/03/nhefs.csv", "nhefs.csv")
readlines("nhefs.csv")[1:6]
6-element Vector{String}:
 "seqn,qsmk,death,yrdth,modth,dad" ⋯ 465 bytes ⋯ "ax71,tax82,price71_82,tax71_82"
 "233,0,0,,,,175,96,0,42,1,19,2,7" ⋯ 172 bytes ⋯ "0977,0.4437866211,0.6403808594"
 "235,0,0,,,,123,80,0,36,0,18,2,9" ⋯ 169 bytes ⋯ "994141,0.5493164063,0.79296875"
 "244,0,0,,,,115,75,1,56,1,15,3,1" ⋯ 170 bytes ⋯ "5488,0.0561981201,0.3202514648"
 "245,0,1,85,2,14,148,78,0,68,1,1" ⋯ 174 bytes ⋯ "7031,0.0547943115,0.3049926758"
 "252,0,0,,,,118,77,0,40,0,18,2,1" ⋯ 168 bytes ⋯ "994141,0.5493164063,0.79296875"

We are using the Download, CSV, and Dataframes modules.6

The data file has a header (variable names). If there is no header see p. 191 of book.

nhefs = CSV.read("nhefs.csv", DataFrame)
nhefs[1:6,:]
6×64 DataFrame
Row seqn qsmk death yrdth modth dadth sbp dbp sex age race income marital school education ht wt71 wt82 wt82_71 birthplace smokeintensity smkintensity82_71 smokeyrs asthma bronch tb hf hbp pepticulcer colitis hepatitis chroniccough hayfever diabetes polio tumor nervousbreak alcoholpy alcoholfreq alcoholtype alcoholhowmuch pica headache otherpain weakheart allergies nerves lackpep hbpmed boweltrouble wtloss infection active exercise birthcontrol pregnancies cholesterol hightax82 price71 price82 tax71 tax82 price71_82 tax71_82
Int64 Int64 Int64 Int64? Int64? Int64? Int64? Int64? Int64 Int64 Int64 Int64? Int64 Int64 Int64 Float64 Float64 Float64? Float64? Int64? Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64? Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64? Int64? Int64? Float64? Float64? Float64? Float64? Float64? Float64?
1 233 0 0 missing missing missing 175 96 0 42 1 19 2 7 1 174.188 79.04 68.946 -10.094 47 30 -10 29 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 3 7 0 1 0 0 0 0 0 1 0 0 0 0 2 2 missing 197 0 2.18359 1.73999 1.10229 0.461975 0.443787 0.640381
2 235 0 0 missing missing missing 123 80 0 36 0 18 2 9 2 159.375 58.63 61.235 2.60497 42 20 -10 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 4 0 1 0 0 0 0 0 0 0 0 1 0 0 2 missing 301 0 2.34668 1.79736 1.36499 0.571899 0.549316 0.792969
3 244 0 0 missing missing missing 115 75 1 56 1 15 3 11 2 168.5 56.81 66.2245 9.41449 51 20 -14 26 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 3 4 missing 0 1 1 0 0 1 0 0 0 0 0 0 2 0 2 157 0 1.56958 1.51343 0.55127 0.230988 0.0561981 0.320251
4 245 0 1 85 2 14 148 78 0 68 1 15 3 5 1 170.188 59.42 64.4101 4.99012 37 3 4 53 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2 3 4 0 0 1 1 0 0 0 0 0 0 0 1 2 2 missing 174 0 1.50659 1.4519 0.524902 0.219971 0.0547943 0.304993
5 252 0 0 missing missing missing 118 77 0 40 0 18 2 11 2 181.875 87.09 92.0793 4.98925 42 20 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 2 0 1 0 0 0 0 0 0 1 0 0 1 1 2 missing 216 0 2.34668 1.79736 1.36499 0.571899 0.549316 0.792969
6 257 0 0 missing missing missing 141 83 1 43 1 11 4 9 2 162.188 99.0 103.419 4.41906 34 10 10 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 2 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 212 1 2.20996 2.02588 1.15479 0.747925 0.184082 0.406982

To see all the variable names we will take the vector of 64 names and reshape them into a 16 by 4 matrix for display purposes.

vn = names(nhefs)
reshape(vn, :, 4)
16×4 Matrix{String}:
 "seqn"       "wt71"               "hayfever"        "hbpmed"
 "qsmk"       "wt82"               "diabetes"        "boweltrouble"
 "death"      "wt82_71"            "polio"           "wtloss"
 "yrdth"      "birthplace"         "tumor"           "infection"
 "modth"      "smokeintensity"     "nervousbreak"    "active"
 "dadth"      "smkintensity82_71"  "alcoholpy"       "exercise"
 "sbp"        "smokeyrs"           "alcoholfreq"     "birthcontrol"
 "dbp"        "asthma"             "alcoholtype"     "pregnancies"
 "sex"        "bronch"             "alcoholhowmuch"  "cholesterol"
 "age"        "tb"                 "pica"            "hightax82"
 "race"       "hf"                 "headache"        "price71"
 "income"     "hbp"                "otherpain"       "price82"
 "marital"    "pepticulcer"        "weakheart"       "tax71"
 "school"     "colitis"            "allergies"       "tax82"
 "education"  "hepatitis"          "nerves"          "price71_82"
 "ht"         "chroniccough"       "lackpep"         "tax71_82"

2.4 The split-apply-combine workflow

We will apply a common workflow (Figure 1):

  1. split (stratify) the data by one or more variables
  2. apply a function to each strata
  3. combine the results into a table
Figure 1: The split-apply-combine is a common workflow in data science.

For split-apply-combine we will conduct the following analysis:

  • Stratified by sex and race, what is the proportion of deaths?
  • Stratified by sex and race, what is the mean age?

We start by creating smaller data set with these four variables, and describing them.

nhefs4 = nhefs[:,[:death,:sex,:age,:race]]
describe(nhefs4)
4×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Float64 Int64 Float64 Int64 Int64 DataType
1 death 0.195212 0 0.0 1 0 Int64
2 sex 0.509515 0 1.0 1 0 Int64
3 age 43.9153 25 44.0 74 0 Int64
4 race 0.131983 0 0.0 1 0 Int64

For “split” we use the groupby function, and for “apply” and “combine” we use the combine function. I also insert a column with description of strate (eg, “white female”).

gdf = groupby(nhefs4, [:sex, :race])
results = combine(gdf,
    :death => mean,
    :age => mean)
insertcols!(results, 
    :strata => ["white male", "black male", "white female", "black female"])
4×5 DataFrame
Row sex race death_mean age_mean strata
Int64 Int64 Float64 Float64 String
1 0 0 0.239716 44.5262 white male
2 0 1 0.276596 45.0638 black male
3 1 0 0.141044 43.3808 white female
4 1 1 0.190083 42.595 black female

Conclusions:

  • In 1971, the mean age of males was slightly higher than the mean age of females.
  • By 1992, a higher proportion of males died.
  • By 1992, within sex strata, a higher proportion of Blacks died compared to whites.

3 Appendix: Data dictionary for NHEFS data file

Download data dictionary which is an Microsoft Excel XLSX file.

Downloads.download("https://www.hsph.harvard.edu/miguel-hernan/wp-content" *
                   "/uploads/sites/1268/2012/10/NHEFS_Codebook.xls",
                   "NHEFS_Codebook.xls")
"NHEFS_Codebook.xls"

Then, open up in you MS Excel and manually save as a CSV file. Although the Julia XLSX.jl package can read .xlsx files, it cannot read old .xls file (yet). So this is a work around.

cb = CSV.read("NHEFS_Codebook.csv", DataFrame)
insertcols!(cb, 1, :row => 1:64)
cb[1:20,:]
20×3 DataFrame
Row row Variable name Description
Int64 String31 String
1 1 active IN YOUR USUAL DAY, HOW ACTIVE ARE YOU? IN 1971, 0:very active, 1:moderately active, 2:inactive
2 2 age AGE IN 1971
3 3 alcoholfreq HOW OFTEN DO YOU DRINK? IN 1971 0: Almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: No alcohol last year, 5: Unknown
4 4 alcoholhowmuch WHEN YOU DRINK, HOW MUCH DO YOU DRINK? IN 1971
5 5 alcoholpy HAVE YOU HAD 1 DRINK PAST YEAR? IN 1971, 1:EVER, 0:NEVER; 2:MISSING
6 6 alcoholtype WHICH DO YOU MOST FREQUENTLY DRINK? IN 1971 1: BEER, 2: WINE, 3: LIQUOR, 4: OTHER/UNKNOWN
7 7 allergies USE ALLERGIES MEDICATION IN 1971, 1:EVER, 0:NEVER
8 8 asthma DX ASTHMA IN 1971, 1:EVER, 0:NEVER
9 9 bithcontrol BIRTH CONTROL PILLS PAST 6 MONTHS? IN 1971 1:YES, 0:NO, 2:MISSING
10 10 birthplace CHECK STATE CODE - SECOND PAGE
11 11 boweltrouble USE BOWEL TROUBLE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING
12 12 bronch DX CHRONIC BRONCHITIS/EMPHYSEMA IN 1971, 1:EVER, 0:NEVER
13 13 cholesterol SERUM CHOLESTEROL (MG/100ML) IN 1971
14 14 chroniccough DX CHRONIC COUGH IN 1971, 1:EVER, 0:NEVER
15 15 colitis DX COLITIS IN 1971, 1:EVER, 0:NEVER
16 16 dadth DAY OF DEATH
17 17 dbp DIASTOLIC BLOOD PRESSURE IN 1982
18 18 death DEATH BY 1992, 1:YES, 0:NO
19 19 diabetes DX DIABETES IN 1971, 1:EVER, 0:NEVER, 2:MISSING
20 20 education AMOUNT OF EDUCATION BY 1971: 1: 8TH GRADE OR LESS, 2: HS DROPOUT, 3: HS, 4:COLLEGE DROPOUT, 5: COLLEGE OR MORE
cb[21:40,:]
20×3 DataFrame
Row row Variable name Description
Int64 String31 String
1 21 exercise IN RECREATION, HOW MUCH EXERCISE? IN 1971, 0:much exercise,1:moderate exercise,2:little or no exercise
2 22 hayfever DX HAY FEVER IN 1971, 1:EVER, 0:NEVER
3 23 hbp DX HIGH BLOOD PRESSURE IN 1971, 1:EVER, 0:NEVER, 2:MISSING
4 24 hbpmed USE HIGH BLOOD PRESSURE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING
5 25 headache USE HEADACHE MEDICATION IN 1971, 1:EVER, 0:NEVER
6 26 hepatitis DX HEPATITIS IN 1971, 1:EVER, 0:NEVER
7 27 hf DX HEART FAILURE IN 1971, 1:EVER, 0:NEVER
8 28 hightax82 LIVING IN A HIGHLY TAXED STATE IN 1982, High taxed state of residence=1, 0 otherwise
9 29 ht HEIGHT IN CENTIMETERS IN 1971
10 30 income TOTAL FAMILY INCOME IN 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+
11 31 infection USE INFECTION MEDICATION IN 1971, 1:EVER, 0:NEVER
12 32 lackpep USELACK OF PEP MEDICATION IN 1971, 1:EVER, 0:NEVER
13 33 marital MARITAL STATUS IN 1971 1: Under 17, 2: Married, 3: Widowed, 4: Never married, 5: Divorced, 6: Separated, 8: Unknown
14 34 modth MONTH OF DEATH
15 35 nerves USE NERVES MEDICATION IN 1971, 1:EVER, 0:NEVER
16 36 nervousbreak DX NERVOUS BREAKDOWN IN 1971, 1:EVER, 0:NEVER
17 37 otherpain USE OTHER PAINS MEDICATION IN 1971, 1:EVER, 0:NEVER
18 38 pepticulcer DX PEPTIC ULCER IN 1971, 1:EVER, 0:NEVER
19 39 pica DO YOU EAT DIRT OR CLAY, STARCH OR OTHER NON STANDARD FOOD? IN 1971 1:EVER, 0:NEVER; 2:MISSING
20 40 polio DX POLIO IN 1971, 1:EVER, 0:NEVER
cb[41:64,:]
24×3 DataFrame
Row row Variable name Description
Int64 String31 String
1 41 pregnancies TOTAL NUMBER OF PREGNANCIES? IN 1971
2 42 price71 AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971 (US$2008)
3 43 price71_82 DIFFERENCE IN AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971-1982 (US$2008)
4 44 price82 AVG TOBACCO PRICE IN STATE OF RESIDENCE 1982 (US$2008)
5 45 qsmk QUIT SMOKING BETWEEN 1ST QUESTIONNAIRE AND 1982, 1:YES, 0:NO
6 46 race 0: WHITE 1: BLACK OR OTHER IN 1971
7 47 sbp SYSTOLIC BLOOD PRESSURE IN 1982
8 48 school HIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971
9 49 seqn UNIQUE PERSONAL IDENTIFIER
10 50 sex 0: MALE 1: FEMALE
11 51 smokeintensity NUMBER OF CIGARETTES SMOKED PER DAY IN 1971
12 52 smkintensity 82_71 INCREASE IN NUMBER OF CIGARETTES/DAY BETWEEN 1971 and 1982
13 53 smokeyrs YEARS OF SMOKING
14 54 tax71 TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)
15 55 tax71_82 DIFFERENCE IN TOBACCO TAX IN STATE OF RESIDENCE 1971-1982 (US$2008)
16 56 tax82 TOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)
17 57 tb DX TUBERCULOSIS IN 1971, 1:EVER, 0:NEVER
18 58 tumor DX MALIGNANT TUMOR/GROWTH IN 1971, 1:EVER, 0:NEVER
19 59 weakheart USE WEAK HEART MEDICATION IN 1971, 1:EVER, 0:NEVER
20 60 wt71 WEIGHT IN KILOGRAMS IN 1971
21 61 wt82 WEIGHT IN KILOGRAMS IN 1982
22 62 wt82_71 WEIGHT CHANGE IN KILOGRAMS
23 63 wtloss USE WEIGHT LOSS MEDICATION IN 1971, 1:EVER, 0:NEVER
24 64 yrdth YEAR OF DEATH

4 Updates from book author

  1. What is new in DataFrames.jl 1.5. Mar 24, 2023. Available from: https://bkamins.github.io/julialang/2023/03/24/df15.html
  2. Hunting for bugs in Julia for Data Analysis. Medium. March 3, 2023. Available from: https://medium.com/bkamins/hunting-for-bugs-in-julia-for-data-analysis-ed6f4d1ce6bd
  3. Errata and source code for book is available from https://github.com/bkamins/JuliaForDataAnalysis

References

1.
Kaminski B. Julia for data analysis. New York, NY: Manning Publications; 2023.
2.
Aragón TJ, Novotny S, Enanoria W, Vugia DJ, Khalakdina A, Katz MH. Endemic cryptosporidiosis and exposure to municipal tap water in persons with acquired immunodeficiency syndrome (AIDS): A case-control study. BMC Public Health [Internet]. 2003 Jan;3(1). Available from: https://doi.org/10.1186/1471-2458-3-2

Footnotes

  1. Used Rstudio on 2021 MacBook Pro M1 Max with 32 GB RAM under Ventura 13.2.1↩︎

  2. “The reason for the difference is that Julia uses 64-bit integers and R uses 64-bit floats by default to do these computations.” Source: https://twitter.com/BogumilKaminski/status/1629968902456311818↩︎

  3. Notice that in Julia I can use 1000000000 or 1_000_000_000 for the number 1 billion.↩︎

  4. In the Pluto.jl package, vectors are displayed horizontally.↩︎

  5. To learn more visit https://freecontent.manning.com/using-multiple-dispatch-in-julia/.↩︎

  6. https://stackoverflow.com/questions/27086159/what-is-the-difference-between-using-and-import-in-julia-when-building-a-mod↩︎