library(nd.tidytuesday)
library(dplyr)
library(ggplot2)

Data import and inspection

One can obtain the metadata table using datapasta’s R package addin and copying the table available at

variable class description
rank double popularity in their database of released passwords
password character Actual text of the password
category character What category does the password fall in to?
value double Time to crack by online guessing
time_unit character Time unit to match with value
offline_crack_sec double Time to crack offline in seconds
rank_alt double Rank 2
strength double Strength = quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords
font_size double Used to create the graphic for KIB

the readr::read_csv function gives some information on how the data was processed, with here no errors or warnings.

The passwords dataset has 507 rows * 9 variables. It is not tidy, if we want to use the value variable of online guessing we need to add the time_unit information to it.

Glimpse of the passwords :

passwords %>% head() %>% pander::pander()
Table continues below
rank password category value time_unit offline_crack_sec
1 password password-related 6.91 years 2.17
2 123456 simple-alphanumeric 18.52 minutes 1.11e-05
3 12345678 simple-alphanumeric 1.29 days 0.00111
4 1234 simple-alphanumeric 11.11 seconds 1.11e-07
5 qwerty simple-alphanumeric 3.72 days 0.00321
6 12345 simple-alphanumeric 1.85 minutes 1.11e-06
rank_alt strength font_size
1 8 11
2 4 8
3 4 8
4 4 8
5 8 11
6 4 8

Univariate study

There are 10 unique categories and 7 unit of time. Passwords are unique individuals in this dataset.

1/3 of the passwords are names, most are actually dictionnary words.

Nearly 70% of the passwords considered are crackable online in less than a week, 20% are guessable within a day.

We can look at the character type distribution in the passwords.

library(stringr)

passwd_type <- psw_clean %>% 
  mutate(password = as.character(password)) %>% 
  mutate("nb_char" = nchar(password),
         num_part = purrr::map_chr( stringr::str_extract_all(string = password, pattern = "[:digit:]"),
                                    function(x) paste(x, collapse = "") ),
         nchar_num = nchar(num_part),
         
         alpha_part = purrr::map_chr( stringr::str_extract_all(string = password, pattern = "[:alpha:]"),
                                      function(x) paste(x, collapse = "") ),
         nchar_alpha = nchar(alpha_part),
         
         punct_part = purrr::map_chr( stringr::str_extract_all(string = password, pattern = "[:punct:]"),
                                      function(x) paste(x, collapse = "") ),
         nchar_punct = nchar(punct_part),
         
         lower_part = purrr::map_chr( stringr::str_extract_all(string = password, pattern = "[:lower:]"),
                                      function(x) paste(x, collapse = "") ),
         nchar_lower = nchar(lower_part),
         
         upper_part = purrr::map_chr( stringr::str_extract_all(string = password, pattern = "[:upper:]"),
                                      function(x) paste(x, collapse = "") ),
         nchar_upper = nchar(upper_part)
  )

The data contains passwords with numeric and alpha character. All of alpha character use lower case.

variable mean var sd min 1q median 3q max
nchar_num 0.54 2.79 1.67 0 0 0 0 9
nchar_alpha 5.66 3.88 1.97 0 6 6 7 8
nchar_punct 0 0 0 0 0 0 0 0
nchar_lower 5.66 3.88 1.97 0 6 6 7 8
nchar_upper 0 0 0 0 0 0 0 0

One can suspect a linear relation between the online and offline time of cracking.

This is confirmed passing to (log ,log) scaled plot. The formula obatined is

\[log(offline) = -18.322 + 0.987 \times log(online)\]