Let's talk about backups
One of the projects I started earlier this year are automated backups of my personal files. This involves lots of photos and videos which have accumulated over the years but also documents, source code, music and much more. By now, like many others, I have a Frankenstein solution of automatic uploads of photos and videos from my mobile, manual syncing of git repositories to github, bitbucket or gitlab, documents on Google Drive and manual synchronization of files via Dropbox and Drive. This works but it doesn’t actually feel very robust and like I’m in good control over that every increasing amount of data over time.
The main problems I see are:
- data privacy: Should I give my private files to Google?
- missing automation: How to keep data up-to-date and consistent?
- difficult overview: What data do I have and where is it?
- slow uploads: Uploading a lot of data can take forever -_-
- lock-in effect: What if I want to move my data to a different provider?
In this post I want to discuss the issues outlined above to understand a bit better if it makes sense to implement another way of backing up personal data and what trade-offs that would involve.
Currently my main concern is data privacy. To be honest, this wasn’t always the case and I was using Dropbox and Google for many years. However a couple of things have changed since I first hand-waved my initial concerns about putting almost all my data into the hands of big cloud providers. Firstly there is now a lot more interconnected data that we all produce on a daily basis than there was ten years ago. The photos I take now are geotagged and can be used to trace where I have been. Together with my calendar or social networks it should be pretty easy to make good guesses at what I was doing there and also with whom. Furthermore the messages I receive or write to other people (e.g. via email & messengers) are mined for data to understand my current interests and predict future behaviour. I’m pretty sure the same is also true for documents or other textual content I create and upload into the cloud.
In the age of big data all that information needs to be analyzed automatically to relate this multitude of data points and extract further knowledge. This means that companies invest a lot of effort in developing better classification and knowledge retrieval systems. Amongst others, this of course includes facial recognition and OCR for photos but also advanced processing of videos and natural language processing of text, just to name a few. Most of these systems nowadays use some kind of machine learning model to extract meta information and discover connections between the data. For instance Google advertises their Cloud Video Intelligence service like this:
Google Cloud Video Intelligence makes videos searchable, and discoverable, by extracting metadata with an easy to use REST API. […]
It quickly annotates videos stored in Google Cloud Storage, and helps you identify key entities (nouns) within your video; and when they occur within the video. […]
Since you can already search your Drive with natural language such as “find my budget spreadsheet from last December” its probable they also scan and classify all media content as well. A recent Motherboard article indicates this is actually done already, when sex workers found their adult content was automatically disappearing from their Drive accounts without warning or explanation.
With all of this in mind, this got me thinking about what happens when there is an issue with the processing of the data that I upload? Will it be flagged for review when an algorithm thinks it may contain controversial content? Will a human operator be able (or required) to access my private files, maybe to comply with regulations in a foreign country? Will my provider grant that access to any third party and would I be able to know this has happened? Should I trust my provider with my most intimate files such as photos of my family, confidential contracts or legal paperwork? What about the content that concerns other peoples rights (e.g. photos and videos) who may not be OK with uploading and processing this data in the cloud? These are just some of questions I have regarding the data privacy of my files.
Of course you could try to apply the “Nothing to hide” argument here but I feel this kind of misses the point. I want to have complete control over my data, I want to explicitly grant access to my private photos and videos and I believe I should be able to keep my documents and records private only to myself. Moreover I should have the option to delete my data and know it is gone for good. Currently I’m not convinced all of that is given when storing (unencrypted) data at one of the major cloud storage providers. As former Google CEO Eric Schmidt sayed in an interview
If you have something that you don’t want anyone to know, then maybe you shouldn’t be doing it in the first place.”
In the context of backups I choose to read the “doing it” part of this quote as the “uploading of your data to a (sometimes unpaid) service which is generously offered by one of the big tech giants. If you do not want them to use that data then do not give it to them. I find this especially true for people who may have the technical knowledge which allows them to try different options. However, there will always be a trade-off between privacy, security and functionality as well as convenience. Everybody will have to decide how much time and money they want to invest in following these alternatives.
There is a lot more that could be said about data privacy but this section must come to an end. Initially I planned this post to cover all issues I outlined in the beginning but it turns out this would exceed the maximum post length I am aiming for significantly. Instead I decided to split this topic into multiple posts, each with a reading duration of about five minutes. Next up will be the topic of Automation. \ʕ◔ϖ◔ʔ/