liked a post by

Collecting user data while protecting user privacy

Lots of companies want to collect data about their users. This is a good thing, generally; being data-driven is important, and it’s jolly hard to know where best to focus your efforts if you don’t know what your people are like. However, this sort of data collection also gives people a sense of disquiet; what are you going to do with that data about me? How do I get you to stop using it? What conclusions are you drawing from it? I’ve spoken about this sense of disquiet in the past, and you can watch (or read) that talk for a lot more detail about how and why people don’t like it.

So, what can we do about it? As I said, being data-driven is a good thing, and you can’t be data-driven if you haven’t got any data to be driven by. How do we enable people to collect data about you without compromising your privacy?

Well, there are some ways. Before I dive into them, though, a couple of brief asides: there are some people who believe that you shouldn’t be allowed to collect any data on your users whatsoever; that the mere act of wanting to do so is in itself a compromise of privacy. This is not addressed to those people. What I want is a way that both sides can get what they want: companies and projects can be data-driven, and users don’t get their privacy compromised. If what you want is that companies are banned from collecting anything… this is not for you. Most people are basically OK with the idea of data collection, they just don’t want to be victimised by it, now or in the future, and it’s that property that we want to protect.

Similarly, if you’re a company who wants to know everything about each individual one of your users so you can sell that data for money, or exploit it on a user-by-user basis, this isn’t for you either. Stop doing that.

Aggregation

The key point here is that, if you’re collecting data about a load of users, you’re usually doing so in order to look at it in aggregate; to draw conclusions about the general trends and the general distribution of your user base. And it’s possible to do that data collection in ways that maintain the aggregate properties of it while making it hard or impossible for the company to use it to target individual users. That’s what we want here: some way that the company can still draw correct conclusions from all the data when collected together, while preventing them from targeting individuals or knowing what a specific person said.

In the 1960s, Warner and Greenberg put together the randomised response technique for social science interviews. Basically, the idea here is that if you want to ask people questions about sensitive topics — have they committed a crime? what are their sexual preferences? — then you need to be able to draw aggregate conclusions about what percentages of people have done various things, but any one individual’s ballot shouldn’t be a confession that can be used against them. The technique varies a lot in exactly how it’s applied, but the basic concept is that for any question, there’s a random chance that the answerer should lie in their response. If some people lie in one direction (saying that they did a thing, when they didn’t), and the same proportion of people lie in the other direction (saying they didn’t do the thing when they did), then if you’ve got enough answerers, all the lies pretty much cancel out. So your aggregate statistics are still pretty much accurate — you know that X percent of people did the thing — but any one individual person’s response isn’t incriminating, because they might have been lying. This gives us the privacy protection we need for people, while preserving the aggregate properties that allow the survey-analysers to draw accurate conclusions.

It’s something like whether you’ll find a ticket inspector on a train. Train companies realised a long time ago that you don’t need to put a ticket inspector on every single train. Instead, you can put inspectors on enough trains that the chance of fare-dodgers being caught is high enough that they don’t want to take the risk. This randomised response is similar; if you get a ballot from someone saying that they smoked marijuana, then you can’t know whether they were one of those who were randomly selected to lie about their answer, and therefore that answer isn’t incriminating, but the overall percentage of people who say they smoked will be roughly equal to the percentage of people who actually did.

A worked example

Let’s imagine you’re, say, an operating system vendor. You’d like to know what sorts of machines your users are installing on (Ubuntu are looking to do this as most other OSes already do), and so how much RAM those machines have would be a useful figure to know. (Lots of other stats would also be useful, of course, but we’ll just look at one for now while we’re explaining the process. And remember this all applies to any statistic you want to collect; it’s not particular to OS vendors, or RAM. If you want to know how often your users open your app, or what country they’re in, this process works too.)

So, we assume that the actual truth about how much RAM the users’ computers have looks something like this graph. Remember, the company does not know this. They want to know it, but they currently don’t.

So, how can they collect data to know this graph, without being able to tell how much RAM any one specific user has?

As described above, the way to do this is to randomise the responses. Let’s say that we tell 20% of users to lie about their answer, one category up or down. So if you’ve really got 8GB of RAM, then there’s an 80% chance you tell the truth, and a 20% chance you lie; 10% of users lie in a “downwards” direction, so they claim to have 4GB of RAM when they’ve actually got 8GB, and 10% of users lie in an “upwards” direction and claim to have 16GB. Obviously, we wouldn’t actually have the users lie — the software that collects this info would randomly either produce the correct information or not with the above probabilities, and people wouldn’t even know it was doing it; the deliberately incorrect data is only provided to the survey. (Your computer doesn’t lie to you about how much RAM it’s got, just the company.) What does that do to the graph data?

We show in this graph the users that gave accurate information in green, and inaccurate lies in red. And the graph looks pretty much the same! Any one given user’s answers are unreliable and can’t be trusted, but the overall shape of the graph is pretty similar to the actual truth. There are still peaks at the most popular points, and still troughs at the unpopular ones. Each bar in the graph is reasonably accurate (accuracy figures are shown below each bar, and they’ll normally be around 90-95%, although because it’s random it may fluctuate a little for you.) So our company can draw conclusions from this data, and they’ll be generally correct. They’ll have to take those conclusions with a small pinch of salt, because we’ve deliberately introduced inaccuracy into them, but the trends and the overall shape of the data will be good.

The key point here is that, although you can see in the graph which answers are truth and which are incorrect, the company can’t. They don’t get told whether an answer is truth or lies; they just get the information and no indication of how true it is. They’ll know the percentage chance that an answer is untrue, but they won’t know whether any one given answer is.

Can we be more inaccurate? Well, here’s a graph to play with. You can adjust what percentage of users’ computers lie about their survey results by dragging the slider, and see what that does to the data.

0%  100%

20% of submissions are deliberately incorrect

Even if you make every single user lie about their values, the graph shape isn’t too bad. Lying tends to “flatten out” the graph; it makes tall peaks less tall, and short troughs more tall, and every single person lying probably flattens out things so much that conclusions you draw are probably now going to be wrong. But you can see from this that it ought to be possible to run the numbers and come up with a “lie” percentage which accurately balances the company’s need for accurate information with the user’s need to not provide accuracy.

It is of course critical to this whole procedure that the lies cancel out, which means that they need to be evenly distributed. If everyone just makes up random answers then obviously this doesn’t work; answers have to start with the truth and then (maybe) lie in one direction or another.

This is a fairly simple description of this whole process of introducing noise into the data, and data scientists would be able to bring much more learning to bear on this. For example, how much does it affect accuracy if user information can lie by more than one “step” in every direction? Do we make it so instead of n% truth and 100-n% lies, we distribute the lies normally across the graph with the centrepoint being the truth? Is it possible to do this data collection without flattening out the graph to such an extent? And the state of the data art has moved on since the 1960s, too: Dwork wrote an influential 2006 paper on differential privacy which goes into this in more detail. Obviously we’ll be collecting data on more than one number — someone looking for data on computers on which their OS is installed will want for example version info, network connectivity, lots of hardware stats, device vendor, and so on. And that’s OK, because it’s safe to collect this data now… so how do our accuracy figures change when there are lots of stats and not just one? There will be better statistical ways to quantify how inaccurate the results are than my simple single-bar percentage measure, and how to tweak the percentage-of-lying to give the best results for everyone. This whole topic seems like something that data scientists in various communities could really get their teeth into and provide great suggestions and help to companies who want to collect data in a responsible way.

Of course, this applies to any data you want to collect. Do you want analytics on how often your users open your app? What times of day they do that? Which OS version they’re on? How long do they spend using it? All your data still works in aggregate, but the things you’re collecting aren’t so personally invasive, because you don’t know if a user’s records are lies. This needs careful thought — there has been plenty of research on deanonymising data and similar things, and the EFF’s Panopticlick project shows how a combination of data can be cross-referenced and that needs protecting against too, but that’s what data science is for; to tune the parameters used here so that individual privacy isn’t compromised while aggregate properties are preserved.

If a company is collecting info about you and they’re not actually interested in tying your submitted records to you (see previous point about how this doesn’t apply to companies who do want to do this, who are a whole different problem), then this in theory isn’t needed. They don’t have to collect IP addresses or usernames and record them against each submission, and indeed if they don’t want that information then they probably don’t do that. But there’s always a concern: what if they’re really doing that and lying about it? Well, this is how we alleviate that problem. Even if a company actually are trying to collect personally-identifiable data and they’re lying to us about doing that it doesn’t matter, because we protect ourselves by — with a specific probability — lying back to them. And then everyone gets what they want. There’s a certain sense of justice in that.

liked a post by

Announcing lazymention: elegant outbound Webmention for static sites

This post also appeared on IndieNews.

Last night I hit publish on version 1.0.0 of a new project, lazymention! Whoohoo!

tl;dr: lazymention exists to add Webmention support to static sites!

To elaborate a little bit, I developed lazymention because I had a problem with this site: I wanted to send outbound Webmentions when I link to things, but my website is completely static. (Webmention, in case you didn't know, is a way to notify another website that you linked to them, so the other website can display some UI about your reply or whatever.) The page builds happen on my local machine, not on the server. One option would be to just send Webmentions from my local machine too, but this isn't really a good solution for a couple reasons. First, I couldn't do it automatically at build-time because the built pages wouldn't have been deployed to the server yet, so receivers of my Webmentions would reject the mentions due to the source being nonexistant. That meant that I would have to have a separate step, which wouldn't really be that big of a deal (lazymention reqires pinging a server too) except for the second reason: I would need some way to keep track of where I'd already sent Webmentions to, and that would require synchronizing across computers. Probably the only decent way to do that would be to check it into Git, but having a program's data store checked in right next to the source code just feels kinda ugly. Plus, then it can't be shared with other people as a service.

So instead of doing it locally, I elected to build a server instead. Here's how it works: you mark up your stuff with h-feed and h-entry, and whenever anything happens (e.g. you publish a new blog post or whatever), you ping lazymention with the URL (either the feed or the post itself). lazymention will use your microformats2 markup to find the canonical location for a given post, then it will find all the links in the post and send Webmentions for them. And presto! You've just sent Webmentions for your blog. lazymention also records when it's sent mentions, so if you ping it again, nothing will happen unless you've updated your content. I'm also planning to add WebSub support to lazymention, too, and that'll work in the exact same way.

lazymention is super easy to get started with, especially because I've provided thorough documentation in the README. If you find anything that's confusing or missing, please let me know by filing an issue! I'd love to get it fixed. In fact, I'd be thrilled to hear about both positive and negative installation experiences.

Oh, and one more thing - lazymention is reusable in other applications. If you're writing a Node.js app and want to reuse its HTTP API, you can use its embedding API to get at the Express application and Router used internally. I'm not sure if people will actually find this useful, but I wrote it just for kicks anyway. See the embedding documentation for more!

Cheers, and happy mentioning! Elegant outbound Webmention for static sites is here.

liked a post by

Creating a Self-Hosted Alternative to Facebook Live using Nginx and Micropub

Facebook Live offers a seamless viewing experience for people to watch your livestream and then see an archived version after you're done broadcasting.

  • When you turn on your camera, a new Facebook post is created on your profile and indicates that you're broadcasting live.
  • When you stop broadcasting, Facebook automatically converts the video to an archived version and shows people the recording when they look at that post later.

I wanted to see if I could do this on my own website, without any third-party services involved. It turns out there is free software available to put this kind of thing together yourself!

The diagram below illustrates the various pieces involved. In this post, we'll walk through setting up each. In this setup, the streaming server is separate from your website. You can of course host both on the same server, but I found it was nicer to fiddle with the nginx settings on a separate server rather than recompiling and restarting nginx on my website's server.

Video Source

You should be able to use any RTMP client to stream video to the server! I've tested this setup with the following video sources:

  • Teradek Vidiu hardware encoder (connected to an HDMI switcher or camcorder)
  • On my Mac, I've used OBS, a cross-platform desktop application
  • On iOS, Larix Broadcaster (also available on Android)

The job of the video source is to perform the h.264 encoding and send the video stream to the RTMP endpoint on the streaming server. Once configured, starting the broadcast is as simple as starting the streaming device.

Building the Streaming Server

Nginx with RTMP extension

The instructions below are a summary of this excellent guide.

  • Download build system dependencies
  • Download nginx source code
  • Download RTMP extension source code
  • Compile nginx with the extension

Download the build system dependencies

sudo apt-get install build-essential libpcre3 libpcre3-dev libssl-dev

Find the latest nginx source code at http://nginx.org/en/download.html

wget http://nginx.org/download/nginx-1.10.2.tar.gz

Download the rtmp module source

wget https://github.com/arut/nginx-rtmp-module/archive/master.zip

Unpack both and enter the nginx folder

tar -zxvf nginx-1.10.2.tar.gz
unzip master.zip
cd nginx-1.10.2

Build nginx with the rtmp module

./configure --with-http_ssl_module --add-module=../nginx-rtmp-module-master
make -j 4
sudo make install

Now you can start nginx!

sudo /usr/local/nginx/sbin/nginx

Configuration

The steps below will walk through the following. Comments are inline in the config files.

  • Set up the nginx configuration to accept RTMP input and output an HLS stream
  • Configure the event hooks to run the bash commands that will make Micropub requests and convert the final video to mp4
  • Set up the location blocks to make the recordings available via http
  • Ensure the folder locations we're using are writable by nginx

First, add the following server block inside the main http block.

server {
  server_name stream.example.com;

  # Define the web root where we'll put the player HTML/JS files
  root /web/stream.example.com/public;

  # Define the location for the HLS files
  location /hls {
    types {
      application/vnd.apple.mpegurl m3u8;
    }

    root /web/stream.example.com; # Will look for files in the /hls subdirectory

    add_header Cache-Control no-cache;

    # Allow cross-domain embedding of the files
    add_header Access-Control-Allow-Origin *;    
  }
}

Outside the main http block, add the following to set up the rtmp endpoint.

rtmp {
  # Enable HLS streaming
  hls on;
  # Define where the HLS files will be written. Viewers will be fetching these
  # files from the browser, so the `location /hls` above points to this folder as well
  hls_path /web/stream.example.com/hls;
  hls_fragment 5s;

  # Enable recording archived files of each stream
  record all;
  # This does not need to be publicly accessible since we'll convert and publish the files later
  record_path /web/stream.example.com/rec;
  record_suffix _%Y-%m-%d_%H-%M-%S.flv;
  record_lock on;

  # Define the two scripts that will run when recording starts and when it finishes
  exec_publish /web/stream.example.com/publish.sh;
  exec_record_done /web/stream.example.com/finished.sh $path $basename.mp4;

  access_log logs/rtmp_access.log combined;
  access_log on;

  server {
    listen 1935;
    chunk_size 4096;

    application rtmp {
      live on;
      record all;
    }
  }
}

Starting Streaming

When a stream starts, the nginx extension will run the script defined by the exec_publish hook. We'll set up this script to create a new post on your website via Micropub. This post will contain the text "Streaming Live" and will include HTML with an iframe containing the <video> tag and the necessary Javascript to enable the video player.

The nginx extension takes care of building the HLS files that the player uses, and will broadcast the input stream to any client that connects.

Your server will need to support Micropub for this command to work. Micropub is a relatively simple protocol for creating and updating posts on your website. You can find Micropub plugins for various software, or write your own code to handle the request. For the purposes of this example, you will need to manually generate an access token and paste it into the scripts below.

Save the following as publish.sh

#!/bin/bash

file_root="/web/stream.example.com/rec"
web_root="http://stream.example.com"

micropub_endpoint=https://you.example.com/micropub
access_token=123123123

# Create the post via Micropub and save the URL
url=`curl -i $micropub_endpoint -H "Authorization: Bearer $access_token" \
  -H "Content-Type: application/json" \
  -d '{"type":"h-entry","properties":{"content":{"html":"<p>Streaming Live</p><iframe width=\"600\" height=\"340\" src=\"http://stream.example.com/live.html\"></iframe>"}}}' \
  | grep Location: | sed -En 's/^Location: (.+)/\1/p' | tr -d '\r\n'`

# Write the URL to a file
echo $url > $file_root/last-url.txt

When the Broadcast is Complete

When the source stops broadcasting, the nginx extension will run the script defined by the exec_record_done hook. This script will eventually update the post with the final mp4 video file so that it appears archived on your website.

  • Update the post to remove the iframe and replace it with a message saying the stream is over and the video is being converted
  • Do the conversion to mp4 (this may take a while depending on the length of the video)
  • Create a jpg thumbnail of the video
  • Update the post, removing the placeholder content and replacing it with the thumbnail and final mp4 file

Save the following as finished.sh

#!/bin/bash

input_file=$1
video_filename=$2
# Define the location that the publicly accessible mp4 files will be served from
output=/web/stream.example.com/public/archive/$2;

file_root="/web/stream.example.com/rec"
web_root="http://stream.example.com"

micropub_endpoint=https://you.example.com/micropub
access_token=123123123

# Find the URL of the last post created
url=`cat $file_root/last-url.txt`

# Replace the post with a message saying the stream has ended
curl $micropub_endpoint -H "Authorization: Bearer $access_token" \
  -H "Content-Type: application/json" \
  -d "{\"action\":\"update\",\"url\":\"$url\",\"replace\":{\"content\":\"<p>The live stream has ended. The archived version will be available here shortly.</p>\"}}"

# Convert the recorded stream to mp4 format, making it available via HTTP
/usr/bin/ffmpeg -y -i $input_file -acodec libmp3lame -ar 44100 -ac 1 -vcodec libx264 $output;
video_url="$web_root/archive/$video_filename"

# Generate a thumbnail and send it as the photo
ffmpeg -i $output -vf "thumbnail,scale=1920:1080" -frames:v 1 $output.jpg
photo_url="$web_root/archive/$video_filename.jpg"

# Replace the post with the video and thumbnail (Micropub update)
curl $micropub_endpoint -H "Authorization: Bearer $access_token" \
  -H "Content-Type: application/json" \
  -d "{\"action\":\"update\",\"url\":\"$url\",\"replace\":{\"content\":\"<p>The live stream has ended. The archived video can now be seen below.</p>\"},\"add\":{\"video\":\"$video_url\",\"photo\":\"$photo_url\"}}"

Note that your Micropub endpoint must support JSON updates, as well as recognizing the photo and video properties as URLs rather than file uploads. The filenames sent will be unique, so it's okay for your website to link directly to the URLs provided, but your endpoint may also want to download the video and serve it locally instead.

Web Player

We'll host the HLS video player on the streaming server, so that you don't have to worry about uploading this javascript to your website. We'll use video.js with the HLS plugin.

Create a file live.html in the web root and copy the following HTML.

<!DOCTYPE html>
<html>
<head>
  <link href="https://vjs.zencdn.net/5.8.8/video-js.css" rel="stylesheet">
  <style type="text/css">
    body {
      margin: 0;
      padding: 0;
    }
  </style>
</head>
<body>
  <video id="video-player" width="600" height="340" class="video-js vjs-default-skin" controls>
    <source src="http://stream.example.com/hls/live.m3u8" type="application/x-mpegURL">
  </video>

  <script src="https://vjs.zencdn.net/5.8.8/video.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/videojs-contrib-hls/3.6.12/videojs-contrib-hls.js"></script>
  <script>
  var player = videojs('video-player');
  player.play();
  </script>
</body>
</html>

Now when you view live.html in your browser, it will load the streaming player and let you start playing the stream! This is the file that we'll be using in an iframe in posts on your website.

Setting up your Website

As previously mentioned, the scripts above use Micropub to create and update posts. If your website is a fully conformant Micropub endpoint, you shouldn't need to do anything special for this to work!

You will need to make sure that your website allows Micropub clients to create posts with HTML content. You will also need to ensure your endpoint supports the photo and video properties supplied as a URL. You can hotlink the URLs your endpoint receives instead of downloading the files if you want, or your endpoint can download a copy of the video and serve it locally.

Realtime Updates

To really make this shine, there are a few things you can do to enable realtime updates of your posts for viewers.

  • When your Micropub endpoint creates or updates a post, broadcast the HTML of the post on an nginx push-stream channel, and use Javascript on your home page to insert the post at the top of your feed.
  • Use WebSub (formerly known as PubSubHubbub) to publish updates of your home page to subscribers who may be reading your website from a reader.

Doing this will mean someone who has your home page open in a browser will see the new livestream appear at the top as soon as you start broadcasting, and they'll be able to see it change to the archived video when you're done. People following you in a reader will see the new post with the streaming player when the reader receives the WebSub notification!

Publish Once, Syndicate Elsewhere

Since the nginx RTMP extension supports rebroadcasting the feed to other services, you can even configure it to also broadcast to Facebook Live or YouTube!

You'll need to find the RTMP endpoint for your Facebook or YouTube Live account, and configure a new block in your nginx settings.

Doing this means you can use Facebook and YouTube as additional syndications of your live stream to increase your exposure, or treat them as an automatic backup of your videos!

liked a post by

New side project: Indie Map

I’m launching a new side project today! Indie Map is a public IndieWeb social graph and dataset. It’s a complete crawl of 2300 of the most active IndieWeb sites, sliced and diced and rolled up in a few useful ways:

The IndieWeb‘s raison d’être is to do social networking on individual personal web sites instead of centralized silos. Some parts have been fairly straightforward to decentralize – publishing, reading, interacting – but others are more difficult. Social graph and data mining fall squarely in the latter camp, which is why the community hasn’t tackled them much so far. I hope this inspires us to do more!

Indie Map was announced at IndieWeb Summit 2017. Check out the slide deck and video (soon!) for more details. Also on IndieNews.