Carter Yagemann

Assistant Professor of Computer Science and Engineering at the Ohio State University with interests in automated vulnerability discovery, root cause analysis, exploit prevention, and cyber-physical security.

H&R Block "MyBlock" App + USA Government Website Analytics = PROFIT


I like data mining. For better or worse, it's the gold of the digital age. So when the USA government decided to make the analytical data for their publicly facing websites available for download, I jumped at the opportunity. Thanks to this lovely data source, I can get insights into how popular various browsers and operating systems are, how frequently devices connect to USA government websites from foreign IP address, and more.

Sadly, the website only offers metrics for the past 30 days. Luckily, it's pretty easy to setup a raspberry pi or other small device to periodically fetch the freshest numbers and build a larger dataset. This is what I've been doing since August of 2016. If you're interested, send me an email and I'll be happy to share. After all, according to the government's website: "this website and its data are free for you to use without restriction."

Continuing my story, I was skimming over the most recent metrics when I noticed a funny browser user-agent:

HRB-MOBILE-IOS-PHONE-MYBLOCK-TOUCHID-6.1.0-Mozilla

With a quick search, I figured out that MyBlock is a mobile app offered by H&R Block. More interesting though is the juicy information H&R Block decided to embed in these user-agent strings. As we can see, they contain the name of the app, the version number, the OS (iOS or Android), the device form factor (phone or tablet), and in the case of iOS, it even mentions if TouchID or FaceID was used. As a security researcher, I'm particularly interested in this last tidbit because people use H&R Block to file taxes and these user-agents started appearing January 7, 2018 (i.e., tax season). So how many people use the various authentication methods offered by Apple to protect their tax filing app? Let's find out!

The following is a small Python script I wrote to filter the data. The parsing and filtering leaves much to be desired, but I didn't want to spend too much time on such a simple task:

#!/usr/bin/env python
import sys
import json

def parse_subtokens(tokens):
    """ Parses subtokens and returns a dictionary. If invalid, None is returned.

    We expect Android user agents to be in the form of:
        HBR MOBILE ANDROID [PHONE|TABLET] MYBLOCK [VERSION] <BROWSER>
    and iOS user agents to be in the form of:
        HBR MOBILE IOS [PHONE|TABLET] MYBLOCK <TOUCHID|FACEID> [VERSION] [BROWSER]
    """
    res = {}

    if tokens[0] != 'HRB':
        return None

    if tokens[1] != 'MOBILE':
        return None

    if tokens[2] != 'ANDROID' and tokens[2] != 'IOS':
        return None
    res['OS'] = tokens[2]

    if tokens[3] != 'PHONE' and tokens[3] != 'TABLET':
        return None
    res['DEVICE'] = tokens[3]

    if tokens[4] != 'MYBLOCK':
        return None
    res['APP'] = tokens[4]

    if tokens[2] == 'ANDROID':
        if len(tokens[5:]) == 1:
            res['BROWSER'] = 'N/A'
            res['VERSION'] = tokens[-1]
            res['AUTH'] = 'N/A'
        elif len(tokens[5:]) == 2:
            res['BROWSER'] = tokens[-1]
            res['VERSION'] = tokens[-2]
            res['AUTH'] = 'N/A'
        else:
            return None

    if tokens[2] == 'IOS':
        if len(tokens[5:]) == 2:
            res['BROWSER'] = tokens[-1]
            res['VERSION'] = tokens[-2]
            res['AUTH'] = 'N/A'
        elif len(tokens[5:]) == 3:
            res['BROWSER'] = tokens[-1]
            res['VERSION'] = tokens[-2]
            res['AUTH'] = tokens[-3]
        else:
            return None

    # Cleanups:
    #     1) Some versions of the Android app prefix 'v' onto version

    if res['VERSION'][0] == 'v':
        res['VERSION'] = res['VERSION'][1:]

    assert len(res) == 6
    return res

def is_hrb(year, filter, line):
    """ Validate that a line should be parsed and added to the buckets.

    Specifically, entry should contain the right year, be a HRB user-agent,
    and contain the filter keyword if one was provided.
    """
    if line[:4] != year:
        return False
    if line[11:14] != 'HRB':
        return False
    if not filter is None and not filter in line:
        return False
    return True

if __name__ == '__main__':

    if len(sys.argv) < 3:
        print 'Usage:', sys.argv[0], '<tax-year>', '<filter>', '<filepath>'
        sys.exit(0)

    if len(sys.argv) == 3:
        filter = None
    else:
        filter = sys.argv[2]

    with open(sys.argv[-1], 'r') as ifile:
        data = [line.strip() for line in ifile if is_hrb(sys.argv[1], filter, line)]

    buckets = {
        'OS': {
            'IOS':     0,
            'ANDROID': 0,
        },
        'DEVICE': {
            'PHONE':   0,
            'TABLET':  0,
        },
        'APP': {
            'MYBLOCK': 0,
        },
        'AUTH': {
            'TOUCHID': 0,
            'FACEID':  0,
            'N/A':     0,
        },
        'VERSION': {},
        'BROWSER': {},
    }

    for line in data:
        tokens = line.split(',')
        if len(tokens) != 3:
            print 'WARNING: Cannot tokenize:', line
            continue

        subtokens = parse_subtokens(tokens[1].split('-'))
        if subtokens is None:
            print 'WARNING: Cannot subtokenize:', tokens[1].split('-')
            continue

        try:
            count = int(tokens[-1])
        except ValueError:
            print 'WARNING: Could not parse count from:', line
            continue

        buckets['OS'][subtokens['OS']]         += count
        buckets['DEVICE'][subtokens['DEVICE']] += count
        buckets['APP'][subtokens['APP']]       += count
        buckets['AUTH'][subtokens['AUTH']]     += count

        if subtokens['VERSION'] in buckets['VERSION']:
            buckets['VERSION'][subtokens['VERSION']] += count
        else:
            buckets['VERSION'][subtokens['VERSION']] = count

        if subtokens['BROWSER'] in buckets['BROWSER']:
            buckets['BROWSER'][subtokens['BROWSER']] += count
        else:
            buckets['BROWSER'][subtokens['BROWSER']] = count

    print json.dumps(buckets, indent=4)

Results

So here's what I uncovered, listed in no particular order:

  • From January 7 through February 8, 232,248 requests were made by MyBlock apps.
  • 230,226 requests were made from phones while 2,022 were tablets; over 99% of the requests were phones.
  • 0 requests were made by Android tablets.
  • Over 99% of requests were made by devices running iOS.
  • Two versions of the app appear in the dataset: 6.0.0 and 6.1.0.
  • Version 6.1.0 makes up over 99% of the requests.
  • The first requests made by version 6.1.0 occurred on January 13; 6 days after the first 6.0.0 request.
  • 100% of requests from Android devices were version 6.1.0.
  • The requests made from Android devices contain no information about authentication method or browser.
  • 100% of requests from iOS contain "Mozilla" in the user-agent.

And finally, the observations relevant to my question:

  • 170,816 requests used TouchID, 15,323 FaceID, and 45,867 showed neither keyword; 74%, 7%, and 19%, respectively.
  • 0% of requests for version 6.0.0 on iOS used FaceID.

Discussion

For the requests from iOS devices that didn't mention an authentication method in their user-agent, I assume the user typed in a password or pin, though I haven't confirmed this. I also haven't looked into why all the iOS requests have "Mozilla" at the end of their user-agent. It's probably related to the browser framework used by the MyBlock app.

Judging by the fact that no requests from version 6.0.0 of the app used FaceID, it's possible that this feature wasn't implemented until 6.1.0, though this is just speculation.

Most interestingly, users appear to be comfortable with using Apple's TouchID to protect their MyBlock. Even more interesting is that people are comfortable with using FaceID, considering that this feature is relatively new. It appears that in mobile computing, biometric authentication is a widely accepted trend.

It's also worth mentioning that while MyBlock doesn't appear to have been available during the 2017 tax season, another H&R Block app does appear:

HRB-MOBILE-IOS-PHONE-TAXES-6.4-Mozilla

This app seems to have two version: 6.4 and 6.3, but the total number of requests is very low; only a few thousand. Another interesting finding is 13 requests made on April 26, 2017 with this user-agent:

HRB-MOBILE-IOS-PHONE-TAXES-nil-Mozilla

Perhaps this was a test version of the app?

Future Work

We still have 2 months to go in this year's tax season, so I'll be interested to check the numbers once the season closes. I'm also interested to see how many people continue to use this app outside of the tax season and how these results will change in 2019.