But we haven't seen much in the way of "how would an ISP implement this?"
Which means its up to us, implementers of a PCSO-state adjacent to the Bearpit, to look at how to build this. As proof we know what we are talking about, here is our 8TB JBOD storage infrastructure (running xubuntu server 14) with a Top of Rack (ToR) switch consisting of a netgear base station flashed to DD-WRT. Sticker some recruitment from "Google reliability engineering"
The new bill requires ISPs to store an "Internet Connection Record (ICR)" of every connection, and allow government access to it on demand.
How would an ISP implement such a system?
First: What is an "Internet Connection Record (ICR)"? The bill has made up something which doesn't really exist and then told ISPs to maintain it.
It is probably a record recording: the originating IP address/port of a TCP connection, the destination IP/port, and the duration of the connection. "probably", because it uses the term "event"
7(9) In this Part “relevant communications data” means communications data which may be used to identify, or assist in identifying, any of the following—
(a) the sender or recipient of a communication (whether or not a person), (b) the time or duration of a communication, 35 (c) the type, method or pattern, or fact, of communication, (d) the telecommunication system (or any part of it) from, to or through which, or by means of which, a communication is or may be transmitted, (e) the location of any such system, or 40 (f) the internet protocol address, or other identifier, of any apparatus to which a communication is transmitted for the purpose of obtaining access to, or running, a computer file or computer program. In this subsection “identifier” means an identifier used to facilitate the transmission of a communication.
The GCHQ/Bristol university "Problem Book (redacted)" appears to cover this very problem in the "five alive" dataset of IP/IP communications (p70), with every communication record being a list of records of (start-time/8, source-IP/4, source port/2, dest-IP/4, dest port/2, protocol/2. [data size]/8). Using the byte size estimated in the /N values, that's 30 bytes per communication; with the move to IPv6 driven by mobile phones, you'd need 16 bytes per source and dest address, or 56 bytes/record. That's for every HTTP request, skype call, bittorrent block share, PS4 game setup. If they include DNS records, it's for every nslookup command issued.
That's a lot of data in a world of phones and home network connections.
How to store all of this?
We see a number of strategies
The "fuck off Theresa" strategy
Here your ISP implements "for security reasons" an isolated system which collects ICRs, but for which the only way to retrieve them is a 1970s-era punched card reader you have to actually walk to. If HMG asks for something, you say "next wednesday? Come on by. Bring your query on a prepared punched card and we'll have the printer ready for the output. Oh, and we'll bill you for the ink"
The problem here is on P135: "The Secretary of State may make regulations imposing specified obligations on 20 relevant operators, or relevant operators of a specified description.', which includes "obligations to provide facilities or services of a specified description;". You could offer "the fuck off then" system, and they'd say "no, we want this". The ISP would not get a choice in the matter.
The TalkTalk fiasco
A Linux server running MySQL with a front end of an unpatched PHP web application accessible on the open internet over an unencrypted HTTP connection.
MySQL does have the lowest cost/TB of storage out there, and while a single mysql server doesn't grow, you can scale via sharding; storing the mass surveillance records of a few tens of of thousands of users.
1. Doesn't handle the "list me everyone who used twitter between 9 and 9:30 query" without going through every single database and then aggregating the results, so pushing out a database query (SELECT * FROM icr WHERE icr.endpoint="twitter.com" AND icr.time>21:00 and icr.time < 21:30), somehow merging them all.
2. Being people utterly out of their depth, that same web UI will be accessible from your customer billing form. That'll be a massive security disaster, unless the people breaking are a group like Anonymous, where it's more likely to be a contribution "; DROP * from USERS" or an attack on public targets (SELECT * FROM icr WHERE user="Theresa May")
The outsourced consultancy disaster
The design here is less a design than a process. Go to a global/national software consultancy (e.g. Capita). Give them lots of money. Wait. Eventually get some software that doesn't work very well, designed to run on very expensive hardware.
This is essentially how major government projects like Universal Credit and any NHS-wide computer systems turn out. The problem here is the "ocean boiling grand vision" along with an inability to adjust features to meet unrealistic deadlines, along with consultants who are too excited by the cash to point out the project is doomed. Oh, and politicians who stand up in parliament announce changes in plans and then keep pretending the project is on schedule.
With this strategy we'd have a police state that came in late and didn't work very well. But: we'd pay for it either via taxes or ISPs, and it will be expensive
Bring it to the Borg
The cloud approach. All ICRs are Netflow records grabbed off the Cisco switches and buffered locally. Regularly they are pushed up to google cloud storage in bulk HTTPS PUT operations, so storing all your data in google's server farms. We'd duplicate it across sites for Disaster Recovery ("DR") , and host it in the zero-CO2 datacentres ("DCs") to help meet the COP21 commitments which Osborne is trying to pretend weren't made.
Those initial datasets could be converted to more compressed format, with some summary data pushed into the google BigTable database. Queries against the dataset "find everyone who used twitter and facebook yesterday" would be Google BigQuery queries.
This architecture would work, provided whoever wrote the capture, upload and query code knew what they were doing. Google would handle datacentre ops, bill your ISP or HMG by the petabyte of data stored and for the CPU load of the import/cleanup phase and the query execution.
This is the one we'd build. No upfront hardware CAPEX; operational costs O(records)+O(queries). As you get 1TB of free processing month, initial dev costs are relatively low too.
Would the government allow it? Unlikely. Google's nearest datacentres are in Ireland and Finland; all the data would be moved out of the UK and under the jurisdiction of others.
It's notable, however, that Amazon have a special datacentre in the US for Amazon GovCloud, which is is where federal agencies can keep data and run code. We aren't aware of a UK equivalent. They could do this, though presumably after discussing tax arrangements.
[BTW, if someone went this way, as costs are directly proportional to the number of records, you could hurt the telcos and government costs by generating as many records as you can. If every UDP packet kept a sender and destination, you could run something on all willing participant's machines to create costs. Even at pennies per gigabyte, you can ramp up the bills if you leave this code running flat out overnight. Just a thought]
The fallback design would be to use the open source Big Data stack; "Hadoop and Friends"
The netflow call records would be streamed into Apache Kafka streams, which would then store the data in Hadoop HDFS, or perhaps feed them through some initial streaming system for cleanup (storm, spark).
Summary information would go into a column database. Normally the checklist item would be HBase, but given this is a government project, the NSA implemented Accumulo would win out.
Assuming there's a way to submit remote queries, those queries need to be secured. We'd come to some arrangement with the government to say "here are the kerberos credentials you need" on a USB stick, restrict access to SPNEGO-authenticated callers over HTTPS and rely on the NSA having no back doors in Accumulo that they aren't willing to tell GCHQ about.
Cost: upfront CAPEX of the datacentre, operational costs: site, power, staff, cooling. ideally you'd host it somewhere with cool air and low cost zero-carbon power, which points us up a Scotland and NE england. Land, power and people are affordable, and before long you could have something on a par with Facebook Prineville, though not, notably, NSA Utah.
Being open source software, you don't actually have to pay anyone for it —if you take on the support costs yourselves. It's unlikely telcos will want to do this, so that needs to be covered too. Once you have enough Petabytes of data, the costs of this system is still likely to be less than with google or amazon.
Otherwise: software development costs atop this layer may be steep, especially if you don't have experience in this.
In fact: all of the strategies have software development costs, a cost which is the same for every telco, all implementing roughly the same application.
Which would leave the government to be in a very good position to walk up to all of them and say "we can provide the software for you —just run these servers with our code". That way they get to run precisely the software they want —and hook it up directly to their central systems in a way which they knows will work.
They could even say "we'll host your servers somewhere and provide the software". That would give the government direct access to the nominally independent telco-hosted datasets, when they were all really in the same big room, running the same application.
There you have it then: four real ways. Two disasters, two viable: one where you hand off the operational issues to Google or amazon in exchange for lots of cash, the other home-built on the open source Big Data stack for more upfront capital costs but lower long term storage & compute costs. And relying on GCHQ to provide the software layer.
The only one that would scale, technically and financially, without giving the Google or Amazon the data, would be Hadoop+Accumulo, with code the telcos probably aren't up to writing themselves.
How much would this cost?
Hardware: Facebook-based open compute servers.
These are never publicly quoted, as you have be shopping for $1M+ before the design starts to make sense; at that point you get off the web site and onto the phone.
Looking at the Penguin Computing icebreaker storage servers, holding 30 3.5" HDDs in a 2U system. with ~44 "rack units" of capacity in a normal rack, you get 22/per rack (we're leaving space for Top of Rack switches) == 660 disks.
With 4TB HDDs, you get 2.6 petabytes in a single rack. Heavy bastard with serious power budget, but compact enough that you can fit one in various telco sites, doing the initial storage/preprocess there before uploading to central facilities in off-peak times.
A 4TB HDD costs $115 from Amazon US; let's assume $75 for the OEM. You'd be paying $50K for those 550 disks. That's all. Servers, let's assume $1.5K for each chassis(*), 33K, so $82,500 per rack, excluding networking. Let's round that up to $100K.
We'd end up with a system whose pure hardware costs, the CAPEX, come in at $100K per 2.5 PB of storage. For 1MB USD you've got 25 PB. At, say, 60 bytes per record (==56 + 4), we'd get 426e9 records per rack. That's what's technically known as "a fuck of a lot of data". More subtly, if you look at the cost per record, in cents, its $0.000000234375. That's "a police state too cheap to meter" (**)
Which is what we are getting here. The big change isn't "the terrorists are using phones", or even "surveillance is getting harder than ever". It's "with the right technologies, we can record everything the population does for next to nothing"
Welcome to the future: privacy is over.
(*) Chassis costs depends very much on RAM and CPU; put in intel Xeon parts and lots of RAM and you'd be paying $8K/node, plus more in the leccy bill. That's what you'd expect to pay for the more compute-centric nodes, rather than the cold data storage, which is what we've costed out.
(**) These numbers seem to come in too low. Either there's a major underestimate of something, or the author should have used a spreadsheet rather than the calculator on their phone.