Free TTC Mobile iPhone app @ bit.ly/ttcmobile (Support)

TTC dataset incorrect– scraping required for offline apps … this won’t be fun.

December 23rd, 2009

Can I get contact info for whoever’s making the TTC data available? I’ve a few requests/ideas/fixes to suggest … sigh. If anyone wants to phone me, my number’s (647) 801-LSTA (5782), or just email me directly.

Based on Kieran’s comment on my blog, at http://blog.lsta.me/?p=27#comment-25938534, I decided to compare the given TTC data with that of ttc.ca. (Kieran’s one of the guys behind myttc.ca)

Since the TTC dataset page says “Board periods change around every six weeks.” but the data last changed October 27, it should’ve been updated around December 8th. But it wasn’t, not that I’ve any proof routes have changed since Oct 27. The data might have been just as bad at launch– but I’m getting ahead of myself.

So, it took me a bit longer than I thought to get this started, and I’ve only compared the ttc_routedetails txt file with the website, and already I’ve discovered two MAJOR problems with the ttc dataset as given on toronto.ca/open … and I’m thinking of saying, “screw it,” and making my own “official” data set instead — I can see how the myttc.ca guys got the idea ;-)

The first? It’s not up-to-date:

Code used to scrape ttc.ca is at: http://gist.github.com/262436

output = “Routes in ttc dataset that arent on ttc.ca: “;db.execute(“select routeid, branchcode from ttc_routedetails except select routeid, branchcode from ttcca_routedetails”) { |r| output += r.join(” “)+”, ” }; puts output.chop.chop

Routes in ttc dataset that arent on ttc.ca: 112N E, 191N A, 191S A, 300E _, 300W _, 301E E, 301E W, 301E _, 301W E, 301W W, 301W _, 302N _, 302S _, 303N _, 303S _, 305E _, 305W _, 306E _, 306W _, 307E _, 307W _, 308E _, 308W _, 309E _, 309W _, 310N _, 310S _, 311N _, 311S _, 312E _, 312W _, 313N _, 313S _, 316N _, 316S _, 319E _, 319W _, 320N _, 320S _, 321E _, 321W _, 322E _, 322W _, 324N _, 324S _, 329N _, 329S _, 352E _, 352W _, 353E _, 353W _, 354E _, 354W _, 385E _, 385W _, 38E A, 38W A, 48E A, 48W A, 501E E, 501E G, 501E W, 501W E, 501W W, 73S _, 96E E, 96W E

And the opposite, there are routes on the ttc.ca website that aren’t in the dataset (okay, just one, 39E G):

output = “Routes on ttc.ca that aren’t in dataset: “;db.execute(“select routeid, branchcode from ttcca_routedetails except select routeid, branchcode from ttc_routedetails”) { |r| output += r.join(” “)+”, ” }; puts output.chop.chop
Routes on ttc.ca that aren’t in dataset: 300E N, 300W N, 301E N, 301W N, 302N N, 302S N, 303N N, 303S N, 305E N, 305W N, 306E N, 306W N, 307E N, 307W N, 308E N, 308W N, 309E N, 309W N, 310N N, 310S N, 311N N, 311S N, 312E N, 312W N, 313N N, 313S N, 316N N, 316S N, 319E N, 319W N, 320N N, 320S N, 321E N, 321W N, 322E N, 322W N, 324N N, 324S N, 329N N, 329S N, 352E N, 352W N, 353E N, 353W N, 354E N, 354W N, 385E N, 385W N, 39E G

As you can see, on the website, night routes are named N, rather than _, probably for consistency (night routes are cross-listed under the regular route’s code).

This meshes up perfectly with what Kieran said: “… Many stop & route names are outdated, missing, mismatched, or just plain incorrect …” (Now I can totally believe that, given the broken bits and pieces I’ve found so far)

Who knows, then, if the actual stop times and stop names are correct? Certainly we’re missing those newfangled bike icons in our beloved dataset.

Anyway, my second point, assuming correct data, is that it’s IMPOSSIBLE to re-create the ttc.ca website with the given data. Why? You need a list of stops in geographical order, and that’s exactly what’s missing right now, geography, or such an order. Instead you’ve a confusing mess of unique keys as routeid+branchcode, repetitive data everywhere, and seemingly little thought put into this. I would have to manually process the data into a database format that’s more useful, and I know such a format already exists somehow for ttc.ca’s use!

Give me XML output from ttc.ca pages and I’d be totally happy at this point. At least I can reliably parse out and create a database from whatever attributes are given in the XML, one file per route, say.

To go beyond the website, though, would be fantastic. Describing my iPhone app ideas, I wrote in my blog, “The goal is to ultimately show, video game-style, a live view of the city, like Google Maps, but with moving buses as the fake buses follow their schedules. Buses might turn red if reported late, yellow if too early, and green if on-time. Maybe I’ll experiment with colouring the roads instead, as if to say, the bus should be somewhere within this area, indicating whether it’s on-time or not, and by how much.”

So what I’d really love to have then, are the times given to bus drivers for their routes — to be able to say, trip #23 departs at 9:43 am from Station X and should arrive at 9:53am at Stop Y. This is, after all, how Google works with such data — they say, “What are the next 3 buses, and how long does the transit service think it will take to get to the destination stop on each?” The info, as then presented, is quite useful for people to plan when and what bus to take, and within the context of a route planner, to pick out the best route based on the shortest times actually estimated. The worst times I’ve had on a bus is when it arrives on time, and has a scheduled stop for 7 minutes when I expected to get somewhere in 3 minutes tops. (41 keele, I’m looking at you)

Ideally, there’d be a way to offer a way to cross-reference such route times with bus numbers, internally if not public, and then app writers could ask people to provide feedback about the bus ride they happen to be on, with the ability to tell people how late/early bus #whatever is. I know, at that point, it’s too micro-managed, but honestly, if I mis-read the schedule and think a bus is late when it’s really on-time, or if 3 busses arrive in a row at some stop, I want to be able to get confirmation of that, and tell others waiting for the bus that it was at that stop at that time. Call it a stop-gap until we get GPS in 2 years, with the ability of being able to report traffic or incidents live through an app and include the trip or bus #, so that they (and future trips) can have a delay message appended, as an advanced @ttcu_community, so to speak.

And as I said before, I’d be more than happy to try to merge together some of these data-sets, as they’d be more useful that way, but it’s hard to do with inaccurate data. I’d so prefer a raw dump of whatever’s on ttc.ca– as HTML even! — with scheduled update notifications and the promise of more details to emerge eventually. This is assuming, of course, that the schedules given to bus drivers don’t differ significantly from those on stops … I can imagine they do, however, as “FS” can hide quite a bit of accuracy for trip #s.

So right now, it looks like I have to put on hold my plans for Symbian/BlackBerry/etc. apps, simply because the data isn’t there for offline reference, and none of those platforms can easily support ttc.ca the way Safari on iPhone does (except perhaps Android??), so my current app’s online-only javascript-injection technique is less useful.

As for the iPhone/iPod Touch, I could perhaps write a “Downloads” section, where you can hit a button and have it download every HTML page off ttc.ca and every image, without hammering the server too hard. Still, I can imagine the TTC not liking such “updates” from EVERY individual iPod Touch and iPhone, which is why it’s critical that better data distribution methods exist for both packaging the data AND notifying us developers of the latest changes, since we take the rap for incorrect info in our apps. And ultimately, regardless who’s to blame, people will have a bad experience on the TTC, which is what we’re both trying to prevent here.

Here’s hoping someone at the TTC listens …


View Comments
  • >You need a list of stops in geographical order, and that’s exactly what’s missing right now, geography, or such an order.

    Hey Louis,

    Isn't that what the 'routestopid' and 'routestopdirection' fields in the 'routestop' table are for? They seemed to do the trick for me to put some of the routes I'm familiar with in geographical order when I poked around the dataset.
  • Louis
    I haven't inspected yet, but based on the descriptions, I think not. What I'm looking for is stops by routeid, not stops by branchcode. E.g. Being able to say, at this stop, what branches are there? Now if the routestopid is not per branch, and therefore not unique, then maybe it might do, but if that's true, why call it routestopid? That implies that it won't change, but would have to unless entirely relative. I guess what it cones down to is I dislike having to do so much processing to get the data in a usable datatbase format, simply because the data is so repetitive yet not separated enough. Ideally it'd be all relational or all document-based, in JSON or XML per route. Instead we've this odd, useless mixture of the two that's poorly tab-delimited.

    Thanks for the suggestion, though, I'll check the dataset when I get back home to see if that's how it works.
blog comments powered by Disqus