Trusting the Black Box

The most important obstacle for wide adoption of self-driving cars is trust. Like nuclear energy, self-driving cars will be forever stigmatized if a single gruesome, highly publicized incident occurs. People feel a sense of control when they drive. They feel like they can always react to what other drivers are doing. Self-driving cars will necessarily give up this form of control, leading to some accidents that would be preventable if a person was driving. It is not surprising that 56% of people polled in a 2017 PEW survey said they would not want to ride in a self-driving car, and 72% of those said that their reasoning is due to safety concerns and lack of trust. The technology in self-driving cars requires hundreds of machine learning, computer vision, and robotics experts to develop. However trust, not technology, will be the primary factor in whether they become the crown jewel in an automated Second Industrial Revolution, or whether they go the way of Google Glass.

Why Every Data Scientist Should Know Command Line Tools

The UNIX command line is great for basic data processing tasks because it has very low latency. If you have a file with millions of rows, performing basic operations in a higher-level language requires reading the entire data file into memory. This can take unacceptably long amounts of time. With the command line, you can work on an entire file without worrying about your task taking hours because it is never necessary to read the entire file into memory.

Oversampling (Or why there’s no Democratic conspiracy in the polls)

One common complaint I’ve heard throughout the internet, mostly among people who think Clinton “rigged the election”, is that the polls were wrong due to oversampling. To people who do not know statistics, “oversampling” sounds like a conspiracy. It seems to imply that the Clinton campaign intentionally sampled too many black people and Hispanics in order to make it look like she had a greater chance of winning. However this is a fundamental misunderstanding of what oversampling is. Oversampling isn’t a way for pollsters to blind themselves about demographics and how they vote. In fact, it’s precisely the opposite.

The Question Concerning Technology (in the Statistics classroom)

In the past thirty years, cheap, ubiquitous computing power has allowed the field of statistics to address a wide variety of questions that previously would have been impractical. Any situation in which a closed-form expression for a particular quantity does not exist would have been virtually impossible to calculate by hand, and problems involving a large number of coefficients would have been unthinkable to solve. But computing has also made it easier than ever for anyone with little statistical understanding to use a statistical package, treat a procedure like a black box, and obtain a p-value without understanding the assumptions inherent in that procedure. How should the field of statistics teach technology to address both the increasing importance of computers and the dangers inherent in using them blindly?