Uncommon advice on becoming a data scientist in the public interest

Civic data scientist Alex Engler shares his insights for those aspiring to work with data in the public sector.

Especially as you enter the field of data science, it is easy to convince yourself that you need to learn neural networks, Tensorflow, Kubernetes or whatever is featured on our field’s many absurd aggregator websites (*cough* Data Science Central *cough*). Try searching online, and everything you google about how to become a data scientist will push you towards some set of technical skills, often with a heavy emphasis on programming frameworks and machine learning. Unfortunately, for those of you who want to work in data science for the public interest, this is bad advice.

I’ve spent much of the last decade as a civic data scientist, including for DC’s local government, for the Congressional Research Service and at think tanks such as the Urban Institute. I’ve also been teaching applied data science for policy problems for eight years at some of the field’s leading academic institutions. In these roles, I’ve spent a quite a bit of time talking to governments, non-profits and other employers about what they need from their data scientists. It is entirely clear from these discussions that most junior data science roles at non-profits and policy research institutions don’t need the cutting edge of data science technical skills.

Some of those skills will be important, but in the public sector, a meaningful contextual understanding of your data and domain is guaranteed to be important. So, I want to articulate a different series of learning priorities to be successful as a young data scientist in the public interest—priorities that are driven by the specifics of a policy domain.

The idea: develop depth in a domain over breadth in data science

The good news about learning data science is that there is an enormous amount of free information on the internet—I didn’t blink when I saw a list of 200 free books on R. Yet, this makes it much more difficult to choose what skills are important. My advice is straightforward: In the early days, don’t try to learn everything. Instead, focus on a specific area of policy or service delivery, and let that drive every other decision you make.

Pick a domain you’re interested in, and then be the best data scientist you can be in that area whether it be health policy, agricultural systems or conflict studies. Practically, this means spending more time understanding relevant data sets, more time understanding the data collection processes used in your field, more time reading relevant research, and more time doing applied work (I’ll revisit some of these shortly). Of course, all of this comes at a cost: reducing time spent learning a more diverse set of technical skills and statistical methods. In the short term, this is a worthwhile trade-off.

At most public service organisations, just having the best technical skills will rarely make you the most qualified applicant for a job. The best candidates are typically going to have a strong overlap of content knowledge and technical skills. So, this strategy is meant to make you highly competitive for a small number of jobs, rather than generally employable. This might sound risky, but done well, it should give you more control over what you do. You’ll be able to take your pick of a few jobs in your field, as compared to a general-purpose programmer and statistician, who will have to send out more and not-as-well-targeted applications.

The most common response that this advice elicits from students is that they don’t know what domain to pick, since they don’t have much experience in anything specific or aren’t yet very passionate about any one area. That’s understandable, but I still think you should choose something anyway. First, you aren’t locked into your choice forever. There is plenty of overlap between various fields of public policy, social science and non-profit service delivery. Whatever you choose can get you started and will likely stay valuable in the long-term.

Second, in my experience, it’s unusual to be especially passionate about any specific issue until you’ve worked on it extensively. For the typical public servant, passion isn’t automatic, it’s earned. Dive into the papers, the debates, and the datasets in a subject and give it some time—you’ll be surprised how quickly you learn to care about the obscure nuances of your field.

The execution: concrete suggestions for projects, datasets, methods, and tools

Concretely, this means you should spend time on applied projects, learning R or Python to analyse real data, and let your field and your questions determine the methods you learn.

Spend your time on applied projects in your domain
Working on applied data projects, in which you might conduct a complete data analysis, build a new dataset or develop a data tool, is extraordinarily valuable. You should opt into this kind of work (in classes, internships or independently) as often as possible. You can learn about relevant data, increase your knowledge of the field, and improve your technical skills all at the same time. Your ability to conduct an applied project from start to finish is also the strongest signal you can send to an employer. It’s not just a claim that you can do the job, it is the job.

Knowing where to begin can be intimidating, so don’t be afraid to start simple. Downloading a dataset, performing basic data cleaning, merging it with another data set and then doing descriptive analysis with some charts is a great start. For most civic data science jobs, I would much rather see a clear and informative descriptive analysis of a relevant dataset than a winning submission to a Kaggle competition.

To go further, you can compare your analysis to research papers and other analyses that used the same data—why did they make the choices they did? What methods did they use to learn more? Early on, you might replicate the work of academics or data journalists, or instead take a project from a similar dataset and perform the same process on your data. As you do this, you’ll get better at asking your own questions, and you’ll soon find yourself with plenty of project ideas.

Knowing your data as a marquee skill
Understanding the universe of data that is relevant to your field is an invaluable skill. Take it as seriously as learning to code.

First, try to build an understanding of what data is out there. Start a list of datasets used in the analyses of your field and add to that list over time. Also give government open data websites and Google Dataset Search a try. You should work with this data as much as possible and take notes to yourself about the data. This also means reading the data documentation—one handy thing about government and non-profit data is that there is frequently pretty good data documentation. Generally, know that you do not understand any dataset until you have read the documentation.

You should also look out for public critiques and comparisons. You might be surprised by how much is written on announcing, critiquing and comparing different public datasets (these links are from empirical conflict studies). For proprietary data, do informational interviews with data analysts in your field and ask them about the datasets they are working with.

You should also deepen your understanding of data collection processes. While you can develop good questions about your data’s collection from working with it, you must go elsewhere to understand its origins.

Look at the forms (oh how they change) and surveys that generate government administrative data. Consider getting in touch with the curators, especially once you have questions from your exploratory analysis. In some cases, you may even be able to directly observe the data collection process. This might mean going door-to-door for a survey or volunteering as a monitor for some process (e.g. voting or standardised testing).

This is a lot of work, but it will improve your work and distinguish you as a young data scientist.  I think most senior data scientists would agree that understanding the shortcomings of specific datasets is a compelling and rare signal of analytic maturity.

Let your domain choose the statistical methods that you learn
There are too many statistical methods. Learning them all is not a remotely achievable goal. Even worse, aiming to learn as many as possible will undermine your effectiveness as an applied data scientist. Instead, I’d argue for two goals. First, understand the broad typology of data science methods and what questions they can answer. Second, learn how to learn, and continually develop your skills while performing applied data science projects.

I’ve written before about the scope of data science methods that are useful in policy analysis, but only one is universally necessary: causal inference. You need to understand causal inference methods. Even if you don’t use them yourself, you need this knowledge base to read the relevant social science in almost any field. After that, you want to know enough about methods to know what questions they are designed to answer. Is there a meaningful prediction I can make with supervised machine learning? Are there latent groups that I could discover with unsupervised learning? Is there an intervention I could simulate outcomes for?

To decide what else to learn in depth, your decision should be driven by your domain. For instance, if you want to study international development, there is a long history of randomised experiments, and a more recent prominence of satellite imagery. If you want to study political science, social media platforms have made an understanding of network analysis and natural language processing far more valuable. Personal devices are creating trace data that adds a critical geospatial component to many social science questions, especially combined with survey research. Academic research and Twitter are great ways to get exposed to what’s going on in your field and can help you narrow down the wide world of data science into a manageable number of methods to learn.

A brief aside about language choice
On programming languages, the options are clear: choose either Python or R. I prefer R and think it’s better for causal inference methods / econometrics and data visualisation, which are both important here. That said, Python is a perfectly good choice, and is more versatile in working with the internet and some engineering tasks. These differences are not so important for you to dwell on it.

More important than which language is that you obtain depth in one. Learning both is valuable in the middle- and long-term, but in the early days, your ability to execute on projects is paramount. Since any given project is likely to be mostly in one language, your ability to contribute will be much higher if you have a robust set of skills ready for that language. This isn’t an argument against being a ‘polyglot’ programmer, or having breadth in a range of frameworks, but that should be later goal for aspiring data scientists.

Wrapping up

None of this is an argument against undirected learning or eventually developing breadth in data science methods—both of which I endorse. I simply want to offer a framing for how to prioritise what you learn in the early stages of your civic data science career.

There are some skills you can’t avoid (the Linux command line, git, and don’t skimp on communication skills like data visualisation, literate programming and writing), but otherwise, use a domain to help you focus your efforts. In terms of getting your first civic data science job, you’ll have an easy and cohesive narrative to market yourself, and you will benefit from having demonstrated interest in a domain, rather than the all-too-common claim (“I am very passionate about…”) that haunts cover letters.

What’s more, your analysis—and your impact—will be all the better for it too.

Alex C. Engler (@alexcengler)

Image: Pietro Jeng, source: Unsplash