The Fetishization of Management

I’ve been working in Corporate America for a little over ten years, and I’ve realized that the most toxic aspect of Corporate America is the fertilization of management. We essentially have believed that as a society, there are certain classes of individuals who have the skill of management that transcends the field that they’re working in. In highly technical fields, this leads to a fairly toxic corporate culture as mid level and upper level managers devote more energy to playing politics rather than addressing the problems at hand. There’s also a great deal of insecurity involved because of impostor syndrome, as these managers are rabidly afraid of looking stupid and out of their depth, and thus they tend to kick down to the more knowledgeable people below them.

In the field of Data Science, I actually see a lot of groups in Pharma run by individuals who I believe have absolutely no business running a data science group. I’ve seen far too many cases where they believe that if they hire the right people, and shout AI every so often that it passes for leadership.

For the most part, I have been relatively lucky in the groups that I’ve been a part of. Most of my managers are very competent in the field I work in, and I generally believe that if I were hit by a bus, most of them would be able to pick up the slack on my projects. And in the case where I reported to a manager who couldn’t replace me if he were hit by a bus, the man had sufficient amounts of humility to basically acknowledge that I knew the subject better than he did, and he would do his best to keep the distractions out of my way, and to get me what i need.

Management itself is hard, and it’s one thing that’s not really taught in school or at least in most cases, poorly taught. I would even argue that getting an MBA from Harvard doesn’t help solve this issue. Because critically, a manager has to do the following:

  1. Establish the goal
  2. Provide an idea how the goal should be reached (note, this does not have to be the correct path, but it’s sufficient provided that it provokes thoughts into those that report to you)
  3. Know when to call out the bullshit that comes across from you.

Fundamentally you can manage people if you don’t know what you’re managing them for. The inability to do this I think has paralyzed a large number of American companies who have relied on MBAs to manage technical fields that they generally could not do on their own, and this is one of the contributing factors to off-shoring. Rather than looking at a problem and having the vision to solve it, they would rather outsource it to someone else to get credit for it and not have to do the difficult part of trying to actually fix the problem. And it is this that is slowly eroding the competitive advantage of US firms.

The latest casualty of this is Intel. The current CEO is Bob Swan and currently Intel has not released any new chip architectures and their foundry process is far behind, despite having a huge financial war chest to draw on. And I believe that a large part of this is because Bob Swan is a finance guy, and not fundamentally a semiconductor engineer. We contrast this to three companies that are currently eating Intel’s lunch. AMD on the CPU side, nVidia in anything related to HPC, and finally TSM on the foundry side. Aside from being run by Taiwanese people (Yay Taiwan), they’re all run by individuals who are electrical engineers (Lisa Su, Jensen Huang, and CC Wei), and so in terms of the chip design, chip manufacturing process, they’re able to identify the primary problems that their firms have to overcome, a basic idea how to get there, and when someone comes and tries to sell them something can smell the bullshit a mile away.

The Pros and Cons of UBI

UBI or universal basic income has been promoted as one of the policies that might help us transition to the post industrial age. With more and more work being done by robots, and ML enhanced software (I loathe to use the term AI), the number of people required to produce the same amount of goods and services is falling. Because of this, there is an increasingly desperate competition for a smaller and smaller number of well paying jobs. On a purely macro economic level, UBI is a mechanism that provides a floor for aggregate demand. By circulating money back into the economy, it allows individuals to buy goods and services. Without UBI, it is very possible that economic growth would end, not because of a lack of innovation, but rather because of the lack of demand.

An optimist such as Andrew Yang would suggest that with UBI, there would be a significant revaluing of the American economy because people would have the freedom to pursue things that really interest them. They would feel free to pursue things like community service, art and self edification. People like Andrew Yang believe that a system like UBI would allow people to become different versions of Plato’s ideal philosopher.

As an aside, Aristotle and Plato both believed that labor was something that had value, but did not increase value, and instead argued for a class of individuals who would be free form traditional work to pursue the “virtues.” Other Greek philosophers such as Epicurus believed that one should work for the love of the craft rather than money. It should be noted that most of the advances in philosophy, government, art and mathematics that happened in the ancient world arose from the minds of these men of leisure, i.e. individuals who had the time and the resources to devote to thought rather than labor. Individuals who argue the virtues of UBI essentially believe that we as individuals all have this potential within them.

The pessimists on the other hand believe that if individuals are given sufficient income to maintain a basic standard of living, that they will essentially stop working. And frankly the adage that idle hands are the tool of the devil in many cases ring true. They can point to the fact that lottery winners in many cases go bankrupt and end up spending most of their money on drugs, alcohol, consumerism, etc. And for a great deal of the population this is also true, as evidenced by the number of individuals in middle America who are collecting Social Security Disability and addicted to opioids.

I believe that this is also the fundamental divide within this country. The fundamental divide is between individuals who are optimistic in the inherent goodness of mankind and the individuals who are pessimistic that individuals without an external motivating factor will not do the right thing, and the scary part is that both individuals are correct for a certain subset of the population. The truly scary thing about this thought is that with the self-sorting that has occurred within this country, we’ve basically amplified the problem.

That being said, I’m on the side of the optimists and I believe that people when given the opportunity to do something of interest will take it. I’m just not blind to the possibility that they won’t.

On Grit

Ever since reading Angela Duckworth’s book Grit, I’ve been bothered by three aspects of her work. The first problem that I have with her work is a criticism of her methodology, namely that she uses a very homogenous population in which to evaluate the impact of grit The second problem that I have is more with the interpretation of her work by others, namely that “success” as defined by those in power can be achieved through sufficient grit without considering the barriers that one has to overcome. Finally, I think in how she writes the book, she doesn’t visit the dark side of grit and enumerate the places in which grit is harmful.

The basic premise behind Dr. Duckworth’s book grit is that hard work is one of the most powerful factors that explain an individual’s success in any endeavor. However, most of her research focuses on students who have been accepted to an Ivy League School and West Point. What this misses is the fact that the mere admissions process in these two institutions greatly homogenizes the incoming student body in qualities such as intelligence, creativity and/or physical fitness. Essentially, the one factor she claims predicts success is the only one that has not been actively selected for as part of the admissions process. Once she has established the fact that within a carefully controlled population of individuals shows a difference in outcome due to grit, the next question is the effect size of grit once the other factors have not been controlled for. For instance, if you look at the players that make the NBA, one of the factors that allowed Kobe Bryant to be successful was the amount of work he was willing to put in at his craft. However, this misses people who have not made it into the MBA such as people like me. The question now arises whether an individual standing 5’9” with the same amount of “grit” as Kobe Bryant would have a better chance of making it into the NBA vs. an individual standing 6’4” with an average level of grit. Likewise, looking at individuals who have been admitted to UPenn, these are some of the smartest individuals within their graduating high school class, and so differences in grit probably stand out. But if you compared individuals who have been admitted to Ivy League schools vs. less selective schools and controlled for grit, it would not be surprising if things like IQ, parental income play a bigger role in metrics such as income. By choosing a deliberately narrow population, Dr. Duckworth essentially minimizes all of the other factors that may play a role in an individual’s success. This fallacy is what gives rise to the sentiment that people should pull themselves up by their bootstrap. Individuals who feel that way are most likely from an environment in which there is very little variability in their peers with respect to education or socioeconomic class, and so they miss seeing all of the barriers that others may face.

The second criticism of the concept of grit, is one that has driven me batty as a primary school student which is the concept of busywork. Yes, people with more grit are more likely to tough it out through busywork, and are probably rewarded for it academically, but it begs the question of when you’re asked to do something stupid is it better to find another way of doing it, or to do it blindly. My mantra as someone who has worked quite a bit in Corporate America is that innovation comes from lazy people and not necessarily hard-working people. However, this puts the onus on managers to motivate and more importantly understand when it’s time to pull a project away from the person who was key in starting it for someone else to polish it up. As a manager, I’ve understood that there needs to be a mix of individuals on a given team, and I’ll generally try to find the bright but lazy individual to brainstorm a solution, but pull him/her away from a project at the point in which they begin to lose interest and I need a higher level of fit and finish. However, the reason that this is not more popular is that most managers are terrible and don’t want to do the work of managing. It bears remembering that every invention that we see around us required at one time someone being too lazy to do something manually.

Finally, I think there needs to be a better treatment of the dark side of grit. My roommate from college is probably the grittiest individual I’ve ever known. He’s set his mind on being a professor and at 39 he’s still struggling to earn his PhD. He’s sacrificed everything in his life from relationships to finance in order to achieve his dream of being a paid intellectual. In contrast, I also have a PhD and I earned it in 4.5 years in big part to my willingness to bail on my first advisor who wanted me as cheap labor. He’s always focused on the “what” and not the “why” and with a lot of gritty individuals, I see that as a common thread. They blindly pursue a goal without asking “why?” and it is these individuals who give excuses like, “I was just following orders.”

Why Open Source is Good

Delving deeper into Apache Phoenix, I had been looking for ways of loading data faster into my system. Our current system writes data utilizing the JDBC connection which is slower than using the bulk loader (this is true for any database system). So I had been playing around with various command line tools that were present with Apache Phoenix and realized that none of them worked if the table nameS were case sensitive.

When writing the application we used camel case everywhere. I didn’t much care what convention we picked, nor do I really care about conventions like that given the state of intellisense on modern editors. And so we happily marched along, double quoting tables and other identifiers and everything seemed to work.

When playing with the map-reduce tool for doing deferred index creation, I ran into the issue of. How exactly do I specify my table name?

${HBASE_HOME}/bin/hbase org.apache.phoenix.mapreduce.index.IndexTool --schema MY_SCHEMA --data-table MY_TABLE --index-table ASYNC_IDX --output-path ASYNC_IDX_HFILES

In any case, every combination of “myTable”, \”myTable\”, ‘\”myTable\”‘, failed. Now the wonderful thing about Open Source Software is that the code is available on Github, so I can download the code and actually step through to figure out why it is or isn’t working.

Now in summary for anyone who is at all interested is that there are two bugs. The first bug is that the apache common cli library does not actually handle the presence of quotes within a command line parameter properly. The second bug is that the quote removal is done in two places in the Map Reduce Code. Removal 1, Removal 2) so even if it had been properly double quoted through the CLI parser, it would still have failed. Now the wonderful thing about open source is that the code is out there.

So my solution was to disable those two lines, take the table name as case sensitive directly from the cli, and poof, the code works. Now at this point I can contribute back to the code by issuing a PR for the benefit of the world and I admit I haven’t done it, because I need a way of introducing that change without breaking someone else’s workflow (and I don’t know that workflow). But for internal purposes, we have fixed an issue that affected the functioning of our system interally. And we didn’t have to wait on someone else’s release schedule to do it.

Which Congressional Districts are Competitive

For the 2018 Elections, the districts in pink are those that are strongly Republican based on voter registration. The ones in light blue are strongly democratic. The ones in dark purple are the ones that are currently held by Republicans, but have a Partisan Voter Index (PVI) under 4, which means that with a good ground campaign, would be well within the realm of possibility to flip. There are a total of 33 nominally competitive seats which is sufficient to flip control of the House of Representatives from Republican to Democrat.

– Update

Blindly grabbing data from Wikipedia is dumb. Previous map had old data of the 114th Congress. Current map is of the 115th congress. Some updates are that I’ve added in the Representative’s name, as well as listed the open seats due to special elections in Green. Montana I left as red even though it’s technically open, because while Zinke is supposed to be appointed to be secretary of the interior, the chances of flipping Montana are probably really low.

Failures in Software Engineering

And why your organization can’t do large development projects.

These are lessons that I learned from running my first large software development project, the obstacles that I smacked head first into, and what lessons we can learn organizationally. I can write about this in a some what optimistic manner because at the end, we did manage to put out a pretty damn good system, though in hindsight, we could have probably done so in much less time.

The entires are numbered, though not by any significance. This post is the first entry only because it contains lessons that I learned.

  1. The wrong person is running the show and he wrote code for his PhD.

Having me be the technical lead for this project was probably not the best idea in the world. It was my first time running a project of this complexity (basically one that me working alone could not have finished in a six months time) Despite actually having a computer science degree, the technical term of people like me is coding weenie and not software engineer. The difference is that while I can program, I treat coding as a tool to solve a problem, and I do not take kindly to people telling me what the requirements are, what buttons to put where, and how exactly they want to see the output. If I think SDTM is a terrible standard for storing and querying data thought up by statisticians to make their narrow use case easier, I will fight you every step of the way if you tell me that you want an API that returns variables as sliced and diced SDTM domains. I will fight you every step of the way if you give me a requirement that is ambiguous and doesn’t make sense rather than make it so i.e. I want you to automatically join two variables, but I won’t tell you what the join conditions are. A proper software engineer knows not to fight this now, but to implement what you ask for, then when it fails points to your requirements document saying that it does exactly what you told him to do.

Anyways, coding weenies spend too much time fighting for what’s right rather than realizing that the power of software is iteration and fixing things when they don’t perform to expectation. This is also why software projects end up being late. But sometimes we must realize that fighting for the perfect solution also wastes a whole lot of time, and it’s better to be late and have the customer understand it’s their fault and sharing the responsibility, rather than having it all fall on you. Joel Spolsky has warned about hiring PhD’s and while I disagree with him on it because I have a PhD, the mentality of build it, ship it, fix it later does have some merit.

As an aside, I disagree with him that PhD’s aren’t about getting things done. PhD’s are all about getting things done. It’s just that we don’t want you to tell us how to get it done. And if you do. You better be right because we can see through your bullshit.

The second thing in this category is that anyone who has written code for their PhD has a very high tolerance for bad code. Not so much that they are bad coders, but because we have to build on the work of others, and not everyone who works in computational whatever has formal training in coding. Therefore as part of my PhD, I saw some truly nasty code, and in the spirit of getting things done, doing the minimal work to turn it into a library, or worse piping output and parsing with PERL and piping into the input stream of my code are acceptable ways of solving the problem. Then if there is an error in the original code, I will wade through some god awful code in order to fix it. So when I see terrible code, provided that I can verify that it works properly through unit-tests and what not. I’ll let it slide. Code used to solve a problem for someone’s dissertation is basically spit, snot, twine and a bit of PERL mixed with a few nuggets of really elegant code.

Code quality however is damned important once you start adding people who aren’t PhD’s to the project and don’t have such a tolerance for terrible code. You start adding outside developers, and once the project gets big enough that you can’t fit it inside your head, you’re going to be in trouble. In our first iteration of this project, code quality was pretty terrible because timelines were compressed and well, get it done and out. We treated version two as learning lessons from version one, and ended up rewriting the entire thing from scratch.

The takeaway lesson from this is that if you’re going to run a large software project, make sure the guy running it has some experience running software projects first. And while I know that we all have to get experience somewhere and sink or swim is a pretty effective teacher, if you must get a newbie, don’t get the opinionated guy with the PhD. If you must use the guy with the PhD, stay out of his way. Tell him the problem and let him solve it however he wishes.

When Joel Spolsky thinks that PhDs aren’t going to be successful as software developers he’s right. In most environments the guy with the PhD should be treated as a resource that you give a problem to when everyone else is stuck. However, there are certain environments and companies where dudes with PhDs are phenomenally successful such as that company Google many of you might have heard of. It’s because they defined the problem, solved it their way, with minimal orders (only advice). I would bet that if you made Brin and Page right out of their PhD, write inventory management software for Walmart, they’d be bashing their heads against their desks every day, pissing everyone off, and probably giving some project manager heartburn as they miss deadlines and ignore requirement documents.

In the end I feel blessed because the organization that I worked for didn’t fire my butt at the end of release one, but allowed release 1 to limp along and gave me a second shot at the problem, this time with minimal interference and at the end I think we did something pretty neat.

Apache Phoenix and UDFs

In one of the projects at work, we’ve been using Apache Phoenix, a SQL layer on top of HBase. It’s a pretty awesome system, though at the same time it’s been a some what frustrating experience given some of the issues that the documentation glosses over. The issues that we encountered where inconsistent use of secondary indexes and the lack of documentation around important functionalities such as User-Defined Functions (UDFs). Now I’ll be the first to admit that perhaps the documentation wasn’t necessarily bad, but it assumes sufficient working knowledge with Java and the Hadoop eco-system which we did not have when we started the project. Thinking back on it, the success of platforms such as PHP or MongoDB despite everyone’s gripes about the systems is due to the fact that their documentation assumed nothing about a coder’s background and allowed coders of diverse backgrounds to get started with them, whereas systems such as HBase and Phoenix have lost out on user mindshare due to the fact that it’s significantly harder to get up and running. Now while this is mainly a gripe post, I hope that there’s sufficient information here to make your life a lot easier than mine was.

A brief description of our project. We were building a data warehouse that was supposed to ingest and store clinical data. While clinical data is generally tabular, different sets of clinical data will have different set of columns. Now before someone tells me to use a Key-Value schema, it’s not quite that simple because different data sources can have different number of columns required to define a primary key, depivoting/pivoting in general is an expensive operation, and there’s no guarantee that data you get isn’t already pivoted or depivoted for you. We had a preliminary SQL server solution that stored all of the data in a KV schema, but it croaked when trying to get more than 50MB of clinical data back out of the system in a somewhat standard format. Oh, and the data that we were ingesting would sometimes be given in incremental form ODM in which updates, deletes, restores were given in the context of primary keys. Oh, and this was supposed to work on clinical data streaming into us, and given how the business process works in our industry, we only have an incomplete definition of the data that we are supposed to be getting. Finally we were supposed to be able to generate standard complaint output before all of the data has been received, which ruled out writing a custom ETL script through tools like Informatica, and update transformations on the fly without reloading all of the data from scratch.

Now, given the hard won work that we’ve put into it, I actually do think that HBase/Phoenix are pretty awesome in terms of having a scalable No-SQL system that still allows for standard query paths. Now to describe the issues:

  1. Apache Phoenix and HBase support Dynamic Columns i.e. Columns that don’t have to be specified at table creation time. However, indices don’t get used if you are using a dynamic column in the query. Therefore if we have a secondary index built on staticCol2:
select "staticCol1" from "table" where "staticCol2"='value'

will use the index. However, using dynamic Columns

select "staticCol1", "dynamicColumn" from "table"("dynamicColumn" varchar) where "staticCol2"='value' 

will not use the index. Even though they use the same table and are trying to use the same column for the index. Even using index hints does not force the system to use indexes.

  1. So our solution to this problem was to make a static column, make it a varchar, and to basically pack a JSON object into that column. Probably not the best solution, but sufficient for our purposes. In this, we needed a user defined function to basically take as input the name of that static column, and the entry within the JSON object. For simplicity we assumed that our JSON object was a simple dictionary.

So, how exactly do you write a UDF function?

Well the documentation isn’t terribly helpful. They have code snippets, but what imports do I need? They tell you to go to a blog post for an actual example. Now, the first problem. It uses libraries from Salesforce (makes sense since Salesforce came up with Apache Phoenix). But where do I get the libraries? Furthermore, since it’s been rolled under the Apache Projects, I’m pretty sure that there should be standard Apache Libraries and not Saleforce libraries that I need to use. A bunch of google-fu took me to another blog post, this time in Chinese (which though I am Chinese I cannot read), and I got the following imports needed

import java.sql.SQLException;
import java.util.List;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.phoenix.expression.Expression;
import org.apache.phoenix.parse.FunctionParseNode.Argument;
import org.apache.phoenix.parse.FunctionParseNode.BuiltInFunction;
import org.apache.phoenix.schema.SortOrder;
import org.apache.phoenix.schema.tuple.Tuple;
import org.apache.phoenix.schema.types.PDataType;
import org.apache.phoenix.schema.types.PVarchar;
import org.apache.phoenix.util.StringUtil;
import org.apache.phoenix.expression.function.*;

All right, so now I can write the function. But what if I wanted to include other libraries. I haven’t used Java since I was an undergrad and I hated it then, preferring C/C++ and never really got the hang of ant. Now I’d love if in a central place they did things like tell me how to use the build tools mvn, create a jar and all that fun stuff. But they didn’t. So googling a bit more, I do manage to figure it out. I figure out how to include libraries in the POM file and the damn thing builds.

Now how do I get it into my system? The instructions are:

  • After compiling your code to a jar, you need to deploy the jar into the HDFS. It would be better to add the jar to HDFS folder configured for hbase.dynamic.jars.dir.

Not entirely clear unless you know what you’re doing with hbase. So something like the bottom would have been helpful.

/usr/lib/hadoop/hadoop-2.7.2/bin/hdfs dfs -mkdir /hbase/udf/
#Remove all previous udf files (if the package changes for whatever reason)
/usr/lib/hadoop/hadoop-2.7.2/bin/hdfs dfs -rm /hbase/udf/*
/usr/lib/hadoop/hadoop-2.7.2/bin/hdfs dfs -copyFromLocal /path/to/the/uber/jar /hbase/udf

Finally, now that I have it in there, how do I test it? This actually took me the longest time since

select JSONCOLUMN('{\"a\":\"hello\"}', 'a') 

should have worked (select 1+1 does). But it doesn’t. It doesn’t give me an error, it just tells me that the function isn’t found. So I’m trying to figure out why it can’t see my function. Turns out, that what you really need to do is the following:

select JSONCOLUMN(“ColumnName”, ‘jsonname’) from “sources”

You need to apply it on the result of an actual query that returns rows. This is not made clear anywhere in the documentation. Yes I wrote simpler functions that just added +1 to a given numeric value, and they all failed with the same cryptic message. Now at this point the frustration is over, our Phoenix system uses indexes, is fast, and handles a rather flexible schema.

However, my gripe is that with relatively simple addendums to the documentation, the path to the happy place would have much easier to get to

Hello

Hello World….

The first thing we try to say when we try out a new system. This blog will seek to detail my trials and tribulations as a data scientist who also does a fair amount of software development for the Pharmaceutical Industry. For all those that are curious, this site is created as a static HTML page, with most of the templating done through Hexo, a package for node.js that takes Markdown and converts it to HTML.