Monday, October 31, 2016

Examining Postgraduate Theses

I've examined a number of postgraduate theses by this point in my career. These are Doctoral and Master's theses from New Zealand and overseas institutions.While most of those theses have been a real pleasure to review, some have been real horrors. Even the ones I enjoyed examining often had errors in them. The errors that appear, though, tend to be the same kind of errors. That is, candidates for higher degrees are making the same errors even though they are from different institutions. So, I've written this post to discuss these errors, and how to avoid them. This post, then, describes what I look for in a thesis, and what I don't want to see in a thesis. Since I've mostly examined Doctoral and Master's theses, these are the focus of this post.

The Examination Process

This varies a bit according to the institution, but the general structure is the same. The process usually goes something like this:
  • I am asked if I am interested in examining the thesis. I usually get sent the candidate's name, the title of the thesis and the thesis abstract.
  • I say "Yes, I am interested" although I occasionally say "No" if the thesis is outside of my field of expertise.
  • Some time later, I receive an examination pack. This usually contains things like:
    • The thesis to be examined
    • Institutional guidelines for examiners
    • A marking sheet, where I make my recommendation and comments
    • Other forms for payment of the honorarium and tax 
  • I read the thesis several times, making comments on the pages each time.
  • Using my comments, I write a report on the thesis and make a recommendation.
  • There is sometimes an oral exam / viva held later, although more and more institutions seem to be moving away from those now.

Examiners usually get paid an honorarium, and as a New Zealand resident the honorarium I receive from a New Zealand institution has always had tax deducted from it. The size of the honorarium varies between institutions, but if you consider the time it takes to examine a thesis, it comes out at substantially less than minimum wage.

I prefer to receive the thesis as a PDF, as I find the examination process much easier when done electronically. I usually load the PDF onto my tablet, so I can get some examination work done during my daily commute on the train. When I submit my report and recommendation, I send the marked-up PDF with it.

Clarity

A thesis should be clear. Don't leave the hard mental work to the reader of the thesis! Lay everything out for them, especially why are you doing this? There is presumably a reason for doing the work you did, apart from "your supervisor told you to do it". The motivation for pursuing the research in the thesis should be laid out clearly and as early as possible.

The literature review should relevant to the topic of the thesis. I don't want to have to wade through pages of literature review that don't have anything to do with the thesis. Or, to put it another, way, don't "stuff" your literature review, it just annoys the examiner. A good question to ask about any part of the literature review is "why is this in the thesis?".

The literature review should be to-the-point. Anyone examining the thesis will have been carefully chosen and they will be experts in the field. Spending pages reviewing or describing material that a third-year student in the field should know is a waste of time and space. Better to just cite the key relevant papers and move on.

The literature review should also be critical. What are the holes in the literature? What is wrong with what has been previously published? What could have been done better? The work in a thesis should build upon what has gone before, it is incredibly rare that a thesis introduces an entirely new field.

If a thesis isn't clear, then it won't pass the examination. If you're lucky, then the examiner will give you enough detail to fix it and another examination. If not, then you will fail.

Citations

Any statement that is made should be backed up by data, or logical argument, or citation. In a thesis, most statements will be backed up by citation. This is especially true of the literature review.

Citations should be formatted correctly. Citations that are in-text are usually done something like (Smith, 1999). When the citation is referred to directly, it is something like "as in Smith (1999)". This is also the form used when the citation leads the sentence, for example "Smith (1999) said that...". While most authors nowadays will be using reference-management software, you should know how to use the software and not rely on the defaults.

This might seem like a small thing, but every time a reader comes across an incorrectly-formatted citation, it can break the flow of their reading. Break the flow of reading enough and the reader gets frustrated. That's not what you want when the reader is an examiner with the power to make the last three (or more) years of your life irrelevant.

Typos

Typos are a fact of life. Everyone makes mistakes while writing, but there are some things you can do to reduce the number of mistakes that make it through to the examiner.

Firstly, use a spell-checker. These are so straightforward to use now that there is no excuse for any incorrectly spelt words to appear in a thesis that is going for examination. However, relying on a spell-checker is also dangerous. Spell-checkers only tell you if a word is spelt incorrectly, they won't tell you if they are the wrong word to use. So, proof-reading is still essential.

Grammatical errors should also be checked for. While English has its quirks, these quirks must be known and dealt with. Small errors in grammar can completely change the meaning of a sentence. A common error is using incorrect tenses. For example, experiments reported in the thesis have been done, they are in the past, so use the past tense to refer to them.

Tables and Figures

Tables and figures are one of the most effective ways of presenting data, provided they are used appropriately and carefully. There are some common mistakes that you must avoid in tables and figures.

Firstly, do not use unnecessary precision in a table. If the table is presenting the area of city blocks, then presenting areas to the square millimetre is excessively precise.

Secondly, every column at least should be labelled. There are exceptions, of course, but it is important to consider whether the table could be understood without the labels. The rows and tables should also be in a logical structure, with related values grouped together.

The caption of a table or figure should be stand-alone, and should explain what the table or figure is showing. For tables, that means that column labels need to be described or defined. That is, the reader should be able to interpret the table or figure without having to refer to the main text of the work. This is because a table or figure often ends up being displayed on a different page to the explanation of the table, and having to flip back-and-forth between pages while trying to understand presented data is annoying. This can lead to some long captions.

For plots of data, be careful with legends and labels. These should be informative, not just some default like "x" on the x-axis. Again, the goal is clarity, as the purpose of a plot is to communicate to the reader.

There is, in my opinion, almost no situation under which a 3D plot makes sense. The 3D bar-charts in MS Excel are particularly bad and should not be used under any circumstances. 3D plots serve no purpose other than to show that the author knows how to make them. They do not make data clearer, but they are harder to accurately interpret.

Do not use line plots for discrete data. For example, a school has three terms per year, and students may commence their studies at the start of any of the terms. If we were to plot the number of students who commenced in each term across a period of several years, we would use a scatter plot, because the quantity being plotted (number of students) is discrete. A line plot would imply that the number of students who commenced in a particular term is different halfway through the term than it is at the start of the term. Since we've already established that students commence at the start of the term, this is plainly incorrect.

If presenting several different series of values on the same plot, then distinguish between them by making the point markers clearly different shapes. Do not rely on colour for this! There are two reasons why: Firstly, a non-trivial portion of the population are sufficiently colour-deficient that they will not be able to perceive the difference, especially between red and green; Secondly, a thesis will likely be printed in greyscale, which completely hides the colours.

Check the cross-referencing to tables and figures. I once examined a thesis where all of the cross-referencing was incorrect - the cross-references in the text referred to figures and tables that did not exist - which made the results all but impossible to interpret. If you use a package like LaTeX to write your thesis, and carefully check the error messages when compiling your document, this is not an issue. For other writing software, like MS Word, you need to be a bit more careful.

Finally, do not use the word "plot" in a caption for a plot, or "table" in the caption for a table. I know what a table is, and I know what a plot is. I don't need to be told.

Equations

Used properly, equations are an effective way of communicating complex concepts. It is very easy for equations to become opaque and uninformative. To avoid this, equations must be laid out carefully and consistently. Again LaTeX is good for this kind of thing, its equation tools are very powerful.

Every variable in an equation should be defined somewhere, ideally following the first equation in which it is used. Similarly, variables should not be re-used. A table of symbols can be helpful.

Experiments

You must understand your data. What process created it? What are the variables? What do the variables mean? What are the ranges of the variables? What are the scales of the variables? Are they nominal, ordinal, interval? Remember, just because something is expressed as a number, doesn't mean you can do arithmetic with it. Some statistics are invalid for some kinds of data, so a working knowledge of measurement theory and statistics is essential.

Some data sets will have hidden biases. These biases will influence any model that is built using the data and must therefore be accounted for. Remember, if you are using biased data to build a model, you will end up with a biased model.

The data must be represented in a logical way. Some models like neural networks can only handle discrete values like class labels if they are represented orthogonally. 

When evaluating the accuracy of a classification model, you must give some thought to the distribution of classes in the data set. If 90 % of the data in the data set are from one class, then it is quite simple to create a model that is 90 % accurate: it just classifies every example as the most common class. A simple percentage accuracy is not, therefore, very useful for evaluating the performance of your model.

A single partitioning of data is not going to give an accurate estimate of performance of any model. The standard approach, therefore, is to cross-validate over the data set, with a separate, independent, validation set held out (note that some sources call this the test set - the name given doesn't matter, as long as you use such an independent data set). If the data set is too small to use cross-validation, then jackknife over the data set instead. Or, you can bootstrap the data. The point is, there are several different approaches that can be used to produce statistically reliable results. These approaches are so simple, and well-known, that I consider not using them to be sufficient reason to reject a thesis: the candidate plainly does not have sufficient skill in the field to qualify for a higher degree.

The set-up of experiments must be described in detail, including the parameters of any algorithms used. The goal is for all experiments to be reproducible. The description should also include reasons for selecting any particular algorithm. There is always a reason, and if a candidate can't justify their choice of algorithm, then I do wonder whether they understand the state of the art enough to qualify as a professional researcher. There is always a reason for selecting an algorithm, even if it is really "Because my supervisor told me to use it".

If the thesis is presenting a new or improved algorithm, then is must be compared to existing algorithms. The choice of algorithms compared to should be justified. It is very easy to find an algorithm that performs so badly that it makes a new algorithm look good by comparison. Be clear about why an algorithm was selected.

All results should be subjected to an appropriate statistical analysis. Statistics show us what the numbers are trying to say. Statistics allows us to separate reality from our own prejudices. A good working knowledge of statistics is, therefore, extremely important.

The thesis should interpret the results for the reader. In other words, the thesis should explicitly answer the question "What do the results mean?". This interpretation, of course, must be done within the context of the statistical analysis. The results should also be compared to the literature where possible. Don't leave this interpretation up to the examiner! The examiner might not interpret things the way that you intended.

Response

When the examiners' reports are received by the institution, they will be collated and made available to the candidate. The candidate always has a right of response. Don't be afraid to disagree with an examiner! Examiners are human, they make mistakes, or they might have missed something in the literature. If a candidate does disagree, however, then they should have a solid justification for disagreeing. The candidate will have to convince the examiner that they were mistaken, that means using facts or logical argument. Personally, I am quite prepared to be proven wrong on anything I write in an examiner's report. But I will only be swayed by a convincing, well-reasoned argument based on either logic or data. If a candidate tries to bullsh*t me, then I will not react well.

If a viva is to be held, then the examiners' reports will be made available to the candidate well before. A viva is a way of demonstrating that the candidate really does know what they are talking about in the thesis, and that they are able to handle questions on their own. It is also an opportunity for the examiners to clarify any lingering issues from the examination. I never try to make a candidate feel uncomfortable or upset in the viva, and I don't understand examiners who do that. It is not an opportunity for an examiner to show off how clever they are, or to exercise their limited power over another person. It is the last step of the examination, and it should be carried out in a professional and collegial manner.

Summary

A postgraduate degree represents a substantial investment of time and effort on the part of the candidate and their supervisor. It behooves all involved in that process to minimise the chances that the effort will be wasted. Putting in the effort to avoid the common issues I have identified above will help to achieve this.