Like the NZISM, the NZAIM should delineate the different types of algorithm and AI techniques

The NZISM defines common cybersecurity procedures across information technology systems. It separately identifies specific controls for particular types of systems, such as software, email systems, networks, and gateways. The NZAIM should adopt this unified yet specific approach.

Guidance should distinguish between techniques only as precisely as is practical to minimise the compliance burden. A useful organising principle is the evaluation paradigm under which a system is assessed. The methods an agency uses to ensure system performance largely determine both the operational guidance relevant to development and the monitoring and governance arrangements needed on an ongoing basis. Importantly, these methods tend to be consistent within a given evaluation paradigm, enabling common best-practice guidance, assurance and governance approaches to be applied across a diverse range of techniques.

For example, ACC’s use of algorithms and AI spans from the most basic to the most advanced. When a claim is sent to ACC, business rules search the claim’s free-text accident description¹ for specific terms to code information into structured data. Such algorithms are simple to develop and evaluate using traditional testing. However, they still require continuous validation. Here, manual validation is required to identify new terms not covered by existing rules, such as “collided with Flamingo [scooter]” or “Tesla autopilot crashed into a pole”. ACC can substitute more complex algorithms, such as fuzzy matching, but the evaluation methods remain the same.

If a claim is successfully coded by these rules, ACC immediately processes the claim using a supervised model (logistic regression) to determine auto acceptance¹ or hold for manual assessment. Here, the rules are now automatically generated by optimising a discrete output (accept or hold) based on past data characteristics. These rules are also straightforward to assess for accuracy: by applying them to data it has not seen before to measure overall accuracy, and over protected attributes like sex and age to monitor fairness. ACC can substitute more complex models in this system, but the evaluation methods remain the same.

Any accepted claimant with more complex needs may need to contact ACC over the phone. Their calls will likely be transcribed² by a generative AI model with millions to billions of times more complexity than the previously mentioned algorithms. The open-ended nature of generative AI makes it harder to empirically measure performance in the context in which it is deployed, as is the case for non-generative AI. Instead, evaluation relies on expert and user acceptance of sample answers for a defined use case. ACC can increase the system’s flexibility and utility, such as augmenting the call transcript with its knowledge base, or changing to a different model architecture with improved language translation, but the evaluation methods remain the same. To remain flexible to technological developments, guidance should not descend into finer technical distinctions beyond the evaluation method.

4.1 Action: Adopt a taxonomy that recognises commonalities between algorithms, traditional and generative AI

Given these common controls across algorithms and AI, I recommend that AI system guidance cover traditional algorithms. This distinction is not evident in existing AI-specific guidance. This distinction is not solely a matter of academic precision, but a practical safeguard. Traditional automations outside the modern understanding of AI still have the same potential for beneficence (e.g. freeing up workers from menial, repetitive tasks, standardising and de-biasing processes) and maleficence (e.g. Robodebt³, Dutch childcare benefits scandal⁴) as AI. Promoting this understanding ensures that governance and monitoring efforts appropriately cover all systems with the potential for material impact, not only those labelled as AI. Furthermore, the same legal obligations apply to any impactful decision-making system, regardless of whether it is a traditional algorithm or an AI system.

A similar risk arises when conflating AI with generative AI. AI systems have long been deployed across government using traditional predictive techniques, as outlined in Section 5. Focusing efforts on newer generative AI systems, currently limited to low-impact administrative uses, while ignoring the monitoring of established predictive AI systems, risks diverting oversight away from systems that already make consequential decisions about users of government services.

Figure 2 illustrates a categorisation that recognises the similarities and differences among these evaluation paradigms, visualising the nesting and overlaps among the categories. Below, I propose definitions for key categories of algorithms and AI, and discuss:

what other categories of techniques fall into that category (subsets)
what other categories share techniques in that category (intersections)
examples of techniques in that category
methods for evaluation of model suitability, both during design and operation
challenges universally associated with designing the systems included within that category. Subcategories inherit the challenges of their parent category. For example, the issues identified for algorithms (automation bias, auditability, monitoring and evaluation) extend to all techniques mentioned in this paper.

Figure 2: Overview of relevant categories for algorithmic and AI techniques with distinct evaluation and interpretation considerations. Algorithms form the superset of all analytical techniques discussed in this paper, within which AI constitutes a subset. Two of the most relevant paradigms in AI are shown. The first is machine learning (ML), the main set of the most used AI techniques. The second is evolutionary computation (EC), a distinct type of AI used for optimisation and simulation. Relevant types of ML are also shown: supervised learning, unsupervised learning, reinforcement learning. Two other categories employ a mix of those three techniques in an advanced, novel manner: deep learning (which itself is not an evaluation paradigm but a type of model), and generative AI. Goal-driven optimisation is a distinct evaluation paradigm.

Methodical set of instructions that can be executed by a machine. Algorithms are typically authored by humans but increasingly can be written by generative AI.

Subsets (non-exhaustive): goal-driven optimisation, artificial intelligence

Examples: any logic that runs on machines, from a simple business rule to categorise an email, to a generative AI model powering advanced features in existing software.

Evaluation: varies by technique. At a high level, algorithms are validated using traditional software testing frameworks. Test frameworks typically distinguish between verifying the outputs of the system and validating the outcomes the system sought to achieve. Tests are typically manually written but increasingly can be written by advanced AI.

Challenges:

wide definition makes it hard to track for governance purposes
lack of continuous monitoring risks performance degradation as the operating environment changes from when the rules were developed (concept drift) or encounters new data it cannot properly handle (data drift)
lack of documentation may deny individuals their official information rights to rules that lead to, or the reasons for, decisions or recommendations
poor auditability without an audit trail required for all public records, including those generated by an algorithm
automation bias may result in uncritical human deference to automatic predictions
appeals, judicial review and Ombudsman investigations may be taken against a decision, recommendation or action performed by an algorithm, ensuring that such acts were made lawfully. A degree of reversibility may be required.

Algorithms consisting of predefined logic, rules or ‘symbols’ (as in the traditional category of symbolic AI) for reaching an output. These algorithms are typically manually designed by humans but may be designed by large language models through linguistic likelihood, rather than mathematical derivation. This category excludes data-derived systems that may itself consist of rules or symbols, such as decision tree classifiers.

Examples: expert systems, business rules, predefined decision trees

Evaluation: as for algorithms

Challenges: relative simplicity and explainability may come at the expense of accuracy.

Algorithms that search for the best solution to a defined problem, which becomes exponentially difficult when attempted exhaustively. They typically use simulations to model a solution’s interaction with its environment. The “best solution” may be the end intervention itself, such as a set of school bus routes that maximise coverage and minimise resource use. It may also be the best strategy for how actors behave and react to an intervention.

Actors⁵ optimise their behaviour based on their own motivation (e.g. economic gain, travel time minimisation, personal health) reaching equilibrium against others in the simulation.

Subsets: reinforcement learning, evolutionary computation.

Examples: navigation application planning, supply chain optimisation, school bus route generation (Ministry of Education’s School Route Transport Optimiser), transport system modelling (Ministry of Transport’s Monty), large population models (PHF Science’s ALMA)

Evaluation: the output is necessarily the ‘best’ given the problem and search constraints, but further validation is required to ensure the desired outcome was met (cf. impact evaluation in policy design) [Challenges continues next page]

Challenges:

defining the true problem, constraints and objectives are difficult, requires translating end-user requirements to measurable equations, which can often be incomplete, framed poorly or from a deficit basis, can change over time
representativeness of simulated population and behaviour to actual population
computationally expensive, often using AI methods to more quickly find a sufficiently (not always the most) effective solution

Machine-based systems that infer (based on implicit or explicit objectives) from the input it receives how to generate outputs (predictions, content, recommendations, decisions, actions).

Subsets: machine learning, evolutionary computation

Examples: from basic linear regressions or probability models that predict numbers or outcomes; to advanced deep neural networks that can generate complex nuanced text and images, or guide decision-making of simulated actors.

Evaluation / challenges: varies by technique

AI models that are automatically developed using existing observations, understanding and approximating how that data contributes to an output.

Subsets: supervised learning, unsupervised learning, reinforcement learning, deep learning, generative AI

Evaluation: self-evaluates its output at each iteration of model development using the objective it was given, but further manual evaluation is typically required to assess whether the objective has produced the right outcome (cf. impact evaluation in policy design)

Challenges:

data accuracy – predictions are only as reliable as the underlying data, risk of systemic error (e.g. police non-completion of callout assessment correlated with victim ethnicity), human error (e.g. staff assumes data points instead of asking the subject), and faulty assumptions (e.g. using outputs as proxy for outcomes).
data robustness – data must be representative of the population and operating environment which a model will be applied (e.g. oversampling under-represented groups, discarding stale data). [Challenges continue next page]
confounding variables – hidden factors influencing variables may result in misleading conclusions, such as home environment stability confounding the relationship between school attendance and educational achievement.
fairness and equity – predictions often reflect the biases within the data and the wider system where the data originates from; as it is difficult to satisfy all the different definitions of fairness, a normative decision is required for what kind of equity is desired and optimised for.
accountability – as no human was directly responsible in generating the decision-making process making, responsibility and liability is less clear.

ML models that infer how to generate outputs based on the relationship of previously observed inputs with an associated output.

Subsets: self-supervised learning (associated output comes from the input data, typically used in generating large complex sequences like languages, images)

Examples: classifiers (decision trees, gradient boosting: e.g. StatsNZ International Migration predictions, logistic regression: e.g. ACC Probability of Accept / RoC*RoI), regressions (linear regression)

Evaluation: measured based on how well the model performs on past but previously unseen examples of data.

Challenges:

generalisation – careful testing methodology required to ensure patterns are learnt (i.e. not just examples memorised) that can be accurately applied outside of training and for extreme edge cases.
validity of targets – assumes target outputs are correct and consistent, such as adjusting for environmental or legislative changes since the data was captured
suitability of targets – critically assessing whether the chosen predicted output (e.g. risk of reimprisonment and reconviction) aligns with broader desired outcomes (e.g. releasing an offender will not result in future societal adversities), where the output variable may introduce confounding (e.g. reimprisonment may be less likely in less policed areas).

ML models that infer how to generate outputs from the input it receives without explicitly knowing what outputs to generate (from past examples), based only on underlying patterns in the input data. ⁶

Subsets: semi-supervised learning (a small, labelled dataset provides the starting point for unsupervised learning on a larger unlabelled dataset)

Examples: anomaly detection (e.g. fraud and abuse detection), clustering (e.g. cohort detection for targeted policy interventions)

Evaluation: measured by how well the model captures structure and patterns in the data (e.g. separation of clusters, reconstruction quality). Developing robust objective metrics is more difficult during model development because, unlike supervised learning, there is no known ground truth. A semi-supervised approach (labelling a small dataset) may be used for known concepts like fraud and abuse.

Challenges:

fairness and bias – despite seeming inherently unbiased by learning from underlying patterns, it is still susceptible to learning systemic biases embedded in the input data, or statistical biases from a lack of sufficient representation.
imbalanced data – patterns of interest (e.g. fraud and abuse) may be overshadowed by patterns of normal behaviour, making detection or pre-definition difficult.

ML models that infer how to generate outputs through reproducing input data rather than relying on an independently provided target. As the target output comes from the input data itself, this ML paradigm combines the verifiability of supervised learning (withheld data can be used as ground truth) and the scalability of unsupervised learning (no manual labelling and eliminates bias from proxy labels).

Examples: LLM pre-training, large population models (e.g. PHF Science’s ALMA)

Evaluation: measured by how well the model recreates previously unseen input data.

Challenges: fairness and bias as with unsupervised learning

DL models that create new data (text, images, video, music, speech) by learning and mimicking the underlying patterns within existing data.

Examples: large language models (e.g. GPT-5, Claude Sonnet), orchestrating LLMs (e.g. Microsoft Copilot Analyst, Github Copilot), diffusion models (e.g. Midjourney, OpenAI Sora)

Evaluation: as with all ML techniques, but output evaluation is much more difficult due to the near-infinite and unpredictable nature of its output. The tasks these models have been trained for (e.g. generally predicting the next likely word in a sequence) may be different from the tasks they are used for in practice (e.g. analysing policy intervention options, performing calculations on tables pasted into chat). Outcome evaluation is therefore more important and is typically done manually (and is always manual when using off-the-shelf models that the deployer does not train further).

Challenges:

hallucination – production of plausible but false content, exacerbated by the potential for deceptive confidence: delivering false information as authoritatively and fluently as faithful information; and the subtlety of potential errors given its optimisation for linguistic likelihood rather than factuality.
homogenisation – generating overly uniform content leading to the marginalisation of underrepresented ideas and identities by regressing to the mean.
user hijacking – manipulating the system with inputs that result in undesirable behaviour (e.g. prompt injection, data poisoning)
supply chain provenance – lack of transparency around the reliability of constituent components of a GenAI system (e.g. are datasets and pre-trained models accurate, representative, and obtained legally and ethically).
- Provenance is especially important if outputs may derive from:
  - protected mātauranga Māori (e.g. language, art, knowledge) as permission must be sought from the authors or kaitiaki of such materials
  - tapu materials which should never be used to create new outputs
generation of dangerous content – including dangerous weapons; dangerous, violent or hateful content; misinformation and disinformation; offensive cyber-attacks; obscene harmful imagery
unauthorised data integration or deanonymisation – leakage of sensitive information from training data or inputs, both explicitly (e.g. publicly available social media profiles) and implicitly (e.g. semantic cues that imply personal attributes)
content attribution – as part of the general challenge of auditability and recordkeeping.
environmental impact – training models require vast computational resource (and thus energy or water for cooling); impact is less so for end use of pre-trained models but cumulative resource use can be substantial at scale.

AI models that learn how actors in an environment should effectively act (known as an actor’s policy, e.g. writing the next best word, perform the next best economic transaction, wait or administer a medical treatment) rather than training on past examples.

Examples: large language models (RL from human feedback), macroeconomic models (e.g. Salesforce AI Economist), dynamic healthcare treatment personalised to patient characteristics and needs and real-time diagnostics

Challenges: generalisability – genuinely robust solutions are hard to distinguish over fortuitous solutions performing well under the conditions it encountered in a particular training run. This instability is worsened as each actor’s actions directly shape the data it learns from, creating correlations that amplify noise.

AI models that mimic the trial-and-error, survival-of-the-fittest nature of biological evolution. Unlike traditional machine learning, which primarily optimises solutions mathematically, evolutionary computation randomly adjusts candidate solutions and retains a subset of the best ones. These methods can be especially useful when conventional ML reaches a dead end and cannot find a mathematical way to improve a solution, or when there is no clear mathematical formulation for how to improve it. Evolutionary computation maintains a pool of potentially effective candidates and evaluates them using externally defined measures of “fitness” that are not constrained by the internal structure of the solution.

Examples: actor behaviour in simulations (e.g. MATSim in Ministry of Transport’s Monty traffic model: simulated transport system users determine their own daily travel plans based on historical data, ‘compete’ with other transport system users, everyone reaches an optimised equilibrium that balances everyone’s travel objectives without pre-defining behaviour – beneficial for modelling infrastructure change), resource allocation and scheduling

Challenges: genetic instability – careful model design is often required to ensure beneficial traits are not lost to excessive variation, reducing the likelihood of settling on a truly effective solution.

4.2 Action: A new NZAIM should recognise the different technical challenges of different AI paradigms

Evidently, there are common challenges and considerations that apply to all algorithmic and AI techniques. For example, the risks of automation bias, concept drift, and output-outcome misalignment can be realised with any algorithm, regardless of the technique used. These risks are rooted in the broader system and environment where these algorithms operate, rather than in the algorithms themselves. However, these challenges often appear and are addressed differently depending on the technique. For example, automation bias is relatively straightforward to tackle with a single prediction from supervised learning, compared to the active critical thinking needed to evaluate the open-ended output of generative AI.

These nuances require more detailed technical guidance than current one-size-fits-all approaches. This is not just about technical accuracy, but a user-focused need to reduce perceived compliance efforts, especially for non-generative AI, which already has decades of experience in trustworthy development. For example, while there are standard methods for assessing the performance and fairness of supervised machine learning, evaluating generative AI is far less straightforward and often depends on manual assessment in the specific context in which it is used. A new NZAIM must be designed accordingly, offering methodologies and controls appropriate for certain or all types of systems.

Like the NZISM, the NZAIM should firstly provide broadly applicable controls to all systems, then prescribe controls based on the risks unique to the evaluation paradigm used by an AI model.

4.3 Action: Consider how to promote trustworthy AI use in simulations and goal-driven optimisation

Using this more precise taxonomy uncovers an opportunity for goal-driven optimisation to enhance the rigour of intervention appraisals and evaluations, thereby supporting more robust policy decisions, if not the rigour of intervention design itself. Instead of recreating past responses to interventions (as in supervised learning) or hypothesising policy impacts from emergent linguistic patterns (as in large language models), GDO models directly compute the effects of an intervention within a virtual environment.

This use case is not hypothetical, as agencies such as the Ministry of Transport use GDO (co-evolutionary algorithms) to simulate changes in traveller behaviour in response to a given transport system intervention. Vendors like PHF Science have developed new GDO models that lower the computational barriers to precisely and representatively simulate five million New Zealanders and their reactions to interventions.

Promoting effective and trustworthy AI use and decision-making with GDO requires system-wide coordination, as multiple agencies contribute data to these models and explore their use in isolation. Issues that a system lead agency can address include:

Source data improvements: simulation actors are trained by replicating patterns from historical data and often rely on data sources that the interested agency does not collect, such as StatsNZ’s Census data. The impending redevelopment of StatsNZ’s census and survey programme presents a timely opportunity to consider how effective existing data sources are in simulations and GDO, and which new (or extensions to existing) datasets can have the greatest impact for a wide range of user agencies.
Streamlining procurement: agencies currently commission bespoke GDO solutions tailored to their needs. The vendor market could be evaluated to see if vendor(s) can meet a wide range of agencies’ requirements by offering domain-agnostic models that provide flexibility across agencies’ data and operational environments without creating bespoke model architectures for each use case.
Encouraging use cases: existing communities of practice and networks can more intentionally promote this category of AI, which is currently limited to agencies responsible for explicit system design, such as the transport system. Any agency can reconceptualise their operations – or even policy remit – as an explicit system to be simulated, such as customer journey optimisation or regulatory impact management, with greater detail than traditional coarse appraisal models like Treasury’s CBAx.

ML models predict to decide, generative models mimic to create, GDO models simulate to solve. GDO is already being used by agencies like MoT. System leadership can promote use and streamline development of GDO models.

References

ACC, August 2018. Statistical models to improve ACC claims approval and registration process. https://s3.ap-southeast-2.amazonaws.com/nzdoctor.files/production/public/2018-08/claims-approval-technical-summary_0.pdf ↩ ↩²
ACC, March 2025. ACC Privacy Impact Assessment (PIA) - Agent Copilot. https://www.acc.co.nz/assets/corporate-documents/Privacy-Impact-Assessment-Agent-Copilot.pdf ↩
Australian federal government data matching algorithm between one’s declared income to the social service agency and actual income known to the tax agency to calculate welfare overpayments. This algorithm made invalid assumptions about income, and its validity was never tested. This scheme triggered parliamentary inquiries and a Royal Commission report, which became a political issue in the election where the responsible government lost. ↩
Dutch government risk-scoring algorithm to determine the risk of childcare benefit fraud. Rules were manually developed by humans, factoring in protected attributes like dual nationality. Rules and outputs were not made available to flagged households, and decisions could not be contested. Staff succumbed to automation bias, treating the outputs as evidence of fraud rather than a signal to investigate actual evidence of fraud. Consequently, parliamentary opposition moved a motion of no confidence in the government, forcing the serving government to resign. ↩
I avoid using the term ‘agents’ and defining ‘agentic AI’ as a category. Agency describes a philosophical capacity to act with intention, rather than a description of technical evaluation. This taxonomy instead distinguishes between orchestrating LLMs and GDO models. LLM-orchestrated ‘AI agents’ (such as Microsoft Copilot Researcher or Github Copilot) only reason the next likely action based on probabilistic likelihood, as explained in Box 4.1.9. Even though they can perform actions, orchestrating LLMs lack the independent intent to measure and satisfy a goal beyond prompt completion. In contrast, GDO models learn how to act within an environment, real or virtual, guided by a defined mathematical objective. For example, if an ‘agent’ is tasked with devising an individual’s unique injury rehabilitation plan, an orchestrating LLM would generate a plan mimicking prior examples with similar circumstances. A GDO model would simulate the patient’s unique biomedical, psychological and social risk factors and calculate how specific treatments and programmes optimise the probability of returning to independence. Either approach is valuable in different situations: LLM orchestration provides a user-friendly natural language interface that draws on historical patterns, while GDO provides systemic mathematical fidelity that models the actual effect of a plan. ↩
This framework differentiates between supervised and unsupervised learning models to recognise the findings of a review by the Australian National Audit Office (ANAO) into the Australian Tax Office’s (ATO) use of AI. The ATO’s use of AI comprised 71% unsupervised learning models, by virtue of their role in identifying non-compliance where patterns of concern may be hard to detect, poorly defined, or emerge from novel, unanticipated behaviour. Presumably due to the challenges of measuring these models’ performance, the ATO did not develop ways to measure any of the models’ performance. Adaptive identification of non-compliance will be desirable for – if not already used by – many government agencies here. Therefore, standard guidance should recognise the increased risk around the development of unsupervised non-compliance models. ↩

Like the NZISM, the NZAIM should delineate the different types of algorithm and AI techniques

4.1 Action: Adopt a taxonomy that recognises commonalities between algorithms, traditional and generative AI

Algorithms

Predesigned / handcrafted algorithms

Goal-driven optimisation (GDO)

Artificial intelligence (AI)

Machine learning (ML)

Supervised learning

Unsupervised learning

Self-supervised learning

Generative AI (GenAI)

Reinforcement learning (RL)

Evolutionary computation (EC)

4.2 Action: A new NZAIM should recognise the different technical challenges of different AI paradigms

4.3 Action: Consider how to promote trustworthy AI use in simulations and goal-driven optimisation

References