Data and modelling

Responsible training data collection and modelling practices support useful, secure and fair artificial intelligence systems.

Data is the ‘fuel’ that makes artificial intelligence (AI) work. It is used to train the AI model so it can, for example, spot patterns, make predictions and choices, create content, or perform another function for which it was built. The better and more fit-for-purpose the data, the better the AI will work.

Sometimes, even AI developers don’t fully understand how AI systems make – especially with GenAI. That's why it’s important to prioritise data quality from the start, so outputs can be as reliable as possible.

Data and modelling processes are primarily relevant for those building or developing AI models. However, those using or deploying existing, already-trained AI solutions, should be aware of the considerations relating to data and modelling – to better understand whether the system is right for its intended use and fits with business values. This is increasingly relevant as more GenAI applications are released.

IT and Cybersecurity discusses the importance of ensuring good cybersecurity practice to protect AI training data and model integrity.

Fit-for-purpose data

Training data and datasets should be good quality – accurate and ‘clean’, complete, lawfully obtained or accessed, structured appropriately (including to support transparency and explainability), relevant and representative of the environment that the AI model may function in.

Poor quality training data doesn’t just cost money – it can lead to mistakes, upset customers, waste money and damage business reputation.

Improper data collection and processing can mean potential privacy and confidentiality breaches, intellectual property rights violations, or potential data sovereignty and human rights impacts. Consequences may be reputational, financial, punitive, or obstructive (such as via compliance, cease and desist, or asset freezing orders).

AI datasets and systems are likely to have varying degrees of bias (for example, reflecting patterns in historical decision-making that may be based on inappropriate or harmful biases, representation, or how methodologies are implemented). Special attention needs to be paid to potential bias when training data contains information about people – and especially when model outputs and associated decision-making can directly impact individuals, families and whānau and wider communities. AI systems can amplify inaccurate or unreliable outputs, unfairness, discrimination and/or harm to individuals (including those potentially not receiving services due to system discrimination), and/or damage to business reputation.

Tips

Understand the data, the context it was collected in, and what uses are permitted.
Document (or require documentation on) how the data was sourced or collected, any modifications made and known biases or accuracy levels. Evaluate its reliability and limitations.
Consider what open data sets are available and relevant for developing or fine-tuning an AI model. This includes data that doesn't have personal information but has valuable geospatial, health and other data sets made available free of charge. The Open Government Data Programme takes a collaborative approach to making government data available for reuse.

Open data(external link) – Digital.govt.nz

Consider the environment in which the AI model will be deployed, and whether the data is suitable and relevant for that context. For example – a facial recognition model to be used in New Zealand would likely be more accurate and effective if trained on images representative of the New Zealand population.
Document (or require documentation on) the legal basis for training data to be collected, used, and/or stored (including consent mechanisms for customer data collection).
Remove inaccuracies, duplicates, and errors from datasets (including correcting typos, checking for correct formats, valid ranges, and logical consistency).
Where relevant, take care that data from various sources is combined in a cohesive manner, and you are able to attribute those sources.
Keep data updated to reflect new information.
Label, tag and annotate data for better accuracy and performance, and to support any attribution, traceability, and audit needs.
Document (or require documentation on) any preprocessing steps for the data (for example cleaning, splitting, transformation, or augmentation).
Data fields containing sensitive or protected attributes should be defined and handled appropriately to minimise harm.
Diversity of thinking (including a range of backgrounds, perspectives, skills, abilities, experiences, and identity characteristics) in teams involved in governance, deployment, use and review of AI system/s can help identify and support resolution of bias, inequity, and discrimination.

Scenario B: BigBuild

BigBuild is a New Zealand construction firm looking to speed up their recruitment processes. They decide to trial a CV screening AI tool to scan applications and rank them based on desired skills, experience and other key words. Those that fall below a certain threshold are filtered out and do not progress further.

Following a risk assessment, BigBuild puts in place some risk mitigations, including being transparent with applicants about AI involvement in the shortlisting process, and implementing a human reviewer for all shortlisted applications.

After deployment, 50 candidates apply for a position at BigBuild. They are advised that AI is being used. The human reviewer notices that, despite there being an equal amount of women who applied, only 3 of the 10 shortlisted applications were women. This is not in line with BigBuild’s goals for gender diversity in hiring. They also receive correspondence from rejected female candidates questioning the AI system results, and upon evaluation their applications were at least as deserving as those that were shortlisted.

On investigation, BigBuild finds that the AI model was trained on historical ‘successful hire’ data – which underrepresented women, likely due to a mixture of historical human bias and societal factors. Therefore, despite the AI system being fed ‘gender blind’ CVs, it viewed factors and language moreso included on men’s CVs as positive (such as ‘executed’). Meanwhile, it assessed similar factors that appear moreso on women’s CVs (for example, language like ‘supported’ and ‘co-ordinated’) negatively. Career gaps (that may have been taken due to maternity leave), and certain hobbies, roles or institutions (such as ‘volleyball’, ‘girl’s school’) also contributed to the system being more likely to rank female candidates lower.

To address the issue, BigBuild apologises to impacted candidates and paused use of the tool while it works with the provider to retrain the model using more representative, bias-audited data. They also introduce a diverse set of ‘success profiles’ representing a range of career pathways and experiences, and lower the shortlisting threshold so more applications move to human review.

BigBuild regularly monitor and evaluate the tool’s performance throughout their trial, and continue to provide feedback to the developer as necessary.

Legal and ethical data

Collecting and processing data ethically, in ways that respect respect people’s rights and privacy, helps protect from legal risk and keep reputations strong. Some types of data need care.

Sensitive, proprietary, or personal information

If any data includes proprietary, confidential or personal information, special consideration needs to be given to risks of it being accidentally shared - which could result in trust or privacy breaches, legal liabilities, commercial harm and reputational damage. Such data could include confidential business information (such as access keys, source code, or billing details), trade secrets, customer data or personal information. AI systems learning from this data could retain patterns and relationships within it that may surface information that you don’t want it to or to people that are not supposed to be able to access it.

Even use of personal information that is already ‘publicly available’ may be considered unethical, illegal and/or damage reputation in some contexts. For example, ‘web scraping’ practices can be seen as intrusive, and could erode customer trust and threaten customer privacy. These considerations are important to both an AI system’s underlying training data, and to any data supplied by a user to an AI system. See Use and outputs.

Tips

Conduct a Privacy Impact Assessment at the design stage to support a privacy-by-design approach. Privacy-by-design helps ensure privacy protection is built into information systems, business processes, products and services from the start. For more information see Privacy
As part of your Privacy Impact Assessment, check that Information Privacy Principles (IPPs) are applied (see Office of the Privacy Commissioner (OPC) guidance on Privacy Impact Assessments, AI and other resources).

Privacy Act 2020(external link) — Privacy Commissioner

Privacy Impact Assessments(external link) — Privacy Commissioner

Artificial Intelligence and the Information Privacy Principles(external link) — Privacy Commissioner

This includes being able to answer questions like:

- Why is the information required for the AI-enabled (business) purpose?
- Is only necessary data being collected for that intended business purpose?
- Has the information been collected in a way that is ‘fair’ (with the individual having an appropriate level of understanding and choice)?
- Is it reliable enough and accurate?
- If/how are individuals able to access and correct relevant data? If the model has already been trained on it, can it be corrected?
- Do those handling the information have proper training, including to exercise required confidentiality?
- Is the storage and privacy of personal information being actively monitored?
- Will data inputs be transferred to companies located overseas (out of New Zealand)?
Practices such as data anonymisation, encryption and secure storage can be employed to help protect personal, confidential and other sensitive information.
Robust model evaluation processes can help mitigate against sensitive, proprietary, or personal information used for training being reverse-engineered or otherwise extracted from the system.
The use of privacy-enhancing technologies can help maintain privacy without impacting the AI system’s functionality. See information provided by the OECD on Privacy enhancing technologies for more details.

Privacy enhancing technologies(external link) – Organisation for Economic Co-operation and Development (OECD)

Ownership and intellectual property rights

Training data and models can be sourced from proprietary datasets, open data platforms, or public content (including via web scraping practices). Each source may come with specific licensing agreements that describe who may access them, how they can be used, and/or how they must be labelled or attributed. It is good practice to disclose these details about the source(s) of training datasets, including any intellectual property rights licences entered into if applicable.

The ‘black box’ nature of some AI systems puts pressure on expectations and understanding around intellectual property protections when it comes to training data. Open-source datasets may be protected by intellectual property rights and have certain conditions that need to be met in order to copy and/or use those datasets (for example Creative Commons licences), including attribution requirements, or restrictions on derivations or commercial use. Similarly, terms of use restrictions in publicly available or proprietary datasets might prohibit web scraping or use for AI training.

* GenAI, in particular, has been largely reliant on copyright works for its development. Fairly attributing and compensating creators and authors of copyright works can support continued creation, sharing, and availability of new works to support ongoing training and refinement of AI models and systems.

Some datasets will include proprietary databases - where authorisation is needed to access and copy information from those databases, which may only be granted for specific purposes. For example, publishers of medical journals may have created a database of medical articles and provide licenced access to academics for research purposes, but not for other purposes (such as for software developers training AI tools). Besides raising questions around potential infringement of IP rights, it may also involve a breach of contract (terms and conditions related to access to the database).

While this section focuses on data and modelling, ownership and intellectual property considerations around GenAI outputs is included in Use and outputs.

The World Intellectual Property Organisation also has GenAI specific guidance.

Generative AL: Navigating Intellectual Property [PDF 1.5KB](external link) World Intellectual Property Organisation

Options to ethically source datasets including copyright works

Developers benefit from certainty around training data origins and that it is both legally and ethically sourced. Creators benefit from acknowledgement and remuneration for using their works – which in turn incentivises continued creation and sharing.

Increasingly various and dynamic options are becoming available to businesses to obtain permissions to use third-party proprietary datasets, including copyright works in training, refining, or prompting AI models. The table below outlines some of these.

Note these are examples only, and not necessarily endorsed by MBIE or Government more widely.

Options

Directly license copyright works

AI developers are increasingly striking partnerships with traditional publishers and media entities to license their extensive content libraries, in order to foster innovation and grow together.

Consider creating opportunities to partner directly with media libraries, publishers, iwi, and other content creators, rightsholders and aggregators.

Access a collective license

AI developers can access traditional ways to licence use of copyright works through collective licensing schemes offered by various copyright management organisations.

Collective licences can also be used to obtain permission to use overseas works, vastly increasing the available volume and variety of copyright works available to AI developers. There are examples of overseas collective licensing options in:

the United Kingdom – the Copyright Licensing Agency’s GenAI licence permissions, and Text and Data Mining Licensing; and Authors’ Licensing and Collecting Society’s AI Licences
Australia – the Copyright Agency’s extended Annual Business Licence for AI tools
the United States – Copyright Clearance Centre’s Collective Licensing Solution for Content Usage in Internal AI Systems, and AI Systems Training License (United States)

For New Zealand copyright works and business licence solutions, Copyright Licensing New Zealand intends to release a collective licensing scheme later in 2025 to partner AI developers with New Zealand rightsholders.

Use fair marketplaces

New marketplaces are emerging for creators and rightsholders to directly license their creative works for AI training, ensuring permission and remuneration. US examples include Created by Humans and RHEI.

Choose a fairly-trained and commercially safe AI model

Examples of AI models that exemplify trustworthy AI and exclusively use licensed datasets include, but are not limited to:

Te Hiku – which used ethically sourced archival footage and audio to design an automatic speech recognition model that can transcribe te reo Māori with 92% accuracy. This has been used to run Kaituhi, an automatic bilingual transcription service. Kaituhi(external link)
Pro Rata AI – which enables attribution of contributing content and share revenues on a per-use basis. The process looks at the output of the GenAI content, analyses where the outputted content came from, and then shares half of the revenue with the rightsholders (similar to how content distributors like Spotify or YouTube compensate rightsholders on their platforms).
Adobe Firefly Video Model - which is trained exclusively on licensed content, and only public domain content where copyright has expired.

You can also check the internet for reporting on AI model infringement claims and infringement checking tools are also emerging.

Options to ethically source datasets including copyright works

Note these are examples only, and not necessarily endorsed by MBIE or Government more widely.

Options

Directly license copyright works

AI developers are increasingly striking partnerships with traditional publishers and media entities to license their extensive content libraries, in order to foster innovation and grow together.

Consider creating opportunities to partner directly with media libraries, publishers, iwi, and other content creators, rightsholders and aggregators.

Access a collective license

AI developers can access traditional ways to licence use of copyright works through collective licensing schemes offered by various copyright management organisations.

the United Kingdom – the Copyright Licensing Agency’s GenAI licence permissions, and Text and Data Mining Licensing; and Authors’ Licensing and Collecting Society’s AI Licences
Australia – the Copyright Agency’s extended Annual Business Licence for AI tools
the United States – Copyright Clearance Centre’s Collective Licensing Solution for Content Usage in Internal AI Systems, and AI Systems Training License (United States)

Use fair marketplaces

Choose a fairly-trained and commercially safe AI model

Examples of AI models that exemplify trustworthy AI and exclusively use licensed datasets include, but are not limited to:

Te Hiku – which used ethically sourced archival footage and audio to design an automatic speech recognition model that can transcribe te reo Māori with 92% accuracy. This has been used to run Kaituhi, an automatic bilingual transcription service. Kaituhi(external link)
Pro Rata AI – which enables attribution of contributing content and share revenues on a per-use basis. The process looks at the output of the GenAI content, analyses where the outputted content came from, and then shares half of the revenue with the rightsholders (similar to how content distributors like Spotify or YouTube compensate rightsholders on their platforms).
Adobe Firefly Video Model - which is trained exclusively on licensed content, and only public domain content where copyright has expired.

You can also check the internet for reporting on AI model infringement claims and infringement checking tools are also emerging.

Tips

Conduct a licensing assessment to ensure appropriate copyright permissions are in place. These considerations are also relevant to issues outlined in GenAI user inputs.
A licensed-by-design approach can ensure fair, lawful, and ethical compliance is built into business practices and systems from the start. New and emerging options are outlined in Other artificial intelligence guidance and resources.
Businesses will want to obtain appropriate permissions to copy and/or use data (or verify that proper authorisation has been acquired) for training or other means, alongside maintaining data provenance, to avoid infringement and creating commercial or reputational harm.
Some customers may prefer to use or procure AI systems, particularly GenAI tools, trained on ethically obtained or accessed data (with appropriate consent/s). There are options available to ensure training data is ethically and lawfully obtained or accessed, including with the explicit permission of the owners of the datasets used – for example, through ‘collective licensing schemes’. See Options to ethically source datasets including copyright works for more options.

Māori and other indigenous data

Māori data refers broadly to digital or digitisable data, information or knowledge that is about, from or connected to Māori. It includes data about people, language, population, place, culture and environment.

Producing, using or handling Māori data in your organisation may warrant special considerations. AI systems can enable misrepresentation, misappropriation or misuse of data and mātauranga Māori and other Indigenous knowledge. This can mean inappropriate commodification of that data, disregard for indigenous protocols around that data, or reinforcement of stereotypes which perpetuate inequality and harm.

Businesses who may be using Māori data can avoid its misuse or exploitation, with appropriate safeguards, consultation and cultural considerations.

Guidance from the Centre of Data, Ethics and Innovation provides further detail on Māori data and AI.

Māori data and AI guidance for business(external link) — Data.govt.nz

Tips

Know the difference between data that is non-sensitive or noa data and tapū Māori data (which has sacred meaning to Māori, and where Māori have a strong interest in being involved in deciding how it can be protected with appropriate tikanga, if collected or used at all).

Principle 5: Tapū and Noa(external link) – data.govt.nz

Having Māori data as part of datasets reinforces the importance of ensuring appropriate guardrails are put in place and processes are in line with good practice as outlined in this guidance including around: data accuracy; ethical and legal data collection and sourcing; supporting creators to economically benefit from their IP; protecting personal information; ensuring cybersecurity; minimising bias; reflecting diversity; committing to data provenance and recordkeeping; and being transparent.
Involve Māori in AI development and decision-making that could impact them (for example, as part of any governance board). This can help to understand what data could be considered tapū and/or require more extensive measures to appropriately manage that data, including potentially not using that data. Build these relationships not just to manage immediate risks, but also to create room for innovation and leadership into the future.
Your organisation may choose to store data supporting its AI system on New Zealand servers and within New Zealand’s legal jurisdiction, which may better support data sovereignty than offshore storage.

New Zealand based data centers(external link) – Data center map
The Waitangi Tribunal case ‘WAI262’ also delves into issues around intellectual property rights in the context of protection of taonga Māori. While primarily relevant for the Crown, this can provide some direction around what is considered to be respectful collection, definition, storage and use of Māori data.
The Principles of Māori Data Sovereignty – Te Mana Raraunga, the Māori Data Sovereignty Network are a helpful guide. While developed for government agencies, businesses may also find the discussion relevant for their activities.

Principles of Māori Data Sovereignty [PDF 92KB](external link) – Te Mana Raraunga Māori Data Sovereignty Network

Model efficacy

Those building and developing AI models will want to be mindful of exactly what they want it to do. Developers can work to minimise the risk of misleading, biased or inaccurate outputs or decision-making, or of cybersecurity weaknesses, at the outset, while there is opportunity to improve or correct it before potential harm is caused.

The type of system architecture that is the best fit for an AI system will primarily depend on what the system needs to learn from the data in order to solve the business problem.

To make sure AI systems work properly, they should be tested often and actively monitored. The type/s of testing will depend on the objectives of the system, and can include a focus on accuracy, privacy, or explainability (understanding how the AI system makes decisions) for example.

AI systems can be adjusted to improve how they work. Refinement methods such as fine-tuning, constrained sampling, and post-processing filters (for example, for grammar correction, offensive content, or filtering out personal information) can be used to improve outcomes.

Before model release, developers or deployers should be satisfied that it performs adequately, is reliable, and safe. A number of vendors have guidance and tools to support continuous evaluation of performance and responsibility metrics before systems go live.

Tips

Establish success metrics and thresholds to evaluate the model against, and monitor performance over time.
Avoid unnecessary model features to support better model performance and explainability, and reduced computational costs.
Developers can consider feature flags to be able to turn some functionality off without deploying new code.
Run ‘model scenarios’ with a pool of human evaluators representing your customer or end user perspective. By adapting communications or the model itself in response to its performance, your organisation can support efforts to build stakeholders’ (including workforce and any users) understanding of how to interpret AI system outputs (and related decisions) as it relates to their use.
Models can be evaluated on safety characteristics such as prompt stereotyping (encoded biases for gender, socioeconomic status, etc), factual knowledge, and toxicity.
Model safety can be assessed through penetration testing, ‘red teaming’, threat modelling, and/or audits.
AI models can change over time, so it is important to regularly test and audit the system to support continued assurance of its accuracy and usefulness, and understand if there are any emerging or new risks to consider.
Document the model version and dataset used for each model.
Developers can produce an AI model (or service) card detailing model purpose, data sources and provenance, training methodologies, performance metrics (see the OECD’s Catalogue of Tools and Metrics for Trustworthy AI), and potential biases. Deployers and users can request or require model cards to understand and assess these things.

Catalogue of Tools and Metrics for Trustworthy AI(external link) — Organisation for Economic Co-operation and Development (OECD)

Data drift can impact a model as it takes hold. Active monitoring (and plans to retest and retrain the model as needed) can help detect deviations.

* Indicates content specific to GenAI

< AI system specific considerations | Use and outputs >

Last updated: 18 August 2025

https://www.mbie.govt.nz/business-and-employment/business/support-for-business/responsible-ai-guidance-for-businesses/artificial-intelligence-system-specific-considerations/data-and-modelling
Please note: This content will change over time and can go out of date.

Building and construction

Tenancy and housing

Energy and natural resources

Employment and skills

Consumer Protection

Business

Economic growth

Isolation and quarantine

Immigration

Tourism and hospitality

Tourism research and data

Science and innovation

New Zealand Space Agency

Communications and Broadband

Government Centre for Dispute Resolution

Regulatory stewardship

New Zealand Government Procurement and Property

Language Assistance Services

Who we are

Open government and official information

Contact us

Upcoming events

Data and modelling

On this page

Fit-for-purpose data

Legal and ethical data

Sensitive, proprietary, or personal information

Ownership and intellectual property rights

Options to ethically source datasets including copyright works

Options

Options

Māori and other indigenous data

Model efficacy