Data and modelling
Responsible training data collection and modelling practices support useful, secure and fair artificial intelligence systems.
On this page
Data is the ‘fuel’ that makes artificial intelligence (AI) work. It is used to train the AI model so it can, for example, spot patterns, make predictions and choices, create content, or perform another function for which it was built. The better and more fit-for-purpose the data, the better the AI will work.
Sometimes, even AI developers don’t fully understand how AI systems make – especially with GenAI. That's why it’s important to prioritise data quality from the start, so outputs can be as reliable as possible.
Data and modelling processes are primarily relevant for those building or developing AI models. However, those using or deploying existing, already-trained AI solutions, should be aware of the considerations relating to data and modelling – to better understand whether the system is right for its intended use and fits with business values. This is increasingly relevant as more GenAI applications are released.
IT and Cybersecurity discusses the importance of ensuring good cybersecurity practice to protect AI training data and model integrity.
Fit-for-purpose data
Training data and datasets should be good quality – accurate and ‘clean’, complete, lawfully obtained or accessed, structured appropriately (including to support transparency and explainability), relevant and representative of the environment that the AI model may function in.
Poor quality training data doesn’t just cost money – it can lead to mistakes, upset customers, waste money and damage business reputation.
Improper data collection and processing can mean potential privacy and confidentiality breaches, intellectual property rights violations, or potential data sovereignty and human rights impacts. Consequences may be reputational, financial, punitive, or obstructive (such as via compliance, cease and desist, or asset freezing orders).
AI datasets and systems are likely to have varying degrees of bias (for example, reflecting patterns in historical decision-making that may be based on inappropriate or harmful biases, representation, or how methodologies are implemented). Special attention needs to be paid to potential bias when training data contains information about people – and especially when model outputs and associated decision-making can directly impact individuals, families and whānau and wider communities. AI systems can amplify inaccurate or unreliable outputs, unfairness, discrimination and/or harm to individuals (including those potentially not receiving services due to system discrimination), and/or damage to business reputation.
Tips
Scenario B: BigBuild
BigBuild is a New Zealand construction firm looking to speed up their recruitment processes. They decide to trial a CV screening AI tool to scan applications and rank them based on desired skills, experience and other key words. Those that fall below a certain threshold are filtered out and do not progress further.
Following a risk assessment, BigBuild puts in place some risk mitigations, including being transparent with applicants about AI involvement in the shortlisting process, and implementing a human reviewer for all shortlisted applications.
After deployment, 50 candidates apply for a position at BigBuild. They are advised that AI is being used. The human reviewer notices that, despite there being an equal amount of women who applied, only 3 of the 10 shortlisted applications were women. This is not in line with BigBuild’s goals for gender diversity in hiring. They also receive correspondence from rejected female candidates questioning the AI system results, and upon evaluation their applications were at least as deserving as those that were shortlisted.
On investigation, BigBuild finds that the AI model was trained on historical ‘successful hire’ data – which underrepresented women, likely due to a mixture of historical human bias and societal factors. Therefore, despite the AI system being fed ‘gender blind’ CVs, it viewed factors and language moreso included on men’s CVs as positive (such as ‘executed’). Meanwhile, it assessed similar factors that appear moreso on women’s CVs (for example, language like ‘supported’ and ‘co-ordinated’) negatively. Career gaps (that may have been taken due to maternity leave), and certain hobbies, roles or institutions (such as ‘volleyball’, ‘girl’s school’) also contributed to the system being more likely to rank female candidates lower.
To address the issue, BigBuild apologises to impacted candidates and paused use of the tool while it works with the provider to retrain the model using more representative, bias-audited data. They also introduce a diverse set of ‘success profiles’ representing a range of career pathways and experiences, and lower the shortlisting threshold so more applications move to human review.
BigBuild regularly monitor and evaluate the tool’s performance throughout their trial, and continue to provide feedback to the developer as necessary.
Legal and ethical data
Collecting and processing data ethically, in ways that respect respect people’s rights and privacy, helps protect from legal risk and keep reputations strong. Some types of data need care.
Sensitive, proprietary, or personal information
If any data includes proprietary, confidential or personal information, special consideration needs to be given to risks of it being accidentally shared - which could result in trust or privacy breaches, legal liabilities, commercial harm and reputational damage. Such data could include confidential business information (such as access keys, source code, or billing details), trade secrets, customer data or personal information. AI systems learning from this data could retain patterns and relationships within it that may surface information that you don’t want it to or to people that are not supposed to be able to access it.
Even use of personal information that is already ‘publicly available’ may be considered unethical, illegal and/or damage reputation in some contexts. For example, ‘web scraping’ practices can be seen as intrusive, and could erode customer trust and threaten customer privacy. These considerations are important to both an AI system’s underlying training data, and to any data supplied by a user to an AI system. See Use and outputs.
Tips
Ownership and intellectual property rights
Training data and models can be sourced from proprietary datasets, open data platforms, or public content (including via web scraping practices). Each source may come with specific licensing agreements that describe who may access them, how they can be used, and/or how they must be labelled or attributed. It is good practice to disclose these details about the source(s) of training datasets, including any intellectual property rights licences entered into if applicable.
The ‘black box’ nature of some AI systems puts pressure on expectations and understanding around intellectual property protections when it comes to training data. Open-source datasets may be protected by intellectual property rights and have certain conditions that need to be met in order to copy and/or use those datasets (for example Creative Commons licences), including attribution requirements, or restrictions on derivations or commercial use. Similarly, terms of use restrictions in publicly available or proprietary datasets might prohibit web scraping or use for AI training.
* GenAI, in particular, has been largely reliant on copyright works for its development. Fairly attributing and compensating creators and authors of copyright works can support continued creation, sharing, and availability of new works to support ongoing training and refinement of AI models and systems.
Some datasets will include proprietary databases - where authorisation is needed to access and copy information from those databases, which may only be granted for specific purposes. For example, publishers of medical journals may have created a database of medical articles and provide licenced access to academics for research purposes, but not for other purposes (such as for software developers training AI tools). Besides raising questions around potential infringement of IP rights, it may also involve a breach of contract (terms and conditions related to access to the database).
While this section focuses on data and modelling, ownership and intellectual property considerations around GenAI outputs is included in Use and outputs.
The World Intellectual Property Organisation also has GenAI specific guidance.
Generative AL: Navigating Intellectual Property [PDF 1.5KB](external link) World Intellectual Property Organisation
Options to ethically source datasets including copyright works
Developers benefit from certainty around training data origins and that it is both legally and ethically sourced. Creators benefit from acknowledgement and remuneration for using their works – which in turn incentivises continued creation and sharing.
Increasingly various and dynamic options are becoming available to businesses to obtain permissions to use third-party proprietary datasets, including copyright works in training, refining, or prompting AI models. The table below outlines some of these.
Note these are examples only, and not necessarily endorsed by MBIE or Government more widely.
Options
Directly license copyright works
AI developers are increasingly striking partnerships with traditional publishers and media entities to license their extensive content libraries, in order to foster innovation and grow together.
Consider creating opportunities to partner directly with media libraries, publishers, iwi, and other content creators, rightsholders and aggregators.
Access a collective license
AI developers can access traditional ways to licence use of copyright works through collective licensing schemes offered by various copyright management organisations.
Collective licences can also be used to obtain permission to use overseas works, vastly increasing the available volume and variety of copyright works available to AI developers. There are examples of overseas collective licensing options in:
- the United Kingdom – the Copyright Licensing Agency’s GenAI licence permissions, and Text and Data Mining Licensing; and Authors’ Licensing and Collecting Society’s AI Licences
- Australia – the Copyright Agency’s extended Annual Business Licence for AI tools
- the United States – Copyright Clearance Centre’s Collective Licensing Solution for Content Usage in Internal AI Systems, and AI Systems Training License (United States)
For New Zealand copyright works and business licence solutions, Copyright Licensing New Zealand intends to release a collective licensing scheme later in 2025 to partner AI developers with New Zealand rightsholders.
Use fair marketplaces
New marketplaces are emerging for creators and rightsholders to directly license their creative works for AI training, ensuring permission and remuneration. US examples include Created by Humans and RHEI.
Choose a fairly-trained and commercially safe AI model
Examples of AI models that exemplify trustworthy AI and exclusively use licensed datasets include, but are not limited to:
- Te Hiku – which used ethically sourced archival footage and audio to design an automatic speech recognition model that can transcribe te reo Māori with 92% accuracy. This has been used to run Kaituhi, an automatic bilingual transcription service. Kaituhi(external link)
- Pro Rata AI – which enables attribution of contributing content and share revenues on a per-use basis. The process looks at the output of the GenAI content, analyses where the outputted content came from, and then shares half of the revenue with the rightsholders (similar to how content distributors like Spotify or YouTube compensate rightsholders on their platforms).
- Adobe Firefly Video Model - which is trained exclusively on licensed content, and only public domain content where copyright has expired.
You can also check the internet for reporting on AI model infringement claims and infringement checking tools are also emerging.
Tips
Māori and other indigenous data
Māori data refers broadly to digital or digitisable data, information or knowledge that is about, from or connected to Māori. It includes data about people, language, population, place, culture and environment.
Producing, using or handling Māori data in your organisation may warrant special considerations. AI systems can enable misrepresentation, misappropriation or misuse of data and mātauranga Māori and other Indigenous knowledge. This can mean inappropriate commodification of that data, disregard for indigenous protocols around that data, or reinforcement of stereotypes which perpetuate inequality and harm.
Businesses who may be using Māori data can avoid its misuse or exploitation, with appropriate safeguards, consultation and cultural considerations.
Guidance from the Centre of Data, Ethics and Innovation provides further detail on Māori data and AI.
Māori data and AI guidance for business(external link) — Data.govt.nz
Tips
Model efficacy
Those building and developing AI models will want to be mindful of exactly what they want it to do. Developers can work to minimise the risk of misleading, biased or inaccurate outputs or decision-making, or of cybersecurity weaknesses, at the outset, while there is opportunity to improve or correct it before potential harm is caused.
The type of system architecture that is the best fit for an AI system will primarily depend on what the system needs to learn from the data in order to solve the business problem.
To make sure AI systems work properly, they should be tested often and actively monitored. The type/s of testing will depend on the objectives of the system, and can include a focus on accuracy, privacy, or explainability (understanding how the AI system makes decisions) for example.
AI systems can be adjusted to improve how they work. Refinement methods such as fine-tuning, constrained sampling, and post-processing filters (for example, for grammar correction, offensive content, or filtering out personal information) can be used to improve outcomes.
Before model release, developers or deployers should be satisfied that it performs adequately, is reliable, and safe. A number of vendors have guidance and tools to support continuous evaluation of performance and responsibility metrics before systems go live.
Tips
* Indicates content specific to GenAI