A comprehensive guide to configuring and using Large Language Models (LLMs) in your CrewAI projects
CrewAI integrates with multiple LLM providers through LiteLLM, giving you the flexibility to choose the right model for your specific use case. This guide will help you understand how to configure and use different LLM providers in your CrewAI projects.
Large Language Models (LLMs) are the core intelligence behind CrewAI agents. They enable agents to understand context, make decisions, and generate human-like responses. Here’s what you need to know:
Large Language Models are AI systems trained on vast amounts of text data. They power the intelligence of your CrewAI agents, enabling them to understand and generate human-like text.
The context window determines how much text an LLM can process at once. Larger windows (e.g., 128K tokens) allow for more context but may be more expensive and slower.
Temperature (0.0 to 1.0) controls response randomness. Lower values (e.g., 0.2) produce more focused, deterministic outputs, while higher values (e.g., 0.8) increase creativity and variability.
Each LLM provider (e.g., OpenAI, Anthropic, Google) offers different models with varying capabilities, pricing, and features. Choose based on your needs for accuracy, speed, and cost.
There are different places in CrewAI code where you can specify the model to use. Once you specify the model you are using, you will need to provide the configuration (like an API key) for each of the model providers you use. See the provider configuration examples section for your provider.
The simplest way to get started. Set the model in your environment directly, through an .env
file or in your app code. If you used crewai create
to bootstrap your project, it will be set already.
Never commit API keys to version control. Use environment files (.env) or your system’s secret management.
The simplest way to get started. Set the model in your environment directly, through an .env
file or in your app code. If you used crewai create
to bootstrap your project, it will be set already.
Never commit API keys to version control. Use environment files (.env) or your system’s secret management.
Create a YAML file to define your agent configurations. This method is great for version control and team collaboration:
The YAML configuration allows you to:
For maximum flexibility, configure LLMs directly in your Python code:
Parameter explanations:
temperature
: Controls randomness (0.0-1.0)timeout
: Maximum wait time for responsemax_tokens
: Limits response lengthtop_p
: Alternative to temperature for samplingfrequency_penalty
: Reduces word repetitionpresence_penalty
: Encourages new topicsresponse_format
: Specifies output structureseed
: Ensures consistent outputsCrewAI supports a multitude of LLM providers, each offering unique features, authentication methods, and model capabilities. In this section, you’ll find detailed examples that help you select, configure, and optimize the LLM that best fits your project’s needs.
OpenAI
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
OpenAI is one of the leading providers of LLMs with a wide range of models and features.
Model | Context Window | Best For |
---|---|---|
GPT-4 | 8,192 tokens | High-accuracy tasks, complex reasoning |
GPT-4 Turbo | 128,000 tokens | Long-form content, document analysis |
GPT-4o & GPT-4o-mini | 128,000 tokens | Cost-effective large context processing |
o3-mini | 200,000 tokens | Fast reasoning, complex reasoning |
o1-mini | 128,000 tokens | Fast reasoning, complex reasoning |
o1-preview | 128,000 tokens | Fast reasoning, complex reasoning |
o1 | 200,000 tokens | Fast reasoning, complex reasoning |
Meta-Llama
Meta’s Llama API provides access to Meta’s family of large language models.
The API is available through the Meta Llama API.
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
All models listed here https://llama.developer.meta.com/docs/models/ are supported.
Model ID | Input context length | Output context length | Input Modalities | Output Modalities |
---|---|---|---|---|
meta_llama/Llama-4-Scout-17B-16E-Instruct-FP8 | 128k | 4028 | Text, Image | Text |
meta_llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 128k | 4028 | Text, Image | Text |
meta_llama/Llama-3.3-70B-Instruct | 128k | 4028 | Text | Text |
meta_llama/Llama-3.3-8B-Instruct | 128k | 4028 | Text | Text |
Anthropic
Example usage in your CrewAI project:
Google (Gemini API)
Set your API key in your .env
file. If you need a key, or need to find an
existing key, check AI Studio.
Example usage in your CrewAI project:
Google offers a range of powerful models optimized for different use cases.
Model | Context Window | Best For |
---|---|---|
gemini-2.5-flash-preview-04-17 | 1M tokens | Adaptive thinking, cost efficiency |
gemini-2.5-pro-preview-05-06 | 1M tokens | Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more |
gemini-2.0-flash | 1M tokens | Next generation features, speed, thinking, and realtime streaming |
gemini-2.0-flash-lite | 1M tokens | Cost efficiency and low latency |
gemini-1.5-flash | 1M tokens | Balanced multimodal model, good for most tasks |
gemini-1.5-flash-8B | 1M tokens | Fastest, most cost-efficient, good for high-frequency tasks |
gemini-1.5-pro | 2M tokens | Best performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration |
The full list of models is available in the Gemini model docs.
The Gemini API also allows you to use your API key to access Gemma models hosted on Google infrastructure.
Model | Context Window |
---|---|
gemma-3-1b-it | 32k tokens |
gemma-3-4b-it | 32k tokens |
gemma-3-12b-it | 32k tokens |
gemma-3-27b-it | 128k tokens |
Google (Vertex AI)
Get credentials from your Google Cloud Console and save it to a JSON file, then load it with the following code:
Example usage in your CrewAI project:
Google offers a range of powerful models optimized for different use cases:
Model | Context Window | Best For |
---|---|---|
gemini-2.5-flash-preview-04-17 | 1M tokens | Adaptive thinking, cost efficiency |
gemini-2.5-pro-preview-05-06 | 1M tokens | Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more |
gemini-2.0-flash | 1M tokens | Next generation features, speed, thinking, and realtime streaming |
gemini-2.0-flash-lite | 1M tokens | Cost efficiency and low latency |
gemini-1.5-flash | 1M tokens | Balanced multimodal model, good for most tasks |
gemini-1.5-flash-8B | 1M tokens | Fastest, most cost-efficient, good for high-frequency tasks |
gemini-1.5-pro | 2M tokens | Best performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration |
Azure
Example usage in your CrewAI project:
AWS Bedrock
Example usage in your CrewAI project:
Before using Amazon Bedrock, make sure you have boto3 installed in your environment
Amazon Bedrock is a managed service that provides access to multiple foundation models from top AI companies through a unified API, enabling secure and responsible AI application development.
Model | Context Window | Best For |
---|---|---|
Amazon Nova Pro | Up to 300k tokens | High-performance, model balancing accuracy, speed, and cost-effectiveness across diverse tasks. |
Amazon Nova Micro | Up to 128k tokens | High-performance, cost-effective text-only model optimized for lowest latency responses. |
Amazon Nova Lite | Up to 300k tokens | High-performance, affordable multimodal processing for images, video, and text with real-time capabilities. |
Claude 3.7 Sonnet | Up to 128k tokens | High-performance, best for complex reasoning, coding & AI agents |
Claude 3.5 Sonnet v2 | Up to 200k tokens | State-of-the-art model specialized in software engineering, agentic capabilities, and computer interaction at optimized cost. |
Claude 3.5 Sonnet | Up to 200k tokens | High-performance model delivering superior intelligence and reasoning across diverse tasks with optimal speed-cost balance. |
Claude 3.5 Haiku | Up to 200k tokens | Fast, compact multimodal model optimized for quick responses and seamless human-like interactions |
Claude 3 Sonnet | Up to 200k tokens | Multimodal model balancing intelligence and speed for high-volume deployments. |
Claude 3 Haiku | Up to 200k tokens | Compact, high-speed multimodal model optimized for quick responses and natural conversational interactions |
Claude 3 Opus | Up to 200k tokens | Most advanced multimodal model exceling at complex tasks with human-like reasoning and superior contextual understanding. |
Claude 2.1 | Up to 200k tokens | Enhanced version with expanded context window, improved reliability, and reduced hallucinations for long-form and RAG applications |
Claude | Up to 100k tokens | Versatile model excelling in sophisticated dialogue, creative content, and precise instruction following. |
Claude Instant | Up to 100k tokens | Fast, cost-effective model for everyday tasks like dialogue, analysis, summarization, and document Q&A |
Llama 3.1 405B Instruct | Up to 128k tokens | Advanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks. |
Llama 3.1 70B Instruct | Up to 128k tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
Llama 3.1 8B Instruct | Up to 128k tokens | Advanced state-of-the-art model with language understanding, superior reasoning, and text generation. |
Llama 3 70B Instruct | Up to 8k tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
Llama 3 8B Instruct | Up to 8k tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
Titan Text G1 - Lite | Up to 4k tokens | Lightweight, cost-effective model optimized for English tasks and fine-tuning with focus on summarization and content generation. |
Titan Text G1 - Express | Up to 8k tokens | Versatile model for general language tasks, chat, and RAG applications with support for English and 100+ languages. |
Cohere Command | Up to 4k tokens | Model specialized in following user commands and delivering practical enterprise solutions. |
Jurassic-2 Mid | Up to 8,191 tokens | Cost-effective model balancing quality and affordability for diverse language tasks like Q&A, summarization, and content generation. |
Jurassic-2 Ultra | Up to 8,191 tokens | Model for advanced text generation and comprehension, excelling in complex tasks like analysis and content creation. |
Jamba-Instruct | Up to 256k tokens | Model with extended context window optimized for cost-effective text generation, summarization, and Q&A. |
Mistral 7B Instruct | Up to 32k tokens | This LLM follows instructions, completes requests, and generates creative text. |
Mistral 8x7B Instruct | Up to 32k tokens | An MOE LLM that follows instructions, completes requests, and generates creative text. |
Amazon SageMaker
Example usage in your CrewAI project:
Mistral
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Nvidia NIM
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Nvidia NIM provides a comprehensive suite of models for various use cases, from general-purpose tasks to specialized applications.
Model | Context Window | Best For |
---|---|---|
nvidia/mistral-nemo-minitron-8b-8k-instruct | 8,192 tokens | State-of-the-art small language model delivering superior accuracy for chatbot, virtual assistants, and content generation. |
nvidia/nemotron-4-mini-hindi-4b-instruct | 4,096 tokens | A bilingual Hindi-English SLM for on-device inference, tailored specifically for Hindi Language. |
nvidia/llama-3.1-nemotron-70b-instruct | 128k tokens | Customized for enhanced helpfulness in responses |
nvidia/llama3-chatqa-1.5-8b | 128k tokens | Advanced LLM to generate high-quality, context-aware responses for chatbots and search engines. |
nvidia/llama3-chatqa-1.5-70b | 128k tokens | Advanced LLM to generate high-quality, context-aware responses for chatbots and search engines. |
nvidia/vila | 128k tokens | Multi-modal vision-language model that understands text/img/video and creates informative responses |
nvidia/neva-22 | 4,096 tokens | Multi-modal vision-language model that understands text/images and generates informative responses |
nvidia/nemotron-mini-4b-instruct | 8,192 tokens | General-purpose tasks |
nvidia/usdcode-llama3-70b-instruct | 128k tokens | State-of-the-art LLM that answers OpenUSD knowledge queries and generates USD-Python code. |
nvidia/nemotron-4-340b-instruct | 4,096 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
meta/codellama-70b | 100k tokens | LLM capable of generating code from natural language and vice versa. |
meta/llama2-70b | 4,096 tokens | Cutting-edge large language AI model capable of generating text and code in response to prompts. |
meta/llama3-8b-instruct | 8,192 tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
meta/llama3-70b-instruct | 8,192 tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
meta/llama-3.1-8b-instruct | 128k tokens | Advanced state-of-the-art model with language understanding, superior reasoning, and text generation. |
meta/llama-3.1-70b-instruct | 128k tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
meta/llama-3.1-405b-instruct | 128k tokens | Advanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks. |
meta/llama-3.2-1b-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
meta/llama-3.2-3b-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
meta/llama-3.2-11b-vision-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
meta/llama-3.2-90b-vision-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
google/gemma-7b | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/gemma-2b | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/codegemma-7b | 8,192 tokens | Cutting-edge model built on Google’s Gemma-7B specialized for code generation and code completion. |
google/codegemma-1.1-7b | 8,192 tokens | Advanced programming model for code generation, completion, reasoning, and instruction following. |
google/recurrentgemma-2b | 8,192 tokens | Novel recurrent architecture based language model for faster inference when generating long sequences. |
google/gemma-2-9b-it | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/gemma-2-27b-it | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/gemma-2-2b-it | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/deplot | 512 tokens | One-shot visual language understanding model that translates images of plots into tables. |
google/paligemma | 8,192 tokens | Vision language model adept at comprehending text and visual inputs to produce informative responses. |
mistralai/mistral-7b-instruct-v0.2 | 32k tokens | This LLM follows instructions, completes requests, and generates creative text. |
mistralai/mixtral-8x7b-instruct-v0.1 | 8,192 tokens | An MOE LLM that follows instructions, completes requests, and generates creative text. |
mistralai/mistral-large | 4,096 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
mistralai/mixtral-8x22b-instruct-v0.1 | 8,192 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
mistralai/mistral-7b-instruct-v0.3 | 32k tokens | This LLM follows instructions, completes requests, and generates creative text. |
nv-mistralai/mistral-nemo-12b-instruct | 128k tokens | Most advanced language model for reasoning, code, multilingual tasks; runs on a single GPU. |
mistralai/mamba-codestral-7b-v0.1 | 256k tokens | Model for writing and interacting with code across a wide range of programming languages and tasks. |
microsoft/phi-3-mini-128k-instruct | 128K tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-mini-4k-instruct | 4,096 tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-small-8k-instruct | 8,192 tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-small-128k-instruct | 128K tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-medium-4k-instruct | 4,096 tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-medium-128k-instruct | 128K tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3.5-mini-instruct | 128K tokens | Lightweight multilingual LLM powering AI applications in latency bound, memory/compute constrained environments |
microsoft/phi-3.5-moe-instruct | 128K tokens | Advanced LLM based on Mixture of Experts architecture to deliver compute efficient content generation |
microsoft/kosmos-2 | 1,024 tokens | Groundbreaking multimodal model designed to understand and reason about visual elements in images. |
microsoft/phi-3-vision-128k-instruct | 128k tokens | Cutting-edge open multimodal model exceling in high-quality reasoning from images. |
microsoft/phi-3.5-vision-instruct | 128k tokens | Cutting-edge open multimodal model exceling in high-quality reasoning from images. |
databricks/dbrx-instruct | 12k tokens | A general-purpose LLM with state-of-the-art performance in language understanding, coding, and RAG. |
snowflake/arctic | 1,024 tokens | Delivers high efficiency inference for enterprise applications focused on SQL generation and coding. |
aisingapore/sea-lion-7b-instruct | 4,096 tokens | LLM to represent and serve the linguistic and cultural diversity of Southeast Asia |
ibm/granite-8b-code-instruct | 4,096 tokens | Software programming LLM for code generation, completion, explanation, and multi-turn conversion. |
ibm/granite-34b-code-instruct | 8,192 tokens | Software programming LLM for code generation, completion, explanation, and multi-turn conversion. |
ibm/granite-3.0-8b-instruct | 4,096 tokens | Advanced Small Language Model supporting RAG, summarization, classification, code, and agentic AI |
ibm/granite-3.0-3b-a800m-instruct | 4,096 tokens | Highly efficient Mixture of Experts model for RAG, summarization, entity extraction, and classification |
mediatek/breeze-7b-instruct | 4,096 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
upstage/solar-10.7b-instruct | 4,096 tokens | Excels in NLP tasks, particularly in instruction-following, reasoning, and mathematics. |
writer/palmyra-med-70b-32k | 32k tokens | Leading LLM for accurate, contextually relevant responses in the medical domain. |
writer/palmyra-med-70b | 32k tokens | Leading LLM for accurate, contextually relevant responses in the medical domain. |
writer/palmyra-fin-70b-32k | 32k tokens | Specialized LLM for financial analysis, reporting, and data processing |
01-ai/yi-large | 32k tokens | Powerful model trained on English and Chinese for diverse tasks including chatbot and creative writing. |
deepseek-ai/deepseek-coder-6.7b-instruct | 2k tokens | Powerful coding model offering advanced capabilities in code generation, completion, and infilling |
rakuten/rakutenai-7b-instruct | 1,024 tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
rakuten/rakutenai-7b-chat | 1,024 tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
baichuan-inc/baichuan2-13b-chat | 4,096 tokens | Support Chinese and English chat, coding, math, instruction following, solving quizzes |
Local NVIDIA NIM Deployed using WSL2
NVIDIA NIM enables you to run powerful LLMs locally on your Windows machine using WSL2 (Windows Subsystem for Linux). This approach allows you to leverage your NVIDIA GPU for private, secure, and cost-effective AI inference without relying on cloud services. Perfect for development, testing, or production scenarios where data privacy or offline capabilities are required.
Here is a step-by-step guide to setting up a local NVIDIA NIM model:
Follow installation instructions from NVIDIA Website
Install the local model. For Llama 3.1-8b follow instructions
Configure your crewai local models:
Groq
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Model | Context Window | Best For |
---|---|---|
Llama 3.1 70B/8B | 131,072 tokens | High-performance, large context tasks |
Llama 3.2 Series | 8,192 tokens | General-purpose tasks |
Mixtral 8x7B | 32,768 tokens | Balanced performance and context |
IBM watsonx.ai
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Ollama (Local LLMs)
ollama run llama3
Fireworks AI
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Perplexity AI
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Hugging Face
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
SambaNova
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Model | Context Window | Best For |
---|---|---|
Llama 3.1 70B/8B | Up to 131,072 tokens | High-performance, large context tasks |
Llama 3.1 405B | 8,192 tokens | High-performance and output quality |
Llama 3.2 Series | 8,192 tokens | General-purpose, multimodal tasks |
Llama 3.3 70B | Up to 131,072 tokens | High-performance and output quality |
Qwen2 familly | 8,192 tokens | High-performance and output quality |
Cerebras
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Cerebras features:
Open Router
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Open Router models:
Nebius AI Studio
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Nebius AI Studio features:
CrewAI supports streaming responses from LLMs, allowing your application to receive and process outputs in real-time as they’re generated.
Enable streaming by setting the stream
parameter to True
when initializing your LLM:
When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.
Enable streaming by setting the stream
parameter to True
when initializing your LLM:
When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.
All LLM events in CrewAI include agent and task information, allowing you to track and filter LLM interactions by specific agents or tasks:
This feature is particularly useful for:
CrewAI supports structured responses from LLM calls by allowing you to define a response_format
using a Pydantic model. This enables the framework to automatically parse and validate the output, making it easier to integrate the response into your application without manual post-processing.
For example, you can define a Pydantic model to represent the expected response structure and pass it as the response_format
when instantiating the LLM. The model will then be used to convert the LLM output into a structured Python object.
Learn how to get the most out of your LLM configuration:
Context Window Management
CrewAI includes smart context management features:
Best practices for context management:
Performance Optimization
Token Usage Optimization
Choose the right context window for your task:
Best Practices
Remember to regularly monitor your token usage and adjust your configuration as needed to optimize costs and performance.
Drop Additional Parameters
CrewAI internally uses Litellm for LLM calls, which allows you to drop additional parameters that are not needed for your specific use case. This can help simplify your code and reduce the complexity of your LLM configuration.
For example, if you don’t need to send the stop
parameter, you can simply omit it from your LLM call:
Most authentication issues can be resolved by checking API key format and environment variable names.
Most authentication issues can be resolved by checking API key format and environment variable names.
Always include the provider prefix in model names
Use larger context models for extensive tasks
A comprehensive guide to configuring and using Large Language Models (LLMs) in your CrewAI projects
CrewAI integrates with multiple LLM providers through LiteLLM, giving you the flexibility to choose the right model for your specific use case. This guide will help you understand how to configure and use different LLM providers in your CrewAI projects.
Large Language Models (LLMs) are the core intelligence behind CrewAI agents. They enable agents to understand context, make decisions, and generate human-like responses. Here’s what you need to know:
Large Language Models are AI systems trained on vast amounts of text data. They power the intelligence of your CrewAI agents, enabling them to understand and generate human-like text.
The context window determines how much text an LLM can process at once. Larger windows (e.g., 128K tokens) allow for more context but may be more expensive and slower.
Temperature (0.0 to 1.0) controls response randomness. Lower values (e.g., 0.2) produce more focused, deterministic outputs, while higher values (e.g., 0.8) increase creativity and variability.
Each LLM provider (e.g., OpenAI, Anthropic, Google) offers different models with varying capabilities, pricing, and features. Choose based on your needs for accuracy, speed, and cost.
There are different places in CrewAI code where you can specify the model to use. Once you specify the model you are using, you will need to provide the configuration (like an API key) for each of the model providers you use. See the provider configuration examples section for your provider.
The simplest way to get started. Set the model in your environment directly, through an .env
file or in your app code. If you used crewai create
to bootstrap your project, it will be set already.
Never commit API keys to version control. Use environment files (.env) or your system’s secret management.
The simplest way to get started. Set the model in your environment directly, through an .env
file or in your app code. If you used crewai create
to bootstrap your project, it will be set already.
Never commit API keys to version control. Use environment files (.env) or your system’s secret management.
Create a YAML file to define your agent configurations. This method is great for version control and team collaboration:
The YAML configuration allows you to:
For maximum flexibility, configure LLMs directly in your Python code:
Parameter explanations:
temperature
: Controls randomness (0.0-1.0)timeout
: Maximum wait time for responsemax_tokens
: Limits response lengthtop_p
: Alternative to temperature for samplingfrequency_penalty
: Reduces word repetitionpresence_penalty
: Encourages new topicsresponse_format
: Specifies output structureseed
: Ensures consistent outputsCrewAI supports a multitude of LLM providers, each offering unique features, authentication methods, and model capabilities. In this section, you’ll find detailed examples that help you select, configure, and optimize the LLM that best fits your project’s needs.
OpenAI
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
OpenAI is one of the leading providers of LLMs with a wide range of models and features.
Model | Context Window | Best For |
---|---|---|
GPT-4 | 8,192 tokens | High-accuracy tasks, complex reasoning |
GPT-4 Turbo | 128,000 tokens | Long-form content, document analysis |
GPT-4o & GPT-4o-mini | 128,000 tokens | Cost-effective large context processing |
o3-mini | 200,000 tokens | Fast reasoning, complex reasoning |
o1-mini | 128,000 tokens | Fast reasoning, complex reasoning |
o1-preview | 128,000 tokens | Fast reasoning, complex reasoning |
o1 | 200,000 tokens | Fast reasoning, complex reasoning |
Meta-Llama
Meta’s Llama API provides access to Meta’s family of large language models.
The API is available through the Meta Llama API.
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
All models listed here https://llama.developer.meta.com/docs/models/ are supported.
Model ID | Input context length | Output context length | Input Modalities | Output Modalities |
---|---|---|---|---|
meta_llama/Llama-4-Scout-17B-16E-Instruct-FP8 | 128k | 4028 | Text, Image | Text |
meta_llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 128k | 4028 | Text, Image | Text |
meta_llama/Llama-3.3-70B-Instruct | 128k | 4028 | Text | Text |
meta_llama/Llama-3.3-8B-Instruct | 128k | 4028 | Text | Text |
Anthropic
Example usage in your CrewAI project:
Google (Gemini API)
Set your API key in your .env
file. If you need a key, or need to find an
existing key, check AI Studio.
Example usage in your CrewAI project:
Google offers a range of powerful models optimized for different use cases.
Model | Context Window | Best For |
---|---|---|
gemini-2.5-flash-preview-04-17 | 1M tokens | Adaptive thinking, cost efficiency |
gemini-2.5-pro-preview-05-06 | 1M tokens | Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more |
gemini-2.0-flash | 1M tokens | Next generation features, speed, thinking, and realtime streaming |
gemini-2.0-flash-lite | 1M tokens | Cost efficiency and low latency |
gemini-1.5-flash | 1M tokens | Balanced multimodal model, good for most tasks |
gemini-1.5-flash-8B | 1M tokens | Fastest, most cost-efficient, good for high-frequency tasks |
gemini-1.5-pro | 2M tokens | Best performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration |
The full list of models is available in the Gemini model docs.
The Gemini API also allows you to use your API key to access Gemma models hosted on Google infrastructure.
Model | Context Window |
---|---|
gemma-3-1b-it | 32k tokens |
gemma-3-4b-it | 32k tokens |
gemma-3-12b-it | 32k tokens |
gemma-3-27b-it | 128k tokens |
Google (Vertex AI)
Get credentials from your Google Cloud Console and save it to a JSON file, then load it with the following code:
Example usage in your CrewAI project:
Google offers a range of powerful models optimized for different use cases:
Model | Context Window | Best For |
---|---|---|
gemini-2.5-flash-preview-04-17 | 1M tokens | Adaptive thinking, cost efficiency |
gemini-2.5-pro-preview-05-06 | 1M tokens | Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more |
gemini-2.0-flash | 1M tokens | Next generation features, speed, thinking, and realtime streaming |
gemini-2.0-flash-lite | 1M tokens | Cost efficiency and low latency |
gemini-1.5-flash | 1M tokens | Balanced multimodal model, good for most tasks |
gemini-1.5-flash-8B | 1M tokens | Fastest, most cost-efficient, good for high-frequency tasks |
gemini-1.5-pro | 2M tokens | Best performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration |
Azure
Example usage in your CrewAI project:
AWS Bedrock
Example usage in your CrewAI project:
Before using Amazon Bedrock, make sure you have boto3 installed in your environment
Amazon Bedrock is a managed service that provides access to multiple foundation models from top AI companies through a unified API, enabling secure and responsible AI application development.
Model | Context Window | Best For |
---|---|---|
Amazon Nova Pro | Up to 300k tokens | High-performance, model balancing accuracy, speed, and cost-effectiveness across diverse tasks. |
Amazon Nova Micro | Up to 128k tokens | High-performance, cost-effective text-only model optimized for lowest latency responses. |
Amazon Nova Lite | Up to 300k tokens | High-performance, affordable multimodal processing for images, video, and text with real-time capabilities. |
Claude 3.7 Sonnet | Up to 128k tokens | High-performance, best for complex reasoning, coding & AI agents |
Claude 3.5 Sonnet v2 | Up to 200k tokens | State-of-the-art model specialized in software engineering, agentic capabilities, and computer interaction at optimized cost. |
Claude 3.5 Sonnet | Up to 200k tokens | High-performance model delivering superior intelligence and reasoning across diverse tasks with optimal speed-cost balance. |
Claude 3.5 Haiku | Up to 200k tokens | Fast, compact multimodal model optimized for quick responses and seamless human-like interactions |
Claude 3 Sonnet | Up to 200k tokens | Multimodal model balancing intelligence and speed for high-volume deployments. |
Claude 3 Haiku | Up to 200k tokens | Compact, high-speed multimodal model optimized for quick responses and natural conversational interactions |
Claude 3 Opus | Up to 200k tokens | Most advanced multimodal model exceling at complex tasks with human-like reasoning and superior contextual understanding. |
Claude 2.1 | Up to 200k tokens | Enhanced version with expanded context window, improved reliability, and reduced hallucinations for long-form and RAG applications |
Claude | Up to 100k tokens | Versatile model excelling in sophisticated dialogue, creative content, and precise instruction following. |
Claude Instant | Up to 100k tokens | Fast, cost-effective model for everyday tasks like dialogue, analysis, summarization, and document Q&A |
Llama 3.1 405B Instruct | Up to 128k tokens | Advanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks. |
Llama 3.1 70B Instruct | Up to 128k tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
Llama 3.1 8B Instruct | Up to 128k tokens | Advanced state-of-the-art model with language understanding, superior reasoning, and text generation. |
Llama 3 70B Instruct | Up to 8k tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
Llama 3 8B Instruct | Up to 8k tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
Titan Text G1 - Lite | Up to 4k tokens | Lightweight, cost-effective model optimized for English tasks and fine-tuning with focus on summarization and content generation. |
Titan Text G1 - Express | Up to 8k tokens | Versatile model for general language tasks, chat, and RAG applications with support for English and 100+ languages. |
Cohere Command | Up to 4k tokens | Model specialized in following user commands and delivering practical enterprise solutions. |
Jurassic-2 Mid | Up to 8,191 tokens | Cost-effective model balancing quality and affordability for diverse language tasks like Q&A, summarization, and content generation. |
Jurassic-2 Ultra | Up to 8,191 tokens | Model for advanced text generation and comprehension, excelling in complex tasks like analysis and content creation. |
Jamba-Instruct | Up to 256k tokens | Model with extended context window optimized for cost-effective text generation, summarization, and Q&A. |
Mistral 7B Instruct | Up to 32k tokens | This LLM follows instructions, completes requests, and generates creative text. |
Mistral 8x7B Instruct | Up to 32k tokens | An MOE LLM that follows instructions, completes requests, and generates creative text. |
Amazon SageMaker
Example usage in your CrewAI project:
Mistral
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Nvidia NIM
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Nvidia NIM provides a comprehensive suite of models for various use cases, from general-purpose tasks to specialized applications.
Model | Context Window | Best For |
---|---|---|
nvidia/mistral-nemo-minitron-8b-8k-instruct | 8,192 tokens | State-of-the-art small language model delivering superior accuracy for chatbot, virtual assistants, and content generation. |
nvidia/nemotron-4-mini-hindi-4b-instruct | 4,096 tokens | A bilingual Hindi-English SLM for on-device inference, tailored specifically for Hindi Language. |
nvidia/llama-3.1-nemotron-70b-instruct | 128k tokens | Customized for enhanced helpfulness in responses |
nvidia/llama3-chatqa-1.5-8b | 128k tokens | Advanced LLM to generate high-quality, context-aware responses for chatbots and search engines. |
nvidia/llama3-chatqa-1.5-70b | 128k tokens | Advanced LLM to generate high-quality, context-aware responses for chatbots and search engines. |
nvidia/vila | 128k tokens | Multi-modal vision-language model that understands text/img/video and creates informative responses |
nvidia/neva-22 | 4,096 tokens | Multi-modal vision-language model that understands text/images and generates informative responses |
nvidia/nemotron-mini-4b-instruct | 8,192 tokens | General-purpose tasks |
nvidia/usdcode-llama3-70b-instruct | 128k tokens | State-of-the-art LLM that answers OpenUSD knowledge queries and generates USD-Python code. |
nvidia/nemotron-4-340b-instruct | 4,096 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
meta/codellama-70b | 100k tokens | LLM capable of generating code from natural language and vice versa. |
meta/llama2-70b | 4,096 tokens | Cutting-edge large language AI model capable of generating text and code in response to prompts. |
meta/llama3-8b-instruct | 8,192 tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
meta/llama3-70b-instruct | 8,192 tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
meta/llama-3.1-8b-instruct | 128k tokens | Advanced state-of-the-art model with language understanding, superior reasoning, and text generation. |
meta/llama-3.1-70b-instruct | 128k tokens | Powers complex conversations with superior contextual understanding, reasoning and text generation. |
meta/llama-3.1-405b-instruct | 128k tokens | Advanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks. |
meta/llama-3.2-1b-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
meta/llama-3.2-3b-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
meta/llama-3.2-11b-vision-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
meta/llama-3.2-90b-vision-instruct | 128k tokens | Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation. |
google/gemma-7b | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/gemma-2b | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/codegemma-7b | 8,192 tokens | Cutting-edge model built on Google’s Gemma-7B specialized for code generation and code completion. |
google/codegemma-1.1-7b | 8,192 tokens | Advanced programming model for code generation, completion, reasoning, and instruction following. |
google/recurrentgemma-2b | 8,192 tokens | Novel recurrent architecture based language model for faster inference when generating long sequences. |
google/gemma-2-9b-it | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/gemma-2-27b-it | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/gemma-2-2b-it | 8,192 tokens | Cutting-edge text generation model text understanding, transformation, and code generation. |
google/deplot | 512 tokens | One-shot visual language understanding model that translates images of plots into tables. |
google/paligemma | 8,192 tokens | Vision language model adept at comprehending text and visual inputs to produce informative responses. |
mistralai/mistral-7b-instruct-v0.2 | 32k tokens | This LLM follows instructions, completes requests, and generates creative text. |
mistralai/mixtral-8x7b-instruct-v0.1 | 8,192 tokens | An MOE LLM that follows instructions, completes requests, and generates creative text. |
mistralai/mistral-large | 4,096 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
mistralai/mixtral-8x22b-instruct-v0.1 | 8,192 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
mistralai/mistral-7b-instruct-v0.3 | 32k tokens | This LLM follows instructions, completes requests, and generates creative text. |
nv-mistralai/mistral-nemo-12b-instruct | 128k tokens | Most advanced language model for reasoning, code, multilingual tasks; runs on a single GPU. |
mistralai/mamba-codestral-7b-v0.1 | 256k tokens | Model for writing and interacting with code across a wide range of programming languages and tasks. |
microsoft/phi-3-mini-128k-instruct | 128K tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-mini-4k-instruct | 4,096 tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-small-8k-instruct | 8,192 tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-small-128k-instruct | 128K tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-medium-4k-instruct | 4,096 tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3-medium-128k-instruct | 128K tokens | Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills. |
microsoft/phi-3.5-mini-instruct | 128K tokens | Lightweight multilingual LLM powering AI applications in latency bound, memory/compute constrained environments |
microsoft/phi-3.5-moe-instruct | 128K tokens | Advanced LLM based on Mixture of Experts architecture to deliver compute efficient content generation |
microsoft/kosmos-2 | 1,024 tokens | Groundbreaking multimodal model designed to understand and reason about visual elements in images. |
microsoft/phi-3-vision-128k-instruct | 128k tokens | Cutting-edge open multimodal model exceling in high-quality reasoning from images. |
microsoft/phi-3.5-vision-instruct | 128k tokens | Cutting-edge open multimodal model exceling in high-quality reasoning from images. |
databricks/dbrx-instruct | 12k tokens | A general-purpose LLM with state-of-the-art performance in language understanding, coding, and RAG. |
snowflake/arctic | 1,024 tokens | Delivers high efficiency inference for enterprise applications focused on SQL generation and coding. |
aisingapore/sea-lion-7b-instruct | 4,096 tokens | LLM to represent and serve the linguistic and cultural diversity of Southeast Asia |
ibm/granite-8b-code-instruct | 4,096 tokens | Software programming LLM for code generation, completion, explanation, and multi-turn conversion. |
ibm/granite-34b-code-instruct | 8,192 tokens | Software programming LLM for code generation, completion, explanation, and multi-turn conversion. |
ibm/granite-3.0-8b-instruct | 4,096 tokens | Advanced Small Language Model supporting RAG, summarization, classification, code, and agentic AI |
ibm/granite-3.0-3b-a800m-instruct | 4,096 tokens | Highly efficient Mixture of Experts model for RAG, summarization, entity extraction, and classification |
mediatek/breeze-7b-instruct | 4,096 tokens | Creates diverse synthetic data that mimics the characteristics of real-world data. |
upstage/solar-10.7b-instruct | 4,096 tokens | Excels in NLP tasks, particularly in instruction-following, reasoning, and mathematics. |
writer/palmyra-med-70b-32k | 32k tokens | Leading LLM for accurate, contextually relevant responses in the medical domain. |
writer/palmyra-med-70b | 32k tokens | Leading LLM for accurate, contextually relevant responses in the medical domain. |
writer/palmyra-fin-70b-32k | 32k tokens | Specialized LLM for financial analysis, reporting, and data processing |
01-ai/yi-large | 32k tokens | Powerful model trained on English and Chinese for diverse tasks including chatbot and creative writing. |
deepseek-ai/deepseek-coder-6.7b-instruct | 2k tokens | Powerful coding model offering advanced capabilities in code generation, completion, and infilling |
rakuten/rakutenai-7b-instruct | 1,024 tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
rakuten/rakutenai-7b-chat | 1,024 tokens | Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation. |
baichuan-inc/baichuan2-13b-chat | 4,096 tokens | Support Chinese and English chat, coding, math, instruction following, solving quizzes |
Local NVIDIA NIM Deployed using WSL2
NVIDIA NIM enables you to run powerful LLMs locally on your Windows machine using WSL2 (Windows Subsystem for Linux). This approach allows you to leverage your NVIDIA GPU for private, secure, and cost-effective AI inference without relying on cloud services. Perfect for development, testing, or production scenarios where data privacy or offline capabilities are required.
Here is a step-by-step guide to setting up a local NVIDIA NIM model:
Follow installation instructions from NVIDIA Website
Install the local model. For Llama 3.1-8b follow instructions
Configure your crewai local models:
Groq
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Model | Context Window | Best For |
---|---|---|
Llama 3.1 70B/8B | 131,072 tokens | High-performance, large context tasks |
Llama 3.2 Series | 8,192 tokens | General-purpose tasks |
Mixtral 8x7B | 32,768 tokens | Balanced performance and context |
IBM watsonx.ai
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Ollama (Local LLMs)
ollama run llama3
Fireworks AI
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Perplexity AI
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Hugging Face
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
SambaNova
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Model | Context Window | Best For |
---|---|---|
Llama 3.1 70B/8B | Up to 131,072 tokens | High-performance, large context tasks |
Llama 3.1 405B | 8,192 tokens | High-performance and output quality |
Llama 3.2 Series | 8,192 tokens | General-purpose, multimodal tasks |
Llama 3.3 70B | Up to 131,072 tokens | High-performance and output quality |
Qwen2 familly | 8,192 tokens | High-performance and output quality |
Cerebras
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Cerebras features:
Open Router
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Open Router models:
Nebius AI Studio
Set the following environment variables in your .env
file:
Example usage in your CrewAI project:
Nebius AI Studio features:
CrewAI supports streaming responses from LLMs, allowing your application to receive and process outputs in real-time as they’re generated.
Enable streaming by setting the stream
parameter to True
when initializing your LLM:
When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.
Enable streaming by setting the stream
parameter to True
when initializing your LLM:
When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.
All LLM events in CrewAI include agent and task information, allowing you to track and filter LLM interactions by specific agents or tasks:
This feature is particularly useful for:
CrewAI supports structured responses from LLM calls by allowing you to define a response_format
using a Pydantic model. This enables the framework to automatically parse and validate the output, making it easier to integrate the response into your application without manual post-processing.
For example, you can define a Pydantic model to represent the expected response structure and pass it as the response_format
when instantiating the LLM. The model will then be used to convert the LLM output into a structured Python object.
Learn how to get the most out of your LLM configuration:
Context Window Management
CrewAI includes smart context management features:
Best practices for context management:
Performance Optimization
Token Usage Optimization
Choose the right context window for your task:
Best Practices
Remember to regularly monitor your token usage and adjust your configuration as needed to optimize costs and performance.
Drop Additional Parameters
CrewAI internally uses Litellm for LLM calls, which allows you to drop additional parameters that are not needed for your specific use case. This can help simplify your code and reduce the complexity of your LLM configuration.
For example, if you don’t need to send the stop
parameter, you can simply omit it from your LLM call:
Most authentication issues can be resolved by checking API key format and environment variable names.
Most authentication issues can be resolved by checking API key format and environment variable names.
Always include the provider prefix in model names
Use larger context models for extensive tasks