ollama
llms: ollama
An ollama object interface with a running ollama server.
ollama
is a scalar handle class object, which allows communication
with an ollama server running either locally or across a network. The
heavy lifting, interfacing with the ollama API, is done by the compiled
__ollama__
function, which should not be called directly.
An ollama interface object should be considered as a session to the ollama server by holding any user defined settings along with the chat history (if opted) and other custom parameters to be parsed to the LLM model during inference. You can initialize several ollama interface objects pointing to the same ollama server and use them concurently to implement more complex schemes such as RAG, custom tooling, etc.
See also: calendarDuration, datetime
Source Code: ollama
Displays the models that are currently loaded in ollama server’s memory.
Specifies the inference mode that the ollama interface object will use
to send requests to the ollama server. Currently, only 'query'
and 'chat'
modes are implemented.
Specifies the network IP address and the port at which the ollama interface object is connected to.
Displays the models that are currently available in ollama server.
Contains various metrics about the last processed request and the response returned from the ollama server.
Contains an cell array with the history of user prompts,
images (if any), and models response for a given chat session. The first
column contains character vectors with the user’s prompts, the second
column contains a nested cell array with any images attached to the
corresponding user prompt (otherwise it is empty), and the third column
contains the model’s responses. By default, chatHistory
is an
empty cell array, and it is only populated while in 'chat'
mode.
The name of the model that will be used for generating the response to the next user request. This is empty upon construction and it must be specified before requesting any inference from the ollama server.
The time in seconds that the ollama interface object will wait for a server response before closing the connection with an error.
The time in seconds that the ollama interface object will wait for a request to be successfully sent to the server before closing the connection with an error.
A structure containing fields as optional parameters to be passed to a
model for inference at runtime. By default, this is an empty structure,
in which case the model utilizes its default parameters as specified in
the respective model file in the ollama server. See the
setOptions
method for more information about the custom parameters
you can specify.
ollama: llm = ollama (serverURL)
ollama: llm = ollama (serverURL, model)
ollama: llm = ollama (serverURL, model, mode)
llm = ollama (serverURL)
creates an ollama interface
object, which allows communication with an ollama server accesible at
serverURL, which must be a character vector specifying a uniform
resource locator (URL). If serverURL is empty or ollama
is
called without any input arguments, then it defaults to
http://localhost:11434
.
llm = ollama (serverURL, model)
also specifies
the active model of the ollama interface llm which will be used for
inference. model must be a character vector specifying an existing
model at the ollama server. If the requested model is not available, a
warning is emitted and no model is set active, which is the default
behavior when ollama
is called with fewer arguments. An active
model is mandatory before starting any communication with the ollama
server. Use the listModels
class method to see the all models
available in the server instance that llm is interfacing with. Use
the loadModel
method to set an active model in an ollama interface
object that has been already created.
llm = ollama (serverURL, model, mode)
also
specifies the inference mode of the ollama interface. mode can be
specified as 'query'
, for generating responses to single prompts,
'chat'
, for starting a conversation with a model by retaining the
entire chat history during inference, and 'embed'
(unimplemented)
for generating embedings for given prompts. By default, the
ollama
interface is initialized in query mode.
See also: fig2base64
ollama: list = listModels (llm)
ollama: list = listModels (llm, outtype)
ollama: listModels (…)
list = listModels (llm)
returns a cell array of
character vectors in list with the names of the models, which are
available on the ollama server that llm interfaces with. This is
equivalent to accessing the availableModels
property with the
syntax list = llm.availableModels
.
list = listModels (lllm, outtype)
also specifies
the data type of the output argument list. outtype must be a
character vector with any of the following options:
'cellstr'
(default) returns list as a cell array of
character vectors. Use this option to see available models for selecting
an active model for inference.
'json'
returns list as a character vector containing
the json string response returned from the ollama server. Use this
option if you want to access all the details about the models available
in the ollama server.
'table'
returns list as a table with the most
important information about the available models in specific table
variables.
listModels (…)
will display the output requested according
to the previous syntaxes to the standard output instead of returning it
to an output argument. This syntax is not valid for the 'json'
option, which requires an output argument.
ollama: list = listRunningModels (llm)
ollama: list = listRunningModels (llm, outtype)
ollama: listRunningModels (…)
list = listRunningModels (llm)
returns a cell array of
character vectors in list with the names of the models, which are
currently loaded in memory at the ollama server that llm interfaces
with. This is equivalent to accessing the runningModels
property
with the syntax list = llm.runningModels
.
list = listRunningModels (lllm, outtype)
also
specifies the data type of the output argument list. outtype
must be a character vector with any of the following options:
'cellstr'
(default) returns list as a cell array of
character vectors. Use this option to see which models are currently
running on the ollama server for better memory management.
'json'
returns list as a character vector containing
the json string response returned from the ollama server. Use this
option if you want to access all the details about currnently running
models.
'table'
returns list as a table with the most
important information about the currently running models in specific
table variables.
listModels (…)
will display the output requested according
to the previous syntaxes to the standard output instead of returning it
to an output argument. This syntax is not valid for the 'json'
option, which requires an output argument.
ollama: copyModel (llm, source, target)
copyModel (llm, source, target)
copies the model
specified by source into a new model named after target in
the ollama server interfaced by llm. Both source and
target must be character vectors, and source must specify an
existing model in the ollama server. If successful, the available models
in the llm.availableModels
property are updated, otherwise,
an error is returned.
Alternatively, source may also be an integer scalar value indexing
an existing model in llm.availableModels
.
ollama: deleteModel (llm, target)
deleteModel (llm, target)
deletes the model specified
by target in the ollama server interfaced by llm.
source can be either a character vector with the name of the model
or an integer scalar value indexing an existing model in
llm.availableModels
. If successful, the available models
in the llm.availableModels
property are updated, otherwise,
an error is returned.
ollama: loadModel (llm, target)
loadModel (llm, target)
loads the model specified by
target in the ollama server interfaced by llm. This syntax
is equivalent to assigning a value to the activeModel
property as
in llm.activeModel = target
. If successful,
the specified model is also set as the active model for inference in the
llm.activeModel
property. target can be either a
character vector with the name of the model or an integer scalar value
indexing an existing model in llm.availableModels
.
You can load multiple models conncurently and you are only limited by the hardware specifications of the ollama server, which llm interfaces with. However, since each time a new model is loaded it is also set as the active mode for inference, keep in mind that only a single model can be set active at a time for a given ollama interface object. The active model for for inference will always be the latest loaded model.
ollama: unloadModel (llm, target)
unloadModel (llm, target)
unloads the model specified
by target from memory of the ollama server interfaced by llm.
target can be either a character vector with the name of the model
or an integer scalar value indexing an existing model in
llm.availableModels
. Use this method to free resources in
the ollama server. By default, the ollama server unloads any idle model
from memory after five minutes, unless otherwise instructed.
If the model you unload is also the active model in the ollama interface
object, then the activeModel
property is also cleared. You need
to set an active model before inference.
ollama: pullModel (llm, target)
pullModel (llm, target)
downloads the model specified
by target from the ollama library into the ollama server interfaced
by llm. If successful, the model is appended to list of available
models in the llm.availableModels
property. target
must be a character vector.
ollama: setOptions (llm, name, value)
setOptions (llm, name, value)
sets custom
options to be parsed to the ollama server in order to tailor the
behavior of the model according to specific needs. The options must be
specified as name, value paired arguments, where name
is a character vector naming the option to be customized, and value
can be either numeric or logical scalars depending on the values each
option requires.
The following options may be customized in any order as paired input arguments.
name | value | description | ||
---|---|---|---|---|
'num_keep' | integer | Specifies how many of the most recent tokens or responses should be kept in memory for generating the next output. Higher values can improve relevance of the generated text by providing more context. | ||
'seed' | integer | Controls the randomness of token selection during text generation so that similar responses are reproduced for the same requests. | ||
'num_predict' | integer | Specifies the maximum number of tokens to predict when geneerating text. | ||
'top_k' | integer | Limits the number of possible choices for each next token when generating responses by specifying how many of the most likely options to consider. | ||
'top_p' | double | Sets the cumulative probability for nucleus sampling. It must be in the range . | ||
'min_p' | double | Adjusts the sampling threshold in accordance with the model’s confidence. Specifically, it scales the probability threshold based on the top token’s probability, allowing the model to focus on high-confidence tokens when certain, and to consider a broader range of tokens when less confident. It must be in the range . | ||
'typical_p' | double | Controls how conventional or creative the responses from a language model will be. A higher typical_p value results in more expected and standard responses, while a lower value allows for more unusual and creative outputs. It must be in the range . | ||
'repeat_last_n' | integer | Defines how far back the model looks to avoid repetition. | ||
'temperature' | double | Controls the randomness of the generated out by determining how the model leverages the raw likelihoods of the tokens under consideration for the next words in a sequence. It ranges from 0 to 2 with higher values corresponding to more chaotic output. | ||
'repeat_penalty' | double | Adjusts the penalty for repeated phrases; higher values discourage repetition. | ||
'presence_penalty' | double | Controls the diversity of the generated text by penalizing new tokens based on whether they appear in the text so far. | ||
'frequency_penalty' | double | Controls how often the same words should be repeated in the generated text. | ||
'penalize_newline' | logical | Discourages the model from generating newlines in its responses. | ||
'numa' | logical | Allows for non-uniform memory access to enhance performance. This can significantly improve processing speeds on multi-CPU systems. | ||
'num_ctx' | integer | Sets the context window length (in tokens) determining how much previous text the model considers. This should be kept in mind especially in chat seesions. | ||
'num_batch' | integer | Controls the number of input samples processed in a single batch during model inference. Reducing this value can help prevent out-of-memory (OOM) errors when working with large models. | ||
'num_gpu' | integer | Specifies the number of GPU devices to use for computation. | ||
'main_gpu' | integer | Specified which GPU device to use for inference. | ||
'use_mmap' | logical | Allows for memory-mapped file access, which can improve performance by enabling faster loading of model weights from disk. | ||
'num_thread' | integer | Specifies the number of threads to use during model generation, allowing you to optimize performance based on your CPU’s capabilities. |
Specified customized options are preserved in the ollama interface object
for all subsequent requests for inference until they are altered or reset
to the model’s default value by removing them. To remove a custom option
pass an empty value to the name, value paired argument, as in
setOptions (llm, 'seed', [])
.
Use the showOptions
method to display any custom options that may
be currently set in the ollama interface object. Alternatively, you can
retrieve the custom options as a structure through the options
property as in opts = llm.options
, where each field in
opts refers to a custom property If no custom options are set,
then opts is an empty structure.
You can also set or clear a single custom option with direct assignment
to the options
property of the ollama inteface object by passing
the name, value paired argument as a 2-element cell array.
The equivalent syntax of setOptions (llm, 'seed', []
is
llm.options = {'seed', []}
.
ollama: showOptions (llm)
showOptions (llm)
displays any custom options that may be
specified in the ollama inteface object llm.
ollama: query (llm, prompt)
ollama: query (llm, prompt, image)
ollama: txt = query (…)
ollama: query (llm)
query (llm, prompt)
uses the "api/generate"
API end point to make a request to the ollama server interfaced by
llm to generate text based on the user’s input specified in
prompt, which must be a character vector. When no output argument
is requested, query
prints the response text in the standard
output (command window) with a custom display method so that words are
not split between lines depending on the terminal size. If an output
argument is requested, the text is returned as a character vector and
nothing gets displayed in the terminal.
query (llm, prompt, image)
also specifies an
image or multiple images to be passed to the model along with the user’s
prompt. For a single image, image must be a character vector
specifying either the filename of an image or a base64 encoded image.
query
distinguishes between the two by scanning image for
a period character ('.'
), which is commonly used as a separator
between base-filename and extension, but it is an invalid character for
base64 encoded strings. For multiple images, image must be a cell
array of character vectors explicitly containing either multiple
filenames or mulitple base64 encoded string representations of images.
txt = query (…)
returns the generated text to the
output argument txt instead of displaying it to the terminal for
any of the previous syntaxes.
query (llm)
does not make a request to the ollama server,
but it sets the 'mode'
property in the ollama interface object
llm to 'query'
for future requests. Use this syntax to
switch from another inteface mode to query mode without making a request
to the server.
An alternative method of calling the query
method is by using
direct subscripted reference to the ollama interface object llm as
long as it already set in query mode. The table below lists the
equivalent syntaxes.
method calling | object subscripted reference | |
---|---|---|
query (llm, prompt) | llm(prompt) | |
query (llm, prompt, image) | llm(prompt, image) | |
query (llm, prompt, image) | llm(prompt, image) | |
txt = query (llm, prompt) | txt = llm(prompt) | |
txt = query (llm, prompt, image) | txt = llm(prompt, image) |
ollama: chat (llm, prompt)
ollama: chat (llm, prompt, image)
ollama: txt = chat (…)
ollama: chat (llm)
chat (llm, prompt)
uses the "api/chat"
API
end point to make a request to the ollama server interfaced by
llm to generate text based on the user’s input specified in
prompt along with all previous requests and responses, made by the
user and models during the same chat session, which is stored in the
'chatHistory'
property of the ollama interface object llm.
prompt must be a character vector. When no output argument
is requested, chat
prints the response text in the standard output
(command window) with a custom display method so that words are not split
between lines depending on the terminal size. If an output argument is
requested, the text is returned as a character vector and nothing gets
displayed in the terminal. In either case, the response text is appended
to the history chat, which can be displayed with the showHistory
method or return as a cell array from llm.chatHistory
. If
you want to start a new chat session, you can either clear the chat
history with the clearHistory
method or create a new ollama
interface object.
chat (llm, prompt, image)
also specifies an
image or multiple images to be passed to the model along with the user’s
prompt. For a single image, image must be a character vector
specifying either the filename of an image or a base64 encoded image.
chat
distinguishes between the two by scanning image for
a period character ('.'
), which is commonly used as a separator
between base-filename and extension, but it is an invalid character for
base64 encoded strings. For multiple images, image must be a cell
array of character vectors, which can contain both multiple filenames and
mulitple base64 encoded string representations of images. Any images
supplied along with a prompt during a chat session are also stored in the
chat history.
txt = chat (…)
returns the generated text to the
output argument txt instead of displaying it to the terminal for
any of the previous syntaxes.
chat (llm)
does not make a request to the ollama server,
but it sets the 'mode'
property in the ollama interface object
llm to 'chat'
for future requests. Use this syntax to
switch from another inteface mode to chat mode without making a request
to the server. Switching to chat mode does not clear any existing chat
history in llm.
An alternative method of calling the chat
method is by using
direct subscripted reference to the ollama interface object llm as
long as it already set in chat mode. The table below lists the
equivalent syntaxes.
method calling | object subscripted reference | |
---|---|---|
chat (llm, prompt) | llm(prompt) | |
chat (llm, prompt, image) | llm(prompt, image) | |
chat (llm, prompt, image) | llm(prompt, image) | |
txt = chat (llm, prompt) | txt = llm(prompt) | |
txt = chat (llm, prompt, image) | txt = llm(prompt, image) |
ollama: showStats (llm)
showStats (llm)
displays the response statistics of the last
response returned from the ollama server intefaced by llm. The
type of request (e.g. query, chat, embed) does not alter the displayed
statistics, which include the following parameters:
ollama: showHistory (llm)
ollama: showHistory (llm, 'all'
)
ollama: showHistory (llm, 'last'
)
ollama: showHistory (llm, 'first'
)
ollama: showHistory (llm, idx)
showHistory (llm)
displays the entire chat history stored in
the ollama interface object llm. The chat history is displayed in
chronological order alternating between user’s requests and the model’s
responses. For any user’s request that contained images, the filenames
or the number of images (in case of base64 encoded images) are also
listed below the corresponding request and before the subsequent
response.
showHistory (llm),
is exactly the same as
'all'
showHistory (llm)
.
showHistory (llm),
displays only the last
user-model interaction of the current chat session.
'last'
showHistory (llm),
displays only the first
user-model interaction of the current chat session.
'first'
showHistory (llm), idx
displays the user-model
interactions specified by idx, which must be a scalar or a vector
of integer values indexing the rows of the cell array
comprising the chatHistory
property in llm.
showHistory
is explicitly used for displaying the chat history and
does not return any output argument. If you want to retrieve the chat
history in a cell array, you can access the chatHistory
property
directly, as in hdata = llm.chatHistory
.
ollama: clearHistory (llm)
clearHistory (llm)
deletes the entire chat history in the
ollama interface object llm. Use this method to initialize a new
chat session.
clearHistory (llm),
is exactly the same as
'all'
clearHistory (llm)
.
clearHistory (llm),
deletes the last
user-model interaction from the current chat session. Use this option if
you want to rephrase or modify the last request without clear the entire
chat history.
'last'
showHistory (llm),
removes only the first
user-model interaction from the current chat session. Use this option if
you want to discard the initial user-model interaction in order to
experiment with the model’s context size.
'first'
showHistory (llm), idx
deletes the user-model
interactions specified by idx, which must be a scalar or a vector
of integer values indexing the rows of the cell array
comprising the chatHistory
property in llm.
Note that selectively deleting user-model interactions from the chat history also removes any images that may be integrated with the selected requests.