Back to home page

OSCL-LXR

 
 

    


0001 =======================================
0002 Oracle Data Analytics Accelerator (DAX)
0003 =======================================
0004 
0005 DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
0006 (DAX2) processor chips, and has direct access to the CPU's L3 caches
0007 as well as physical memory. It can perform several operations on data
0008 streams with various input and output formats.  A driver provides a
0009 transport mechanism and has limited knowledge of the various opcodes
0010 and data formats. A user space library provides high level services
0011 and translates these into low level commands which are then passed
0012 into the driver and subsequently the Hypervisor and the coprocessor.
0013 The library is the recommended way for applications to use the
0014 coprocessor, and the driver interface is not intended for general use.
0015 This document describes the general flow of the driver, its
0016 structures, and its programmatic interface. It also provides example
0017 code sufficient to write user or kernel applications that use DAX
0018 functionality.
0019 
0020 The user library is open source and available at:
0021 
0022     https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
0023 
0024 The Hypervisor interface to the coprocessor is described in detail in
0025 the accompanying document, dax-hv-api.txt, which is a plain text
0026 excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
0027 Specification" version 3.0.20+15, dated 2017-09-25.
0028 
0029 
0030 High Level Overview
0031 ===================
0032 
0033 A coprocessor request is described by a Command Control Block
0034 (CCB). The CCB contains an opcode and various parameters. The opcode
0035 specifies what operation is to be done, and the parameters specify
0036 options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
0037 is passed to the Hypervisor, which handles queueing and scheduling of
0038 requests to the available coprocessor execution units. A status code
0039 returned indicates if the request was submitted successfully or if
0040 there was an error.  One of the addresses given in each CCB is a
0041 pointer to a "completion area", which is a 128 byte memory block that
0042 is written by the coprocessor to provide execution status. No
0043 interrupt is generated upon completion; the completion area must be
0044 polled by software to find out when a transaction has finished, but
0045 the M7 and later processors provide a mechanism to pause the virtual
0046 processor until the completion status has been updated by the
0047 coprocessor. This is done using the monitored load and mwait
0048 instructions, which are described in more detail later.  The DAX
0049 coprocessor was designed so that after a request is submitted, the
0050 kernel is no longer involved in the processing of it.  The polling is
0051 done at the user level, which results in almost zero latency between
0052 completion of a request and resumption of execution of the requesting
0053 thread.
0054 
0055 
0056 Addressing Memory
0057 =================
0058 
0059 The kernel does not have access to physical memory in the Sun4v
0060 architecture, as there is an additional level of memory virtualization
0061 present. This intermediate level is called "real" memory, and the
0062 kernel treats this as if it were physical.  The Hypervisor handles the
0063 translations between real memory and physical so that each logical
0064 domain (LDOM) can have a partition of physical memory that is isolated
0065 from that of other LDOMs.  When the kernel sets up a virtual mapping,
0066 it specifies a virtual address and the real address to which it should
0067 be mapped.
0068 
0069 The DAX coprocessor can only operate on physical memory, so before a
0070 request can be fed to the coprocessor, all the addresses in a CCB must
0071 be converted into physical addresses. The kernel cannot do this since
0072 it has no visibility into physical addresses. So a CCB may contain
0073 either the virtual or real addresses of the buffers or a combination
0074 of them. An "address type" field is available for each address that
0075 may be given in the CCB. In all cases, the Hypervisor will translate
0076 all the addresses to physical before dispatching to hardware. Address
0077 translations are performed using the context of the process initiating
0078 the request.
0079 
0080 
0081 The Driver API
0082 ==============
0083 
0084 An application makes requests to the driver via the write() system
0085 call, and gets results (if any) via read(). The completion areas are
0086 made accessible via mmap(), and are read-only for the application.
0087 
0088 The request may either be an immediate command or an array of CCBs to
0089 be submitted to the hardware.
0090 
0091 Each open instance of the device is exclusive to the thread that
0092 opened it, and must be used by that thread for all subsequent
0093 operations. The driver open function creates a new context for the
0094 thread and initializes it for use.  This context contains pointers and
0095 values used internally by the driver to keep track of submitted
0096 requests. The completion area buffer is also allocated, and this is
0097 large enough to contain the completion areas for many concurrent
0098 requests.  When the device is closed, any outstanding transactions are
0099 flushed and the context is cleaned up.
0100 
0101 On a DAX1 system (M7), the device will be called "oradax1", while on a
0102 DAX2 system (M8) it will be "oradax2". If an application requires one
0103 or the other, it should simply attempt to open the appropriate
0104 device. Only one of the devices will exist on any given system, so the
0105 name can be used to determine what the platform supports.
0106 
0107 The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
0108 all of these, success is indicated by a return value from write()
0109 equal to the number of bytes given in the call. Otherwise -1 is
0110 returned and errno is set.
0111 
0112 CCB_DEQUEUE
0113 -----------
0114 
0115 Tells the driver to clean up resources associated with past
0116 requests. Since no interrupt is generated upon the completion of a
0117 request, the driver must be told when it may reclaim resources.  No
0118 further status information is returned, so the user should not
0119 subsequently call read().
0120 
0121 CCB_KILL
0122 --------
0123 
0124 Kills a CCB during execution. The CCB is guaranteed to not continue
0125 executing once this call returns successfully. On success, read() must
0126 be called to retrieve the result of the action.
0127 
0128 CCB_INFO
0129 --------
0130 
0131 Retrieves information about a currently executing CCB. Note that some
0132 Hypervisors might return 'notfound' when the CCB is in 'inprogress'
0133 state. To ensure a CCB in the 'notfound' state will never be executed,
0134 CCB_KILL must be invoked on that CCB. Upon success, read() must be
0135 called to retrieve the details of the action.
0136 
0137 Submission of an array of CCBs for execution
0138 ---------------------------------------------
0139 
0140 A write() whose length is a multiple of the CCB size is treated as a
0141 submit operation. The file offset is treated as the index of the
0142 completion area to use, and may be set via lseek() or using the
0143 pwrite() system call. If -1 is returned then errno is set to indicate
0144 the error. Otherwise, the return value is the length of the array that
0145 was actually accepted by the coprocessor. If the accepted length is
0146 equal to the requested length, then the submission was completely
0147 successful and there is no further status needed; hence, the user
0148 should not subsequently call read(). Partial acceptance of the CCB
0149 array is indicated by a return value less than the requested length,
0150 and read() must be called to retrieve further status information.  The
0151 status will reflect the error caused by the first CCB that was not
0152 accepted, and status_data will provide additional data in some cases.
0153 
0154 MMAP
0155 ----
0156 
0157 The mmap() function provides access to the completion area allocated
0158 in the driver.  Note that the completion area is not writeable by the
0159 user process, and the mmap call must not specify PROT_WRITE.
0160 
0161 
0162 Completion of a Request
0163 =======================
0164 
0165 The first byte in each completion area is the command status which is
0166 updated by the coprocessor hardware. Software may take advantage of
0167 new M7/M8 processor capabilities to efficiently poll this status byte.
0168 First, a "monitored load" is achieved via a Load from Alternate Space
0169 (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
0170 "monitored wait" is achieved via the mwait instruction (a write to
0171 %asr28). This instruction is like pause in that it suspends execution
0172 of the virtual processor for the given number of nanoseconds, but in
0173 addition will terminate early when one of several events occur. If the
0174 block of data containing the monitored location is modified, then the
0175 mwait terminates. This causes software to resume execution immediately
0176 (without a context switch or kernel to user transition) after a
0177 transaction completes. Thus the latency between transaction completion
0178 and resumption of execution may be just a few nanoseconds.
0179 
0180 
0181 Application Life Cycle of a DAX Submission
0182 ==========================================
0183 
0184  - open dax device
0185  - call mmap() to get the completion area address
0186  - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
0187  - submit CCB via write() or pwrite()
0188  - go into a loop executing monitored load + monitored wait and
0189    terminate when the command status indicates the request is complete
0190    (CCB_KILL or CCB_INFO may be used any time as necessary)
0191  - perform a CCB_DEQUEUE
0192  - call munmap() for completion area
0193  - close the dax device
0194 
0195 
0196 Memory Constraints
0197 ==================
0198 
0199 The DAX hardware operates only on physical addresses. Therefore, it is
0200 not aware of virtual memory mappings and the discontiguities that may
0201 exist in the physical memory that a virtual buffer maps to. There is
0202 no I/O TLB or any scatter/gather mechanism. All buffers, whether input
0203 or output, must reside in a physically contiguous region of memory.
0204 
0205 The Hypervisor translates all addresses within a CCB to physical
0206 before handing off the CCB to DAX. The Hypervisor determines the
0207 virtual page size for each virtual address given, and uses this to
0208 program a size limit for each address. This prevents the coprocessor
0209 from reading or writing beyond the bound of the virtual page, even
0210 though it is accessing physical memory directly. A simpler way of
0211 saying this is that a DAX operation will never "cross" a virtual page
0212 boundary. If an 8k virtual page is used, then the data is strictly
0213 limited to 8k. If a user's buffer is larger than 8k, then a larger
0214 page size must be used, or the transaction size will be truncated to
0215 8k.
0216 
0217 Huge pages. A user may allocate huge pages using standard interfaces.
0218 Memory buffers residing on huge pages may be used to achieve much
0219 larger DAX transaction sizes, but the rules must still be followed,
0220 and no transaction will cross a page boundary, even a huge page.  A
0221 major caveat is that Linux on Sparc presents 8Mb as one of the huge
0222 page sizes. Sparc does not actually provide a 8Mb hardware page size,
0223 and this size is synthesized by pasting together two 4Mb pages. The
0224 reasons for this are historical, and it creates an issue because only
0225 half of this 8Mb page can actually be used for any given buffer in a
0226 DAX request, and it must be either the first half or the second half;
0227 it cannot be a 4Mb chunk in the middle, since that crosses a
0228 (hardware) page boundary. Note that this entire issue may be hidden by
0229 higher level libraries.
0230 
0231 
0232 CCB Structure
0233 -------------
0234 A CCB is an array of 8 64-bit words. Several of these words provide
0235 command opcodes, parameters, flags, etc., and the rest are addresses
0236 for the completion area, output buffer, and various inputs::
0237 
0238    struct ccb {
0239        u64   control;
0240        u64   completion;
0241        u64   input0;
0242        u64   access;
0243        u64   input1;
0244        u64   op_data;
0245        u64   output;
0246        u64   table;
0247    };
0248 
0249 See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
0250 each of these fields, and see dax-hv-api.txt for a complete description
0251 of the Hypervisor API available to the guest OS (ie, Linux kernel).
0252 
0253 The first word (control) is examined by the driver for the following:
0254  - CCB version, which must be consistent with hardware version
0255  - Opcode, which must be one of the documented allowable commands
0256  - Address types, which must be set to "virtual" for all the addresses
0257    given by the user, thereby ensuring that the application can
0258    only access memory that it owns
0259 
0260 
0261 Example Code
0262 ============
0263 
0264 The DAX is accessible to both user and kernel code.  The kernel code
0265 can make hypercalls directly while the user code must use wrappers
0266 provided by the driver. The setup of the CCB is nearly identical for
0267 both; the only difference is in preparation of the completion area. An
0268 example of user code is given now, with kernel code afterwards.
0269 
0270 In order to program using the driver API, the file
0271 arch/sparc/include/uapi/asm/oradax.h must be included.
0272 
0273 First, the proper device must be opened. For M7 it will be
0274 /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
0275 procedure is to attempt to open both, as only one will succeed::
0276 
0277         fd = open("/dev/oradax1", O_RDWR);
0278         if (fd < 0)
0279                 fd = open("/dev/oradax2", O_RDWR);
0280         if (fd < 0)
0281                /* No DAX found */
0282 
0283 Next, the completion area must be mapped::
0284 
0285       completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
0286 
0287 All input and output buffers must be fully contained in one hardware
0288 page, since as explained above, the DAX is strictly constrained by
0289 virtual page boundaries.  In addition, the output buffer must be
0290 64-byte aligned and its size must be a multiple of 64 bytes because
0291 the coprocessor writes in units of cache lines.
0292 
0293 This example demonstrates the DAX Scan command, which takes as input a
0294 vector and a match value, and produces a bitmap as the output. For
0295 each input element that matches the value, the corresponding bit is
0296 set in the output.
0297 
0298 In this example, the input vector consists of a series of single bits,
0299 and the match value is 0. So each 0 bit in the input will produce a 1
0300 in the output, and vice versa, which produces an output bitmap which
0301 is the input bitmap inverted.
0302 
0303 For details of all the parameters and bits used in this CCB, please
0304 refer to section 36.2.1.3 of the DAX Hypervisor API document, which
0305 describes the Scan command in detail::
0306 
0307         ccb->control =       /* Table 36.1, CCB Header Format */
0308                   (2L << 48)     /* command = Scan Value */
0309                 | (3L << 40)     /* output address type = primary virtual */
0310                 | (3L << 34)     /* primary input address type = primary virtual */
0311                              /* Section 36.2.1, Query CCB Command Formats */
0312                 | (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
0313                 | (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
0314                 | (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
0315                 | (0 <<  5)     /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
0316                 | (31 << 0);    /* 36.2.1.3 Disable second scan criteria */
0317 
0318         ccb->completion = 0;    /* Completion area address, to be filled in by driver */
0319 
0320         ccb->input0 = (unsigned long) input; /* primary input address */
0321 
0322         ccb->access =       /* Section 36.2.1.2, Data Access Control */
0323                   (2 << 24)    /* Primary input length format = bits */
0324                 | (nbits - 1); /* number of bits in primary input stream, minus 1 */
0325 
0326         ccb->input1 = 0;       /* secondary input address, unused */
0327 
0328         ccb->op_data = 0;      /* scan criteria (value to be matched) */
0329 
0330         ccb->output = (unsigned long) output;   /* output address */
0331 
0332         ccb->table = 0;        /* table address, unused */
0333 
0334 The CCB submission is a write() or pwrite() system call to the
0335 driver. If the call fails, then a read() must be used to retrieve the
0336 status::
0337 
0338         if (pwrite(fd, ccb, 64, 0) != 64) {
0339                 struct ccb_exec_result status;
0340                 read(fd, &status, sizeof(status));
0341                 /* bail out */
0342         }
0343 
0344 After a successful submission of the CCB, the completion area may be
0345 polled to determine when the DAX is finished. Detailed information on
0346 the contents of the completion area can be found in section 36.2.2 of
0347 the DAX HV API document::
0348 
0349         while (1) {
0350                 /* Monitored Load */
0351                 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
0352                                      : "=r" (status)
0353                                      : "r"  (completion_area));
0354 
0355                 if (status)          /* 0 indicates command in progress */
0356                         break;
0357 
0358                 /* MWAIT */
0359                 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
0360         }
0361 
0362 A completion area status of 1 indicates successful completion of the
0363 CCB and validity of the output bitmap, which may be used immediately.
0364 All other non-zero values indicate error conditions which are
0365 described in section 36.2.2::
0366 
0367         if (completion_area[0] != 1) {  /* section 36.2.2, 1 = command ran and succeeded */
0368                 /* completion_area[0] contains the completion status */
0369                 /* completion_area[1] contains an error code, see 36.2.2 */
0370         }
0371 
0372 After the completion area has been processed, the driver must be
0373 notified that it can release any resources associated with the
0374 request. This is done via the dequeue operation::
0375 
0376         struct dax_command cmd;
0377         cmd.command = CCB_DEQUEUE;
0378         if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
0379                 /* bail out */
0380         }
0381 
0382 Finally, normal program cleanup should be done, i.e., unmapping
0383 completion area, closing the dax device, freeing memory etc.
0384 
0385 Kernel example
0386 --------------
0387 
0388 The only difference in using the DAX in kernel code is the treatment
0389 of the completion area. Unlike user applications which mmap the
0390 completion area allocated by the driver, kernel code must allocate its
0391 own memory to use for the completion area, and this address and its
0392 type must be given in the CCB::
0393 
0394         ccb->control |=      /* Table 36.1, CCB Header Format */
0395                 (3L << 32);     /* completion area address type = primary virtual */
0396 
0397         ccb->completion = (unsigned long) completion_area;   /* Completion area address */
0398 
0399 The dax submit hypercall is made directly. The flags used in the
0400 ccb_submit call are documented in the DAX HV API in section 36.3.1/
0401 
0402 ::
0403 
0404   #include <asm/hypervisor.h>
0405 
0406         hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
0407                                  HV_CCB_QUERY_CMD |
0408                                  HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
0409                                  HV_CCB_VA_PRIVILEGED,
0410                                  0, &bytes_accepted, &status_data);
0411 
0412         if (hv_rv != HV_EOK) {
0413                 /* hv_rv is an error code, status_data contains */
0414                 /* potential additional status, see 36.3.1.1 */
0415         }
0416 
0417 After the submission, the completion area polling code is identical to
0418 that in user land::
0419 
0420         while (1) {
0421                 /* Monitored Load */
0422                 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
0423                                      : "=r" (status)
0424                                      : "r"  (completion_area));
0425 
0426                 if (status)          /* 0 indicates command in progress */
0427                         break;
0428 
0429                 /* MWAIT */
0430                 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
0431         }
0432 
0433         if (completion_area[0] != 1) {  /* section 36.2.2, 1 = command ran and succeeded */
0434                 /* completion_area[0] contains the completion status */
0435                 /* completion_area[1] contains an error code, see 36.2.2 */
0436         }
0437 
0438 The output bitmap is ready for consumption immediately after the
0439 completion status indicates success.
0440 
0441 Excer[t from UltraSPARC Virtual Machine Specification
0442 =====================================================
0443 
0444  .. include:: dax-hv-api.txt
0445     :literal: